Back to the table of contents

Previous Next

Neural network examples

This document gives examples for writing code to use the neural network class in Waffles.

Logistic regression

Logistic regression is fitting your data with a single layer of logistic units. Here are the #includes that we are going to need for this example:

	#include <GClasses/GActivation.h>
	#include <GClasses/GHolders.h>
	#include <GClasses/GMatrix.h>
	#include <GClasses/GNeuralNet.h>
	#include <GClasses/GRand.h>
	#include <GClasses/GTransform.h>

We are going to need some data, so let's load some data from an ARFF file. Let's use Iris, a well-known dataset for machine learning examples:

	GMatrix data;
	data.loadArff("iris.arff");

"data" is a 150x5 matrix. Next, we need to divide this data into a feature matrix (the inputs) and a label matrix (the outputs):

	GDataColSplitter cs(data, 1); // the "iris" dataset has only 1 column of "labels"
	const GMatrix& inputs = cs.features();
	const GMatrix& outputs = cs.labels();

"inputs" is a 150x4 matrix of real values, and "outputs" is a 150x1 matrix of categorical values. Neural networks typically only support continuous values, but the labels in the iris dataset are categorical, so we will convert them to use a real-valued representation (also known as a categorical distribution, a one-hot representation, or binarized representation):

	GNominalToCat nc;
	nc.train(outputs);
	GMatrix* pRealOutputs = nc.transformBatch(outputs);

pRealOutputs points to a 150x3 matrix of real values. Now, lets further divide our data into a training portion and a testing portion:

	GRand r(0);
	GDataRowSplitter rs(inputs, *pRealOutputs, r, 75);
	const GMatrix& trainingFeatures = rs.features1();
	const GMatrix& trainingLabels = rs.labels1();
	const GMatrix& testingFeatures = rs.features2();
	const GMatrix& testingLabels = rs.labels2();

Now, we are ready to train a layer of logistic units that takes 4 inputs and gives 3 outputs. In this example, we will also specify the learning rate:

	GNeuralNet nn;
	nn.addLayer(new GLayerClassic(4, 3, new GActivationLogistic()));
	nn.setLearningRate(0.05);
	nn.train(trainingFeatures, trainingLabels);

Let's test our model to see how well it performs:

	double sse = nn.sumSquaredError(testingFeatures, testingLabels);
	double mse = sse / testingLabels.rows();
	double rmse = sqrt(mse);
	std::cout << "The root-mean-squared test error is " << to_str(rmse) << "\n";

Finally, don't forget to delete pRealOutputs:

	delete(pRealOutputs);

Or, preferably:

	Holder<GMatrix> hRealOutputs(pRealOutputs);

Classification

(This example builds on the previous one.)

The previous example was not actually very useful because root-mean-squared only tells us how poorly the neural network fit to our continuous representation of the data. It does not really tell us how accurate the neural network is for classifying this data. So, instead of transforming the data to meet the model, we can transform the model to meet the data. Specifically, we can use the GAutoFilter class to turn the neural network into a classifier:

	GAutoFilter af(&nn, false); // false means the auto-filter does not need to "delete(&nn)" when it is destroyed.

Now, we can train the auto-filter using the original data (with nominal labels). We no longer need to explicitly train the neural network because when we train the auto-filter, it trains the inner model.

	af.train(inputs, outputs);

The auto-filter automatically filters the data as needed for its inner model, but ultimately behaves as a model that can handle whatever types of data you have available. In this case, it turns a neural network into a classifier, since "outputs" contains 1 column of categorical values. So, now we can obtain the misclassification rate as follows:

	double mis = af.sumSquaredError(inputs, outputs);
	std::cout << "The model misclassified " << to_str(mis)  <<
		" out of " << to_str(outputs.rows()) << " instances.\n";

Why does a method named "sumSquaredError" return the total number of misclassifications? Because it uses Hamming distance for categorical labels, which reports a squared-error of 1 for each misclassification.

(Note that if you are training a big network with big data, then efficiency may be critical. In such cases, it is generally better to use the approach of transforming the data to meet the model, so that it does not waste a lot of time transformaing data during training.)

Adding layers

To make a deeper neural network, you add more layers prior to training. For example:

	nn.addLayer(new GLayerClassic(784, 1000));
	nn.addLayer(new GLayerClassic(1000, 300));
	nn.addLayer(new GLayerClassic(300, 90));
	nn.addLayer(new GLayerClassic(90, 10));

The layers are added in feed-forward order. The first layer added is the input layer. The last layer added is the output layer. Notice how the number of inputs into each layer should be the same as the number of outputs in the previous layer. If they do not line up, then the existing layer will be resized to fit with the newly added layer. In other words, if there is a conflict, the last one added to the neural network wins. Alternatively, you can also use the constant "FLEXIBLE_SIZE" to indicate that you want the adjacent layer to determine the size. For example, following code results in the very same topology:

	nn.addLayer(new GLayerClassic(784, 1000));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 300));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 90));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 10));

When training begins, the number of inputs and outputs will also be resized to fit the training data. So, you can use FLEXIBLE_SIZE for those too:

	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 1000));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 300));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 90));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, FLEXIBLE_SIZE));

That is as far as you can use FLEXIBLE_SIZE. If you use FLEXIBLE_SIZE anywhere else, then the size of one of the layers would be ambiguous, which would be problematic.

Activation functions

If no activation function is specified, then the layer uses a default activation function. The default activation function for GLayerClassic is GActivationTanH, which implements the tanh activation function. Many other activation functions are also available:

	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 1000, new GActivationRectifiedLinear()));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 300, new GActivationSoftPlus()));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 90, new GActivationGaussian()));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, FLEXIBLE_SIZE, new GActivationIdentity()));

You can use the GLayerMixed class to make layers that contain a mixture of activation functions. In the following example, the second layer has a size of 300, but it uses the softplus activation function for 150 of its units, and a sinusoid activation function for the other 150 units:

	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 1000, new GActivationRectifiedLinear()));
	GLayerMixed* pMix = new GLayerMixed();
	pMix->addComponent(new GLayerClassic(1000, 150, new GActivationSoftPlus()));
	pMix->addComponent(new GLayerClassic(1000, 150, new GActivationSin()));
	nn.addLayer(new GLayerClassic(300, 90, new GActivationSinc()));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, FLEXIBLE_SIZE, new GActivationBend()));

Layer types

There are several types of layers that you can use in your neural networks:

GLayerClassic - A traditional fully-connected feed-forward layer.
GLayerMixed - Allows you to combine multiple layers of different types into a single layer.
GLayerRestrictedBoltzmannMachine: A restricted boltzmann machine layer.
GLayerConvolutional1D - A 1-dimensional convolutional layer.
GLayerConvolutional2D - A 2-dimensional convolutional layer.
GLayerSoftMax - A soft-max layer, trained with cross-entropy.
GLayerClassicCuda - A GPU-optimized classic layer. Requires linking with the GCuda library.
(Other layer types are currently under development...)

Stopping criteria

The GNeuralNet::train method divides the training data into a training portion and a validation portion. The default uses 65% for training and 35% for validation. Suppose you wanted to change this ratio to 60/40. This would be done as follows:

	nn.setValidationPortion(0.4);

By default, training continues until validation accuracy does not improve by 0.2% over a window of 100 epochs. If you wanted to change this to 0.1% over a window of 10 epochs, then you could do:

	nn.setImprovementThresh(0.001);
	nn.setWindowSize(10);

Training incrementally

Sometimes, it is preferable to train your neural network incrementally, instead of simply calling the "train" method. For example, you might want to use a custom stopping criteria, you might want to report validation accuracy before each training epoch, you might want to decay the learning rate in a particular manner, etc. The following example shows how such things can be implemented:

	nn.beginIncrementalLearning(trainingFeatures.relation(), trainingLabels.relation());
	GRandomIndexIterator ii(trainingFeatures.rows(), nn.rand());
	for(size_t epoch = 0; epoch < total_epochs; epoch++)
	{
		// Report validation accuracy
		double rmse = sqrt(nn1.sumSquaredError(validateFeatures, validateLabels) / validateLabels.rows());
		std::cout << to_str(rmse) << "\n";
		std::cout.flush();
	
		// Train
		ii.reset();
		size_t index;
		while(ii.next(index))
		{
			nn.trainIncremental(trainingFeatures[index], trainingLabels[index]);
		}
	
		// Decay the learning rate
		nn.setLearningRate(nn.learningRate() * 0.98);
	}

Serialization

You can write your neural network to a file:

GDom doc;
doc.setRoot(nn.serialize(&doc));
doc.saveJson("my_neural_net.json");

Then, you can load it from the file again:

GDom doc;
doc.loadJson("my_neural_net.json");
GLearnerLoader ll;
GNeuralNet* pNN = (GNeuralNet*)ll.loadLearner(doc.root());

GPU Parallelization

Suppose you want to train your neural network in parallel on a GPU. For class neural network layers, this is done by simply replacing GLayerClassic with GLayerClassicCuda.

For example, if you make your neural network like this,

	GNeuralNet nn;
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 2000));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 5000));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 1000));

then all you need to do to parallelize it is change these lines to:

	GCudaEngine e;
	GNeuralNet nn;
	nn.addLayer(new GLayerClassicCuda(e, FLEXIBLE_SIZE, 2000));
	nn.addLayer(new GLayerClassicCuda(e, FLEXIBLE_SIZE, 5000));
	nn.addLayer(new GLayerClassicCuda(e, FLEXIBLE_SIZE, 1000));

The GLayerClassicCuda class is not in the GClasses library. It is in the GCuda library. To build this library, go into "src/depends/GCuda" and do "sudo make install". (Subfolders of "depends" are not built by default because they depend on things that some users may not have installed.) Of course, you also need to have an NVIDIA GPGPU, and you have to link to a few extra libraries: GCuda, curand, cublas, and cudart. The latter three can all be obtained by installing the Cuda SDK. To help you get started, there is a demo app in demos/cuda, which shows how this is done.

Currently only classic layers have a GPU-counterpart, but I am working on parallelizig the other layer types too. I plan to have all functionality available on the GPU ...eventually. (If you would like to help with this effort, I would very much appreciate it.)

MNIST

These days, it has become very popular to test deep neural networks with the MNIST dataset. So, let's do an example with it. First, some heading code:

#include <exception>
#include <iostream>
#include <GClasses/GApp.h>
#include <GClasses/GError.h>
#include <GClasses/GNeuralNet.h>
#include <GClasses/GActivation.h>
#include <GClasses/GTransform.h>
#include <GClasses/GHolders.h>
#include <GClasses/GVec.h>
#include <GClasses/GTime.h>
#include <GClasses/GDom.h>

using namespace GClasses;
using std::cerr;
using std::cout;

int main(int argc, char *argv[])
{

Now, let's load the dataset and divide it into features and labels. I am also going to scale all the features by 1/256. (Note that manually scaling the features in this way is not the same as applying normalizing transform to the features. A normalizing transform would compute a unique range for each column, which would cause background pixels with little variance to be scaled to fill the full range of values. In this case, that would augment the noise in the features. Since all of the pixels have meaning with respect to each other, we don't want to mess up those relationships.)

	// Load and prep the training data
	GRand rand(0);
	GMatrix train;
	train.loadArff("mnist/train.arff");
	GDataColSplitter trainSplit(train, 1);
	trainSplit.features().multiply(1.0 / 256.0);
	GMatrix test;
	test.loadArff("mnist/test.arff");
	GDataColSplitter testSplit(test, 1);
	testSplit.features().multiply(1.0 / 256.0);

Now, let's make a neural network. One of my favorite activation functions is GActivationBend, so I will use it. This activation function requires a smaller learning rate, because its derivative is never close to zero. I will use a regular sigmoidal layer in the output layer, however, because that tends to stabilize training. Note that this is actually a really small network for this problem. We could get much better results with a bigger network, but it is a good practice to keep your initial experiments fast and simple, so you can make adjustments without paying a heavy cost. Notice how I alternate between big and small layers--that is a good trick for keeping the total number of network weights small.

	// Make the networks
	GNeuralNet nn;
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 40, new GActivationBend()));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 120, new GActivationBend()));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 40, new GActivationBend()));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 120, new GActivationBend()));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, FLEXIBLE_SIZE));
	nn.setLearningRate(0.01);
	GAutoFilter af(&nn, false);
	af.beginIncrementalLearning(trainSplit.features().relation(), trainSplit.labels().relation());
	cout << "Total weights: " << to_str(nn.countWeights()) << "\n";

Now, let's train it. I don't really do anything fancy here.

	// Train
	GRandomIndexIterator ii(trainSplit.features().rows(), rand);
	for(size_t epoch = 0; epoch < 10; epoch++)
	{
		// Validate
		double mis = af.sumSquaredError(testSplit.features(), testSplit.labels());
		cout << "Misclassifications: " << to_str(mis) << "/" << to_str(testSplit.labels().rows()) << "\n";

		// Train
		ii.reset();
		size_t index;
		while(ii.next(index))
			af.trainIncremental(trainSplit.features()[index], trainSplit.labels()[index]);
	}
	return 0;
}

Training takes less than 2 minutes on a modest laptop computer. Here are the results that I get:

Total weights: 47290
Misclassifications: 8865/10000
Misclassifications: 807/10000
Misclassifications: 649/10000
Misclassifications: 488/10000
Misclassifications: 512/10000
Misclassifications: 494/10000
Misclassifications: 412/10000
Misclassifications: 360/10000
Misclassifications: 346/10000
Misclassifications: 306/10000

Okay, now let's throw some more tricks onto the pile. Instead of going through the GAutoFilter class, we will want to fiddle with the neural network directly, so we can access some of its more specialized methods. Therefore, we will prefilter the labels, to get them in the right format for the neural network (a representation consisting of all continuous values). There is no need to prefilter the features in this case, because they are already in a form that neural networks can handle. (We could if we wanted to, but it would be a no-op.)

So, let's use a convolutional input layer, and add two types of regularization: Dropout and Max-Norm. I will highlight changes from the previous example in red:

#include <exception>
#include <iostream>
#include <GClasses/GApp.h>
#include <GClasses/GError.h>
#include <GClasses/GNeuralNet.h>
#include <GClasses/GActivation.h>
#include <GClasses/GTransform.h>
#include <GClasses/GHolders.h>
#include <GClasses/GVec.h>
#include <GClasses/GTime.h>
#include <GClasses/GDom.h>

using namespace GClasses;
using std::cerr;
using std::cout;

int main(int argc, char *argv[])
{
	// Load and prep the training data
	GRand rand(0);
	GMatrix train;
	train.loadArff("mnist/train.arff");
	GDataColSplitter trainSplit(train, 1);
	trainSplit.features().multiply(1.0 / 256.0);
	GMatrix test;
	test.loadArff("mnist/test.arff");
	GDataColSplitter testSplit(test, 1);
	testSplit.features().multiply(1.0 / 256.0);

	// Make the networks
	GNeuralNet nn;
	nn.addLayer(new GLayerConvolutional2D(28, 28, 1, 7, 5, new GActivationBend()));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 60, new GActivationBend()));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 40, new GActivationBend()));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, 120, new GActivationBend()));
	nn.addLayer(new GLayerClassic(FLEXIBLE_SIZE, FLEXIBLE_SIZE));
	nn.setLearningRate(0.01);
	GAutoFilter af(&nn, false);
	af.beginIncrementalLearning(trainSplit.features().relation(), trainSplit.labels().relation());
	cout << "Total weights: " << to_str(nn.countWeights()) << "\n";
	GMatrix* pInnerLabels = af.prefilterLabels(trainSplit.labels());
	Holder<GMatrix> hInnerLabels(pInnerLabels);

	// Train
	GRandomIndexIterator ii(trainSplit.features().rows(), rand);
	for(size_t epoch = 0; epoch < 10; epoch++)
	{
		// Validate
		nn.scaleWeights(1.0 - 0.35, false, 1);
		double mis = af.sumSquaredError(testSplit.features(), testSplit.labels());
		nn.scaleWeights(1.0 / (1.0 - 0.35), false, 1);
		cout << "Misclassifications: " << to_str(mis) << "/" << to_str(testSplit.labels().rows()) << "\n";

		// Train
		ii.reset();
		size_t index;
		while(ii.next(index))
		{
			for(size_t i = 0; i < nn.layerCount(); i++)
				nn.layer(i).maxNorm(2.0);
			nn.trainIncrementalWithDropout(trainSplit.features()[index], pInnerLabels->row(index), 0.35);
		}
	}
	return 0;
}

And here are some results. Notice they are not actually better. Is that because these tricks are no good? Is it because this network has more weights and I didn't train for long enough? (Notice the accuracy is still improving.) Is it because those tricks are really designed for larger networks, and this one is still so very small? Is it because I didn't tune my hyperparameters at all? I will leave answering such questions as an exercize for the reader--I'm just trying to show you how to use my code =).

Total weights: 154080
Misclassifications: 8968/10000
Misclassifications: 812/10000
Misclassifications: 659/10000
Misclassifications: 551/10000
Misclassifications: 572/10000
Misclassifications: 609/10000
Misclassifications: 503/10000
Misclassifications: 563/10000
Misclassifications: 464/10000
Misclassifications: 414/10000

Previous Next

Back to the table of contents

Hosting for this project generously provided by: