Usage of validation set in neural networks - algorithm

I want to describe you an algorithm that I used to choose number of hidden layers and number of neurons within each of it (I couldn't find anywhere any approval of this approach, whereas it seems very logical for me):
Divide data into 60% (training), 20% (validation) and 20% (test) parts.
Now I want to check all possibilities in one hidden layer and two hidden layer network. To do so:
I'll train network with one hidden layer and with 1, 2 and 3 neurons within this one layer (3 different networks). After that I will calculate error on validation set (RMSE and MAE) for each of those networks.
I'll do exactly same thing for network with 2 hidden layers i.e. I'm gonna to estimate network with two hidden layers with all the possibilities of number of neurons in each layer (but number of neurons can only be 1, 2 or 3). It means that I will have 9 pairs of possible outcomes : 1 neuron in first hidden layer, 1 neuron in second hidden layer. 2 neurons in first hidden layer, 1 neuron in second hidden layer and so on... For each of the 9 architectures I'll calculate RMSE and MAE between predictions on validation set and actual values.
Out of all those iterations I'll pick the architecture for which error on the validation set is the lowest.
Could you please tell me if this algorithm make any sense? If not, what else I can do with validation set to choose the best architecture for neural network?

This is the absolutely standard method of hyperparameter/model selection. The "trick" one relies upon is that you select from a very limited set of models (in your case 12), and thus your estimation of error will be very tight. You can refer to "statistical learning theory" book by Vladimir Vapnik giving you exact bounds why this sort of approach is great.
That being said you are working under the assumption that 20% of your data is "big enough" to estimate your performance, if it is not you can look at Cross Validation

Related

Neural Networks - why is my training error increasing as I add hidden units (neurons)?

I'm trying to optimise the number of hidden units in my MLP.
I'm using k-fold cross validation, with 10 folds - 16200 training points and 1800 validation points in each fold.
When I run the network with hidden units varying from 1:10, I find the minimum error always occurs at 2 (NMSE of about 7).
3 is slightly higher (NMSE of about 11) and 4 or more hidden units and the error remains constant at about 14 or 15 regardless of many I add.
Why is this?
I find it hard to believe that overfitting is occurring, because of the very large amount of data points being used (with all 10 folds, that's 162000 training points, albeit each repeated 9 times).
Many thanks for any help or advice!
If the input is voltage and current, and question is about the power generated, then it's just P=V*I. Even if you have some noise, the relationship will be still linear. In this case simple linear model would do just fine - and would be far nicer to interpret! That's why simple ANN works best and more complex is overfitting, as it looks for non-linear relationships (which are not there, but it does whatever will minimise cost function).
To summarise, I would recommend to check a simple linear model. Also, since you have a lot of data points, make a 50-25-25 split for training, test and validation sets. Look at your cost function and see how it changes with error rate.

Understanding Perceptrons

I just started a Machine learning class and we went over Perceptrons. For homework we are supposed to:
"Choose appropriate training and test data sets of two dimensions (plane). Use 10 data points for training and 5 for testing. " Then we are supposed to write a program that will use a perceptron algorithm and output:
a comment on whether the training data points are linearly
separable
a comment on whether the test points are linearly separable
your initial choice of the weights and constants
the final solution equation (decision boundary)
the total number of weight updates that your algorithm made
the total number of iterations made over the training set
the final misclassification error, if any, on the training data and
also on the test data
I have read the first chapter of my book several times and I am still having trouble fully understanding perceptrons.
I understand that you change the weights if a point is misclassified until none are misclassified anymore, I guess what I'm having trouble understanding is
What do I use the test data for and how does that relate to the
training data?
How do I know if a point is misclassified?
How do I go about choosing test points, training points, threshold or a bias?
It's really hard for me to know how to make up one of these without my book providing good examples. As you can tell I am pretty lost, any help would be so much appreciated.
What do I use the test data for and how does that relate to the
training data?
Think about a Perceptron as young child. You want to teach a child how to distinguish apples from oranges. You show it 5 different apples (all red/yellow) and 5 oranges (of different shape) while telling it what it sees at every turn ("this is a an apple. this is an orange). Assuming the child has perfect memory, it will learn to understand what makes an apple an apple and an orange an orange if you show him enough examples. He will eventually start to use meta-features (like shapes) without you actually telling him. This is what a Perceptron does. After you showed him all examples, you start at the beginning, this is called a new epoch.
What happens when you want to test the child's knowledge? You show it something new. A green apple (not just yellow/red), a grapefruit, maybe a watermelon. Why not show the child the exact same data as before during training? Because the child has perfect memory, it will only tell you what you told him. You won't see how good it generalizes from known to unseen data unless you have different training data that you never showed him during training. If the child has a horrible performance on the test data but a 100% performance on the training data, you will know that he has learned nothing - it's simply repeating what he has been told during training - you trained him too long, he only memorized your examples without understanding what makes an apple an apple because you gave him too many details - this is called overfitting. To prevent your Perceptron from only (!) recognizing training data you'll have to stop training at a reasonable time and find a good balance between the size of the training and testing set.
How do I know if a point is misclassified?
If it's different from what it should be. Let's say an apple has class 0 and an orange has 1 (here you should start reading into Single/MultiLayer Perceptrons and how Neural Networks of multiple Perceptrons work). The network will take your input. How it's coded is irrelevant for this, let's say input is a string "apple". Your training set then is {(apple1,0), (apple2,0), (apple3,0), (orange1,1), (orange2,1).....}. Since you know the class beforehand, the network will either output 1 or 0 for the input "apple1". If it outputs 1, you perform (targetValue-actualValue) = (1-0) = 1. 1 in this case means that the network gives a wrong output. Compare this to the delta rule and you will understand that this small equation is part of the larger update equation. In case you get a 1 you will perform a weight update. If target and actual value are the same, you will always get a 0 and you know that the network didn't misclassify.
How do I go about choosing test points, training points, threshold or
a bias?
Practically the bias and threshold isn't "chosen" per se. The bias is trained like any other unit using a simple "trick", namely using the bias as an additional input unit with value 1 - this means the actual bias value is encoded in this additional unit's weight and the algorithm we use will make sure it learns the bias for us automatically.
Depending on your activation function, the threshold is predetermined. For a simple perceptron, the classification will occur as follows:
Since we use a binary output (between 0 and 1), it's a good start to put the threshold at 0.5 since that's exactly the middle of the range [0,1].
Now to your last question about choosing training and test points: This is quite difficult, you do that by experience. Where you're at, you start off by implementing simple logical functions like AND, OR, XOR etc. There's it's trivial. You put everything in your training set and test with the same values as your training set (since for x XOR y etc. there are only 4 possible inputs 00, 10, 01, 11). For complex data like images, audio etc. you'll have to try and tweak your data and features until you feel like the network can work with it as good as you want it to.
What do I use the test data for and how does that relate to the training data?
Usually, to asses how well a particular algorithm performs, one first trains it and then uses different data to test how well it does on data it has never seen before.
How do I know if a point is misclassified?
Your training data has labels, which means that for each point in the training set, you know what class it belongs to.
How do I go about choosing test points, training points, threshold or a bias?
For simple problems, you usually take all the training data and split it around 80/20. You train on the 80% and test against the remaining 20%.

Back propagation algorithm when we have two outputs

I have a big problem I want to implement my neuronal neutwork with 2 neurons outputs. Sth like that :
And I want to use backpropagation algorithm, but I don't know how to calculate a error, because I have a output with 2 neurons, when I have a only one neuron on a output that's very easy to use a backpropagation algorithm from one exit error, but with two neurons? I thinking about calculate error for every output seperately but then I must calculate seperately back propagation for 2 cases and I get "two different hidden layers" (For every neuron in hidden layer I have a weights for two cases). Mayby anyone knows some better solutions?
I will be very gratefull for any help.
Logically thinking, the first layer of weights should give you a representation (the hidden layer) that is useful for predicting both outputs. So, this layer should be updated based on the error made in both outputs. But the next layer of weights are separate for each output node, so should get separate weight updates.
So, on second layer weights, the weight updates will be calculated separately based on the respective outputs. For the first layer of weights, I would first calculate error derivatives backpropagating from each output separately and then simply combine them to get the final error derivative. Then apply learning rate to get the weight updates.
Watch out for the dynamic range of your outputs. For example, if one output is producing some real value of range [0,10] and another is producing values in range [-1000,1000] then your updates will be dominated by the one with larger range. You can
add a preprocessing step that would change your data set to have same dynamic range in both outputs. Also, add a postprocessing step to restore the actual range.
formulate the error functions for each output so that they produce error values of same dynamic range.

Artificial neural network activation function

I've programmed an ANN with backpropagation algorithm to forecast number of customers with 3 layers, 1 output neuron, 3 hidden neurons and 4 input neurons. so i need a continuous output. what activation functions should I use?
In this case what you can do (and I've seen this works well) is to apply a PureLin function to the input and output layer and use the Tanh or Sigmod in the hidden layers. The rest of the work is done by the weights!
Hope this helps!

Confusion with neural networks in MATLAB

I'm working on character recognition (and later fingerprint recognition) using neural networks. I'm getting confused with the sequence of events. I'm training the net with 26 letters. Later I will increase this to include 26 clean letters and 26 noisy letters. If I want to recognize one letter say "A", what is the right way to do this? Here is what I'm doing now.
1) Train network with a 26x100 matrix; each row contains a letter from segmentation of the bmp (10x10).
2) However, for the test targets I use my input matrix for "A". I had 25 rows of zeros after the first row so that my input matrix is the same size as my target matrix.
3) I run perform(net, testTargets,outputs) where outputs are the outputs from the net trained with the 26x100 matrix. testTargets is the matrix for "A".
This doesn't seem right though. Is training supposed by separate from recognizing any character? What I want to happen is as follows.
1) Training the network for an image file that I select (after processing the image into logical arrays).
2) Use this trained network to recognize letter in a different image file.
So train the network to recognize A through Z. Then pick an image, run the network to see what letters are recognized from the picked image.
Okay, so it seems that the question here seems to be more along the lines of "How do I neural networks" I can outline the basic procedure here to try to solidify the idea in your mind, but as far as actually implementing it goes you're on your own. Personally I believe that proprietary languages (MATLAB) are an abomination, but I always appreciate intellectual zeal.
The basic concept of a neural net is that you have a series of nodes in layers with weights that connect them (depending on what you want to do you can either just connect each node to the layer above and beneath, or connect every node, or anywhere in betweeen.). Each node has a "work function" or a probabilistic function that represents the chance that the given node, or neuron will evaluate to "on" or 1.
The general workflow starts from whatever top layer neurons/nodes you've got, initializing them to the values of your data (in your case, you would probably start each of these off as the pixel values in your image, normalized to be binary would be simplest). Each of those nodes would then be multiplied by a weight and fed down towards your second layer, which would be considered a "hidden layer" depending on the sum (either geometric or arithmetic sum, depending on your implementation) which would be used with the work function to determine the state of your hidden layer.
That last point was a little theoretical and hard to follow, so here's an example. Imagine your first row has three nodes ([1,0,1]), and the weights connecting the three of those nodes to the first node in your second layer are something like ([0.5, 2.0, 0.6]). If you're doing an arithmetic sum that means that the weighting on the first node in your "hidden layer" would be
1*0.5 + 0*2.0 + 1*0.6 = 1.1
If you're using a logistic function as your work function (a very common choice, though tanh is also common) this would make the chance of that node evaluating to 1 approximately 75%.
You would probably want your final layer to have 26 nodes, one for each letter, but you could add in more hidden layers to improve your model. You would assume that the letter your model predicted would be the final node with the largest weighting heading in.
After you have that up and running you want to train it though, because you probably just randomly seeded your weights, which makes sense. There are a lot of different methods for this, but I'll generally outline back-propagation which is a very common method of training neural nets. The idea is essentially, since you know which character the image should have been recognized, you compare the result to the one that your model actually predicted. If your model accurately predicted the character you're fine, you can leave the model as is, since it worked. If you predicted an incorrect character you want to go back through your neural net and increment the weights that lead from the pixel nodes you fed in to the ending node that is the character that should have been predicted. You should also decrement the weights that led to the character it incorrectly returned.
Hope that helps, let me know if you have any more questions.

Resources