Does validation accuracy/loss impact training in caffe - validation

Have a simple question about validation set in caffe, was wondering if validation set has any impact on training? I know that you use validation set to check if the network isn't overfitting and as I understand validation set has no impact on weight update, but does it have some kind of impact on selecting or modifying hyper-parameters or is it just for user to see and estimate how well network has learned?

No, the results of the validation set are not used by the neural network during training to adjust any hyperparameters. Using the validation set during training is the same as applying the network at some point in time to predict values for the validation set, and then scoring how well it did.
You might decide that you want to run the same network training procedure many times over using different values for hyperparameters. In its fully exhaustive form, that would mean you would do a grid search over the hyperparameter space with many different training sessions of separate networks. In practice, it's not a great idea to do a fully exhaustive grid search with neural networks because the amount of parameters can be extremely large.
Often with neural networks you can tune one parameter at a time until they each seem "about right". Of course this might not get you the absolute best result, but it's not a bad first approach.

Related

What is the difference between test and validation specifically in Mask-R-CNN?

I have my own image dataset and use Mask-R-CNN for training. There you divide your dataset into train, valivation and test.
I want to know the difference between validation and test.
I know that validation in general is used to see the quality of the NN after each epoch. Based on that you can see how good the NN is and if overfitting is happening.
But i want to know if the NN learns based on the validation set.
Based on the trainset the NN learns after each image and adjusts each neuron to reduce the loss. And after the NN is finished learning, we use the testset to see how good our NN is really with new unseen images.
But what exactly happen in Mask-R-CNN based on the validationset? Is the validation set only there for seeing the results? Or will some parameters be adjusted based on the validation result to avoid overfitting? An even if this is the case, how much influence does the validationset have on the parameters? Will the neurons itself be adjusted or not?
If the influence is very very small, then i will choose the validation set equal to the testset, because i don't have many images(800).
So basically i want to know the difference between test and validation in Mask-R-CNN, that is how and how much the validationset influence the NN.
The model does not learn off the validation set. The validation set is just used to give an approximation of generalization error at any epoch but also, crucially, for hyperparameter optimization. So I can iterate over several different hyperparameter configuration and evaluate the accuracy of those on the validation set.
Then after we choose the best model based on the validation set accuracies we can then calculate the test error based on the test set. Ideally there is not a large difference between test set and validation set accuracies. Sometimes your model can essentially 'overfit' to the validation set if you iterate over lots of different hyperparameters.
Reserving another set, the test set, to evaluate on after this validation set evaluation is a luxury you may have if you have a lot of data. Lots of times you may be lacking enough labelled data for it even to be worth having a separate test set held back.
Lastly, these things are not specific to an Mask RCNN. Validation sets never affect the training of a model i.e. the weights or biases. Validation sets, like test sets, are purely for evaluation purposes.

Comparing two neural networks (nntool in Matlab)

I'm new to the Neural Network Toolbox (nntool) in Matlab. I have trained two networks using the same data set. One of these networks contains a higher number of neurons as the other one.
Now I'm wondering: how can I compare these networks? How can I say network A is better than network B?
Is it all about the number of correctly classified pattern in my test set? Lets say both networks were shown the same test set and network A classified more pattern correctly. Can I say network A is (in general) better than network B?
Or should I also look at the performance according to my performance function?
Are there any other measures for comparing two networks trained with different parameter?
That mainly depends on what is your concern. As I see, in most cases analyzing the predicted labels, or accuracy of the nets can lead to a good pickup decision, especially when your networks have shallow architectures,however there are some side-handed issues that may become more important when you decide to see the nets with wider eyes.
For example, in the training phase, adding even one hidden unit to the first hidden layer comes up with inserting d (dimension of input layer) free parameters (weights) to your model that should be estimated. In other hand, more free parameters your model has, more training data is required to come up with a reliable model. Therefore, bigger networks are well-accepted as long as you have enough data to compensate for the added free parameters. As rule of thumb, inserting more free parameters increase the chance of over-fitting which has been a vital problem in deep neural networks and many efforts has been made to resolve it.
Another case which is less important in shallow nets, is the computational cost imposed by extra hidden nodes. Since we are looking with wide eyes, mentioning this issue is somewhat necessary. In cases when your network goes deeper, this computational cost becomes more challenging. The computational cost in training phase is also an important issue when you use back-propagation to update the parameters.
One other thing that you may mainly see in deep neural networks is the memory requirements. As the number of layers or neurons increase, the number of free parameters grows dramatically such that in deep networks you may see millions of parameters. It is clear that loading this amount of parameters asks for sufficient hardware requirements.
hope it helps.

Is validation set necessary in neural networks while cross validation alone works well in regression based models?

Do we need the validation in neural networks because neural networks do not always converge to the same answer?
I have never heard of a validation set in models such as regression or ensemble learning. We cross validate our dataset entirely. dividing it into k-fold train and test sets. however for neural networks we also need a validation set that we extract from the training set. Now I know why we need the validation set in neural networks. What I need to know is why we don't do the same procedure in let's say logistic regression.
There is a somewhat good discussion on the purpose of the training, testing and validation sets here.
As for the need of a testing set, if you are not modifying any parameters of your model, there probably isn't much of a need for a second set to test on. A neural network has a large number of parameters that can be adjusted (hidden layers, number of neurons, training runs, epochs, momentum, learning rate, etc.) that can be influenced by the results of the validation set. Using this additional testing set can be used to confirm that a model generalises well on unseen testing data after the model has been tuned (further alterations should not occur once the testing set has been completed).
I have also used a testing set for ensemble configurations in the past where the model had some adjustable parameters (number of ensemble members, combination parameters) and this also verified its ability to estimate unseen testing data after tuning.
Hope this helps!

Neural Network settings for fast training

I am creating a tool for predicting the time and cost of software projects based on past data. The tool uses a neural network to do this and so far, the results are promising, but I think I can do a lot more optimisation just by changing the properties of the network. There don't seem to be any rules or even many best-practices when it comes to these settings so if anyone with experience could help me I would greatly appreciate it.
The input data is made up of a series of integers that could go up as high as the user wants to go, but most will be under 100,000 I would have thought. Some will be as low as 1. They are details like number of people on a project and the cost of a project, as well as details about database entities and use cases.
There are 10 inputs in total and 2 outputs (the time and cost). I am using Resilient Propagation to train the network. Currently it has: 10 input nodes, 1 hidden layer with 5 nodes and 2 output nodes. I am training to get under a 5% error rate.
The algorithm must run on a webserver so I have put in a measure to stop training when it looks like it isn't going anywhere. This is set to 10,000 training iterations.
Currently, when I try to train it with some data that is a bit varied, but well within the limits of what we expect users to put into it, it takes a long time to train, hitting the 10,000 iteration limit over and over again.
This is the first time I have used a neural network and I don't really know what to expect. If you could give me some hints on what sort of settings I should be using for the network and for the iteration limit I would greatly appreciate it.
Thank you!
First of all, thanks for providing so much information about your network! Here are a few pointers that should give you a clearer picture.
You need to normalize your inputs. If one node sees a mean value of 100,000 and another just 0.5, you won't see an equal impact from the two inputs. Which is why you'll need to normalize them.
Only 5 hidden neurons for 10 input nodes? I remember reading somewhere that you need at least double the number of inputs; try 20+ hidden neurons. This will provide your neural network model the capability to develop a more complex model. However, too many neurons and your network will just memorize the training data set.
Resilient backpropagation is fine. Just remember that there are other training algorithms out there like Levenberg-Marquardt.
How many training sets do you have? Neural networks usually need a large dataset to be good at making useful predictions.
Consider adding a momentum factor to your weight-training algorithm to speed things up if you haven't done so already.
Online training tends to be better for making generalized predictions than batch training. The former updates weights after running every training set through the network, while the latter updates the network after passing every data set through. It's your call.
Is your data discrete or continuous? Neural networks tend to do a better job with 0s and 1s than continuous functions. If it is the former, I'd recommend using the sigmoid activation function. A combination of tanh and linear activation functions for the hidden and output layers tend to do a good job with continuously-varying data.
Do you need another hidden layer? It may help if your network is dealing with complex input-output surface mapping.

Artificial Neural Network Question

Generally speaking what do you get out of extending an artificial neural net by adding more nodes to a hidden layer or more hidden layers?
Does it allow for more precision in the mapping, or does it allow for more subtlety in the relationships it can identify, or something else?
There's a very well known result in machine learning that states that a single hidden layer is enough to approximate any smooth, bounded function (the paper was called "Multilayer feedforward networks are universal approximators" and it's now almost 20 years old). There are several things to note, however.
The single hidden layer may need to be arbitrarily wide.
This says nothing about the ease with which an approximation may be found; in general large networks are hard to train properly and fall victim to overfitting quite frequently (the exception are so-called "convolutional neural networks" which really are only meant for vision problems).
This also says nothing about the efficiency of the representation. Some functions require exponential numbers of hidden units if done with one layer but scale much more nicely with more layers (for more discussion of this read Scaling Learning Algorithms Towards AI)
The problem with deep neural networks is that they're even harder to train. You end up with very very small gradients being backpropagated to the earlier hidden layers and the learning not really going anywhere, especially if weights are initialized to be small (if you initialize them to be of larger magnitude you frequently get stuck in bad local minima). There are some techniques for "pre-training" like the ones discussed in this Google tech talk by Geoff Hinton which attempt to get around this.
This is very interesting question but it's not so easy to answer. It depends on the problem you try to resolve and what neural network you try to use. There are several neural network types.
I general it's not so clear that more nodes equals more precision. Research show that you need mostly only one hidden layer. The numer of nodes should be the minimal numer of nodes that are required to resolve a problem. If you don't have enough of them - you will not reach solution.
From the other hand - if you have reached the number of nodes that is good to resolve solution - you can add more and more of them and you will not see any further progress in result estimation.
That's why there are so many types of neural networks. They try to resolve different types of problems. So you have NN to resolve static problems, to resolve time related problems and so one. The number of nodes is not so important like the design of them.
When you have a hidden layer is that you are creating a combined feature of the input. So, is the problem better tackled by more features of the existing input, or through higher-order features that come from combining existing features? This is the trade-off for a standard feed-forward network.
You have a theoretical reassurance that any function can be represented by a neural network with two hidden layers and non-linear activation.
Also, consider using additional resources for boosting, instead of adding more nodes, if you're not certain of the appropriate topology.
Very rough rules of thumb
generally more elements per layer for bigger input vectors.
more layers may let you model more non-linear systems.
If the kind of network you are using has delays in propagation , more layers may allow modelling of time series . Take care to have time jitter in the delays or it wont work very well. If this is just gobbledegook to you, ignore it.
More layers lets you insert recurrent features. This can be very useful for discrimination tasks. You ANN implementation my not permit this.
HTH
The number of units per hidden layer accounts for the ANN's potential to describe an arbitrarily complex function. Some (complicated) functions may require many hidden nodes, or possibly more than one hidden layer.
When a function can be roughly approximated by a certain number of hidden units, any extra nodes will provide more accuracy...but this is only true if the training samples used are enough to justify this addition - otherwise what will happen is "overconvergence". Overconvergence means that your ANN has lost its generalization abilities because it has overemphasized on the particular samples.
In general it is best to use the less hidden units possible, if the resulting network can give good results. The additional training patterns required to justify more hidden nodes can not be found easily in most cases, and accuracy is not the NNs' strong point.

Resources