Why should we compute the image mean when we train CNNs? - image

When I use caffe for image classification, it often computes the image mean. Why is that the case?
Someone said that it can improve the accuracy, but I don't understand why this should be the case.

Refer to image whitening technique in Deep learning. Actually it has been proved that it improve the accuracy but not widely used.
To understand why it helps refer to the idea of normalizing data before applying machine learning method. which helps to keep the data in the same range. Actually there is another method now used in CNN which is Batch normalization.

Neural networks (including CNNs) are models with thousands of parameters which we try to optimize with gradient descent. Those models are able to fit a lot of different functions by having a non-linearity φ at their nodes. Without a non-linear activation function, the network collapses to a linear function in total. This means we need the non-linearity for most interesting problems.
Common choices for φ are the logistic function, tanh or ReLU. All of them have the most interesting region around 0. This is where the gradient either is big enough to learn quickly or where a non-linearity is at all in case of ReLU. Weight initialization schemes like Glorot initialization try to make the network start at a good point for the optimization. Other techniques like Batch Normalization also keep the mean of the nodes input around 0.
So you compute (and subtract) the mean of the image so that the first computing nodes get data which "behaves well". It has a mean of 0 and thus the intuition is that this helps the optimization process.
In theory, a network can be able to "subtract" the mean by itself. So if you train long enough, this should not matter too much. However, depending on the activation function "long enough" can be important.

Related

How to test if my implementation of back propagation neural Network is correct

I am working on an implementation of the back propagation algorithm. What I have implemented so far seems working but I can't be sure that the algorithm is well implemented, here is what I have noticed during training test of my network :
Specification of the implementation :
A data set containing almost 100000 raw containing (3 variable as input, the sinus of the sum of those three variables as expected output).
The network does have 7 layers all the layers use the Sigmoid activation function
When I run the back propagation training process:
The minimum of costs of the error is found at the fourth iteration (The minimum cost of error is 140, is it normal? I was expecting much less than that)
After the fourth Iteration the costs of the error start increasing (I don't know if it is normal or not?)
The short answer would be "no, very likely your implementation is incorrect". Your network is not training as can be observed by the very high cost of error. As discussed in comments, your network suffers very heavily from vanishing gradient problem, which is inevitable in deep networks. In essence, the first layers of you network learn much slower than the later. All neurons get some random weights at the beginning, right? Since the first layer almost doesn't learn anything, the large initial error propagates through the whole network!
How to fix it? From the description of your problem it seems that a feedforward network with just a single hidden layer in should be able to do the trick (as proven in universal approximation theorem).
Check e.g. free online book by Michael Nielsen if you'd like to learn more.
so I do understand from that the back propagation can't deal with deep neural networks? or is there some method to prevent this problem?
It can, but it's by no mean a trivial challenge. Deep neural networks have been used since 60', but only in 90' researchers came up with methods how to deal with them efficiently. I recommend reading "Efficient BackProp" chapter (by Y.A. LeCun et al.) of "Neural Networks: Tricks of the Trade".
Here is the summary:
Shuffle the examples
Center the input variables by subtracting the mean
Normalize the input variable to a standard deviation of 1
If possible, decorrelate the input variables.
Pick a network with the sigmoid function f(x)=1.7159*(tanh(2/3x): it won't saturate at +1 / -1, but instead will have highest gain at these points (second derivative is at max.)
Set the target values within the range of the sigmoid, typically +1 and -1.
The weights should be randomly drawn from a distribution with mean zero and a standard deviation given by m^(-1/2), where m is the number of inputs to the unit
The preferred method for training the network should be picked as follows:
If the training set is large (more than a few hundred samples) and redundant, and if the task is classification, use stochastic gradient with careful tuning, or use the stochastic diagonal Levenberg Marquardt method.
If the training set is not too large, or if the task is regression, use conjugate gradient.
Also, some my general remarks:
Watch for numerical stability if you implement it yourself. It's easy to get into troubles.
Think of the architecture. Fully-connected multi-layer networks are rarely a smart idea. Unfortunately ANN are poorly understood from theoretical point of view and one of the best things you can do is just check what worked for others and learn useful patterns (with regularization, pooling and dropout layers and such).

Layers and Neurons of a Neural Network

I would like to know a bit more about Neural Network, I'm developing a C++ program to make a NN but I'm stuck with the BackPropagation algorithm, sorry for not offering some working code.
I know that there are so many libraries for creating a NN in many languages, but I prefer to make one from my self. The point is that I don't know how many layers and how many neurons should be necessary for achieving a particular goal such as pattern recognition, or functions approximations, or whatever.
My questions are: if I'd like to recognize some particulars patterns, like in image detection, how many layers and neurons-per-layer should be necessary? Let's say my images are all 8x8 pixels, I would start naturally with an input layer of 64 neurons, but I don't have any idea of how many neurons I have to put in hidden layers, and also in output layer. Let's say I have to distinguish from cats and dogs, or whatever you may think, how could be the output layer? I can imagine an output layer with only-one neuron outputting a value between 0 and 1 with the classical logistic function (1/(1+exp(-x)) and when it is near 0 the input was a cat and when approaches 1 it was a dog, but ... is it correct? What if I add a new pattern like a fish? and what if the input contains a dog and a cat ( ..and a fish)? This make me thinking that the logistic function in the output layer is not very suitable for pattern recognition like this, only because 1/(1+exp(-x)) has a range in (0,1). Do I have to change the activation function or maybe add some other neurons to the output layer? Are there some other activations function more accurate to do this? Do every neurons in every layers have the same activation function, or it is different from layer to layer?
Sorry for all of this questions, but this topic is not very clear to me.
I read a lot around internet, and I found libraries all-yet-implemented and hard to read from, and many explanations to what a NN can do, but not how it can do.
I read a lot from https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ and http://neuralnetworksanddeeplearning.com/chap1.html, and here I understood how to approximate a function (because every neurons in a layer can be thought as a step-function with a particular step for weights and bias) and how back-propagation algorithm works, but other tutorials and similars were more focused on preexisting libraries. I also read this question Determining the proper amount of Neurons for a Neural Network but I would like to involve also the activation functions of a NN, which is the best and for what is the best.
Thanks in advance for your answers!
Your questions are quite general, so I can only give some general recommendations:
The number of layers you need depends on the complexity of the problem you want to solve. The more calculation is required to obtain an output from a given input, the more layers you need.
Only very simple problems can be solved with a single layer network. These are called linearly separable and are usually trivial. With two layers it gets better and with three layers, at least in theory, all kinds of classification tasks can be performed if you have enough cells within the layers. In practice, however it is often better to add a 4th or 5th layer to the network while reducing the number of cells within a single layer.
Be aware that the standard backpropagation algorithm performs badly with more than 4 or 5 layers. If you need more layers, have a look at Deep Learning.
The numbers of cells within each layer mainly depends on the number of inputs and, if you solve a classification task, the number of classes you want to detect. In practice it is quite common to reduce the number of cells from layer to layer, but there are exceptions.
Concerning your question about the output function: In most cases you should stick with one type of sigmoid function. The case you describe is not really an issue because you could add another output cell for your "fish" class. The choice of a specific activation function is not that critical. Basically you use one whose values and derivative can be calculated efficiently.
#Frank Puffer has already provided some nice information, but let me add my two cents. First off, much of what you're asking is in the area of hyperparameter optimization. Although there are various "rules of thumb", the reality is that determining the optimal architecture (number/size of layers, connectivity structure, etc.) and other parameters like the learning rate typically requires extensive experimentation. The good news is that the parameterization of these hyperparameters is among the simplest aspects of the implementation of a neural network. So I would recommend focusing on building your software such that the number of layers, size of layers, learning rate, etc., are all easily configurable.
Now you specifically asked about detecting patterns in an image. It's worth mentioning that using standard multi-layer perceptrons (MLPs) to perform classification on raw image data can be computationally expensive, especially for larger images. It's common to use architectures that are designed to extract useful, spacially-local features (i.e.: Convolutional Neural Networks or CNNs).
You could still use standard MLPs for this, but the computational complexity can make it an untenable solution. The sparse connectivity of CNNs for example dramatically reduce the number of parameters requiring optimization and simultaneously build a conceptual hierarchy of representations better suited for classification of images.
Regardless, I would recommend implementing backpropagation using stochastic gradient descent for optimization. This is still the approach typically used for training neural nets, CNNs, RNNs, etc.
Regarding the number of output neurons, this is one question that does have a simple answer: use "one-hot" encoding. For each class you want to recognize, you have an output neuron. In your example of the dog, cat, and fish classes, you have three neurons. For an input image representing a dog, you would expect a value of 1 for the "dog" neuron, and 0 for all the others. Then, during inference, you can interpret the output as a probability distribution reflecting the confidence of the NN. For example, if you get output dog:0.70, cat:0.25, fish:0.05, then you have a 70% confidence that the image is a dog, and so on.
For activation functions, the most recent research I've seen seems to indicate that Rectified Linear Units are generally a good choice since they're easy to differentiate and compute, and they avoid a problem that plagues deeper networks called the "vanishing gradient problem".
Best of luck!

Edge detection : Any performance evaluation technique?

I am working on edge detection in images and would like to evaluate the performance of algorithm, if any any one could give me a reference or method on how to proceed it will be really helpful. :)
I do not have ground truth and data set includes color as well as gray images.
Thank you.
Create a synthetic data set with known edges, for example by 3D rendering, by compositing 2D images with precise masks (as may be obtained in royalty free photosets), or by introducing edges directly (thin/faint lines). Remember to add some confounding non-edges that look like edges, of a type appropriate for what you're tuning for.
Use your (non-synthetic) data set. Run the reference algorithms that you want to compare against. Also produce combinations of the reference algorithms, for example by voting (majority, at least K out of N, etc). Calculate stats on your algo vs reference algo performance, in terms of (a) number of points your algo classifies as edge which each reference algo, or the combination, does not classify as edge (false positive), or (b) number of points which the reference algo classifies as edge that your algo does not (false negative). You can also calculate a rank correlation-type number for algos by looking at each point and looking at which algos do (or don't) classify that as an edge.
Create ground truth manually. Use reference edge-finding algos as a starting point, then fix up by hand. Probably valuable to do for a small number of images in any case.
Good luck!
For comparisons, quantitative measures like what #Alex I explained is best. To do so, you need to define what is "correct" with a ground truth set and a way to consistently determine if a given image is correct or on a more granular level, how correct (some number like a percentage) it is. #Alex I gave a way to do that.
Another option that is often used in graphics research where there is no ground truth is user studies. Usually less desirable as they are time consuming and often more costly. However, if it is a qualitative improvement that you are after or if a quantitative measurement is just too hard to do, a user study is an appropriate solution.
When I mean user study I mean to poll people on how well a result is given the input image. You could give them a scale to rate things on and randomly give them samples from both your results and the results of another algorithm
And of course, if you still want more ideas, be sure to check out edge detection papers to see how they measured their results (I'd actually look here first as they've already gone through this same process and determined what was best for them: google scholar).

Continuous vs Discrete artificial neural networks

I realize that this is probably a very niche question, but has anyone had experience with working with continuous neural networks? I'm specifically interested in what a continuous neural network may be useful for vs what you normally use discrete neural networks for.
For clarity I will clear up what I mean by continuous neural network as I suppose it can be interpreted to mean different things. I do not mean that the activation function is continuous. Rather I allude to the idea of a increasing the number of neurons in the hidden layer to an infinite amount.
So for clarity, here is the architecture of your typical discreet NN:
(source: garamatt at sites.google.com)
The x are the input, the g is the activation of the hidden layer, the v are the weights of the hidden layer, the w are the weights of the output layer, the b is the bias and apparently the output layer has a linear activation (namely none.)
The difference between a discrete NN and a continuous NN is depicted by this figure:
(source: garamatt at sites.google.com)
That is you let the number of hidden neurons become infinite so that your final output is an integral. In practice this means that instead of computing a deterministic sum you instead must approximate the corresponding integral with quadrature.
Apparently its a common misconception with neural networks that too many hidden neurons produces over-fitting.
My question is specifically, given this definition of discrete and continuous neural networks, I was wondering if anyone had experience working with the latter and what sort of things they used them for.
Further description on the topic can be found here:
http://www.iro.umontreal.ca/~lisa/seminaires/18-04-2006.pdf
I think this is either only of interest to theoreticians trying to prove that no function is beyond the approximation power of the NN architecture, or it may be a proposition on a method of constructing a piecewise linear approximation (via backpropagation) of a function. If it's the latter, I think there are existing methods that are much faster, less susceptible to local minima, and less prone to overfitting than backpropagation.
My understanding of NN is that the connections and neurons contain a compressed representation of the data it's trained on. The key is that you have a large dataset that requires more memory than the "general lesson" that is salient throughout each example. The NN is supposedly the economical container that will distill this general lesson from that huge corpus.
If your NN has enough hidden units to densely sample the original function, this is equivalent to saying your NN is large enough to memorize the training corpus (as opposed to generalizing from it). Think of the training corpus as also a sample of the original function at a given resolution. If the NN has enough neurons to sample the function at an even higher resolution than your training corpus, then there is simply no pressure for the system to generalize because it's not constrained by the number of neurons to do so.
Since no generalization is induced nor required, you might as well just memorize the corpus by storing all of your training data in memory and use k-nearest neighbor, which will always perform better than any NN, and will always perform as well as any NN even as the NN's sampling resolution approaches infinity.
The term hasn't quite caught on in the machine learning literature, which explains all the confusion. It seems like this was a one off paper, an interesting one at that, but it hasn't really led to anything, which may mean several things; the author may have simply lost interest.
I know that Bayesian neural networks (with countably many hidden units, the 'continuous neural networks' paper extends to the uncountable case) were successfully employed by Radford Neal (see his thesis all about this stuff) to win the NIPS 2003 Feature Selection Challenge using Bayesian neural networks.
In the past I've worked on a few research projects using continuous NN's. Activation was done using a bipolar hyperbolic tan, the network took several hundred floating point inputs and output around one hundred floating point values.
In this particular case the aim of the network was to learn the dynamic equations of a mineral train. The network was given the current state of the train and predicted speed, inter-wagon dynamics and other train behaviour 50 seconds into the future.
The rationale for this particular project was mainly about performance. This was being targeted for an embedded device and evaluating the NN was much more performance friendly then solving a traditional ODE (ordinary differential equation) system.
In general a continuous NN should be able to learn any kind of function. This is particularly useful when its impossible/extremely difficult to solve a system using deterministic methods. As opposed to binary networks which are often used for pattern recognition/classification purposes.
Given their non-deterministic nature NN's of any kind are touchy beasts, choosing the right kinds of inputs/network architecture can be somewhat a black art.
Feed forward neural networks are always "continuous" -- it's the only way that backpropagation learning actually works (you can't backpropagate through a discrete/step function because it's non-differentiable at the bias threshold).
You might have a discrete (e.g. "one-hot") encoding of the input or target output, but all of the computation is continuous-valued. The output may be constrained (i.e. with a softmax output layer such that the outputs always sum to one, as is common in a classification setting) but again, still continuous.
If you mean a network that predicts a continuous, unconstrained target -- think of any prediction problem where the "correct answer" isn't discrete, and a linear regression model won't suffice. Recurrent neural networks have at various times been a fashionable method for various financial prediction applications, for example.
Continuous neural networks are not known to be universal approximators (in the sense of density in $L^p$ or $C(\mathbb{R})$ for the topology of uniform convergence on compacts, i.e.: as in the universal approximation theorem) but only universal interpolators in the sense of this paper:
https://arxiv.org/abs/1908.07838

Artificial Neural Network Question

Generally speaking what do you get out of extending an artificial neural net by adding more nodes to a hidden layer or more hidden layers?
Does it allow for more precision in the mapping, or does it allow for more subtlety in the relationships it can identify, or something else?
There's a very well known result in machine learning that states that a single hidden layer is enough to approximate any smooth, bounded function (the paper was called "Multilayer feedforward networks are universal approximators" and it's now almost 20 years old). There are several things to note, however.
The single hidden layer may need to be arbitrarily wide.
This says nothing about the ease with which an approximation may be found; in general large networks are hard to train properly and fall victim to overfitting quite frequently (the exception are so-called "convolutional neural networks" which really are only meant for vision problems).
This also says nothing about the efficiency of the representation. Some functions require exponential numbers of hidden units if done with one layer but scale much more nicely with more layers (for more discussion of this read Scaling Learning Algorithms Towards AI)
The problem with deep neural networks is that they're even harder to train. You end up with very very small gradients being backpropagated to the earlier hidden layers and the learning not really going anywhere, especially if weights are initialized to be small (if you initialize them to be of larger magnitude you frequently get stuck in bad local minima). There are some techniques for "pre-training" like the ones discussed in this Google tech talk by Geoff Hinton which attempt to get around this.
This is very interesting question but it's not so easy to answer. It depends on the problem you try to resolve and what neural network you try to use. There are several neural network types.
I general it's not so clear that more nodes equals more precision. Research show that you need mostly only one hidden layer. The numer of nodes should be the minimal numer of nodes that are required to resolve a problem. If you don't have enough of them - you will not reach solution.
From the other hand - if you have reached the number of nodes that is good to resolve solution - you can add more and more of them and you will not see any further progress in result estimation.
That's why there are so many types of neural networks. They try to resolve different types of problems. So you have NN to resolve static problems, to resolve time related problems and so one. The number of nodes is not so important like the design of them.
When you have a hidden layer is that you are creating a combined feature of the input. So, is the problem better tackled by more features of the existing input, or through higher-order features that come from combining existing features? This is the trade-off for a standard feed-forward network.
You have a theoretical reassurance that any function can be represented by a neural network with two hidden layers and non-linear activation.
Also, consider using additional resources for boosting, instead of adding more nodes, if you're not certain of the appropriate topology.
Very rough rules of thumb
generally more elements per layer for bigger input vectors.
more layers may let you model more non-linear systems.
If the kind of network you are using has delays in propagation , more layers may allow modelling of time series . Take care to have time jitter in the delays or it wont work very well. If this is just gobbledegook to you, ignore it.
More layers lets you insert recurrent features. This can be very useful for discrimination tasks. You ANN implementation my not permit this.
HTH
The number of units per hidden layer accounts for the ANN's potential to describe an arbitrarily complex function. Some (complicated) functions may require many hidden nodes, or possibly more than one hidden layer.
When a function can be roughly approximated by a certain number of hidden units, any extra nodes will provide more accuracy...but this is only true if the training samples used are enough to justify this addition - otherwise what will happen is "overconvergence". Overconvergence means that your ANN has lost its generalization abilities because it has overemphasized on the particular samples.
In general it is best to use the less hidden units possible, if the resulting network can give good results. The additional training patterns required to justify more hidden nodes can not be found easily in most cases, and accuracy is not the NNs' strong point.

Resources