Comparing two neural networks (nntool in Matlab) - performance

I'm new to the Neural Network Toolbox (nntool) in Matlab. I have trained two networks using the same data set. One of these networks contains a higher number of neurons as the other one.
Now I'm wondering: how can I compare these networks? How can I say network A is better than network B?
Is it all about the number of correctly classified pattern in my test set? Lets say both networks were shown the same test set and network A classified more pattern correctly. Can I say network A is (in general) better than network B?
Or should I also look at the performance according to my performance function?
Are there any other measures for comparing two networks trained with different parameter?

That mainly depends on what is your concern. As I see, in most cases analyzing the predicted labels, or accuracy of the nets can lead to a good pickup decision, especially when your networks have shallow architectures,however there are some side-handed issues that may become more important when you decide to see the nets with wider eyes.
For example, in the training phase, adding even one hidden unit to the first hidden layer comes up with inserting d (dimension of input layer) free parameters (weights) to your model that should be estimated. In other hand, more free parameters your model has, more training data is required to come up with a reliable model. Therefore, bigger networks are well-accepted as long as you have enough data to compensate for the added free parameters. As rule of thumb, inserting more free parameters increase the chance of over-fitting which has been a vital problem in deep neural networks and many efforts has been made to resolve it.
Another case which is less important in shallow nets, is the computational cost imposed by extra hidden nodes. Since we are looking with wide eyes, mentioning this issue is somewhat necessary. In cases when your network goes deeper, this computational cost becomes more challenging. The computational cost in training phase is also an important issue when you use back-propagation to update the parameters.
One other thing that you may mainly see in deep neural networks is the memory requirements. As the number of layers or neurons increase, the number of free parameters grows dramatically such that in deep networks you may see millions of parameters. It is clear that loading this amount of parameters asks for sufficient hardware requirements.
hope it helps.

Related

Layers and Neurons of a Neural Network

I would like to know a bit more about Neural Network, I'm developing a C++ program to make a NN but I'm stuck with the BackPropagation algorithm, sorry for not offering some working code.
I know that there are so many libraries for creating a NN in many languages, but I prefer to make one from my self. The point is that I don't know how many layers and how many neurons should be necessary for achieving a particular goal such as pattern recognition, or functions approximations, or whatever.
My questions are: if I'd like to recognize some particulars patterns, like in image detection, how many layers and neurons-per-layer should be necessary? Let's say my images are all 8x8 pixels, I would start naturally with an input layer of 64 neurons, but I don't have any idea of how many neurons I have to put in hidden layers, and also in output layer. Let's say I have to distinguish from cats and dogs, or whatever you may think, how could be the output layer? I can imagine an output layer with only-one neuron outputting a value between 0 and 1 with the classical logistic function (1/(1+exp(-x)) and when it is near 0 the input was a cat and when approaches 1 it was a dog, but ... is it correct? What if I add a new pattern like a fish? and what if the input contains a dog and a cat ( ..and a fish)? This make me thinking that the logistic function in the output layer is not very suitable for pattern recognition like this, only because 1/(1+exp(-x)) has a range in (0,1). Do I have to change the activation function or maybe add some other neurons to the output layer? Are there some other activations function more accurate to do this? Do every neurons in every layers have the same activation function, or it is different from layer to layer?
Sorry for all of this questions, but this topic is not very clear to me.
I read a lot around internet, and I found libraries all-yet-implemented and hard to read from, and many explanations to what a NN can do, but not how it can do.
I read a lot from https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ and http://neuralnetworksanddeeplearning.com/chap1.html, and here I understood how to approximate a function (because every neurons in a layer can be thought as a step-function with a particular step for weights and bias) and how back-propagation algorithm works, but other tutorials and similars were more focused on preexisting libraries. I also read this question Determining the proper amount of Neurons for a Neural Network but I would like to involve also the activation functions of a NN, which is the best and for what is the best.
Thanks in advance for your answers!
Your questions are quite general, so I can only give some general recommendations:
The number of layers you need depends on the complexity of the problem you want to solve. The more calculation is required to obtain an output from a given input, the more layers you need.
Only very simple problems can be solved with a single layer network. These are called linearly separable and are usually trivial. With two layers it gets better and with three layers, at least in theory, all kinds of classification tasks can be performed if you have enough cells within the layers. In practice, however it is often better to add a 4th or 5th layer to the network while reducing the number of cells within a single layer.
Be aware that the standard backpropagation algorithm performs badly with more than 4 or 5 layers. If you need more layers, have a look at Deep Learning.
The numbers of cells within each layer mainly depends on the number of inputs and, if you solve a classification task, the number of classes you want to detect. In practice it is quite common to reduce the number of cells from layer to layer, but there are exceptions.
Concerning your question about the output function: In most cases you should stick with one type of sigmoid function. The case you describe is not really an issue because you could add another output cell for your "fish" class. The choice of a specific activation function is not that critical. Basically you use one whose values and derivative can be calculated efficiently.
#Frank Puffer has already provided some nice information, but let me add my two cents. First off, much of what you're asking is in the area of hyperparameter optimization. Although there are various "rules of thumb", the reality is that determining the optimal architecture (number/size of layers, connectivity structure, etc.) and other parameters like the learning rate typically requires extensive experimentation. The good news is that the parameterization of these hyperparameters is among the simplest aspects of the implementation of a neural network. So I would recommend focusing on building your software such that the number of layers, size of layers, learning rate, etc., are all easily configurable.
Now you specifically asked about detecting patterns in an image. It's worth mentioning that using standard multi-layer perceptrons (MLPs) to perform classification on raw image data can be computationally expensive, especially for larger images. It's common to use architectures that are designed to extract useful, spacially-local features (i.e.: Convolutional Neural Networks or CNNs).
You could still use standard MLPs for this, but the computational complexity can make it an untenable solution. The sparse connectivity of CNNs for example dramatically reduce the number of parameters requiring optimization and simultaneously build a conceptual hierarchy of representations better suited for classification of images.
Regardless, I would recommend implementing backpropagation using stochastic gradient descent for optimization. This is still the approach typically used for training neural nets, CNNs, RNNs, etc.
Regarding the number of output neurons, this is one question that does have a simple answer: use "one-hot" encoding. For each class you want to recognize, you have an output neuron. In your example of the dog, cat, and fish classes, you have three neurons. For an input image representing a dog, you would expect a value of 1 for the "dog" neuron, and 0 for all the others. Then, during inference, you can interpret the output as a probability distribution reflecting the confidence of the NN. For example, if you get output dog:0.70, cat:0.25, fish:0.05, then you have a 70% confidence that the image is a dog, and so on.
For activation functions, the most recent research I've seen seems to indicate that Rectified Linear Units are generally a good choice since they're easy to differentiate and compute, and they avoid a problem that plagues deeper networks called the "vanishing gradient problem".
Best of luck!

Neural network with categorical variables (enum) as inputs

I'm trying to solve some machine-learning problems using neural networks, mostly with the NEAT evolution (NeuroEvolution of Augmented Topologies).
Some of my input variables are continuous, but some of them are of a categorical nature, like:
Species: {Lion,Leopard,Tiger,Jaguar}
Branches of Trade: {Health care,Insurances,Finance,IT,Advertising}
At first I wanted to model such a variable by mapping the categories to discrete numbers, like:
{Lion:1, Leopard:2, Tiger:3, Jaguar:4}
But I'm afraid this adds some kind of arbitrary topology on the variable. A Tiger is not the sum of a Lion and a Leopard.
What approaches to this problem are usually employed?
Unfortunately there is no good solution, each leads to some kind of problems:
Your solution is adding the topology, as you mentioned; it may not be that bad, as NN can fit arbitrary functions and represent "ifs", but in many cases it will (as NN are often falling into some local minima).
You can encode your data in form of is_categorical_feature_i_equal_j, which won't induce any additional topology, but will grow the number of features quadratically. So instaed of "species" you get features "is_lion", "is_leopard", etc. and only one of them is equal 1 at the time
in case of large amount of data as compared to the possible categorical values (for example you have 10000 od data points, and only 10 possible categorical values) one can also split the problem into 10 independent ones, each trained on one particular value (so we have "neural network for lions" "neural network for jaguars" etc.)
These two first approaches are to "extreme" cases - one is very computationally cheap, but can lead to high bias, while the second introduces much complexity, but should not influence the classification process itself. The last one is rarely usable (due to assumption of small number of categorical values) yet quite reasonable in terms of machine learning.
Update
So many things changes in 8 years. Solution 2 is definitely the most popular one, and with growth of compute, wide adoption of neural networks, and support of sparse inputs, the costs is now negliegiable

Continuous vs Discrete artificial neural networks

I realize that this is probably a very niche question, but has anyone had experience with working with continuous neural networks? I'm specifically interested in what a continuous neural network may be useful for vs what you normally use discrete neural networks for.
For clarity I will clear up what I mean by continuous neural network as I suppose it can be interpreted to mean different things. I do not mean that the activation function is continuous. Rather I allude to the idea of a increasing the number of neurons in the hidden layer to an infinite amount.
So for clarity, here is the architecture of your typical discreet NN:
(source: garamatt at sites.google.com)
The x are the input, the g is the activation of the hidden layer, the v are the weights of the hidden layer, the w are the weights of the output layer, the b is the bias and apparently the output layer has a linear activation (namely none.)
The difference between a discrete NN and a continuous NN is depicted by this figure:
(source: garamatt at sites.google.com)
That is you let the number of hidden neurons become infinite so that your final output is an integral. In practice this means that instead of computing a deterministic sum you instead must approximate the corresponding integral with quadrature.
Apparently its a common misconception with neural networks that too many hidden neurons produces over-fitting.
My question is specifically, given this definition of discrete and continuous neural networks, I was wondering if anyone had experience working with the latter and what sort of things they used them for.
Further description on the topic can be found here:
http://www.iro.umontreal.ca/~lisa/seminaires/18-04-2006.pdf
I think this is either only of interest to theoreticians trying to prove that no function is beyond the approximation power of the NN architecture, or it may be a proposition on a method of constructing a piecewise linear approximation (via backpropagation) of a function. If it's the latter, I think there are existing methods that are much faster, less susceptible to local minima, and less prone to overfitting than backpropagation.
My understanding of NN is that the connections and neurons contain a compressed representation of the data it's trained on. The key is that you have a large dataset that requires more memory than the "general lesson" that is salient throughout each example. The NN is supposedly the economical container that will distill this general lesson from that huge corpus.
If your NN has enough hidden units to densely sample the original function, this is equivalent to saying your NN is large enough to memorize the training corpus (as opposed to generalizing from it). Think of the training corpus as also a sample of the original function at a given resolution. If the NN has enough neurons to sample the function at an even higher resolution than your training corpus, then there is simply no pressure for the system to generalize because it's not constrained by the number of neurons to do so.
Since no generalization is induced nor required, you might as well just memorize the corpus by storing all of your training data in memory and use k-nearest neighbor, which will always perform better than any NN, and will always perform as well as any NN even as the NN's sampling resolution approaches infinity.
The term hasn't quite caught on in the machine learning literature, which explains all the confusion. It seems like this was a one off paper, an interesting one at that, but it hasn't really led to anything, which may mean several things; the author may have simply lost interest.
I know that Bayesian neural networks (with countably many hidden units, the 'continuous neural networks' paper extends to the uncountable case) were successfully employed by Radford Neal (see his thesis all about this stuff) to win the NIPS 2003 Feature Selection Challenge using Bayesian neural networks.
In the past I've worked on a few research projects using continuous NN's. Activation was done using a bipolar hyperbolic tan, the network took several hundred floating point inputs and output around one hundred floating point values.
In this particular case the aim of the network was to learn the dynamic equations of a mineral train. The network was given the current state of the train and predicted speed, inter-wagon dynamics and other train behaviour 50 seconds into the future.
The rationale for this particular project was mainly about performance. This was being targeted for an embedded device and evaluating the NN was much more performance friendly then solving a traditional ODE (ordinary differential equation) system.
In general a continuous NN should be able to learn any kind of function. This is particularly useful when its impossible/extremely difficult to solve a system using deterministic methods. As opposed to binary networks which are often used for pattern recognition/classification purposes.
Given their non-deterministic nature NN's of any kind are touchy beasts, choosing the right kinds of inputs/network architecture can be somewhat a black art.
Feed forward neural networks are always "continuous" -- it's the only way that backpropagation learning actually works (you can't backpropagate through a discrete/step function because it's non-differentiable at the bias threshold).
You might have a discrete (e.g. "one-hot") encoding of the input or target output, but all of the computation is continuous-valued. The output may be constrained (i.e. with a softmax output layer such that the outputs always sum to one, as is common in a classification setting) but again, still continuous.
If you mean a network that predicts a continuous, unconstrained target -- think of any prediction problem where the "correct answer" isn't discrete, and a linear regression model won't suffice. Recurrent neural networks have at various times been a fashionable method for various financial prediction applications, for example.
Continuous neural networks are not known to be universal approximators (in the sense of density in $L^p$ or $C(\mathbb{R})$ for the topology of uniform convergence on compacts, i.e.: as in the universal approximation theorem) but only universal interpolators in the sense of this paper:
https://arxiv.org/abs/1908.07838

Neural Network settings for fast training

I am creating a tool for predicting the time and cost of software projects based on past data. The tool uses a neural network to do this and so far, the results are promising, but I think I can do a lot more optimisation just by changing the properties of the network. There don't seem to be any rules or even many best-practices when it comes to these settings so if anyone with experience could help me I would greatly appreciate it.
The input data is made up of a series of integers that could go up as high as the user wants to go, but most will be under 100,000 I would have thought. Some will be as low as 1. They are details like number of people on a project and the cost of a project, as well as details about database entities and use cases.
There are 10 inputs in total and 2 outputs (the time and cost). I am using Resilient Propagation to train the network. Currently it has: 10 input nodes, 1 hidden layer with 5 nodes and 2 output nodes. I am training to get under a 5% error rate.
The algorithm must run on a webserver so I have put in a measure to stop training when it looks like it isn't going anywhere. This is set to 10,000 training iterations.
Currently, when I try to train it with some data that is a bit varied, but well within the limits of what we expect users to put into it, it takes a long time to train, hitting the 10,000 iteration limit over and over again.
This is the first time I have used a neural network and I don't really know what to expect. If you could give me some hints on what sort of settings I should be using for the network and for the iteration limit I would greatly appreciate it.
Thank you!
First of all, thanks for providing so much information about your network! Here are a few pointers that should give you a clearer picture.
You need to normalize your inputs. If one node sees a mean value of 100,000 and another just 0.5, you won't see an equal impact from the two inputs. Which is why you'll need to normalize them.
Only 5 hidden neurons for 10 input nodes? I remember reading somewhere that you need at least double the number of inputs; try 20+ hidden neurons. This will provide your neural network model the capability to develop a more complex model. However, too many neurons and your network will just memorize the training data set.
Resilient backpropagation is fine. Just remember that there are other training algorithms out there like Levenberg-Marquardt.
How many training sets do you have? Neural networks usually need a large dataset to be good at making useful predictions.
Consider adding a momentum factor to your weight-training algorithm to speed things up if you haven't done so already.
Online training tends to be better for making generalized predictions than batch training. The former updates weights after running every training set through the network, while the latter updates the network after passing every data set through. It's your call.
Is your data discrete or continuous? Neural networks tend to do a better job with 0s and 1s than continuous functions. If it is the former, I'd recommend using the sigmoid activation function. A combination of tanh and linear activation functions for the hidden and output layers tend to do a good job with continuously-varying data.
Do you need another hidden layer? It may help if your network is dealing with complex input-output surface mapping.

Artificial Neural Network Question

Generally speaking what do you get out of extending an artificial neural net by adding more nodes to a hidden layer or more hidden layers?
Does it allow for more precision in the mapping, or does it allow for more subtlety in the relationships it can identify, or something else?
There's a very well known result in machine learning that states that a single hidden layer is enough to approximate any smooth, bounded function (the paper was called "Multilayer feedforward networks are universal approximators" and it's now almost 20 years old). There are several things to note, however.
The single hidden layer may need to be arbitrarily wide.
This says nothing about the ease with which an approximation may be found; in general large networks are hard to train properly and fall victim to overfitting quite frequently (the exception are so-called "convolutional neural networks" which really are only meant for vision problems).
This also says nothing about the efficiency of the representation. Some functions require exponential numbers of hidden units if done with one layer but scale much more nicely with more layers (for more discussion of this read Scaling Learning Algorithms Towards AI)
The problem with deep neural networks is that they're even harder to train. You end up with very very small gradients being backpropagated to the earlier hidden layers and the learning not really going anywhere, especially if weights are initialized to be small (if you initialize them to be of larger magnitude you frequently get stuck in bad local minima). There are some techniques for "pre-training" like the ones discussed in this Google tech talk by Geoff Hinton which attempt to get around this.
This is very interesting question but it's not so easy to answer. It depends on the problem you try to resolve and what neural network you try to use. There are several neural network types.
I general it's not so clear that more nodes equals more precision. Research show that you need mostly only one hidden layer. The numer of nodes should be the minimal numer of nodes that are required to resolve a problem. If you don't have enough of them - you will not reach solution.
From the other hand - if you have reached the number of nodes that is good to resolve solution - you can add more and more of them and you will not see any further progress in result estimation.
That's why there are so many types of neural networks. They try to resolve different types of problems. So you have NN to resolve static problems, to resolve time related problems and so one. The number of nodes is not so important like the design of them.
When you have a hidden layer is that you are creating a combined feature of the input. So, is the problem better tackled by more features of the existing input, or through higher-order features that come from combining existing features? This is the trade-off for a standard feed-forward network.
You have a theoretical reassurance that any function can be represented by a neural network with two hidden layers and non-linear activation.
Also, consider using additional resources for boosting, instead of adding more nodes, if you're not certain of the appropriate topology.
Very rough rules of thumb
generally more elements per layer for bigger input vectors.
more layers may let you model more non-linear systems.
If the kind of network you are using has delays in propagation , more layers may allow modelling of time series . Take care to have time jitter in the delays or it wont work very well. If this is just gobbledegook to you, ignore it.
More layers lets you insert recurrent features. This can be very useful for discrimination tasks. You ANN implementation my not permit this.
HTH
The number of units per hidden layer accounts for the ANN's potential to describe an arbitrarily complex function. Some (complicated) functions may require many hidden nodes, or possibly more than one hidden layer.
When a function can be roughly approximated by a certain number of hidden units, any extra nodes will provide more accuracy...but this is only true if the training samples used are enough to justify this addition - otherwise what will happen is "overconvergence". Overconvergence means that your ANN has lost its generalization abilities because it has overemphasized on the particular samples.
In general it is best to use the less hidden units possible, if the resulting network can give good results. The additional training patterns required to justify more hidden nodes can not be found easily in most cases, and accuracy is not the NNs' strong point.

Resources