Proper way to implement biases in Neural Networks - matrix

I can make a neural network, I just need a clarification on bias implementation. Which way is better: Implement the Bias matrices B1, B2, .. Bn for each layer in their own, seperate matrix from the weight matrix, or, include the biases in the weight matrix by adding a 1 to the previous layer output (input for this layer). In images, I am asking whether this implementation:
Or this implementation:
Is the best. Thank you

I think the best way is to have two separate matrices, one for the weitghts and one for the bias. Why? :
I don't believe there is an increase on the computational load since W*x and W*x + b should be equivalent running on GPU. Mathematically and computationally they are equivalent.
Greater modularity. Let's say you want to initialize the weights and the bias using different initializers (ones, zeros, glorot...). By having two separate matrices this is straightforward.
Easier to read and maintain.

include the biases in the weight matrix by adding a 1 to the previous layer output (input for this layer)
This seems to be what is implemented here: Machine Learning with Python: Training and Testing the Neural Network with MNIST data set in the paragraph "Networks with multiple hidden layers".
I don't know if it's the best way to do it though. (Maybe not related but still: in the mentioned example code, it worked with sigmoid, but failed when I replaced it with ReLU).

In my opinion I think implementing the bias matrices separately for each layer is the way to go. This will create a lot of hyper-parameters that your model will have to learn but it will give your model more freedom to converge.
For more information read this.

Related

Continous action-state-space and tiling

After getting used to the Q-Learning algorithm in discrete action-state-space I would like to expand this now to continous spaces. To do this I read the chapter On-Policy Control with Approximation of Sutton´s introduction. Here, the usage of differentiable functions like a linear function or an ANN are recommended to solve the problem of continous action-state-space. Nevertheless Sutton then discribes the tiling method which maps the continous variables onto a discrete presentation. Is this always necessary?
Trying to understand this methods I tried to implement the example of the Hill Climbing Car in the book without the tiling method and a linear base function q. As my state space is 2 dimensional, and my action is one dimensional I used a three dimensional weight vector w in this equation:
When I now try to choose the action which will maximize the output, the obvious answer will be a=1, if w_2 > 0. Therefore, the weight will slowly converge to positive zero and the agent will not learn anything useful. As Sutton is able to solve the problem using the tiling I am wondering if my problem is caused by the absence of the tiling method or if I am doing anything else wrong.
So: Is the tiling always necessary?
Regarding your main question about tiling, the answer is no, not always it is necessary using tiling.
As you tried, it's a good idea to implement some easy example as the Hill Climbing Car in order to fully understand the concepts. Here, however, you are misundertanding something important. When the book talks about linear methods, it is refering to linear in the parameters, which means that you can extract a set of (non linear) features and combine them linearly. This kind of approximators can represent functions much more complex than a standard linear regression.
The parametrization you have proposed it's not able to represent a non-linear Q function. Taking into account that in the Hill Climbing problem you want to learn Q-functions of this style:
You will need something more powefull than . An easy solution for your problem could be to use a Radial Basis Function (RBF) network. In this case, you use a set of features (or BF, like for example Gaussians functions) to map your state space:
Additionally, if your action space is discrete and small, the easiest solution is to maintain an independent RBF network for each action. For selecting the action, simply compute the Q value for each action and select the one with higher value. In this way you avoid the (complex) optimization problem of selecting the best action in a continuous function.
You can find a more detailed explanation on the Busoniu et al. book Reinforcement Learning and Dynamic Programming Using Function Approximators, pages 49-51. It's available for free here.

Why should we compute the image mean when we train CNNs?

When I use caffe for image classification, it often computes the image mean. Why is that the case?
Someone said that it can improve the accuracy, but I don't understand why this should be the case.
Refer to image whitening technique in Deep learning. Actually it has been proved that it improve the accuracy but not widely used.
To understand why it helps refer to the idea of normalizing data before applying machine learning method. which helps to keep the data in the same range. Actually there is another method now used in CNN which is Batch normalization.
Neural networks (including CNNs) are models with thousands of parameters which we try to optimize with gradient descent. Those models are able to fit a lot of different functions by having a non-linearity φ at their nodes. Without a non-linear activation function, the network collapses to a linear function in total. This means we need the non-linearity for most interesting problems.
Common choices for φ are the logistic function, tanh or ReLU. All of them have the most interesting region around 0. This is where the gradient either is big enough to learn quickly or where a non-linearity is at all in case of ReLU. Weight initialization schemes like Glorot initialization try to make the network start at a good point for the optimization. Other techniques like Batch Normalization also keep the mean of the nodes input around 0.
So you compute (and subtract) the mean of the image so that the first computing nodes get data which "behaves well". It has a mean of 0 and thus the intuition is that this helps the optimization process.
In theory, a network can be able to "subtract" the mean by itself. So if you train long enough, this should not matter too much. However, depending on the activation function "long enough" can be important.

How to find common characteristics of the three matrices?

I have three matrices A, B and C, the size of which are all 120*1000 double, where 120 represents the number of time points and 1000 represents the total number of features. For each matrix, there is a corresponding regressor matrix, the size of which are all 120*5 double. The regressor matrices only contain "1" and "0", where "1" represents there is a stimulus in this time point and "0" represents rest time points. I want to find the common characteristics of the three matrices A, B and C combined with the three regressor matrices. Then I want to train a classifier based on matrices A and B. In the end, I want to classify matrix C based on the training data. How to realize it? Thank you!
I was hoping that someone more qualified would step in but it looks like the lack of specific info from OP side is discouraging all of them not to answer. My comments were meant as a guideline not as answer but as requested moved my comment to answer.
First of all this is nowhere near my cup of tea so handle with extreme prejudice but:
if the features/subjects are not related
then you should handle each as separate 1D function/array/vector and train your neural network classifier (one for each feature).
if the features are dependent on each-other
then you need to use all of them as an input to your neural network classifier and have network architecture with large enough amount of nodes (wights) capable of handling such amount of data.
you need to find the dependency your self only if you want to reduce the input to classifier
but as you are going for neural network you do not need to as the neural network tends to do it itself. Of coarse if you do it will reduce the needed architecture complexity.
Anyway if you really need to do it then PCA Principal Component Analysis is your way... This step is usually done for deterministic based classifiers (not neural network ones, for example based on correlation coefficients,or based on distance in any metric etc...). PCA has the advantage that you do not need to know too much about the data ... All other reduction approaches I know of usually exploit some feature of the dependency or data but for that you would need to know the properties of input in high detail which I assume is not the case.

Layers and Neurons of a Neural Network

I would like to know a bit more about Neural Network, I'm developing a C++ program to make a NN but I'm stuck with the BackPropagation algorithm, sorry for not offering some working code.
I know that there are so many libraries for creating a NN in many languages, but I prefer to make one from my self. The point is that I don't know how many layers and how many neurons should be necessary for achieving a particular goal such as pattern recognition, or functions approximations, or whatever.
My questions are: if I'd like to recognize some particulars patterns, like in image detection, how many layers and neurons-per-layer should be necessary? Let's say my images are all 8x8 pixels, I would start naturally with an input layer of 64 neurons, but I don't have any idea of how many neurons I have to put in hidden layers, and also in output layer. Let's say I have to distinguish from cats and dogs, or whatever you may think, how could be the output layer? I can imagine an output layer with only-one neuron outputting a value between 0 and 1 with the classical logistic function (1/(1+exp(-x)) and when it is near 0 the input was a cat and when approaches 1 it was a dog, but ... is it correct? What if I add a new pattern like a fish? and what if the input contains a dog and a cat ( ..and a fish)? This make me thinking that the logistic function in the output layer is not very suitable for pattern recognition like this, only because 1/(1+exp(-x)) has a range in (0,1). Do I have to change the activation function or maybe add some other neurons to the output layer? Are there some other activations function more accurate to do this? Do every neurons in every layers have the same activation function, or it is different from layer to layer?
Sorry for all of this questions, but this topic is not very clear to me.
I read a lot around internet, and I found libraries all-yet-implemented and hard to read from, and many explanations to what a NN can do, but not how it can do.
I read a lot from https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ and http://neuralnetworksanddeeplearning.com/chap1.html, and here I understood how to approximate a function (because every neurons in a layer can be thought as a step-function with a particular step for weights and bias) and how back-propagation algorithm works, but other tutorials and similars were more focused on preexisting libraries. I also read this question Determining the proper amount of Neurons for a Neural Network but I would like to involve also the activation functions of a NN, which is the best and for what is the best.
Thanks in advance for your answers!
Your questions are quite general, so I can only give some general recommendations:
The number of layers you need depends on the complexity of the problem you want to solve. The more calculation is required to obtain an output from a given input, the more layers you need.
Only very simple problems can be solved with a single layer network. These are called linearly separable and are usually trivial. With two layers it gets better and with three layers, at least in theory, all kinds of classification tasks can be performed if you have enough cells within the layers. In practice, however it is often better to add a 4th or 5th layer to the network while reducing the number of cells within a single layer.
Be aware that the standard backpropagation algorithm performs badly with more than 4 or 5 layers. If you need more layers, have a look at Deep Learning.
The numbers of cells within each layer mainly depends on the number of inputs and, if you solve a classification task, the number of classes you want to detect. In practice it is quite common to reduce the number of cells from layer to layer, but there are exceptions.
Concerning your question about the output function: In most cases you should stick with one type of sigmoid function. The case you describe is not really an issue because you could add another output cell for your "fish" class. The choice of a specific activation function is not that critical. Basically you use one whose values and derivative can be calculated efficiently.
#Frank Puffer has already provided some nice information, but let me add my two cents. First off, much of what you're asking is in the area of hyperparameter optimization. Although there are various "rules of thumb", the reality is that determining the optimal architecture (number/size of layers, connectivity structure, etc.) and other parameters like the learning rate typically requires extensive experimentation. The good news is that the parameterization of these hyperparameters is among the simplest aspects of the implementation of a neural network. So I would recommend focusing on building your software such that the number of layers, size of layers, learning rate, etc., are all easily configurable.
Now you specifically asked about detecting patterns in an image. It's worth mentioning that using standard multi-layer perceptrons (MLPs) to perform classification on raw image data can be computationally expensive, especially for larger images. It's common to use architectures that are designed to extract useful, spacially-local features (i.e.: Convolutional Neural Networks or CNNs).
You could still use standard MLPs for this, but the computational complexity can make it an untenable solution. The sparse connectivity of CNNs for example dramatically reduce the number of parameters requiring optimization and simultaneously build a conceptual hierarchy of representations better suited for classification of images.
Regardless, I would recommend implementing backpropagation using stochastic gradient descent for optimization. This is still the approach typically used for training neural nets, CNNs, RNNs, etc.
Regarding the number of output neurons, this is one question that does have a simple answer: use "one-hot" encoding. For each class you want to recognize, you have an output neuron. In your example of the dog, cat, and fish classes, you have three neurons. For an input image representing a dog, you would expect a value of 1 for the "dog" neuron, and 0 for all the others. Then, during inference, you can interpret the output as a probability distribution reflecting the confidence of the NN. For example, if you get output dog:0.70, cat:0.25, fish:0.05, then you have a 70% confidence that the image is a dog, and so on.
For activation functions, the most recent research I've seen seems to indicate that Rectified Linear Units are generally a good choice since they're easy to differentiate and compute, and they avoid a problem that plagues deeper networks called the "vanishing gradient problem".
Best of luck!

Bimodal distribution characterization algorithm?

What algorithms can be used to characterize an expected clearly bimodal distribution, say a mixture of 2 normal distributions with well separated peaks, in an array of samples? Something that spits out 2 means, 2 standard deviations, and some sort of robustness estimate, would be the desired result.
I am interested in an algorithm that can be implemented in any programming language (for an embedded controller), not an existing C or Python library or stat package.
Would it be easier if I knew that the two modal means differ by a ratio of approximately 3:1 +- 50%, the standard deviations are "small" relative to the peak separation, but the pair of peaks could be anywhere in a 100:1 range?
There are two separate possibilities here. One is that you have a single distribution that is bimodal. The other is that you are observing data from two different distributions. The usual way to estimate the later is in something called, unsurprisingly, a mixture model.
Your approaches for estimating are to use a maximum likelihood approach or use Markov chain Monte Carlo methods if you want to take a Bayesian view of the problem. If you state your assumptions in a bit more detail I'd be willing to help try and figure out what objective function you'd want to try and maximize.
These type of models can be computationally intensive, so I am not sure you'd want to try and do the whole statistical approach in an embedded controller. A hack might be a better fit. If the peaks are in fact well separated, I think it would be easier to try and identify the two peaks and split your data between them and do the estimation of the mean and standard deviation for each distribution independently.

Resources