Error calculation in backpropagation (gradient descent) - backpropagation

Can someone please give an explanation about the calculation of the error in backpropagation which is found in many code examples such as:
error=calculated-target
// then calculate error with respect to each parameter...
Is this same for squared error and cross entropy error? How?
Thanks...

I will note x an example from the training set, f(x) the prediction of your network for this particular example, and g_x the ground truth (label) associated to x.
The short answer is, the root means squared error (RMS) is used when you have a network that can exactly, and differentiably, predict the labels that you want. The cross-entropy error is used when your network predicts scores for a set of discrete labels.
To clarify, you usually use Root Mean Squared (RMS) when you want to predict values that can change continuously. Imagine you want your network to predict vectors in R^n. This is the case when, for example, you want to predict surface normals or optical flow. Then, these values can changes continuously, and ||f(x)-g_x|| is differentiable. You can use backprop and train your network.
Crosss-entropy, on the other hand, is useful in classification with n labels, for example, in image classification. In that case, the g_x take the discrete values c_1,c_2,...,c_m where m is the number of classes.
Now, you can not use RMS because if you assume that your netwrok predicts the exact labels (i.e. f(x) in {c_1,...,c_m}), then ||f(x)-g_x|| is no longer differentiable, and you can not use back-propagation. So, you make a network that does not compute class labels directly, but instead computes a set of scores s_1,...,s_m for each class label. Then, you maximize the probability of the correct score, by maximizing a softmax function on the scores. This makes the loss function differentiable.

Related

How to form precision-recall curve using one test dataset for my algorithm?

I'm working on knowledge graph, more precisely in natural language processing field. To evaluate the components of my algorithm, it is necessary to be able to classify the good and the poor candidates. For this purpose, we manually classified pairs in a dataset.
My system returns the relevant pairs according to the implementation logic. now I'm able to calculate :
Precision = X
Recall = Y
For establishing a complete curve I need the rest of points (X,Y), what should I do?:
build another dataset for test ?
split my dataset ?
or any other solution ?
Neither of your proposed two methods. In short, Precision-recall or ROC curve is designed for classifiers with probabilistic output. That is, instead of simply producing a 0 or 1 (in case of binary classification), you need classifiers that can provide a probability in [0,1] range. This is the function to do it in sklearn, note how the 2nd parameter is called probas_pred.
To turn this probabilities into concrete class prediction, you can then set a threshold, say at .5. Setting such a threshold is problematic however, since you can trade-off precision/recall by varying the threshold, and an arbitrary choice can give false impression of a classifier's performance. To circumvent this, threshold-independent measures like area under ROC or Precision-Recall curve is used. They create thresholds at different intervals, say 0.1,0.2,0.3...0.9, turn probabilities into binary classes and then compute precision-recall for each such threshold.

Finding optimal solution to multivariable function with non-negligible solution time?

So I have this issue where I have to find the best distribution that, when passed through a function, matches a known surface. I have written a script that creates the distribution given some parameters and spits out a metric that compares the given surface to the known, but this script takes a non-negligible time, so I can't just run through a very large set of parameters to find the optimal set of parameters. I looked into the simplex method, and it seems to be the right path, but its not quite what I need, because I dont exactly have a set of linear equations, and dont know the constraints for the parameters, but rather one method that gives a single output (an thats all). Can anyone point me in the right direction to how to solve this problem? Thanks!
To quickly go over my process / problem again, I have a set of parameters (at this point 2 but will be expanded to more later) that defines a distribution. This distribution is used to create a surface, which is compared to a known surface, and an error metric is produced. I want to find the optimal set of parameters, but cannot run through an arbitrarily large number of parameters due to the time constraint.
One situation consistent with what you have asked is a model in which you have a reasonably tractable probability distribution which generates an unknown value. This unknown value goes through a complex and not mathematically nice process and generates an observation. Your surface corresponds to the observed probability distribution on the observations. You would be happy finding the parameters that give a good least squares fit between the theoretical and real life surface distribution.
One approximation for the fitting process is that you compute a grid of values in the space output by the probability distribution. Each set of parameters gives you a probability for each point on this grid. The not nice process maps each grid point here to a nearest grid point in the space of the surface. The least squares fit is a quadratic in the probabilities calculated for the first grid, because the probabilities calculated for a grid point in the surface are the sums of the probabilities calculated for values in the first grid that map to something nearer to that point in the surface than any other point in the surface. This means that it has first (and even second) derivatives that you can calculate. If your probability distribution is nice enough you can use the chain rule to calculate derivatives for the least squares fit in the initial parameters. This means that you can use optimization methods to calculate the best fit parameters which require not just a means to calculate the function to be optimized but also its derivatives, and these are generally more efficient than optimization methods which require only function values, such as Nelder-Mead or Torczon Simplex. See e.g. http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math4/optim/package-summary.html.
Another possible approach is via something called the EM Algorithm. Here EM stands for Expectation-Maximization. It can be used for finding maximum likelihood fits in cases where the problem would be easy if you could see some hidden state that you cannot actually see. In this case the output produced by the initial distribution might be such a hidden state. One starting point is http://www-prima.imag.fr/jlc/Courses/2002/ENSI2.RNRF/EM-tutorial.pdf.

Converting SVM hyperplane distance (response) to Likelihood

I am trying to use SVM to train some image models. However SVM is not a probabilistic framework so it outputs distance between hyperplanes as a whole number.
Platt converted the output of SVM to likelihood by using some optimisation function but I fail to understand that, does the method assumes one class has same probability I.E for binary classifier if all training sets are even and proportional, then for label 1 or -1 it occurs every time with 50% probability.
Secondly, in some papers I read that for binary SVM classifier they convert -1 and 1 label to range of 0 to 1 and compute the likelihood. But they do not mention anything about how to convert the SVM distance to probability.
Sorry for my english. I would welcome any suggestion and comment. Thank you.
link to paper
Well as far as I can tell that paper is proposing a mapping from the SVM output to a range of [0,1] using a sigmoid function.
From a simplified point of view, it would be something like Sigmoid(RAWSVM(X)) in [0,1], so there is not an explicit "weight" to the labels. The idea is that you take one label (let's say Y=+1) and then you take the output of the SVM and see how close is the prediction for that pattern to that label, if it is close then the sigmoid would give you a number close to 1, otherwise will give you a number close to 0. And hence you have a sense of probability.
Secondly, in some papers I read that for binary SVM classifier they convert -1 and 1 label to range of 0 to 1 and compute the likelihood. But they do not mention anything about how to convert the SVM distance to probability.
Yes, you are correct and some implementations works in the realm of [0,1] instead of [-1,+1], some even maps the label to a factor depending on the value of C. In any case, that shouldn't affect the method proposed in the paper since they would map any range to [0,1]. Keep in mind that this "probabilistic" distribution is just a map from any range to [0,1] assuming uniformity. I am oversimplifying this but the effect is the same.
One last thing, that sigmoid map is not static but data-driven, which means that there would be some training using the dataset to parametrize the sigmoid to adjust it to the data. In other words, for two different datasets you would probably get two different mapping functions.

neural network training set

My question is about a training set in a supervised artificial neural network (ANN)
Training set, as some of you probably know, consists of pairs (input, desired output)
Training phase itself is the following
for every pair in a training set
-we input the first value of the pair and calculate the output error i.e. how far is the generated output from the desired output, which is the second value of the pair
-based on that error value we use backpropagate algorithm to calculate weight gradients and update weights of ANN
end for
Now assume that there are pair1, pair2, ...pair m, ... in the training set
we take pair1, produce some error, update weights, then take pair2, etc.
later we reach pair m, produce some error, and update weights,
My question is, what if that weight update after pair m will eliminate some weight update, or even updates which happened before ?
For example, if pair m is going to eliminate weight updates happened after pair1, or pair2, or both, then although ANN will produce a reasonable output for input m, it will kinda forget the updates for pair1 and pair2, and the result for inputs 1 and 2 will be poor,
then what's the point of training?
Unless we train ANN with pair1 and pair2 again, after pair m
For example, if pair m is going to eliminate weight updates happened after pair1, or pair2, or both, then although ANN will produce a reasonable output for input m, it will kinda forget the updates for pair1 and pair2, and the result for inputs 1 and 2 will be poor, then what's the point of training ??
The aim of training a neural network is to end up with weights that give you the desired output for all-possible input values. What you're doing here is traversing the error surface as you back-propagate so that you end up in an area where the error is below the error threshold. Keep in mind that when you backpropagate the error for one set of inputs, it doesn't mean that the neural network automatically recognizes that particular input and immediately produces the exact response when that input is presented again. When you backpropagate, all it means is that you have changed your weights in such a manner that your neural network will get better at recognizing that particular input (that is, the error keeps decreasing).
So if you present pair-1 and then pair-2, it is possible that pair-2 may negate the changes to a certain degree. However in the long run the neural network's weights will tend towards recognizing all inputs properly. The thing is, you cannot look at the result of a particular training attempt for a particular set of inputs/outputs and be concerned that the changes will be negated. As I mentioned before, when you're training a neural network you are traversing an error surface to find a location where the error is the lowest. Think of it as walking along a landscape that has a bunch of hills and valleys. Imagine that you don't have a map and that you have a special compass that tells you in what direction you need to move, and by what distance. The compass is basically trying to direct you to the lowest point in this landscape. Now this compass doesn't know the the landscape well either and so in trying to send you to the lowest point, it may go in a slightly-wrong direction (i.e., send you some way up a hill) but it will try and correct itself after that. In the long run, you will eventually end up at the lowest point in the landscape (unless you're in a local minima i.e., a low-point, but one that is not the lowest point).
Whenever you're doing supervised training, you should run several (or even thousands) rounds through a training dataset. Each such round through the training dataset is called an epoch.
There is also two different ways of updating the the parameters in the neural network, during supervised training. Stochastic training and batch training. Batch training is one loop through the dataset, accumulating the total error through the set, and updating the parameters (weights) only once when all error has been accumulated. Stochastic training is the method you describe, where the weights are adjusted for each input, desired output pair.
In almost all cases, where the training data set is relatively representative for the general case, you should prefer stochastic training over batch training. Stochastic training beats batch training in 99 of 100 cases! (Citation needed :-)). (Simple XOR training cases and other toy problems are the exceptions)
Back to your question (which applies for stochastic training): Yes, the second pair could indeed adjust the weights in the opposite direction from the first pair. However it is not really likely that all weights are adjusted opposite direction for two cases. However since you will run several epochs through the set the effect will diminish through each epoch. You should also randomize the order of the pairs for each epoch. (Use some kind of Fisher-Yates algorithm.) This will diminish the effect even more.
Next tip: Keep a benchmark dataset separate from the training data. For each n epoch of training, benchmark the neural network with the benchmark set. That is calculating the total error over the pairs in this benchmark dataset. When the error does not decrease, it's time to stop the training.
Good luck!
If you were performing a stochastic gradient descent (SGD), then this probably wouldn't happen because the parameter updates for pair 1 would take effect before the parameter updates for pair 2 would be computed. That is why SGD may converge faster.
If you are computing your parameter updates using all your data simultaneously (or even a chunk of it) then these two pairs may cancel each other out. However, that is not a bad thing because, clearly, these two pairs of data points are giving conflicting information. This is why batch backprop is typically considered to be more stable.

Question about Backpropagation Algorithm with Artificial Neural Networks -- Order of updating

Hey everyone, I've been trying to get an ANN I coded to work with the backpropagation algorithm. I have read several papers on them, but I'm noticing a few discrepancies.
Here seems to be the super general format of the algorithm:
Give input
Get output
Calculate error
Calculate change in weights
Repeat steps 3 and 4 until we reach the input level
But here's the problem: The weights need to be updated at some point, obviously. However, because we're back propagating, we need to use the weights of previous layers (ones closer to the output layer, I mean) when calculating the error for layers closer to the input layer. But we already calculated the weight changes for the layers closer to the output layer! So, when we use these weights to calculate the error for layers closer to the input, do we use their old values, or their "updated values"?
In other words, if we were to put the the step of updating the weights in my super general algorithm, would it be:
(Updating the weights immediately)
Give input
Get output
Calculate error
Calculate change in weights
Update these weights
Repeat steps 3,4,5 until we reach the input level
OR
(Using the "old" values of the weights)
Give input
Get output
Calculate error
Calculate change in weights
Store these changes in a matrix, but don't change these weights yet
Repeat steps 3,4,5 until we reach the input level
Update the weights all at once using our stored values
In this paper I read, in both abstract examples (the ones based on figures 3.3 and 3.4), they say to use the old values, not to immediately update the values. However, in their "worked example 3.1", they use the new values (even though what they say they're using are the old values) for calculating the error of the hidden layer.
Also, in my book "Introduction to Machine Learning by Ethem Alpaydin", though there is a lot of abstract stuff I don't yet understand, he says "Note that the change in the first-layer weight delta-w_hj, makes use of the second layer weight v_h. Therefore, we should calculate the changes in both layers and update the first-layer weights, making use of the old value of the second-layer weights, then update the second-layer weights."
To be honest, it really seems like they just made a mistake and all the weights are updated simultaneously at the end, but I want to be sure. My ANN is giving me strange results, and I want to be positive that this isn't the cause.
Anyone know?
Thanks!
As far as I know, you should update weights immediately. The purpose of back-propagation is to find weights that minimize the error of the ANN, and it does so by doing a gradient descent. I think the algorithm description in the Wikipedia page is quite good. You may also double-check its implementation in the joone engine.
You are usually backpropagating deltas not errors. These deltas are calculated from the errors, but they do not mean the same thing. Once you have the deltas for layer n (counting from input to output) you use these deltas and the weigths from the layer n to calculate the deltas for layer n-1 (one closer to input). The deltas only have a meaning for the old state of the network, not for the new state, so you should always use the old weights for propagating the deltas back to the input.
Deltas mean in a sense how much each part of the NN has contributed to the error before, not how much it will contribute to the error in the next step (because you do not know the actual error yet).
As with most machine-learning techniques it will probably still work, if you use the updated, weights, but it might converge slower.
If you simply train it on a single input-output pair my intuition would be to update weights immediately, because the gradient is not constant. But I don't think your book mentions only a single input-output pair. Usually you come up with an ANN because you have many input-output samples from a function you would like to model with the ANN. Thus your loops should repeat from step 1 instead of from step 3.
If we label your two methods as new->online and old->offline, then we have two algorithms.
The online algorithm is good when you don't know how many sample input-output relations you are going to see, and you don't mind some randomness in they way the weights update.
The offline algorithm is good if you want to fit a particular set of data optimally. To avoid overfitting the samples in your data set, you can split it into a training set and a test set. You use the training set to update the weights, and the test set to measure how good a fit you have. When the error on the test set begins to increase, you are done.
Which algorithm is best depends on the purpose of using an ANN. Since you talk about training until you "reach input level", I assume you train until output is exactly as the target value in the data set. In this case the offline algorithm is what you need. If you were building a backgammon playing program, the online algorithm would be a better because you have an unlimited data set.
In this book, the author talks about how the whole point of the backpropagation algorithm is that it allows you to efficiently compute all the weights in one go. In other words, using the "old values" is efficient. Using the new values is more computationally expensive, and so that's why people use the "old values" to update the weights.

Resources