Question about Backpropagation Algorithm with Artificial Neural Networks -- Order of updating - algorithm

Hey everyone, I've been trying to get an ANN I coded to work with the backpropagation algorithm. I have read several papers on them, but I'm noticing a few discrepancies.
Here seems to be the super general format of the algorithm:
Give input
Get output
Calculate error
Calculate change in weights
Repeat steps 3 and 4 until we reach the input level
But here's the problem: The weights need to be updated at some point, obviously. However, because we're back propagating, we need to use the weights of previous layers (ones closer to the output layer, I mean) when calculating the error for layers closer to the input layer. But we already calculated the weight changes for the layers closer to the output layer! So, when we use these weights to calculate the error for layers closer to the input, do we use their old values, or their "updated values"?
In other words, if we were to put the the step of updating the weights in my super general algorithm, would it be:
(Updating the weights immediately)
Give input
Get output
Calculate error
Calculate change in weights
Update these weights
Repeat steps 3,4,5 until we reach the input level
OR
(Using the "old" values of the weights)
Give input
Get output
Calculate error
Calculate change in weights
Store these changes in a matrix, but don't change these weights yet
Repeat steps 3,4,5 until we reach the input level
Update the weights all at once using our stored values
In this paper I read, in both abstract examples (the ones based on figures 3.3 and 3.4), they say to use the old values, not to immediately update the values. However, in their "worked example 3.1", they use the new values (even though what they say they're using are the old values) for calculating the error of the hidden layer.
Also, in my book "Introduction to Machine Learning by Ethem Alpaydin", though there is a lot of abstract stuff I don't yet understand, he says "Note that the change in the first-layer weight delta-w_hj, makes use of the second layer weight v_h. Therefore, we should calculate the changes in both layers and update the first-layer weights, making use of the old value of the second-layer weights, then update the second-layer weights."
To be honest, it really seems like they just made a mistake and all the weights are updated simultaneously at the end, but I want to be sure. My ANN is giving me strange results, and I want to be positive that this isn't the cause.
Anyone know?
Thanks!

As far as I know, you should update weights immediately. The purpose of back-propagation is to find weights that minimize the error of the ANN, and it does so by doing a gradient descent. I think the algorithm description in the Wikipedia page is quite good. You may also double-check its implementation in the joone engine.

You are usually backpropagating deltas not errors. These deltas are calculated from the errors, but they do not mean the same thing. Once you have the deltas for layer n (counting from input to output) you use these deltas and the weigths from the layer n to calculate the deltas for layer n-1 (one closer to input). The deltas only have a meaning for the old state of the network, not for the new state, so you should always use the old weights for propagating the deltas back to the input.
Deltas mean in a sense how much each part of the NN has contributed to the error before, not how much it will contribute to the error in the next step (because you do not know the actual error yet).
As with most machine-learning techniques it will probably still work, if you use the updated, weights, but it might converge slower.

If you simply train it on a single input-output pair my intuition would be to update weights immediately, because the gradient is not constant. But I don't think your book mentions only a single input-output pair. Usually you come up with an ANN because you have many input-output samples from a function you would like to model with the ANN. Thus your loops should repeat from step 1 instead of from step 3.
If we label your two methods as new->online and old->offline, then we have two algorithms.
The online algorithm is good when you don't know how many sample input-output relations you are going to see, and you don't mind some randomness in they way the weights update.
The offline algorithm is good if you want to fit a particular set of data optimally. To avoid overfitting the samples in your data set, you can split it into a training set and a test set. You use the training set to update the weights, and the test set to measure how good a fit you have. When the error on the test set begins to increase, you are done.
Which algorithm is best depends on the purpose of using an ANN. Since you talk about training until you "reach input level", I assume you train until output is exactly as the target value in the data set. In this case the offline algorithm is what you need. If you were building a backgammon playing program, the online algorithm would be a better because you have an unlimited data set.

In this book, the author talks about how the whole point of the backpropagation algorithm is that it allows you to efficiently compute all the weights in one go. In other words, using the "old values" is efficient. Using the new values is more computationally expensive, and so that's why people use the "old values" to update the weights.

Related

Exploration Algorithm

Massively edited this question to make it easier to understand.
Given an environment with arbitrary dimensions and arbitrary positioning of an arbitrary number of obstacles, I have an agent exploring the environment with a limited range of sight (obstacles don't block sight). It can move in the four cardinal directions of NSEW, one cell at a time, and the graph is unweighted (each step has a cost of 1). Linked below is a map representing the agent's (yellow guy) current belief of the environment at the instant of planning. Time does not pass in the simulation while the agent is planning.
http://imagizer.imageshack.us/a/img913/9274/qRsazT.jpg
What exploration algorithm can I use to maximise the cost-efficiency of utility, given that revisiting cells are allowed? Each cell holds a utility value. Ideally, I would seek to maximise the sum of utility of all cells SEEN (not visited) divided by the path length, although if that is too complex for any suitable algorithm then the number of cells seen will suffice. There is a maximum path length but it is generally in the hundreds or higher. (The actual test environments used on my agent are at least 4x bigger, although theoretically there is no upper bound on the dimensions that can be set, and the maximum path length would thus increase accordingly)
I consider BFS and DFS to be intractable, A* to be non-optimal given a lack of suitable heuristics, and Dijkstra's inappropriate in generating a single unbroken path. Is there any algorithm you can think of? Also, I need help with loop detection, as I've never done that before since allowing revisitations is my first time.
One approach I have considered is to reduce the map into a spanning tree, except that instead of defining it as a tree that connects all cells, it is defined as a tree that can see all cells. My approach would result in the following:
http://imagizer.imageshack.us/a/img910/3050/HGu40d.jpg
In the resultant tree, the agent can go from a node to any adjacent nodes that are 0-1 turn away at intersections. This is as far as my thinking has gotten right now. A solution generated using this tree may not be optimal, but it should at least be near-optimal with much fewer cells being processed by the algorithm, so if that would make the algorithm more likely to be tractable, then I guess that is an acceptable trade-off. I'm still stuck with thinking how exactly to generate a path for this however.
Your problem is very similar to a canonical Reinforcement Learning (RL) problem, the Grid World. I would formalize it as a standard Markov Decision Process (MDP) and use any RL algorithm to solve it.
The formalization would be:
States s: your NxM discrete grid.
Actions a: UP, DOWN, LEFT, RIGHT.
Reward r: the value of the cells that the agent can see from the destination cell s', i.e. r(s,a,s') = sum(value(seen(s')).
Transition function: P(s' | s, a) = 1 if s' is not out of the boundaries or a black cell, 0 otherwise.
Since you are interested in the average reward, the discount factor is 1 and you have to normalize the cumulative reward by the number of steps. You also said that each step has cost one, so you could subtract 1 to the immediate reward rat each time step, but this would not add anything since you will already average by the number of steps.
Since the problem is discrete the policy could be a simple softmax (or Gibbs) distribution.
As solving algorithm you can use Q-learning, which guarantees the optimality of the solution provided a sufficient number of samples. However, if your grid is too big (and you said that there is no limit) I would suggest policy search algorithms, like policy gradient or relative entropy (although they guarantee convergence only to local optima). You can find something about Q-learning basically everywhere on the Internet. For a recent survey on policy search I suggest this.
The cool thing about these approaches is that they encode the exploration in the policy (e.g., the temperature in a softmax policy, the variance in a Gaussian distribution) and will try to maximize the cumulative long term reward as described by your MDP. So usually you initialize your policy with a high exploration (e.g., a complete random policy) and by trial and error the algorithm will make it deterministic and converge to the optimal one (however, sometimes also a stochastic policy is optimal).
The main difference between all the RL algorithms is how they perform the update of the policy at each iteration and manage the tradeoff exploration-exploitation (how much should I explore VS how much should I exploit the information I already have).
As suggested by Demplo, you could also use Genetic Algorithms (GA), but they are usually slower and require more tuning (elitism, crossover, mutation...).
I have also tried some policy search algorithms on your problem and they seems to work well, although I initialized the grid randomly and do not know the exact optimal solution. If you provide some additional details (a test grid, the max number of steps and if the initial position is fixed or random) I can test them more precisely.

How do I extend a support vector machine algorithm to a high dimensional data set?

I'm trying to implement an SVM algorithm, but I'm having a hard time understanding how d-dimensional data sets are actually handled. In my particular case, each 'point' has nearly 400 identifying features.
In the two dimensional space, it basically tries to find a line between the two groups that maximizes the margin from any point on either side. I can sort of imagine what such a 'line' would look like in a d-dimensional space, but I'm completely lost on how the classification would actually work.
There is a similar question here, but I'm not getting it. I sort of get how the separation would occur after you have the classifier, but I'm lost on how to actually get the classifier.
If you can imagine how the line of the 2D case would become a d-dimensional hyperplane for higher dimensions, then you are pretty much done. The actual classification occurs when you test a point over the hyperplane, which will give you a positive number if the point belongs to class 1 or negative if it belongs to class 2.
Notice that in the formula there is no restriction for the dimension of each point:
[Image courtesy of wikipedia]
And in case you are curious about what happens with the non-linear case when you use the kernel trick, I would like to share with you a video that illustrates very well the idea.
http://www.youtube.com/watch?v=3liCbRZPrZA

neural network training set

My question is about a training set in a supervised artificial neural network (ANN)
Training set, as some of you probably know, consists of pairs (input, desired output)
Training phase itself is the following
for every pair in a training set
-we input the first value of the pair and calculate the output error i.e. how far is the generated output from the desired output, which is the second value of the pair
-based on that error value we use backpropagate algorithm to calculate weight gradients and update weights of ANN
end for
Now assume that there are pair1, pair2, ...pair m, ... in the training set
we take pair1, produce some error, update weights, then take pair2, etc.
later we reach pair m, produce some error, and update weights,
My question is, what if that weight update after pair m will eliminate some weight update, or even updates which happened before ?
For example, if pair m is going to eliminate weight updates happened after pair1, or pair2, or both, then although ANN will produce a reasonable output for input m, it will kinda forget the updates for pair1 and pair2, and the result for inputs 1 and 2 will be poor,
then what's the point of training?
Unless we train ANN with pair1 and pair2 again, after pair m
For example, if pair m is going to eliminate weight updates happened after pair1, or pair2, or both, then although ANN will produce a reasonable output for input m, it will kinda forget the updates for pair1 and pair2, and the result for inputs 1 and 2 will be poor, then what's the point of training ??
The aim of training a neural network is to end up with weights that give you the desired output for all-possible input values. What you're doing here is traversing the error surface as you back-propagate so that you end up in an area where the error is below the error threshold. Keep in mind that when you backpropagate the error for one set of inputs, it doesn't mean that the neural network automatically recognizes that particular input and immediately produces the exact response when that input is presented again. When you backpropagate, all it means is that you have changed your weights in such a manner that your neural network will get better at recognizing that particular input (that is, the error keeps decreasing).
So if you present pair-1 and then pair-2, it is possible that pair-2 may negate the changes to a certain degree. However in the long run the neural network's weights will tend towards recognizing all inputs properly. The thing is, you cannot look at the result of a particular training attempt for a particular set of inputs/outputs and be concerned that the changes will be negated. As I mentioned before, when you're training a neural network you are traversing an error surface to find a location where the error is the lowest. Think of it as walking along a landscape that has a bunch of hills and valleys. Imagine that you don't have a map and that you have a special compass that tells you in what direction you need to move, and by what distance. The compass is basically trying to direct you to the lowest point in this landscape. Now this compass doesn't know the the landscape well either and so in trying to send you to the lowest point, it may go in a slightly-wrong direction (i.e., send you some way up a hill) but it will try and correct itself after that. In the long run, you will eventually end up at the lowest point in the landscape (unless you're in a local minima i.e., a low-point, but one that is not the lowest point).
Whenever you're doing supervised training, you should run several (or even thousands) rounds through a training dataset. Each such round through the training dataset is called an epoch.
There is also two different ways of updating the the parameters in the neural network, during supervised training. Stochastic training and batch training. Batch training is one loop through the dataset, accumulating the total error through the set, and updating the parameters (weights) only once when all error has been accumulated. Stochastic training is the method you describe, where the weights are adjusted for each input, desired output pair.
In almost all cases, where the training data set is relatively representative for the general case, you should prefer stochastic training over batch training. Stochastic training beats batch training in 99 of 100 cases! (Citation needed :-)). (Simple XOR training cases and other toy problems are the exceptions)
Back to your question (which applies for stochastic training): Yes, the second pair could indeed adjust the weights in the opposite direction from the first pair. However it is not really likely that all weights are adjusted opposite direction for two cases. However since you will run several epochs through the set the effect will diminish through each epoch. You should also randomize the order of the pairs for each epoch. (Use some kind of Fisher-Yates algorithm.) This will diminish the effect even more.
Next tip: Keep a benchmark dataset separate from the training data. For each n epoch of training, benchmark the neural network with the benchmark set. That is calculating the total error over the pairs in this benchmark dataset. When the error does not decrease, it's time to stop the training.
Good luck!
If you were performing a stochastic gradient descent (SGD), then this probably wouldn't happen because the parameter updates for pair 1 would take effect before the parameter updates for pair 2 would be computed. That is why SGD may converge faster.
If you are computing your parameter updates using all your data simultaneously (or even a chunk of it) then these two pairs may cancel each other out. However, that is not a bad thing because, clearly, these two pairs of data points are giving conflicting information. This is why batch backprop is typically considered to be more stable.

Graph Simplification Algorithm Advice Needed

I have a need to take a 2D graph of n points and reduce it the r points (where r is a specific number less than n). For example, I may have two datasets with slightly different number of total points, say 1021 and 1001 and I'd like to force both datasets to have 1000 points. I am aware of a couple of simplification algorithms: Lang Simplification and Douglas-Peucker. I have used Lang in a previous project with slightly different requirements.
The specific properties of the algorithm I am looking for is:
1) must preserve the shape of the line
2) must allow me reduce dataset to a specific number of points
3) is relatively fast
This post is a discussion of the merits of the different algorithms. I will post a second message for advice on implementations in Java or Groovy (why reinvent the wheel).
I am concerned about requirement 2 above. I am not an expert enough in these algorithms to know whether I can dictate the exact number of output points. The implementation of Lang that I've used took lookAhead, tolerance and the array of Points as input, so I don't see how to dictate the number of points in the output. This is a critical requirement of my current needs. Perhaps this is due to the specific implementation of Lang we had used, but I have not seen a lot of information on Lang on the web. Alternatively we could use Douglas-Peucker but again I am not sure if the number of points in the output can be specified.
I should add I am not an expert on these types of algorithms or any kind of math wiz, so I am looking for mere mortal type advice :) How do I satisfy requirements 1 and 2 above? I would sacrifice performance for the right solution.
I think you can adapt Douglas-Pücker quite straightforwardly. Adapt the recursive algorithm so that rather than producing a list it produces a tree mirroring the structure of the recursive calls. The root of the tree will be the single-line approximation P0-Pn; the next level will represent the two-line approximation P0-Pm-Pn where Pm is the point between P0 and Pn which is furthest from P0-Pn; the next level (if full) will represent a four-line approximation, etc. You can then trim the tree either on the basis of depth or on the basis of distance of the inserted point from the parent line.
Edit: in fact, if you take the latter approach you don't need to build a tree. Instead you populate a priority queue where the priority is given by the distance of the inserted point from the parent line. Then when you've finished the queue tells you which points to remove (or keep, according to the order of the priorities).
You can find my C++ implementation and article on Douglas-Peucker simplification here and here. I also provide a modified version of the Douglas-Peucker simplification that allows you to specify the number of points of the resulting simplified line. It uses a priority queue as mentioned by 'Peter Taylor'. Its a lot slower though, so I don't know if it would satisfy the 'is relatively fast' requirement.
I'm planning on providing an implementation for Lang simplification (and several others). Currently I don't see any easy way how to adjust Lang to reduce to a fixed point count. If you
could live with a less strict requirement: 'must allow me reduce dataset to an approximate number of points', then you could use an iterative approach. Guess an initial value for lookahead: point count / desired point count. Then slowly increase the lookahead until you approximately hit the desired point count.
I hope this helps.
p.s.: I just remembered something, you could also try the Visvalingam-Whyatt algorithm. In short:
-compute the triangle area for each point with its direct neighbors
-sort these areas
-remove the point with the smallest area
-update the area of its neighbors
-resort
-continue until n points remain

Checking if a graph is random using the Erdős–Rényi model?

Given some graph, I would like to determine how likely it is that it was generated randomly. I was told that a comparison to the Erdős–Rényi model was a good way to get this information, but I can't quite figure out how to do that.
Any advice?
The simplest way would probably be to compare the expected number of links with what you observed in the given graph. A slightly smarter method would be to examine the degree distributions. Erdős–Rényi graphs will have a binomial distributions, while real world networks are typically power law.
It might also be easier to test if you had an idea as to what other kinds of models were being used to generate the graph.
You can have a look at the ERGM package for R (www.r-project.org) at www.statnet.org. Although you might not be able to say with 100% certainty that your observed network is produced by a random process, you will be able to assess the likelihood that it was produced by random or non random partner selection processes. ERGM has a function called gof which stands for goodness-of-fit and will compare your observed network with simulated random networks and looks at network statistics such as: geodesic distance distribution, edgewise shared partner distribution, degree distribution and the triad census distribution. This will allow you to make an informed decision whether you consider your network to be random or not.
You will not be able to say whether a single graph is generated randomly. If the generating algorithm is random, than you have to check for randomness of the distribution of edges. But you will need many instances generated by that algorithm. Better check with the notion of randomness in mathematics, cryptography and information theory. [or maybe you want to start with rfc 1750]
The Erdős–Rényi model basically states that you take a number n of nodes and every possible edge has probability p of existence [G(n,p)-model]. Thus by p you can generate the expected number of edges and deviation from this expectation. If a significant ratio of graphs is within standard deviation of this expectation, well, you might not state that your algorithm is random at all, but you have at least one feature uncovered, the expected number of edges.
But again, without having a lot of states (graphs, intermediary graph generation steps or similar) you will be lost there. Say, I give you a number: 4. Is it generated randomly or not?

Resources