I am trying to use the Viterbi min-sum algorithm which tries to find the pathway through a bunch of nodes that minimizes the overall Hamming distance (fancy term for "xor two numbers and count the resulting bits") against some fixed input.
I understand find how to use DP to compute the minimal distance overall, but I am having trouble using it to also capture the corresponding path that corresponds to the minimal distance.
It seems like memoizing the path at each node would be really memory-intensive. Is there a standard way to handle these kinds of problems?
Here is a sample trellis with what I am talking about. The general idea is to find the path through the trellis that most closely emulates the input bitstring, with minimal error (measured by minimizing overall Hamming distance, or the number of mismatched bits).
As you can see, the first chunk of my input string is 01, and I can traverse there in column 1 of the trellis. The next chunk is 10, and I can move there in column 2. Next chunk is 11. Fine so far. Next chunk is 10, which is a problem because I can't reach that state from where I am now, so I have to go to the next best thing (00) and the rest can be filled fine.
But this can become more complex. I'd need to be able to somehow get the corresponding path to the minimal Hamming distance.
(The point of this exercise is that the trellis represents what are ACTUALLY valid transitions, whereas the input string is something you receive through telecommunicationa and might get garbled and have incorrect bits here and there. This program tries to figure out what the input string SHOULD be by minimizing error).

There's the usual "follow path backwards" technique, requiring only the table of values (but the whole table of values, no cheating with "keep only the most recent part"). The algorithm is simple: start at the end, decide which way you came from. You can make that decision, because either there's exactly one way such that if you came from it you'd compute the value that matches the stored one, or several result in the same value and it wouldn't matter which one you chose.
Storing also a table of "back-pointers" doesn't take much space (about as much as the table of weights, but you can actually omit most of the table of weights if you do this), doing it that way allows you to have a much simpler backwards phase: just follow the pointers. That really is the path, just stored backwards.

You are correct that the immediate approach for calculating the paths, is space expensive.
This problem comes up often in DNA sequencing, where the cost is prohibitive. There are a number of ways to overcome it (see more here):
You can reduce up to a square root of the space if you are willing to double the execution time (see 2.1.1 in the link above).
Using a compressed tree, you can reduce one of the dimensions logarithmically (see 2.1.2 in the link above).


Reconstruction a signal from random samples with holes

I've encountered the following problem as part of my master thesis, and having been unable to find a suitable solution over the last few weeks I will ask the masses.
The problem 1
Assume there exist an (unknown) sequence of symbols of a known length. Say for instance
Now, given N samples from arbitrary positions in the sequence, the task is to reconstruct the original sequence. For instance:
The problem 2 (Harder)
Now, on the bright side, there is no limit to how many samples I can make, whilst on the not so bright side there is more to the story.
The samples are noisy. i.e. There might be errors.
There are known holes in the samples. I am only able to observe every 4-6th symbol.
Thus the samples are actually looking more like this:
B B C* # The C should have been an A.
I have tried the following:
Let S be the set of all partial noisy sequences with holes.
Greedy algorithm with random sampling and sliding window.
Let X be the the "best" sequence thus far.
Set X as a random sample from S.
Choose a sequence v from S
Slide v along X and score the match, and choose the "best" sequence as the new X.
Repeat from 3.
The problem with this algorithm is that I have been unable to find a good metric to score the sequences. Especially when considering the holes + noise. The result tended to favor shorter sequences, and the result was highly divergent in subsequent runs. Ideas to resolve this are most welcome.
Trying to align the start of the sequence.
This approach attempted to use the fact that I might be able to identify a suffix in the strings that likely make up beginning of the unknown sequence. However, due to the holes in the samples, I would need to shift even the matching sequences a few steps right or left. This results in exponential complexity and makes the problem intractable.
I have also played with the idea of using a Hidden Markov Model, but am thwarted on how to deal with the missing data.
Other ideas include, trying max flow through a graph built from the strings (don't think this will work), trellis decoding [Viterbi] (don't see how I can deal with samples starting in the middle of the unknown sequence) and more.
Any fresh Ideas are very welcome. Links/references to relevant articles are like manna!
Specific information about my data set
I have three symbols S (start), A and B.
I am < 60% certain any given symbol is sampled correctly.
The S symbol should only appear a few times at the start of the master sequence, but does occur more often due to misclassification.
The symbol B occurs about 1.5 times as often as A in the master sequence.
Problem 1 is known as the Shortest Common Supersequence problem. It is NP-hard for more than two input strings, even with only two symbols. Problem 2 is an instance of Multiple Sequence Alignment. There are many algorithms and implementations for it, mostly heuristic since it is also NP-hard in general.

Algorithm for Connect 4 Evaluation of Data Set

I am working on a connect 4 AI, and saw many people were using this data set, containing all the legal positions at 8 ply, and their eventual outcome.
I am using a standard minimax with alpha/beta pruning as my search algorithm. It seems like this data set could could be really useful for my AI. However, I'm trying to find the best way to implement it. I thought the best approach might be to process the list, and use the board state as a hash for the eventual result (win, loss, draw).
What is the best way for to design an AI to use a data set like this? Is my idea of hashing the board state, and using it in a traditional search algorithm (eg. minimax) on the right track? or is there is better way?
Update: I ended up converting the large move database to a plain test format, where 1 represented X and -1 O. Then I used a string of the board state, an an integer representing the eventual outcome, and put it in an std::unsorted_map (see Stack Overflow With Unordered Map to for a problem I ran into). The performance of the map was excellent. It built quickly, and the lookups were fast. However, I never quite got the search right. Is the right way to approach the problem to just search the database when the number of turns in the game is less than 8, then switch over to a regular alpha-beta?
Your approach seems correct.
For the first 8 moves, use alpha-beta algorithm, and use the look-up table to evaluate the value of each node at depth 8.
Once you have "exhausted" the table (exceeded 8 moves in the game) - you should switch to regular alpha-beta algorithm, that ends with terminal states (leaves in the game tree).
This is extremely helpful because:
Remember that the complexity of searching the tree is O(B^d) - where B is the branch factor (number of possible moves per state) and d is the needed depth until the end.
By using this approach you effectively decrease both B and d for the maximal waiting times (longest moves needed to be calculated) because:
Your maximal depth shrinks significantly to d-8 (only for the last moves), effectively decreasing d!
The branch factor itself tends to shrink in this game after a few moves (many moves become impossible or leading to defeat and should not be explored), this decreases B.
In the first move, you shrink the number of developed nodes as well
to B^8 instead of B^d.
So, because of these - the maximal waiting time decreases significantly by using this approach.
Also note: If you find the optimization not enough - you can always expand your look up table (to 9,10,... first moves), of course it will increase the needed space exponentially - this is a tradeoff you need to examine and chose what best serves your needs (maybe even store the entire game in file system if the main memory is not enough should be considered)

Simple k-nearest-neighbor algorithm for euclidean data with variable density?

An elaboration on this question, but with more constraints.
The idea is the same, to find a simple, fast algorithm for k-nearest-neighbors in 2 euclidean dimensions. The bucketing grid seems to work nicely if you can find a grid size that will suitably partition your data. However, what if the data is not uniformly distributed, but has areas with both very high and very low density (for example, the US population), so that no fixed grid size could guarantee both enough neighbors and efficiency? Can this method still be salvaged?
If not, other suggestions would be helpful, though I hope for answers less complex than moving to kd-trees, etc.
If you don't have too many elements, just compare each with all the others. This can be a lot faster than you'd think; today's machines are fast. Unfortunately, the square factor will catch you sooner or later; I figure a linear search of a million objects won't take tooo long, so you may be okay with up to 1000 elements. Using a grid, or even stripes, might boost that number substantially.
But I think you're stuck with a quadtree (a specific form of k-d tree). Your whole map is one block, which can contain four subblocks (upper left, upper right, lower left, lower right). When a block fills up with more elements than you want to do a linear search on, break it into smaller ones and transfer the elements. (Only leaf nodes have elements.) It's easy to search within a given radius of a given point. Start at the top and if a part of a block is within range of the point, check out it's subblocks the same way if it has them. If it doesn't, check its elements.
(When searching for "closest", take care. The square grid means a nearer object might be in a farther block. You have to get everything within a given radius, then check 'em all. If you want the 10 closest and your radius of 20 only picked up 5, you need to try a larger radius. You may have a rejected item that proved to be 30 away and think you should grab it and a few others to make up your 10. However, there may be a few items at 25 away whose whole blocks were rejected, and you want them instead. There ought to be a better solution for this, but I haven't figured it out yet. I just make a guess at the radius and double it till I get enough.)
Quadtrees are fun. If you can set up your data and then access it, it's easy. The problems come when your mapped elements appear, disappear, and move while you are trying to figure out who's near what.
Have you looked at this?
kd-trees are quite simple to implement, there are standard java/c implementations.
Question about Backpropagation Algorithm with Artificial Neural Networks -- Order of updating

Hey everyone, I've been trying to get an ANN I coded to work with the backpropagation algorithm. I have read several papers on them, but I'm noticing a few discrepancies.
Here seems to be the super general format of the algorithm:
Give input
Get output
Calculate error
Calculate change in weights
Repeat steps 3 and 4 until we reach the input level
But here's the problem: The weights need to be updated at some point, obviously. However, because we're back propagating, we need to use the weights of previous layers (ones closer to the output layer, I mean) when calculating the error for layers closer to the input layer. But we already calculated the weight changes for the layers closer to the output layer! So, when we use these weights to calculate the error for layers closer to the input, do we use their old values, or their "updated values"?
In other words, if we were to put the the step of updating the weights in my super general algorithm, would it be:
(Updating the weights immediately)
Give input
Get output
Calculate error
Calculate change in weights
Update these weights
Repeat steps 3,4,5 until we reach the input level
(Using the "old" values of the weights)
Give input
Get output
Calculate error
Calculate change in weights
Store these changes in a matrix, but don't change these weights yet
Repeat steps 3,4,5 until we reach the input level
Update the weights all at once using our stored values
In this paper I read, in both abstract examples (the ones based on figures 3.3 and 3.4), they say to use the old values, not to immediately update the values. However, in their "worked example 3.1", they use the new values (even though what they say they're using are the old values) for calculating the error of the hidden layer.
Also, in my book "Introduction to Machine Learning by Ethem Alpaydin", though there is a lot of abstract stuff I don't yet understand, he says "Note that the change in the first-layer weight delta-w_hj, makes use of the second layer weight v_h. Therefore, we should calculate the changes in both layers and update the first-layer weights, making use of the old value of the second-layer weights, then update the second-layer weights."
To be honest, it really seems like they just made a mistake and all the weights are updated simultaneously at the end, but I want to be sure. My ANN is giving me strange results, and I want to be positive that this isn't the cause.
Anyone know?
As far as I know, you should update weights immediately. The purpose of back-propagation is to find weights that minimize the error of the ANN, and it does so by doing a gradient descent. I think the algorithm description in the Wikipedia page is quite good. You may also double-check its implementation in the joone engine.
You are usually backpropagating deltas not errors. These deltas are calculated from the errors, but they do not mean the same thing. Once you have the deltas for layer n (counting from input to output) you use these deltas and the weigths from the layer n to calculate the deltas for layer n-1 (one closer to input). The deltas only have a meaning for the old state of the network, not for the new state, so you should always use the old weights for propagating the deltas back to the input.
Deltas mean in a sense how much each part of the NN has contributed to the error before, not how much it will contribute to the error in the next step (because you do not know the actual error yet).
As with most machine-learning techniques it will probably still work, if you use the updated, weights, but it might converge slower.
If you simply train it on a single input-output pair my intuition would be to update weights immediately, because the gradient is not constant. But I don't think your book mentions only a single input-output pair. Usually you come up with an ANN because you have many input-output samples from a function you would like to model with the ANN. Thus your loops should repeat from step 1 instead of from step 3.
If we label your two methods as new->online and old->offline, then we have two algorithms.
The online algorithm is good when you don't know how many sample input-output relations you are going to see, and you don't mind some randomness in they way the weights update.
The offline algorithm is good if you want to fit a particular set of data optimally. To avoid overfitting the samples in your data set, you can split it into a training set and a test set. You use the training set to update the weights, and the test set to measure how good a fit you have. When the error on the test set begins to increase, you are done.
Which algorithm is best depends on the purpose of using an ANN. Since you talk about training until you "reach input level", I assume you train until output is exactly as the target value in the data set. In this case the offline algorithm is what you need. If you were building a backgammon playing program, the online algorithm would be a better because you have an unlimited data set.
In this book, the author talks about how the whole point of the backpropagation algorithm is that it allows you to efficiently compute all the weights in one go. In other words, using the "old values" is efficient. Using the new values is more computationally expensive, and so that's why people use the "old values" to update the weights.

Graph Simplification Algorithm Advice Needed

I have a need to take a 2D graph of n points and reduce it the r points (where r is a specific number less than n). For example, I may have two datasets with slightly different number of total points, say 1021 and 1001 and I'd like to force both datasets to have 1000 points. I am aware of a couple of simplification algorithms: Lang Simplification and Douglas-Peucker. I have used Lang in a previous project with slightly different requirements.
The specific properties of the algorithm I am looking for is:
1) must preserve the shape of the line
2) must allow me reduce dataset to a specific number of points
3) is relatively fast
This post is a discussion of the merits of the different algorithms. I will post a second message for advice on implementations in Java or Groovy (why reinvent the wheel).
I am concerned about requirement 2 above. I am not an expert enough in these algorithms to know whether I can dictate the exact number of output points. The implementation of Lang that I've used took lookAhead, tolerance and the array of Points as input, so I don't see how to dictate the number of points in the output. This is a critical requirement of my current needs. Perhaps this is due to the specific implementation of Lang we had used, but I have not seen a lot of information on Lang on the web. Alternatively we could use Douglas-Peucker but again I am not sure if the number of points in the output can be specified.
I should add I am not an expert on these types of algorithms or any kind of math wiz, so I am looking for mere mortal type advice :) How do I satisfy requirements 1 and 2 above? I would sacrifice performance for the right solution.
I think you can adapt Douglas-PĆ¼cker quite straightforwardly. Adapt the recursive algorithm so that rather than producing a list it produces a tree mirroring the structure of the recursive calls. The root of the tree will be the single-line approximation P0-Pn; the next level will represent the two-line approximation P0-Pm-Pn where Pm is the point between P0 and Pn which is furthest from P0-Pn; the next level (if full) will represent a four-line approximation, etc. You can then trim the tree either on the basis of depth or on the basis of distance of the inserted point from the parent line.
Edit: in fact, if you take the latter approach you don't need to build a tree. Instead you populate a priority queue where the priority is given by the distance of the inserted point from the parent line. Then when you've finished the queue tells you which points to remove (or keep, according to the order of the priorities).
You can find my C++ implementation and article on Douglas-Peucker simplification here and here. I also provide a modified version of the Douglas-Peucker simplification that allows you to specify the number of points of the resulting simplified line. It uses a priority queue as mentioned by 'Peter Taylor'. Its a lot slower though, so I don't know if it would satisfy the 'is relatively fast' requirement.
I'm planning on providing an implementation for Lang simplification (and several others). Currently I don't see any easy way how to adjust Lang to reduce to a fixed point count. If you
could live with a less strict requirement: 'must allow me reduce dataset to an approximate number of points', then you could use an iterative approach. Guess an initial value for lookahead: point count / desired point count. Then slowly increase the lookahead until you approximately hit the desired point count.
I hope this helps.
p.s.: I just remembered something, you could also try the Visvalingam-Whyatt algorithm. In short:
-compute the triangle area for each point with its direct neighbors
-sort these areas
-remove the point with the smallest area
-update the area of its neighbors
-continue until n points remain
