Understanding the Viterbi algorithm

Understanding the Viterbi algorithm - algorithm

I was looking for a precise step by step example of the Viterbi algorithm.
Considering sentence tagging with the input sentence as:
The cat saw the angry dog jump
And from this I would like to generate the most probable output as:
D N V T A N V
How do we use the Viterbi algorithm to get the above output using a trigram-HMM?
(PS: I'm looking for a precise step by step explanation, not a piece of code, or math representation. Assume all probabilities as numbers.)
Thanks a ton!

I suggest that you look it up in one of the books available, e.g. Chris Bishop "Pattern Recognition and Machine Learning". Viterbi algorithm is a really basic thing and has been described in various levels of detail in the literature.

For Viterbi algorithm and Hidden Markov Model, you first need the transition probability and emission probability.
In your example, the transition probability is P(D->N), P(N->V) and the emission probability (assuming bigram model) is P(D|the), P(N|cat).
Of course, in real world example, there are a lot more word than the, cat, saw, etc. You have to loop through all your training data to have estimate of P(D|the), P(N|cat), P(N|car). Then we use Viterbi algorithm to find the most likely sequence of tags such as
D N V T A N V
given your observation.
Here is my implementation of Viterbi.
def viterbi(vocab, vocab_tag, words, tags, t_bigram_count, t_unigram_count, e_bigram_count, e_unigram_count, ADD_K):
vocab_size = len(vocab)
V = [{}]
for t in vocab_tag:
# Prob of very first word
prob = np.log2(float(e_bigram_count.get((words[0],t),0)+ADD_K))-np.log2(float(e_unigram_count[t]+vocab_size*ADD_K))
# trigram V[0][0]
V[0][t] = {"prob": prob, "prev": None}
for i in range(1,len(words)):
V.append({})
for t in vocab_tag:
V[i][t] = {"prob": np.log2(0), "prev": None}
for t in vocab_tag:
max_trans_prob = np.log2(0);
for prev_tag in vocab_tag:
trans_prob = np.log2(float(t_bigram_count.get((t, prev_tag),0)+ADD_K))-np.log2(float(t_unigram_count[prev_tag]+vocab_size*ADD_K))
if V[i-1][prev_tag]["prob"]+trans_prob > max_trans_prob:
max_trans_prob = V[i-1][prev_tag]["prob"]+trans_prob
max_prob = max_trans_prob+np.log2(e_bigram_count.get((words[i],t),0)+ADD_K)-np.log2(float(e_unigram_count[t]+vocab_size*ADD_K))
V[i][t] = {"prob": max_prob, "prev": prev_tag}
opt = []
previous = None
max_prob = max(value["prob"] for value in V[-1].values())
# Get most probable state and its backtrack
for st, data in V[-1].items():
if data["prob"] == max_prob:
opt.append(st)
previous = st
break
for t in range(len(V) - 2, -1, -1):
opt.insert(0, V[t + 1][previous]["prev"])
previous = V[t][previous]["prev"]
return opt

Related

Initialization of Weighted Reservoir Sampling (A-Chao implementation)

I am trying to implement A-Chao version of weighted reservoir sampling as shown in https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_A-Chao
But I found that the pseudo-code described in wiki seems to be wrong, especially on the initialization part. I read the paper, it mentions we need to handle over-weighted data points, but I still cannot get the idea how to initialize correctly.
In my understanding, on initialization step, we want to make sure all initial data points chosen should have same probability*weight to be chosen. However, I don't understand how the over-weighted points is related with that.
Code I implemented according to the wiki, but the results show it is incorrect.
const reservoirSampling = <T>(dataList: T[], k: number, getWeight: (point: T) => number): T[] => {
const sampledList = dataList.slice(0, k);
let currentWeightSum: number = sampledList.reduce((sum, item) => sum + getWeight(item), 0);
for (let i = k; i < dataList.length; i++) {
const currentItem = dataList[i];
currentWeightSum += getWeight(currentItem);
const probOfChoosingCurrentItem = getWeight(currentItem) / currentWeightSum;
const rand = Math.random();
if (rand <= probOfChoosingCurrentItem) {
sampledList[getRandomInt(0, k - 1)] = currentItem;
}
}
return sampledList;
};

The best way to get the distribution that Chao's algorithm produces is to implement VarOptk sampling as in the pseudocode labeled Algorithm 1 from the paper that introduced VarOptk sampling by Cohen et al.
That's an arXiv link and hence very stable, but to summarize, the idea is to separate the items into "heavy" (weight high enough to guarantee inclusion in the sample so far) and "light" (the others). Keep the heavy items in a priority queue where it is easy to remove the lightest of them. When a new item comes in, we have to determine whether it is heavy or light, and which heavy items became light (if any). Then there's a sampling procedure for dropping an item that treats the heavy → light items specially using weighted sampling and then falls back to choosing a uniform random light item (as in the easy case of Chao's algorithm).
The one trick with the pseudocode is that, if you use floating-point arithmetic, you have to be a little careful about "impossible" cases. Post your finished code on Code Review and ping me here if you would like feedback.

You will find a python implementation of Chao's strategy below. Here is a plot of 10000 samples from 0,..,99 with weights indicated by the yellow lines. The y-coordinate denotes how many times a given item was sampled.
I first implemented the pseudocode on Wikipedia, and agree completely with the OP that it is dead wrong. It then took me more than a day to understand Chao's paper. I also found the section of Tillé's book on Chao's method (see Algorithm 6.14 on page 120) helpful. (I don't know what the OP means by with the issues with initialization.)
Disclaimer: I am new to python, and just tried to do my best. I think posting code might be more helpful than posting pseudocode. (Mainly I want to save someone a day's work getting to the bottom of Chao's paper!) If you do end up using this, I'd appreciate any feedback. Standard health warnings apply!
First, Chao's computation of inclusion probabilities:
import numpy as np
import random
def compute_Chao_probs(weights, total_weight, sample_size):
"""
Consider a weighted population, some of its members, and their weights.
This function returns a list of probabilities that these members are selected
in a weighted sample of sample_size members of the population.
Example 1: If all weights are equal, this probability is sample_size /(size of population).
Example 2: If the size of our population is sample_size then these probabilities are all 1.
Naively we expect these probabilities to be given by sample_size*weight/total_weight, however
this may lead to a probability greater than 1. For example, consider a population
of 3 with weights [3,1,1], and suppose we want to select 2 elements. The naive
probability of selecting the first element is 2*3/5 > 1.
We follow Chao's description: compute naive guess, set any probs which are bigger
than 1 to 1, rinse and repeat.
We expect to call this routine many times, so we avoid for loops, and try to make numpy do the work.
"""
assert all(w > 0 for w in weights), "weights must be strictly positive."
# heavy_items is a True / False array of length sample_size.
# True indicates items deemed "heavy" (i.e. assigned probability 1)
# At the outset, no items are heavy:
heavy_items = np.zeros(len(weights),dtype=bool)
while True:
new_probs = (sample_size - np.sum(heavy_items))/(total_weight - np.sum(heavy_items*weights))*weights
valid_probs = np.less_equal(np.logical_not(heavy_items) * new_probs, np.ones((len(weights))))
if all(valid_probs): # we are done
return np.logical_not(heavy_items)*new_probs + heavy_items
else: # we need to declare some more items heavy
heavy_items = np.logical_or(heavy_items, np.logical_not(valid_probs))
Then Chao's rejection rule:
def update_sample(current_sample, new_item, new_weight):
"""
We have a weighted population, from which we have selected n items.
We know their weights, the total_weight of the population, and the
probability of their inclusion in the sample when we selected them.
Now new_item arrives, with a new_weight. Should we take it or not?
current_sample is a dictionary, with keys 'items', 'weights', 'probs'
and 'total_weight'. This function updates current_sample according to
Chao's recipe.
"""
items = current_sample['items']
weights = current_sample['weights']
probs = current_sample['probs']
total_weight = current_sample['total_weight']
assert len(items) == len(weights) and len(weights) == len(probs)
fixed_sample_size = len(weights)
total_weight = total_weight + new_weight
new_Chao_probs = compute_Chao_probs(np.hstack((weights,[new_weight])),total_weight,fixed_sample_size)
if random.random() <= new_Chao_probs[-1]: # we should take new_item
#
# Now we need to decide which element should be replaced.
# Fix an index i in items, and let P denote probability. We have:
# P(i is selected in previous step) = probs[i]
# P(i is selected at current step) = new_Chao_probs[i]
# Hence (by law of conditional probability)
# P(i is selected at current step | i is selected at previous step) = new_Chao_probs[i] / probs[i]
# Thus:
# P(i is not selected at current step | i is selected at previous step) = 1 - new_Chao_probs[i] / probs[i]
# Now is we condition this on the assumption that the new element is taken, we get
# 1/new_Chao_probs[-1]*(1 - new_Chao_probs[i] / probs[i]).
#
# (*I think* this is what Chao is talking about in the two paragraphs just before Section 3 in his paper.)
rejection_weights = 1/new_Chao_probs[-1]*(np.ones((fixed_sample_size)) - (new_Chao_probs[0:-1]/probs))
# assert np.isclose(np.sum(rejection_weights),1)
# In examples we see that np.sum(rejection_weights) is not necessarily 1.
# I am a little confused by this, but ignore it for the moment.
rejected_index = random.choices(range(fixed_sample_size), rejection_weights)[0]
#make the changes:
current_sample['items'][rejected_index] = new_item
current_sample['weights'][rejected_index] = new_weight
current_sample['probs'] = new_Chao_probs[0:-1]
current_sample['probs'][rejected_index] = new_Chao_probs[-1]
current_sample['total_weight'] = total_weight
Finally, code to test and plot:
# Now we test Chao on some different distributions.
#
# This also illustrates how to use update_sample.
#
from collections import Counter
import matplotlib.pyplot as plt
n = 10 # number of samples
items_in = list(range(100))
weights_in = [random.random() for _ in range(10)]
# other possible tests:
weights_in = [i+1 for i in range(10)] # staircase
#weights_in = [9-i+1 for i in range(10)] # upside down staircase
#weights_in = [(i+1)**2 for i in range(10)] # parabola
#weights_in = [10**i for i in range(10)] # a very heavy tailed distribution (to check numerical stability)
random.shuffle(weights_in) # sometimes it is fun to shuffle
weights_in = np.array([w for w in weights_in for _ in range(10)])
count = Counter({})
for j in range(10000):
# we take the first n with probability 1:
current_sample = {}
current_sample['items'] = items_in[:n]
current_sample['weights'] = np.array(weights_in[:n])
current_sample['probs'] = np.ones((n))
current_sample['total_weight'] = np.sum(current_sample['weights'])
for i in range(n,len(items_in)):
update_sample(current_sample, items_in[i], weights_in[i])
count.update(current_sample['items'])
plt.figure(figsize=(20,10))
plt.plot(100000*np.array(weights_in)/np.sum(weights_in), 'yo')
plt.plot(list(count.keys()), list(count.values()), 'ro')
plt.show()

Fast solution to Subset sum

Consider this way of solving the Subset sum problem:
def subset_summing_to_zero (activities):
subsets = {0: []}
for (activity, cost) in activities.iteritems():
old_subsets = subsets
subsets = {}
for (prev_sum, subset) in old_subsets.iteritems():
subsets[prev_sum] = subset
new_sum = prev_sum + cost
new_subset = subset + [activity]
if 0 == new_sum:
new_subset.sort()
return new_subset
else:
subsets[new_sum] = new_subset
return []
I have it from here:
http://news.ycombinator.com/item?id=2267392
There is also a comment which says that it is possible to make it "more efficient".
How?
Also, are there any other ways to solve the problem which are at least as fast as the one above?
Edit
I'm interested in any kind of idea which would lead to speed-up. I found:
https://en.wikipedia.org/wiki/Subset_sum_problem#cite_note-Pisinger09-2
which mentions a linear time algorithm. But I don't have the paper, perhaps you, dear people, know how it works? An implementation perhaps? Completely different approach perhaps?
Edit 2
There is now a follow-up:
Fast solution to Subset sum algorithm by Pisinger

I respect the alacrity with which you're trying to solve this problem! Unfortunately, you're trying to solve a problem that's NP-complete, meaning that any further improvement that breaks the polynomial time barrier will prove that P = NP.
The implementation you pulled from Hacker News appears to be consistent with the pseudo-polytime dynamic programming solution, where any additional improvements must, by definition, progress the state of current research into this problem and all of its algorithmic isoforms. In other words: while a constant speedup is possible, you're very unlikely to see an algorithmic improvement to this solution to the problem in the context of this thread.
However, you can use an approximate algorithm if you require a polytime solution with a tolerable degree of error. In pseudocode blatantly stolen from Wikipedia, this would be:
initialize a list S to contain one element 0.
for each i from 1 to N do
let T be a list consisting of xi + y, for all y in S
let U be the union of T and S
sort U
make S empty
let y be the smallest element of U
add y to S
for each element z of U in increasing order do
//trim the list by eliminating numbers close to one another
//and throw out elements greater than s
if y + cs/N < z ≤ s, set y = z and add z to S
if S contains a number between (1 − c)s and s, output yes, otherwise no
Python implementation, preserving the original terms as closely as possible:
from bisect import bisect
def ssum(X,c,s):
""" Simple impl. of the polytime approximate subset sum algorithm
Returns True if the subset exists within our given error; False otherwise
"""
S = [0]
N = len(X)
for xi in X:
T = [xi + y for y in S]
U = set().union(T,S)
U = sorted(U) # Coercion to list
S = []
y = U[0]
S.append(y)
for z in U:
if y + (c*s)/N < z and z <= s:
y = z
S.append(z)
if not c: # For zero error, check equivalence
return S[bisect(S,s)-1] == s
return bisect(S,(1-c)*s) != bisect(S,s)
... where X is your bag of terms, c is your precision (between 0 and 1), and s is the target sum.
For more details, see the Wikipedia article.
(Additional reference, further reading on CSTheory.SE)

While my previous answer describes the polytime approximate algorithm to this problem, a request was specifically made for an implementation of Pisinger's polytime dynamic programming solution when all xi in x are positive:
from bisect import bisect
def balsub(X,c):
""" Simple impl. of Pisinger's generalization of KP for subset sum problems
satisfying xi >= 0, for all xi in X. Returns the state array "st", which may
be used to determine if an optimal solution exists to this subproblem of SSP.
"""
if not X:
return False
X = sorted(X)
n = len(X)
b = bisect(X,c)
r = X[-1]
w_sum = sum(X[:b])
stm1 = {}
st = {}
for u in range(c-r+1,c+1):
stm1[u] = 0
for u in range(c+1,c+r+1):
stm1[u] = 1
stm1[w_sum] = b
for t in range(b,n+1):
for u in range(c-r+1,c+r+1):
st[u] = stm1[u]
for u in range(c-r+1,c+1):
u_tick = u + X[t-1]
st[u_tick] = max(st[u_tick],stm1[u])
for u in reversed(range(c+1,c+X[t-1]+1)):
for j in reversed(range(stm1[u],st[u])):
u_tick = u - X[j-1]
st[u_tick] = max(st[u_tick],j)
return st
Wow, that was headache-inducing. This needs proofreading, because, while it implements balsub, I can't define the right comparator to determine if the optimal solution to this subproblem of SSP exists.

I don't know much python, but there is an approach called meet in the middle.
Pseudocode:
Divide activities into two subarrays, A1 and A2
for both A1 and A2, calculate subsets hashes, H1 and H2, the way You do it in Your question.
for each (cost, a1) in H1
if(H2.contains(-cost))
return a1 + H2[-cost];
This will allow You to double the number of elements of activities You can handle in reasonable time.

I apologize for "discussing" the problem, but a "Subset Sum" problem where the x values are bounded is not the NP version of the problem. Dynamic programing solutions are known for bounded x value problems. That is done by representing the x values as the sum of unit lengths. The Dynamic programming solutions have a number of fundamental iterations that is linear with that total length of the x's. However, the Subset Sum is in NP when the precision of the numbers equals N. That is, the number or base 2 place values needed to state the x's is = N. For N = 40, the x's have to be in the billions. In the NP problem the unit length of the x's increases exponentially with N.That is why the dynamic programming solutions are not a polynomial time solution to the NP Subset Sum problem. That being the case, there are still practical instances of the Subset Sum problem where the x's are bounded and the dynamic programming solution is valid.

Here are three ways to make the code more efficient:
The code stores a list of activities for each partial sum. It is more efficient in terms of both memory and time to just store the most recent activity needed to make the sum, and work out the rest by backtracking once a solution is found.
For each activity the dictionary is repopulated with the old contents (subsets[prev_sum] = subset). It is faster to simply grow a single dictionary
Splitting the values in two and applying a meet in the middle approach.
Applying the first two optimisations results in the following code which is more than 5 times faster:
def subset_summing_to_zero2 (activities):
subsets = {0:-1}
for (activity, cost) in activities.iteritems():
for prev_sum in subsets.keys():
new_sum = prev_sum + cost
if 0 == new_sum:
new_subset = [activity]
while prev_sum:
activity = subsets[prev_sum]
new_subset.append(activity)
prev_sum -= activities[activity]
return sorted(new_subset)
if new_sum in subsets: continue
subsets[new_sum] = activity
return []
Also applying the third optimisation results in something like:
def subset_summing_to_zero3 (activities):
A=activities.items()
mid=len(A)//2
def make_subsets(A):
subsets = {0:-1}
for (activity, cost) in A:
for prev_sum in subsets.keys():
new_sum = prev_sum + cost
if new_sum and new_sum in subsets: continue
subsets[new_sum] = activity
return subsets
subsets = make_subsets(A[:mid])
subsets2 = make_subsets(A[mid:])
def follow_trail(new_subset,subsets,s):
while s:
activity = subsets[s]
new_subset.append(activity)
s -= activities[activity]
new_subset=[]
for s in subsets:
if -s in subsets2:
follow_trail(new_subset,subsets,s)
follow_trail(new_subset,subsets2,-s)
if len(new_subset):
break
return sorted(new_subset)
Define bound to be the largest absolute value of the elements.
The algorithmic benefit of the meet in the middle approach depends a lot on bound.
For a low bound (e.g. bound=1000 and n=300) the meet in the middle only gets a factor of about 2 improvement other the first improved method. This is because the dictionary called subsets is densely populated.
However, for a high bound (e.g. bound=100,000 and n=30) the meet in the middle takes 0.03 seconds compared to 2.5 seconds for the first improved method (and 18 seconds for the original code)
For high bounds, the meet in the middle will take about the square root of the number of operations of the normal method.
It may seem surprising that meet in the middle is only twice as fast for low bounds. The reason is that the number of operations in each iteration depends on the number of keys in the dictionary. After adding k activities we might expect there to be 2**k keys, but if bound is small then many of these keys will collide so we will only have O(bound.k) keys instead.

Thought I'd share my Scala solution for the discussed pseudo-polytime algorithm described in wikipedia. It's a slightly modified version: it figures out how many unique subsets there are. This is very much related to a HackerRank problem described at https://www.hackerrank.com/challenges/functional-programming-the-sums-of-powers. Coding style might not be excellent, I'm still learning Scala :) Maybe this is still helpful for someone.
object Solution extends App {
var input = "1000\n2"
System.setIn(new ByteArrayInputStream(input.getBytes()))
println(calculateNumberOfWays(readInt, readInt))
def calculateNumberOfWays(X: Int, N: Int) = {
val maxValue = Math.pow(X, 1.0/N).toInt
val listOfValues = (1 until maxValue + 1).toList
val listOfPowers = listOfValues.map(value => Math.pow(value, N).toInt)
val lists = (0 until maxValue).toList.foldLeft(List(List(0)): List[List[Int]]) ((newList, i) =>
newList :+ (newList.last union (newList.last.map(y => y + listOfPowers.apply(i)).filter(z => z <= X)))
)
lists.last.count(_ == X)
}
}

Partition a Set into k Disjoint Subset

Give a Set S, partition the set into k disjoint subsets such that the difference of their sums is minimal.
say, S = {1,2,3,4,5} and k = 2, so { {3,4}, {1,2,5} } since their sums {7,8} have minimal difference. For S = {1,2,3}, k = 2 it will be {{1,2},{3}} since difference in sum is 0.
The problem is similar to The Partition Problem from The Algorithm Design Manual. Except Steven Skiena discusses a method to solve it without rearrangement.
I was going to try Simulated Annealing. So i wondering, if there was a better method?
Thanks in advance.

The pseudo-polytime algorithm for a knapsack can be used for k=2. The best we can do is sum(S)/2. Run the knapsack algorithm
for s in S:
for i in 0 to sum(S):
if arr[i] then arr[i+s] = true;
then look at sum(S)/2, followed by sum(S)/2 +/- 1, etc.
For 'k>=3' I believe this is NP-complete, like the 3-partition problem.
The simplest way to do it for k>=3 is just to brute force it, here's one way, not sure if it's the fastest or cleanest.
import copy
arr = [1,2,3,4]
def t(k,accum,index):
print accum,k
if index == len(arr):
if(k==0):
return copy.deepcopy(accum);
else:
return [];
element = arr[index];
result = []
for set_i in range(len(accum)):
if k>0:
clone_new = copy.deepcopy(accum);
clone_new[set_i].append([element]);
result.extend( t(k-1,clone_new,index+1) );
for elem_i in range(len(accum[set_i])):
clone_new = copy.deepcopy(accum);
clone_new[set_i][elem_i].append(element)
result.extend( t(k,clone_new,index+1) );
return result
print t(3,[[]],0);
Simulated annealing might be good, but since the 'neighbors' of a particular solution aren't really clear, a genetic algorithm might be better suited to this. You'd start out by randomly picking a group of subsets and 'mutate' by moving numbers between subsets.

If the sets are large, I would definitely go for stochastic search. Don't know exactly what spinning_plate means when writing that "the neighborhood is not clearly defined". Of course it is --- you either move one item from one set to another, or swap items from two different sets, and this is a simple neighborhood. I would use both operations in stochastic search (which in practice could be tabu search or simulated annealing.)

Find the words in a long stream of characters. Auto-tokenize

How would you find the correct words in a long stream of characters?
Input :
"The revised report onthesyntactictheoriesofsequentialcontrolandstate"
Google's Output:
"The revised report on syntactic theories sequential controlandstate"
(which is close enough considering the time that they produced the output)
How do you think Google does it?
How would you increase the accuracy?

I would try a recursive algorithm like this:
Try inserting a space at each position. If the left part is a word, then recur on the right part.
Count the number of valid words / number of total words in all the final outputs. The one with the best ratio is likely your answer.
For example, giving it "thesentenceisgood" would run:
thesentenceisgood
the sentenceisgood
sent enceisgood
enceisgood: OUT1: the sent enceisgood, 2/3
sentence isgood
is good
go od: OUT2: the sentence is go od, 4/5
is good: OUT3: the sentence is good, 4/4
sentenceisgood: OUT4: the sentenceisgood, 1/2
these ntenceisgood
ntenceisgood: OUT5: these ntenceisgood, 1/2
So you would pick OUT3 as the answer.

Try a stochastic regular grammar (equivalent to hidden markov models) with the following rules:
for every word in a dictionary:
stream -> word_i stream with probability p_w
word_i -> letter_i1 ...letter_in` with probability q_w (this is the spelling of word_i)
stream -> letter stream with prob p (for any letter)
stream -> epsilon with prob 1
The probabilities could be derived from a training set, but see the following discussion.
The most likely parse is computed using the Viterbi algorithm, which has quadratic time complexity in the number of hidden states, in this case your vocabulary, so you could run into speed issues with large vocabularies. But what if you set all the p_w = 1, q_w = 1 p = .5 Which means, these are probabilities in an artificial language model where all words are equally likely and all non-words are equally unlikely. Of course you could segment better if you didn't use this simplification, but the algorithm complexity goes down by quite a bit. If you look at the recurrence relation in the wikipedia entry you can try and simplify it for this special case. The viterbi parse probability up to position k can be simplified to VP(k) = max_l(VP(k-l) * (1 if text[k-l:k] is a word else .5^l) You can bound l with the maximim length of a word and find if a l letters form a word with a hash search. The complexity of this is independent of the vocabulary size and is O(<text length> <max l>). Sorry this is not a proof, just a sketch but should get you going. Another potential optimization, if you create a trie of the dictionary, you can check if a substring is a prefix of any correct word. So when you query text[k-l:k] and get a negative answer, you already know that the same is true for text[k-l:k+d] for any d. To take advantage of this you would have to rearrange the recursion significantly, so I am not sure this can be fully exploited (it can see comment).

Here is a code in Mathematica I started to develop for a recent code golf.
It is a minimal matching, non greedy, recursive algorithm. That means that the sentence "the pen is mighter than the sword" (without spaces) returns {"the pen is might er than the sword} :)
findAll[s_] :=
Module[{a = s, b = "", c, sy = "="},
While[
StringLength[a] != 0,
j = "";
While[(c = findFirst[a]) == {} && StringLength[a] != 0,
j = j <> StringTake[a, 1];
sy = "~";
a = StringDrop[a, 1];
];
b = b <> " " <> j ;
If[c != {},
b = b <> " " <> c[[1]];
a = StringDrop[a, StringLength[c[[1]]]];
];
];
Return[{StringTrim[StringReplace[b, " " -> " "]], sy}];
]
findFirst[s_] :=
If[s != "" && (c = DictionaryLookup[s]) == {},
findFirst[StringDrop[s, -1]], Return[c]];
Sample Input
ss = {"twodreamstop",
"onebackstop",
"butterfingers",
"dependentrelationship",
"payperiodmatchcode",
"labordistributioncodedesc",
"benefitcalcrulecodedesc",
"psaddresstype",
"ageconrolnoticeperiod",
"month05",
"as_benefits",
"fname"}
Output
twodreamstop = two dreams top
onebackstop = one backstop
butterfingers = butterfingers
dependentrelationship = dependent relationship
payperiodmatchcode = pay period match code
labordistributioncodedesc ~ labor distribution coded es c
benefitcalcrulecodedesc ~ benefit c a lc rule coded es c
psaddresstype ~ p sad dress type
ageconrolnoticeperiod ~ age con rol notice period
month05 ~ month 05
as_benefits ~ as _ benefits
fname ~ f name
HTH

Check spelling correction algorithm. Here is a link to an article on algorithm used in google - http://www.norvig.com/spell-correct.html. Here you will find a scientific paper on this topic from google.

After doing the recursive splitting and dictionary lookup, to increase the quality of word pairs in your your phrase you might be interested to employ Mutual information of Word pairs.
This is essentially going though a training set and finding out M.I. values of word pairs that tells you that Albert Simpson is less Likely than Albert Einstein :)
You can try searching Science Direct for academic papers in this theme. For basic information on Mutual information see http://en.wikipedia.org/wiki/Mutual_information
Last year I had been involved in the phrase search part of a search engine project in which I was trying to parse though wikipedia dataset and rank each word pair. I've got the code in C++ if you care could share it with you if you can find some use of it. It parses wikimedia and for every word pair finds out the mutual information.

Routing "paths" through a rectangular array

I'm trying to create my own implementation of a puzzle game.
To create my game board, I need to traverse each square in my array once and only once.
The traversal needs to be linked to an adjacent neighbor (horizontal, vertical or diagonal).
I'm using an array structure of the form:
board[n,m] = byte
Each bit of the byte represents a direction 0..7 and exactly 2 bits are always set
Directions are numbered clockwise
0 1 2
7 . 3
6 5 4
Board[0,0] must have some combination of bits 3,4,5 set
My current approach for constructing a random path is:
Start at a random position
While (Squares remain)
if no directions from this square are valid, step backwards
Pick random direction from those remaining in bitfield for this square
Erase the direction to this square from those not picked
Move to the next square
This algorithm takes far too long on some medium sized grids, because earlier choices remove areas from consideration.
What I'd like to have is a function that takes an index into every possible path, and returns with the array filled in for that path. This would let me provide a 'seed' value to return to this particular board.
Other suggestions are welcome..

Essentially, you want to construct a Hamiltonian path: A path that visits each node of a graph exactly once.
In general, even if you only want to test whether a graph contains a Hamiltonian path, that's already NP-complete. In this case, it's obvious that the graph contains at least one Hamiltonian path, and the graph has a nice regular structure -- but enumerating all Hamiltonian paths still seems to be a difficult problem.
The Wikipedia entry on the Hamiltonian path problem has a randomized algorithm for finding a Hamiltonian path that is claimed to be "fast on most graphs". It's different from your algorithm in that it swaps around a whole branch of the path instead of backtracking by just one node. This more "aggressive" strategy might be faster -- try it and see.
You could let your users enter the seed for the random number generator to reconstruct a certain path. You still wouldn't be enumerating all possible paths, but I guess that's probably not necessary for your application.

This was originally a comment to Martin B's post, but it grew a bit, so I'm posting it as an "answer".
Firstly, the problem of enumerating all possible Hamiltonian Paths is quite similar to the #P complexity class --- NP's terrifying older brother. Wikipedia, as always, can explain. It is for this reason that I hope you don't really need to know every path, but just most. If not, well, there's still hope if you can exploit the structure of your graphs.
(Here's a fun way of looking at why enumerating all paths is really hard: Let's say you have 10 paths for the graph, and you think you're done. And evil old me comes over and says "well, prove to me that there isn't an 11th path". Having an answer that could both convince me and not take a while to demonstrate --- that would be pretty impressive! Proving that there doesn't exist any path is co-NP, and that's also quite hard. Proving that there only exists a certain number, that sounds very hard. Its late here, so I'm a bit fuzzy-headed, but so far I think everything I've said is correct.)
Anyways, my google-fu seems particularly weak on this subject, which is surprising, so I don't have a cold answer for you, but I have some hints?
Firstly, if each node has at most 2 outgoing edges, then the number of edges are linear in the number of vertices, and I'm pretty sure that that means its a "sparse" graph, which are typically happier to deal with. But Hamiltonian path is exponential in the number of edges, so there's no really easy way out. :)
Secondly, if the structure of your graph lends itself to this, it may be possible to break the problem up into smaller chunks. Min-cut and strongly-connected-components come to mind. The basic idea would be, ideally, finding that your big graph is in fact, really, 4 smaller graphs with just a few connecting edges between the smaller graphs. Running your HamPath algorithm on those smaller chunks would be considerably faster, and the hope would be that it would be somewhat easy to piece together the sub-graph-paths into the whole-graph-paths. Coincidentally, these are great tongue-twisters. The downside is that these algorithms are a bit difficult to implement, and will require a lot of thought and care to use properly (in the combining of the HamPaths). Furthermore, if your graphs are in fact quite thoroughly connected, this whole paragraph was for naught! :)
So those were just a few ideas. If there is any other aspect of your graph (a particular starting point is required, or a particular ending point, or additional rules about what node can connect to which), it may be quite helpful! The border between NP-complete problems and P (fast) algorithms is often quite thin.
Hope this helps, -Agor.

I would start with a trivial path through the grid, then randomly select locations on the board (using the input seed to seed the random number generator) and perform "switches" on the path where possible. e.g.:
------ to: --\/--
------ --/\--
With enough switches you should get a random-looking path.

Bizarrely enough, I spent an undergrad summer in my Math degree studying this problem as well as coming up with algorithms to address it. First I will comment on the other answers.
Martin B: Correctly identified this as a Hamiltonian path problem. However, if the graph is a regular grid (as you two discussed in the comments) then a Hamiltonian path can trivially be found (snaking row-major order for example.)
agnorest: Correctly talked about the Hamiltonian path problem being in a class of hard problems. agnorest also alluded to possibly exploiting the graph structure, which funny enough, is trivial in the case of a regular grid.
I will now continue the discussion by elaborating on what I think you are trying to achieve. As you mentioned in the comments, it is trivial to find certain classes of "space-filling" non intersecting "walks" on a regular lattice/grid. However, it seems you are not satisfied only with these and would like to find an algorithm that finds "interesting" walks that cover your grid at random. But before I explore that possibility, I'd like to point out that the "non-intersecting" property of these walks is extremely important and what causes the difficulties of enumerating them.
In fact, the thing that I studied in my summer internship is called the Self Avoiding Walk. Surprisingly, the study of SAWs (self avoiding walks) is very important to a few sub-domains of modeling in physics (and was a key ingredient to the Manhattan project!) The algorithm you gave in your question is actually a variant on an algorithm known as the "growth" algorithm or Rosenbluth's algorithm (named after non other than Marshal Rosenbluth). More details on both the general version of this algorithm (used to estimate statistics on SAWs) as well as their relation to physics can be readily found in literature like this thesis.
SAWs in 2 dimensions are notoriously hard to study. Very few theoretical results are known about SAWs in 2 dimensions. Oddly enough, in higher than 3 dimensions you can say that most properties of SAWs are known theoretically, such as their growth constant. Suffice it to say, SAWs in 2 dimensions are very interesting things!
Reigning in this discussion to talk about your problem at hand, you are probably finding that your implementation of the growth algorithm gets "cut off" quite frequently and can not expand to fill the whole of your lattice. In this case, it is more appropriate to view your problem as a Hamiltonian Path problem. My approach to finding interesting Hamiltonian paths would be to formulate the problem as an Integer Linear Program, and add fix random edges to be used in the ILP. The fixing of random edges would give randomness to the generation process, and the ILP portion would efficiently compute whether certain configurations are feasible and if they are would return a solution.
The Code
The following code implements the Hamiltonian path or cycle problem for arbitrary initial conditions. It implements it on a regular 2D lattice with 4-connectivity. The formulation follows the standard sub-tour elimination ILP Lagrangian. The sub-tours are eliminated lazily, meaning there can be potentially many iterations required.
You could augment this to satisfy random or otherwise initial conditions that you deem "interesting" for your problem. If the initial conditions are infeasible it terminates early and prints this.
This code depends on NetworkX and PuLP.
"""
Hamiltonian path formulated as ILP, solved using PuLP, adapted from
https://projects.coin-or.org/PuLP/browser/trunk/examples/Sudoku1.py
Authors: ldog
"""
# Import PuLP modeler functions
from pulp import *
# Solve for Hamiltonian path or cycle
solve_type = 'cycle'
# Define grid size
N = 10
# If solving for path a start and end must be specified
if solve_type == 'path':
start_vertex = (0,0)
end_vertex = (5,5)
# Assuming 4-connectivity (up, down, left, right)
Edges = ["up", "down", "left", "right"]
Sequence = [ i for i in range(N) ]
# The Rows and Cols sequences follow this form, Vals will be which edge is used
Vals = Edges
Rows = Sequence
Cols = Sequence
# The prob variable is created to contain the problem data
prob = LpProblem("Hamiltonian Path Problem",LpMinimize)
# The problem variables are created
choices = LpVariable.dicts("Choice",(Vals,Rows,Cols),0,1,LpInteger)
# The arbitrary objective function is added
prob += 0, "Arbitrary Objective Function"
# A constraint ensuring that exactly two edges per node are used
# (a requirement for the solution to be a walk or path.)
for r in Rows:
for c in Cols:
if solve_type == 'cycle':
prob += lpSum([choices[v][r][c] for v in Vals ]) == 2, ""
elif solve_type == 'path':
if (r,c) == end_vertex or (r,c) == start_vertex:
prob += lpSum([choices[v][r][c] for v in Vals]) == 1, ""
else:
prob += lpSum([choices[v][r][c] for v in Vals]) == 2, ""
# A constraint ensuring that edges between adjacent nodes agree
for r in Rows:
for c in Cols:
#The up direction
if r > 0:
prob += choices["up"][r][c] == choices["down"][r-1][c],""
#The down direction
if r < N-1:
prob += choices["down"][r][c] == choices["up"][r+1][c],""
#The left direction
if c > 0:
prob += choices["left"][r][c] == choices["right"][r][c-1],""
#The right direction
if c < N-1:
prob += choices["right"][r][c] == choices["left"][r][c+1],""
# Ensure boundaries are not used
for c in Cols:
prob += choices["up"][0][c] == 0,""
prob += choices["down"][N-1][c] == 0,""
for r in Rows:
prob += choices["left"][r][0] == 0,""
prob += choices["right"][r][N-1] == 0,""
# Random conditions can be specified to give "interesting" paths or cycles
# that have this condition.
# In the simplest case, just specify one node with fixed edges used.
prob += choices["down"][2][2] == 1,""
prob += choices["up"][2][2] == 1,""
# Keep solving while the smallest cycle is not the whole thing
while True:
# The problem is solved using PuLP's choice of Solver
prob.solve()
# The status of the solution is printed to the screen
status = LpStatus[prob.status]
print "Status:", status
if status == 'Infeasible':
print 'The set of conditions imposed are impossible to solve for. Change the conditions.'
break
import networkx as nx
g = nx.Graph()
g.add_nodes_from([i for i in range(N*N)])
for r in Rows:
for c in Cols:
if value(choices['up'][r][c]) == 1:
nr = r - 1
nc = c
g.add_edge(r*N+c,nr*N+nc)
if value(choices['down'][r][c]) == 1:
nr = r + 1
nc = c
g.add_edge(r*N+c,nr*N+nc)
if value(choices['left'][r][c]) == 1:
nr = r
nc = c - 1
g.add_edge(r*N+c,nr*N+nc)
if value(choices['right'][r][c]) == 1:
nr = r
nc = c + 1
g.add_edge(r*N+c,nr*N+nc)
# Get connected components sorted by length
cc = sorted(nx.connected_components(g), key = len)
# For the shortest, add the remove cycle condition
def ngb(idx,v):
r = idx/N
c = idx%N
if v == 'up':
nr = r - 1
nc = c
if v == 'down':
nr = r + 1
nc = c
if v == 'left':
nr = r
nc = c - 1
if v == 'right':
nr = r
nc = c + 1
return nr*N+c
prob += lpSum([choices[v][idx/N][idx%N] for v in Vals for idx in cc[0] if ngb(idx,v) in cc[0] ]) \
<= 2*(len(cc[0]))-1, ""
# Pretty print the solution
if len(cc[0]) == N*N:
print ''
print '***************************'
print ' This is the final solution'
print '***************************'
for r in Rows:
s = ""
for c in Cols:
if value(choices['up'][r][c]) == 1:
s += " | "
else:
s += " "
print s
s = ""
for c in Cols:
if value(choices['left'][r][c]) == 1:
s += "-"
else:
s += " "
s += "X"
if value(choices['right'][r][c]) == 1:
s += "-"
else:
s += " "
print s
s = ""
for c in Cols:
if value(choices['down'][r][c]) == 1:
s += " | "
else:
s += " "
print s
if len(cc[0]) != N*N:
print 'Press key to continue to next iteration (which eliminates a suboptimal subtour) ...'
elif len(cc[0]) == N*N:
print 'Press key to terminate'
raw_input()
if len(cc[0]) == N*N:
break

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio