All combination tree algorithm - algorithm

I have a question related to trees. I have about 100 sentences on a topic like "car." Those sentences basically talk about a car. If a user submits a query: "Find all combinations of word links between words "engine" and "oil"." I want to find all the word links possible so that "engine" and "oil" connects by any number of similar words in a sentence.
For instance.
Engine is hot when it runs.
Car has an engine.
Car use an oil.
In this case the answer will be: engine->car->oil (three word combination). And I want to find all the combinations possible so that in the end "engine" and "oil" connects with each other. It is not the shortest path, or the longest path, but all possible paths running in all directions and words. It is even possible to have 1,000 word combinations to reach "engine" and "oil" as long as the paths are not similar of course.
Is there a way to do this. I tried using bread-first, but it is little tricky. For instance combinations could be.
engine->car->run->stop->oil
engine->car->oil
engine->fast->brake->oil
Can anyone please help me with this. What is the logic and idea here. I can't ignore a word I already visited because that will stop the algorithm right there and not give me all the links.
Please help and insights.
Thanks.
fa323

Your question is so underspecified... that is, even in your example, it is unclear why the answer doesn't include engine->an->oil.
Also, it doesn't actually have anything to do with trees, rather it deals with graphs.
The first thing you need to do is determine how to build your graph. A reasonable way of doing this would be to have an edge between two words if they both appear in a particular sentence.
Then you have to decide what you want outputted. I highly doubt that you want all paths. Why? Well, if you build the graph I described, then even using just your first sentence, there are 24 paths from engine to oil not counting paths with cycles. However, if that is what you want anyway, you can find all non-cyclic paths in a graph via depth-first search, where you mark a node visited when you push it on the stack, and unmark it when you take it off.

Related

Shortest path from one word to another via valid words (no graph)

I came across this variation of edit-distance problem:
Find the shortest path from one word to another, for example storm->power, validating each intermediate word by using a isValidWord() function. There is no other access to the dictionary of words and therefore a graph cannot be constructed.
I am trying to figure this out but it doesn't seem to be a distance related problem, per se. Use simple recursion maybe? But then how do you know that you're going the right direction?
Anyone else find this interesting? Looking forward to some help, thanks!
This is a puzzle from Lewis Carroll known as Word Ladders. Donald Knuth covers this in The Stanford Graphbase. This also
You can view it as a breadth first search. You will need access to a dictionary of words, otherwise the space you will have to search will be huge. If you just have access to a valid word you can generate all the permutations of words and then just use isValidWord() to filter it down (Norvig's "How to Write a Spelling Corrector" is a great explanation of generating the edits).
You can guide the search by trying to minimize the edit distance between where you currently are and where you can to be. For example, generate the space of all nodes to search, and sort by minimum edit distance. Follow the links first that are closest (e.g. minimize the edit distance) to the target. In the example, follow the nodes that are closest to "power".
I found this interesting as well, so there's a Haskell implementation here which works reasonably well. There's a link in the comments to a Clojure version which has some really nice visualizations.
You can search from two sides at the same time. I.e. change a letter in storm and run it through isValidWord(), and change a letter in power and run it through isValidWord(). If those two words are the same, you have found a path.

Number of simple mutations to change one string to another?

I'm sure you've all heard of the "Word game", where you try to change one word to another by changing one letter at a time, and only going through valid English words. I'm trying to implement an A* Algorithm to solve it (just to flesh out my understanding of A*) and one of the things that is needed is a minimum-distance heuristic.
That is, the minimum number of one of these three mutations that can turn an arbitrary string a into another string b:
1) Change one letter for another
2) Add one letter at a spot before or after any letter
3) Remove any letter
Examples
aabca => abaca:
aabca
abca
abaca
= 2
abcdebf => bgabf:
abcdebf
bcdebf
bcdbf
bgdbf
bgabf
= 4
I've tried many algorithms out; I can't seem to find one that gives the actual answer every time. In fact, sometimes I'm not sure if even my human reasoning is finding the best answer.
Does anyone know any algorithm for such purpose? Or maybe can help me find one?
(Just to clarify, I'm asking for an algorithm that can turn any arbitrary string to any other, disregarding their English validity-ness.)
You want the minimum edit distance (or Levenshtein distance):
The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.
And one algorithm to determine the editing sequence is on the same page here.
An excellent reference on "Edit distance" is section 6.3 of the Algorithms textbook by S. Dasgupta, C. H. Papadimitriou, and U. V. Vazirani, a draft of which is available freely here.
If you have a reasonably sized (small) dictionary, a breadth first tree search might work.
So start with all words your word can mutate into, then all those can mutate into (except the original), then go down to the third level... Until you find the word you are looking for.
You could eliminate divergent words (ones further away from the target), but doing so might cause you to fail in a case where you must go through some divergent state to reach the shortest path.

Calculate shortest path through a grocery store

I'm trying to find a way to find the shortest path through a grocery store, visiting a list of locations (shopping list). The path should start at a specified start position and can end at multiple end positions (there are multiple checkout counters).
Also, I have some predefined constraints on the path, such as "item x on the shopping list needs to be the last, second last, or third last item on the path". There is a function that will return true or false for a given path.
Finally, this needs to be calculated with limited CPU power (on a smartphone) and within a second or so. If this isn't possible, then an approximation to the optimal path is also ok.
Is this possible? So far I think I need to start by calculating the distance between every item on the list using something like A* or Dijkstra's. After that, should I treat it like the traveling salesman problem? Because in my problem there is a specified start node, specified end nodes, and some constraints, which are not in the traveling salesman problem.
It seems like viewing it as a TSP problem makes it more difficult. Someone pointed out that grocery stories aren't that complicated. In the grocery stores I am familiar with (in the US), there aren't that many reasonable routes. Especially if you have a given starting point. I think a well thought-out heuristic will probably do the trick.
For example: I typically start at one end-- if it's a big trip I make sure I go through the frozen foods last, but it often doesn't matter and I'll start closest to where I enter the store. I generally walk around the outside, only going down individual aisles if I need something in that one. Once you go into an aisle, pick up everything in that one. With some aisles its better to drop into one end, grab the item, and go back to your starting point, and others you just commit to the whole aisle-- it's a function of the last item you need in that aisle and where you need to be next-- how to get out of the aisle depends on the next item needed-- it may or may not involve a backtrack-- but the computer can easily calculate the shortest path to the next items.
So I agree with the helpful hints of the other problems above, but maybe a less general algorithm will work. And will probably work better with limited resources. TSP tells us, however, you can't prove that it's the optimal approach but I suspect that's not really necessary...
Well, basically it is a traveling salesman problem, but with your requirements you limit the search space pretty well. If there are not too many locations, you could try to calculate all possible options, but if that is not feasible, you need to limit the search space even more.
You can make the search time limited so that you will return the shortest path you can find in 1 second, but that might not be the shortest of them all, but pretty short anyway.
You could also apply Greedy algorithm to find the next closest location, then using Backtracking select the next closest location etc.
Pseudo code for possible solution:
def find_shortest_path(current_path, best_path):
if time_limit()
return
if len(current_path) == NUM_LOCATIONS:
# destionation reached
if calc_len(current_path) < calc_len(best_path):
best_path = current_path
return
# You need to sort the possible locations well to maximize your chances of finding the
# the shortest solution
sort(possible_locations)
for location in possible_locations:
find_shortest_path(current_path + location, best_path)
Well, you could limit the search-space using information about the layout of the store. For instance, a typical store (at least here in Germany) has many shelves in it that can be considered to form lanes. In between there are orthogonal lanes that connect the shelf-lanes. Now you define the crossings to be the nodes and the lanes to be edges in a graph. The edges are labeled with all the items in the shelves of that section of lane. Now, even for a big store this graph would be pretty small. You would now have to find the shortest path that includes all the edge-labels (items) you need. This should be possible by using the greedy/backtracking approach Tuomas Pelkonen suggested.
This is just an idea, and I do not know if it really works, but maybe you can take it from here.
Only breadth first searching will make sure you don't "miss" a path through the store which is better than you're current "best" solution, but you need not search every node in the path. Nodes which are "obviously" longer than the current "best" solution may be expanded later.
This means you approach the problem like a "breath first" search, but alter the expansion of your nodes based on the current distance travelled. Some branches of the search tree will expand faster than others, because the manage to visit more nodes in the same amount of time.
So if node expansion is not truly breath-first, why do I keep using that word? Because after you do find a solution, you must still expand the current set of "considered nodes" until each one of those search paths exceed the solution. This avoids missing a path which has a lot of time consuming initial legs, but finishes faster than the current solution because the last step is lighting fast.
The requirement of a start node is fictitious. Using the TSP you'll end up with a tour where you can chose the start node that you want without altering the solution cost.
It's somewhat trickier when it comes to the counters: what you need is to solve the problem on a directed graph with some arcs missing (or, which is the same, where some arcs have a really high cost).
Starting with the complete directed graph you should modify the costs of the proper arcs in order to:
deny going from the items to the start node
deny going from the counters to the items
deny going from the start node to the counters
allow going from the counters to the start node at zero cost (this are only needed to close the path)
after having drawn the thing down, tell me if I missed something :)
HTH

Efficient word scramble algorithm

I'm looking for an efficient algorithm for scrambling a set of letters into a permutation containing the maximum number of words.
For example, say I am given the list of letters: {e, e, h, r, s, t}. I need to order them in such a way as to contain the maximum number of words. If I order those letters into "theres", it contain the words "the", "there", "her", "here", and "ere". So that example could have a score of 5, since it contains 5 words. I want to order the letters in such a way as to have the highest score (contain the most words).
A naive algorithm would be to try and score every permutation. I believe this is O(n!), so 720 different permutations would be tried for just the 6 letters above (including some duplicates, since the example has e twice). For more letters, the naive solution quickly becomes impossible, of course.
The algorithm doesn't have to actually produce the very best solution, but it should find a good solution in a reasonable amount of time. For my application, simply guessing (Monte Carlo) at a few million permutations works quite poorly, so that's currently the mark to beat.
I am currently using the Aho-Corasick algorithm to score permutations. It searches for each word in the dictionary in just one pass through the text, so I believe it's quite efficient. This also means I have all the words stored in a trie, but if another algorithm requires different storage that's fine too. I am not worried about setting up the dictionary, just the run time of the actual ordering and searching. Even a fuzzy dictionary could be used if needed, like a Bloom Filter.
For my application, the list of letters given is about 100, and the dictionary contains over 100,000 entries. The dictionary never changes, but several different lists of letters need to be ordered.
I am considering trying a path finding algorithm. I believe I could start with a random letter from the list as a starting point. Then each remaining letter would be used to create a "path." I think this would work well with the Aho-Corasick scoring algorithm, since scores could be built up one letter at a time. I haven't tried path finding yet though; maybe it's not a even a good idea? I don't know which path finding algorithm might be best.
Another algorithm I thought of also starts with a random letter. Then the dictionary trie would be searched for "rich" branches containing the remain letters. Dictionary branches containing unavailable letters would be pruned. I'm a bit foggy on the details of how this would work exactly, but it could completely eliminate scoring permutations.
Here's an idea, inspired by Markov Chains:
Precompute the letter transition probabilities in your dictionary. Create a table with the probability that some letter X is followed by another letter Y, for all letter pairs, based on the words in the dictionary.
Generate permutations by randomly choosing each next letter from the remaining pool of letters, based on the previous letter and the probability table, until all letters are used up. Run this many times.
You can experiment by increasing the "memory" of your transition table - don't look only one letter back, but say 2 or 3. This increases the probability table, but gives you more chance of creating a valid word.
You might try simulated annealing, which has been used successfully for complex optimization problems in a number of domains. Basically you do randomized hill-climbing while gradually reducing the randomness. Since you already have the Aho-Corasick scoring you've done most of the work already. All you need is a way to generate neighbor permutations; for that something simple like swapping a pair of letters should work fine.
Have you thought about using a genetic algorithm? You have the beginnings of your fitness function already. You could experiment with the mutation and crossover (thanks Nathan) algorithms to see which do the best job.
Another option would be for your algorithm to build the smallest possible word from the input set, and then add one letter at a time so that the new word is also is or contains a new word. Start with a few different starting words for each input set and see where it leads.
Just a few idle thoughts.
It might be useful to check how others solved this:
http://sourceforge.net/search/?type_of_search=soft&words=anagram
On this page you can generate anagrams online. I've played around with it for a while and it's great fun. It doesn't explain in detail how it does its job, but the parameters give some insight.
http://wordsmith.org/anagram/advanced.html
With javascript and Node.js I implemented a jumble solver that uses a dictionary and create a tree and then traversal the tree after that you can get all possible words, I explained the algorithm in this article in detail and put the source code on GitHub:
Scramble or Jumble Word Solver with Express and Node.js

Calculating the semantic distance between words

Does anyone know of a good way to calculate the "semantic distance" between two words?
Immediately an algorithm that counts the steps between words in a thesaurus springs to mind.
OK, looks like a similar question has already been answered: Is there an algorithm that tells the semantic similarity of two phrases.
In text mining there is an important maxim: "You shall know a word by the
company it keeps". It means that it is possible to learn the meaning of a word based on the terms that frequently appear close to it.
Without entering in extensive details, let me give two simple options to estimate semantic distance between terms:
Use a resource similar to WordNet (a large lexical database of English). WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. The semantic distance between words can be estimated as the number of vertices that connect the two words.
Using a large corpus (e.g. Wikipedia), count the terms that appear close to the words you are analyzing. Create two vector and compute a distance (e.g cosine).
You can check this materials to get a get picture about the subject:
http://www.saifmohammad.com/WebDocs/Mohammad_Saif_Thesis-slides.pdf
http://www.umiacs.umd.edu/~saif/WebDocs/distributionalmeasures.pdf
http://www.umiacs.umd.edu/~saif/WebDocs/Measuring-Semantic-Distance.pdf
The thesaurus idea has some merit. One idea would be to create a graph based on a thesaurus with the nodes being the words and an edge indicating that there they are listed as synonyms in the thesaurus. You could then use a shortest path algorithm to give you the distance between the nodes as a measure of their similarity.
One difficulty here is that some words have different meanings in different contexts. Your algorithm may need to take this into account and use directed links with the weight of the outgoing link dependent on the incoming link being followed (or ignore some outgoing links based on the incoming link).
Possible hack: send the two words to Google search, and return the # of pages found.

Resources