Finding a list of adjacent words between two words - ruby

I am working on a programming challenge for practice and am having trouble finding a good data structure/algorithm to use to implement a solution.
Background:
Call two words “adjacent” if you can change one word into the other by adding, deleting, or changing a single letter.
A “word list” is an ordered list of unique words where successive words are adjacent.
The problem:
Write a program which takes two words as inputs and walks through the dictionary and creates a list of words between them.
Examples:
hate → love: hate, have, hove, love
dogs → wolves: dogs, does, doles, soles, solves, wolves
man → woman: man, ran, roan, roman, woman
flour → flower: flour, lour, dour, doer, dower, lower, flower
I am not quite sure how to approach this problem, my first attempt involved creating permutations of the first word then trying to replace letters in it. My second thought was maybe something like a suffix tree
Any thoughts or ideas toward at least breaking the problem down would be appreciated. Keep in mind that this is not homework, but a programming challenge I am working on myself.

This puzzle was first stated by Charles Dodgson, who wrote Alice's Adventures in Wonderland under his pseudonym Lewis Carroll.
The basic idea is to create a graph structure in which the nodes are words in a dictionary and the edges connect words that are one letter apart, then do a breadth-first search through the graph, starting at the first word, until you find the second word.
I discuss this problem, and give an implementation that includes a clever algorithm for identifying "adjacent to" words, at my blog.

I have done this myself and used it to create a (not very good) Windows game.
I used the approach recommended by others of implementing this as a graph, where each node is a word and they are connected if they differ in one letter. This means you can use well known graph theory results to find paths between words (eg simple recursion where knowing the words at distance 1 allows you to find the words at distance 2).
The tricky part is building up the graph. The bad news is that it is O(n^2). The good news is that it doesn't have to be done in real time - rather than your program reading the dictionary words from a file, it reads in the data structure you baked earlier.
The key insight is that the order doesn't matter, in fact it gets in the way. You need to construct another form in which to hold the words which strips out the order information and allows words to be compared more easily. You can do this in O(n). You have lots of choices; I will show two.
For word puzzles I quit often use an encoding which I call anagram dictionary. A word is represented by another word which has the same letters but in alphabetic sequence. So "cars" becomes "acrs". Both lists and slits become "ilsst". This is a better structure for comparison than the original word, but much better comparisons exist (however, it is a very useful structure for other word puzzles).
Letter counts. An array of 26 values which show the frequency of that letter in the word. So for "cars" it starts 1,0,1,0,0... as there is one "a" and one "c". Hold an external list of the non-zero entries (which letters appear in the word) so you only have to check 5 or 6 values at most instead of 26. Very simple to compare two words held in this form quickly by ensuring at most two counts are different. This is the one I would use.
So, this is how I did it.
I wrote a program which implemented the data structure up above.
It had a class called WordNode. This contains the original word; a List of all other WordNodes which are one letter different; an array of 26 integers giving the frequency of each letter, a list of the non-zero values in the letter count array.
The initialiser populates the letter frequency array and the corresponding list of non-zero values. It sets the list of connected WordNodes to zero.
After I have created an instance of the WordNode class for every word, I run a compare method which checks to see if the frequency counts are different in no more than two places. That normally takes slightly less compares than there are letters in the words; not too bad. If they are different in exactly two places they differ by one letter, and I add that WordNode into the list of WordNodes differing in only one letter.
This means we now have a graph of all the words one letter different.
You can export either the whole data structure or strip out the letter frequency and other stuff you don't need and save it (I used serialized XML. If you go that way, make sure you check it handles the List of WordNodes as references and not embedded objects).
Your actual game then only has to read in this data structure (instead of a dictionary) and it can find the words one letter different with a direct lookup, in essentially zero time.
Pity my game was crap.

I don't know if this is the type of solution that you're looking for, but an active area of research is in constructing "edit distance 1" dictionaries for quickly looking up adjacent words (to use your parlance) for search term suggestions, data entry correction, and bioinformatics (e.g. finding similarities in chromosomes). See for example this research paper. Short of indexing your entire dictionary, at the very least this might suggest a search heuristic that you can use.

The simplest (recursive) algorithm is I can think of (well, the only one I can think in the moment) is
Initialize a empty blacklist
Take all words from your dictionary that is a valid step for the current word
remove the ones that are in the blacklist
Check if you can find the target word.
if not, repeat the algorithm for all words you found in last step
if yes, you found it. Return the recursion printing all words in the path you found.
Maybe someone with a bit more time can add the ruby code for this?

Try this
x = 'hate'
puts x = x.next until x == 'love'
And if you couple it with dictionary lookup, you will get a list of all valid words in between in that dictionary.

Related

Generate or find a shortest text given list of words

Let's say I have a list of 1000+ words and I would like to generate a text that includes these words from the list. I would like to use as few extra words outside of the list as possible. How would one tackle such a problem? Or alternatively, is there a way to efficiently search for a smaller portion of text containing these words the most, given some larger text (millions of words)? Basically, the resulting text from the search should be optimized to be shortest but to contain all the words from the list.
I am not sure how you'd like the text to be generated, so I'll attempt to answer the second question:
Is there a way to efficiently search for a smaller portion of text containing these words the most, given some larger text (millions of words)? Basically, the resulting text from the search should be optimized to be shortest but to contain all the words from the list.
This is obviously a computationally demanding endeavour so I'll assume you are alright with spending like a gig of RAM on this and some time (but maybe not too long). Since you are looking for the shortest continuous text which satisfies some condition, one can conclude the following:
If the text satisfies the condition, you want to shorten it.
If it doesn't, you want to make it longer so that hopefully it will start satisfying the condition.
Now, when it comes to the condition, it is whatever predicate that will say whether the continuous section of the text is "good enough" or not, based on some relatively simple statistics. For instance, the predicate could check if some cumulative index based on what ratio of the words from your list are included in the section, modified by the number of words from outside the list, is greater than some expected value.
What my mind races to when I see something like this is the sliding window technique, described in this article. I do not know if this is a good article, I did not take the time to read it, but scanning through it seems to be decent. It's also known as the caterpillar method, which is a particularly common name for it in Poland.
Basically, you have two pointers, a left pointer and a right pointer. In the case of looking for the shortest continuous fragment of a larger text, such that the fragment satisfies a condition and given that if a condition is met for a fragment, then it is met for a larger fragment containing the previous fragment, you advance the right pointer forward as long as the condition is unmet, and then once it is met, you advance the left pointer, until the condition isn't met. This repeats until either or both pointers reach the end of the text.
This is a neat technique, which allows you to iterate over the whole text exactly once, linearly. It is clearly desirable in your case to have an algorithm linear with respect to the length of the text.
Now, we have to consider the statistics you will be collecting. You will probably want to know how many words from the list, and how many words from outside of the list are present in a continuous fragment. An extra condition for these statistics is that they will need to be relatively easily modifiable (preferably in constant time, but that will be hard to achieve) every time one of the pointers advances.
In order to keep track of the words, we will use a hashmap of ordered sets of indeces. In Java these data structures are called HashMap and TreeSet, in C++ they're unordered_map and set. The keys to the hashmap will be strings representing words. The values will be sets of indices of where the words appear in the text. Note that lookup in a hashmap is linear relative to the length of the key, so we can assume constant as most words are like <10 characters long, and checking how many values in a set there are between two given values is logarithmic relative to the size of the set. So getting the number of times a word appears in a fragment of the text is easy and fast. Keeping track of whether a word exists in the given list or not can also be achieved with a hashmap (or a hashset).
So let's get back to the statistics. Say you want to keep track of the number of words from inside and from outside your list in a given fragment. This can be achieved very simply:
Every time you add a word to the fragment by advancing its right end, you check if it appears in the list in constant time and if so, you add one to the "good words" number, and otherwise, you add one to the "bad words" number.
Every time you remove a word from the fragment by advancing the left end, you do the same but you decrement the counters instead.
Now if you want to track how many unique words from inside and from outside the list there are in the fragment, every time you will need to check the number of times a given word exists in the fragment. We established earlier that this can be done logarithmically relative to the length of the fragment, so now the trick is simple. You only modify the counters if the number of appearances of a word in the fragment either
rose from 0 to 1 when advancing the right pointer, or
fell from 1 to 0 when advancing the left pointer.
Otherwise, you ignore the word, not changing the counters.
Additional memory optimisations include removing indices from the sets of indices when they are out of scope of the fragment and removing hashmap entries from the hashmap if a set of indices becomes empty.
It is now up to you to perhaps find a better heuristic, some other statistical values which you can easily track whatever it is you intend to check in your predicate. Although it is important that whenever a fragment meets your condition, a bigger fragment must meet it too.
In the case described above you could keep track of all the fragments which had at least... I don't know... 90% of the words from your list and from those choose the shortest one or the one with the fewest foreign words.

What would be a good data structure to store a dictionary(of words) to optimize the search time?

Provided a list of valid words, and a search word, I want to find whether the search word is a valid word or not ALLOWING 2 typo characters.
What would be a good data structure to store a dictionary of words(assumingly it contains a million words) and algorithm to find whether the word exists in the dictionary(allowing 2 typo characters).
If no typo characters where allowed, then a Trie would be a good way to store the words but not sure if it stays the best way to store dictionary when typos are allowed. Not sure what the complexity for a backtracking algorithm(to search for a word in Trie allowing 2 typos) would be. Any idea about it?
You might want to checkout Directed Acyclic Word Graph or DAWG. It has more of an automata structure than a tree of graph structure. Multiple possibilities from one place may provide you with your solution.
If there is no need to also store all mistyped words I would consider to use a two-step approach for this problem.
Build a set containing hashes of all valid words (not including
typos). So probably we are talking here about some 10.000 entries,
which should still allow quite fast lookups with a binary search. If
the hash of a word is found in the set it is typed correctly.
If a words hash is not found in the set the word is probably
mistyped. So calculate a the Damerau-Levenshtein distance between
the word and all known words to figure out what the user might have
meant. To gain some performance here modify the DL-algorithm to
abort calculation if the distance gets bigger than your allowed
threshold of 2 typos.

Comparing word to targeted words in dictionary

I'm trying to write a program in JAVA that stores a dictionary in a hashmap (each word under a different key) and compares a given word to the words in the dictionary and comes up with a spelling suggestion if it is not found in the dictionary -- basically a spell check program.
I already came up with the comparison algorithm (i.e. Needleman-Wunsch then Levenshtein distance), etc., but got stuck when it came figuring out what words in the dictionary-hashmap to compare the word to i.e. "hellooo".
I cannot compare "ohelloo" [should be corrected to "hello" to each word in the dictionary b/c that would take too long and I cannot compare it to all words int the dictionary starting with 'o' b/c it's supposed to be "hello".
Any ideas?
The most common spelling mistakes are
Delete a letter (smaller word OR word split)
Swap adjacent letters
Alter letter (QWERTY adjacent letters)
Insert letter
Some reports say that 70-90% of mistakes fall in the above categories (edit distance 1)
Take a look on the url below that provides a solution for single or double mistakes (edit distance 1 or 2). Almost everything you'll need is there!
How to write a spelling corrector
FYI: You can find implementation in various programming languages in the bottom of the aforementioned article. I've used it in some of my projects, practical accuracy is really good, sometimes more than 95+% as claimed by the author.
--Based on OP's comment--
If you don't want to pre-compute every possible alteration and then search on the map, I suggest that you use a patricia trie (radix tree) instead of a HashMap. Unfortunately, you will again need to handle the "first-letter mistake" (eg remove first letter or swap first with the second, or just replace it with a Qwerty adjacent) and you can limit your search with high probability.
You can even combine it with an extra Index Map or Trie with "reversed" words or an extra index that omits first N characters (eg first 2), so you can catch errors occurred on prefix only.

What data structure for maintaining candidate words in the Wheel of Fortune?

Background on the Wheel of Fortune, for those unfamiliar with it: in a Wheel of Fortune game, players initially see a set of blanks, representing words with letters hidden. (So players know the length of each word, but not what letters the words contain.) As the game progresses, players guess letters; if the phrase contains that letter, all the locations of that letter in the phrase are revealed. For example, a game (with the hidden phrase "stack overflow") would initially be represented as ????? ????????, and after the letter "o" is guessed, the game would display ????? o?????ow.
For simplicity, let's suppose our games contain only a single hidden word. What data structure would I use to hold all the possible candidates for that word? (I'm playing around with an AI that picks what letter to guess next, so in order to make the choice, I want to be able to calculate statistics like the most common letter out of the remaining candidates.) To be clear, initially I know that my word contains N letters, and then I learn the positions of various letters in the word, as well as what letters the word does not contain.
There's a similar question at Good algorithm and data structure for looking up words with missing letters?, but I think this question is slightly different in that I have more than two blanks, and I'm also iteratively pruning my candidate list (whereas that question seems to only use a single search). My current idea is to just maintain a candidate list of words, initialized to all English words (there should be at most 300k-500k words), and then (taking a similar approach to that question) to use a Regex to iteratively prune this candidate list as I guess more letters, but I'm curious if there's a better data structure or algorithm.
You should start by splitting the words based on size. Within each size, a http://en.wikipedia.org/wiki/Trie seems a good start. You'll get to prune whole subtrees at once and you could keep the trie in tact by only flipping a flag at each node whether the subtree rooted at that node is out of consideration in the current game.

Efficient word scramble algorithm

I'm looking for an efficient algorithm for scrambling a set of letters into a permutation containing the maximum number of words.
For example, say I am given the list of letters: {e, e, h, r, s, t}. I need to order them in such a way as to contain the maximum number of words. If I order those letters into "theres", it contain the words "the", "there", "her", "here", and "ere". So that example could have a score of 5, since it contains 5 words. I want to order the letters in such a way as to have the highest score (contain the most words).
A naive algorithm would be to try and score every permutation. I believe this is O(n!), so 720 different permutations would be tried for just the 6 letters above (including some duplicates, since the example has e twice). For more letters, the naive solution quickly becomes impossible, of course.
The algorithm doesn't have to actually produce the very best solution, but it should find a good solution in a reasonable amount of time. For my application, simply guessing (Monte Carlo) at a few million permutations works quite poorly, so that's currently the mark to beat.
I am currently using the Aho-Corasick algorithm to score permutations. It searches for each word in the dictionary in just one pass through the text, so I believe it's quite efficient. This also means I have all the words stored in a trie, but if another algorithm requires different storage that's fine too. I am not worried about setting up the dictionary, just the run time of the actual ordering and searching. Even a fuzzy dictionary could be used if needed, like a Bloom Filter.
For my application, the list of letters given is about 100, and the dictionary contains over 100,000 entries. The dictionary never changes, but several different lists of letters need to be ordered.
I am considering trying a path finding algorithm. I believe I could start with a random letter from the list as a starting point. Then each remaining letter would be used to create a "path." I think this would work well with the Aho-Corasick scoring algorithm, since scores could be built up one letter at a time. I haven't tried path finding yet though; maybe it's not a even a good idea? I don't know which path finding algorithm might be best.
Another algorithm I thought of also starts with a random letter. Then the dictionary trie would be searched for "rich" branches containing the remain letters. Dictionary branches containing unavailable letters would be pruned. I'm a bit foggy on the details of how this would work exactly, but it could completely eliminate scoring permutations.
Here's an idea, inspired by Markov Chains:
Precompute the letter transition probabilities in your dictionary. Create a table with the probability that some letter X is followed by another letter Y, for all letter pairs, based on the words in the dictionary.
Generate permutations by randomly choosing each next letter from the remaining pool of letters, based on the previous letter and the probability table, until all letters are used up. Run this many times.
You can experiment by increasing the "memory" of your transition table - don't look only one letter back, but say 2 or 3. This increases the probability table, but gives you more chance of creating a valid word.
You might try simulated annealing, which has been used successfully for complex optimization problems in a number of domains. Basically you do randomized hill-climbing while gradually reducing the randomness. Since you already have the Aho-Corasick scoring you've done most of the work already. All you need is a way to generate neighbor permutations; for that something simple like swapping a pair of letters should work fine.
Have you thought about using a genetic algorithm? You have the beginnings of your fitness function already. You could experiment with the mutation and crossover (thanks Nathan) algorithms to see which do the best job.
Another option would be for your algorithm to build the smallest possible word from the input set, and then add one letter at a time so that the new word is also is or contains a new word. Start with a few different starting words for each input set and see where it leads.
Just a few idle thoughts.
It might be useful to check how others solved this:
http://sourceforge.net/search/?type_of_search=soft&words=anagram
On this page you can generate anagrams online. I've played around with it for a while and it's great fun. It doesn't explain in detail how it does its job, but the parameters give some insight.
http://wordsmith.org/anagram/advanced.html
With javascript and Node.js I implemented a jumble solver that uses a dictionary and create a tree and then traversal the tree after that you can get all possible words, I explained the algorithm in this article in detail and put the source code on GitHub:
Scramble or Jumble Word Solver with Express and Node.js

Resources