Number of simple mutations to change one string to another? - algorithm

I'm sure you've all heard of the "Word game", where you try to change one word to another by changing one letter at a time, and only going through valid English words. I'm trying to implement an A* Algorithm to solve it (just to flesh out my understanding of A*) and one of the things that is needed is a minimum-distance heuristic.
That is, the minimum number of one of these three mutations that can turn an arbitrary string a into another string b:
1) Change one letter for another
2) Add one letter at a spot before or after any letter
3) Remove any letter
Examples
aabca => abaca:
aabca
abca
abaca
= 2
abcdebf => bgabf:
abcdebf
bcdebf
bcdbf
bgdbf
bgabf
= 4
I've tried many algorithms out; I can't seem to find one that gives the actual answer every time. In fact, sometimes I'm not sure if even my human reasoning is finding the best answer.
Does anyone know any algorithm for such purpose? Or maybe can help me find one?
(Just to clarify, I'm asking for an algorithm that can turn any arbitrary string to any other, disregarding their English validity-ness.)

You want the minimum edit distance (or Levenshtein distance):
The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.
And one algorithm to determine the editing sequence is on the same page here.

An excellent reference on "Edit distance" is section 6.3 of the Algorithms textbook by S. Dasgupta, C. H. Papadimitriou, and U. V. Vazirani, a draft of which is available freely here.

If you have a reasonably sized (small) dictionary, a breadth first tree search might work.
So start with all words your word can mutate into, then all those can mutate into (except the original), then go down to the third level... Until you find the word you are looking for.
You could eliminate divergent words (ones further away from the target), but doing so might cause you to fail in a case where you must go through some divergent state to reach the shortest path.

Related

Algorithm Problem: Word transformation from a given word to another using only the words in a given dict

The detailed description of the problem is as follows:
Given two words (beginWord and endWord), and a dictionary's word list, find if there's a transformation sequence from beginWord to endWord, such that:
Only one letter can be changed at a time
Each transformed word must exist in the word list. Note that beginWord is not a transformed word.
I know this word can be solved using breadth-first-search. After I proposed the normal BFS solution, the interviewer asked me if I can make it faster. I didn't figure out a way to speed up. And the interviewer told me I should use a PriorityQueue instead to do a "Best-First-Search". And the priority is given by the hamming distance between the current word and target.
I don't quite understand why this can speed up the search. I feel by using priorityQueue we try to search the path that makes progress (i.e. reducing hamming distance).
This seems to be a greedy method. My questions is:
Why this solution is faster than the breadth-first-search solution? I feel the actual path can be like this: at first not making any progress, or even increasing the hamming distance, but after reaching a word the hamming distance goes down gradually. In this scenario, I think the priority queue solution will be slower.
Any suggestions will be appreciated! Thanks
First, I'd recommend to do some thorough reading on graph searching algorithms, that will explain the question to any detail you want (and far beyond).
TL;DR:
Your interviewer effectively recommended something close to the A* algorithm.
It differs from BFS in one aspect: which node to expand first. It uses a notion of distance score, composed from two elements:
At a node X, we already "traveled" a distance given by the number of transformations so far.
To reach the target from X, we still need to travel some more, at least N steps, where N is the number of characters different between node and target.
If we are to follow the path through X, the total number of steps from start to target can't be less than this score. It can be more if the real rest distance turns out to be longer (some words necessary for the direct path don't exist in the dictionary).
A* tells us: of all open (unexpanded) nodes, try the one first that potentially gives the shortest overall solution path, i.e. the one with the lowest score. And to implement that, a priority queue is a good fit.
In many cases, A* can dramatically reduce the search space (compared to BFS), and it still guarantees to find the best solution.
A* is NOT a greedy algorithm. It will eventually explore the whole search space, only in a much better ordering than a blind BFS.

Minimum-size Trie Construction by Dynamic Programming (War Story: What’s Past is Prolog from The Algorithm Design Manual by Steven S Skiena)

I'm reading the The Algorithm Design Manual by Steven S Skiena, and trying to understand the solution to the problem of War Story: What’s Past is Prolog.
The problem is also well described here.
Basically, the problem is, given an ordered list of strings, give a optimal solution to construct a trie with minimum size (the string character as the node), with the constraint that the order of the strings must be reserved, while the character index can be reordered.
Maybe this is not an appropriate question here for stackoverflow, still I'm wondering if anyone could give me some hint on the solution, especially what this recurrence means by its arguments:
the recurrence for the Dynamic Programming algorithm
You can think about it this way:
Let's assume that we fix the index of the first character. All strings get split into r bins based on the value of the character in this position (bins are essentially subtrees).
We can work with each bin independently. It won't change the order across different bins because two strings in different bins are different in the first character.
Thus, we can solve the problem for each bin independently. After that, we need exactly one edge to connect the root to each bin (that is, subtree). That's where the formula
C[i_k, j_k] + 1 comes from.
As we want to minimize the total number of edges and we're free to pick the first position, we just try all possible options among m positions.
Note: this algorithm is correct under assumption that we can reorder the rest of the characters in each subtree independently. If it's not the case, the dynamic programming solution is incorrect.

Levenstein Transpose Distance

How can I implement the transpose/swap/twiddle/exchange distance alone using dynamic programming. I must stress that I do not want to check for the other operations (ie copy, delete, insert, kill etc) just transpose/swap.
I wish to apply Levenstein's algorithm just for swap distance. How would the code look like?
I'm not sure that Levenstein's algorithm can be used in this case. Without insert or delete operation, distance is good defined only between strings with same length and same characters. Examples of strings that isn't possible to transform to same string with only transpositions:
AB, ABC
AAB, ABB
With that, algorithm can be to find all possible permutations of positions of characters not on same places in both strings and look for one that can be represent with minimum number of transpositions or swaps.
An efficient application of dynamic programming usually requires that the task decompose into several instances of the same task for a shorter input. In case of the Levenstein distance, this boils down to prefixes of the two strings and the number of edits required to get from one to the other. I don't see how such a decomposition can be achieved in your case. At least I don't see one that would result in a polynomial time algorithm.
Also, it is not quite clear what operations you are talking about. Depending on the context, a swap or exchange can mean either the same thing as transposition or a replacement of a letter with an arbitrary other letter, e.g. test->text. If by "transpose/swap/twiddle/exchange" you try to say just "transpose", than you should have a look at Counting the adjacent swaps required to convert one permutation into another. If not, please clarify the question.

Shortest path from one word to another via valid words (no graph)

I came across this variation of edit-distance problem:
Find the shortest path from one word to another, for example storm->power, validating each intermediate word by using a isValidWord() function. There is no other access to the dictionary of words and therefore a graph cannot be constructed.
I am trying to figure this out but it doesn't seem to be a distance related problem, per se. Use simple recursion maybe? But then how do you know that you're going the right direction?
Anyone else find this interesting? Looking forward to some help, thanks!
This is a puzzle from Lewis Carroll known as Word Ladders. Donald Knuth covers this in The Stanford Graphbase. This also
You can view it as a breadth first search. You will need access to a dictionary of words, otherwise the space you will have to search will be huge. If you just have access to a valid word you can generate all the permutations of words and then just use isValidWord() to filter it down (Norvig's "How to Write a Spelling Corrector" is a great explanation of generating the edits).
You can guide the search by trying to minimize the edit distance between where you currently are and where you can to be. For example, generate the space of all nodes to search, and sort by minimum edit distance. Follow the links first that are closest (e.g. minimize the edit distance) to the target. In the example, follow the nodes that are closest to "power".
I found this interesting as well, so there's a Haskell implementation here which works reasonably well. There's a link in the comments to a Clojure version which has some really nice visualizations.
You can search from two sides at the same time. I.e. change a letter in storm and run it through isValidWord(), and change a letter in power and run it through isValidWord(). If those two words are the same, you have found a path.

Efficient word scramble algorithm

I'm looking for an efficient algorithm for scrambling a set of letters into a permutation containing the maximum number of words.
For example, say I am given the list of letters: {e, e, h, r, s, t}. I need to order them in such a way as to contain the maximum number of words. If I order those letters into "theres", it contain the words "the", "there", "her", "here", and "ere". So that example could have a score of 5, since it contains 5 words. I want to order the letters in such a way as to have the highest score (contain the most words).
A naive algorithm would be to try and score every permutation. I believe this is O(n!), so 720 different permutations would be tried for just the 6 letters above (including some duplicates, since the example has e twice). For more letters, the naive solution quickly becomes impossible, of course.
The algorithm doesn't have to actually produce the very best solution, but it should find a good solution in a reasonable amount of time. For my application, simply guessing (Monte Carlo) at a few million permutations works quite poorly, so that's currently the mark to beat.
I am currently using the Aho-Corasick algorithm to score permutations. It searches for each word in the dictionary in just one pass through the text, so I believe it's quite efficient. This also means I have all the words stored in a trie, but if another algorithm requires different storage that's fine too. I am not worried about setting up the dictionary, just the run time of the actual ordering and searching. Even a fuzzy dictionary could be used if needed, like a Bloom Filter.
For my application, the list of letters given is about 100, and the dictionary contains over 100,000 entries. The dictionary never changes, but several different lists of letters need to be ordered.
I am considering trying a path finding algorithm. I believe I could start with a random letter from the list as a starting point. Then each remaining letter would be used to create a "path." I think this would work well with the Aho-Corasick scoring algorithm, since scores could be built up one letter at a time. I haven't tried path finding yet though; maybe it's not a even a good idea? I don't know which path finding algorithm might be best.
Another algorithm I thought of also starts with a random letter. Then the dictionary trie would be searched for "rich" branches containing the remain letters. Dictionary branches containing unavailable letters would be pruned. I'm a bit foggy on the details of how this would work exactly, but it could completely eliminate scoring permutations.
Here's an idea, inspired by Markov Chains:
Precompute the letter transition probabilities in your dictionary. Create a table with the probability that some letter X is followed by another letter Y, for all letter pairs, based on the words in the dictionary.
Generate permutations by randomly choosing each next letter from the remaining pool of letters, based on the previous letter and the probability table, until all letters are used up. Run this many times.
You can experiment by increasing the "memory" of your transition table - don't look only one letter back, but say 2 or 3. This increases the probability table, but gives you more chance of creating a valid word.
You might try simulated annealing, which has been used successfully for complex optimization problems in a number of domains. Basically you do randomized hill-climbing while gradually reducing the randomness. Since you already have the Aho-Corasick scoring you've done most of the work already. All you need is a way to generate neighbor permutations; for that something simple like swapping a pair of letters should work fine.
Have you thought about using a genetic algorithm? You have the beginnings of your fitness function already. You could experiment with the mutation and crossover (thanks Nathan) algorithms to see which do the best job.
Another option would be for your algorithm to build the smallest possible word from the input set, and then add one letter at a time so that the new word is also is or contains a new word. Start with a few different starting words for each input set and see where it leads.
Just a few idle thoughts.
It might be useful to check how others solved this:
http://sourceforge.net/search/?type_of_search=soft&words=anagram
On this page you can generate anagrams online. I've played around with it for a while and it's great fun. It doesn't explain in detail how it does its job, but the parameters give some insight.
http://wordsmith.org/anagram/advanced.html
With javascript and Node.js I implemented a jumble solver that uses a dictionary and create a tree and then traversal the tree after that you can get all possible words, I explained the algorithm in this article in detail and put the source code on GitHub:
Scramble or Jumble Word Solver with Express and Node.js

Resources