Algorithm to find sentences in text - algorithm

Friends, I am looking for a good algorithm to search for given verbose phrases in a large text. For simplicity, I consider the text tokenized and all the words already found in it. Thus, if I have a phrase of three words (in fact there may be more words), I first look for the positions of each of these words in the text, so an array of integers is associated with each of the three words in the phrase. Not necessarily these arrays have the same length.
May be some example will be good here. Assume we need to find a phrase "all white cats" from this text:
...this is just a dummy text about cats. In this text I want to write the phrase that all cats are white but the fact is not all cats are white. But in case there are some white cats, anyway, we need to write about them. All facts about the cats...
If we assume "this" word have a number 30, then we can create those numbers for each word from the initial phrase:
all: 48, 57, 76
white: 51, 60, 67
cats: 37, 49, 58, 68, 80
As you can see, we can combine those words in different phases and each "phrase" will have its own "quality". Quality can be calculated as the sum of distances from each word to the virtual "phrase center".
"all cats are white" are two good phrases with a quality of 3.33. All other words can be combined with phrases, but they will be low-quality.
My question is to find a good algorithm to make a list of all phrases, each phrase will have a center coordinate and numbers of words. I know it can be done by direct calculation of the distance between each word to each word, but it can take ages if we have enough big text and enough long phrases.
To simplify, I think to limit the lookup distance (let's say 5 words) from each word.
But next, I can't imagine how to calculate this faster.
I feel there is a ready algorithm for this, but can't find one.
Thanks!

Let's prepare an intermediate data structure of sorted positions with corresponding words (see pos_words below). For each triplet of subsequent words we check that all required words are present, and for valid triplets we calculate the score/quality value.
See the model implementation in Python:
def calculate_score(data):
def score(positions):
center = sum(positions) / len(positions)
return sum(abs(p - center) for p in positions)
word_set = set(data)
word_count = len(word_set)
pos_words = {p: word for word, positions in data.items() for p in positions}
positions = sorted(pos_words)
return [
(positions[i], score(positions[i:i+word_count]))
for i in range(len(positions) - word_count + 1)
if set(pos_words[positions[i+j]] for j in range(word_count)) == word_set
]
data = {
"all": [48, 57, 76],
"white": [51, 60, 67],
"cats": [37, 49, 58, 68, 80],
}
print(calculate_score(data))
The result contains positions of the first word of the triplet together with calculated scores.
[(48, 3.3333333333333357),
(49, 9.333333333333336),
(51, 8.666666666666664),
(57, 3.3333333333333357),
(67, 11.333333333333329)]

Related

Algorithm Complexity: Time and Space

I have two solutions to one coding problem. Both of the solutions work fine. I just need to find the time/space complexity for both of them.
Question: Given a string or words, return the largest set of unique words in the given string that are anagrams of each other.
Example:
Input: 'I am bored and robed, nad derob'
Correct output: {bored, robed, derob}
Wrong output: {and, nad}
Solution 1:
In the first solution, I iterate over the given string of words, take each word, sort the characters in it, add it to a dictionary as a key. And the original word (not sorted) is being added to a set of words as a value for that key. Sort helps figure out the words that are anagrams of each other. At each iteration, I also keep track of the key that has the longest set of words as its value. At the end, I return the key that has the longest set.
Solution 2:
In the second solution, I do almost the same. The only difference is that I calculate a prime factor for each word to use it as a key in my dictionary. In other words, I don't sort the characters of the word.
def solution_two(string):
primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47,
53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101]
d = defaultdict(set)
longest_set = ''
for word in string.replace(",", "").replace(".", "").split():
key = 1
for ch in word:
key *= primes[ord(ch.lower()) - 97]
d[key].add(word)
if len(d[key]) > len(d[longest_set ]):
longest_set = key
return d[longest_set]
My thoughts:
I think that the runtime of the first solution is O(n), where n is the number of words in the string. But if I sort each word of the string, wouldn't it make the runtime O(n) * O(n log n)?
As for the second one, I think that it has the linear runtime too. But I have the second loop inside the first one where I iterate through each character of a word...
I am also confused about the space complexity for both of the solutions.
Any guidance would be greatly appreciated. Thank you.
ALG1= time is O(n)* O(awl log(awl)) and space O(n)*O(awl) but be careful because this looks pretty good but you need to consider that if awl is much smaller than n you get an ~O(n) time while if your awl is bigger you get an O(awl log(awl)) meanwhile worst case of all is if n=awl –
ALG2= time O(n)*O(awl), where O(n) is for iterating the given string and O(awl) is calculating the prime factor for each word (i.e. I take a word, find a prime for each char in the word in the list primes and multiply them) and space O(n)
same considerations on n and awl as the previous algorithm
So if i am correct your second algorithm in the WC has a complexity of n² better than the other one also in space!
Just to recap :)

Is it possible to predict the next number

I have a query relating to pattern recognition in a series, can we get the
next number by just inputting a series based on any mathematical logic?
Input : 1, 10, 30, 68, ?
Output: 130. (or whatever logic fits )

Algorithm to find similar strings in a list of many strings

I know about approximate string searching and things like the Levenshtein distance, but what I want to do is take a large list of strings and quickly pick out any matching pairs that are similar to each other (say, 1 Damerau-Levenshtein distance apart). So something like this
l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat"]
matching_strings(l)
# Output
# [["moose","mouse"],["rat", "cat"]]
I only really know how to use R and Python, so bonus points if your solution can be easily implemented in one of those languages.
UPDATE:
Thanks to Collapsar's help, here is a solution in Python
import numpy
import functools
alphabet = {'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3, 'g': 6, 'f': 5, 'i': 8, 'h': 7, 'k': 10, 'j': 9, 'm': 12, 'l': 11, 'o': 14, 'n': 13, 'q': 16, 'p': 15, 's': 18, 'r': 17, 'u': 20, 't': 19, 'w': 22, 'v': 21, 'y': 24, 'x': 23, 'z': 25}
l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat"]
fvlist=[]
for string in l:
fv = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
for letter in string:
fv[alphabet[letter]]+=1
fvlist.append(fv)
fvlist.sort (key=functools.cmp_to_key(lambda fv1,fv2: numpy.sign(numpy.sum(numpy.subtract(fv1, fv2)))))
However, the sorted vectors are returned in the following order:
"rat" "cat" "lion" "fish" "moose" "tiger" "mouse"
Which I would consider to be sub-optimal because I would want moose and mouse to end up next to each other. I understand that however I sort these words there's no way to get all of the words next to all of their closest pairs. However, I am still open to alternative solutions
One way to do that (with complexity O(n k^2), where n is number of strings and k is the longest string) is to convert every string into a set of masks like this:
rat => ?at, r?t, ra?, ?rat, r?at, ra?t, rat?
This way if two words are different in one letter, like 'rat' and 'cat', they will both have a mask ?at among others, while if one word is a subsequence of another, like 'rat' and 'rats', they will both have mask 'rat?'.
Then you just group strings based on their masks, and print groups that have more than two strings. You might want to dedup your array first, if it has duplicates.
Here's an example code, with an extra cats string in it.
l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat", "cats"]
d = {}
def add(mask, s):
if mask not in d:
d[mask] = []
d[mask].append(s)
for s in l:
for pos in range(len(s)):
add(s[:pos] + '?' + s[pos + 1:], s)
add(s[:pos] + '?' + s[pos:], s)
add(s + '?', s)
for k, v in d.items():
if len(v) > 1:
print v
Outputs
['moose', 'mouse']
['rat', 'cat']
['cat', 'cats']
1st step, you must index your list with any fuzzy search indexing.
2nd, you needed iterate your list and search for neighbors by quick lookup in the pre-indexed list.
About fuzzy indexing:
Approx 15 years ago I wrote fuzzy search, which can found N closes neighbors. This is my modification of Wilbur's trigram algorithm, and this modification is named "Wilbur-Khovayko algorithm".
Basic idea: To split strings by trigrams, and search maximal intersection scores.
For example, lets we have string "hello world". This string is generates trigrams: hel ell llo "lo ", "o_w", and so on; Also, produces special prefix/suffix trigrams for each word, like $he $wo lo$ ld$.
Thereafter, for each trigram built index, in which term it is present.
So, this is list of term_ID for each trigram.
When user invoke some string - it also splits to trigrams, and program search maximal intersection score, and generates N-size list.
It works quick: I remember, on old Sun/Solaris, 256MB ram, 200MHZ CPU, it search 100 closest term in dictionary 5,000,000 terms, in 0.25s
You can get my old source from: http://olegh.ftp.sh/wilbur-khovayko.tgz
The naive implementation amounts to setting up a boolean matrix indexed by the strings (i.e. their position in the sorted list) and comparing each pair of strings, setting the corresponding matrix element to true iff the strings are 'similar' wrt your criterion. This will run in O(n^2).
You might be better off by transforming your strings into tuples of character frequencies ( e.g. 'moose' -> (0,0,0,0,1,0,0,0,0,0,0,0,1,0,2,0,0,0,1,0,0,0,0,0,0,0) where the i-th vector component represents the i-th letter in the alphabet). Note that the frequency vectors will differ in 'few' components only ( e.g. for D-L distance 1 in at most 2 components, the respective differences being +1,-1 ).
Sort your transformed data. Candidates for the pairs you wish to generate will be adjacent or at least 'close' to each other in your sorted list of transformed values. You check the candidates by comparing each list entry with at most k of its successors in the list, k being a small integer (actually comparing the corresponding strings, of course). This algorithm will run in O(n log n).
You have to trade off between the added overhead of transformation / sorting (with complex comparison operations depending on the representation you choose for the frequency vectors ) and the reduced number of comparisons. The method does not consider the intra-word position of characters but only their occurrence. Depending on the actual set of strings there'll be many candidate pairs that do not turn into actually 'similar' pairs.
As your data appears to consist of English lexemes, a potential optimisation would be to define character classes ( e.g. vowels/consonants, 'e'/other vowels/syllabic consonants/non-syllabic consonants ) instead of individual characters.
Additional optimisation:
Note that precisely the pairs of strings in your data set that are permutations of each other (e.g. [art,tar]) will produce identical values under the given transformation. so if you limit yourself to a D-L distance of 1 and if you do not consider the transposition of adjacent characters as a single edit step, never pick list items with identical transformation values as candidates.

Finding anagrams for a given word

Two words are anagrams if one of them has exactly same characters as that of the another word.
Example : Anagram & Nagaram are anagrams (case-insensitive).
Now there are many questions similar to this . A couple of approaches to find whether two strings are anagrams are :
1) Sort the strings and compare them.
2) Create a frequency map for these strings and check if they are the same or not.
But in this case , we are given with a word (for the sake of simplicity let us assume a single word only and it will have single word anagrams only) and we need to find anagrams for that.
Solution which I have in mind is that , we can generate all permutations for the word and check which of these words exist in the dictionary . But clearly , this is highly inefficient. Yes , the dictionary is available too.
So what alternatives do we have here ?
I also read in a similar thread that something can be done using Tries but the person didn't explain as to what the algorithm was and why did we use a Trie in first place , just an implementation was provided that too in Python or Ruby. So that wasn't really helpful which is why I have created this new thread. If someone wants to share their implementation (other than C,C++ or Java) then kindly explain it too.
Example algorithm:
Open dictionary
Create empty hashmap H
For each word in dictionary:
Create a key that is the word's letters sorted alphabetically (and forced to one case)
Add the word to the list of words accessed by the hash key in H
To check for all anagrams of a given word:
Create a key that is the letters of the word, sorted (and forced to one case)
Look up that key in H
You now have a list of all anagrams
Relatively fast to build, blazingly fast on look-up.
I came up with a new solution I guess. It uses the Fundamental Theorem of Arithmetic. So the idea is to use an array of the first 26 prime numbers. Then for each letter in the input word we get the corresponding prime number A = 2, B = 3, C = 5, D = 7 … and then we calculate the product of our input word. Next we do this for each word in the dictionary and if a word matches our input word, then we add it to the resulting list. All anagrams will have the same signature because
Any integer greater than 1 is either a prime number, or can be written
as a unique product of prime numbers (ignoring the order).
Here's the code. I convert the word to UPPERCASE and 65 is the position of A which corresponds to my first prime number:
private int[] PRIMES = new int[] { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31,
37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103,
107, 109, 113 };
This is the method:
private long calculateProduct(char[] letters) {
long result = 1L;
for (char c : letters) {
if (c < 65) {
return -1;
}
int pos = c - 65;
result *= PRIMES[pos];
}
return result;
}
We know that if two words don't have the same length, they are not anagrams. So you can partition your dictionary in groups of words of the same length.
Now we focus on only one of these groups and basically all words have exactly the same length in this smaller universe.
If each letter position is a dimension, and the value in that dimension is based on the letter (say the ASCII code). Then you can calculate the length of the word vector.
For example, say 'A'=65, 'B'=66, then length("AB") = sqrt(65*65 + 66*66). Obviously, length("AB") = length("BA").
Clearly, if two word are anagrams, then their vectors have the same length. The next question is, if two word (of same number of letters) vectors have the same length, are they anagrams? Intuitively, I'd say no, since all vectors with that length forms a sphere, there are many. Not sure, since we're in the integer space in this case, how many there are actually.
But at the very least it allows you to partition your dictionary even further. For each word in your dictionary, calculate the vector's distance:
for(each letter c) { distance += c*c }; distance = sqrt(distance);
Then create a map for all words of length n, and key it with the distance and the value is a list of words of length n that yield that particular distance.
You'll create a map for each distance.
Then your lookup becomes the following algorithm:
Use the correct dictionary map based on the length of the word
Compute the length of your word's vector
Lookup the list of words that match that length
Go through the list and pick the anagrams using a naive algorithm is now the list of candidates is greatly reduced
Reduce the words to - say - lower case (clojure.string/lower-case).
Classify them (group-by) by letter frequency-map (frequencies).
Drop the frequency maps,
... leaving the collections of anagrams.
(These) are the corresponding functions in the Lisp dialect Clojure.
The whole function can be expressed so:
(defn anagrams [dict]
(->> dict
(map clojure.string/lower-case)
(group-by frequencies)
vals))
For example,
(anagrams ["Salt" "last" "one" "eon" "plod"])
;(["salt" "last"] ["one" "eon"] ["plod"])
An indexing function that maps each thing to its collection is
(defn index [xss]
(into {} (for [xs xss, x xs] [x xs])))
So that, for example,
((comp index anagrams) ["Salt" "last" "one" "eon" "plod"])
;{"salt" ["salt" "last"], "last" ["salt" "last"], "one" ["one" "eon"], "eon" ["one" "eon"], "plod" ["plod"]}
... where comp is the functional composition operator.
Well Tries would make it easier to check if the word exists.
So if you put the whole dictionary in a trie:
http://en.wikipedia.org/wiki/Trie
then you can afterward take your word and do simple backtracking by taking a char and recursively checking if we can "walk" down the Trie with any combiniation of the rest of the chars (adding one char at a time). When all chars are used in a recursion branch and there was a valid path in the Trie, then the word exists.
The Trie helps because its a nice stopping condition:
We can check if the part of a string, e.g "Anag" is a valid path in the trie, if not we can break that perticular recursion branch. This means we don't have to check every single permutation of the characters.
In pseudo-code
checkAllChars(currentPositionInTrie, currentlyUsedChars, restOfWord)
if (restOfWord == 0)
{
AddWord(currentlyUsedChar)
}
else
{
foreach (char in restOfWord)
{
nextPositionInTrie = Trie.Walk(currentPositionInTrie, char)
if (nextPositionInTrie != Positions.NOT_POSSIBLE)
{
checkAllChars(nextPositionInTrie, currentlyUsedChars.With(char), restOfWord.Without(char))
}
}
}
Obviously you need a nice Trie datastructure which allows you to progressively "walk" down the tree and check at each node if there is a path with the given char to any next node...
static void Main(string[] args)
{
string str1 = "Tom Marvolo Riddle";
string str2 = "I am Lord Voldemort";
str2= str2.Replace(" ", string.Empty);
str1 = str1.Replace(" ", string.Empty);
if (str1.Length != str2.Length)
Console.WriteLine("Strings are not anagram");
else
{
str1 = str1.ToUpper();
str2 = str2.ToUpper();
int countStr1 = 0;
int countStr2 = 0;
for (int i = 0; i < str1.Length; i++)
{
countStr1 += str1[i];
countStr2 += str2[i];
}
if(countStr2!=countStr1)
Console.WriteLine("Strings are not anagram");
else Console.WriteLine("Strings are anagram");
}
Console.Read();
}
Generating all permutations is easy, I guess you are worried that checking their existence in the dictionary is the "highly inefficient" part. But that actually depends on what data structure you use for the dictionary: of course, a list of words would be inefficient for your use case. Speaking of Tries, they would probably be an ideal representation, and quite efficient, too.
Another possibility would be to do some pre-processing on your dictionary, e.g. build a hashtable where the keys are the word's letters sorted, and the values are lists of words. You can even serialize this hashtable so you can write it to a file and reload quickly later. Then to look up anagrams, you simply sort your given word and look up the corresponding entry in the hashtable.
That depends on how you store your dictionary. If it is a simple array of words, no algorithm will be faster than linear.
If it is sorted, then here's an approach that may work. I've invented it just now, but I guess its faster than linear approach.
Denote your dictionary as D, current prefix as S. S = 0;
You create frequency map for your word. Lets denote it by F.
Using binary search find pointers to start of each letter in dictionary. Lets denote this array of pointers by P.
For each char c from A to Z, if F[c] == 0, skip it, else
S += c;
F[c] --;
P <- for every character i P[i] = pointer to first word beginning with S+i.
Recursively call step 4 till you find a match for your word or till you find that no such match exists.
This is how I would do it, anyway. There should be a more conventional approach, but this is faster then linear.
tried to implement the hashmap solution
public class Dictionary {
public static void main(String[] args){
String[] Dictionary=new String[]{"dog","god","tool","loot","rose","sore"};
HashMap<String,String> h=new HashMap<String, String>();
QuickSort q=new QuickSort();
for(int i=0;i<Dictionary.length;i++){
String temp =new String();
temp= q.quickSort(Dictionary[i]);//sorted word e.g dgo for dog
if(!h.containsKey(temp)){
h.put(temp,Dictionary[i]);
}
else
{
String s=h.get(temp);
h.put(temp,s + " , "+ Dictionary[i]);
}
}
String word=new String(){"tolo"};
String sortedword = q.quickSort(word);
if(h.containsKey(sortedword.toLowerCase())){ //used lowercase to make the words case sensitive
System.out.println("anagrams from Dictionary : " + h.get(sortedword.toLowerCase()));
}
}
Compute the frequency count vector for each word in the dictionary, a vector of length of the alphabet list.
generate a random Gaussian vector of the length of the alphabet list
project each dictionary word's count vector in this random direction and store the value (insert such that the array of values is sorted).
Given a new test word, project it in the same random direction used for the dictionary words.
Do a binary search to find the list of words that map to the same value.
Verify if each word obtained as above is indeed a true anagram. If not, remove it from the list.
Return the remaining elements of the list.
PS: The above procedure is a generalization of the prime number procedure which may potentially lead to large numbers (and hence computational precision issues)
# list of words
words = ["ROOPA","TABU","OOPAR","BUTA","BUAT" , "PAROO","Soudipta",
"Kheyali Park", "Tollygaunge", "AROOP","Love","AOORP",
"Protijayi","Paikpara","dipSouta","Shyambazaar",
"jayiProti", "North Calcutta", "Sovabazaar"]
#Method 1
A = [''.join(sorted(word)) for word in words]
dict ={}
for indexofsamewords,samewords in enumerate(A):
dict.setdefault(samewords, []).append(indexofsamewords)
print(dict)
#{'AOOPR': [0, 2, 5, 9, 11], 'ABTU': [1, 3, 4], 'Sadioptu': [6, 14], ' KPaaehiklry': [7], 'Taeggllnouy': [8], 'Leov': [10], 'Paiijorty': [12, 16], 'Paaaikpr': [13], 'Saaaabhmryz': [15], ' CNaachlortttu': [17], 'Saaaaborvz': [18]}
for index in dict.values():
print( [words[i] for i in index ] )
The Output :
['ROOPA', 'OOPAR', 'PAROO', 'AROOP', 'AOORP']
['TABU', 'BUTA', 'BUAT']
['Soudipta', 'dipSouta']
['Kheyali Park']
['Tollygaunge']
['Love']
['Protijayi', 'jayiProti']
['Paikpara']
['Shyambazaar']
['North Calcutta']
['Sovabazaar']
One solution is -
Map prime numbers to alphabet characters and multiply prime number
For ex -
a -> 2
b -> 3
......
.......
......
z -> 101
So
'ab' -> 6
'ba' -> 6
'bab' -> 18
'abba' -> 36
'baba' -> 36
Get MUL_number for Given word. return all the words from dictionary which have same MUL_number as given word
First check if the length of the strings are the same.
then check if the sum of the characters in both the strings are same (ie the ascii code sum)
then the words are anagrams
else not an anagram

Algorithm to generate anagrams

What would be the best strategy to generate anagrams.
An anagram is a type of word play, the result of rearranging the letters
of a word or phrase to produce a new word or phrase, using all the original
letters exactly once;
ex.
Eleven plus two is anagram of Twelve plus one
A decimal point is anagram of I'm a dot in place
Astronomers is anagram of Moon starers
At first it looks straightforwardly simple, just to jumble the letters and generate all possible combinations. But what would be the efficient approach to generate only the words in dictionary.
I came across this page, Solving anagrams in Ruby.
But what are your ideas?
Most of these answers are horribly inefficient and/or will only give one-word solutions (no spaces). My solution will handle any number of words and is very efficient.
What you want is a trie data structure. Here's a complete Python implementation. You just need a word list saved in a file named words.txt You can try the Scrabble dictionary word list here:
http://www.isc.ro/lists/twl06.zip
MIN_WORD_SIZE = 4 # min size of a word in the output
class Node(object):
def __init__(self, letter='', final=False, depth=0):
self.letter = letter
self.final = final
self.depth = depth
self.children = {}
def add(self, letters):
node = self
for index, letter in enumerate(letters):
if letter not in node.children:
node.children[letter] = Node(letter, index==len(letters)-1, index+1)
node = node.children[letter]
def anagram(self, letters):
tiles = {}
for letter in letters:
tiles[letter] = tiles.get(letter, 0) + 1
min_length = len(letters)
return self._anagram(tiles, [], self, min_length)
def _anagram(self, tiles, path, root, min_length):
if self.final and self.depth >= MIN_WORD_SIZE:
word = ''.join(path)
length = len(word.replace(' ', ''))
if length >= min_length:
yield word
path.append(' ')
for word in root._anagram(tiles, path, root, min_length):
yield word
path.pop()
for letter, node in self.children.iteritems():
count = tiles.get(letter, 0)
if count == 0:
continue
tiles[letter] = count - 1
path.append(letter)
for word in node._anagram(tiles, path, root, min_length):
yield word
path.pop()
tiles[letter] = count
def load_dictionary(path):
result = Node()
for line in open(path, 'r'):
word = line.strip().lower()
result.add(word)
return result
def main():
print 'Loading word list.'
words = load_dictionary('words.txt')
while True:
letters = raw_input('Enter letters: ')
letters = letters.lower()
letters = letters.replace(' ', '')
if not letters:
break
count = 0
for word in words.anagram(letters):
print word
count += 1
print '%d results.' % count
if __name__ == '__main__':
main()
When you run the program, the words are loaded into a trie in memory. After that, just type in the letters you want to search with and it will print the results. It will only show results that use all of the input letters, nothing shorter.
It filters short words from the output, otherwise the number of results is huge. Feel free to tweak the MIN_WORD_SIZE setting. Keep in mind, just using "astronomers" as input gives 233,549 results if MIN_WORD_SIZE is 1. Perhaps you can find a shorter word list that only contains more common English words.
Also, the contraction "I'm" (from one of your examples) won't show up in the results unless you add "im" to the dictionary and set MIN_WORD_SIZE to 2.
The trick to getting multiple words is to jump back to the root node in the trie whenever you encounter a complete word in the search. Then you keep traversing the trie until all letters have been used.
For each word in the dictionary, sort the letters alphabetically. So "foobar" becomes "abfoor."
Then when the input anagram comes in, sort its letters too, then look it up. It's as fast as a hashtable lookup!
For multiple words, you could do combinations of the sorted letters, sorting as you go. Still much faster than generating all combinations.
(see comments for more optimizations and details)
See this assignment from the University of Washington CSE department.
Basically, you have a data structure that just has the counts of each letter in a word (an array works for ascii, upgrade to a map if you want unicode support). You can subtract two of these letter sets; if a count is negative, you know one word can't be an anagram of another.
Pre-process:
Build a trie with each leaf as a known word, keyed in alphabetical order.
At search time:
Consider the input string as a multiset. Find the first sub-word by traversing the index trie as in a depth-first search. At each branch you can ask, is letter x in the remainder of my input? If you have a good multiset representation, this should be a constant time query (basically).
Once you have the first sub-word, you can keep the remainder multiset and treat it as a new input to find the rest of that anagram (if any exists).
Augment this procedure with memoization for faster look-ups on common remainder multisets.
This is pretty fast - each trie traversal is guaranteed to give an actual subword, and each traversal takes linear time in the length of the subword (and subwords are usually pretty darn small, by coding standards). However, if you really want something even faster, you could include all n-grams in your pre-process, where an n-gram is any string of n words in a row. Of course, if W = #words, then you'll jump from index size O(W) to O(W^n). Maybe n = 2 is realistic, depending on the size of your dictionary.
One of the seminal works on programmatic anagrams was by Michael Morton (Mr. Machine Tool), using a tool called Ars Magna. Here is a light article based on his work.
So here's the working solution, in Java, that Jason Cohen suggested and it performs somewhat better than the one using trie. Below are some of the main points:
Only load dictionary with the words that are subsets of given set of words
Dictionary will be a hash of sorted words as key and set of actual words as values (as suggested by Jason)
Iterate through each word from dictionary key and do a recursive forward lookup to see if any valid anagram is found for that key
Only do forward lookup because, anagrams for all the words that have already been traversed, should have already been found
Merge all the words associated to the keys for e.g. if 'enlist' is the word for which anagrams are to be found and one of the set of keys to merge are [ins] and [elt], and the actual words for key [ins] is [sin] and [ins], and for key [elt] is [let], then the final set of merge words would be [sin, let] and [ins, let] which will be part of our final anagrams list
Also to note that, this logic will only list unique set of words i.e. "eleven plus two" and "two plus eleven" would be same and only one of them would be listed in the output
Below is the main recursive code which finds the set of anagram keys:
// recursive function to find all the anagrams for charInventory characters
// starting with the word at dictionaryIndex in dictionary keyList
private Set<Set<String>> findAnagrams(int dictionaryIndex, char[] charInventory, List<String> keyList) {
// terminating condition if no words are found
if (dictionaryIndex >= keyList.size() || charInventory.length < minWordSize) {
return null;
}
String searchWord = keyList.get(dictionaryIndex);
char[] searchWordChars = searchWord.toCharArray();
// this is where you find the anagrams for whole word
if (AnagramSolverHelper.isEquivalent(searchWordChars, charInventory)) {
Set<Set<String>> anagramsSet = new HashSet<Set<String>>();
Set<String> anagramSet = new HashSet<String>();
anagramSet.add(searchWord);
anagramsSet.add(anagramSet);
return anagramsSet;
}
// this is where you find the anagrams with multiple words
if (AnagramSolverHelper.isSubset(searchWordChars, charInventory)) {
// update charInventory by removing the characters of the search
// word as it is subset of characters for the anagram search word
char[] newCharInventory = AnagramSolverHelper.setDifference(charInventory, searchWordChars);
if (newCharInventory.length >= minWordSize) {
Set<Set<String>> anagramsSet = new HashSet<Set<String>>();
for (int index = dictionaryIndex + 1; index < keyList.size(); index++) {
Set<Set<String>> searchWordAnagramsKeysSet = findAnagrams(index, newCharInventory, keyList);
if (searchWordAnagramsKeysSet != null) {
Set<Set<String>> mergedSets = mergeWordToSets(searchWord, searchWordAnagramsKeysSet);
anagramsSet.addAll(mergedSets);
}
}
return anagramsSet.isEmpty() ? null : anagramsSet;
}
}
// no anagrams found for current word
return null;
}
You can fork the repo from here and play with it. There are many optimizations that I might have missed. But the code works and does find all the anagrams.
And here is my novel solution.
Jon Bentley’s book Programming Pearls contains a problem about finding anagrams of words.
The statement:
Given a dictionary of english words, find all sets of anagrams. For
instance, “pots”, “stop” and “tops” are all anagrams of one another
because each can be formed by permuting the letters of the others.
I thought a bit and it came to me that the solution would be to obtain the signature of the word you’re searching and comparing it with all the words in the dictionary. All anagrams of a word should have the same signature. But how to achieve this? My idea was to use the Fundamental Theorem of Arithmetic:
The fundamental theorem of arithmetic states that
every positive integer (except the number 1) can be represented in
exactly one way apart from rearrangement as a product of one or more
primes
So the idea is to use an array of the first 26 prime numbers. Then for each letter in the word we get the corresponding prime number A = 2, B = 3, C = 5, D = 7 … and then we calculate the product of our input word. Next we do this for each word in the dictionary and if a word matches our input word, then we add it to the resulting list.
The performance is more or less acceptable. For a dictionary of 479828 words, it takes 160 ms to get all anagrams. This is roughly 0.0003 ms / word, or 0.3 microsecond / word. Algorithm’s complexity seems to be O(mn) or ~O(m) where m is the size of the dictionary and n is the length of the input word.
Here’s the code:
package com.vvirlan;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.Scanner;
public class Words {
private int[] PRIMES = new int[] { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73,
79, 83, 89, 97, 101, 103, 107, 109, 113 };
public static void main(String[] args) {
Scanner s = new Scanner(System.in);
String word = "hello";
System.out.println("Please type a word:");
if (s.hasNext()) {
word = s.next();
}
Words w = new Words();
w.start(word);
}
private void start(String word) {
measureTime();
char[] letters = word.toUpperCase().toCharArray();
long searchProduct = calculateProduct(letters);
System.out.println(searchProduct);
try {
findByProduct(searchProduct);
} catch (Exception e) {
e.printStackTrace();
}
measureTime();
System.out.println(matchingWords);
System.out.println("Total time: " + time);
}
private List<String> matchingWords = new ArrayList<>();
private void findByProduct(long searchProduct) throws IOException {
File f = new File("/usr/share/dict/words");
FileReader fr = new FileReader(f);
BufferedReader br = new BufferedReader(fr);
String line = null;
while ((line = br.readLine()) != null) {
char[] letters = line.toUpperCase().toCharArray();
long p = calculateProduct(letters);
if (p == -1) {
continue;
}
if (p == searchProduct) {
matchingWords.add(line);
}
}
br.close();
}
private long calculateProduct(char[] letters) {
long result = 1L;
for (char c : letters) {
if (c < 65) {
return -1;
}
int pos = c - 65;
result *= PRIMES[pos];
}
return result;
}
private long time = 0L;
private void measureTime() {
long t = new Date().getTime();
if (time == 0L) {
time = t;
} else {
time = t - time;
}
}
}
I've used the following way of computing anagrams a couple of month ago:
Compute a "code" for each word in your dictionary: Create a lookup-table from letters in the alphabet to prime numbers, e.g. starting with ['a', 2] and ending with ['z', 101]. As a pre-processing step compute the code for each word in your dictionary by looking up the prime number for each letter it consists of in the lookup-table and multiply them together. For later lookup create a multimap of codes to words.
Compute the code of your input word as outlined above.
Compute codeInDictionary % inputCode for each code in the multimap. If the result is 0, you've found an anagram and you can lookup the appropriate word. This also works for 2- or more-word anagrams as well.
Hope that was helpful.
The book Programming Pearls by Jon Bentley covers this kind of stuff quite nicely. A must-read.
How I see it:
you'd want to build a table that maps unordered sets of letters to lists words i.e. go through the dictionary so you'd wind up with, say
lettermap[set(a,e,d,f)] = { "deaf", "fade" }
then from your starting word, you find the set of letters:
astronomers => (a,e,m,n,o,o,r,r,s,s,t)
then loop through all the partitions of that set ( this might be the most technical part, just generating all the possible partitions), and look up the words for that set of letters.
edit: hmmm, this is pretty much what Jason Cohen posted.
edit: furthermore, the comments on the question mention generating "good" anagrams, like the examples :). after you build your list of all possible anagrams, run them through WordNet and find ones that are semantically close to the original phrase :)
A while ago I have written a blog post about how to quickly find two word anagrams. It works really fast: finding all 44 two-word anagrams for a word with a textfile of more than 300,000 words (4 Megabyte) takes only 0.6 seconds in a Ruby program.
Two Word Anagram Finder Algorithm (in Ruby)
It is possible to make the application faster when it is allowed to preprocess the wordlist into a large hash mapping from words sorted by letters to a list of words using these letters. This preprocessed data can be serialized and used from then on.
If I take a dictionary as a Hash Map as every word is unique and the Key is a binary(or Hex) representation of the word. Then if I have a word I can easily find the meaning of it with O(1) complexity.
Now, if we have to generate all the valid anagrams, we need to verify if the generated anagram is in the dictionary, if it is present in dictionary, its a valid one else we need to ignore that.
I will assume that there can be a word of max 100 characters(or more but there is a limit).
So any word we take it as a sequence of indexes like a word "hello" can be represented like
"1234".
Now the anagrams of "1234" are "1243", "1242" ..etc
The only thing we need to do is to store all such combinations of indexes for a particular number of characters. This is an one time task.
And then words can be generated from the combinations by picking the characters from the index.Hence we get the anagrams.
To verify if the anagrams are valid or not, just index into the dictionary and validate.
The only thing need to be handled is the duplicates.That can be done easily. As an when we need to compare with the previous ones that has been searched in dictionary.
The solution emphasizes on performance.
Off the top of my head, the solution that makes the most sense would be to pick a letter out of the input string randomly and filter the dictionary based on words that start with that. Then pick another, filter on the second letter, etc. In addition, filter out words that can't be made with the remaining text. Then when you hit the end of a word, insert a space and start it over with the remaining letters. You might also restrict words based on word type (e.g. you wouldn't have two verbs next to each other, you wouldn't have two articles next to each other, etc).
As Jason suggested, prepare a dictionary making hashtable with key being word sorted alphabetically, and value word itself (you may have multiple values per key).
Remove whitespace and sort your query before looking it up.
After this, you'd need to do some sort of a recursive, exhaustive search. Pseudo code is very roughly:
function FindWords(solutionList, wordsSoFar, sortedQuery)
// base case
if sortedQuery is empty
solutionList.Add(wordsSoFar)
return
// recursive case
// InitialStrings("abc") is {"a","ab","abc"}
foreach initialStr in InitalStrings(sortedQuery)
// Remaining letters after initialStr
sortedQueryRec := sortedQuery.Substring(initialStr.Length)
words := words matching initialStr in the dictionary
// Note that sometimes words list will be empty
foreach word in words
// Append should return a new list, not change wordSoFar
wordsSoFarRec := Append(wordSoFar, word)
FindWords(solutionList, wordSoFarRec, sortedQueryRec)
In the end, you need to iterate through the solutionList, and print the words in each sublist with spaces between them. You might need to print all orderings for these cases (e.g. "I am Sam" and "Sam I am" are both solutions).
Of course, I didn't test this, and it's a brute force approach.

Resources