Algorithm to generate anagrams - algorithm

What would be the best strategy to generate anagrams.
An anagram is a type of word play, the result of rearranging the letters
of a word or phrase to produce a new word or phrase, using all the original
letters exactly once;
ex.
Eleven plus two is anagram of Twelve plus one
A decimal point is anagram of I'm a dot in place
Astronomers is anagram of Moon starers
At first it looks straightforwardly simple, just to jumble the letters and generate all possible combinations. But what would be the efficient approach to generate only the words in dictionary.
I came across this page, Solving anagrams in Ruby.
But what are your ideas?

Most of these answers are horribly inefficient and/or will only give one-word solutions (no spaces). My solution will handle any number of words and is very efficient.
What you want is a trie data structure. Here's a complete Python implementation. You just need a word list saved in a file named words.txt You can try the Scrabble dictionary word list here:
http://www.isc.ro/lists/twl06.zip
MIN_WORD_SIZE = 4 # min size of a word in the output
class Node(object):
def __init__(self, letter='', final=False, depth=0):
self.letter = letter
self.final = final
self.depth = depth
self.children = {}
def add(self, letters):
node = self
for index, letter in enumerate(letters):
if letter not in node.children:
node.children[letter] = Node(letter, index==len(letters)-1, index+1)
node = node.children[letter]
def anagram(self, letters):
tiles = {}
for letter in letters:
tiles[letter] = tiles.get(letter, 0) + 1
min_length = len(letters)
return self._anagram(tiles, [], self, min_length)
def _anagram(self, tiles, path, root, min_length):
if self.final and self.depth >= MIN_WORD_SIZE:
word = ''.join(path)
length = len(word.replace(' ', ''))
if length >= min_length:
yield word
path.append(' ')
for word in root._anagram(tiles, path, root, min_length):
yield word
path.pop()
for letter, node in self.children.iteritems():
count = tiles.get(letter, 0)
if count == 0:
continue
tiles[letter] = count - 1
path.append(letter)
for word in node._anagram(tiles, path, root, min_length):
yield word
path.pop()
tiles[letter] = count
def load_dictionary(path):
result = Node()
for line in open(path, 'r'):
word = line.strip().lower()
result.add(word)
return result
def main():
print 'Loading word list.'
words = load_dictionary('words.txt')
while True:
letters = raw_input('Enter letters: ')
letters = letters.lower()
letters = letters.replace(' ', '')
if not letters:
break
count = 0
for word in words.anagram(letters):
print word
count += 1
print '%d results.' % count
if __name__ == '__main__':
main()
When you run the program, the words are loaded into a trie in memory. After that, just type in the letters you want to search with and it will print the results. It will only show results that use all of the input letters, nothing shorter.
It filters short words from the output, otherwise the number of results is huge. Feel free to tweak the MIN_WORD_SIZE setting. Keep in mind, just using "astronomers" as input gives 233,549 results if MIN_WORD_SIZE is 1. Perhaps you can find a shorter word list that only contains more common English words.
Also, the contraction "I'm" (from one of your examples) won't show up in the results unless you add "im" to the dictionary and set MIN_WORD_SIZE to 2.
The trick to getting multiple words is to jump back to the root node in the trie whenever you encounter a complete word in the search. Then you keep traversing the trie until all letters have been used.

For each word in the dictionary, sort the letters alphabetically. So "foobar" becomes "abfoor."
Then when the input anagram comes in, sort its letters too, then look it up. It's as fast as a hashtable lookup!
For multiple words, you could do combinations of the sorted letters, sorting as you go. Still much faster than generating all combinations.
(see comments for more optimizations and details)

See this assignment from the University of Washington CSE department.
Basically, you have a data structure that just has the counts of each letter in a word (an array works for ascii, upgrade to a map if you want unicode support). You can subtract two of these letter sets; if a count is negative, you know one word can't be an anagram of another.

Pre-process:
Build a trie with each leaf as a known word, keyed in alphabetical order.
At search time:
Consider the input string as a multiset. Find the first sub-word by traversing the index trie as in a depth-first search. At each branch you can ask, is letter x in the remainder of my input? If you have a good multiset representation, this should be a constant time query (basically).
Once you have the first sub-word, you can keep the remainder multiset and treat it as a new input to find the rest of that anagram (if any exists).
Augment this procedure with memoization for faster look-ups on common remainder multisets.
This is pretty fast - each trie traversal is guaranteed to give an actual subword, and each traversal takes linear time in the length of the subword (and subwords are usually pretty darn small, by coding standards). However, if you really want something even faster, you could include all n-grams in your pre-process, where an n-gram is any string of n words in a row. Of course, if W = #words, then you'll jump from index size O(W) to O(W^n). Maybe n = 2 is realistic, depending on the size of your dictionary.

One of the seminal works on programmatic anagrams was by Michael Morton (Mr. Machine Tool), using a tool called Ars Magna. Here is a light article based on his work.

So here's the working solution, in Java, that Jason Cohen suggested and it performs somewhat better than the one using trie. Below are some of the main points:
Only load dictionary with the words that are subsets of given set of words
Dictionary will be a hash of sorted words as key and set of actual words as values (as suggested by Jason)
Iterate through each word from dictionary key and do a recursive forward lookup to see if any valid anagram is found for that key
Only do forward lookup because, anagrams for all the words that have already been traversed, should have already been found
Merge all the words associated to the keys for e.g. if 'enlist' is the word for which anagrams are to be found and one of the set of keys to merge are [ins] and [elt], and the actual words for key [ins] is [sin] and [ins], and for key [elt] is [let], then the final set of merge words would be [sin, let] and [ins, let] which will be part of our final anagrams list
Also to note that, this logic will only list unique set of words i.e. "eleven plus two" and "two plus eleven" would be same and only one of them would be listed in the output
Below is the main recursive code which finds the set of anagram keys:
// recursive function to find all the anagrams for charInventory characters
// starting with the word at dictionaryIndex in dictionary keyList
private Set<Set<String>> findAnagrams(int dictionaryIndex, char[] charInventory, List<String> keyList) {
// terminating condition if no words are found
if (dictionaryIndex >= keyList.size() || charInventory.length < minWordSize) {
return null;
}
String searchWord = keyList.get(dictionaryIndex);
char[] searchWordChars = searchWord.toCharArray();
// this is where you find the anagrams for whole word
if (AnagramSolverHelper.isEquivalent(searchWordChars, charInventory)) {
Set<Set<String>> anagramsSet = new HashSet<Set<String>>();
Set<String> anagramSet = new HashSet<String>();
anagramSet.add(searchWord);
anagramsSet.add(anagramSet);
return anagramsSet;
}
// this is where you find the anagrams with multiple words
if (AnagramSolverHelper.isSubset(searchWordChars, charInventory)) {
// update charInventory by removing the characters of the search
// word as it is subset of characters for the anagram search word
char[] newCharInventory = AnagramSolverHelper.setDifference(charInventory, searchWordChars);
if (newCharInventory.length >= minWordSize) {
Set<Set<String>> anagramsSet = new HashSet<Set<String>>();
for (int index = dictionaryIndex + 1; index < keyList.size(); index++) {
Set<Set<String>> searchWordAnagramsKeysSet = findAnagrams(index, newCharInventory, keyList);
if (searchWordAnagramsKeysSet != null) {
Set<Set<String>> mergedSets = mergeWordToSets(searchWord, searchWordAnagramsKeysSet);
anagramsSet.addAll(mergedSets);
}
}
return anagramsSet.isEmpty() ? null : anagramsSet;
}
}
// no anagrams found for current word
return null;
}
You can fork the repo from here and play with it. There are many optimizations that I might have missed. But the code works and does find all the anagrams.

And here is my novel solution.
Jon Bentley’s book Programming Pearls contains a problem about finding anagrams of words.
The statement:
Given a dictionary of english words, find all sets of anagrams. For
instance, “pots”, “stop” and “tops” are all anagrams of one another
because each can be formed by permuting the letters of the others.
I thought a bit and it came to me that the solution would be to obtain the signature of the word you’re searching and comparing it with all the words in the dictionary. All anagrams of a word should have the same signature. But how to achieve this? My idea was to use the Fundamental Theorem of Arithmetic:
The fundamental theorem of arithmetic states that
every positive integer (except the number 1) can be represented in
exactly one way apart from rearrangement as a product of one or more
primes
So the idea is to use an array of the first 26 prime numbers. Then for each letter in the word we get the corresponding prime number A = 2, B = 3, C = 5, D = 7 … and then we calculate the product of our input word. Next we do this for each word in the dictionary and if a word matches our input word, then we add it to the resulting list.
The performance is more or less acceptable. For a dictionary of 479828 words, it takes 160 ms to get all anagrams. This is roughly 0.0003 ms / word, or 0.3 microsecond / word. Algorithm’s complexity seems to be O(mn) or ~O(m) where m is the size of the dictionary and n is the length of the input word.
Here’s the code:
package com.vvirlan;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.Scanner;
public class Words {
private int[] PRIMES = new int[] { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73,
79, 83, 89, 97, 101, 103, 107, 109, 113 };
public static void main(String[] args) {
Scanner s = new Scanner(System.in);
String word = "hello";
System.out.println("Please type a word:");
if (s.hasNext()) {
word = s.next();
}
Words w = new Words();
w.start(word);
}
private void start(String word) {
measureTime();
char[] letters = word.toUpperCase().toCharArray();
long searchProduct = calculateProduct(letters);
System.out.println(searchProduct);
try {
findByProduct(searchProduct);
} catch (Exception e) {
e.printStackTrace();
}
measureTime();
System.out.println(matchingWords);
System.out.println("Total time: " + time);
}
private List<String> matchingWords = new ArrayList<>();
private void findByProduct(long searchProduct) throws IOException {
File f = new File("/usr/share/dict/words");
FileReader fr = new FileReader(f);
BufferedReader br = new BufferedReader(fr);
String line = null;
while ((line = br.readLine()) != null) {
char[] letters = line.toUpperCase().toCharArray();
long p = calculateProduct(letters);
if (p == -1) {
continue;
}
if (p == searchProduct) {
matchingWords.add(line);
}
}
br.close();
}
private long calculateProduct(char[] letters) {
long result = 1L;
for (char c : letters) {
if (c < 65) {
return -1;
}
int pos = c - 65;
result *= PRIMES[pos];
}
return result;
}
private long time = 0L;
private void measureTime() {
long t = new Date().getTime();
if (time == 0L) {
time = t;
} else {
time = t - time;
}
}
}

I've used the following way of computing anagrams a couple of month ago:
Compute a "code" for each word in your dictionary: Create a lookup-table from letters in the alphabet to prime numbers, e.g. starting with ['a', 2] and ending with ['z', 101]. As a pre-processing step compute the code for each word in your dictionary by looking up the prime number for each letter it consists of in the lookup-table and multiply them together. For later lookup create a multimap of codes to words.
Compute the code of your input word as outlined above.
Compute codeInDictionary % inputCode for each code in the multimap. If the result is 0, you've found an anagram and you can lookup the appropriate word. This also works for 2- or more-word anagrams as well.
Hope that was helpful.

The book Programming Pearls by Jon Bentley covers this kind of stuff quite nicely. A must-read.

How I see it:
you'd want to build a table that maps unordered sets of letters to lists words i.e. go through the dictionary so you'd wind up with, say
lettermap[set(a,e,d,f)] = { "deaf", "fade" }
then from your starting word, you find the set of letters:
astronomers => (a,e,m,n,o,o,r,r,s,s,t)
then loop through all the partitions of that set ( this might be the most technical part, just generating all the possible partitions), and look up the words for that set of letters.
edit: hmmm, this is pretty much what Jason Cohen posted.
edit: furthermore, the comments on the question mention generating "good" anagrams, like the examples :). after you build your list of all possible anagrams, run them through WordNet and find ones that are semantically close to the original phrase :)

A while ago I have written a blog post about how to quickly find two word anagrams. It works really fast: finding all 44 two-word anagrams for a word with a textfile of more than 300,000 words (4 Megabyte) takes only 0.6 seconds in a Ruby program.
Two Word Anagram Finder Algorithm (in Ruby)
It is possible to make the application faster when it is allowed to preprocess the wordlist into a large hash mapping from words sorted by letters to a list of words using these letters. This preprocessed data can be serialized and used from then on.

If I take a dictionary as a Hash Map as every word is unique and the Key is a binary(or Hex) representation of the word. Then if I have a word I can easily find the meaning of it with O(1) complexity.
Now, if we have to generate all the valid anagrams, we need to verify if the generated anagram is in the dictionary, if it is present in dictionary, its a valid one else we need to ignore that.
I will assume that there can be a word of max 100 characters(or more but there is a limit).
So any word we take it as a sequence of indexes like a word "hello" can be represented like
"1234".
Now the anagrams of "1234" are "1243", "1242" ..etc
The only thing we need to do is to store all such combinations of indexes for a particular number of characters. This is an one time task.
And then words can be generated from the combinations by picking the characters from the index.Hence we get the anagrams.
To verify if the anagrams are valid or not, just index into the dictionary and validate.
The only thing need to be handled is the duplicates.That can be done easily. As an when we need to compare with the previous ones that has been searched in dictionary.
The solution emphasizes on performance.

Off the top of my head, the solution that makes the most sense would be to pick a letter out of the input string randomly and filter the dictionary based on words that start with that. Then pick another, filter on the second letter, etc. In addition, filter out words that can't be made with the remaining text. Then when you hit the end of a word, insert a space and start it over with the remaining letters. You might also restrict words based on word type (e.g. you wouldn't have two verbs next to each other, you wouldn't have two articles next to each other, etc).

As Jason suggested, prepare a dictionary making hashtable with key being word sorted alphabetically, and value word itself (you may have multiple values per key).
Remove whitespace and sort your query before looking it up.
After this, you'd need to do some sort of a recursive, exhaustive search. Pseudo code is very roughly:
function FindWords(solutionList, wordsSoFar, sortedQuery)
// base case
if sortedQuery is empty
solutionList.Add(wordsSoFar)
return
// recursive case
// InitialStrings("abc") is {"a","ab","abc"}
foreach initialStr in InitalStrings(sortedQuery)
// Remaining letters after initialStr
sortedQueryRec := sortedQuery.Substring(initialStr.Length)
words := words matching initialStr in the dictionary
// Note that sometimes words list will be empty
foreach word in words
// Append should return a new list, not change wordSoFar
wordsSoFarRec := Append(wordSoFar, word)
FindWords(solutionList, wordSoFarRec, sortedQueryRec)
In the end, you need to iterate through the solutionList, and print the words in each sublist with spaces between them. You might need to print all orderings for these cases (e.g. "I am Sam" and "Sam I am" are both solutions).
Of course, I didn't test this, and it's a brute force approach.

Related

Valid Permutations of a String

This question was asked to me in a recent amazon technical interview. It goes as follows:-
Given a string ex: "where am i" and a dictionary of valid words, you have to list all valid distinct permutations of the string. A valid string comprises of words which exists in the dictionary. For ex: "we are him","whim aree" are valid strings considering the words(whim, aree) are part of the dictionary. Also the condition is that a mere rearrangement of words is not a valid string, i.e "i am where" is not a valid combination.
The task is to find all possible such strings in the optimum way.
As you have said, space doesn't count, so input can be just viewed as a list of chars. The output is the permutation of words, so an obvious way to do it is find all valid words then permutate them.
Now problem becomes to divide a list of chars into subsets which each forms a word, which you can find some answers here and following is my version to solve this sub-problem.
If the dictionary is not large, we can iterate dictionary to
find min_len/max_len of words, to estimate how many words we may have, i.e. how deep we recur
convert word into map to accelerate search;
filter the words which have impossible char (i.e. the char our input doesn't have) out;
if this word is subset of our input, we can find word recursively.
The following is pseudocode:
int maxDepth = input.length / min_len;
void findWord(List<Map<Character, Integer>> filteredDict, Map<Character, Integer> input, List<String> subsets, int level) {
if (level < maxDepth) {
for (Map<Character, Integer> word : filteredDict) {
if (subset(input, word)) {
subsets.add(word);
findWord(filteredDict, removeSubset(input, word), subsets, level + 1);
}
}
}
}
And then you can permutate words in a recursive functions easily.
Technically speaking, this solution can be O(n**d) -- where n is dictionary size and d is max depth. But if the input is not large and complex, we can still solve it in feasible time.

Given a dictionary and a list of letters find all valid words that can be built with the letters

The brute force way can solve the problem in O(n!), basically calculating all the permutations and checking the results in a dictionary. I am looking for ways to improve the complexity. I can think of building a tree out of the dictionary but still checking all letters permutations is O(n!). Are there better ways to solve this problem?
Letters can have duplicates.
The api for the function looks like this:
List<String> findValidWords(Dict dict, char letters[])
Assume that letters only contains letters from a to z.
Use an integer array to count the number of occurrence of a character in letters.
For each word in the dictionary, check if there is a specific character in the word that appears more than allowed, if not, add this word into result.
List<String> findValidWords(List<String> dict, char letters[]){
int []avail = new int[26];
for(char c : letters){
int index = c - 'a';
avail[index]++;
}
List<String> result = new ArrayList();
for(String word: dict){
int []count = new int[26];
boolean ok = true;
for(char c : word.toCharArray()){
int index = c - 'a';
count[index]++;
if(count[index] > avail[index]){
ok = false;
break;
}
}
if(ok){
result.add(word);
}
}
return result;
}
So we can see that the time complexity is O(m*k) with m is number of word in the dictionary and k is the maximum total of characters in a word
You can sort each word in your dictionary so that the letters appear in the same order as they do in the alphabet, and then build a trie out of your sorted words. (where each node contains a list of all words that can be made out of the letters). (linear time in total letter length of dictionary) Then, given a set of query letters, sort the letters the same way and proceed through the trie using depth first search in all possible directions that use a subset of your letters from left to right. Any time you reach a node in the trie that contains words, output those words. Each path you explore can be charged to at least one word in the dictionary, so the worst case complexity to find all nodes that contain words you can make is O(kn) where n is the number of words in the dictionary and k is the maximum number of letters in a word. However for somewhat restricted sets of query letters, the running time should be much faster per query.
Here is the algorithm that will find all words that can be formed from a set of letters in O(1). We will represent words with their spectra and store them in a prefix tree (aka trie).
General Description
The spectrum of a word W is an array S of size N, such that S(i) is the number of occurrences (aka frequency) of an A(i) letter in the word W, where A(i) is the i-th letter of a chosen alphabet and N is its size.
For example, in the English alphabet, A(0) is A, A(1) is B, ... , A(25) is Z. A spectrum of the word aha is <2,0,0,0,0,0,0,1,0,...,0>.
We will store the dictionary in a prefix trie, using spectrum as a key. The first token of a key is the frequency of letter A, the second is the frequency of letter B and so on. (From here and below we will use the English alphabet as an example).
Once formed, our dictionary will be a tree with the height 26 and width that varies with each level, depending on a popularity of the letter. Basically, each layer will have a number of subtrees that is equal to the maximum word frequency of this letter in the provided dictionary.
Since our task is not only to decide whether we can build a word from the provided set of characters but also to find these words (a search problem), then we need to attach the words to their spectra (as spectral transformation is not invertible, consider spectra of words read and dear). We will attach a word to the end of each path that represents its spectrum.
To find whether we can build a word from a provided set we will build a spectrum of the set, and find all paths in the prefix trie with the frequencies bounded by the corresponding frequencies of the set's spectrum. (Note, we are not forcing to use all letters from the set, so if a word uses fewer letters, then we can build it. Basically, our requirement is that for all letters in the word the frequency of a letter should be less than or equal than a frequency of the same letter in the provided set).
The complexity of the search procedure doesn't depend on the length of the dictionary or the length of the provided set. On average, it is equal to 26 times the average frequency of a letter. Given the English alphabet, it is a quite small constant factor. For other alphabets, it might not be the case.
Reference implementation
I will provide a reference implementation of an algorithm in OCaml.
The dictionary data type is recursive:
type t = {
dict : t Int.Map.t;
data : string list;
}
(Note: it is not the best representation, probably it is better to represent it is a sum type, e.g., type t = Dict of t Int.Map.t | Data of string list, but I found it easier to implement it with the above representation).
We can generalize the algorithm by a spectrum function, either using a functor, or by just storing the spectrum function in the dictionary, but for the simplicity, we will just hardcode the English alphabet in the ASCII representation,
let spectrum word =
let index c = Char.(to_int (uppercase c) - to_int 'A') in
let letters = Char.(to_int 'Z' - to_int 'A' + 1) in
Array.init letters ~f:(fun i ->
String.count word ~f:(fun c -> index c = i))
Next, we will define the add_word function of type dict -> string -> dict, that will add a new path to our dictionary, by decomposing a word to its spectrum, and adding each constituent. Each addition will require exactly 26 iterations, not including the spectrum computation. Note, the implementation is purely functional, and doesn't use any imperative features. Every time the function add_word returns a new data structure.
let add_word dict word =
let count = spectrum word in
let rec add {dict; data} i =
if i < Array.length count then {
data;
dict = Map.update dict count.(i) ~f:(function
| None -> add empty (i+1)
| Some sub -> add sub (i+1))
} else {empty with data = word :: data} in
add dict 0
We are using the following definition of the empty value in the add function:
let empty = {dict = Int.Map.empty; data=[]}
Now let's define the is_buildable function of type dict -> string -> bool that will decide whether the given set of characters can be used to build any word in the dictionary. Although we can express it via the search, by checking the size of the found set, we would still prefer to have a specialized implementation, as it is more efficient and easier to understand. The definition of the function follows closely the general description provided above. Basically, for every character in the alphabet, we check whether there is an entry in the dictionary with the frequency that is less or equal than the frequency in the building set. If we checked all letters, then we proved, that we can build at least one word with the given set.
let is_buildable dict set =
let count = spectrum set in
let rec find {dict} i =
i >= Array.length count ||
Sequence.range 0 count.(i) ~stop:`inclusive |>
Sequence.exists ~f:(fun cnt -> match Map.find dict cnt with
| None -> false
| Some dict -> find dict (i+1)) in
find dict 0
Now, let's actually find the set of all words, that are buildable from the provided set:
let build dict set =
let count = spectrum set in
let rec find {dict; data} i =
if i < Array.length count then
Sequence.range 0 count.(i) ~stop:`inclusive |>
Sequence.concat_map ~f:(fun cnt -> match Map.find dict cnt with
| None -> Sequence.empty
| Some dict -> find dict (i+1))
else Sequence.of_list data in
find dict 0
We will basically follow the structure of the is_buildable function, except that instead of proving that such a frequency exists for each letter, we will collect all the proofs by reaching the end of the path and grabbing the set of word attached to it.
Testing and example
For the sake of completeness, we will test it by creating a small program, that will read a dictionary, with each word on a separate line, and interact with a user, by asking for a set and printing the resultion set of words, that can be built from it.
module Test = struct
let run () =
let dict =
In_channel.(with_file Sys.argv.(1)
~f:(fold_lines ~init:empty ~f:add_word)) in
let prompt () =
printf "Enter characters and hit enter (or Ctrl-D to stop): %!" in
prompt ();
In_channel.iter_lines stdin ~f:(fun set ->
build dict set |> Sequence.iter ~f:print_endline;
prompt ())
end
Here comes and example of interaction, that uses /usr/share/dict/american-english dictionary available on my machine (Ubunty Trusty).
./scrabble.native /usr/share/dict/american-english
Enter characters and hit enter (or Ctrl-D to stop): read
r
R
e
E
re
Re
Er
d
D
Rd
Dr
Ed
red
Red
a
A
Ra
Ar
era
ear
are
Rae
ad
read
dear
dare
Dare
Enter characters and hit enter (or Ctrl-D to stop):
(Yep, the dictionary contains words, that like r and d that are probably not true English words. In fact, for each letter the dictionary has a word, so, we can basically build a word from each non-empty set of alphabet letters).
The full implementation along with the building instructions can be found on Gist
A better way to do this is to loop through all the words in the dictionary and see if the word can be built with the letters in the array.
"Sign" the letters available by sorting them in order; that's O(m log m), where m is the number of letters.
"Sign" each word in the dictionary by sorting the letters of the word in order; that's O(k log k), where k is the length of the word.
Compare the letter signature to each word signature; that's O(min(m, k) * n), where n is the number of words in the dictionary. Output any word that matches.
Assuming an English word list of approximately a quarter-million words, and no more than about half a dozen, that should be nearly instantaneous.
I was recently asked the same question in BankBazaar interview. I was given the option to (he said that in a very subtle manner) pre-process the dictionary in any way I want.
My first thought was to arrange the dictionary in a trie or ternary search tree, and make all the words from the letters given. In any optimization way, that would take n! + n-1! + n-2! n-3! + ..... + n word checks(n being the number of letters) in worst case, which was not acceptable.
The other way could be to check all the dictionary words whether they can be made from the given letters. This again in any optimized way would take noOfDictionaryWords(m) * average size of dictionary words(k) at worst case, which was again not acceptable.
Now I have n! + n-1! + n-2! + .... + N words, which I have to check in the dictionary, and I don't want to check them all, so what are the situations that I have to check only a subset of them, and how to group them.
If I have to check only combination and not permutation, the result gets to 2^n.
so I have to pre-process the dictionary words in such a way that if I pass a combination, all the anagrams would be printed.
A ds something like this : http://1.bp.blogspot.com/-9Usl9unQJpY/Vg6IIO3gpsI/AAAAAAAAAbM/oTuhRDWelhQ/s1600/hashmapArrayForthElement.png
A hashvalue made by the letters(irrespective of its positions and permutation), pointing to list containing all the words made by those letters, then we only need to check that hashvalue.
I gave the answer to make the hash value by assigning a prime value to all the alphabets and while calculating the hash value of a word, multiply all the assigned values. This will create a problem of having really big hash values given that 26th prime is 101, and many null values in the map taking space. We could optimize it a bit by rather than starting lexicographically with a = 2, b = 3, c = 5, d = 7.... z = 101, we search for the most used alphabets and assign them small values, like vowels, and 's', 't' etc.
The interviewer accepted it, but was not expecting the answer, so there is definitely another answer, for better or worse but there is.
Swift 4
func findValidWords(in dictionary: [String], with letters: [Character]) -> [String] {
var validWords = [String]()
for word in dictionary {
var temp = word
for char in letters {
temp = temp.filter({ $0 != char })
if temp.isEmpty {
validWords.append(word)
break
}
}
}
return validWords
}
print(findValidWords(in: ["ape", "apples", "orange", "elapse", "lap", "soap", "bar", "sole"], with: ["a","p","l","e","s","o"]))
Output => ["ape", "apples", "elapse", "lap", "soap", "sole"]
My English is not good so try to understand.
My approach is using bit/bitwise to increase speed. Still bruteforce, though.
FIRST STEP
We only consider distinct character in each word and mark its existence. English has 26 characters, so we need 26 bits. Integer is 32 bits. That's enough.
Now encode each words in dictionary to an integer number.
abcdddffg -> 123444667 -> 123467 (only distinct characters) -> 1111011 (bits) -> 123 (decimal number)
So 2,000,000 words will be converted into 2,000,000 integer numbers.
Now let say you have this set of letters: a,b,c,d,e
abcde -> 12345 -> 1111100 (bits)
Do AND operation and we have:
1111100 (abcde)
&
1111011 (abcdddffg, no e)
=
1111000 (result) => result != abcdddffg => word cannot be created
Other example with a,b,c,d,e,f,g,h:
11111111 (abcdefgh)
&
11110110 (abcdddffg, no e and h)
=
11110110 (result) => result == abcdddffg => word can be created
SECOND STEP
While converting word to number, store the letter count also. If we found a match in first step, we continue to check if the number of letters is enough too.
Depend on the requirement, you might not need this second step.
COMPLEXITY
O(n) to convert word to number and store letters count. Only need to do this once.
O(n) for each search query.
Following is more efficient way :-
1.Use count sort to count all letters appearing in the a word in dictionary.
2.Do count sort on the collection of letter that you are given.
3.Compare if the counts are same then the word can be made.
4. Do this for all words in dictionary.
This will be inefficient for multiple such queries so you can do following :-
1. make a tupple for each word using count sort.
2. put the tupple in a Tree or hashmap with count entries.
3. When query is given do count sort and lookup tupple in hashmap
.
Time complexity :-
The above method gives O(1) time complexity for a query and O(N) time complexity for hash table construction where N is no of words in dictionary
(cf. anagram search, e.g. using primes looks cleaner for a signature based approach - collect for all non-equivalent "substrings of letters"])
Given the incentive, I'd (pre)order Dict by (set of characters that make up each word, increasing length) and loop over the subsets from letters checking validity of each word until too long.
Alternatively, finding the set of words from dict out of chars from letters can be considered a multi-dimensional range query: with "eeaspl" specifying letters, valid words have zero to two "e", one or none of a, s, p, l, and no other characters at all - bounds on word length (no longer than letters, lower bound to taste) blend in nicely.
Then again, data structures like k-d-trees do well with few, selective dimensions.
(Would-be comment: you do not mention alphabet cardinality, whether "valid" depends on capitalisation or diacritics, "complexity" includes programmer effort or preprocessing of dict - the latter may be difficult to amortise if dict is immutable.)
Swift 3
func findValidWords(wordsList: [String] , string: String) -> [String]{
let charCountsDictInTextPassed = getCharactersCountIn(string: string)
var wordsArrayResult: [String] = []
for word in wordsList {
var canBeProduced = true
let currentWordCharsCount = getCharactersCountIn(string: word)
for (char, count) in currentWordCharsCount {
if let charCountInTextPassed = charCountsDictInTextPassed[char], charCountInTextPassed >= count {
continue
}else{
canBeProduced = false
break
}
}// end for
if canBeProduced {
wordsArrayResult.append(word)
}//end if
}//end for
return wordsArrayResult
}
// Get the count of each character in the string
func getCharactersCountIn(string: String) -> [String: Int]{
var charDictCount:[String: Int] = [:]
for char in string.characters {
if let count = charDictCount[String(char)] {
charDictCount[String(char)] = count + 1
}else{
charDictCount[String(char)] = 1
}
}//end for
return charDictCount
}
If letters can be repeated, that means that a word can be infinitely long. You would obviously cap this at the length of the longest word in the dictionary, but there are still too many words to check. Like nmore suggested, you'd rather iterate over the dictionary to do this.
List<String> findAllValidWords(Set<String> dict, char[] letters) {
List<String> result = new LinkedList<>();
Set<Character> charSet = new HashSet<>();
for (char letter : letters) {
charSet.add(letter);
}
for (String word : dict) {
if (isPossible(word, charSet)) {
result.add(word);
}
}
return result;
}
boolean isPossible(String word, Set<Character> charSet) {
// A word is possible if all its letters are contained in the given letter set
for (int i = 0; i < word.length(); i++) {
if (!charSet.contains(word.charAt(i))) {
return false;
}
}
return true;
}

clustering words based on their char set

Say there is a word set and I would like to clustering them based on their char bag (multiset). For example
{tea, eat, abba, aabb, hello}
will be clustered into
{{tea, eat}, {abba, aabb}, {hello}}.
abba and aabb are clustered together because they have the same char bag, i.e. two a and two b.
To make it efficient, a naive way I can think of is to covert each word into a char-cnt series, for exmaple, abba and aabb will be both converted to a2b2, tea/eat will be converted to a1e1t1. So that I can build a dictionary and group words with same key.
Two issues here: first I have to sort the chars to build the key; second, the string key looks awkward and performance is not as good as char/int keys.
Is there a more efficient way to solve the problem?
For detecting anagrams you can use a hashing scheme based on the product of prime numbers A->2, B->3, C->5 etc. will give "abba" == "aabb" == 36 (but a different letter to primenumber mapping will be better)
See my answer here.
Since you are going to sort words, I assume all characters ascii values are in the range 0-255. Then you can do a Counting Sort over the words.
The counting sort is going to take the same amount of time as the size of the input word. Reconstruction of the string obtained from counting sort will take O(wordlen). You cannot make this step less than O(wordLen) because you will have to iterate the string at least once ie O(wordLen). There is no predefined order. You cannot make any assumptions about the word without iterating though all the characters in that word. Traditional sorting implementations(ie comparison based ones) will give you O(n * lg n). But non comparison ones give you O(n).
Iterate over all the words of the list and sort them using our counting sort. Keep a map of
sorted words to the list of known words they map. Addition of elements to a list takes constant time. So overall the complexity of the algorithm is O(n * avgWordLength).
Here is a sample implementation
import java.util.ArrayList;
public class ClusterGen {
static String sortWord(String w) {
int freq[] = new int[256];
for (char c : w.toCharArray()) {
freq[c]++;
}
StringBuilder sortedWord = new StringBuilder();
//It is at most O(n)
for (int i = 0; i < freq.length; ++i) {
for (int j = 0; j < freq[i]; ++j) {
sortedWord.append((char)i);
}
}
return sortedWord.toString();
}
static Map<String, List<String>> cluster(List<String> words) {
Map<String, List<String>> allClusters = new HashMap<String, List<String>>();
for (String word : words) {
String sortedWord = sortWord(word);
List<String> cluster = allClusters.get(sortedWord);
if (cluster == null) {
cluster = new ArrayList<String>();
}
cluster.add(word);
allClusters.put(sortedWord, cluster);
}
return allClusters;
}
public static void main(String[] args) {
System.out.println(cluster(Arrays.asList("tea", "eat", "abba", "aabb", "hello")));
System.out.println(cluster(Arrays.asList("moon", "bat", "meal", "tab", "male")));
}
}
Returns
{aabb=[abba, aabb], ehllo=[hello], aet=[tea, eat]}
{abt=[bat, tab], aelm=[meal, male], mnoo=[moon]}
Using an alphabet of x characters and a maximum word length of y, you can create hashes of (x + y) bits such that every anagram has a unique hash. A value of 1 for a bit means there is another of the current letter, a value of 0 means to move on to the next letter. Here's an example showing how this works:
Let's say we have a 7 letter alphabet(abcdefg) and a maximum word length of 4. Every word hash will be 11 bits. Let's hash the word "fade": 10001010100
The first bit is 1, indicating there is an a present. The second bit indicates that there are no more a's. The third bit indicates that there are no more b's, and so on. Another way to think about this is the number of ones in a row represents the number of that letter, and the total zeroes before that string of ones represents which letter it is.
Here is the hash for "dada": 11000110000
It's worth noting that because there is a one-to-one correspondence between possible hashes and possible anagrams, this is the smallest possible hash guaranteed to give unique hashes for any input, which eliminates the need to check everything in your buckets when you are done hashing.
I'm well aware that using large alphabets and long words will result in a large hash size. This solution is geared towards guaranteeing unique hashes in order to avoid comparing strings. If you can design an algorithm to compute this hash in constant time(given you know the values of x and y) then you'll be able to solve the entire grouping problem in O(n).
I would do this in two steps, first sort all your words according to their length and work on each subset separately(this is to avoid lots of overlaps later.)
The next step is harder and there are many ways to do it. One of the simplest would be to assign every letter a number(a = 1, b = 2, etc. for example) and add up all the values for each word, thereby assigning each word to an integer. Then you can sort the words according to this integer value which drastically cuts the number you have to compare.
Depending on your data set you may still have a lot of overlaps("bad" and "cac" would generate the same integer hash) so you may want to set a threshold where if you have too many words in one bucket you repeat the previous step with another hash(just assigning different numbers to the letters) Unless someone has looked at your code and designed a wordlist to mess you up, this should cut the overlaps to almost none.
Keep in mind that this approach will be efficient when you are expecting small numbers of words to be in the same char bag. If your data is a lot of long words that only go into a couple char bags, the number of comparisons you would do in the final step would be astronomical, and in this case you would be better off using an approach like the one you described - one that has no possible overlaps.
One thing I've done that's similar to this, but allows for collisions, is to sort the letters, then get rid of duplicates. So in your example, you'd have buckets for "aet", "ab", and "ehlo".
Now, as I say, this allows for collisions. So "rod" and "door" both end up in the same bucket, which may not be what you want. However, the collisions will be a small set that is easily and quickly searched.
So once you have the string for a bucket, you'll notice you can convert it into a 32-bit integer (at least for ASCII). Each letter in the string becomes a bit in a 32-bit integer. So "a" is the first bit, "b" is the second bit, etc. All (English) words make a bucket with a 26-bit identifier. You can then do very fast integer compares to find the bucket a new words goes into, or find the bucket an existing word is in.
Count the frequency of characters in each of the strings then build a hash table based on the frequency table. so for an example, for string aczda and aacdz we get 20110000000000000000000001. Using hash table we can partition all these strings in buckets in O(N).
26-bit integer as a hash function
If your alphabet isn't too large, for instance, just lower case English letters, you can define this particular hash function for each word: a 26 bit integer where each bit represents whether that English letter exists in the word. Note that two words with the same char set will have the same hash.
Then just add them to a hash table. It will automatically be clustered by hash collisions.
It will take O(max length of the word) to calculate a hash, and insertion into a hash table is constant time. So the overall complexity is O(max length of a word * number of words)

Finding anagrams for a given word

Two words are anagrams if one of them has exactly same characters as that of the another word.
Example : Anagram & Nagaram are anagrams (case-insensitive).
Now there are many questions similar to this . A couple of approaches to find whether two strings are anagrams are :
1) Sort the strings and compare them.
2) Create a frequency map for these strings and check if they are the same or not.
But in this case , we are given with a word (for the sake of simplicity let us assume a single word only and it will have single word anagrams only) and we need to find anagrams for that.
Solution which I have in mind is that , we can generate all permutations for the word and check which of these words exist in the dictionary . But clearly , this is highly inefficient. Yes , the dictionary is available too.
So what alternatives do we have here ?
I also read in a similar thread that something can be done using Tries but the person didn't explain as to what the algorithm was and why did we use a Trie in first place , just an implementation was provided that too in Python or Ruby. So that wasn't really helpful which is why I have created this new thread. If someone wants to share their implementation (other than C,C++ or Java) then kindly explain it too.
Example algorithm:
Open dictionary
Create empty hashmap H
For each word in dictionary:
Create a key that is the word's letters sorted alphabetically (and forced to one case)
Add the word to the list of words accessed by the hash key in H
To check for all anagrams of a given word:
Create a key that is the letters of the word, sorted (and forced to one case)
Look up that key in H
You now have a list of all anagrams
Relatively fast to build, blazingly fast on look-up.
I came up with a new solution I guess. It uses the Fundamental Theorem of Arithmetic. So the idea is to use an array of the first 26 prime numbers. Then for each letter in the input word we get the corresponding prime number A = 2, B = 3, C = 5, D = 7 … and then we calculate the product of our input word. Next we do this for each word in the dictionary and if a word matches our input word, then we add it to the resulting list. All anagrams will have the same signature because
Any integer greater than 1 is either a prime number, or can be written
as a unique product of prime numbers (ignoring the order).
Here's the code. I convert the word to UPPERCASE and 65 is the position of A which corresponds to my first prime number:
private int[] PRIMES = new int[] { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31,
37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103,
107, 109, 113 };
This is the method:
private long calculateProduct(char[] letters) {
long result = 1L;
for (char c : letters) {
if (c < 65) {
return -1;
}
int pos = c - 65;
result *= PRIMES[pos];
}
return result;
}
We know that if two words don't have the same length, they are not anagrams. So you can partition your dictionary in groups of words of the same length.
Now we focus on only one of these groups and basically all words have exactly the same length in this smaller universe.
If each letter position is a dimension, and the value in that dimension is based on the letter (say the ASCII code). Then you can calculate the length of the word vector.
For example, say 'A'=65, 'B'=66, then length("AB") = sqrt(65*65 + 66*66). Obviously, length("AB") = length("BA").
Clearly, if two word are anagrams, then their vectors have the same length. The next question is, if two word (of same number of letters) vectors have the same length, are they anagrams? Intuitively, I'd say no, since all vectors with that length forms a sphere, there are many. Not sure, since we're in the integer space in this case, how many there are actually.
But at the very least it allows you to partition your dictionary even further. For each word in your dictionary, calculate the vector's distance:
for(each letter c) { distance += c*c }; distance = sqrt(distance);
Then create a map for all words of length n, and key it with the distance and the value is a list of words of length n that yield that particular distance.
You'll create a map for each distance.
Then your lookup becomes the following algorithm:
Use the correct dictionary map based on the length of the word
Compute the length of your word's vector
Lookup the list of words that match that length
Go through the list and pick the anagrams using a naive algorithm is now the list of candidates is greatly reduced
Reduce the words to - say - lower case (clojure.string/lower-case).
Classify them (group-by) by letter frequency-map (frequencies).
Drop the frequency maps,
... leaving the collections of anagrams.
(These) are the corresponding functions in the Lisp dialect Clojure.
The whole function can be expressed so:
(defn anagrams [dict]
(->> dict
(map clojure.string/lower-case)
(group-by frequencies)
vals))
For example,
(anagrams ["Salt" "last" "one" "eon" "plod"])
;(["salt" "last"] ["one" "eon"] ["plod"])
An indexing function that maps each thing to its collection is
(defn index [xss]
(into {} (for [xs xss, x xs] [x xs])))
So that, for example,
((comp index anagrams) ["Salt" "last" "one" "eon" "plod"])
;{"salt" ["salt" "last"], "last" ["salt" "last"], "one" ["one" "eon"], "eon" ["one" "eon"], "plod" ["plod"]}
... where comp is the functional composition operator.
Well Tries would make it easier to check if the word exists.
So if you put the whole dictionary in a trie:
http://en.wikipedia.org/wiki/Trie
then you can afterward take your word and do simple backtracking by taking a char and recursively checking if we can "walk" down the Trie with any combiniation of the rest of the chars (adding one char at a time). When all chars are used in a recursion branch and there was a valid path in the Trie, then the word exists.
The Trie helps because its a nice stopping condition:
We can check if the part of a string, e.g "Anag" is a valid path in the trie, if not we can break that perticular recursion branch. This means we don't have to check every single permutation of the characters.
In pseudo-code
checkAllChars(currentPositionInTrie, currentlyUsedChars, restOfWord)
if (restOfWord == 0)
{
AddWord(currentlyUsedChar)
}
else
{
foreach (char in restOfWord)
{
nextPositionInTrie = Trie.Walk(currentPositionInTrie, char)
if (nextPositionInTrie != Positions.NOT_POSSIBLE)
{
checkAllChars(nextPositionInTrie, currentlyUsedChars.With(char), restOfWord.Without(char))
}
}
}
Obviously you need a nice Trie datastructure which allows you to progressively "walk" down the tree and check at each node if there is a path with the given char to any next node...
static void Main(string[] args)
{
string str1 = "Tom Marvolo Riddle";
string str2 = "I am Lord Voldemort";
str2= str2.Replace(" ", string.Empty);
str1 = str1.Replace(" ", string.Empty);
if (str1.Length != str2.Length)
Console.WriteLine("Strings are not anagram");
else
{
str1 = str1.ToUpper();
str2 = str2.ToUpper();
int countStr1 = 0;
int countStr2 = 0;
for (int i = 0; i < str1.Length; i++)
{
countStr1 += str1[i];
countStr2 += str2[i];
}
if(countStr2!=countStr1)
Console.WriteLine("Strings are not anagram");
else Console.WriteLine("Strings are anagram");
}
Console.Read();
}
Generating all permutations is easy, I guess you are worried that checking their existence in the dictionary is the "highly inefficient" part. But that actually depends on what data structure you use for the dictionary: of course, a list of words would be inefficient for your use case. Speaking of Tries, they would probably be an ideal representation, and quite efficient, too.
Another possibility would be to do some pre-processing on your dictionary, e.g. build a hashtable where the keys are the word's letters sorted, and the values are lists of words. You can even serialize this hashtable so you can write it to a file and reload quickly later. Then to look up anagrams, you simply sort your given word and look up the corresponding entry in the hashtable.
That depends on how you store your dictionary. If it is a simple array of words, no algorithm will be faster than linear.
If it is sorted, then here's an approach that may work. I've invented it just now, but I guess its faster than linear approach.
Denote your dictionary as D, current prefix as S. S = 0;
You create frequency map for your word. Lets denote it by F.
Using binary search find pointers to start of each letter in dictionary. Lets denote this array of pointers by P.
For each char c from A to Z, if F[c] == 0, skip it, else
S += c;
F[c] --;
P <- for every character i P[i] = pointer to first word beginning with S+i.
Recursively call step 4 till you find a match for your word or till you find that no such match exists.
This is how I would do it, anyway. There should be a more conventional approach, but this is faster then linear.
tried to implement the hashmap solution
public class Dictionary {
public static void main(String[] args){
String[] Dictionary=new String[]{"dog","god","tool","loot","rose","sore"};
HashMap<String,String> h=new HashMap<String, String>();
QuickSort q=new QuickSort();
for(int i=0;i<Dictionary.length;i++){
String temp =new String();
temp= q.quickSort(Dictionary[i]);//sorted word e.g dgo for dog
if(!h.containsKey(temp)){
h.put(temp,Dictionary[i]);
}
else
{
String s=h.get(temp);
h.put(temp,s + " , "+ Dictionary[i]);
}
}
String word=new String(){"tolo"};
String sortedword = q.quickSort(word);
if(h.containsKey(sortedword.toLowerCase())){ //used lowercase to make the words case sensitive
System.out.println("anagrams from Dictionary : " + h.get(sortedword.toLowerCase()));
}
}
Compute the frequency count vector for each word in the dictionary, a vector of length of the alphabet list.
generate a random Gaussian vector of the length of the alphabet list
project each dictionary word's count vector in this random direction and store the value (insert such that the array of values is sorted).
Given a new test word, project it in the same random direction used for the dictionary words.
Do a binary search to find the list of words that map to the same value.
Verify if each word obtained as above is indeed a true anagram. If not, remove it from the list.
Return the remaining elements of the list.
PS: The above procedure is a generalization of the prime number procedure which may potentially lead to large numbers (and hence computational precision issues)
# list of words
words = ["ROOPA","TABU","OOPAR","BUTA","BUAT" , "PAROO","Soudipta",
"Kheyali Park", "Tollygaunge", "AROOP","Love","AOORP",
"Protijayi","Paikpara","dipSouta","Shyambazaar",
"jayiProti", "North Calcutta", "Sovabazaar"]
#Method 1
A = [''.join(sorted(word)) for word in words]
dict ={}
for indexofsamewords,samewords in enumerate(A):
dict.setdefault(samewords, []).append(indexofsamewords)
print(dict)
#{'AOOPR': [0, 2, 5, 9, 11], 'ABTU': [1, 3, 4], 'Sadioptu': [6, 14], ' KPaaehiklry': [7], 'Taeggllnouy': [8], 'Leov': [10], 'Paiijorty': [12, 16], 'Paaaikpr': [13], 'Saaaabhmryz': [15], ' CNaachlortttu': [17], 'Saaaaborvz': [18]}
for index in dict.values():
print( [words[i] for i in index ] )
The Output :
['ROOPA', 'OOPAR', 'PAROO', 'AROOP', 'AOORP']
['TABU', 'BUTA', 'BUAT']
['Soudipta', 'dipSouta']
['Kheyali Park']
['Tollygaunge']
['Love']
['Protijayi', 'jayiProti']
['Paikpara']
['Shyambazaar']
['North Calcutta']
['Sovabazaar']
One solution is -
Map prime numbers to alphabet characters and multiply prime number
For ex -
a -> 2
b -> 3
......
.......
......
z -> 101
So
'ab' -> 6
'ba' -> 6
'bab' -> 18
'abba' -> 36
'baba' -> 36
Get MUL_number for Given word. return all the words from dictionary which have same MUL_number as given word
First check if the length of the strings are the same.
then check if the sum of the characters in both the strings are same (ie the ascii code sum)
then the words are anagrams
else not an anagram

Another Permutation Word Conundrum... With Linq?

I have seen many examples of getting all permutations of a given set of letters. Recursion seems to work well to get all possible combinations of a set of letters (though it doesn't seem to take into account if 2 of the letters are the same).
What I would like to figure out is, could you use linq (or not) to get all possible combinations of letters down to 3 letter combinations.
For example, given the letters: P I G G Y
I want an array of all possible cominations of these letters so I can check against a word list (scrabble?) and eventually get a list of all possible words that you can make using those letters (from 3 letters up to the total, in this case 5 letters).
I would suggest that rather than generating all possible permutations (of each desired length), take slightly different approach that will reduce the overall amount of work that you have to do.
First, find some word lists (you say that you are going to check against a word list).
Here is a good source of word lists:
http://www.poslarchive.com/math/scrabble/lists/index.html
Next, for each word list (e.g. for 3 letter words, 4 letter words, etc), build a dictionary whose key is the letters of the word in alphabetical order, and whose value is the word. For example, given the following word list:
ACT
CAT
ART
RAT
BAT
TAB
Your dictionary would look something like this (conceptually) (You might want to make a dictionary of List):
ABT - BAT, TAB
ACT - ACT, CAT
ART - ART, RAT, TAR
You could probably put all words of all lengths in the same dictionary, it is really up to you.
Next, to find candidate words for a given set of N letters, generate all possible combinations of length K for the lengths that you are interested in. For scrabble, that would all combinations (order is not important, so CAT == ACT, so all permutations is not required) of 2 (7 choose 2), 3 (7 choose 3), 4 (7 choose 4), 5 (7 choose 5), 6 (7 choose 6), 7 letters (7 choose 7). This can be improved by first ordering the N letters alphabetically and then finding the combinations of length K.
For each combination of length K, check the dictionary to see if there are any words with this key. If so, they are candidates to be played.
So, for CAKE, order the letters:
ACEK
Get the 2, 3, and 4 letter combinations:
AC
AE
AK
CE
CK
EK
ACE
CEK
ACEK
Now, use these keys into the dictionary. You will find ACE and CAKE are candidates.
This approach allows you to be much more efficient than generating all permutations and then checking each to see if it is a word. Using the combination approach, you do not have to do separate lookups for groups of letters of the same length with the same letters.
For example, given:
TEA
There are 6 permutations (of length 3), but only 1 combination (of length 3). So, only one lookup is required, using the key AET.
Sorry for not putting in any code, but with these ideas, it should be relatively straightforward to achieve what you want.
I wrote a program that does a lot of this back when I was first learning C# and .NET. I will try to post some snippets (improved based on what I have learned since then).
This string extension will return a new string that represents the input string's characters reassembled in alphabetical order:
public static string ToWordKey(this string s)
{
return new string(s.ToCharArray().OrderBy(x => x).ToArray());
}
Based on this answer by #Adam Hughes, here is an extension method that will return all combinations (n choose k, not all permutations) for all lengths (1 to string.Length) of the input string:
public static IEnumerable<string> Combinations(this String characters)
{
//Return all combinations of 1, 2, 3, etc length
for (int i = 1; i <= characters.Length; i++)
{
foreach (string s in CombinationsImpl(characters, i))
{
yield return s;
}
}
}
//Return all combinations (n choose k, not permutations) for a given length
private static IEnumerable<string> CombinationsImpl(String characters, int length)
{
for (int i = 0; i < characters.Length; i++)
{
if (length == 1)
{
yield return characters.Substring(i,1);
}
else
{
foreach (string next in CombinationsImpl(characters.Substring(i + 1, characters.Length - (i + 1)), length - 1))
yield return characters[i] + next;
}
}
}
Using the "InAlphabeticOrder" method, you can build a list of your input words (scrabble dictionary), indexed by their "key" (similar to dictionary, but many words could have the same key).
public class WordEntry
{
public string Key { set; get; }
public string Word { set; get; }
public WordEntry(string w)
{
Word = w;
Key = Word.ToWordKey();
}
}
var wordList = ReadWordsFromFileIntoWordEntryClasses();
Given a list of WordEntry, you can query the list using linq to find all words that can be made from a given set of letters:
string lettersKey = letters.ToWordKey();
var words = from we in wordList where we.Key.Equals(lettersKey) select we.Word;
You could find all words that could be made from any combination (of any length) of a given set of letters like this:
string lettersKey = letters.ToWordKey();
var words = from we in wordList
from key in lettersKey.Combinations()
where we.Key.Equals(key)
select we.Word;
[EDIT]
Here is some more sample code:
Given a list of 2, 3, and 4 letter words from here: http://www.poslarchive.com/math/scrabble/lists/common-234.html
Here is some code that will read those words (I cut and pasted them into a txt file) and construct a list of WordEntry objects:
private IEnumerable<WordEntry> GetWords()
{
using (FileStream fs = new FileStream(#".\Words234.txt", FileMode.Open))
using (StreamReader sr = new StreamReader(fs))
{
var words = sr.ReadToEnd().Split(new char[] { ' ', '\n' }, StringSplitOptions.RemoveEmptyEntries);
var wordLookup = from w in words select new WordEntry(w, w.ToWordKey());
return wordLookup;
}
}
I have renamed the InAlphateticalOrder extension method to ToWordKey.
Nothing fancy here, just read the file, split it into words, and create a new WordEntry for each word. Possibly could be more efficient here by reading one line at a time. The list will also get pretty long when you consider 5, 6, and 7 letter words. That might be an issue and it might not. For a toy or a game, it is probably no big deal. If you wanted to get fancy, you might consider building a small database with the words and keys.
Given a set of letters, find all possible words of the same length as the key:
string key = "cat".ToWordKey();
var candidates = from we in wordEntries
where we.Key.Equals(key,StringComparison.OrdinalIgnoreCase)
select we.Word;
Given a set of letters, find all possible words from length 2 to length(letters)
string letters = "seat";
IEnumerable<string> allWords = Enumerable.Empty<string>();
//Get each combination so that the combination is in alphabetical order
foreach (string s in letters.ToWordKey().Combinations())
{
//For this combination, find all entries with the same key
var words = from we in wordEntries
where we.Key.Equals(s.ToWordKey(),StringComparison.OrdinalIgnoreCase)
select we.Word;
allWords = allWords.Concat(words.ToList());
}
This code could probably be better, but it gets the job done. One thing that it does not do is handle duplicate letters. If you have "egg", the two letter combinations will be "eg", "eg", and "gg". That can be fixed easily enough by adding a call to Distinct to the foreach loop:
//Get each combination so that the combination is in alphabetical order
//Don't be fooled by words with duplicate letters...
foreach (string s in letters.ToWordKey().Combinations().Distinct())
{
//For this combination, find all entries with the same key
var words = from we in wordEntries
where we.Key.Equals(s.ToWordKey(),StringComparison.OrdinalIgnoreCase)
select we.Word;
//I forced the evaluation here because without ToList I was only capturing the LAST
//(longest) combinations of letters.
allWords = allWords.Concat(words.ToList());
}
Is that the most efficient way to do it? Maybe, maybe not. Somebody has to do the work, why not LINQ?
I think that with this approach you probably don't need a Dictionary of Lists (Dictionary<string,List<string>>).
With this code and with a suitable set of words, you should be able to take any combination of letters and find all words that can be made from them. You can
control the words by finding all words of a particular length, or all words of any length.
This should get you on your way.
[More Clarification]
In terms of your original question, you take as input "piggy" and you want to find all possible words that can be made from these letters. Using the Combinations extension method on "piggy", you will come up with a list like this:
p
i
g
g
y
pi
pg
pg
py
ig
ig
iy
gg
gy
gy
pig
pig
piy
etc. Note that there are repetitions. That is ok, the last bit of sample code that I posted showed how to find all unique Combinations by applying the Distinct operator.
So, we can get a list of all combinations of letters from a given set of letters. My algorithm depends on the list of WordEntry objects being searchable based on the Key property. The Key property is simply the letters of the word rearranged into alphabetical order. So, if you read a word file containing words like this:
ACT
CAT
DOG
GOD
FAST
PIGGY
The list of WordEntry objects will look like this:
Word Key
ACT ACT
CAT ACT
DOG DGO
GOD DGO
FAST AFST
PIGGY GGIPY
So, it's easy enough to build up the list of words and keys that we want to test against (or dictionary of valid scrabble words).
For example, (assume the few words above form your entire dictionary), if you had the letters 'o' 'g' 'd' on your scrabble tray, you could form the words DOG and GOD, because both have the key DGO.
Given a set of letters, if we want to find all possible words that can be made from those letters, we must be able to generate all possible combinations of letters. We can test each of these against the "dictionary" (quotes because it is not REALLY a Dictionary in the .NET sense, it is a list (or sequence) of WordEntry objects). To make sure that the keys (from the sequence of letters that we have "drawn" in scrabble) is compatible with the Key field in the WordEntry object, we must first order the letters.
Say we have PIGGY on our scrabble tray. To use the algorithm that I suggested, we want to get all possible "Key" values from PIGGY. In our list of WordEntry objects, we created the Key field by ordering the Word's letters in alphabetic order. We must do the same with the letters on our tray.
So, PIGGY becomes GGIPY. (That is what ToWordKey does). Now, given the letters from our tray in alphabetical order, we can use Combinations to generate all possible combinations (NOT permumations). Each combination we can look up in our list, based on Key. If a combination from GGIPY matches a Key value, then the corresponding Word (of the WordEntry class) can be constructed from our letters.
A better example than PIGGY
SEAT
First use ToWordKey:
AETS
Now, make all Combinations of all lengths:
A
E
T
S
AE
AT
AS
ET
ES
TS
AET
ATS
ETS
AETS
When we look in our list of WordEntry objects (made from reading in the list of 2, 3, 4 letter words), we will probably find that the following combinations are found:
AT
AS
AET
ATS
ETS
AETS
These Key values correspond to the following words:
Key Word
AT AT
AS AS
AET ATE
AET EAT
AET TEA
AST SAT
EST SET
AEST EATS
AEST SEAT
AEST TEAS
The final code example above will take the letters ('s' 'e' 'a' 't'), convert to Key format (ToWordKey) generate the combinations (Combinations), keep only the unique possible key values (Distict - not an issue here since no repeated letters), and then query the list of all WordEntry objects for those WordEntry objects whose Key is the same as one of the combinations.
Essentially, what we have done is constructed a table with columns Word and Key (based on reading the file of words and computing the Key for each) and then we do a query joining Key with a sequence of Key values that we generated (from the letters on our tray).
Try using my code in steps.
First, use the Combinations extension method:
var combinations = "piggy".Combinations();
Print the result (p i g g y ... pi pg pg ... pig pig piy ... pigg pigy iggy ... etc)
Next, get all combinations after applying the ToWordKey extension method:
//
// "piggy".ToWordKey() yields "iggpy"
//
var combinations = "piggy".ToWordKey().Combinations();
Print the result (i g g p y ig ig ip iy igg igp igy ... etc)
Eliminate duplicates with the Distinct() method:
var combinations = "piggy".ToWordKey().Combinations().Distinct();
Print the result (i g p y ig ip iy igg igp igy ... etc)
Use other sets of letters like "ate" and "seat".
Notice that you get significantly fewer candidates than if you use a permutation algorithm.
Now, imagine that the combinations that we just made are the key values that we will use to look in our list of WordEntry objects, comparing each combination to the Key of a WordEntry.
Use the GetWords function above and the link to the 2, 3, 4 letter words to build the list of WordEntry objects. Better yet, make a very stripped down word list with only a few words and print it out (or look at it in the debugger). See what it looks like. Look at each Word and each Key. Now, imagine if you wanted to find ALL words that you could make with "AET". It is easier to imagine using all letters, so start there. There are 6 permutations, but only 1 combination! That's right, you only have to make one search of the word list to find all 3 letter words that can be made with those letters! If you had 4 letters there would be 24 permutations, but again, only 1 combination.
That is the essence of the algorithm. The ToWordKey() function is essentially a hash function. All strings with the same number of letters and the same set of letters will hash to the same value. So, build a list of Words and their hashes (Key - ToWordKey) and then, given a set of letters to use to make words, hash the letters (ToWordKey) and find all entries in the list with the same hash value. To extend to finding all words of any length (given a set of letters), you just have to hash the input (send the whole string through ToWordKey), then find all Combinations of ANY length. Since the combinations are being generated from the hashed set of letters AND since the Combinations extension method maintains the original ordering of the letters in each combination, then each combination retains the property of having been hashed! That's pretty cool!
Hope this helps.
This method seems to work. It's using both Linq and procedural code.
IEnumerable<string> GetWords(string letters, int minLength, int maxLength)
{
if (maxLength > letters.Length)
maxLength = letters.Length;
// Associate an id with each letter to handle duplicate letters
var uniqueLetters = letters.Select((c, i) => new { Letter = c, Index = i });
// Init with 1 zero-length word
var words = new [] { uniqueLetters.Take(0) };
for (int i = 1; i <= maxLength; i++)
{
// Append one unused letter to each "word" already generated
words = (from w in words
from lt in uniqueLetters
where !w.Contains(lt)
select w.Concat(new[] { lt })).ToArray();
if (i >= minLength)
{
foreach (var word in words)
{
// Rebuild the actual string from the sequence of unique letters
yield return String.Join(
string.Empty,
word.Select(lt => lt.Letter));
}
}
}
}
The problem with searching for all permutations of a word is the amount of work that will be spent on calculating absolute gibberish. generating all permutations is O(n!) and sooo much of that will be absolutely wasted. This is why I recommend wageoghe's answer.
Here is a recursive linq function that returns all permutations:
public static IEnumerable<string> AllPermutations(this IEnumerable<char> s) {
return s.SelectMany(x => {
var index = Array.IndexOf(s.ToArray(),x);
return s.Where((y,i) => i != index).AllPermutations()
.Select(y => new string(new [] {x}.Concat(y).ToArray()))
.Union(new [] {new string(new [] {x})});
}).Distinct();
}
You can find the words you want like so:
"piggy".AllPermutations().Where(x => x.Length > 2)
However:
WARNING: I am not fond of this very inefficient answer
Now linq's biggest benefit (to me) is how readable it is. Having said that, however, I do not think the intent of the above code is clear (and I wrote it!). Thus the biggest advantage of linq (to me) is not present above, and it is not as efficient as a non-linq solution. I usually forgive linq's lack of execution efficiency because of the efficiency it adds for coding time, readability, and ease of maintenance, but I just don't think a linq solution is the best fit here...a square peg, round hole sort of thing if you will.
Also, there's the matter of the complexity I mentioned above. Sure it can find the 153 three letters or more permutations of 'piggy' in .2 seconds, but give it a word like 'bookkeeper' and you will be waiting a solid 1 minute 39 seconds for it to find all 435,574 three letters or more permutations. So why did I post such a terrible function? To make the point that wageoghe has the right approach. Generating all permutations just isn't an efficient enough approach to this problem.

Resources