clustering words based on their char set - algorithm

Say there is a word set and I would like to clustering them based on their char bag (multiset). For example
{tea, eat, abba, aabb, hello}
will be clustered into
{{tea, eat}, {abba, aabb}, {hello}}.
abba and aabb are clustered together because they have the same char bag, i.e. two a and two b.
To make it efficient, a naive way I can think of is to covert each word into a char-cnt series, for exmaple, abba and aabb will be both converted to a2b2, tea/eat will be converted to a1e1t1. So that I can build a dictionary and group words with same key.
Two issues here: first I have to sort the chars to build the key; second, the string key looks awkward and performance is not as good as char/int keys.
Is there a more efficient way to solve the problem?

For detecting anagrams you can use a hashing scheme based on the product of prime numbers A->2, B->3, C->5 etc. will give "abba" == "aabb" == 36 (but a different letter to primenumber mapping will be better)
See my answer here.

Since you are going to sort words, I assume all characters ascii values are in the range 0-255. Then you can do a Counting Sort over the words.
The counting sort is going to take the same amount of time as the size of the input word. Reconstruction of the string obtained from counting sort will take O(wordlen). You cannot make this step less than O(wordLen) because you will have to iterate the string at least once ie O(wordLen). There is no predefined order. You cannot make any assumptions about the word without iterating though all the characters in that word. Traditional sorting implementations(ie comparison based ones) will give you O(n * lg n). But non comparison ones give you O(n).
Iterate over all the words of the list and sort them using our counting sort. Keep a map of
sorted words to the list of known words they map. Addition of elements to a list takes constant time. So overall the complexity of the algorithm is O(n * avgWordLength).
Here is a sample implementation
import java.util.ArrayList;
public class ClusterGen {
static String sortWord(String w) {
int freq[] = new int[256];
for (char c : w.toCharArray()) {
freq[c]++;
}
StringBuilder sortedWord = new StringBuilder();
//It is at most O(n)
for (int i = 0; i < freq.length; ++i) {
for (int j = 0; j < freq[i]; ++j) {
sortedWord.append((char)i);
}
}
return sortedWord.toString();
}
static Map<String, List<String>> cluster(List<String> words) {
Map<String, List<String>> allClusters = new HashMap<String, List<String>>();
for (String word : words) {
String sortedWord = sortWord(word);
List<String> cluster = allClusters.get(sortedWord);
if (cluster == null) {
cluster = new ArrayList<String>();
}
cluster.add(word);
allClusters.put(sortedWord, cluster);
}
return allClusters;
}
public static void main(String[] args) {
System.out.println(cluster(Arrays.asList("tea", "eat", "abba", "aabb", "hello")));
System.out.println(cluster(Arrays.asList("moon", "bat", "meal", "tab", "male")));
}
}
Returns
{aabb=[abba, aabb], ehllo=[hello], aet=[tea, eat]}
{abt=[bat, tab], aelm=[meal, male], mnoo=[moon]}

Using an alphabet of x characters and a maximum word length of y, you can create hashes of (x + y) bits such that every anagram has a unique hash. A value of 1 for a bit means there is another of the current letter, a value of 0 means to move on to the next letter. Here's an example showing how this works:
Let's say we have a 7 letter alphabet(abcdefg) and a maximum word length of 4. Every word hash will be 11 bits. Let's hash the word "fade": 10001010100
The first bit is 1, indicating there is an a present. The second bit indicates that there are no more a's. The third bit indicates that there are no more b's, and so on. Another way to think about this is the number of ones in a row represents the number of that letter, and the total zeroes before that string of ones represents which letter it is.
Here is the hash for "dada": 11000110000
It's worth noting that because there is a one-to-one correspondence between possible hashes and possible anagrams, this is the smallest possible hash guaranteed to give unique hashes for any input, which eliminates the need to check everything in your buckets when you are done hashing.
I'm well aware that using large alphabets and long words will result in a large hash size. This solution is geared towards guaranteeing unique hashes in order to avoid comparing strings. If you can design an algorithm to compute this hash in constant time(given you know the values of x and y) then you'll be able to solve the entire grouping problem in O(n).

I would do this in two steps, first sort all your words according to their length and work on each subset separately(this is to avoid lots of overlaps later.)
The next step is harder and there are many ways to do it. One of the simplest would be to assign every letter a number(a = 1, b = 2, etc. for example) and add up all the values for each word, thereby assigning each word to an integer. Then you can sort the words according to this integer value which drastically cuts the number you have to compare.
Depending on your data set you may still have a lot of overlaps("bad" and "cac" would generate the same integer hash) so you may want to set a threshold where if you have too many words in one bucket you repeat the previous step with another hash(just assigning different numbers to the letters) Unless someone has looked at your code and designed a wordlist to mess you up, this should cut the overlaps to almost none.
Keep in mind that this approach will be efficient when you are expecting small numbers of words to be in the same char bag. If your data is a lot of long words that only go into a couple char bags, the number of comparisons you would do in the final step would be astronomical, and in this case you would be better off using an approach like the one you described - one that has no possible overlaps.

One thing I've done that's similar to this, but allows for collisions, is to sort the letters, then get rid of duplicates. So in your example, you'd have buckets for "aet", "ab", and "ehlo".
Now, as I say, this allows for collisions. So "rod" and "door" both end up in the same bucket, which may not be what you want. However, the collisions will be a small set that is easily and quickly searched.
So once you have the string for a bucket, you'll notice you can convert it into a 32-bit integer (at least for ASCII). Each letter in the string becomes a bit in a 32-bit integer. So "a" is the first bit, "b" is the second bit, etc. All (English) words make a bucket with a 26-bit identifier. You can then do very fast integer compares to find the bucket a new words goes into, or find the bucket an existing word is in.

Count the frequency of characters in each of the strings then build a hash table based on the frequency table. so for an example, for string aczda and aacdz we get 20110000000000000000000001. Using hash table we can partition all these strings in buckets in O(N).

26-bit integer as a hash function
If your alphabet isn't too large, for instance, just lower case English letters, you can define this particular hash function for each word: a 26 bit integer where each bit represents whether that English letter exists in the word. Note that two words with the same char set will have the same hash.
Then just add them to a hash table. It will automatically be clustered by hash collisions.
It will take O(max length of the word) to calculate a hash, and insertion into a hash table is constant time. So the overall complexity is O(max length of a word * number of words)

Related

How to instruct longest palindrome from a list of numbers?

I am trying to solve a question which says that we need to write a function in which given a list of numbers, we need to find the longest palindrome that we can from given only the numbers in the list.
For eg:
If the given list is : [3,47,6,6,5,6,15,22,1,6,15]
The longest palindrome that we can return is one of length 9, such as [6,15,6,3,47,3,6,15,6].
Additionally, we have the following constraints:
One can only use an array queue, array stack, and a chaining hashmap, and the list we are supposed to return, and the function must run in linear time. And we can use only constant additional space.
My approach was the following:
Since a palindrome can be formed if have an even number of certain characters, we can iterate over all the elements in the list, and store in a chaining hash map, the number of times each number appears in the list. This should take O(N) time since each lookup in the chaining hash map takes constant time, and iterating over the list takes linear time.
Then we can iterate over all the numbers in the chaining hash map, to see which numbers appear an even number of times, and accordingly, just make a palindrome. In the worst case, this will take a O(n) linear time.
Now there are two things I am wondering:
How should I make the actual palindrome? Like how do I use the data structures that I am being allowed to use in order to make a palindrome? I am thinking that since the queue is a LIFO data structure, for each number that occurs an even number of times, we add it once to the queue and once to the stack, and so on and so forth. And finally, we can just dequeue everything from the queue, and pop once from the stack, and then add it to the list!
It seems that with my approach, it is taking me two linear runs to solve the question. I am wondering if there is a faster way to do this.
Any and all help will be appreciated. Thanks!
It is not possible to get a better algorithm than one that is O(n), as every number in the input has to be inspected, as it might provide a possibility for a longer palindrome. If indeed the output must be a longest palindrome itself (and not only its length), then producing that output itself represents O(n).
You have also omitted one additional thing you have to do in your algorithm: there can be one value in the final palindrome that occurs only once (in the centre). So whenever you encounter a value that occurs an odd number of times, you may reserve one occurrence of that value for putting in the middle of an odd-length palindrome. The even remainder of the occurrences can be used as usual.
As to your questions:
How should I make the actual palindrome?
There are many ways to do it. But don't forget that if you have an even number of occurrences, you should use all those occurrences, not just two. So add half of them to the queue and half of them to the stack. When the frequency is odd, then still do the same (rounded down), and log the number also as a potential centre value.
When you have done this for all values, then dump the queue and stack together in the result list as you suggested, but don't forget to put the centre value in between the two, if you identified such a centre value (i.e. when not all occurrences were even).
It seems that with my approach, it is taking me two linear runs to solve the question.
You cannot do this better than with a linear time complexity. You can save a bit of time if you use the stack also for the result, and just dump the queue unto the stack (after potentially pushing the centre value).
I've got a solution when its palindrome only for the number and not the digit.
for the input: [51,15]
we will return [15] || [51] and not [51,15] =>(5,1,1,5);
feature more your example as a problem 3 doesn't appear twice(and appears in the answer)
or maybe I didn't understand the question.
public static int[] polidrom(int [] numbers){
HashMap<Integer/*the numbere*/,Integer/*how many time appeared*/> hash = new HashMap<>();
boolean middleFree= false;
int middleNumber = 0;
int space = 0;
Stack<Integer> stack = new Stack<>();
for (Integer num:numbers) {//count how mant times each digit appears
if(hash.containsKey(num)){hash.replace(num,1+hash.get(num));}
else{hash.put(num,1);}
}
for (Integer num:hash.keySet()) {//how many times i can use pairs
int original =hash.get(num);
int insert = (int)Math.floor(hash.get(num)/2);
if(hash.get(num) % 2 !=0){middleNumber = num;middleFree = true;}//as no copy
hash.replace(num,insert);
if(insert != 0){
for(int i =0; i < original;i++){
stack.push(num);
}
}
}
space = stack.size();
if(space == numbers.length){ space--;};//all the numbers are been used
int [] answer = new int[space+1];//the middle dont have to have an pair
int startPointer =0 , endPointer= space;
while (!stack.isEmpty()){
int num = stack.pop();
answer[startPointer] = num;
answer[endPointer] = num;
startPointer++;
endPointer--;
}
if (middleFree){answer[answer.length/2] = middleNumber;}
return answer;
}
space O(n) => {stack , hashMap , answer Array};
complexity: O(n)
You can skip the part where I used the stack and build the answer array in the same loop.
and I can't think of a way where you will not iterate at least twice;
Hope I've helped

Valid Permutations of a String

This question was asked to me in a recent amazon technical interview. It goes as follows:-
Given a string ex: "where am i" and a dictionary of valid words, you have to list all valid distinct permutations of the string. A valid string comprises of words which exists in the dictionary. For ex: "we are him","whim aree" are valid strings considering the words(whim, aree) are part of the dictionary. Also the condition is that a mere rearrangement of words is not a valid string, i.e "i am where" is not a valid combination.
The task is to find all possible such strings in the optimum way.
As you have said, space doesn't count, so input can be just viewed as a list of chars. The output is the permutation of words, so an obvious way to do it is find all valid words then permutate them.
Now problem becomes to divide a list of chars into subsets which each forms a word, which you can find some answers here and following is my version to solve this sub-problem.
If the dictionary is not large, we can iterate dictionary to
find min_len/max_len of words, to estimate how many words we may have, i.e. how deep we recur
convert word into map to accelerate search;
filter the words which have impossible char (i.e. the char our input doesn't have) out;
if this word is subset of our input, we can find word recursively.
The following is pseudocode:
int maxDepth = input.length / min_len;
void findWord(List<Map<Character, Integer>> filteredDict, Map<Character, Integer> input, List<String> subsets, int level) {
if (level < maxDepth) {
for (Map<Character, Integer> word : filteredDict) {
if (subset(input, word)) {
subsets.add(word);
findWord(filteredDict, removeSubset(input, word), subsets, level + 1);
}
}
}
}
And then you can permutate words in a recursive functions easily.
Technically speaking, this solution can be O(n**d) -- where n is dictionary size and d is max depth. But if the input is not large and complex, we can still solve it in feasible time.

How to create a unique hash that will match any strings permutations

Given a string abcd how can I create a unique hashing method that will hash those 4 characters to match bcad or any other permutation of the letters abcd?
Currently I have this code
long hashString(string a) {
long hashed = 0;
for(int i = 0; i < a.length(); i++) {
hashed += a[i] * 7; // Timed by a prime to make the hash more unique?
}
return hashed;
}
Now this will not work becasue ad will hash with bc.
I know you can make it more unique by multiplying the position of the letter by the letter itself hashed += a[i] * i but then the string will not hash to its permutations.
Is it possible to create a hash that achieves this?
Edit
Some have suggested sorting the strings before you hash them. Which is a valid answer but the sorting would take O(nlog) time and I am looking for a hash function that runs in O(n) time.
I am looking to do this in O(1) memory.
Create an array of 26 integers, corresponding to letters a-z. Initialize it to 0. Scan the string from beginning to end, and increment the array element corresponding to the current letter. Note that up to this point the algorithm has O(n) time complexity and O(1) space complexity (since the array size is a constant).
Finally, hash the contents of the array using your favorite hash function.
The basic thing you can do is sort the strings before applying the hash function. So, to compute the hash of "adbc" or "dcba" you instead compute the hash of "abcd".
If you want to make sure that there are no collisions in your hash function, then the only way is to have the hash result be a string. There are many more strings than there are 32-bit (or 64-bit) integers so collisions are innevitable (collisions are unlikely with a good hash function though).
Easiest way to understand: sort the letters in the string, and then hash the resulting string.
Some variations on your original idea also work, like:
long hashString(string a) {
long hashed = 0;
for(int i = 0; i < a.length(); i++) {
long t = a[i] * 16777619;
hashed += t^(t>>8);
}
return hashed;
}
I suppose you need a hash such that two anagrams will hash to the same value. I'd suggest you sort them first and use any of the common hash function such as md5. I write the following code using Scala:
import java.security.MessageDigest
def hash(s: String) = {
MessageDigest.getInstance("MD5").digest(s.sorted.getBytes)
}
Note in scala:
scala> "hello".sorted
res0: String = ehllo
scala> "cinema".sorted
res1: String = aceimn
Synopsis: store a histogram of the letters in the hash value.
Step 1: compute a histogram of the letters (since a histogram uniquely identifies the letters in the string without regard to the order of the letters).
int histogram[26];
for ( int i = 0; i < a.length(); i++ )
histogram[a[i] - 'a']++;
Step 2: pack the histogram into the hash value. You have several options here. Which option to choose depends on what sort of limitations you can put on the strings.
If you knew that each letter would appear no more than 3 times, then it takes 2 bits to represent the count, so you could create a 52-bit hash that's guaranteed to be unique.
If you're willing to use a 128-bit hash, then you've got 5 bits for 24 letters, and 4 bits for 2 letters (e.g. q and z). The 128-bit hash allows each letter to appear 31 times (15 times for q and z).
But if you want a fixed sized hash, say 16-bit, then you need to pack the histogram into those 16 bits in a way that reduces collisions. The easiest way to do that is to create a 26 byte message (one byte for each entry in the histogram, allowing each letter to appear up to 255 times). Then take the 16-bit CRC of the message, using your favorite CRC generator.

If a word is made up of two valid words

Given a dictionary find out if given word can be made by two words in dictionary. For eg. given "newspaper" you have to find if it can be made by two words. (news and paper in this case). Only thing i can think of is starting from beginning and checking if current string is a word. In this case checking n, ne, new, news..... check for the remaining part if current string is a valid word.
Also how do you generalize it for k(means if a word is made up of k words) ? Any thoughts?
Starting your split at the center may yield results faster. For example, for newspaper, you would first try splitting at 'news paper' or 'newsp aper'. As you can see, for this example, you would find your result on the first or second try. If you do not find a result, just search outwards. See the example for 'crossbow' below:
cros sbow
cro ssbow
cross bow
For the case with two words, the problem can be solved by just considering all possible ways of splitting the word into two, then checking each half to see if it's a valid word. If the input string has length n, then there are only O(n) different ways of splitting the string. If you store the strings in a structure supporting fast lookup (say, a trie, or hash table).
The more interesting case is when you have k > 2 words to split the word into. For this, we can use a really elegant recursive formulation:
A word can be split into k words if it can be split into a word followed by a word splittable into k - 1 words.
The recursive base case would be that a word can be split into zero words only if it's the empty string, which is trivially true.
To use this recursive insight, we'll modify the original algorithm by considering all possible splits of the word into two parts. Once we have that split, we can check if the first part of the split is a word and if the second part of the split can be broken apart into k - 1 words. As an optimization, we don't recurse on all possible splits, but rather just on those where we know the first word is valid. Here's some sample code written in Java:
public static boolean isSplittable(String word, int k, Set<String> dictionary) {
/* Base case: If the string is empty, we can only split into k words and vice-
* versa.
*/
if (word.isEmpty() || k == 0)
return word.isEmpty() && k == 0;
/* Generate all possible non-empty splits of the word into two parts, recursing on
* problems where the first word is known to be valid.
*
* This loop is structured so that we always try pulling off at least one letter
* from the input string so that we don't try splitting the word into k pieces
* of which some are empty.
*/
for (int i = 1; i <= word.length(); ++i) {
String first = word.substring(0, i), last = word.substring(i);
if (dictionary.contains(first) &&
isSplittable(last, k - 1, dictionary)
return true;
}
/* If we're here, then no possible split works in this case and we should signal
* that no solution exists.
*/
return false;
}
}
This code, in the worst case, runs in time O(nk) because it tries to generate all possible partitions of the string into k different pieces. Of course, it's unlikely to hit this worst-case behavior because most possible splits won't end up forming any words.
I'd first loop through the dictionary using a strpos(-like) function to check if it occurs at all. Then try if you can find a match with the results.
So it would do something like this:
Loop through the dictionary strpos-ing every word in the dictionary and saving results into an array, let's say it gives me the results 'new', 'paper', and 'news'.
Check if new+paper==newspaper, check if new+news==newspaper, etc, untill you get to paper+news==newspaper which returns.
Not sure if it is a good method though, but it seems more efficient than checking a word letter for letter (more iterations) and you didn't explain how you'd check when the second word started.
Don't know what you mean by 'how do you generalize it for k'.

Another Permutation Word Conundrum... With Linq?

I have seen many examples of getting all permutations of a given set of letters. Recursion seems to work well to get all possible combinations of a set of letters (though it doesn't seem to take into account if 2 of the letters are the same).
What I would like to figure out is, could you use linq (or not) to get all possible combinations of letters down to 3 letter combinations.
For example, given the letters: P I G G Y
I want an array of all possible cominations of these letters so I can check against a word list (scrabble?) and eventually get a list of all possible words that you can make using those letters (from 3 letters up to the total, in this case 5 letters).
I would suggest that rather than generating all possible permutations (of each desired length), take slightly different approach that will reduce the overall amount of work that you have to do.
First, find some word lists (you say that you are going to check against a word list).
Here is a good source of word lists:
http://www.poslarchive.com/math/scrabble/lists/index.html
Next, for each word list (e.g. for 3 letter words, 4 letter words, etc), build a dictionary whose key is the letters of the word in alphabetical order, and whose value is the word. For example, given the following word list:
ACT
CAT
ART
RAT
BAT
TAB
Your dictionary would look something like this (conceptually) (You might want to make a dictionary of List):
ABT - BAT, TAB
ACT - ACT, CAT
ART - ART, RAT, TAR
You could probably put all words of all lengths in the same dictionary, it is really up to you.
Next, to find candidate words for a given set of N letters, generate all possible combinations of length K for the lengths that you are interested in. For scrabble, that would all combinations (order is not important, so CAT == ACT, so all permutations is not required) of 2 (7 choose 2), 3 (7 choose 3), 4 (7 choose 4), 5 (7 choose 5), 6 (7 choose 6), 7 letters (7 choose 7). This can be improved by first ordering the N letters alphabetically and then finding the combinations of length K.
For each combination of length K, check the dictionary to see if there are any words with this key. If so, they are candidates to be played.
So, for CAKE, order the letters:
ACEK
Get the 2, 3, and 4 letter combinations:
AC
AE
AK
CE
CK
EK
ACE
CEK
ACEK
Now, use these keys into the dictionary. You will find ACE and CAKE are candidates.
This approach allows you to be much more efficient than generating all permutations and then checking each to see if it is a word. Using the combination approach, you do not have to do separate lookups for groups of letters of the same length with the same letters.
For example, given:
TEA
There are 6 permutations (of length 3), but only 1 combination (of length 3). So, only one lookup is required, using the key AET.
Sorry for not putting in any code, but with these ideas, it should be relatively straightforward to achieve what you want.
I wrote a program that does a lot of this back when I was first learning C# and .NET. I will try to post some snippets (improved based on what I have learned since then).
This string extension will return a new string that represents the input string's characters reassembled in alphabetical order:
public static string ToWordKey(this string s)
{
return new string(s.ToCharArray().OrderBy(x => x).ToArray());
}
Based on this answer by #Adam Hughes, here is an extension method that will return all combinations (n choose k, not all permutations) for all lengths (1 to string.Length) of the input string:
public static IEnumerable<string> Combinations(this String characters)
{
//Return all combinations of 1, 2, 3, etc length
for (int i = 1; i <= characters.Length; i++)
{
foreach (string s in CombinationsImpl(characters, i))
{
yield return s;
}
}
}
//Return all combinations (n choose k, not permutations) for a given length
private static IEnumerable<string> CombinationsImpl(String characters, int length)
{
for (int i = 0; i < characters.Length; i++)
{
if (length == 1)
{
yield return characters.Substring(i,1);
}
else
{
foreach (string next in CombinationsImpl(characters.Substring(i + 1, characters.Length - (i + 1)), length - 1))
yield return characters[i] + next;
}
}
}
Using the "InAlphabeticOrder" method, you can build a list of your input words (scrabble dictionary), indexed by their "key" (similar to dictionary, but many words could have the same key).
public class WordEntry
{
public string Key { set; get; }
public string Word { set; get; }
public WordEntry(string w)
{
Word = w;
Key = Word.ToWordKey();
}
}
var wordList = ReadWordsFromFileIntoWordEntryClasses();
Given a list of WordEntry, you can query the list using linq to find all words that can be made from a given set of letters:
string lettersKey = letters.ToWordKey();
var words = from we in wordList where we.Key.Equals(lettersKey) select we.Word;
You could find all words that could be made from any combination (of any length) of a given set of letters like this:
string lettersKey = letters.ToWordKey();
var words = from we in wordList
from key in lettersKey.Combinations()
where we.Key.Equals(key)
select we.Word;
[EDIT]
Here is some more sample code:
Given a list of 2, 3, and 4 letter words from here: http://www.poslarchive.com/math/scrabble/lists/common-234.html
Here is some code that will read those words (I cut and pasted them into a txt file) and construct a list of WordEntry objects:
private IEnumerable<WordEntry> GetWords()
{
using (FileStream fs = new FileStream(#".\Words234.txt", FileMode.Open))
using (StreamReader sr = new StreamReader(fs))
{
var words = sr.ReadToEnd().Split(new char[] { ' ', '\n' }, StringSplitOptions.RemoveEmptyEntries);
var wordLookup = from w in words select new WordEntry(w, w.ToWordKey());
return wordLookup;
}
}
I have renamed the InAlphateticalOrder extension method to ToWordKey.
Nothing fancy here, just read the file, split it into words, and create a new WordEntry for each word. Possibly could be more efficient here by reading one line at a time. The list will also get pretty long when you consider 5, 6, and 7 letter words. That might be an issue and it might not. For a toy or a game, it is probably no big deal. If you wanted to get fancy, you might consider building a small database with the words and keys.
Given a set of letters, find all possible words of the same length as the key:
string key = "cat".ToWordKey();
var candidates = from we in wordEntries
where we.Key.Equals(key,StringComparison.OrdinalIgnoreCase)
select we.Word;
Given a set of letters, find all possible words from length 2 to length(letters)
string letters = "seat";
IEnumerable<string> allWords = Enumerable.Empty<string>();
//Get each combination so that the combination is in alphabetical order
foreach (string s in letters.ToWordKey().Combinations())
{
//For this combination, find all entries with the same key
var words = from we in wordEntries
where we.Key.Equals(s.ToWordKey(),StringComparison.OrdinalIgnoreCase)
select we.Word;
allWords = allWords.Concat(words.ToList());
}
This code could probably be better, but it gets the job done. One thing that it does not do is handle duplicate letters. If you have "egg", the two letter combinations will be "eg", "eg", and "gg". That can be fixed easily enough by adding a call to Distinct to the foreach loop:
//Get each combination so that the combination is in alphabetical order
//Don't be fooled by words with duplicate letters...
foreach (string s in letters.ToWordKey().Combinations().Distinct())
{
//For this combination, find all entries with the same key
var words = from we in wordEntries
where we.Key.Equals(s.ToWordKey(),StringComparison.OrdinalIgnoreCase)
select we.Word;
//I forced the evaluation here because without ToList I was only capturing the LAST
//(longest) combinations of letters.
allWords = allWords.Concat(words.ToList());
}
Is that the most efficient way to do it? Maybe, maybe not. Somebody has to do the work, why not LINQ?
I think that with this approach you probably don't need a Dictionary of Lists (Dictionary<string,List<string>>).
With this code and with a suitable set of words, you should be able to take any combination of letters and find all words that can be made from them. You can
control the words by finding all words of a particular length, or all words of any length.
This should get you on your way.
[More Clarification]
In terms of your original question, you take as input "piggy" and you want to find all possible words that can be made from these letters. Using the Combinations extension method on "piggy", you will come up with a list like this:
p
i
g
g
y
pi
pg
pg
py
ig
ig
iy
gg
gy
gy
pig
pig
piy
etc. Note that there are repetitions. That is ok, the last bit of sample code that I posted showed how to find all unique Combinations by applying the Distinct operator.
So, we can get a list of all combinations of letters from a given set of letters. My algorithm depends on the list of WordEntry objects being searchable based on the Key property. The Key property is simply the letters of the word rearranged into alphabetical order. So, if you read a word file containing words like this:
ACT
CAT
DOG
GOD
FAST
PIGGY
The list of WordEntry objects will look like this:
Word Key
ACT ACT
CAT ACT
DOG DGO
GOD DGO
FAST AFST
PIGGY GGIPY
So, it's easy enough to build up the list of words and keys that we want to test against (or dictionary of valid scrabble words).
For example, (assume the few words above form your entire dictionary), if you had the letters 'o' 'g' 'd' on your scrabble tray, you could form the words DOG and GOD, because both have the key DGO.
Given a set of letters, if we want to find all possible words that can be made from those letters, we must be able to generate all possible combinations of letters. We can test each of these against the "dictionary" (quotes because it is not REALLY a Dictionary in the .NET sense, it is a list (or sequence) of WordEntry objects). To make sure that the keys (from the sequence of letters that we have "drawn" in scrabble) is compatible with the Key field in the WordEntry object, we must first order the letters.
Say we have PIGGY on our scrabble tray. To use the algorithm that I suggested, we want to get all possible "Key" values from PIGGY. In our list of WordEntry objects, we created the Key field by ordering the Word's letters in alphabetic order. We must do the same with the letters on our tray.
So, PIGGY becomes GGIPY. (That is what ToWordKey does). Now, given the letters from our tray in alphabetical order, we can use Combinations to generate all possible combinations (NOT permumations). Each combination we can look up in our list, based on Key. If a combination from GGIPY matches a Key value, then the corresponding Word (of the WordEntry class) can be constructed from our letters.
A better example than PIGGY
SEAT
First use ToWordKey:
AETS
Now, make all Combinations of all lengths:
A
E
T
S
AE
AT
AS
ET
ES
TS
AET
ATS
ETS
AETS
When we look in our list of WordEntry objects (made from reading in the list of 2, 3, 4 letter words), we will probably find that the following combinations are found:
AT
AS
AET
ATS
ETS
AETS
These Key values correspond to the following words:
Key Word
AT AT
AS AS
AET ATE
AET EAT
AET TEA
AST SAT
EST SET
AEST EATS
AEST SEAT
AEST TEAS
The final code example above will take the letters ('s' 'e' 'a' 't'), convert to Key format (ToWordKey) generate the combinations (Combinations), keep only the unique possible key values (Distict - not an issue here since no repeated letters), and then query the list of all WordEntry objects for those WordEntry objects whose Key is the same as one of the combinations.
Essentially, what we have done is constructed a table with columns Word and Key (based on reading the file of words and computing the Key for each) and then we do a query joining Key with a sequence of Key values that we generated (from the letters on our tray).
Try using my code in steps.
First, use the Combinations extension method:
var combinations = "piggy".Combinations();
Print the result (p i g g y ... pi pg pg ... pig pig piy ... pigg pigy iggy ... etc)
Next, get all combinations after applying the ToWordKey extension method:
//
// "piggy".ToWordKey() yields "iggpy"
//
var combinations = "piggy".ToWordKey().Combinations();
Print the result (i g g p y ig ig ip iy igg igp igy ... etc)
Eliminate duplicates with the Distinct() method:
var combinations = "piggy".ToWordKey().Combinations().Distinct();
Print the result (i g p y ig ip iy igg igp igy ... etc)
Use other sets of letters like "ate" and "seat".
Notice that you get significantly fewer candidates than if you use a permutation algorithm.
Now, imagine that the combinations that we just made are the key values that we will use to look in our list of WordEntry objects, comparing each combination to the Key of a WordEntry.
Use the GetWords function above and the link to the 2, 3, 4 letter words to build the list of WordEntry objects. Better yet, make a very stripped down word list with only a few words and print it out (or look at it in the debugger). See what it looks like. Look at each Word and each Key. Now, imagine if you wanted to find ALL words that you could make with "AET". It is easier to imagine using all letters, so start there. There are 6 permutations, but only 1 combination! That's right, you only have to make one search of the word list to find all 3 letter words that can be made with those letters! If you had 4 letters there would be 24 permutations, but again, only 1 combination.
That is the essence of the algorithm. The ToWordKey() function is essentially a hash function. All strings with the same number of letters and the same set of letters will hash to the same value. So, build a list of Words and their hashes (Key - ToWordKey) and then, given a set of letters to use to make words, hash the letters (ToWordKey) and find all entries in the list with the same hash value. To extend to finding all words of any length (given a set of letters), you just have to hash the input (send the whole string through ToWordKey), then find all Combinations of ANY length. Since the combinations are being generated from the hashed set of letters AND since the Combinations extension method maintains the original ordering of the letters in each combination, then each combination retains the property of having been hashed! That's pretty cool!
Hope this helps.
This method seems to work. It's using both Linq and procedural code.
IEnumerable<string> GetWords(string letters, int minLength, int maxLength)
{
if (maxLength > letters.Length)
maxLength = letters.Length;
// Associate an id with each letter to handle duplicate letters
var uniqueLetters = letters.Select((c, i) => new { Letter = c, Index = i });
// Init with 1 zero-length word
var words = new [] { uniqueLetters.Take(0) };
for (int i = 1; i <= maxLength; i++)
{
// Append one unused letter to each "word" already generated
words = (from w in words
from lt in uniqueLetters
where !w.Contains(lt)
select w.Concat(new[] { lt })).ToArray();
if (i >= minLength)
{
foreach (var word in words)
{
// Rebuild the actual string from the sequence of unique letters
yield return String.Join(
string.Empty,
word.Select(lt => lt.Letter));
}
}
}
}
The problem with searching for all permutations of a word is the amount of work that will be spent on calculating absolute gibberish. generating all permutations is O(n!) and sooo much of that will be absolutely wasted. This is why I recommend wageoghe's answer.
Here is a recursive linq function that returns all permutations:
public static IEnumerable<string> AllPermutations(this IEnumerable<char> s) {
return s.SelectMany(x => {
var index = Array.IndexOf(s.ToArray(),x);
return s.Where((y,i) => i != index).AllPermutations()
.Select(y => new string(new [] {x}.Concat(y).ToArray()))
.Union(new [] {new string(new [] {x})});
}).Distinct();
}
You can find the words you want like so:
"piggy".AllPermutations().Where(x => x.Length > 2)
However:
WARNING: I am not fond of this very inefficient answer
Now linq's biggest benefit (to me) is how readable it is. Having said that, however, I do not think the intent of the above code is clear (and I wrote it!). Thus the biggest advantage of linq (to me) is not present above, and it is not as efficient as a non-linq solution. I usually forgive linq's lack of execution efficiency because of the efficiency it adds for coding time, readability, and ease of maintenance, but I just don't think a linq solution is the best fit here...a square peg, round hole sort of thing if you will.
Also, there's the matter of the complexity I mentioned above. Sure it can find the 153 three letters or more permutations of 'piggy' in .2 seconds, but give it a word like 'bookkeeper' and you will be waiting a solid 1 minute 39 seconds for it to find all 435,574 three letters or more permutations. So why did I post such a terrible function? To make the point that wageoghe has the right approach. Generating all permutations just isn't an efficient enough approach to this problem.

Resources