Shortest word in a dictionary containing all letters - algorithm

I was asked this question in an interview.
Given an array of characters, find the shortest word in a dictionary that contains all the characters. Also, propose an implementation for the dictionary that would optimize this function call.
for e.g. char[] chars = { 'R' , 'C' }. The result should be the word "CAR".
I could not come up with anything that would run reasonably quickly. I thought of pre-processing the dictionary by building a hash table to retrieve all words of a particular length. Then I could only think of retrieving all words in the increasing order of length and checking if the required characters were present in any of those ( maybe by using a bitmask . )

This is a common software interview question, and its solution is this: sort the dictionary itself by length and sort each value alphabetically. When given the characters, sort them and find the needed letters.

First sort the dictionary in ascending order of length.
For each letter, construct a bit map of the locations in the dictionary of the words containing that letter. Each bit map will be long, but there will not be many.
For each search, take the intersection of the bitmaps for the letters in the array. The first one bit in the result will be at the index corresponding to the location in the dictionary of the shortest word containing all the letters.

The other answers are better, but I realized this is entirely precomputable.
For each word
sort the letters and remove duplicates
The sequence of letters can be viewed as a bitmask, A=0bit, B=1bit...Z=26bit. Set the bits of a mask A according to the letters in this word.
For each combination of set bits in the mask A, make a subset mask B
If there is already a word associated with this mask B
and this word is shorter, replace the associated word with this one
otherwise try next B
If there is no word associated with mask B
Associate this word with mask B.
This would take a huge amount of setup time, and the subsequent association storage would be in the vicinity of 1.7GB, but you'd be able to find the shortest word containing a superset of the letters in O(1) time guaranteed.

The obvious preprocessing is to sort all words in the dictionary by their length and alphabetical re-ordering: "word" under "dorw", for example. Then you can use general search algorithms (e.g., regex) to search for the letters you need. An efficient (DFA) search requires only one pass over the dictionary in the worst case, and much less if the first match is short.

Here is a solution in C#:
using System.Collections.Generic;
using System.Linq;
public class ShortestWordFinder
{
public ShortestWordFinder(IEnumerable<string> dictionary)
{
this.dictionary = dictionary;
}
public string ShortestWordContaining(IEnumerable<char> chars)
{
var wordsContaining = dictionary.Where(s =>
{
foreach (var c in chars)
{
if (!s.Contains(c))
{
return false;
}
s = s.Remove(s.IndexOf(c), 1);
}
return true;
}).ToList();
if (!wordsContaining.Any())
{
return null;
}
var minLength = wordsContaining.Min(word => word.Length);
return wordsContaining.First(word => word.Length == minLength);
}
private readonly IEnumerable<string> dictionary;
}
Simple test:
using System.Diagnostics;
using Xunit;
public class ShortestWordFinderTests
{
[Fact]
public void Works()
{
var words = new[] { "dog", "moose", "gargoyle" };
var finder = new ShortestWordFinder(words);
Trace.WriteLine(finder.ShortestWordContaining("o"));
Trace.WriteLine(finder.ShortestWordContaining("oo"));
Trace.WriteLine(finder.ShortestWordContaining("oy"));
Trace.WriteLine(finder.ShortestWordContaining("eyg"));
Trace.WriteLine(finder.ShortestWordContaining("go"));
Assert.Null(finder.ShortestWordContaining("ooo"));
}
}

Pre processing
a. Sort words into alphabetic char arrays. Retain mapping from sorted to original word
b. Split dictionary by word length as you suggest
c. Sort entries in each word length set alphabetically
On function call
Sort char array alphabetically
Start with group of same length as array
Loop through entries testing for characters until first letter of entry lexicographically greater than first in your char array then break. If match then return original word (see a above for mapping)
Back to 2 for next longest word group
Interesting extensions. Multiple words might map to same entry in a. Which one (s) should you return...

Related

Google search suggestion implementation

In a recent amazon interview I was asked to implement Google "suggestion" feature. When a user enters "Aeniffer Aninston", Google suggests "Did you mean Jeniffer Aninston". I tried to solve it by using hashing but could not cover the corner cases. Please let me know your thought on this.
There are 4 most common types of erros -
Omitted letter: "stck" instead of "stack"
One letter typo: "styck" instead of "stack"
Extra letter: "starck" instead of "stack"
Adjacent letters swapped: "satck" instead of "stack"
BTW, we can swap not adjacent letters but any letters but this is not common typo.
Initial state - typed word. Run BFS/DFS from initial vertex. Depth of search is your own choice. Remember that increasing depth of search leads to dramatically increasing number of "probable corrections". I think depth ~ 4-5 is a good start.
After generating "probable corrections" search each generated word-candidate in a dictionary - binary search in sorted dictionary or search in a trie which populated with your dictionary.
Trie is faster but binary search allows searching in Random Access File without loading dictionary to RAM. You have to load only precomputed integer array[]. Array[i] gives you number of bytes to skip for accesing i-th word. Words in Random Acces File should be written in a sorted order. If you have enough RAM to store dictionary use trie.
Before suggesting corrections check typed word - if it is in a dictionary, provide nothing.
UPDATE
Generate corrections should be done by BFS - when I tried DFS, entries like "Jeniffer" showed "edit distance = 3". DFS doesn't works, since it make a lot of changes which can be done in one step - for example, Jniffer->nJiffer->enJiffer->eJniffer->Jeniffer instead of Jniffer->Jeniffer.
Sample code for generating corrections by BFS
static class Pair
{
private String word;
private byte dist;
// dist is byte because dist<=128.
// Moreover, dist<=6 in real application
public Pair(String word,byte dist)
{
this.word = word;
this.dist = dist;
}
public String getWord()
{
return word;
}
public int getDist()
{
return dist;
}
}
public static void main(String[] args) throws Exception
{
HashSet<String> usedWords;
HashSet<String> dict;
ArrayList<String> corrections;
ArrayDeque<Pair> states;
usedWords = new HashSet<String>();
corrections = new ArrayList<String>();
dict = new HashSet<String>();
states = new ArrayDeque<Pair>();
// populate dictionary. In real usage should be populated from prepared file.
dict.add("Jeniffer");
dict.add("Jeniffert"); //depth 2 test
usedWords.add("Jniffer");
states.add(new Pair("Jniffer", (byte)0));
while(!states.isEmpty())
{
Pair head = states.pollFirst();
//System.out.println(head.getWord()+" "+head.getDist());
if(head.getDist()<=2)
{
// checking reached depth.
//4 is the first depth where we don't generate anything
// swap adjacent letters
for(int i=0;i<head.getWord().length()-1;i++)
{
// swap i-th and i+1-th letters
String newWord = head.getWord().substring(0,i)+head.getWord().charAt(i+1)+head.getWord().charAt(i)+head.getWord().substring(i+2);
// even if i==curWord.length()-2 and then i+2==curWord.length
//substring(i+2) doesn't throw exception and returns empty string
// the same for substring(0,i) when i==0
if(!usedWords.contains(newWord))
{
usedWords.add(newWord);
if(dict.contains(newWord))
{
corrections.add(newWord);
}
states.addLast(new Pair(newWord, (byte)(head.getDist()+1)));
}
}
// insert letters
for(int i=0;i<=head.getWord().length();i++)
for(char ch='a';ch<='z';ch++)
{
String newWord = head.getWord().substring(0,i)+ch+head.getWord().substring(i);
if(!usedWords.contains(newWord))
{
usedWords.add(newWord);
if(dict.contains(newWord))
{
corrections.add(newWord);
}
states.addLast(new Pair(newWord, (byte)(head.getDist()+1)));
}
}
}
}
for(String correction:corrections)
{
System.out.println("Did you mean "+correction+"?");
}
usedWords.clear();
corrections.clear();
// helper data structures must be cleared after each generateCorrections call - must be empty for the future usage.
}
Words in a dictionary - Jeniffer,Jeniffert. Jeniffert is just for testing)
Output:
Did you mean Jeniffer?
Did you mean Jeniffert?
Important!
I choose depth of generating = 2. In real application depth should be 4-6, but as number of combinations grows exponentially, I don't go so deep. There are some optomizations devoted to reduce number of branches in a searching tree but I don't think much about them. I wrote only main idea.
Also, I used HashSet for storing dictionary and for labeling used words. It seems HashSet's constant is too large when it containt million objects. May be you should use trie both for word in a dictionary checking and for is word labeled checking.
I didn't implement erase letters and change letters operations because I want to show only main idea.

Breaking a string apart into a sequence of words

I recently came across the following interview question:
Given an input string and a dictionary of words, implement a method that breaks up the input string into a space-separated string of dictionary words that a search engine might use for "Did you mean?" For example, an input of "applepie" should yield an output of "apple pie".
I can't seem to get an optimal solution as far as complexity is concerned. Does anyone have any suggestions on how to do this efficiently?
Looks like the question is exactly my interview problem, down to the example I used in the post at The Noisy Channel. Glad you liked the solution. Am quite sure you can't beat the O(n^2) dynamic programming / memoization solution I describe for worst-case performance.
You can do better in practice if your dictionary and input aren't pathological. For example, if you can identify in linear time the substrings of the input string are in the dictionary (e.g., with a trie) and if the number of such substrings is constant, then the overall time will be linear. Of course, that's a lot of assumptions, but real data is often much nicer than a pathological worst case.
There are also fun variations of the problem to make it harder, such as enumerating all valid segmentations, outputting a best segmentation based on some definition of best, handling a dictionary too large to fit in memory, and handling inexact segmentations (e.g., correcting spelling mistakes). Feel free to comment on my blog or otherwise contact me to follow up.
This link describes this problem as a perfect interview question and provides several methods to solve it. Essentially it involves recursive backtracking. At this level it would produce an O(2^n) complexity. An efficient solution using memoization could bring this problem down to O(n^2).
Using python, we can write two function, the first one segment returns the first segmentation of a piece of contiguous text into words given a dictionary or None if no such segmentation is found. Another function segment_all returns a list of all segmentations found. Worst case complexity is O(n**2) where n is the input string length in chars.
The solution presented here can be extended to include spelling corrections and bigram analysis to determine the most likely segmentation.
def memo(func):
'''
Applies simple memoization to a function
'''
cache = {}
def closure(*args):
if args in cache:
v = cache[args]
else:
v = func(*args)
cache[args] = v
return v
return closure
def segment(text, words):
'''
Return the first match that is the segmentation of 'text' into words
'''
#memo
def _segment(text):
if text in words: return text
for i in xrange(1, len(text)):
prefix, suffix = text[:i], text[i:]
segmented_suffix = _segment(suffix)
if prefix in words and segmented_suffix:
return '%s %s' % (prefix, segmented_suffix)
return None
return _segment(text)
def segment_all(text, words):
'''
Return a full list of matches that are the segmentation of 'text' into words
'''
#memo
def _segment(text):
matches = []
if text in words:
matches.append(text)
for i in xrange(1, len(text)):
prefix, suffix = text[:i], text[i:]
segmented_suffix_matches = _segment(suffix)
if prefix in words and len(segmented_suffix_matches):
for match in segmented_suffix_matches:
matches.append('%s %s' % (prefix, match))
return matches
return _segment(text)
if __name__ == "__main__":
string = 'cargocultscience'
words = set('car cargo go cult science'.split())
print segment(string, words)
# >>> car go cult science
print segment_all(string, words)
# >>> ['car go cult science', 'cargo cult science']
One option would be to store all valid English words in a trie. Once you've done this, you could start walking the trie from the root downward, following the letters in the string. Whenever you find a node that's marked as a word, you have two options:
Break the input at this point, or
Continue extending the word.
You can claim that you've found a match once you have broken the input up into a set of words that are all legal and have no remaining characters left. Since at each letter you either have one forced option (either you are building a word that isn't valid and should stop -or- you can keep extending the word) or two options (split or keep going), you could implement this function using exhaustive recursion:
PartitionWords(lettersLeft, wordSoFar, wordBreaks, trieNode):
// If you walked off the trie, this path fails.
if trieNode is null, return.
// If this trie node is a word, consider what happens if you split
// the word here.
if trieNode.isWord:
// If there is no input left, you're done and have a partition.
if lettersLeft is empty, output wordBreaks + wordSoFar and return
// Otherwise, try splitting here.
PartitinWords(lettersLeft, "", wordBreaks + wordSoFar, trie root)
// Otherwise, consume the next letter and continue:
PartitionWords(lettersLeft.substring(1), wordSoFar + lettersLeft[0],
wordBreaks, trieNode.child[lettersLeft[0])
In the pathologically worst case this will list all partitions of the string, which can t exponentially long. However, this only occurs if you can partition the string in a huge number of ways that all start with valid English words, and is unlikely to occur in practice. If the string has many partitions, we might spend a lot of time finding them, though. For example, consider the string "dotheredo." We can split this many ways:
do the redo
do the red o
doth ere do
dot here do
dot he red o
dot he redo
To avoid this, you might want to institute a limit of the number of answers you report, perhaps two or three.
Since we cut off the recursion when we walk off the trie, if we ever try a split that doesn't leave the remainder of the string valid, we will detect this pretty quickly.
Hope this helps!
import java.util.*;
class Position {
int indexTest,no;
Position(int indexTest,int no)
{
this.indexTest=indexTest;
this.no=no;
} } class RandomWordCombo {
static boolean isCombo(String[] dict,String test)
{
HashMap<String,ArrayList<String>> dic=new HashMap<String,ArrayList<String>>();
Stack<Position> pos=new Stack<Position>();
for(String each:dict)
{
if(dic.containsKey(""+each.charAt(0)))
{
//System.out.println("=========it is here");
ArrayList<String> temp=dic.get(""+each.charAt(0));
temp.add(each);
dic.put(""+each.charAt(0),temp);
}
else
{
ArrayList<String> temp=new ArrayList<String>();
temp.add(each);
dic.put(""+each.charAt(0),temp);
}
}
Iterator it = dic.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry)it.next();
System.out.println("key: "+pair.getKey());
for(String str:(ArrayList<String>)pair.getValue())
{
System.out.print(str);
}
}
pos.push(new Position(0,0));
while(!pos.isEmpty())
{
Position position=pos.pop();
System.out.println("position index: "+position.indexTest+" no: "+position.no);
if(dic.containsKey(""+test.charAt(position.indexTest)))
{
ArrayList<String> strings=dic.get(""+test.charAt(position.indexTest));
if(strings.size()>1&&position.no<strings.size()-1)
pos.push(new Position(position.indexTest,position.no+1));
String str=strings.get(position.no);
if(position.indexTest+str.length()==test.length())
return true;
pos.push(new Position(position.indexTest+str.length(),0));
}
}
return false;
}
public static void main(String[] st)
{
String[] dic={"world","hello","super","hell"};
System.out.println("is 'hellworld' a combo: "+isCombo(dic,"superman"));
} }
I have done similar problem. This solution gives true or false if given string is combination of dictionary words. It can be easily converted to get space-separated string. Its average complexity is O(n), where n: no of dictionary words in given string.

Efficient data structure/algorithm for transliteration based word lookup

I'm looking for a efficient data structure/algorithm for storing and searching transliteration based word lookup (like google do: http://www.google.com/transliterate/ but I'm not trying to use google transliteration API). Unfortunately, the natural language I'm trying to work on doesn't have any soundex implemented, so I'm on my own.
For an open source project currently I'm using plain arrays for storing word list and dynamically generating regular expression (based on user input) to match them. It works fine, but regular expression is too powerful or resource intensive than I need. For example, I'm afraid this solution will drain too much battery if I try to port it to handheld devices, as searching over thousands of words with regular expression is too much costly.
There must be a better way to accomplish this for complex languages, how does Pinyin input method work for example? Any suggestion on where to start?
Thanks in advance.
Edit: If I understand correctly, this is suggested by #Dialecticus-
I want to transliterate from Language1, which has 3 characters a,b,c to Language2, which has 6 characters p,q,r,x,y,z. As a result of difference in numbers of characters each language possess and their phones, it is not often possible to define one-to-one mapping.
Lets assume phonetically here is our associative arrays/transliteration table:
a -> p, q
b -> r
c -> x, y, z
We also have a valid word lists in plain arrays for Language2:
...
px
qy
...
If the user types ac, the possible combinations become px, py, pz, qx, qy, qz after transliteration step 1. In step 2 we have to do another search in valid word list and will have to eliminate everyone of them except px and qy.
What I'm doing currently is not that different from the above approach. Instead of making possible combinations using the transliteration table, I'm building a regular expression [pq][xyz] and matching that with my valid word list, which provides the output px and qy.
I'm eager to know if there is any better method than that.
From what I understand, you have an input string S in an alphabet (lets call it A1) and you want to convert it to the string S' which is its equivalent in another alphabet A2. Actually, if I understand correctly, you want to generate a list [S'1,S'2,...,S'n] of output strings which might potentially be equivalent to S.
One approach that comes to mind is for each word in the list of valid words in A2 generate a list of strings in A1 that matches the. Using the example in your edit, we have
px->ac
qy->ac
pr->ab
(I have added an extra valid word pr for clarity)
Now that we know what possible series of input symbols will always map to a valid word, we can use our table to build a Trie.
Each node will hold a pointer to a list of valid words in A2 that map to the sequence of symbols in A1 that form the path from the root of the Trie to the current node.
Thus for our example, the Trie would look something like this
Root (empty)
| a
|
V
+---Node (empty)---+
| b | c
| |
V V
Node (px,qy) Node (pr)
Starting at the root node, as symbols are consumed transitions are made from the current node to its child marked with the symbol consumed until we have read the entire string. If at any point no transition is defined for that symbol, the entered string does not exist in our trie and thus does not map to a valid word in our target language. Otherwise, at the end of the process, the list of words associated with the current node is the list of valid words the input string maps to.
Apart from the initial cost of building the trie (the trie can be shipped pre-built if we never want the list of valid words to change), this takes O(n) on the length of the input to find a list of mapping valid words.
Using a Trie also provide the advantage that you can also use it to find the list of all valid words that can be generated by adding more symbols to the end of the input - i.e. a prefix match. For example, if fed with the input symbol 'a', we can use the trie to find all valid words that can begin with 'a' ('px','qr','py'). But doing that is not as fast as finding the exact match.
Here's a quick hack at a solution (in Java):
import java.util.*;
class TrieNode{
// child nodes - size of array depends on your alphabet size,
// her we are only using the lowercase English characters 'a'-'z'
TrieNode[] next=new TrieNode[26];
List<String> words;
public TrieNode(){
words=new ArrayList<String>();
}
}
class Trie{
private TrieNode root=null;
public void addWord(String sourceLanguage, String targetLanguage){
root=add(root,sourceLanguage.toCharArray(),0,targetLanguage);
}
private static int convertToIndex(char c){ // you need to change this for your alphabet
return (c-'a');
}
private TrieNode add(TrieNode cur, char[] s, int pos, String targ){
if (cur==null){
cur=new TrieNode();
}
if (s.length==pos){
cur.words.add(targ);
}
else{
cur.next[convertToIndex(s[pos])]=add(cur.next[convertToIndex(s[pos])],s,pos+1,targ);
}
return cur;
}
public List<String> findMatches(String text){
return find(root,text.toCharArray(),0);
}
private List<String> find(TrieNode cur, char[] s, int pos){
if (cur==null) return new ArrayList<String>();
else if (pos==s.length){
return cur.words;
}
else{
return find(cur.next[convertToIndex(s[pos])],s,pos+1);
}
}
}
class MyMiniTransliiterator{
public static void main(String args[]){
Trie t=new Trie();
t.addWord("ac","px");
t.addWord("ac","qy");
t.addWord("ab","pr");
System.out.println(t.findMatches("ac")); // prints [px,qy]
System.out.println(t.findMatches("ab")); // prints [pr]
System.out.println(t.findMatches("ba")); // prints empty list since this does not match anything
}
}
This is a very simple trie, no compression or speedups and only works on lower case English characters for the input language. But it can be easily modified for other character sets.
I would build transliterated sentence one symbol at the time, instead of one word at the time. For most languages it is possible to transliterate every symbol independently of other symbols in the word. You can still have exceptions as whole words that have to be transliterated as complete words, but transliteration table of symbols and exceptions will surely be smaller than transliteration table of all existing words.
Best structure for transliteration table is some sort of associative array, probably utilizing hash tables. In C++ there's std::unordered_map, and in C# you would use Dictionary.

How to find all brotherhood strings?

I have a string, and another text file which contains a list of strings.
We call 2 strings "brotherhood strings" when they're exactly the same after sorting alphabetically.
For example, "abc" and "cba" will be sorted into "abc" and "abc", so the original two are brotherhood. But "abc" and "aaa" are not.
So, is there an efficient way to pick out all brotherhood strings from the text file, according to the one string provided?
For example, we have "abc" and a text file which writes like this:
abc
cba
acb
lalala
then "abc", "cba", "acb" are the answers.
Of course, "sort & compare" is a nice try, but by "efficient", i mean if there is a way, we can determine a candidate string is or not brotherhood of the original one after one pass processing.
This is the most efficient way, i think. After all, you can not tell out the answer without even reading candidate strings. For sorting, most of the time, we need to do more than 1 pass to the candidate string. So, hash table might be a good solution, but i've no idea what hash function to choose.
Most efficient algorithm I can think of:
Set up a hash table for the original string. Let each letter be the key, and the number of times the letter appears in the string be the value. Call this hash table inputStringTable
Parse the input string, and each time you see a character, increment the value of the hash entry by one
for each string in the file
create a new hash table. Call this one brotherStringTable.
for each character in the string, add one to a new hash table. If brotherStringTable[character] > inputStringTable[character], this string is not a brother (one character shows up too many times)
once string is parsed, compare each inputStringTable value with the corresponding brotherStringTable value. If one is different, then this string is not a brother string. If all match, then the string is a brother string.
This will be O(nk), where n is the length of the input string (any strings longer than the input string can be discarded immediately) and k is the number of strings in the file. Any sort based algorithm will be O(nk lg n), so in certain cases, this algorithm is faster than a sort based algorithm.
Sorting each string, then comparing it, works out to something like O(N*(k+log S)), where N is the number of strings, k is the search key length, and S is the average string length.
It seems like counting the occurrences of each character might be a possible way to go here (assuming the strings are of a reasonable length). That gives you O(k+N*S). Whether that's actually faster than the sort & compare is obviously going to depend on the values of k, N, and S.
I think that in practice, the cache-thrashing effect of re-writing all the strings in the sorting case will kill performance, compared to any algorithm that doesn't modify the strings...
iterate, sort, compare. that shouldn't be too hard, right?
Let's assume your alphabet is from 'a' to 'z' and you can index an array based on the characters. Then, for each element in a 26 element array, you store the number of times that letter appears in the input string.
Then you go through the set of strings you're searching, and iterate through the characters in each string. You can decrement the count associated with each letter in (a copy of) the array of counts from the key string.
If you finish your loop through the candidate string without having to stop, and you have seen the same number of characters as there were in the input string, it's a match.
This allows you to skip the sorts in favor of a constant-time array copy and a single iteration through each string.
EDIT: Upon further reflection, this is effectively sorting the characters of the first string using a bucket sort.
I think what will help you is the test if two strings are anagrams. Here is how you can do it. I am assuming the string can contain 256 ascii characters for now.
#define NUM_ALPHABETS 256
int alphabets[NUM_ALPHABETS];
bool isAnagram(char *src, char *dest) {
len1 = strlen(src);
len2 = strlen(dest);
if (len1 != len2)
return false;
memset(alphabets, 0, sizeof(alphabets));
for (i = 0; i < len1; i++)
alphabets[src[i]]++;
for (i = 0; i < len2; i++) {
alphabets[dest[i]]--;
if (alphabets[dest[i]] < 0)
return false;
}
return true;
}
This will run in O(mn) if you have 'm' strings in the file of average length 'n'
Sort your query string
Iterate through the Collection, doing the following:
Sort current string
Compare against query string
If it matches, this is a "brotherhood" match, save it/index/whatever you want
That's pretty much it. If you're doing lots of searching, presorting all of your collection will make the routine a lot faster (at the cost of extra memory). If you are doing this even more, you could pre-sort and save a dictionary (or some hashed collection) based off the first character, etc, to find matches much faster.
It's fairly obvious that each brotherhood string will have the same histogram of letters as the original. It is trivial to construct such a histogram, and fairly efficient to test whether the input string has the same histogram as the test string ( you have to increment or decrement counters for twice the length of the input string ).
The steps would be:
construct histogram of test string ( zero an array int histogram[128] and increment position for each character in test string )
for each input string
for each character in input string c, test whether histogram[c] is zero. If it is, it is a non-match and restore the histogram.
decrement histogram[c]
to restore the histogram, traverse the input string back to its start incrementing rather than decrementing
At most, it requires two increments/decrements of an array for each character in the input.
The most efficient answer will depend on the contents of the file. Any algorithm we come up with will have complexity proportional to N (number of words in file) and L (average length of the strings) and possibly V (variety in the length of strings)
If this were a real world situation, I would start with KISS and not try to overcomplicate it. Checking the length of the target string is simple but could help avoid lots of nlogn sort operations.
target = sort_characters("target string")
count = 0
foreach (word in inputfile){
if target.len == word.len && target == sort_characters(word){
count++
}
}
I would recommend:
for each string in text file :
compare size with "source string" (size of brotherhood strings should be equal)
compare hashes (CRC or default framework hash should be good)
in case of equity, do a finer compare with string sorted.
It's not the fastest algorithm but it will work for any alphabet/encoding.
Here's another method, which works if you have a relatively small set of possible "letters" in the strings, or good support for large integers. Basically consists of writing a position-independent hash function...
Assign a different prime number for each letter:
prime['a']=2;
prime['b']=3;
prime['c']=5;
Write a function that runs through a string, repeatedly multiplying the prime associated with each letter into a running product
long long key(char *string)
{
long long product=1;
while (*string++) {
product *= prime[*string];
}
return product;
}
This function will return a guaranteed-unique integer for any set of letters, independent of the order that they appear in the string. Once you've got the value for the "key", you can go through the list of strings to match, and perform the same operation.
Time complexity of this is O(N), of course. You can even re-generate the (sorted) search string by factoring the key. The disadvantage, of course, is that the keys do get large pretty quickly if you have a large alphabet.
Here's an implementation. It creates a dict of the letters of the master, and a string version of the same as string comparisons will be done at C++ speed. When creating a dict of the letters in a trial string, it checks against the master dict in order to fail at the first possible moment - if it finds a letter not in the original, or more of that letter than the original, it will fail. You could replace the strings with integer-based hashes (as per one answer regarding base 26) if that proves quicker. Currently the hash for comparison looks like a3c2b1 for abacca.
This should work out O(N log( min(M,K) )) for N strings of length M and a reference string of length K, and requires the minimum number of lookups of the trial string.
master = "abc"
wordset = "def cba accb aepojpaohge abd bac ajghe aegage abc".split()
def dictmaster(str):
charmap = {}
for char in str:
if char not in charmap:
charmap[char]=1
else:
charmap[char] += 1
return charmap
def dicttrial(str,mastermap):
trialmap = {}
for char in str:
if char in mastermap:
# check if this means there are more incidences
# than in the master
if char not in trialmap:
trialmap[char]=1
else:
trialmap[char] += 1
else:
return False
return trialmap
def dicttostring(hash):
if hash==False:
return False
str = ""
for char in hash:
str += char + `hash[char]`
return str
def testtrial(str,master,mastermap,masterhashstring):
if len(master) != len(str):
return False
trialhashstring=dicttostring(dicttrial(str,mastermap))
if (trialhashstring==False) or (trialhashstring != masterhashstring):
return False
else:
return True
mastermap = dictmaster(master)
masterhashstring = dicttostring(mastermap)
for word in wordset:
if testtrial(word,master,mastermap,masterhashstring):
print word+"\n"

Word-separating algorithm

What is the algorithm - seemingly in use on domain parking pages - that takes a spaceless bunch of words (eg "thecarrotofcuriosity") and more-or-less correctly breaks it down into the constituent words (eg "the carrot of curiosity") ?
Start with a basic Trie data structure representing your dictionary. As you iterate through the characters of the the string, search your way through the trie with a set of pointers rather than a single pointer - the set is seeded with the root of the trie. For each letter, the whole set is advanced at once via the pointer indicated by the letter, and if a set element cannot be advanced by the letter, it is removed from the set. Whenever you reach a possible end-of-word, add a new root-of-trie to the set (keeping track of the list of words seen associated with that set element). Finally, once all characters have been processed, return an arbitrary list of words which is at the root-of-trie. If there's more than one, that means the string could be broken up in multiple ways (such as "therapistforum" which can be parsed as ["therapist", "forum"] or ["the", "rapist", "forum"]) and it's undefined which we'll return.
Or, in a wacked up pseudocode (Java foreach, tuple indicated with parens, set indicated with braces, cons using head :: tail, [] is the empty list):
List<String> breakUp(String str, Trie root) {
Set<(List<String>, Trie)> set = {([], root)};
for (char c : str) {
Set<(List<String>, Trie)> newSet = {};
for (List<String> ls, Trie t : set) {
Trie tNext = t.follow(c);
if (tNext != null) {
newSet.add((ls, tNext));
if (tNext.isWord()) {
newSet.add((t.follow(c).getWord() :: ls, root));
}
}
}
set = newSet;
}
for (List<String> ls, Trie t : set) {
if (t == root) return ls;
}
return null;
}
Let me know if I need to clarify or I missed something...
I would imagine they take a dictionary word list like /usr/share/dict/words on your common or garden variety Unix system and try to find sets of word matches (starting from the left?) that result in the largest amount of original text being covered by a match. A simple breadth-first-search implementation would probably work fine, since it obviously doesn't have to run fast.
I'd imaging these sites do it similar to this:
Get a list of word for your target language
Remove "useless" words like "a", "the", ...
Run through the list and check which of the words are substrings of the domain name
Take the most common words of the remaining list (Or the ones with the highest adsense rating,...)
Of course that leads to nonsense for expertsexchange, but what else would you expect there...
(disclaimer: I did not try it myself, so take it merely as a food for experimentation. 4-grams are taken mostly out of the blue sky, just from my experience that 3-grams won't work all too well; 5-grams and more might work better, even though you will have to deal with a pretty large table). It's also simplistic in a sense that it does not take into the account the ending of the string - if it works for you otherwise, you'd probably need to think about fixing the endings.
This algorithm would run in a predictable time proportional to the length of the string that you are trying to split.
So, first: Take a lot of human-readable texts. for each of the text, supposing it is in a single string str, run the following algorithm (pseudocode-ish notation, assumes the [] is a hashtable-like indexing, and that nonexistent indexes return '0'):
for(i=0;i<length(s)-5;i++) {
// take 4-character substring starting at position i
subs2 = substring(str, i, 4);
if(has_space(subs2)) {
subs = substring(str, i, 5);
delete_space(subs);
yes_space[subs][position(space, subs2)]++;
} else {
subs = subs2;
no_space[subs]++;
}
}
This will build you the tables which will help to decide whether a given 4-gram would need to have a space in it inserted or not.
Then, take your string to split, I denote it as xstr, and do:
for(i=0;i<length(xstr)-5;i++) {
subs = substring(xstr, i, 4);
for(j=0;j<4;j++) {
do_insert_space_here[i+j] -= no_space[subs];
}
for(j=0;j<4;j++) {
do_insert_space_here[i+j] += yes_space[subs][j];
}
}
Then you can walk the "do_insert_space_here[]" array - if an element at a given position is bigger than 0, then you should insert a space in that position in the original string. If it's less than zero, then you shouldn't.
Please drop a note here if you try it (or something of this sort) and it works (or does not work) for you :-)

Resources