Look for a data structure to match words by letters - algorithm

Given a list of lowercase radom words, each word with same length, and many patterns each with some letters at some positions are specified while other letters are unknown, find out all words that matches each pattern.
For example, words list is:
["ixlwnb","ivknmt","vvqnbl","qvhntl"]
And patterns are:
i-----
-v---l
-v-n-l
With a naive algorithm, one can do an O(NL) travel for each pattern, where N is the words count and L is the word length.
But since there may be a lot of patterns travel on the same words list, is there any good data structure to preprocess and store the words list, then give a sufficient matching for all patterns?

One simple idea is to use an inverted index. First, number your words -- you'll refer to them using these indices rather than the words themselves for speed and space efficiency. Probably the index fits in a 32-bit int.
Now your inverted index: for each letter in each position, construct a sorted list of IDs for words that have that letter in that location.
To do your search, you take the lists of IDs for each of the letters in the positions you're given, and take the intersection of the lists, using a an algorithm like the "merge" in merge-sort. All IDs in the intersection match the input.
Alternatively, if your words are short enough (12 characters or fewer), you could compress them into 64 bit words (using 5 bits per letter, with letters 1-26). Construct a bit-mask with binary 11111 in places where you have a letter, and 00000 in places where you have a blank. And a bit-test from your input with the 5-bit code for each letter in each place, using 00000 where you have blanks. For example, if your input is a-c then your bitmask will be binary 111110000011111 and your bittest binary 000010000000011. Go through your word-list, take this bitwise and of each word with the bit-mask and test to see if it's equal to the bit-test value. This is cache friendly and the inner loop is tight, so may be competitive with algorithms that look like they should be faster on paper.

I'll preface this with it's more of a comment and less of an answer (I don't have enough reputation to comment though). I can't think of any data structure that will satisfy the requirements of of the box. It was interesting to think about, and figured I'd share one potential solution that popped into my head.
I keyed in on the "same length" part, and figured I could come up with something based on that.
In theory we could have N(N being the length) maps of char -> set.
When strings are added, it goes through each character and adds the string to the corresponding set. psuedocode:
firstCharMap[s[0]].insert(s);
secondCharMap[s[1]].insert(s);
thirdCharMap[s[2]].insert(s);
fourthCharMap[s[3]].insert(s);
fifthCharMap[s[4]].insert(s);
sixthCharMap[s[5]].insert(s);
Then to determine which strings match the pattern, we take just do an intersection of the sets ex: "-v-n-l" would be:
intersection of sets: secondCharMap[v], fourthCharMap[n], sixthCharMap[l]
One edge case that jumps out is if I wanted to just get all of the strings, so if that's a requirement--we may also need an additional set of all of the strings.
This solution feels clunky, but I think it could work. Depending on the language, number of strings, etc--I wouldn't be surprised if it performed worse than just iterating over all strings and checking a predicate.

Related

Efficient algorithm to find most common phrases in a large volume of text

I am thinking about writing a program to collect for me the most common phrases in a large volume of the text. Had the problem been reduced to just finding words than that would be as simple as storing each new word in a hashmap and then increasing the count on each occurrence. But with phrases, storing each permutation of a sentence as a key seems infeasible.
Basically the problem is narrowed down to figuring out how to extract every possible phrase from a large enough text. Counting the phrases and then sorting by the number of occurrences becomes trivial.
I assume that you are searching for common patterns of consecutive words appearing in the same order (e.g. "top of the world" would not be counted as the same phrase as "top of a world" or "the world of top").
If so then I would recommend the following linear-time approach:
Split your text into words and remove things you don't consider significant (i.e. remove capitalisation, punctuation, word breaks, etc.)
Convert your text into an array of integers (one integer per unique word) (e.g. every instance of "cat" becomes 1, every "dog" becomes 2) This can be done in linear time by using a hash-based dictionary to store the conversions from words to numbers. If the word is not in the dictionary then assign a new id.
Construct a suffix-array for the array of integers (this is a sorted list of all the suffixes of your array and can be constructed by linear time - e.g. using the algorithm and C code here)
Construct the longest common prefix array for your suffix array. (This can also be done in linear-time, for example using this C code) This LCP array gives the number of common words at the start of each suffix between consecutive pairs in the suffix array.
You are now in a position to collect your common phrases.
It is not quite clear how you wish to determine the end of a phrase. One possibility is to simply collect all sequences of 4 words that repeat.
This can be done in linear time by working through your suffix array looking at places where the longest common prefix array is >= 4. Each run of indices x in the range [start+1...start+len] where the LCP[x] >= 4 (for all except the last value of x) corresponds to a phrase that is repeated len times. The phrase itself is given by the first 4 words of, for example, suffix start+1.
Note that this approach will potentially spot phrases that cross sentence ends. You may prefer to convert some punctuation such as full stops into unique integers to prevent this.

Finding Valid Words in a Matrix

Given a dictionary of words, two APIs
Is_word(string)
Is_prefix(string)
And a NxN matrix with each postion consisting of a character. If from any position (i,j) you can move
in any of the four directions, find out the all the valid words that can be formed in the matrix.
(looping is not allowed, i.e. for forming a word position if you start from (i,j) and move to (i-1,j) then
from this position you cannot go back to (i,j))
What I trie :: I can see an exponential solution where we go though all possibilities and Keep track of already visited Indexes. Can we have a better solution?
This is the best I can do:
Enumerate all the possible strings in half of the allowed directions. (You can reverse the strings to get the other directions.)
For each starting character position in each string, build up substrings until they are neither a word nor a prefix.
Edit: I may have misunderstood the question. I thought each individual word could only be made up of moves in the same direction.
I would be inclined to solve this by creating two parallel data structures. One is a list of words (where duplicates are removed). The other is a matrix of prefixes, where each element is a list. Hopefully, such data structures can be used. The matrix of lists could really be a list of triplets, with the word and coordinates in the original matrix.
In any case, go through the original matrix and call isword() on each element. If a word, then insert the word into the word list.
Then go through the original matrix, and compare call isprefix() on each element. If a prefix, insert in the prefix matrix.
Then, go through the prefix matrix and test the four combinations of the prefix with the additional letter. If a word, then put in word list. If a prefix, then add to the prefix matrix, in the location of the last letter. Remember to do both, since a word can be a prefix.
Also, in the prefix list, you need to keep a list not only of the letters, but the locations of the letters that have been used. This is used to prevent loops.
Then continue this process until there are no words and no prefixes.
It is hard to measure the complexity of this structure, since it seems to depend on the contents of the dictionary.
I think you can't beat exponential here. Proof?
Think about the case when you have a matrix where you have arrangement of letters such that each combination is a valid dictionary word i.e. when you start from [i,j] for e.g. [i,j] combined with either [i-1,j] or [i+1,j] or [i,j+1] or [i,j-1] are all 2 letter words and that continues recursively i.e. any combination of length n (according to your rules) is a word. In this case you can't do better than a brute force.

Solve Hangman in AI way

I named it as "AI way" because I'm thinking make Application to play the hangman game without human being interactive.
The scenario is like this:
a available word list which would contains hundreds of thousands English word.
The Application will pick certain amount of words, e.g 20 from the list.
The Application play Hangman against each word until either WON or FAILURE.
The restriction here is max wrong bad guess.
26 does not make sense obviously and let's say 6 for the max wrong guess.
I tried the strategy mentioned at wiki page but it does not work well.
Basically successful rate is about 30%.
Any suggestions / comments regarding strategy as well as which field I should dig in order to find a fair good strategy?
Thanks a lot.
-Simon
PS: A JavaScript implementation which looks fairly well.
(https://github.com/freizl/play-hangman-game)
Updated Idea
Download a dictionary of words and put it into some database or structure of your choice
When presented with a word, narrow your guesses to words of the same length and perform a letter frequency distribution (you can use a dictionary and/or list collection for fast distribution analysis and sorting)
Pick the most common letter from this list
If the letter is found, create a regex pattern based on the known letter(s) and the word length and repeat from step 2
You should be able to quickly narrow down a single word resulting from your pattern search
For posterity:
Take a look at this wiki page. It includes a table of frequencies of the first letters of words which may help you tune your algorithm.
You could also take into account the fact that if you find a vowel or two in a word the likelihood of finding other vowels will decrease significantly and you should then try more common consonants instead. The example from the wiki page you listed start with E then T and then tries three vowels in a row: A, O and I. The first two letters are missed but once the third letter is found, twice then the process should switch to common consonants and skip trying for more vowels since there will likely be fewer.
Any useful strategies will certainly employ frequency distribution charts on letters and possibly words e.g. some words are very common while others are rarely used so performing a letter frequency distribution on a set of more common words might help... guessing that some words may appear more frequently than other but that depends on your word selection algorithm which might not take into account "common" usage.
You could also build specialized letter frequency tables and possibly even on-the-fly. For example, given the wikipedia h a ngm a n example: You find the letter A twice in a word in two locations 2nd and 6th. You know that the word has seven letters and with a fairly simple reg ex you could isolate the words from a dictionary that match this pattern:
_ a _ _ _ a _
Then perform a letter frequency on that set of words that matches this pattern and use that set for your next guess. Rinse and repeat. I think doing some of those things I mentioned but especially the last will really increase your odds of success.
The strategies in the linked page seem to be "order guesses by letter frequency" and "guess the vowels, then order guesses by letter frequency"
A couple of observations about hangman:
1) Since guessing a letter that isn't in the word hurts us, we should guess letters by word frequency (percentage of words that contain letter X), not letter frequency (number of times that X appears in all words). This should maximise our chances of guessing a bad letter.
2) Once we've guessed some letters correctly, we know more about the word we're trying to guess.
Here are two strategies that should beat the letter frequency strategy. I'm going to assume we have a dictionary of words that might come up.
If we expect the word to be in our dictionary:
1) We know the length of the target word, n. Remove all words in the dictionary that aren't of length n
2) Calculate the word frequency of all letters in the dictionary
3) Guess the most frequent letter that we haven't already guessed.
4) If we guessed correctly, remove all words from the dictionary that don't match the revealed letters.
5) If we guessed incorrectly, remove all words that contain the incorrectly guessed letter
6) Go to step 2
For maximum effect, instead of calculating word frequencies of all letters in step 2, calculate the word frequencies of all letters in positions that are still blank in the target word.
If we don't expect the word to be in our dictionary:
1) From the dictionary, build up a table of n-grams for some value of n (say 2). If you haven't come across n-grams before, they are groups of consecutive letters inside the word. For example, if the word is "word", the 2-grams are {^w,wo,or,rd,d$}, where ^ and $ mark the start and the end of the word. Count the word frequency of these 2-grams.
2) Start by guessing single letters by word frequency as above
3) Once we've had some hits, we can use the table of word frequency of n-grams to determine either letters to eliminate from our guesses, or letters that we're likely to be able to guess. There are a lot of ways you could achieve this:
For example, you could use 2-grams to determine that the blank in w_rd is probably not z. Or, you could determine that the character at the end of the word ___e_ might (say) be d or s.
Alternatively you could use the n-grams to generate the list of possible characters (though this might be expensive for long words). Remember that you can always cross off all n-grams that contain letters you've guessed that aren't in the target word.
Remember that at each step you're trying not to make a wrong guess, since that keeps us alive. If the n-grams tell you that one position is likely to only be (say) a,b or c, and your word frequency table tells you that a appears in 30% of words, but b and c only appear in 10%, then guess a.
For maximum benefit, you could combine the two strategies.
The strategy discussed is one suitable for humans to implement. Since you're writing an AI you can throw computing power at it to get a better result.
Take your word list, filter it down to only those words which match what information you have about the target word. (At the start that will only be the word length.) For each letter A through Z note how many words contain at least one of them (this is different than the count of the letters.) Pick the letter with the highest score.
You MIGHT even be able to run multiple cycles of this in computing a guess but that might prove too much even for modern CPUs.
Clarification: I'm saying that you might be able to run a look-ahead. If we choose "A" at this level what options does that present for the next level? This is an O(x^n) algorithm, obviously you can't go too far down that path.

Searching strings with . wildcard

I have an array with so much strings and want to search for a pattern on it.
This pattern can have some "." wildcard who matches (each) 1 character (any).
For example:
myset = {"bar", "foo", "cya", "test"}
find(myset, "f.o") -> returns true (matches with "foo")
find(myset, "foo.") -> returns false
find(myset, ".e.t") -> returns true (matches with "test")
find(myset, "cya") -> returns true (matches with "cya")
I tried to find a way to implement this algorithm fast because myset actually is a very big array, but none of my ideas has satisfactory complexity (for example O(size_of(myset) * lenght(pattern)))
Edit:
myset is an huge array, the words in it aren't big.
I can do a slow preprocessing. But I'll have so much find() queries, so find() I want find() to be as fast as possible.
You could build a suffix tree of the corpus of all possible words in your set (see this link)
Using this data structure your complexity would include a one time cost of O(n) to build the tree, where n is the sum of the lengths of all your words.
Once the tree is built finding if a string matches should take just O(n) where n is length of the string.
If the set is fixed, you could pre-calculate frequencies of a character c being at position p (for as many p values as you consider worth-while), then search through the array once, for each element testing characters at specific positions in an order such that you are most likely to exit early.
First, divide the corpus into sets per word length. Then your find algorithm can search over the appropriate set, since the input to find() always requires the match to have a specific length, and the algorithm can be designed to work well with all words of the same length.
Next (for each set), create a hash map from a hash of character x position to a list of matching words. It is quite ok to have a large amount of hash collision. You can use delta and run-length encoding to reduce the size of the list of matching words.
To search, pick the appropriate hash map for the find input length, and for each non . character, calculate the hash for that character x position, and AND together the lists of words, to get a much reduced list.
Brute force search through that much smaller list.
If you are sure that the length of the words in your set are not large. You could probably create a table which holds the following:
List of Words which have first Character 'a' , List of Words which have first Character 'b', ..
List of Words which have second Character 'a', List of words which have second Character 'b', ..
and so on.
When you are searching for the word. You can look for the list of words which have the first character same as the search strings' first character. With this refined list, look for the words which have the second character same as the search strings' second character and so on. You can ignore '.' whenever you encounter them.
I understand that building the table may take a large amount of space but the time taken will come down significantly.
For example, if you have myset = {"bar", "foo", "cya", "test"} and you are searching for 'f.o'
The moment you check for list of words starting with f, you eliminate the rest of the set. Just an idea.. Hope it helps.
I had this same question, and I wasn't completely happy with most of the ideas/solutions I found on the internet. I think the "right" way to do this is to use a Directed Acyclic Word Graph. I didn't quite do that, but I added some additional logic to a Trie to get a similar effect.
See my isWord() implementation, analogous to your desired find() interface. It works by recursing down the Trie, branching on wildcard, and then collecting results back into a common set. (See findNodes().)
getMatchingWords() is similar in spirit, except that it returns the set of matching words, instead of just a boolean as to whether or not the query matches anything.

Efficient data structure for word lookup with wildcards

I need to match a series of user inputed words against a large dictionary of words (to ensure the entered value exists).
So if the user entered:
"orange" it should match an entry "orange' in the dictionary.
Now the catch is that the user can also enter a wildcard or series of wildcard characters like say
"or__ge" which would also match "orange"
The key requirements are:
* this should be as fast as possible.
* use the smallest amount of memory to achieve it.
If the size of the word list was small I could use a string containing all the words and use regular expressions.
however given that the word list could contain potentially hundreds of thousands of enteries I'm assuming this wouldn't work.
So is some sort of 'tree' be the way to go for this...?
Any thoughts or suggestions on this would be totally appreciated!
Thanks in advance,
Matt
Put your word list in a DAWG (directed acyclic word graph) as described in Appel and Jacobsen's paper on the World's Fastest Scrabble Program (free copy at Columbia). For your search you will traverse this graph maintaining a set of pointers: on a letter, you make a deterministic transition to children with that letter; on a wildcard, you add all children to the set.
The efficiency will be roughly the same as Thompson's NFA interpretation for grep (they are the same algorithm). The DAWG structure is extremely space-efficient—far more so than just storing the words themselves. And it is easy to implement.
Worst-case cost will be the size of the alphabet (26?) raised to the power of the number of wildcards. But unless your query begins with N wildcards, a simple left-to-right search will work well in practice. I'd suggest forbidding a query to begin with too many wildcards, or else create multiple dawgs, e.g., dawg for mirror image, dawg for rotated left three characters, and so on.
Matching an arbitrary sequence of wildcards, e.g., ______ is always going to be expensive because there are combinatorially many solutions. The dawg will enumerate all solutions very quickly.
I would first test the regex solution and see whether it is fast enough - you might be surprised! :-)
However if that wasn't good enough I would probably use a prefix tree for this.
The basic structure is a tree where:
The nodes at the top level are all the possible first letters (i.e. probably 26 nodes from a-z assuming you are using a full dictionary...).
The next level down contains all the possible second letters for each given first letter
And so on until you reach an "end of word" marker for each word
Testing whether a given string with wildcards is contained in your dictionary is then just a simple recursive algorithm where you either have a direct match for each character position, or in the case of the wildcard you check each of the possible branches.
In the worst case (all wildcards but only one word with the right number of letters right at the end of the dictionary), you would traverse the entire tree but this is still only O(n) in the size of the dictionary so no worse than a full regex scan. In most cases it would take very few operations to either find a match or confirm that no such match exists since large branches of the search tree are "pruned" with each successive letter.
No matter which algorithm you choose, you have a tradeoff between speed and memory consumption.
If you can afford ~ O(N*L) memory (where N is the size of your dictionary and L is the average length of a word), you can try this very fast algorithm. For simplicity, will assume latin alphabet with 26 letters and MAX_LEN as the max length of word.
Create a 2D array of sets of integers, set<int> table[26][MAX_LEN].
For each word in you dictionary, add the word index to the sets in the positions corresponding to each of the letters of the word. For example, if "orange" is the 12345-th word in the dictionary, you add 12345 to the sets corresponding to [o][0], [r][1], [a][2], [n][3], [g][4], [e][5].
Then, to retrieve words corresponding to "or..ge", you find the intersection of the sets at [o][0], [r][1], [g][4], [e][5].
You can try a string-matrix:
0,1: A
1,5: APPLE
2,5: AXELS
3,5: EAGLE
4,5: HELLO
5,5: WORLD
6,6: ORANGE
7,8: LONGWORD
8,13:SUPERLONGWORD
Let's call this a ragged index-matrix, to spare some memory. Order it on length, and then on alphabetical order. To address a character I use the notation x,y:z: x is the index, y is the length of the entry, z is the position. The length of your string is f and g is the number of entries in the dictionary.
Create list m, which contains potential match indexes x.
Iterate on z from 0 to f.
Is it a wildcard and not the latest character of the search string?
Continue loop (all match).
Is m empty?
Search through all x from 0 to g for y that matches length. !!A!!
Does the z character matches with search string at that z? Save x in m.
Is m empty? Break loop (no match).
Is m not empty?
Search through all elements of m. !!B!!
Does not match with search? Remove from m.
Is m empty? Break loop (no match).
A wildcard will always pass the "Match with search string?". And m is equally ordered as the matrix.
!!A!!: Binary search on length of the search string. O(log n)
!!B!!: Binary search on alphabetical ordering. O(log n)
The reason for using a string-matrix is that you already store the length of each string (because it makes it search faster), but it also gives you the length of each entry (assuming other constant fields), such that you can easily find the next entry in the matrix, for fast iterating. Ordering the matrix isn't a problem: since this has only be done once the dictionary updates, and not during search-time.
If you are allowed to ignore case, which I assume, then make all the words in your dictionary and all the search terms the same case before anything else. Upper or lower case makes no difference. If you have some words that are case sensitive and others that are not, break the words into two groups and search each separately.
You are only matching words, so you can break the dictionary into an array of strings. Since you are only doing an exact match against a known length, break the word array into a separate array for each word length. So byLength[3] is the array off all words with length 3. Each word array should be sorted.
Now you have an array of words and a word with potential wild cards to find. Depending on wether and where the wildcards are, there are a few approaches.
If the search term has no wild cards, then do a binary search in your sorted array. You could do a hash at this point, which would be faster but not much. If the vast majority of your search terms have no wildcards, then consider a hash table or an associative array keyed by hash.
If the search term has wildcards after some literal characters, then do a binary search in the sorted array to find an upper and lower bound, then do a linear search in that bound. If the wildcards are all trailing then finding a non empty range is sufficient.
If the search term starts with wild cards, then the sorted array is no help and you would need to do a linear search unless you keep a copy of the array sorted by backwards strings. If you make such an array, then choose it any time there are more trailing than leading literals. If you do not allow leading wildcards then there is no need.
If the search term both starts and ends with wildcards, then you are stuck with a linear search within the words with equal length.
So an array of arrays of strings. Each array of strings is sorted, and contains strings of equal length. Optionally duplicate the whole structure with the sorting based on backwards strings for the case of leading wildcards.
The overall space is one or two pointers per word, plus the words. You should be able to store all the words in a single buffer if your language permits. Of course, if your language does not permit, grep is probably faster anyway. For a million words, that is 4-16MB for the arrays and similar for the actual words.
For a search term with no wildcards, performance would be very good. With wildcards, there will occasionally be linear searches across large groups of words. With the breakdown by length and a single leading character, you should never need to search more than a few percent of the total dictionary even in the worst case. Comparing only whole words of known length will always be faster than generic string matching.
Try to build a Generalized Suffix Tree if the dictionary will be matched by sequence of queries. There is linear time algorithm that can be used to build such tree (Ukkonen Suffix Tree Construction).
You can easily match (it's O(k), where k is the size of the query) each query by traversing from the root node, and use the wildcard character to match any character like typical pattern finding in suffix tree.

Resources