Efficiently compute permutations of a given set of "blocks" in a line - algorithm

I am working on an application where I have a number of blocks which should be positioned on a line. I.e. there are varying number of blocks, each with a different length which should be positioned on the line. There needs to be at least one empty element between blocks.
I would like to get all possible permutations of the blocks on the line efficiently.
For example I have a line of length 15 and would like to place blocks of 1, 6 and 1 size.
Order matters, i.e. in my example the 1-size blocks always should be left and right of the 6-size block.
Possible permutations are
X.XXXXXX.X.....
X..XXXXXX.X....
...
.....X.XXXXXX.X
How do I efficiently generate all possible permutations in a higher level language, e.g. Java?

One way to do this is to approach it recursively:
If the minimum total length required to store all the blocks with exactly one space in-between them exceeds the available space, there are no ways to place the blocks.
Otherwise, if you have no blocks to place, then the only way to place the blocks is to leave all squares unfilled.
Otherwise, there are two options. First, you could place the first block at the first position in the row, then recursively place the remaining blocks in the remaining space within the row after first leaving one extra blank space at the start of the row. Second, you could leave the first space in the row blank, then recursively place the same set of blocks in the remaining space in the row. Trying out both options and combining the results back together should give you the answer you're looking for.
Translating this recursive logic into actual Java should not be too difficult. The code below is designed for readability and can be optimized a bit:
public List<String> allBlockOrderings(int rowLength, List<Integer> blockSizes) {
/* Case 1: Not enough space left. */
if (spaceNeededFor(blockSizes) > rowLength)) return Collections.EMPTY_LIST;
List<String> result = new ArrayList<String>();
/* Case 2: Nothing to place. */
if (blockSizes.isEmpty()) {
result.add(stringOf('.', rowLength));
} else {
/* Case 3a: place the very first block at the beginning of the row. */
List<String> placeFirst = allBlockOrderings(rowLength - blockSizes.get(0) - 1,
blockSizes.subList(1, blockSizes.length()));
for (String rest: placeFirst) {
result.add(stringOf('X', blockSizes.get(0)) + rest);
}
/* Case 3b: leave the very first spot open. */
List<String> skipFirst = allBlockOrderings(rowLength - 1, blockSizes);
for (String rest: skipFirst) {
result.add('.' + rest);
}
}
return result;
}
You'll need to implement the spaceNeededFor method, which returns the length of the shortest row that could possibly hold a given list of blocks, and the stringOf method, which takes in a character and a number, then returns a string of that many copies of the given character.
Hope this helps!

To me it seems more easy to think about the problem in another way:
We have fixed blocks in a fixed order, separated by dots. We can create all permutations by distributing the remaining dots over the allowed positions.
The length of this fixed part of the line is:
fixed_len = length_of_all_blocks + number_of_blocks - 1
The number of remaining dots is
free_dots = length_of_line - fixed_len.
The number of open positions is
pos_count = number_of_blocks + 1
Now we have to find all permutations of how to put free_dots into pos_count.

It's quite hard to determine what an "efficient implementation" is since the output can be very large and therefore even a fast implementation won't be fast enough.
I'd use technics of dynamic programming and recursion for such task. The recursive fuoction should take two parameters - list of unused numbers and remaining length of the row. Inside it will be a simple loop. You should store the results you already know. I'm sure you can handle the details by yourself. Edit : Our friend has already done that for you :-).
By the way, what is the goal of such task? It remainds me about the pictures in a grid where you have such numbers for every row and column and you need to decode the picture. There are better ways to solve such problem.

Related

Edit Distance Data Structure

I am new to data structure want to know the flow of this diagram as mentioned ,it's for calculating minimum edit distance between two string ,in the graph i understood that String 1 is of three length and String 2 is also of three length , so tutorial shown graph from eD(3,3) then why the graph split again in eD(3,2),eD(2,3),eD(2,2) for the 2 level of recursion . What it signifies ? Please need detail explanation . Why we can't split level 2 ,like this eD(3,2),eD(2,3).
I am following this Url : https://www.geeksforgeeks.org/dynamic-programming-set-5-edit-distance/
enter image description here
Ok, so basically in case of Edit Distance, we are trying to either insert, update or delete an element. So, our basic approach is to try all three of these available operations at each point and check which case gives the best result.
Specific to the case that you are trying to understand, the following is the scenario:
eD(3,3) = eD(2,2) if str1[3] == str2[3] # Without incuring any cost, we find edit distance of the remaining strings.
else l + min(del, ins, rep)
where
del = eD(2,3), Deleted the last character of str1, and finding the edit distance of the remaining strings.
ins = eD(3,2), Inserted the last element of str2 in str1, thus now we are finding the edit distance between the remaining strings. E.g. 'adc' and 'axf', if we add 'f' to the first string, it will become 'adcf'. Thus now, the last characters of both strings is same, I have to eventually find the edit distance between 'adc' and 'ax', Thus it becomes eD(3,2).
rep = eD(2,2), Replaced the last element of str1 with the last element of str2, and finding the edit distance of the remaining strings. E.g. 'abc' and adf, if we replace the last character of first string with the last character of second string, we will get 'abf' and 'adf', as now the last characters of both the strings are same, eventually we are finding the edit distance between 'ab' and 'ad'. Thus, eD(2,2)

How to split a string into words. Ex: "stringintowords" -> "String Into Words"?

What is the right way to split a string into words ?
(string doesn't contain any spaces or punctuation marks)
For example: "stringintowords" -> "String Into Words"
Could you please advise what algorithm should be used here ?
! Update: For those who think this question is just for curiosity. This algorithm could be used to camеlcase domain names ("sportandfishing .com" -> "SportAndFishing .com") and this algo is currently used by aboutus dot org to do this conversion dynamically.
Let's assume that you have a function isWord(w), which checks if w is a word using a dictionary. Let's for simplicity also assume for now that you only want to know whether for some word w such a splitting is possible. This can be easily done with dynamic programming.
Let S[1..length(w)] be a table with Boolean entries. S[i] is true if the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for i=2 to length(w) calculate
S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).
This takes O(length(w)^2) time, if dictionary queries are constant time. To actually find the splitting, just store the winning split in each S[i] that is set to true. This can also be adapted to enumerate all solution by storing all such splits.
As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given by Falk Hüffner. Additional info though:
(a) you should consider implementing isWord with a trie, which will save you a lot of time if you use properly (that is by incrementally testing for words).
(b) typing "segmentation dynamic programming" yields a score of more detail answers, from university level lectures with pseudo-code algorithm, such as this lecture at Duke's (which even goes so far as to provide a simple probabilistic approach to deal with what to do when you have words that won't be contained in any dictionary).
There should be a fair bit in the academic literature on this. The key words you want to search for are word segmentation. This paper looks promising, for example.
In general, you'll probably want to learn about markov models and the viterbi algorithm. The latter is a dynamic programming algorithm that may allow you to find plausible segmentations for a string without exhaustively testing every possible segmentation. The essential insight here is that if you have n possible segmentations for the first m characters, and you only want to find the most likely segmentation, you don't need to evaluate every one of these against subsequent characters - you only need to continue evaluating the most likely one.
If you want to ensure that you get this right, you'll have to use a dictionary based approach and it'll be horrendously inefficient. You'll also have to expect to receive multiple results from your algorithm.
For example: windowsteamblog (of http://windowsteamblog.com/ fame)
windows team blog
window steam blog
Consider the sheer number of possible splittings for a given string. If you have n characters in the string, there are n-1 possible places to split. For example, for the string cat, you can split before the a and you can split before the t. This results in 4 possible splittings.
You could look at this problem as choosing where you need to split the string. You also need to choose how many splits there will be. So there are Sum(i = 0 to n - 1, n - 1 choose i) possible splittings. By the Binomial Coefficient Theorem, with x and y both being 1, this is equal to pow(2, n-1).
Granted, a lot of this computation rests on common subproblems, so Dynamic Programming might speed up your algorithm. Off the top of my head, computing a boolean matrix M such M[i,j] is true if and only if the substring of your given string from i to j is a word would help out quite a bit. You still have an exponential number of possible segmentations, but you would quickly be able to eliminate a segmentation if an early split did not form a word. A solution would then be a sequence of integers (i0, j0, i1, j1, ...) with the condition that j sub k = i sub (k + 1).
If your goal is correctly camel case URL's, I would sidestep the problem and go for something a little more direct: Get the homepage for the URL, remove any spaces and capitalization from the source HTML, and search for your string. If there is a match, find that section in the original HTML and return it. You'd need an array of NumSpaces that declares how much whitespace occurs in the original string like so:
Needle: isashort
Haystack: This is a short phrase
Preprocessed: thisisashortphrase
NumSpaces : 000011233333444444
And your answer would come from:
location = prepocessed.Search(Needle)
locationInOriginal = location + NumSpaces[location]
originalLength = Needle.length() + NumSpaces[location + needle.length()] - NumSpaces[location]
Haystack.substring(locationInOriginal, originalLength)
Of course, this would break if madduckets.com did not have "Mad Duckets" somewhere on the home page. Alas, that is the price you pay for avoiding an exponential problem.
This can be actually done (to a certain degree) without dictionary. Essentially, this is an unsupervised word segmentation problem. You need to collect a large list of domain names, apply an unsupervised segmentation learning algorithm (e.g. Morfessor) and apply the learned model for new domain names. I'm not sure how well it would work, though (but it would be interesting).
This is basically a variation of a knapsack problem, so what you need is a comprehensive list of words and any of the solutions covered in Wiki.
With fairly-sized dictionary this is going to be insanely resource-intensive and lengthy operation, and you cannot even be sure that this problem will be solved.
Create a list of possible words, sort it from long words to short words.
Check if each entry in the list against the first part of the string. If it equals, remove this and append it at your sentence with a space. Repeat this.
A simple Java solution which has O(n^2) running time.
public class Solution {
// should contain the list of all words, or you can use any other data structure (e.g. a Trie)
private HashSet<String> dictionary;
public String parse(String s) {
return parse(s, new HashMap<String, String>());
}
public String parse(String s, HashMap<String, String> map) {
if (map.containsKey(s)) {
return map.get(s);
}
if (dictionary.contains(s)) {
return s;
}
for (int left = 1; left < s.length(); left++) {
String leftSub = s.substring(0, left);
if (!dictionary.contains(leftSub)) {
continue;
}
String rightSub = s.substring(left);
String rightParsed = parse(rightSub, map);
if (rightParsed != null) {
String parsed = leftSub + " " + rightParsed;
map.put(s, parsed);
return parsed;
}
}
map.put(s, null);
return null;
}
}
I was looking at the problem and thought maybe I could share how I did it.
It's a little too hard to explain my algorithm in words so maybe I could share my optimized solution in pseudocode:
string mainword = "stringintowords";
array substrings = get_all_substrings(mainword);
/** this way, one does not check the dictionary to check for word validity
* on every substring; It would only be queried once and for all,
* eliminating multiple travels to the data storage
*/
string query = "select word from dictionary where word in " + substrings;
array validwords = execute(query).getArray();
validwords = validwords.sort(length, desc);
array segments = [];
while(mainword != ""){
for(x = 0; x < validwords.length; x++){
if(mainword.startswith(validwords[x])) {
segments.push(validwords[x]);
mainword = mainword.remove(v);
x = 0;
}
}
/**
* remove the first character if any of valid words do not match, then start again
* you may need to add the first character to the result if you want to
*/
mainword = mainword.substring(1);
}
string result = segments.join(" ");

How can I detect common substrings in a list of strings

Given a set of strings, for example:
EFgreen
EFgrey
EntireS1
EntireS2
J27RedP1
J27GreenP1
J27RedP2
J27GreenP2
JournalP1Black
JournalP1Blue
JournalP1Green
JournalP1Red
JournalP2Black
JournalP2Blue
JournalP2Green
I want to be able to detect that these are three sets of files:
EntireS[1,2]
J27[Red,Green]P[1,2]
JournalP[1,2][Red,Green,Blue]
Are there any known ways of approaching this problem - any published papers I can read on this?
The approach I am considering is for each string look at all other strings and find the common characters and where differing characters are, trying to find sets of strings that have the most in common, but I fear that this is not very efficient and may give false positives.
Note that this is not the same as 'How do I detect groups of common strings in filenames' because that assumes that a string will always have a series of digits following it.
I would start here: http://en.wikipedia.org/wiki/Longest_common_substring_problem
There are links to supplemental information in the external links, including Perl implementations of the two algorithms explained in the article.
Edited to add:
Based on the discussion, I still think Longest Common Substring could be at the heart of this problem. Even in the Journal example you reference in your comment, the defining characteristic of that set is the substring 'Journal'.
I would first consider what defines a set as separate from the other sets. That gives you your partition to divide up the data, and then the problem is in measuring how much commonality exists within a set. If the defining characteristic is a common substring, then Longest Common Substring would be a logical starting point.
To automate the process of set detection, in general, you will need a pairwise measure of commonality which you can use to measure the 'difference' between all possible pairs. Then you need an algorithm to compute the partition that results in the overall lowest total difference. If the difference measure is not Longest Common Substring, that's fine, but then you need to determine what it will be. Obviously it needs to be something concrete that you can measure.
Bear in mind also that the properties of your difference measurement will bear on the algorithms that can be used to make the partition. For example, assume diff(X,Y) gives the measure of difference between X and Y. Then it would probably be useful if your measure of distance was such that diff(A,C) <= diff(A,B) + diff(B,C). And obviously diff(A,C) should be the same as diff(C,A).
In thinking about this, I also begin to wonder whether we could conceive of the 'difference' as a distance between any two strings, and, with a rigorous definition of the distance, could we then attempt some kind of cluster analysis on the input strings. Just a thought.
Great question! The steps to a solution are:
tokenizing input
using tokens to build an appropriate data structure. a DAWG is ideal, but a Trie is simpler and a decent starting point.
optional post-processing of the data structure for simplification or clustering of subtrees into separate outputs
serialization of the data structure to a regular expression or similar syntax
I've implemented this approach in regroup.py. Here's an example:
$ cat | ./regroup.py --cluster-prefix-len=2
EFgreen
EFgrey
EntireS1
EntireS2
J27RedP1
J27GreenP1
J27RedP2
J27GreenP2
JournalP1Black
JournalP1Blue
JournalP1Green
JournalP1Red
JournalP2Black
JournalP2Blue
JournalP2Green
^D
EFgre(en|y)
EntireS[12]
J27(Green|Red)P[12]
JournalP[12](Bl(ack|ue)|(Green|Red))
Something like that might work.
Build a trie that represents all your strings.
In the example you gave, there would be two edges from the root: "E" and "J". The "J" branch would then split into "Jo" and "J2".
A single strand that forks, e.g. E-n-t-i-r-e-S-(forks to 1, 2) indicates a choice, so that would be EntireS[1,2]
If the strand is "too short" in relation to the fork, e.g. B-A-(forks to N-A-N-A and H-A-M-A-S), we list two words ("banana, bahamas") rather than a choice ("ba[nana,hamas]"). "Too short" might be as simple as "if the part after the fork is longer than the part before", or maybe weighted by the number of words that have a given prefix.
If two subtrees are "sufficiently similar" then they can be merged so that instead of a tree, you now have a general graph. For example if you have ABRed,ABBlue,ABGreen,CDRed,CDBlue,CDGreen, you may find that the subtree rooted at "AB" is the same as the subtree rooted at "CD", so you'd merge them. In your output this will look like this: [left branch, right branch][subtree], so: [AB,CD][Red,Blue,Green]. How to deal with subtrees that are close but not exactly the same? There's probably no absolute answer but someone here may have a good idea.
I'm marking this answer community wiki. Please feel free to extend it so that, together, we may have a reasonable answer to the question.
try "frak" . It creates regex expression from set of strings. Maybe some modification of it will help you.
https://github.com/noprompt/frak
Hope it helps.
There are many many approaches to string similarity. I would suggest taking a look at this open-source library that implements a lot of metrics like Levenshtein distance.
http://sourceforge.net/projects/simmetrics/
You should be able to achieve this with generalized suffix trees: look for long paths in the suffix tree which come from multiple source strings.
There are many solutions proposed that solve the general case of finding common substrings. However, the problem here is more specialized. You're looking for common prefixes, not just substrings. This makes it a little simpler.
A nice explanation for finding longest common prefix can be found at
http://www.geeksforgeeks.org/longest-common-prefix-set-1-word-by-word-matching/
So my proposed "pythonese" pseudo-code is something like this (refer to the link for an implementation of find_lcp:
def count_groups(items):
sorted_list = sorted(items)
prefix = sorted_list[0]
ctr = 0
groups = {}
saved_common_prefix = ""
for i in range(1, sorted_list):
common_prefix = find_lcp(prefix, sorted_list[i])
if len(common_prefix) > 0: #we are still in the same group of items
ctr += 1
saved_common_prefix = common_prefix
prefix = common_prefix
else: # we must have switched to a new group
groups[saved_common_prefix] = ctr
ctr = 0
saved_common_prefix = ""
prefix = sorted_list[i]
return groups
For this particular example of strings to keep it extremely simple consider using simple word/digit -separation.
A non-digit sequence apparently can begin with capital letter (Entire). After breaking all strings into groups of sequences, something like
[Entire][S][1]
[Entire][S][2]
[J][27][Red][P][1]
[J][27][Green][P][1]
[J][27][Red][P][2]
....
[Journal][P][1][Blue]
[Journal][P][1][Green]
Then start grouping by groups, you can fairly soon see that prefix "Entire" is a common for some group and that all subgroups have S as headgroup, so only variable for those is 1,2.
For J27 case you can see that J27 is only leaf, but that it then branches at Red and Green.
So somekind of List<Pair<list, string>> -structure (composite pattern if I recall correctly).
import java.util.*;
class StringProblem
{
public List<String> subString(String name)
{
List<String> list=new ArrayList<String>();
for(int i=0;i<=name.length();i++)
{
for(int j=i+1;j<=name.length();j++)
{
String s=name.substring(i,j);
list.add(s);
}
}
return list;
}
public String commonString(List<String> list1,List<String> list2,List<String> list3)
{
list2.retainAll(list1);
list3.retainAll(list2);
Iterator it=list3.iterator();
String s="";
int length=0;
System.out.println(list3); // 1 1 2 3 1 2 1
while(it.hasNext())
{
if((s=it.next().toString()).length()>length)
{
length=s.length();
}
}
return s;
}
public static void main(String args[])
{
Scanner sc=new Scanner(System.in);
System.out.println("Enter the String1:");
String name1=sc.nextLine();
System.out.println("Enter the String2:");
String name2=sc.nextLine();
System.out.println("Enter the String3:");
String name3=sc.nextLine();
// String name1="salman";
// String name2="manmohan";
// String name3="rahman";
StringProblem sp=new StringProblem();
List<String> list1=new ArrayList<String>();
list1=sp.subString(name1);
List<String> list2=new ArrayList<String>();
list2=sp.subString(name2);
List<String> list3=new ArrayList<String>();
list3=sp.subString(name3);
sp.commonString(list1,list2,list3);
System.out.println(" "+sp.commonString(list1,list2,list3));
}
}

Word-separating algorithm

What is the algorithm - seemingly in use on domain parking pages - that takes a spaceless bunch of words (eg "thecarrotofcuriosity") and more-or-less correctly breaks it down into the constituent words (eg "the carrot of curiosity") ?
Start with a basic Trie data structure representing your dictionary. As you iterate through the characters of the the string, search your way through the trie with a set of pointers rather than a single pointer - the set is seeded with the root of the trie. For each letter, the whole set is advanced at once via the pointer indicated by the letter, and if a set element cannot be advanced by the letter, it is removed from the set. Whenever you reach a possible end-of-word, add a new root-of-trie to the set (keeping track of the list of words seen associated with that set element). Finally, once all characters have been processed, return an arbitrary list of words which is at the root-of-trie. If there's more than one, that means the string could be broken up in multiple ways (such as "therapistforum" which can be parsed as ["therapist", "forum"] or ["the", "rapist", "forum"]) and it's undefined which we'll return.
Or, in a wacked up pseudocode (Java foreach, tuple indicated with parens, set indicated with braces, cons using head :: tail, [] is the empty list):
List<String> breakUp(String str, Trie root) {
Set<(List<String>, Trie)> set = {([], root)};
for (char c : str) {
Set<(List<String>, Trie)> newSet = {};
for (List<String> ls, Trie t : set) {
Trie tNext = t.follow(c);
if (tNext != null) {
newSet.add((ls, tNext));
if (tNext.isWord()) {
newSet.add((t.follow(c).getWord() :: ls, root));
}
}
}
set = newSet;
}
for (List<String> ls, Trie t : set) {
if (t == root) return ls;
}
return null;
}
Let me know if I need to clarify or I missed something...
I would imagine they take a dictionary word list like /usr/share/dict/words on your common or garden variety Unix system and try to find sets of word matches (starting from the left?) that result in the largest amount of original text being covered by a match. A simple breadth-first-search implementation would probably work fine, since it obviously doesn't have to run fast.
I'd imaging these sites do it similar to this:
Get a list of word for your target language
Remove "useless" words like "a", "the", ...
Run through the list and check which of the words are substrings of the domain name
Take the most common words of the remaining list (Or the ones with the highest adsense rating,...)
Of course that leads to nonsense for expertsexchange, but what else would you expect there...
(disclaimer: I did not try it myself, so take it merely as a food for experimentation. 4-grams are taken mostly out of the blue sky, just from my experience that 3-grams won't work all too well; 5-grams and more might work better, even though you will have to deal with a pretty large table). It's also simplistic in a sense that it does not take into the account the ending of the string - if it works for you otherwise, you'd probably need to think about fixing the endings.
This algorithm would run in a predictable time proportional to the length of the string that you are trying to split.
So, first: Take a lot of human-readable texts. for each of the text, supposing it is in a single string str, run the following algorithm (pseudocode-ish notation, assumes the [] is a hashtable-like indexing, and that nonexistent indexes return '0'):
for(i=0;i<length(s)-5;i++) {
// take 4-character substring starting at position i
subs2 = substring(str, i, 4);
if(has_space(subs2)) {
subs = substring(str, i, 5);
delete_space(subs);
yes_space[subs][position(space, subs2)]++;
} else {
subs = subs2;
no_space[subs]++;
}
}
This will build you the tables which will help to decide whether a given 4-gram would need to have a space in it inserted or not.
Then, take your string to split, I denote it as xstr, and do:
for(i=0;i<length(xstr)-5;i++) {
subs = substring(xstr, i, 4);
for(j=0;j<4;j++) {
do_insert_space_here[i+j] -= no_space[subs];
}
for(j=0;j<4;j++) {
do_insert_space_here[i+j] += yes_space[subs][j];
}
}
Then you can walk the "do_insert_space_here[]" array - if an element at a given position is bigger than 0, then you should insert a space in that position in the original string. If it's less than zero, then you shouldn't.
Please drop a note here if you try it (or something of this sort) and it works (or does not work) for you :-)

Good algorithm and data structure for looking up words with missing letters?

I need to write an efficient algorithm for looking up words with missing letters in a dictionary and I want the set of possible words.
For example, if I have th??e, I might get back "these", "those", "theme:, "there", etc.
There will be up to TWO question marks and when two question marks do occur, they will occur in sequence.
I was wondering if anyone can suggest some data structures or algorithm I should use.
A Trie is too space-inefficient and would make it too slow. Any other ideas modifications?
Currently I am using 3 hash tables for when it is an exact match, 1 question mark, and 2 question marks.
Given a dictionary I hash all the possible words. For example, if I have the word WORD. I hash WORD, ?ORD, W?RD, WO?D, WOR?, ??RD, W??D, and WO?? into the dictionary. Then I use a link list to link the collisions together. So say hash(W?RD) = hash(STR?NG) = 17. hashtab(17) will point to WORD and WORD points to STRING because it is a linked list.
The timing on average lookup of one word is about 2e-6s. I am looking to do better, preferably on the order of 1e-9. It took 0.5 seconds for 3m entries insertions and it took 4 seconds for 3m entries lookup.
I believe in this case it is best to just use a flat file where each word stands in one line. With this you can conveniently use the power of a regular expression search, which is highly optimized and will probably beat any data structure you can devise yourself for this problem.
Solution #1: Using Regex
This is working Ruby code for this problem:
def query(str, data)
r = Regexp.new("^#{str.gsub("?", ".")}$")
idx = 0
begin
idx = data.index(r, idx)
if idx
yield data[idx, str.size]
idx += str.size + 1
end
end while idx
end
start_time = Time.now
query("?r?te", File.read("wordlist.txt")) do |w|
puts w
end
puts Time.now - start_time
The file wordlist.txt contains 45425 words (downloadable here). The program's output for query ?r?te is:
brute
crate
Crete
grate
irate
prate
write
wrote
0.013689
So it takes just 37 milliseconds to both read the whole file and to find all matches in it. And it scales very well for all kinds of query patterns, even where a Trie is very slow:
query ????????????????e
counterproductive
indistinguishable
microarchitecture
microprogrammable
0.018681
query ?h?a?r?c?l?
theatricals
0.013608
This looks fast enough for me.
Solution #2: Regex with Prepared Data
If you want to go even faster, you can split the wordlist into strings that contain words of equal lengths and just search the correct one based on your query length. Replace the last 5 lines with this code:
def query_split(str, data)
query(str, data[str.length]) do |w|
yield w
end
end
# prepare data
data = Hash.new("")
File.read("wordlist.txt").each_line do |w|
data[w.length-1] += w
end
# use prepared data for query
start_time = Time.now
query_split("?r?te", data) do |w|
puts w
end
puts Time.now - start_time
Building the data structure takes now about 0.4 second, but all queries are about 10 times faster (depending on the number of words with that length):
?r?te 0.001112 sec
?h?a?r?c?l? 0.000852 sec
????????????????e 0.000169 sec
Solution #3: One Big Hashtable (Updated Requirements)
Since you have changed your requirements, you can easily expand on your idea to use just one big hashtable that contains all precalculated results. But instead of working around collisions yourself you could rely on the performance of a properly implemented hashtable.
Here I create one big hashtable, where each possible query maps to a list of its results:
def create_big_hash(data)
h = Hash.new do |h,k|
h[k] = Array.new
end
data.each_line do |l|
w = l.strip
# add all words with one ?
w.length.times do |i|
q = String.new(w)
q[i] = "?"
h[q].push w
end
# add all words with two ??
(w.length-1).times do |i|
q = String.new(w)
q[i, 2] = "??"
h[q].push w
end
end
h
end
# prepare data
t = Time.new
h = create_big_hash(File.read("wordlist.txt"))
puts "#{Time.new - t} sec preparing data\n#{h.size} entries in big hash"
# use prepared data for query
t = Time.new
h["?ood"].each do |w|
puts w
end
puts (Time.new - t)
Output is
4.960255 sec preparing data
616745 entries in big hash
food
good
hood
mood
wood
2.0e-05
The query performance is O(1), it is just a lookup in the hashtable. The time 2.0e-05 is probably below the timer's precision. When running it 1000 times, I get an average of 1.958e-6 seconds per query. To get it faster, I would switch to C++ and use the Google Sparse Hash which is extremely memory efficient, and fast.
Solution #4: Get Really Serious
All above solutions work and should be good enough for many use cases. If you really want to get serious and have lots of spare time on your hands, read some good papers:
Tries for Approximate String Matching - If well implemented, tries can have very compact memory requirements (50% less space than the dictionary itself), and are very fast.
Agrep - A Fast Approximate Pattern-Matching Tool - Agrep is based on a new efficient and flexible algorithm for approximate string matching.
Google Scholar search for approximate string matching - More than enough to read on this topic.
Given the current limitations:
There will be up to 2 question marks
When there are 2 question marks, they appear together
There are ~100,000 words in the dictionary, average word length is 6.
I have two viable solutions for you:
The fast solution: HASH
You can use a hash which keys are your words with up to two '?', and the values are a list of fitting words. This hash will have around 100,000 + 100,000*6 + 100,000*5 = 1,200,000 entries (if you have 2 question marks, you just need to find the place of the first one...). Each entry can save a list of words, or a list of pointers to the existing words. If you save a list of pointers, and we assume that there are on average less than 20 words matching each word with two '?', then the additional memory is less than 20 * 1,200,000 = 24,000,000.
If each pointer size is 4 bytes, then the memory requirement here is (24,000,000+1,200,000)*4 bytes = 100,800,000 bytes ~= 96 mega bytes.
To sum up this solution:
Memory Consumption: ~96 MB
Time for each search: calculating a hash function, and following a pointer. O(1)
Note: if you want to use a hash of a smaller size, you can, but then it is better to save a balanced search tree in each entry instead of a linked list, for better performance.
The space savvy, but still very fast solution: TRIE variation
This solution uses the following observation:
If the '?' signs were at the end of the word, trie would be an excellent solution.
The search in the trie would search at the length of the word, and for the last couple of letters, a DFS traversal would bring all of the endings.
Very fast, and very memory-savvy solution.
So lets use this observation, in order to build something to work exactly like this.
You can think about every word you have in the dictionary, as a word ending with # (or any other symbol that does not exist in your dictionary).
So the word 'space' would be 'space#'.
Now, if you rotate each of the words, with the '#' sign, you get the following:
space#, pace#s, ace#sp, *ce#spa*, e#spac
(no # as first letter).
If you insert all of these variations into a TRIE, you can easily find the word you are seeking at the length of the word, by 'rotating' your word.
Example:
You want to find all words that fit 's??ce' (one of them is space, another is slice).
You build the word: s??ce#, and rotate it so that the ? sign is in the end. i.e. 'ce#s??'
All of the rotation variations exist inside the trie, and specifically 'ce#spa' (marked with * above). After the beginning is found - you need to go over all of the continuations in the appropriate length, and save them. Then, you need to rotate them again so that the # is the last letter, and walla - you have all of the words you were looking for!
To sum up this solution:
Memory Consumption:
For each word, all of its rotations appear in the trie. On average, *6 of the memory size is saved in the trie. The trie size is around *3 (just guessing...) of the space saved inside it. So the total space necessary for this trie is 6*3*100,000 = 1,800,000 words ~= 6.8 mega bytes.
Time for each search:
rotating the word: O(word length)
seeking the beginning in the trie: O(word length)
going over all of the endings: O(number of matches)
rotating the endings: O(total length of answers)
To sum up, it is very very fast, and depends on the word length * small constant.
To sum up...
The second choice has a great time/space complexity, and would be the best option for you to use. There are a few problems with the second solution (in which case you might want to use the first solution):
More complex to implement. I'm not sure whether there are programming languages with tries built-in out of the box. If there isn't - it means that you'll need to implement it yourself...
Does not scale well. If tomorrow you decide that you need your question marks spread all over the word, and not necessarily joined together, you'll need to think hard of how to fit the second solution to it. In the case of the first solution - it is quite easy to generalize.
To me this problem sounds like a good fit for a Trie data structure. Enter the entire dictionary into your trie, and then look up the word. For a missing letter you would have to try all sub-tries, which should be relatively easy to do with a recursive approach.
EDIT: I wrote a simple implementation of this in Ruby just now: http://gist.github.com/262667.
Directed Acyclic Word Graph would be perfect data structure for this problem. It combines efficiency of a trie (trie can be seen as a special case of DAWG), but is much more space efficient. Typical DAWG will take fraction of size that plain text file with words would take.
Enumerating words that meet specific conditions is simple and the same as in trie - you have to traverse graph in depth-first fashion.
Anna's second solution is the inspiration for this one.
First, load all the words into memory and divide the dictionary into sections based on word length.
For each length, make n copies of an array of pointers to the words. Sort each array so that the strings appear in order when rotated by a certain number of letters. For example, suppose the original list of 5-letter words is [plane, apple, space, train, happy, stack, hacks]. Then your five arrays of pointers will be:
rotated by 0 letters: [apple, hacks, happy, plane, space, stack, train]
rotated by 1 letter: [hacks, happy, plane, space, apple, train, stack]
rotated by 2 letters: [space, stack, train, plane, hacks, apple, happy]
rotated by 3 letters: [space, stack, train, hacks, apple, plane, happy]
rotated by 4 letters: [apple, plane, space, stack, train, hacks, happy]
(Instead of pointers, you can use integers identifying the words, if that saves space on your platform.)
To search, just ask how much you would have to rotate the pattern so that the question marks appear at the end. Then you can binary search in the appropriate list.
If you need to find matches for ??ppy, you would have to rotate that by 2 to make ppy??. So look in the array that is in order when rotated by 2 letters. A quick binary search finds that "happy" is the only match.
If you need to find matches for th??g, you would have to rotate that by 4 to make gth??. So look in array 4, where a binary search finds that there are no matches.
This works no matter how many question marks there are, as long as they all appear together.
Space required in addition to the dictionary itself: For words of length N, this requires space for (N times the number of words of length N) pointers or integers.
Time per lookup: O(log n) where n is the number of words of the appropriate length.
Implementation in Python:
import bisect
class Matcher:
def __init__(self, words):
# Sort the words into bins by length.
bins = []
for w in words:
while len(bins) <= len(w):
bins.append([])
bins[len(w)].append(w)
# Make n copies of each list, sorted by rotations.
for n in range(len(bins)):
bins[n] = [sorted(bins[n], key=lambda w: w[i:]+w[:i]) for i in range(n)]
self.bins = bins
def find(self, pattern):
bins = self.bins
if len(pattern) >= len(bins):
return []
# Figure out which array to search.
r = (pattern.rindex('?') + 1) % len(pattern)
rpat = (pattern[r:] + pattern[:r]).rstrip('?')
if '?' in rpat:
raise ValueError("non-adjacent wildcards in pattern: " + repr(pattern))
a = bins[len(pattern)][r]
# Binary-search the array.
class RotatedArray:
def __len__(self):
return len(a)
def __getitem__(self, i):
word = a[i]
return word[r:] + word[:r]
ra = RotatedArray()
start = bisect.bisect(ra, rpat)
stop = bisect.bisect(ra, rpat[:-1] + chr(ord(rpat[-1]) + 1))
# Return the matches.
return a[start:stop]
words = open('/usr/share/dict/words', 'r').read().split()
print "Building matcher..."
m = Matcher(words) # takes 1-2 seconds, for me
print "Done."
print m.find("st??k")
print m.find("ov???low")
On my computer, the system dictionary is 909KB big and this program uses about 3.2MB of memory in addition to what it takes just to store the words (pointers are 4 bytes). For this dictionary, you could cut that in half by using 2-byte integers instead of pointers, because there are fewer than 216 words of each length.
Measurements: On my machine, m.find("st??k") runs in 0.000032 seconds, m.find("ov???low") in 0.000034 seconds, and m.find("????????????????e") in 0.000023 seconds.
By writing out the binary search instead of using class RotatedArray and the bisect library, I got those first two numbers down to 0.000016 seconds: twice as fast. Implementing this in C++ would make it faster still.
First we need a way to compare the query string with a given entry. Let's assume a function using regexes: matches(query,trialstr).
An O(n) algorithm would be to simply run through every list item (your dictionary would be represented as a list in the program), comparing each to your query string.
With a bit of pre-calculation, you could improve on this for large numbers of queries by building an additional list of words for each letter, so your dictionary might look like:
wordsbyletter = { 'a' : ['aardvark', 'abacus', ... ],
'b' : ['bat', 'bar', ...],
.... }
However, this would be of limited use, particularly if your query string starts with an unknown character. So we can do even better by noting where in a given word a particular letter lies, generating:
wordsmap = { 'a':{ 0:['aardvark', 'abacus'],
1:['bat','bar']
2:['abacus']},
'b':{ 0:['bat','bar'],
1:['abacus']},
....
}
As you can see, without using indices, you will end up hugely increasing the amount of required storage space - specifically a dictionary of n words and average length m will require nm2 of storage. However, you could very quickly now do your look up to get all the words from each set that can match.
The final optimisation (which you could use off the bat on the naive approach) is to also separate all the words of the same length into separate stores, since you always know how long the word is.
This version would be O(kx) where k is the number of known letters in the query word, and x=x(n) is the time to look up a single item in a dictionary of length n in your implementation (usually log(n).
So with a final dictionary like:
allmap = {
3 : {
'a' : {
1 : ['ant','all'],
2 : ['bar','pat']
}
'b' : {
1 : ['bar','boy'],
...
}
4 : {
'a' : {
1 : ['ante'],
....
Then our algorithm is just:
possiblewords = set()
firsttime = True
wordlen = len(query)
for idx,letter in enumerate(query):
if(letter is not '?'):
matchesthisletter = set(allmap[wordlen][letter][idx])
if firsttime:
possiblewords = matchesthisletter
else:
possiblewords &= matchesthisletter
At the end, the set possiblewords will contain all the matching letters.
If you generate all the possible words that match the pattern (arate, arbte, arcte ... zryte, zrzte) and then look them up in a binary tree representation of the dictionary, that will have the average performance characteristics of O(e^N1 * log(N2)) where N1 is the number of question marks and N2 is the size of the dictionary. Seems good enough for me but I'm sure it's possible to figure out a better algorithm.
EDIT: If you will have more than say, three question marks, have a look at Phil H's answer and his letter indexing approach.
Assume you have enough memory, you could build a giant hashmap to provide the answer in constant time. Here is a quick example in python:
from array import array
all_words = open("english-words").read().split()
big_map = {}
def populate_map(word):
for i in range(pow(2, len(word))):
bin = _bin(i, len(word))
candidate = array('c', word)
for j in range(len(word)):
if bin[j] == "1":
candidate[j] = "?"
if candidate.tostring() in big_map:
big_map[candidate.tostring()].add(word)
else:
big_map[candidate.tostring()] = set([word])
def _bin(x, width):
return ''.join(str((x>>i)&1) for i in xrange(width-1,-1,-1))
def run():
for word in all_words:
populate_map(word)
run()
>>> big_map["y??r"]
set(['your', 'year'])
>>> big_map["yo?r"]
set(['your'])
>>> big_map["?o?r"]
set(['four', 'poor', 'door', 'your', 'hour'])
You can take a look at how its done in aspell. It prompts suggestions of correct word for misspelled words.
Build a hash set of all the words. To find matches, replace the question marks in the pattern with each possible combination of letters. If there are two question marks, a query consists of 262 = 676 quick, constant-expected-time hash table lookups.
import itertools
words = set(open("/usr/share/dict/words").read().split())
def query(pattern):
i = pattern.index('?')
j = pattern.rindex('?') + 1
for combo in itertools.product('abcdefghijklmnopqrstuvwxyz', repeat=j-i):
attempt = pattern[:i] + ''.join(combo) + pattern[j:]
if attempt in words:
print attempt
This uses less memory than my other answer, but it gets exponentially slower as you add more question marks.
If 80-90% accuracy is acceptable, you could manage with Peter Norvig's spell checker. The implementation is small and elegant.
A regex-based solution will consider every possible value in your dictionary. If performance is your largest constraint, an index could be built to speed it up considerably.
You could start with an index on each word length containing an index of each index=character matching word sets. For length 5 words, for example, 2=r : {write, wrote, drate, arete, arite}, 3=o : {wrote, float, group}, etc. To get the possible matches for the original query, say '?ro??', the word sets would be intersected resulting in {wrote, group} in this case.
This is assuming that the only wildcard will be a single character and that the word length is known up front. If these are not valid assumptions, I can recommend n-gram based text matching, such as discussed in this paper.
The data structure you want is called a trie - see the wikipedia article for a short summary.
A trie is a tree structure where the paths through the tree form the set of all the words you wish to encode - each node can have up to 26 children, on for each possible letter at the next character position. See the diagram in the wikipedia article to see what I mean.
Have you considered using a Ternary Search Tree?
The lookup speed is comparable to a trie, but it is more space-efficient.
I have implemented this data structure several times, and it is a quite straightforward task in most languages.
My first post had an error that Jason found, it did not work well when ?? was in the beginning. I have now borrowed the cyclic shifts from Anna..
My solution:
Introduce an end-of-word character (#) and store all cyclic shifted words in sorted arrays!! Use one sorted array for each word length. When looking for "th??e#", shift the string to move the ?-marks to the end (obtaining e#th??) and pick the array containing words of length 5 and make a binary search for the first word occurring after string "e#th". All remaining words in the array match, i.e., we will find "e#thoo (thoose), e#thes (these), etc.
The solution has time complexity Log( N ), where N is the size of the dictionary, and it expands the size of the data by a factor of 6 or so ( the average word length)
Here's how I'd do it:
Concatenate the words of the dictionary into one long String separated by a non-word character.
Put all words into a TreeMap, where the key is the word and the value is the offset of the start of the word in the big String.
Find the base of the search string; i.e. the largest leading substring that doesn't include a '?'.
Use TreeMap.higherKey(base) and TreeMap.lowerKey(next(base)) to find the range within the String between which matches will be found. (The next method needs to calculate the next larger word to the base string with the same number or fewer characters; e.g. next("aa") is "ab", next("az") is "b".)
Create a regex for the search string and use Matcher.find() to search the substring corresponding to the range.
Steps 1 and 2 are done beforehand giving a data structure using O(NlogN) space where N is the number of words.
This approach degenerates to a brute-force regex search of the entire dictionary when the '?' appears in the first position, but the further to the right it is, the less matching needs to be done.
EDIT:
To improve the performance in the case where '?' is the first character, create a secondary lookup table that records the start/end offsets of runs of words whose second character is 'a', 'b', and so on. This can be used in the case where the first non-'?' is second character. You can us a similar approach for cases where the first non-'?' is the third character, fourth character and so on, but you end up with larger and larger numbers of smaller and smaller runs, and eventually this "optimization" becomes ineffective.
An alternative approach which requires significantly more space, but which is faster in most cases, is to prepare the dictionary data structure as above for all rotations of the words in the dictionary. For instance, the first rotation would consist of all words 2 characters or more with the first character of the word moved to the end of the word. The second rotation would be words of 3 characters or more with the first two characters moved to the end, and so on. Then to do the search, look for the longest sequence of non-'?' characters in the search string. If the index of the first character of this substring is N, use the Nth rotation to find the ranges, and search in the Nth rotation word list.
A lazy solution is to let SQLite or another DBMS do the job for you.
Just create an in-memory database, load your words and run a select using the LIKE operator.
Summary: Use two compact binary-searched indexes, one of the words, and one of the reversed words. The space cost is 2N pointers for the indexes; almost all lookups go very fast; the worst case, "??e", is still decent. If you make separate tables for each word length, that'd make even the worst case very fast.
Details: Stephen C. posted a good idea: search an ordered dictionary to find the range where the pattern can appear. This doesn't help, though, when the pattern starts with a wildcard. You might also index by word-length, but here's another idea: add an ordered index on the reversed dictionary words; then a pattern always yields a small range in either the forward index or the reversed-word index (since we're told there are no patterns like ?ABCD?). The words themselves need be stored only once, with the entries of both structures pointing to the same words, and the lookup procedure viewing them either forwards or in reverse; but to use Python's built-in binary-search function I've made two separate strings arrays instead, wasting some space. (I'm using a sorted array instead of a tree as others have suggested, as it saves space and goes at least as fast.)
Code:
import bisect, re
def forward(string): return string
def reverse(string): return string[::-1]
index_forward = sorted(line.rstrip('\n')
for line in open('/usr/share/dict/words'))
index_reverse = sorted(map(reverse, index_forward))
def lookup(pattern):
"Return a list of the dictionary words that match pattern."
if reverse(pattern).find('?') <= pattern.find('?'):
key, index, fixup = pattern, index_forward, forward
else:
key, index, fixup = reverse(pattern), index_reverse, reverse
assert all(c.isalpha() or c == '?' for c in pattern)
lo = bisect.bisect_left(index, key.replace('?', 'A'))
hi = bisect.bisect_right(index, key.replace('?', 'z'))
r = re.compile(pattern.replace('?', '.') + '$')
return filter(r.match, (fixup(index[i]) for i in range(lo, hi)))
Tests: (The code also works for patterns like ?AB?D?, though without the speed guarantee.)
>>> lookup('hello')
['hello']
>>> lookup('??llo')
['callo', 'cello', 'hello', 'uhllo', 'Rollo', 'hollo', 'nullo']
>>> lookup('hel??')
['helio', 'helix', 'hello', 'helly', 'heloe', 'helve']
>>> lookup('he?l')
['heal', 'heel', 'hell', 'heml', 'herl']
>>> lookup('hx?l')
[]
Efficiency: This needs 2N pointers plus the space needed to store the dictionary-word text (in the tuned version). The worst-case time comes on the pattern '??e' which looks at 44062 candidates in my 235k-word /usr/share/dict/words; but almost all queries are much faster, like 'h??lo' looking at 190, and indexing first on word-length would reduce '??e' almost to nothing if we need to. Each candidate-check goes faster than the hashtable lookups others have suggested.
This resembles the rotations-index solution, which avoids all false match candidates at the cost of needing about 10N pointers instead of 2N (supposing an average word-length of about 10, as in my /usr/share/dict/words).
You could do a single binary search per lookup, instead of two, using a custom search function that searches for both low-bound and high-bound together (so the shared part of the search isn't repeated).
If you only have ? wildcards, no * wildcards that match a variable number of characters, you could try this: For each character index, build a dictionary from characters to sets of words. i.e. if the words are write, wrote, drate, arete, arite, your dictionary structure would look like this:
Character Index 0:
'a' -> {"arete", "arite"}
'd' -> {"drate"}
'w' -> {"write", "wrote"}
Character Index 1:
'r' -> {"write", "wrote", "drate", "arete", "arite"}
Character Index 2:
'a' -> {"drate"}
'e' -> {"arete"}
'i' -> {"write", "arite"}
'o' -> {"wrote"}
...
If you want to look up a?i?? you would take the set that corresponds to character index 0 => 'a' {"arete", "arite"} and the set that corresponds to character index 2 = 'i' => {"write", "arite"} and take the set intersection.
If you seriously want something on the order of a billion searches per second (though i can't dream why anyone outside of someone making the next grand-master scrabble AI or something for a huge web service would want that fast), i recommend utilizing threading to spawn [number of cores on your machine] threads + a master thread that delegates work to all of those threads. Then apply the best solution you have found so far and hope you don't run out of memory.
An idea i had is that you can speed up some cases by preparing sliced down dictionaries by letter then if you know the first letter of the selection you can resort to looking in a much smaller haystack.
Another thought I had was that you were trying to brute-force something -- perhaps build a DB or list or something for scrabble?

Resources