Knuth-Morris-Pratt Fail table - algorithm

I am studying for an exam I have and I am looking over the Knuth-Morris-Pratt algorithm. What is going to be on the exam is the Fail table and DFA construction. I understand DFA construction, but I don't really understand how to make the fail table.
If I have an example of a pattern "abababc" how do I build a fail table from this? The solution is:
Fail table:
0 1 2 3 4 5 6 7
0 0 0 1 2 3 4 0
but how do I get that? No code just an explanation of how to get that is necessary.

The value of cell i in the fail table for string s is defined as follows: take the substring of s that ends at position i, and the value in the cell is the length of the longest proper(not the whole string) sufix of this substring that is equal to its prefix of the same length.
Let's take your example and consider the value for 6. The substring of s with length 6 is ababab. It has 6 suffixes: babab, abab, bab, ab and b on the other hand its proper prefixes are ababa, abab, aba, ab and a. Now it is easy to see that the sufixes that are equal to prefixes of the same length are abab and ab. Of these the longer is abab and thus the value in cell 6 is the its length - 4.

Pattern P = {abababc}
P[0] = 'a'. P[1] = 'b'. P[2] = 'a'. P[3] = 'b'. P[4] = 'a'. P[5] = 'b'. P[6] = 'c'.
The motive of the Fail Table is to identify the maximum possible shift (such that we would not miss out on any pattern matching, but would also not make unnecessary comparison), if first "i" character of the pattern string are matching and the break is found at the i+1 th character.
The number in the Fail Table indicates how many character still continues to match after the shift if the first i character of the pattern matches to the text.
Let FailTable be FT[].
FT[1] - 'a' matches with text. Break found at 'b'(P[1]). Do we have a proper suffix of 'a' which matches the proper prefix of 'a'? Ans is NO. So length of the String which still continues to match after the shift is 0. Hence FT[1] = 0.
FT[2] - 'ab' matches with text. Break found at 'a' (P[2]). Do we have a proper suffix of 'ab' which matches the proper prefix of 'ab'? Ans is NO. So length of the String which still continues to match after the shift is 0. Hence FT[2] = 0.
FT[3] - 'aba' matches with text. Break found at 'b' (P[3]). Do we have a proper suffix of 'aba' which matches the proper prefix of 'aba'? Ans is YES ('a'). So length of the String which still continues to match after the shift is 1. Hence FT[3] = 1.
FT[4] - 'abab' matches with text. Break found at 'a' (P[4]). Do we have a proper suffix of 'abab' which matches the proper prefix of 'abab'? Ans is YES('ab'). So length of the String which still continues to match after the shift is 2. Hence FT[4] = 2.
FT[5] - 'ababa' matches with text. Break found at 'b' (P[5]). Do we have a proper suffix of 'ababa' which matches the proper prefix of 'ababa'? Ans is YES('aba'). So length of the String which still continues to match after the shift is 3. Hence FT[5] = 3.
FT[6] - 'ababab' matches with text. Break found at 'a' (P[6]). Do we have a proper suffix of 'ababab' which matches the proper prefix of 'ababab'? Ans is YES('abab'). So length of the String which still continues to match after the shift is 4. Hence FT[6] = 4.
FT[7] - 'abababc' matches with text. No break found at all, Pattern matched with the text. Do we have a proper suffix of 'abababc' which matches the proper prefix of 'abababc'? Ans is NO. So length of the String which still continues to match after the shift is 0. Hence FT[7] = 0.
Hence the final array is FT = [0,0,1,2,3,4,0]
Hope it helps!

Related

Number of substrings of a given string containing a specific character

What can be the most efficient algorithm to count the number of substrings of a given string that contain a given character.
e.g. for abb b
sub-strings : a, b, b, ab, bb, abb.
Answer : strings containg b atlest once = 5.
PS. i solved this question by generating all the substrings and then checking in O(n ^ 2). Just want to know whether there can be a better solution to this.
Let you need to find substrings with character X.
Scan string left to right, keeping position of the last X: lastX with starting value -1
When you meet X at position i, add i+1 to result and update lastX
(this is number of substrings ending in current position and they all contain X)
When you meet another character, add lastX + 1 to result
(this is again number of substrings ending in current position and containing X),
because the rightmost possible start of substring is position of the last X
Algorithm is linear.
Example:
a X a a X a
good substrings overall
idx char ending at idx lastX count count
0 a - -1 0 0
1 X aX X 1 2 2
2 a aXa Xa 1 2 4
3 a aXaa Xaa 1 2 6
4 X aXaaX XaaX aaX aX X 4 5 11
5 a aXaaXa XaaXa aaXa aXa Xa 4 5 16
Python code:
def subcnt(s, c):
last = -1
cnt = 0
for i in range(len(s)):
if s[i] == c:
last = i
cnt += last + 1
return cnt
print(subcnt('abcdba', 'b'))
You could turn this around and scan your string for occurrences of your letter. Every time you find an occurrence in some position i, you know that it is contained by definition in all the substrings that contain it (i.e. all substrings which start before or at i and end at or after i), so you only need to store pairs of indices to define substrings instead of storing substrings explicitly.
That being said, you'll still need O(n²) with this approach because although you don't mind repeated substrings as your example shows, you don't want to count the same substring twice, so you still have to make sure that you don't select the same pair of indices twice.
Let's consider the string as abcdaefgabb and the given character as a.
Loop over the string char by char.
If a character matches a given character, let's say a at index 4, so number of substrings which will contain a is from abcda to aefgabb. So, we add (4-0 + 1) + (10 - 4) = 11. These represent substrings as abcda,bcda,cda,da,a,ae,aef,aefg,aefga,aefgab and aefgabb.
This applies to wherever you find a, like you find it at index 0 and also at index 8.
Final answer is the sum of above mentioned math operations.
Update: You will have to maintain 2 pointers between last occurred a and the current a to avoid calculating duplicate substrings which start end end with the same index.
Think of a substring as selecting two elements from the gaps between the letters in your string and including everything between them (where there are gaps on the extreme ends of the string).
For a string of length n, there are choose(n+1,2) substrings.
Of those, for each run of k characters that doesn't include the target, there are choose(k+1,2) substrings that only include letters from that substring. All other substrings of the main string must include the target.
Answer: choose(n+1,2) - sum(choose(k_i+1,2)), where the k_i are the lengths of runs of letters that don't include the target.

Efficiently find a given subsequence in a string, maximizing the number of contiguous characters

Long problem description
Fuzzy string matcher utilities like fzf or CtrlP filter a list of strings for ones which have a given search string as a subsequence.
As an example, consider that a user wants to search for a specific photo in a list of files. To find the file
/home/user/photos/2016/pyongyang_photo1.png
it suffices to type ph2016png, because this search string is a subsequence of this file name. (Mind that this is not LCS. The whole search string must be a subsequence of the file name.)
It is trivial to check whether a given search string is a subsequence of another string, but I wonder how to efficiently obtain the best match: In the above example, there are multiple possible matches. One is
/home/user/photos/2016/pyongyang_photo1.png
but the one which the user probably had in mind is
/home/user/photos/2016/pyongyang_photo1.png
To formalize this, I'd define the "best" match as the one that is composed of the the smallest number of substrings. This number is 5 for the first example match and 3 for the second.
I came up with this because it would be interesting to obtain the best match to assign a score to each result, for sorting. I'm not interested in approximate solutions though, my interest in this problem is primarily of academic nature.
tl;dr problem description
Given strings s and t, find among the subsequences of t that are equal to s one that maximizes the number of pairs of elements that are contiguous in t.
What I've tried so far
For discussion, let's call the search query s and the string to test t. The problem's solution is denoted fuzzy(s, t). I'll utilize Python's string slicing notation. The easiest approach is as follows:
Since any solution must use all characters from s in order, an algorithm for solving this problem can start by searching the first occurrence of s[0] in t (with index i) and then use the better of the two solutions
t[:i+1] + fuzzy(s[1:], t[i+1:]) # Use the character
t[:i] + fuzzy(s, t[i+1:]) # Skip it and use the next occurence
# of s[0] in t instead
This is obviously not the best solution to this problem. En contraire, it's the obvious brute force one. (I've played around with simultaneously searching for the last occurrence of s[-1] and using this information in an earlier version of this question, but it turned out that this approach does not work.)
→ My question is: What is the most efficient solution to this problem?
I would suggest creating a search tree, where each node represents a character position in the haystack that matches one of the needle characters.
The top nodes are siblings and represent the occurrences of the first needle character in the haystack.
The children of a parent node are those nodes that represent the occurrences of the next needle character in the haystack, but only those that are positioned after the position represented by that parent node.
This logically means that some children are shared by several parents, and so this structure is not really a tree, but a directed acyclic graph. Some sibling parents might even have exactly the same children. Other parents might not have children at all: they are a dead-end, unless they are at the bottom of the graph where the leaves represent positions of the last needle character.
Once this graph is set up, a depth-first search in it can easily derive the number of segments that are still needed from a certain node onwards, and then minimise that among alternatives.
I have added some comments in the Python code below. This code might still be improved, but it seems already quite efficient compared to your solution.
def fuzzy_trincot(haystack, needle, returnSegments = False):
inf = float('inf')
def getSolutionAt(node, depth, optimalCount = 2):
if not depth: # reached end of needle
node['count'] = 0
return
minCount = inf # infinity ensures also that incomplete branches are pruned
child = node['child']
i = node['i']+1
# Optimisation: optimalCount gives the theoretical minimum number of
# segments needed for any solution. If we find such case,
# there is no need to continue the search.
while child and minCount > optimalCount:
# If this node was already evaluated, don't lose time recursing again.
# It works without this condition, but that is less optimal.
if 'count' not in child:
getSolutionAt(child, depth-1, 1)
count = child['count'] + (i < child['i'])
if count < minCount:
minCount = count
child = child['sibling']
# Store the results we found in this node, so if ever we come here again,
# we don't need to recurse the same sub-tree again.
node['count'] = minCount
# Preprocessing: build tree
# A node represents a needle character occurrence in the haystack.
# A node can have these keys:
# i: index in haystack where needle character occurs
# child: node that represents a match, at the right of this index,
# for the next needle character
# sibling: node that represents the next match for this needle character
# count: the least number of additional segments needed for matching the
# remaining needle characters (only; so not counting the segments
# already taken at the left)
root = { 'i': -2, 'child': None, 'sibling': None }
# Take a short-cut for when needle is a substring of haystack
if haystack.find(needle) != -1:
root['count'] = 1
else:
parent = root
leftMostIndex = 0
rightMostIndex = len(haystack)-len(needle)
for j, c in enumerate(needle):
sibling = None
child = None
# Use of leftMostIndex is an optimisation; it works without this argument
i = haystack.find(c, leftMostIndex)
# Use of rightMostIndex is an optimisation; it works without this test
while 0 <= i <= rightMostIndex:
node = { 'i': i, 'child': None, 'sibling': None }
while parent and parent['i'] < i:
parent['child'] = node
parent = parent['sibling']
if sibling: # not first child
sibling['sibling'] = node
else: # first child
child = node
leftMostIndex = i+1
sibling = node
i = haystack.find(c, i+1)
if not child: return False
parent = child
rightMostIndex += 1
getSolutionAt(root, len(needle))
count = root['count']
if not returnSegments:
return count
# Use the `returnSegments` option when you need the character content
# of the segments instead of only the count. It runs in linear time.
if count == 1: # Deal with short-cut case
return [needle]
segments = []
node = root['child']
i = -2
start = 0
for end, c in enumerate(needle):
i += 1
# Find best child among siblings
while (node['count'] > count - (i < node['i'])):
node = node['sibling']
if count > node['count']:
count = node['count']
if end:
segments.append(needle[start:end])
start = end
i = node['i']
node = node['child']
segments.append(needle[start:])
return segments
The function can be called with an optional third argument:
haystack = "/home/user/photos/2016/pyongyang_photo1.png"
needle = "ph2016png"
print (fuzzy_trincot(haystack, needle))
print (fuzzy_trincot(haystack, needle, True))
Outputs:
3
['ph', '2016', 'png']
As the function is optimised to return only the count, the second call will add a bit to the execution time.
This is probably not the most efficient solution, but it is an efficient and easy to implement solution. To illustrate, I'll borrow your example. Let /home/user/photos/2016/pyongyang_photo1.png be the filename, and ph2016png, the input.
The first step (precalculation) is optional but might help speed up the next step (setup) quite a bit, especially if you are applying the algorithm to many filenames.
Precalculation
Create a table counting the occurrences of each character in the input. Since you are probably only dealing with ASCII characters, 256 entries are sufficient (maybe 128, or even less depending on the character set).
"ph2016png"
['p'] : 2
['h'] : 1
['2'] : 1
['0'] : 1
['b'] : 0
...
Setup
Slice the filename into substrings by throwing away characters that are not present in the input. At the same time, check if each character of the input is present the correct amount of times in the filename (if the precalculation is done). Finally, check that each character of the input appears in order in the substrings list. If you take the substrings list as a single string, for any given character of that string, every character that is found before it in the input must be found before it in that string. That can be done while creating the substrings.
"/home/user/photos/2016/pyongyang_photo1.png"
"h", "ph", "2016", "p", "ng", "ng", "ph", "1", "png"
'p' must come before "h", so throw this one away
"ph", "2016", "p", "ng", "ng", "ph", "1", "png"
Core
Match the longest substring with the input and keep track of the longest match. This match can keep the beginning of the substring (for instance, matching ababa (substring) with babaa (input) would result in aba, not baba) because it's easier to implement, although it doesn't have to. If you don't get a complete match, use the longest one to slice up the substring once more, and retry with the next longest substring.
Since there is no instance of incomplete match with your example,
let's take something else, made to illustrate the point.
Let's take "babaaababcb" as the filename, and "ababb" as input.
Substrings : "abaaabab", "b"
Longest substring : "abaaabab"
If you keep the beginning of matches
Longest match : "aba"
Slice "abaaabab" into "aba", "aabab"
-> "aba", "aabab", "b"
Retry with "aabab"
-> "aba", "a", "abab", "b"
Retry with "abab" (complete match)
Otherwise (harder to implement, not necessarily better performing, as shown in this example)
Longest match : "abab"
Slice "abaaabab" into "abaa", "abab"
-> "abaa", "abab", "b"
Retry with "abaa"
-> "aba", "a", "abab", "b"
Retry with "abab" (complete match)
If you do get a complete match, continue by slicing the input in two as well as the list of substrings, and repeat matching the longest substring.
With "ph2016png" as input
Longest substring : "2016"
Complete match
Match substrings "h", "ph" with input "ph"
Match substrings "p", "ng", "ng", "ph", "1", "png" with input "png"
You are guaranteed to find the sequence of substrings that contains the fewest substrings because you try the longest ones first. That will typically perform well if the input doesn't contain many short substrings from the filename.

Counting in Wonderland

The text of Alice in Wonderland contains the word 'Wonderland' 8 times. (Let's be case-insensitive for this question).
However it contains the word many more times if you count non-contiguous subsequences as well as substrings, eg.
Either the well was very deep, or she fell very slowly, for she had
plenty of time as she went down to look about her and to WONDER what was
going to happen next. First, she tried to Look down AND make out what
she was coming to, but it was too dark to see anything;
(A subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements. —Wikipedia)
How many times does the book contain the word Wonderland as a subsequence? I expect this will be a big number—it's a long book with many w's and o's and n's and d's.
I tried brute force counting (recursion to make a loop 10 deep) but it was too slow, even for that example paragraph.
Let's say you didn't want to search for wonderland, but just for w. Then you'd simply count how many times w occurred in the story.
Now let's say you want wo. For each first character of the current pattern you find, you add to your count:
How many times the current pattern without its first character occurs in the rest of the story, after this character you're at: so you have reduced the problem (story[1..n], pattern[1..n]) to (story[2..n], pattern[2..n])
How many times the entire current pattern occurs in the rest of the story. So you have reduced the problem to (story[2..n], pattern[1..n])
Now you can just add the two. There is no overcounting if we talk in terms of subproblems. Consider the example wawo. Obviously, wo occurs 2 times. You might think the counting will go like:
For the first w, add 1 because o occurs once after it and another 1 because wo occurs once after it.
For the second w, add 1 because o occurs once after it.
Answer is 3, which is wrong.
But this is what actually happens:
(wawo, wo) -> (awo, o) -> (wo, o) -> (o, o) -> (-, -) -> 1
-> (-, o) -> 0
-> (awo, wo) -> (wo, wo) -> (o, wo) -> (-, wo) -> 0
-> (o, o) -> (-, -) -> 1
-> (-, o) -> 0
So you can see that the answer is 2.
If you don't find a w, then the count for this position is just how many times wo occurs after this current character.
This allows for dynamic programming with memoization:
count(story_index, pattern_index, dp):
if dp[story_index, pattern_index] not computed:
if pattern_index == len(pattern):
return 1
if story_index == len(story):
return 0
if story[story_index] == pattern[pattern_index]:
dp[story_index, pattern_index] = count(story_index + 1, pattern_index + 1, dp) +
count(story_index + 1, pattern_index, dp)
else:
dp[story_index, pattern_index] = count(story_index + 1, pattern_index, dp)
return dp[story_index, pattern_index]
Call with count(0, 0, dp). Note that you can make the code cleaner (remove the duplicate function call).
Python code, with no memoization:
def count(story, pattern):
if len(pattern) == 0:
return 1
if len(story) == 0:
return 0
s = count(story[1:], pattern)
if story[0] == pattern[0]:
s += count(story[1:], pattern[1:])
return s
print(count('wonderlandwonderland', 'wonderland'))
Output:
17
This makes sense: for each i first characters in the first wonderland of the story, you can group it with remaining final characters in the second wonderland, giving you 10 solutions. Another 2 are the words themselves. The other five are:
wonderlandwonderland
********* *
******** **
******** * *
** ** ******
*** * ******
You're right that this will be a huge number. I suggest that you either use large integers or take the result modulo something.
The same program returns 9624 for your example paragraph.
The string "wonderland" occurs as a subsequence in Alice in Wonderland1 24100772180603281661684131458232 times.
The main idea is to scan the main text character by character, keeping a running count of how often each prefix of the target string (i.e.: in this case, "w", "wo", "won", ..., "wonderlan", and "wonderland") has occurred up to the current letter. These running counts are easy to compute and update. If the current letter does not occur in "wonderland", then the counts are left untouched. If the current letter is "a" then we increment the count of "wonderla"s seen by the number of "wonderl"s seen up to this point. If the current letter is "n" then we increment the count of "won"s by the count of "wo"s and the count of "wonderlan"s by the count of "wonderla"s. And so forth. When we reach end of the text, we will have the count of all prefixes of "wonderland" including the string "wonderland" itself, as desired.
The advantage of this approach is that it requires a single pass through the text and does not require O(n) recursive calls (which will likely exceed the maximum recursion depth unless you do something clever).
Code
import fileinput
import string
target = 'wonderland'
prefixes = dict()
count = dict()
for i in range(len(target)) :
letter = target[i]
prefix = target[:i+1]
if letter not in prefixes :
prefixes[letter] = [prefix]
else :
prefixes[letter].append(prefix)
count[prefix] = 0L
for line in fileinput.input() :
for letter in line.lower() :
if letter in prefixes :
for prefix in prefixes[letter] :
if len(prefix) > 1 :
count[prefix] = count[prefix] + count[prefix[:len(prefix)-1]]
else:
count[prefix] = count[prefix] + 1
print count[target]
Using this text from Project Gutenberg, starting with "CHAPTER I. Down the Rabbit-Hole" and ending with "THE END"
Following up on previous comments, if you are looking for an algorithm that would return 2 for the input wonderlandwonderland and 1 for wonderwonderland, then I think you could adapt the algorithm from this question:
How to find smallest substring which contains all characters from a given string?
Effectively, the change in your case would be that, once an instance of the word is found, you increment a counter and repeat all the procedure with the remaining part of the text.
Such algorithm would be O(n) in time when n is the lenght of the text and O(m) in space where m is the length of the searched string.

Find all substrings that don't contain the entire set of characters

This was asked to me in an interview.
I'm given a string whose characters come from the set {a,b,c} only. Find all substrings that dont contain all the characters from the set.For e.g, substrings that contain only a's, only b's, only c's or only a,b's or only b,c's or only c,a's. I gave him the naive O(n^2) solution by generating all substrings and testing them.
The interviewer wanted an O(n) solution.
Edit: My attempt was to have the last indexes of a,b,c and run a pointer from left to right, and anytime all 3 were counted, change the start of the substring to exclude the earliest one and start counting again. It doesn't seem exhaustive
So for e.g, if the string is abbcabccaa,
let i be the pointer that traverses the string. Let start be start of the substring.
1) i = 0, start = 0
2) i = 1, start = 0, last_index(a) = 0 --> 1 substring - a
3) i = 2, start = 0, last_index(a) = 0, last_index(b) = 1 -- > 1 substring ab
4) i = 3, start = 0, last_index(a) = 0, last_index(b) = 2 --> 1 substring abb
5) i = 4, start = 1, last_index(b) = 2, last_index(c) = 3 --> 1 substring bbc(removed a from the substring)
6) i = 5, start = 3, last_index(c) = 3, last_index(a) = 4 --> 1 substring ca(removed b from the substring)
but this isn't exhaustive
Given that the problem in its original definition can't be solved in less than O(N^2) time, as some comments point out, I suggest a linear algorithm for counting the number of substrings (not necessarily unique in their values, but unique in their positions within the original string).
The algorithm
count = 0
For every char C in {'a','b','c'} scan the input S and break it into longest sequences not including C. For each such section A, add |A|*(|A|+1)/2 to count. This addition stands for the number of legal sub-strings inside A.
Now we have the total number of legal strings including only {'a','b'}, only {'a','c'} and only {'b','c'}. The problem is that we counted substrings with a single repeated character twice. To fix this we iterate over S again, this time subtracting |A|*(|A|+1)/2 for every largest sequence A of a single character that we encounter.
Return count
Example
S='aacb'
breaking it using 'a' gives us only 'cb', so count = 3. For C='b' we have 'aac', which makes count = 3 + 6 = 9. With C='c' we get 'aa' and 'b', so count = 9 + 3 + 1 = 13. Now we have to do the subtraction: 'aa': -3, 'c': -1, 'b': -1. So we have count=8.
The 8 substrings are:
'a'
'a' (the second char this time)
'aa'
'ac'
'aac'
'cb'
'c'
'b'
To get something better than O(n) we may need additional assumptions (maybe longest substrings with this property).
Consider a string of the form aaaaaaaaaabbbbbbbbbb of length n. There is at least O(n^2) possible substrings so if we want to list them all we need O(n^2) time.
I came up with a linear solution for the longest substrings.
Take a set S of all substrings separated by a, all substrings separated by b and finally all substrings separated by c. Each of those steps can be done in O(n), so we have O(3n), thus O(n).
Example:
Take aaabcaaccbaa.
In this case set S contains:
substrings separated by a: bc, ccb
substrings separated by b: aaa, caacc
substrings separated by c: aaab, aa, baa.
By the set I mean a data structure with adding and finding element with a given key in O(1).

Find all words and phrases from one string

Due to subject area (writing on a wall) interesting condition is added - letters cannot change their order, so this is not a question about anagrams.
I saw a long word, written by paint on a wall, and now suddenly
I want all possible words and phrases I can get from this word by painting out any combination of letters. Wo r ds, randomly separated by whitespace are OK.
To broaden possible results let's make an assumption, that space is not necessary to separate words.
Edit: Obviously letter order should be maintained (thanks idz for pointing that out). Also, phrases may be meaningless. Here are some examples:
Source word: disestablishment
paint out: ^ ^^^ ^^^^ ^^
left: i tabl e -> i table
or paint out:^^^^^^^^^ ^ ^^
left: ish e -> i she (spacelessness is ok)
Visual example
Hard mode/bonus task: consider possible slight alterations to letters (D <-> B, C <-> O and so on)
Please suggest your variants of solving this problem.
Here's my general straightforward approach
It's clear that we'll need an English dictionary to find words.
Our goal is to get words to search for in dictionary.
We need to find all possible letters variations to match them against dictionary: each letter can be itself (1) or painted out (0).
Taking the 'space is not needed to separate words' condition in consideration, to distinguish words we must assume that there might be a space between any two letters (1 - there's a space, 0 - there isn't).
d i s e s t a b l i s h m e n t
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ - possible whitespace
N = number of letters in source word
N-1 = number of 'might-be spaces'
Any of the N + N - 1 elements can be in two states, so let's treat them as booleans. The number of possible variations is 2^(N + N - 1). Yes, it counts useless variants like pasting a space between to spaces, but I didn't come up with more elegant formula.
Now we need an algorithm to get all possible variations of N+N-1 sequence of booleans (I haven't thought it out yet, but word recursion flows through my mind). Then substitute all 1s with corresponding letters (if index of boolean is odd) or whitespace (even)
and 0s with whitespace (odd) or nothing (even). Then trim leading and trailing whitespace, separate words and search them in dictionary.
I don't like this monstrous approach and hope you will help me find good alternatives.
1) Put your dictionary in a trie or prefix tree
2) For each position in the string find legal words by trie look up; store these
3) Print all combinations of non-overlapping words
This assumes that like the examples in the question you want to maintain the letter order (i.e. you are not interested in anagrams).
#!/usr/bin/python3
from itertools import *
from pprint import pprint as pp
Read in dictionary, remove all 1- and 2-letter words which we never use in the English language:
with open('/usr/share/dict/words') as f:
english = f.read().splitlines()
english = map(str.lower, english)
english = [w for w in english if (len(w)>2 or w in ['i','a','as','at','in','on','im','it','if','is','am','an'])]
def isWord(word):
return word in english
Your problem:
def splitwords(word):
"""
splitwords('starts') -> (('st', 'ar', 'ts'), ('st', 'arts'), ('star', 'ts'), ('starts'))
"""
if word=='':
yield ()
for i in range(1,len(word)+1):
try:
left,right = word[:i],word[i:]
if left in english:
for reading in list(splitwords(right)):
yield (left,) + tuple(reading)
else:
raise IndexError()
except IndexError:
pass
def splitwordsWithDeletions(word):
masks = product(*[(0,1) for char in word])
for mask in masks:
candidate = ''.join(compress(word,mask))
for reading in splitwords(candidate):
yield reading
for reading in splitwordsWithDeletions('interesting'):
print(reading)
Result (takes about 30 seconds):
()
('i',)
('in',)
('tin',)
('ting',)
('sin',)
('sing',)
('sting',)
('eng',)
('rig',)
('ring',)
('rein',)
('resin',)
('rest',)
('rest', 'i')
('rest', 'in')
...
('inters', 'tin')
('inter', 'sting')
('inters', 'ting')
('inter', 'eng')
('interest',)
('interest', 'i')
('interest', 'in')
('interesting',)
Speedup possible perhaps by precalculating which words can be read on each letter, into one bin per letter, and iterating with those pre-calculated to speed things up. I think someone else outlines a solution to that effect.
There are other places you can find anagram algorithms.
subwords(word):
if word is empty return
if word is real word:
print word
anagrams(word)
for each letter in word:
subwords(word minus letter)
Edit: shoot, you'll want to pass a starting point in for the for loop. Otherwise, you'll be redundantly creating a LOT of calls. Frank minus r minus n is the same as Frank minus n minus r. Putting a starting point can ensure that you get each subset once... Except for repeats due to double letters. Maybe just memoize the results to a hash table before printing? Argh...

Resources