minimal cyclic sub string in a bigger cyclic string - algorithm

I am trying to find an algorithm that culd return the length of the shortest cyclic sub string in a larger cyclic string.
A cyclic string would be defined as a concatenation of tow or more identicle strings, e.g. "abababab", or "aaaa"...
Now in a given for example a string T = "abbcabbcabbcabbc" there is a cycle of the pattern "abbc" but the shortest cyclic sub string would be "bb".

If you're just looking for a substring that appears more than once:
Build a Suffix tree from the string.
While creating the suffix tree, you can count re-occurrences of every substring and save it on the number of occurrences on the node.
Then just do a BFS search on the tree (which will give you a layered search, from shorter to longer strings) and find the first substring which is longer than 1 that occurred more than once.
Total complexity: O(n) where n is the length of the string
Edit:
The paths from the root to the leaves
have a one-to-one relationship with
the suffixes of S
You can implement the tree that each node contains one letter, that will give you better granularity and allow you to see all the substrings by length.
Here's a suffix tree of banana where every node contains one letter, you can see that you have all the substrings there.
If you'll look at the applications section of the suffix tree, you'll see that it is used for exactly this kind of tasks - finding stuff about substrings.
Look at the image from the root, you can see ALL the substrings start from the root (BFS list):
b
a
n
ba
an
na
ban
ana
nan
bana
anan
nana
banan
anana
banana

Let me call "abbc" the generator in your example - i.e. the string that you repeat in order to get the bigger string.
The very first observation is that the smaller string should be made by repeating some substring twice.
It's clear that the smallest string should be smaller than the generator repeated twice (2*generator), because 2*generator is cyclic.
Now note that you only need to consider the string obtained by taking the generator 3 times, when searching for smaller cyclic string. Indeed, if the smallest is not there, but it is in the 4*generator, then it must span at least two generators, but then it wouldn't be the smallest.
So now lets assume the bigger string is 3*generator (or 2*generator).
Also it's clear that if the generator has only different digits, then the answer is 2*generator. If not then you just need to find all pairs of identical characters in the bigger string say at position i and j and check whether the string starting a i, which is 2*(j-i) long is cyclic. If you try them in order of increasing j-i, then you can stop after the first success.

Related

How to Modify a Suffix Array to search multiple strings?

I've recently been updating my knowledge of algorithms and have been reading up on suffix arrays. Every text I've read has defined them as an array of suffixes over a single search string, but some articles have mentioned its 'trivial' to generalize to an entire list of search strings, but I can't see how.
Assume I'm trying to implement a simple substring search over a word list and wish to return a list of words matching a given substring. The naive approach would appear to be to insert the lexicographic end character '$' between words in my list, concatenate them all together, and produce a suffix tree from the result. But this would seem to generate large numbers of irrelevant entries. If I create a source string of 'banana$muffin' then I'll end up generating suffixes for 'ana$muffin' which I'll never use.
I'd appreciate any hints as to how to do this right, or better yet, a pointer to some algorithm texts that handle this case.
In Suffix-Arrays you usually don't use strings, just one string. That will be the concatenated version of several strings with some endtoken (a different one for every string). For the Suffix Arrays, you use pointers (or the array index) to reference the suffix (only the position for the first token/character is needed).
So the space required is the array + for each suffix the pointer. (that is just a pretty simple implementation, you should do more, to get more performance).
In that case you could optimise the sorting algorithm for the suffixes, since you only need to sort those suffixes the pointers are referencing to, till the endtokens. Everything behind the endtoken does not need to be used in the sorting algorithm.
After having now read through most of the book Algorithms on Strings, Trees and Sequences by Dan Gusfield, the answer seems clear.
If you start with a multi-string suffix tree, one of the standard conversion algorithms will still work. However, instead of having getting an array of integers, you end up with an array of lists. Each lists contains one or more pairs of a string identifier and a starting offset in that string.
The resulting structure is still useful, but not as efficient as a normal suffix array.
From Iowa State University, taken from Prefix.pdf:
Suffix trees and suffix arrays can be generalized to multiple strings.
The generalized suffix tree of a set of strings S = {s1, s2, . . . ,
sk}, denoted GST(S) or simply GST, is a compacted trie of all suffixes
of each string in S. We assume that the unique termination character $
is appended to the end of each string. A leaf label now consists of a
pair of integers (i, j), where i denotes the suffix is from string si
and j denotes the starting position of the suffix in si . Similarly,
an edge label in a GST is a substring of one of the strings. An edge
label is represented by a triplet of integers (i, j, l), where i
denotes the string number, and j and l denote the starting and ending
positions of the substring in si . For convenience of understanding,
we will continue to show the actual edge labels. Note that two strings
may have identical suffixes. This is compensated by allowing leaves in
the tree to have multiple labels. If a leaf is multiply labelled, each
suffix should come from a different string. If N is the total number
of characters (including the $ in each string) of all strings in S,
the GST has at most N leaf nodes and takes up O(N) space. The
generalized suffix array of S, denoted GSA(S) or simply GSA, is a
lexicographically sorted array of all suffixes of each string in S.
Each suffix is represented by an integer pair (i, j) denoting suffix
starting from position j in si . If suffixes from different strings
are identical, they occupy consecutive positions in the GSA. For
convenience, we make an exception for the suffix $ by listing it only
once, though it occurs in each string. The GST and GSA of strings
apple and maple are shown in Figure 1.2.
Here you have an article about an algorithm to construct a GSA:
Generalized enhanced suffix array construction in external memory

Minimum number of char substitutions to get a palindrome

I would like to solve this problem from TopCoder, in which a String is given and in each step you have to replace all occurrences of an character (of your choice) with another character (of your choice), so that at the end after all steps you get a palindrome. The problem is to identify the minimum total number of replacements.
Ideas so far:
I can identify that the string after every step is simply a node/vertex in a graph and that the cost of every edge is the number of replacements made in the step, but I don't see how to use greedy for that (it is definitely not the Minimum Spanning Tree problem). I don't think it makes sense to identify all possible nodes & edge costs and to convert the problem in the Shortest Path problem. On the other side, I think in every step it makes sense to replace the character X with the biggest number of conflicts, with the character Y in conflict with X that occurs most in the string.
Anyway, I can't either prove that it works. Also I can't identify any known problems in this. Any ideas?
You need to identify disjunct sets of characters. A disjunct set of characters is a set of characters that will all have to become the same character in order for the string to become a palindrome.
Example:
Let's say we have the string abcdefgfmdebac
It has 3 disjunct sets, abc, de and fgm
Algorithm:
Pick the first character and check all occurences of it picking up other characters in the set.
In the example string we start with a and pick up band c (because they sit on the opposite sides of the two ain our string). We repeat the process for band c, but no new characters are added to the set. So abc is our first disjunct set.
Continue doing this with the remaining characters.
A disjunct set of n characters (counting all characters) needs n-m replacements, where m is the number of occurences of the most frequent character.
So simply sum over the sets.
In our example it takes 4 + 2 + 2 = 8 replacements.

Find strings which are prefixes of other strings

This is an interview question. Given a number of strings find such strings, which are prefixes of others. For example, given strings = {"a", "aa", "ab", abb"} the result is {"a", "ab"}.
The simplest solution is just to sort the strings and check each pair of two subsequent strings if the 1st one is a prefix of the 2nd one. The running time of the algorithm is the running time of the sorting.
I guess there is another solution, which uses a trie, and has complexity O(N), where N is the number of strings. Could you suggest such an algorithm?
I have a following idea regarding Trie, complexity O(N):
You start with empty Trie.
You take words one by one, and add word to Trie.
After you add a word (let's call it word Wi) to Trie, there are two cases to consider:
Wi is prefix of some of the words you added before.
That statement is true if you didn't add any nodes to Trie while adding word Wi.
In that case, Wi is prefix and part of our solution.
Some of the words added before are prefix of Wi.
That statement is true if you passed through node that represents end of some word added before (let's cal that word Wj). In that case, Wj is prefix of Wi and part of our solution.
In more details (pseudocode):
for word in words
add word to trie
if size of trie did not change then // first case
add word to result
if ending nodes found while adding word // second case
add words defined by those nodes to result
return result
Adding new word to Trie:
node = trie.root();
for letter in word
if node.hasChild(letter) == false then // if letter doesnt exist, add it
node.addChild(letter)
if letter is last_letter_of_word then // if last letter of word, store that info
node.setIsLastLetterOf(word)
node = node.getChild(letter) // move
While you are adding new word, you can also check if you passed through any nodes that represent last letters of other words.
Complexity of algorithm that I described is O(N).
Another important thing is that this way you can know how many times word Wi prefixes other words, which may be useful.
Example for {aab, aaba, aa}:
Green nodes are nodes detected as case 1.
Red nodes are nodes detected as case 2.
Each column(trie) is one step. At the beginning trie is empty.
Black arrows show which nodes we visited(added) in that step.
Nodes that represent last letter of some word have that word written in parenthesess.
In step 1 we add word aab.
In step 2 we add word aaba, recognize one case 2 (word aab) and add word aab to result.
In step 3 we add word aa, recognize case 1 and add word aa to result.
At the end we have result = {aab, aa} which is correct.
The original answer is correct for: is a string a a substring of b (misread).
Using a trie, you can simply add all strings to it in a first iteration, and in the 2nd iteration, start reading each word, let it be w. If you find a word that you finished your read, but did not reach the string terminator ($ usually), you reach some node v in the trie.
By doing a DFS from v, you can get all strings which w is prefix of them.
high level pseudo code:
t <- new trie
for each word w:
t.add(w)
for each word w:
node <- t.getLastNode(w)
if node.val != $
collection<- DFS(node) (excluding w itself)
w is a prefix of each word in collection
Note: in order to optimize it, you might need to do some extra work: if a is prefix of b, and b is prefix of c, then a is prefix of c, so - when you do the DFS, if you reach some node that was already searched - just append its strings to the current prefix.
Still, since there could be quadric number of possibilities ("a", "aa", "aaa", .... ), getting all of them requires quadric time.
Original answer: finding if a is a substring of b:
The suggested solution runs in a quadric complexity, you will need to check each two pairs, giving you O(n* (n-1) * |S|).
You can build a suffix tree from the strings in the first iteration, and in the 2nd iteration check if each string is a non trivial entry (not itself) of another string.
This solution is O(n*|S|)

Efficient data structure for word lookup with wildcards

I need to match a series of user inputed words against a large dictionary of words (to ensure the entered value exists).
So if the user entered:
"orange" it should match an entry "orange' in the dictionary.
Now the catch is that the user can also enter a wildcard or series of wildcard characters like say
"or__ge" which would also match "orange"
The key requirements are:
* this should be as fast as possible.
* use the smallest amount of memory to achieve it.
If the size of the word list was small I could use a string containing all the words and use regular expressions.
however given that the word list could contain potentially hundreds of thousands of enteries I'm assuming this wouldn't work.
So is some sort of 'tree' be the way to go for this...?
Any thoughts or suggestions on this would be totally appreciated!
Thanks in advance,
Matt
Put your word list in a DAWG (directed acyclic word graph) as described in Appel and Jacobsen's paper on the World's Fastest Scrabble Program (free copy at Columbia). For your search you will traverse this graph maintaining a set of pointers: on a letter, you make a deterministic transition to children with that letter; on a wildcard, you add all children to the set.
The efficiency will be roughly the same as Thompson's NFA interpretation for grep (they are the same algorithm). The DAWG structure is extremely space-efficient—far more so than just storing the words themselves. And it is easy to implement.
Worst-case cost will be the size of the alphabet (26?) raised to the power of the number of wildcards. But unless your query begins with N wildcards, a simple left-to-right search will work well in practice. I'd suggest forbidding a query to begin with too many wildcards, or else create multiple dawgs, e.g., dawg for mirror image, dawg for rotated left three characters, and so on.
Matching an arbitrary sequence of wildcards, e.g., ______ is always going to be expensive because there are combinatorially many solutions. The dawg will enumerate all solutions very quickly.
I would first test the regex solution and see whether it is fast enough - you might be surprised! :-)
However if that wasn't good enough I would probably use a prefix tree for this.
The basic structure is a tree where:
The nodes at the top level are all the possible first letters (i.e. probably 26 nodes from a-z assuming you are using a full dictionary...).
The next level down contains all the possible second letters for each given first letter
And so on until you reach an "end of word" marker for each word
Testing whether a given string with wildcards is contained in your dictionary is then just a simple recursive algorithm where you either have a direct match for each character position, or in the case of the wildcard you check each of the possible branches.
In the worst case (all wildcards but only one word with the right number of letters right at the end of the dictionary), you would traverse the entire tree but this is still only O(n) in the size of the dictionary so no worse than a full regex scan. In most cases it would take very few operations to either find a match or confirm that no such match exists since large branches of the search tree are "pruned" with each successive letter.
No matter which algorithm you choose, you have a tradeoff between speed and memory consumption.
If you can afford ~ O(N*L) memory (where N is the size of your dictionary and L is the average length of a word), you can try this very fast algorithm. For simplicity, will assume latin alphabet with 26 letters and MAX_LEN as the max length of word.
Create a 2D array of sets of integers, set<int> table[26][MAX_LEN].
For each word in you dictionary, add the word index to the sets in the positions corresponding to each of the letters of the word. For example, if "orange" is the 12345-th word in the dictionary, you add 12345 to the sets corresponding to [o][0], [r][1], [a][2], [n][3], [g][4], [e][5].
Then, to retrieve words corresponding to "or..ge", you find the intersection of the sets at [o][0], [r][1], [g][4], [e][5].
You can try a string-matrix:
0,1: A
1,5: APPLE
2,5: AXELS
3,5: EAGLE
4,5: HELLO
5,5: WORLD
6,6: ORANGE
7,8: LONGWORD
8,13:SUPERLONGWORD
Let's call this a ragged index-matrix, to spare some memory. Order it on length, and then on alphabetical order. To address a character I use the notation x,y:z: x is the index, y is the length of the entry, z is the position. The length of your string is f and g is the number of entries in the dictionary.
Create list m, which contains potential match indexes x.
Iterate on z from 0 to f.
Is it a wildcard and not the latest character of the search string?
Continue loop (all match).
Is m empty?
Search through all x from 0 to g for y that matches length. !!A!!
Does the z character matches with search string at that z? Save x in m.
Is m empty? Break loop (no match).
Is m not empty?
Search through all elements of m. !!B!!
Does not match with search? Remove from m.
Is m empty? Break loop (no match).
A wildcard will always pass the "Match with search string?". And m is equally ordered as the matrix.
!!A!!: Binary search on length of the search string. O(log n)
!!B!!: Binary search on alphabetical ordering. O(log n)
The reason for using a string-matrix is that you already store the length of each string (because it makes it search faster), but it also gives you the length of each entry (assuming other constant fields), such that you can easily find the next entry in the matrix, for fast iterating. Ordering the matrix isn't a problem: since this has only be done once the dictionary updates, and not during search-time.
If you are allowed to ignore case, which I assume, then make all the words in your dictionary and all the search terms the same case before anything else. Upper or lower case makes no difference. If you have some words that are case sensitive and others that are not, break the words into two groups and search each separately.
You are only matching words, so you can break the dictionary into an array of strings. Since you are only doing an exact match against a known length, break the word array into a separate array for each word length. So byLength[3] is the array off all words with length 3. Each word array should be sorted.
Now you have an array of words and a word with potential wild cards to find. Depending on wether and where the wildcards are, there are a few approaches.
If the search term has no wild cards, then do a binary search in your sorted array. You could do a hash at this point, which would be faster but not much. If the vast majority of your search terms have no wildcards, then consider a hash table or an associative array keyed by hash.
If the search term has wildcards after some literal characters, then do a binary search in the sorted array to find an upper and lower bound, then do a linear search in that bound. If the wildcards are all trailing then finding a non empty range is sufficient.
If the search term starts with wild cards, then the sorted array is no help and you would need to do a linear search unless you keep a copy of the array sorted by backwards strings. If you make such an array, then choose it any time there are more trailing than leading literals. If you do not allow leading wildcards then there is no need.
If the search term both starts and ends with wildcards, then you are stuck with a linear search within the words with equal length.
So an array of arrays of strings. Each array of strings is sorted, and contains strings of equal length. Optionally duplicate the whole structure with the sorting based on backwards strings for the case of leading wildcards.
The overall space is one or two pointers per word, plus the words. You should be able to store all the words in a single buffer if your language permits. Of course, if your language does not permit, grep is probably faster anyway. For a million words, that is 4-16MB for the arrays and similar for the actual words.
For a search term with no wildcards, performance would be very good. With wildcards, there will occasionally be linear searches across large groups of words. With the breakdown by length and a single leading character, you should never need to search more than a few percent of the total dictionary even in the worst case. Comparing only whole words of known length will always be faster than generic string matching.
Try to build a Generalized Suffix Tree if the dictionary will be matched by sequence of queries. There is linear time algorithm that can be used to build such tree (Ukkonen Suffix Tree Construction).
You can easily match (it's O(k), where k is the size of the query) each query by traversing from the root node, and use the wildcard character to match any character like typical pattern finding in suffix tree.

Finding dictionary words

I have a lot of compound strings that are a combination of two or three English words.
e.g. "Spicejet" is a combination of the words "spice" and "jet"
I need to separate these individual English words from such compound strings. My dictionary is going to consist of around 100000 words.
What would be the most efficient by which I can separate individual English words from such compound strings.
I'm not sure how much time or frequency you have to do this (is it a one-time operation? daily? weekly?) but you're obviously going to want a quick, weighted dictionary lookup.
You'll also want to have a conflict resolution mechanism, perhaps a side-queue to manually resolve conflicts on tuples that have multiple possible meanings.
I would look into Tries. Using one you can efficiently find (and weight) your prefixes, which are precisely what you will be looking for.
You'll have to build the Tries yourself from a good dictionary source, and weight the nodes on full words to provide yourself a good quality mechanism for reference.
Just brainstorming here, but if you know your dataset consists primarily of duplets or triplets, you could probably get away with multiple Trie lookups, for example looking up 'Spic' and then 'ejet' and then finding that both results have a low score, abandon into 'Spice' and 'Jet', where both Tries would yield a good combined result between the two.
Also I would consider utilizing frequency analysis on the most common prefixes up to an arbitrary or dynamic limit, e.g. filtering 'the' or 'un' or 'in' and weighting those accordingly.
Sounds like a fun problem, good luck!
If the aim is to find the "the largest possible break up for the input" as you replied, then the algorithm could be fairly straightforward if you use some graph theory. You take the compound word and make a graph with a vertex before and after every letter. You'll have a vertex for each index in the string and one past the end. Next you find all legal words in your dictionary that are substrings of the compound word. Then, for each legal substring, add an edge with weight 1 to the graph connecting the vertex before the first letter in the substring with the vertex after the last letter in the substring. Finally, use a shortest path algorithm to find the path with fewest edges between the first and the last vertex.
The pseudo code is something like this:
parseWords(compoundWord)
# Make the graph
graph = makeGraph()
N = compoundWord.length
for index = 0 to N
graph.addVertex(i)
# Add the edges for each word
for index = 0 to N - 1
for length = 1 to min(N - index, MAX_WORD_LENGTH)
potentialWord = compoundWord.substr(index, length)
if dictionary.isElement(potentialWord)
graph.addEdge(index, index + length, 1)
# Now find a list of edges which define the shortest path
edges = graph.shortestPath(0, N)
# Change these edges back into words.
result = makeList()
for e in edges
result.add(compoundWord.substr(e.start, e.stop - e.start + 1))
return result
I, obviously, haven't tested this pseudo-code, and there may be some off-by-one indexing errors, and there isn't any bug-checking, but the basic idea is there. I did something similar to this in school and it worked pretty well. The edge creation loops are O(M * N), where N is the length of the compound word, and M is the maximum word length in your dictionary or N (whichever is smaller). The shortest path algorithm's runtime will depend on which algorithm you pick. Dijkstra's comes most readily to mind. I think its runtime is O(N^2 * log(N)), since the max edges possible is N^2.
You can use any shortest path algorithm. There are several shortest path algorithms which have their various strengths and weaknesses, but I'm guessing that for your case the difference will not be too significant. If, instead of trying to find the fewest possible words to break up the compound, you wanted to find the most possible, then you give the edges negative weights and try to find the shortest path with an algorithm that allows negative weights.
And how will you decide how to divide things? Look around the web and you'll find examples of URLs that turned out to have other meanings.
Assuming you didn't have the capitals to go on, what would you do with these (Ones that come to mind at present, I know there are more.):
PenIsland
KidsExchange
TherapistFinder
The last one is particularly problematic because the troublesome part is two words run together but is not a compound word, the meaning completely changes when you break it.
So, given a word, is it a compound word, composed of two other English words? You could have some sort of lookup table for all such compound words, but if you just examine the candidates and try to match against English words, you will get false positives.
Edit: looks as if I am going to have to go to provide some examples. Words I was thinking of include:
accustomednesses != accustomed + nesses
adulthoods != adult + hoods
agreeabilities != agree + abilities
willingest != will + ingest
windlasses != wind + lasses
withstanding != with + standing
yourselves != yours + elves
zoomorphic != zoom + orphic
ambassadorships != ambassador + ships
allotropes != allot + ropes
Here is some python code to try out to make the point. Get yourself a dictionary on disk and have a go:
from __future__ import with_statement
def opendict(dictionary=r"g:\words\words(3).txt"):
with open(dictionary, "r") as f:
return set(line.strip() for line in f)
if __name__ == '__main__':
s = opendict()
for word in sorted(s):
if len(word) >= 10:
for i in range(4, len(word)-4):
left, right = word[:i], word[i:]
if (left in s) and (right in s):
if right not in ('nesses', ):
print word, left, right
It sounds to me like you want to store you dictionary in a Trie or a DAWG data structure.
A Trie already stores words as compound words. So "spicejet" would be stored as "spicejet" where the * denotes the end of a word. All you'd have to do is look up the compound word in the dictionary and keep track of how many end-of-word terminators you hit. From there you would then have to try each substring (in this example, we don't yet know if "jet" is a word, so we'd have to look that up).
It occurs to me that there are a relatively small number of substrings (minimum length 2) from any reasonable compound word. For example for "spicejet" I get:
'sp', 'pi', 'ic', 'ce', 'ej', 'je', 'et',
'spi', 'pic', 'ice', 'cej', 'eje', 'jet',
'spic', 'pice', 'icej', 'ceje', 'ejet',
'spice', 'picej', 'iceje', 'cejet',
'spicej', 'piceje', 'icejet',
'spiceje' 'picejet'
... 26 substrings.
So, find a function to generate all those (slide across your string using strides of 2, 3, 4 ... (len(yourstring) - 1) and then simply check each of those in a set or hash table.
A similar question was asked recently: Word-separating algorithm. If you wanted to limit the number of splits, you would keep track of the number of splits in each of the tuples (so instead of a pair, a triple).
Word existence could be done with a trie, or more simply with a set (i.e. a hash table). Given a suitable function, you could do:
# python-ish pseudocode
def splitword(word):
# word is a character array indexed from 0..n-1
for i from 1 to n-1:
head = word[:i] # first i characters
tail = word[i:] # everything else
if is_word(head):
if i == n-1:
return [head] # this was the only valid word; return it as a 1-element list
else:
rest = splitword(tail)
if rest != []: # check whether we successfully split the tail into words
return [head] + rest
return [] # No successful split found, and 'word' is not a word.
Basically, just try the different break points to see if we can make words. The recursion means it will backtrack until a successful split is found.
Of course, this may not find the splits you want. You could modify this to return all possible splits (instead of merely the first found), then do some kind of weighted sum, perhaps, to prefer common words over uncommon words.
This can be a very difficult problem and there is no simple general solution (there may be heuristics that work for small subsets).
We face exactly this problem in chemistry where names are composed by concatenation of morphemes. An example is:
ethylmethylketone
where the morphemes are:
ethyl methyl and ketone
We tackle this through automata and maximum entropy and the code is available on Sourceforge
http://www.sf.net/projects/oscar3-chem
but be warned that it will take some work.
We sometimes encounter ambiguity and are still finding a good way of reporting it.
To distinguish between penIsland and penisLand would require domain-specific heuristics. The likely interpretation will depend on the corpus being used - no linguistic problem is independent from the domain or domains being analysed.
As another example the string
weeknight
can be parsed as
wee knight
or
week night
Both are "right" in that they obey the form "adj-noun" or "noun-noun". Both make "sense" and which is chosen will depend on the domain of usage. In a fantasy game the first is more probable and in commerce the latter. If you have problems of this sort then it will be useful to have a corpus of agreed usage which has been annotated by experts (technically a "Gold Standard" in Natural Language Processing).
I would use the following algorithm.
Start with the sorted list of words
to split, and a sorted list of
declined words (dictionary).
Create a result list of objects
which should store: remaining word
and list of matched words.
Fill the result list with the words
to split as remaining words.
Walk through the result array and
the dictionary concurrently --
always increasing the least of the
two, in a manner similar to the
merge algorithm. In this way you can
compare all the possible matching
pairs in one pass.
Any time you find a match, i.e. a
split words word that starts with a
dictionary word, replace the
matching dictionary word and the
remaining part in the result list.
You have to take into account
possible multiples.
Any time the remaining part is empty,
you found a final result.
Any time you don't find a match on
the "left side", in other words,
every time you increment the result
pointer because of no match, delete
the corresponding result item. This
word has no matches and can't be
split.
Once you get to the bottom of the
lists, you will have a list of
partial results. Repeat the loop
until this is empty -- go to point 4.

Resources