Find all words and phrases from one string - algorithm

Due to subject area (writing on a wall) interesting condition is added - letters cannot change their order, so this is not a question about anagrams.
I saw a long word, written by paint on a wall, and now suddenly
I want all possible words and phrases I can get from this word by painting out any combination of letters. Wo r ds, randomly separated by whitespace are OK.
To broaden possible results let's make an assumption, that space is not necessary to separate words.
Edit: Obviously letter order should be maintained (thanks idz for pointing that out). Also, phrases may be meaningless. Here are some examples:
Source word: disestablishment
paint out: ^ ^^^ ^^^^ ^^
left: i tabl e -> i table
or paint out:^^^^^^^^^ ^ ^^
left: ish e -> i she (spacelessness is ok)
Visual example
Hard mode/bonus task: consider possible slight alterations to letters (D <-> B, C <-> O and so on)
Please suggest your variants of solving this problem.
Here's my general straightforward approach
It's clear that we'll need an English dictionary to find words.
Our goal is to get words to search for in dictionary.
We need to find all possible letters variations to match them against dictionary: each letter can be itself (1) or painted out (0).
Taking the 'space is not needed to separate words' condition in consideration, to distinguish words we must assume that there might be a space between any two letters (1 - there's a space, 0 - there isn't).
d i s e s t a b l i s h m e n t
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ - possible whitespace
N = number of letters in source word
N-1 = number of 'might-be spaces'
Any of the N + N - 1 elements can be in two states, so let's treat them as booleans. The number of possible variations is 2^(N + N - 1). Yes, it counts useless variants like pasting a space between to spaces, but I didn't come up with more elegant formula.
Now we need an algorithm to get all possible variations of N+N-1 sequence of booleans (I haven't thought it out yet, but word recursion flows through my mind). Then substitute all 1s with corresponding letters (if index of boolean is odd) or whitespace (even)
and 0s with whitespace (odd) or nothing (even). Then trim leading and trailing whitespace, separate words and search them in dictionary.
I don't like this monstrous approach and hope you will help me find good alternatives.

1) Put your dictionary in a trie or prefix tree
2) For each position in the string find legal words by trie look up; store these
3) Print all combinations of non-overlapping words
This assumes that like the examples in the question you want to maintain the letter order (i.e. you are not interested in anagrams).

#!/usr/bin/python3
from itertools import *
from pprint import pprint as pp
Read in dictionary, remove all 1- and 2-letter words which we never use in the English language:
with open('/usr/share/dict/words') as f:
english = f.read().splitlines()
english = map(str.lower, english)
english = [w for w in english if (len(w)>2 or w in ['i','a','as','at','in','on','im','it','if','is','am','an'])]
def isWord(word):
return word in english
Your problem:
def splitwords(word):
"""
splitwords('starts') -> (('st', 'ar', 'ts'), ('st', 'arts'), ('star', 'ts'), ('starts'))
"""
if word=='':
yield ()
for i in range(1,len(word)+1):
try:
left,right = word[:i],word[i:]
if left in english:
for reading in list(splitwords(right)):
yield (left,) + tuple(reading)
else:
raise IndexError()
except IndexError:
pass
def splitwordsWithDeletions(word):
masks = product(*[(0,1) for char in word])
for mask in masks:
candidate = ''.join(compress(word,mask))
for reading in splitwords(candidate):
yield reading
for reading in splitwordsWithDeletions('interesting'):
print(reading)
Result (takes about 30 seconds):
()
('i',)
('in',)
('tin',)
('ting',)
('sin',)
('sing',)
('sting',)
('eng',)
('rig',)
('ring',)
('rein',)
('resin',)
('rest',)
('rest', 'i')
('rest', 'in')
...
('inters', 'tin')
('inter', 'sting')
('inters', 'ting')
('inter', 'eng')
('interest',)
('interest', 'i')
('interest', 'in')
('interesting',)
Speedup possible perhaps by precalculating which words can be read on each letter, into one bin per letter, and iterating with those pre-calculated to speed things up. I think someone else outlines a solution to that effect.

There are other places you can find anagram algorithms.
subwords(word):
if word is empty return
if word is real word:
print word
anagrams(word)
for each letter in word:
subwords(word minus letter)
Edit: shoot, you'll want to pass a starting point in for the for loop. Otherwise, you'll be redundantly creating a LOT of calls. Frank minus r minus n is the same as Frank minus n minus r. Putting a starting point can ensure that you get each subset once... Except for repeats due to double letters. Maybe just memoize the results to a hash table before printing? Argh...

Related

Splitting a sentence to minimize sentence lengths

I have come across the following problem statement:
You have a sentence written entirely in a single row. You would like to split it into several rows by replacing some of the spaces
with "new row" indicators. Your goal is to minimize the width of the
longest row in the resulting text ("new row" indicators do not count
towards the width of a row). You may replace at most K spaces.
You will be given a sentence and a K. Split the sentence using the
procedure described above and return the width of the longest row.
I am a little lost with where to start. To me, it seems I need to try to figure out every possible sentence length that satisfies the criteria of splitting the single sentence up into K lines.
I can see a couple of edge cases:
There are <= K words in the sentence, therefore return the longest word.
The sentence length is 0, return 0
If neither of those criteria are true, then we have to determine all possible combinations of splitting the sentence and the return the minimum of all those options. This is the part I don't know how to do (and is obviously the heart of the problem).
You can solve it by inverting the problem. Let's say I fix the length of the longest split to L. Can you compute the minimum number of breaks you need to satisfy it?
Yes, you just break before the first word that would go over L and count them up (O(N)).
So now that we have that we just have to find a minimum L that would require less or equal K breaks. You can do a binary search in the length of the input. Final complexity O(NlogN).
First Answer
What you want to achieve is Minimum Raggedness. If you just want the algorithm, it is here as a PDF. If the research paper's link goes bad, please search for the famous paper named Breaking Paragraphs into Lines by Knuth.
However if you want to get your hands over some implementations of the same, in the question Balanced word wrap (Minimum raggedness) in PHP on SO, people have actually given implementation not only in PHP but in C, C++ and bash as well.
Second Answer
Though this is not exactly a correct approach, it is quick and dirty if you are looking for something like that. This method will not return correct answer for every case. It is for those people for whom time to ship their product is more important.
Idea
You already know the length of your input string. Let's call it L;
When putting in K breaks, the best scenario would be to be able to break the string to parts of exactly L / (K + 1) size;
So break your string at that word which makes the resulting sentence part's length least far from L / (K + 1);
My recursive solution, which can be improved through memoization or dynamic programming.
def split(self,sentence, K):
if not sentence: return 0
if ' ' not in sentence or K == 0: return len(sentence)
spaces = [i for i, s in enumerate(sentence) if s == ' ']
res = 100000
for space in spaces:
res = min(res, max(space, self.split(sentence[space+1:], K-1)))
return res

Counting in Wonderland

The text of Alice in Wonderland contains the word 'Wonderland' 8 times. (Let's be case-insensitive for this question).
However it contains the word many more times if you count non-contiguous subsequences as well as substrings, eg.
Either the well was very deep, or she fell very slowly, for she had
plenty of time as she went down to look about her and to WONDER what was
going to happen next. First, she tried to Look down AND make out what
she was coming to, but it was too dark to see anything;
(A subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements. —Wikipedia)
How many times does the book contain the word Wonderland as a subsequence? I expect this will be a big number—it's a long book with many w's and o's and n's and d's.
I tried brute force counting (recursion to make a loop 10 deep) but it was too slow, even for that example paragraph.
Let's say you didn't want to search for wonderland, but just for w. Then you'd simply count how many times w occurred in the story.
Now let's say you want wo. For each first character of the current pattern you find, you add to your count:
How many times the current pattern without its first character occurs in the rest of the story, after this character you're at: so you have reduced the problem (story[1..n], pattern[1..n]) to (story[2..n], pattern[2..n])
How many times the entire current pattern occurs in the rest of the story. So you have reduced the problem to (story[2..n], pattern[1..n])
Now you can just add the two. There is no overcounting if we talk in terms of subproblems. Consider the example wawo. Obviously, wo occurs 2 times. You might think the counting will go like:
For the first w, add 1 because o occurs once after it and another 1 because wo occurs once after it.
For the second w, add 1 because o occurs once after it.
Answer is 3, which is wrong.
But this is what actually happens:
(wawo, wo) -> (awo, o) -> (wo, o) -> (o, o) -> (-, -) -> 1
-> (-, o) -> 0
-> (awo, wo) -> (wo, wo) -> (o, wo) -> (-, wo) -> 0
-> (o, o) -> (-, -) -> 1
-> (-, o) -> 0
So you can see that the answer is 2.
If you don't find a w, then the count for this position is just how many times wo occurs after this current character.
This allows for dynamic programming with memoization:
count(story_index, pattern_index, dp):
if dp[story_index, pattern_index] not computed:
if pattern_index == len(pattern):
return 1
if story_index == len(story):
return 0
if story[story_index] == pattern[pattern_index]:
dp[story_index, pattern_index] = count(story_index + 1, pattern_index + 1, dp) +
count(story_index + 1, pattern_index, dp)
else:
dp[story_index, pattern_index] = count(story_index + 1, pattern_index, dp)
return dp[story_index, pattern_index]
Call with count(0, 0, dp). Note that you can make the code cleaner (remove the duplicate function call).
Python code, with no memoization:
def count(story, pattern):
if len(pattern) == 0:
return 1
if len(story) == 0:
return 0
s = count(story[1:], pattern)
if story[0] == pattern[0]:
s += count(story[1:], pattern[1:])
return s
print(count('wonderlandwonderland', 'wonderland'))
Output:
17
This makes sense: for each i first characters in the first wonderland of the story, you can group it with remaining final characters in the second wonderland, giving you 10 solutions. Another 2 are the words themselves. The other five are:
wonderlandwonderland
********* *
******** **
******** * *
** ** ******
*** * ******
You're right that this will be a huge number. I suggest that you either use large integers or take the result modulo something.
The same program returns 9624 for your example paragraph.
The string "wonderland" occurs as a subsequence in Alice in Wonderland1 24100772180603281661684131458232 times.
The main idea is to scan the main text character by character, keeping a running count of how often each prefix of the target string (i.e.: in this case, "w", "wo", "won", ..., "wonderlan", and "wonderland") has occurred up to the current letter. These running counts are easy to compute and update. If the current letter does not occur in "wonderland", then the counts are left untouched. If the current letter is "a" then we increment the count of "wonderla"s seen by the number of "wonderl"s seen up to this point. If the current letter is "n" then we increment the count of "won"s by the count of "wo"s and the count of "wonderlan"s by the count of "wonderla"s. And so forth. When we reach end of the text, we will have the count of all prefixes of "wonderland" including the string "wonderland" itself, as desired.
The advantage of this approach is that it requires a single pass through the text and does not require O(n) recursive calls (which will likely exceed the maximum recursion depth unless you do something clever).
Code
import fileinput
import string
target = 'wonderland'
prefixes = dict()
count = dict()
for i in range(len(target)) :
letter = target[i]
prefix = target[:i+1]
if letter not in prefixes :
prefixes[letter] = [prefix]
else :
prefixes[letter].append(prefix)
count[prefix] = 0L
for line in fileinput.input() :
for letter in line.lower() :
if letter in prefixes :
for prefix in prefixes[letter] :
if len(prefix) > 1 :
count[prefix] = count[prefix] + count[prefix[:len(prefix)-1]]
else:
count[prefix] = count[prefix] + 1
print count[target]
Using this text from Project Gutenberg, starting with "CHAPTER I. Down the Rabbit-Hole" and ending with "THE END"
Following up on previous comments, if you are looking for an algorithm that would return 2 for the input wonderlandwonderland and 1 for wonderwonderland, then I think you could adapt the algorithm from this question:
How to find smallest substring which contains all characters from a given string?
Effectively, the change in your case would be that, once an instance of the word is found, you increment a counter and repeat all the procedure with the remaining part of the text.
Such algorithm would be O(n) in time when n is the lenght of the text and O(m) in space where m is the length of the searched string.

Spell checker with fused spelling error correction algorithm

Recently I've looked through several spell checker algorithms including simple ones(like Peter Norvig's) and much more complex (like Brill and Moore's) ones. But there's a type of errors which none of them can handle. If for example I type stackoverflow instead of stack overflow these spellcheckers will fail to correct the mistype (unless the stack overflow in the dictionary of terms). Storing all the pairs of words is too expensive (and it will not help if the error is 3 single words without spaces between them).
Is there an algorithm which can correct (despite usual mistypes) this type of errors?
Some examples of what I need:
spel checker -> spell checker
spellchecker -> spell checker
spelcheker -> spell checker
I hacked up Norvig's spell corrector to do this. I had to cheat a bit and add the word 'checker' to Norvig's data file because it never appears. Without that cheating, the problem is really hard.
expertsexchange expert exchange
spel checker spell checker
spellchecker spell checker
spelchecker she checker # can't win them all
baseball base all # baseball isn't in the dictionary either :(
hewent he went
Basically you need to change the code so that:
you add space to the alphabet to automatically explore the word breaks.
you first check that all of the words that make up a phrase are in the dictionary to consider the phrase valid, rather than just dictionary membership directly (the dict contains no phrases).
you need a way to score a phrase against plain words.
The latter is the trickiest, and I use a braindead independence assumption for phrase composition that the probability of two adjacent words is the product of their individual probabilities (here done with sum in log prob space), with a small penalty. I am sure that in practice, you'll want to keep some bigram stats to do that splitting well.
import re, collections, math
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
counts = collections.defaultdict(lambda: 1.0)
for f in features:
counts[f] += 1.0
tot = float(sum(counts.values()))
model = collections.defaultdict(lambda: math.log(.1 / tot))
for f in counts:
model[f] = math.log(counts[f] / tot)
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz '
def valid(w):
return all(s in NWORDS for s in w.split())
def score(w):
return sum(NWORDS[s] for s in w.split()) - w.count(' ')
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if valid(e2))
def known(words): return set(w for w in words if valid(w))
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=score)
def t(w):
print w, correct(w)
t('expertsexchange')
t('spel checker')
t('spellchecker')
t('spelchecker')
t('baseball')
t('hewent')
This problem is very similar to the problem of compound splitting as applied to German or Dutch, but also noisy English data. See Monz & De Rijke for a very simple algorithm (which can I think be implemented as a finite state transducer for efficiency) and Google for "compound splitting" and "decompounding".
I sometimes get such suggestions when spell-checking in kate, so there certainly is an algorithm that can correct some such errors. I am sure one can do better, but one idea is to split the candidate in likely places and check whether close matches for the components exist. The hard part is to decide what are likely places. In the languages I'm sort of familiar with, there are letter combinations that occur rarely in words. For example, the combinations dk or lh are, as far as I'm aware rare in English words. Other combinations occur often at the start of words (e.g. un, ch), so those would be good guesses for splitting too. In the example spelcheker, the lc combination is not too widespread, and ch is a common start of words, so the split spel and cheker is a prime candidate, and any decent algorithm would then find spell and checker (but it would probably also find spiel, so don't auto-correct, just give suggestions).

Algorithm to find

the logic behind this was (n-2)3^(n-3) has lots of repetitons like (abc)***(abc) when abc is at start and at end and the strings repated total to 3^4 . similarly as abc moves ahead and number of sets of (abc) increase
You can use dynamic programming to compute the number of forbidden strings.
The algorithms follow from the observation below:
"Legal string of size n is the legal string of size n - 1 extended with one letter, so that the last three letters of the resulting string are not all distinct."
So if we had all the legal strings of size n-1 we could try extending them to obtain the legal strings of size n.
To check whether the extended string is legal we just need to know the last two letters of the previous string (of size n-1).
In the algorithm we will compute two arrays, where
different[i] # number of legal strings of length i in which last two letters are different
same[i] # number of legal strings of length i in which last two letters are the same
It can be easily proved that:
different[i+1] = different[i] + 2*same[i]
same[i+1] = different[i] + same[i]
It is the consequence of the following facts:
Any 'same' string of size i+1 can be obtained either from 'same' string of size i (think BB -> BBB) or from 'different' string (think AB -> ABB) and these are the only options.
Any 'different' string of size i+1 can be obtained either from 'different' string of size i (think AB-> ABA ) or from the 'same' string in two ways (AA -> AAB or AA -> AAC)
Having observed all this it is easy to write an algorithm that computes the result in O(n) time.
I suggest you use recursion, and look at two numbers:
F(n), the number of legal strings of length n whose last two symbols are the same.
G(n), the number of legal strings of length n whose last two symbols are different.
Is that enough to go on?
get the ASCII values of the last three letters and add the square values of these letters. If it gives a certain result, then it is forbidden. For A, B and C, it would be fine.
To do this:
1) find out how to get characters from your string.
2) find out how to get ASCII value of a character.
3) Multiply these ASCII values with themselves.
4) Do that for the three letters each time and add their values.

string of integers puzzle

I apologize for not have the math background to put this question in a more formal way.
I'm looking to create a string of 796 letters (or integers) with certain properties.
Basically, the string is a variation on a De Bruijn sequence B(12,4), except order and repetition within each n-length subsequence are disregarded.
i.e. ABBB BABA BBBA are each equivalent to {AB}.
In other words, the main property of the string involves looking at consecutive groups of 4 letters within the larger string
(i.e. the 1st through 4th letters, the 2nd through 5th letters, the 3rd through 6th letters, etc)
And then producing the set of letters that comprise each group (repetitions and order disregarded)
For example, in the string of 9 letters:
A B B A C E B C D
the first 4-letter groups is: ABBA, which is comprised of the set {AB}
the second group is: BBAC, which is comprised of the set {ABC}
the third group is: BACE, which is comprised of the set {ABCE}
etc.
The goal is for every combination of 1-4 letters from a set of N letters to be represented by the 1-4-letter resultant sets of the 4-element groups once and only once in the original string.
For example, if there is a set of 5 letters {A, B, C, D, E} being used
Then the possible 1-4 letter combinations are:
A, B, C, D, E,
AB, AC, AD, AE, BC, BD, BE, CD, CE, DE,
ABC, ABD, ABE, ACD, ACE, ADE, BCD, BCE, BDE, CDE,
ABCD, ABCE, ABDE, ACDE, BCDE
Here is a working example that uses a set of 5 letters {A, B, C, D, E}.
D D D D E C B B B B A E C C C C D A E E E E B D A A A A C B D D B
The 1st through 4th elements form the set: D
The 2nd through 5th elements form the set: DE
The 3rd through 6th elements form the set: CDE
The 4th through 7th elements form the set: BCDE
The 5th through 8th elements form the set: BCE
The 6th through 9th elements form the set: BC
The 7th through 10th elements form the set: B
etc.
* I am hoping to find a working example of a string that uses 12 different letters (a total of 793 4-letter groups within a 796-letter string) starting (and if possible ending) with 4 of the same letter. *
Here is a working solution for 7 letters:
AAAABCDBEAAACDECFAAADBFBACEAGAADEFBAGACDFBGCCCCDGEAFAGCBEEECGFFBFEGGGGFDEEEEFCBBBBGDCFFFFDAGBEGDDDDBE
Beware that in order to attempt exhaustive search (answer in VB is trying a naive version of that) you'll first have to solve the problem of generating all possible expansions while maintaining lexicographical order. Just ABC, expands to all perms of AABC, plus all perms of ABBC, plus all perms of ABCC which is 3*4! instead of just AABC. If you just concatenate AABC and AABD it would cover just 4 out of 4! perms of AABC and even that by accident. Just this expansion will bring you exponential complexity - end of game. Plus you'll need to maintain association between all explansions and the set (the set becomes a label).
Your best bet is to use one of known efficient De Bruijn constuctors and try to see if you can put your set-equivalence in there. Check out
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.674&rep=rep1&type=pdf
and
http://www.dim.uchile.cl/~emoreno/publicaciones/FINALES/copyrighted/IPL05-De_Bruijn_sequences_and_De_Bruijn_graphs_for_a_general_language.pdf
for a start.
If you know graphs, another viable option is to start with De Bruijn graph and formulate your set-equivalence as a graph rewriting. 2nd paper does De Bruijn graph partitioning.
BTW, try VB answer just for A,B,AB (at least expansion is small) - it will make AABBAB and construct ABBA or ABBAB (or throw in a decent language) both of which are wrong. You can even prove that it will always miss with 1st lexical expansions (that's what AAB, AAAB etc. are) just by examining first 2 passes (it will always miss 2nd A for NxA because (N-1)xA+B is in the string (1st expansion of {AB}).
Oh and if we could establish how many of each letters an optimal soluton should have (don't look at B(5,2) it's too easy and regular :-) a random serch would be feasible - you generate candidates with provable traits (like AAAA, BBBB ... are present and not touching and is has n1 A-s, n2 B-s ...) and random arrangement and then test whether they are solutions (checking is much faster than exhaustive search in this case).
Cool problem. Just a draft/psuedo algo:
dim STR-A as string = getall(ABCDEFGHIJKL)
//custom function to generate concat list of all 793 4-char combos.
//should be listed side-by-side to form 3172 character-long string.
//different ordering may ultimately produce different results.
//brute-forcing all orders of combos is too much work (793! is a big #).
//need to determine how to find optimal ordering, for this particular
//approach below.
dim STR-B as string = "" // to hold the string you're searching for
dim STR-C as string = "" // to hold the sub-string you are searching in
dim STR-A-NEW as string = "" //variable to hold your new string
dim MATCH as boolean = false //variable to hold matching status
while len(STR-A) > 0
//check each character in STR-A, which will be shorted by 1 char on each
//pass.
MATCH = false
STR-B = left(STR-A, 4)
STR-B = reduce(STR-B)
//reduce(str) is a custom re-usable function to sort & remove duplicates
for i as integer = 1 to len((STR-A) - 1)
STR-C = substr(STR-A, i, 4)
//gives you the 4-character sequence beginning at position i
STR-C = reduce(STR-C)
IF STR-B = STR-C Then
MATCH = true
exit for
//as long as there is even one match, you can throw-away the first
//letter
END IF
i = i+1
next
IF match = false then
//if you didn't find a match, then the first letter should be saved
STR-A-NEW += LEFT(STR-B, 1)
END IF
MATCH = false //re-init MATCH
STR-A = RIGHT(STR-A, LEN(STR-A) - 1) //re-init STR_A
wend
Anyway -- there could be problems at this, and you'd need to write another function to parse your result string (STR-A-NEW) to prove that it's a viable answer...
I've been thinking about this one and I'm sketching out a solution.
Let's call a string of four symbols a word and we'll write S(w) to denote the set of symbols in word w.
Each word abcd has "follow-on" words bcde where a,...,e are all symbols.
Let succ(w) be the set of follow-on words v for w such that S(w) != S(v). succ(w) is the set of successor words that can follow on from the first symbol in w if w is in a solution.
For each non-empty set of symbols s of cardinality at most four, let words(s) be the set of words w such that S(w) = s. Any solution must contain exactly one word in words(s) for each such set s.
Now we can do a reasonable search. The basic idea is this: say we are exploring a search path ending with word w. The follow-on word must be a non-excluded word in succ(w). A word v is excluded if the search path contains some word w such that v in words(S(w)).
You can be slightly more cunning: if we track the possible "predecessor" words to a set s (i.e., words w with a successor v such that v in words(s)) and reach a point where every predecessor of s is excluded, then we know we have reached a dead end, since we'll never be able to obtain s from any extension of the current search path.
Code to follow after the weekend, with a bit of luck...
Here is my proposal. I'll admit upfront this is a performance and memory hog.
This may be overkill, but have a class We'll call it UniqueCombination This will contain a unique 1-4 char reduced combination of the input set (i.e. A,AB,ABC,...) This will also contain a list of possible combination (AB {AABB,ABAB,BBAA,...}) this will need a method that determines if any possible combination overlaps any possible combination of another UniqueCombination by three characters. Also need a override that takes a string as well.
Then we start with the string "AAAA" then we find all of the UniqueCombinations that overlap this string. Then we find how many uniqueCombinations those possible matches overlap with. (we could be smart at this point an store this number.) Then we pick the one with the least number of overlaps greater than 0. Use up the ones with the least possible matches first.
Then we find a specific combination for the chosen UniqueCombination and add it to the final string. Remove this UniqueCombination from the list, then as we find overlaps for current string. rinse and repeat. (we could be smart and on subsequent runs while searching for overlaps we could remove any of the unreduced combination that are contained in the final string.)
Well that's my plan I will work on the code this weekend. Granted this does not guarantee that the final 4 characters will be 4 of the same letter (it might actually be trying to avoid that but I will look into that as well.)
If there is a non-exponential solution at all it may need to be formulated in terms of a recursive "growth" from a problem with a smaller size i.e to contruct B(N,k) from B(N-1,k-1) or from B(N-1,k) or from B(N,k-1).
Systematic construction for B(5,2) - one step at the time :-) It's bound to get more complex latter [card stands for cardinality, {AB} has card=2, I'll also call them 2-s, 3-s etc.] Note, 2-s and 3-s will be k-1 and k latter (I hope).
Initial. Start with k-1 result and inject symbols for singletons
(unique expansion empty intersection):
ABCDE -> AABBCCDDEE
mark used card=2 sets: AB,BC,CD,DE
Rewriting. Form card=3 sets to inject symbols into marked card=2.
1st feasible lexicographic expansion fires (may have to backtrack for k>2)
it's OK to use already marked 2-s since they'll all get replaced
but may have to do a verification pass for higher k
AB->ACB, BC->BCD, CD->CED, DE->DAE ==> AACBBDCCEDDAEEB
mark/verify used 2s
normally keep marking/unmarking during the construction but also keep keep old
mark list
marking/unmarking can get expensive if there's backtracking in #3
Unused: AB, BE
For higher k may need several recursive rewriting passes
possibly partitioning new sets into classes
Finalize: unused 2-s should overlap around the edge (that's why it's cyclic)
ABE - B can go to the begining or and: AACBBDCCEDDAEEB
Note: a step from B(N-1,k) to B(N,k) may need injection of pseudo-signletons, like doubling or trippling A
B(5,2) -> B(5,3) - B(5,4)
Initial. same: - ABCDE -> AAACBBBDCCCEDDDAEEEB
no use of marking 3-sets since they are all going to be chenged
Rewriting.
choose systematic insertion positions
AAA_CBBB_DCCC_EDDD_AEEE_B
mark all 2-s released by this: AC,AD,BD,BE,CE
use marked 2-s to decide inserted symbols - totice total regularity:
AxCB D -> ADCB
BxDC E -> BEDC
CxED A -> CAED
DxAE B => DBAE
ExBA C -> ECBA
Verify that 3-s are all used (marked inserted symbols just for fun)
AAA[D]CBBB[E]DCCC[A]EDDD[B]AEEE[C]B
Note: Systematic choice if insertion point deterministically dictated insertions (only AD can fit 1st, AC would create duplicate 2-set (AAC, ACC))
Note: It's not going to be so nice for B(6,2) and B(6,3) since number of 2-s will exceede 2x the no of 1-s. This is important since 2-s sit naturally on the sides of 1-s like CBBBE and the issue is how to place them when you run out of 1-s.
B(5,3) is so symetrical that just repeating #1 produces B(5.4):
AAAADCBBBBEDCCCCAEDDDDBAEEEECB

Resources