Need assitance understanding Sardinas-Patterson algorithm (Algorithm and example provided) - algorithm

I am having difficulty understanding Sardinas- Patterson algorithm from the below slide:
How do we get C1 and C2???
I also got this information from the internet:
The algorithm is finite because all dangling suffixes added to the list are suffixes of a finite set of codewords, and a dangling suffix can be added at most once.
{ 0, 01, 11 }. The codeword 0 is a prefix of 01, so add the dangling suffix 1. { 0, 01, 11, 1 }. The codeword 0 is a prefix of 01, but the dangling suffix 1 is already in the list; the codeword 1 is a prefix of 11, but the dangling suffix 1 is already in the list. There are no other dangling suffixes, so conclude that the set is uniquely decodable.
{ 0, 01, 10 }. The codeword 0 is a prefix of 01, so add the dangling suffix 1 to the list. { 0, 01, 10, 1 }. The codeword 1 is a prefix of 10, but the dangling suffix 0 is a codewords. So, conclude that the code is not uniquely decodeable.

The wiki article is a great explanation
The C in your slide correspond to the Si from the wiki article.
Here is description from me:
The important operation is taking two strings from C and if one of them is a prefix to the other you and to record the suffix that is left when the prefix is removed.
This is how C1 is obtained.
With the following C2, C3, etc.
You again want to look for words from C which are prefixes to words from Ci and take the remaining suffix, but you also want to look at the words from C_i and remove and words from C which are prefixes. C(i+1) is the union of those sets.
As soon as some Ci contains a word from C you know the code is not uniquely decodeable.
So:
C = 1, 011, 01110, 1110, 10011
C1 = 110 (because (1)110 is in C), 0011 (because (1)0011 is inC), 10 (because (011)10 is in C)
C2 = {10 (because (1)10 is in C1), 0 (because (1)0 is in C1)} union { 011, because (10)011 is in C }

C1 is found by seeing if any code word in C is a prefix of any other code word in C, if it is then the suffix is added to the set C1. e.g. 1 is a prefix of 1110 and hence you get the suffix 110 which is added to C1.
For C2, first you check to see if the code words in C is a prefix of any other code word in C1 if it is then make a set of all the "dangling suffix" , you then check if C1 is a prefix of any code words in C if it is then again make a set of all the "dangling suffix". Then you take the union of those two sets which results in C2.
Hopefully that kinda made sense.

The sets C1 and C2 correspond to S1 and S2 in this Wikipedia article.
Specifically, C1 is the set of words that can remain after we take a single word from C and remove some its prefix that is also in C.
For C2 we have the words that can remain after we take a word from C and remove a prefix from C1, or after we take a word from C1 and remove a prefix from C.
If we wanted to compute C3, we would take the words that can remain after we take a word from C and remove some its prefix that is in C2, and the words that can remain after we take a word from C2 and remove some its prefix that is in C.
Thus, C3 would be: {[empty word], 0, 011, 10, 11, 1110}. It contains the empty word, so the algorithm halts and determines that C is not uniquely decodable.

Related

Interview question: minimum number of swaps to make couples sit together

This is an interview question, and the problem description is as follows:
There are n couples sitting in a row with 2n seats. Find the minimum number of swaps to make everyone sit next on his/her partner. For example, 0 and 1 are couple, and 2 and 3 are couple. Originally they are sitting in a row in this order: [2, 0, 1, 3]. The minimum number of swaps is 1, for example swapping 2 with 1.
I know there is a greedy solution for this problem. You just need to scan the array from left to right. Every time you see an unmatched pair, you swap the first person of the pair to his/her correct position. For example, in the above example for pair [2, 0], you will directly swap 2 with 1. There is no need to try swapping 0 with 3.
But I don't really understand why this works. One of the proofs I saw was something like this:
Consider a simple example: 7 1 4 6 2 3 0 5. At first step we have two choices to match the first couple: swap 7 with 0, or swap 1 with 6. Then we get 0 1 4 6 2 3 7 5 or 7 6 4 1 2 3 0 5. Pay attention that the first couple doesn't count any more. For the later part it is composed of 4 X 2 3 Y 5 (X=6 Y=7 or X=1 Y=0). Since different couples are unrelated, we don't care X Y is 6 7 pair or 0 1 pair. They are equivalent! Thus it means our choice doesn't count.
I feel that this is very reasonable but not compelling enough. In my opinion we have to prove that X and Y are couple in all possible cases and don't know how. Can anyone give a hint? Thanks!
I've split the problem into 3 examples. A's are a pair and so are B's in all examples. Note that throughout the examples a match requires that elements are adjacent and the first element occupy an index that satisfies index%2 = 0. An array looking like this [X A1 A2 ...] does not satisfy this condition, however this does [X Y A1 A2 ...]. The examples also do not look to the left at all, because looking to the left of A2 below is the same as looking to the right of A1.
First example
There's an even number of elements between two unmatched pairs:
A1 B1 ..2k.. A2 B2 .. for any number k in {0, 1, 2, ..} meaning A1 B1 A2 B2 .. is just a another case.
Both can be matched in one swap:
A1 A2 ..2k.. B1 B2 .. or B2 B1 ..2k.. A2 A1 ..
Order is not important, so it doesn't matter which pair is first. Once the pairs are matched, there will be no more swapping involving either pair. Finding A2 based on A1 will result in the same amount of swaps as finding B2 based on B1.
Second example
There's an odd number of elements between two pairs (2k + the element C):
A1 B1 ..2k.. C A2 B2 D .. (A1 B1 ..2k.. C B2 A2 D .. is identical)
Both cannot be matched in one swap, but like before it doesn't matter which pair is first nor if the matched pair is in the beginning or in the middle part of the array, so all these possible swaps are equally valid, and none of them creates more swaps later on:
A1 A2 ..2k .. C B1 B2 D .. or B2 B1 ..2k.. C A2 A1 D .. Note that the last pair is not matched
C B1 ..2k.. A1 A2 B2 D .. or A1 D ..2k.. C A2 B2 B1 .. Here we're not matching the first pair.
The important thing about this is that in each case, only one pair is matched and none of the elements of that pair will need to be swapped again. The result of the remaining non-matched pair are either one of:
..2k.. C B1 B2 D ..
..2k.. C A2 A1 D ..
C B1 ..2k.. B2 D ..
A1 D ..2k.. C A2 ..
They are clearly equivalent in terms of swaps needed to match the remaining A's or B's.
Third example
This is logically identical to the second. Both B1/A2 and A2/B2 can have any number of elements between them. No matter how elements are swapped, only one pair can be matched. m1 and m2 are arbitrary number of elements. Note that elements X and Y are just the elements surrounding B2, and they're only used to illustrate the example:
A1 B1 ..m1.. A2 ..m2.. X B2 Y .. (A1 B1 ..m1.. B2 ..m2.. X A2 Y .. is identical)
Again both pairs cannot be matched in one swap, but it's not important which pair is matched, or where the matched pair position is:
A1 A2 ..m1.. B1 ..m2.. X B2 Y .. or B2 B1 ..m1.. A2 ..m2.. X A1 Y .. Note that the last pair is not matched
A1 X ..m1.. A2 ..m2-1.. B1 B2 Y .. or A1 Y ..m1.. A2 ..m2.. X B2 B1.. depending on position of B2. Here we're not matching the first pair.
Matching the pair around A2 is equivalent, but omitted.
As in the second example, one swap can also be matching a pair in the beginning or in the middle of the array, but either choice doesn't change that only one pair is matched. Nor does it change the remaining amount of unmatched pairs.
A little analysis
Keeping in mind that matched pairs drop out of the list of unmatched/problem pairs, the list of unmatched pairs are either one fewer or two fewer pairs for each swap. Since it's not important which pair drops out of the problem, it might as well be the first. In that case we can assume that pairs to the left of the cursor/current index are all matched. And that we only need to match the first pair, unless it's already matched by coincidence and the cursor is then rightfully moved.
It becomes even more clear if the above examples are looked at with the cursor being at the second unmatched pair, instead of the first. It still doesn't matter which pairs are swapped for the amount of total swaps needed. So there's no need to try to match pairs in the middle. The resulting amount of swaps are the same.
The only time two pairs can be matched with only one swap are those in the first example. There is no way to match two pairs in one swap in any other setup. Looking at the result of the swap in the second and third examples, it also becomes clear that none of the results have any advantage to any of the others and that each result becomes a new problem that can be described as one of the three cases (two cases really, because second and third are equivalent in terms of match-able pairs).
Optimal swapping
There is no way to modify the array to prepare it for more optimal swapping later on. Either a swap will match one or two pairs, or it will count as a swap with no matches:
Looking at this: A1 B1 ..2k.. C B2 ... A2 ...
Swap to prepare for optimal swap:
A1 B1 ..2k.. A2 B2 ... C ... no matches
A1 A2 ..2k.. B1 B2 ... C ... two in one
Greedy swap:
B2 B1 ..2k.. C A1 ... A2 ... one
B2 B1 ..2k.. A2 A1 ... C ... one
Un-matching pairs
Pairs already matched will not become unmatched because that would require that:
For A1 B1 ..2k.. C A2 B2 D ..
C is identical to A1 or
D is identical to B1
either of which is impossible.
Likewise with A1 B1 ..m1.. (Z) A2 (V) ..m2.. X B2 Y ..
Or it would require that matched pairs are shifted one (or any odd number of) index inside the array. That's also not possible, because we always ever swap, so the array elements aren't being shifted at all.
[Edited for clarity 4-Mar-2020.]
There is no point doing a swap which does not put (at least) one couple together. To do so would add 1 to the swap count and leave us with the same number of unpaired couples.
So, each time we do a swap, we put one couple together leaving at most n-1 couples. Repeating the process we end up with 1 pair, who must by then be a couple. So, the worst case must be n-1 swaps.
Clearly, we can ignore couples who are already together.
Clearly, where we have two pairs a:B b:A, one swap will create the two couples a:A b:B.
And if we have m pairs a:Q b:A c:B ... q:P -- where the m pairs are a "disjoint subset" (or cycle) of couples, m-1 swaps will put them into couples.
So: the minimum number of swaps is going to be n - s where s is the number of "disjoint subsets" (and s >= 1). [A subset may, of course, contain just one couple.]
Interestingly, there is nothing clever you can do to reduce the number of swaps. Provided every swap creates a couple you will do the minimum number.
If you wanted to arrange each couple in height order as well, things may or may not be more interesting.
FWIW: having shown that you cannot do better than n-1 swaps for each disjoint set of n couples, the trick then is to avoid the O(n^2) search for each swap. That can be done relatively straightforwardly by keeping a vector with one entry per person, giving where they are currently sat. Then in one scan you pick up each person and if you know where their partner is sat, swap down to make a pair, and update the location of the person swapped up.
I will swap every even positioned member,
if he/she doesn't sit besides his/her partner.
Even positioned means array indexed 1, 3, 5 and so on.
The couples are [even, odd] pair. For example [0, 1], [2, 3], [4, 5] and so on.
The loop will be like that:
for(i=1; i<n*2; i+=2) // when n = # of couples.
Now, we will check i-th and (i-1)-th index member. If they are not couple, then we will look for the partner of (i-1)-th member and once we have it, we should swap it with i-th member.
For an example, say at i=1, we got 6, now if (i-1)-th element is 7 then they form a couple (if (i-1)-th element is 5 then [5, 6] is not a couple.) and we don't need any swap, otherwise we should look for the partner of (i-1)-th element and will swap with i-th element. So, (i-1)-th and i-th will form a couple.
It ensure that we need to check only half of the total members, that means, n.
And, for any non-matched couple, we need a linear search from i-th position to the rest of the array. Which is O(2n), eventually O(n).
So, the overall technique complexity will be O(n^2).
In worst case, minimum swap will be n-1. (this is maximum as well).
Very straightforward. If you need help to code, let us know.

Minimum distance metric on encoded sequence

I'm looking for a minimum distance metric which preserve subsequences subtitution. With this I mean that any subsequence of second sequence can have different representation, but still be same to the first subsequence. The two sequences will always have same the length. I'm familiar with Hamming or Levenshtein distance, but they are probably useless in this case.
Consider this examples:
AABBAA
CCDDCC
has distance 0, because A = C and B = D (or AA = CC and BB = DD).
AABBBBBB
CCDDEEEE
has distance 2, because A = C and B = E (or AA = CC or BB = EE or BBBB = EEEE), but the B =/= D (or BB =/= DD).
However, this function may not behave exactly like that. I just need to know how unencoded sequence is similar, in term of repetition, to encoded one. You could assume that second sequence is encoded with something like caesar cipher (although I'm not sure if i.e the shift could vary through the time).
Note:
I also thought about compressing the two sequences with LZW algorithm and compare their compression ratio. Any other idea?
You can enumerate elements in your sequences with continuous numbers from the beginning and then use Levenshtein distance or something like that.
AACCAABB --> 11221133 (A->1, C->2, B->3)
CCXXCCYY --> 11221133 (C->1, X->2, Y->3)
d(AACCAABB, CCXXCCYY) = d(11221133, 11221133) = 0

Find all substrings that don't contain the entire set of characters

This was asked to me in an interview.
I'm given a string whose characters come from the set {a,b,c} only. Find all substrings that dont contain all the characters from the set.For e.g, substrings that contain only a's, only b's, only c's or only a,b's or only b,c's or only c,a's. I gave him the naive O(n^2) solution by generating all substrings and testing them.
The interviewer wanted an O(n) solution.
Edit: My attempt was to have the last indexes of a,b,c and run a pointer from left to right, and anytime all 3 were counted, change the start of the substring to exclude the earliest one and start counting again. It doesn't seem exhaustive
So for e.g, if the string is abbcabccaa,
let i be the pointer that traverses the string. Let start be start of the substring.
1) i = 0, start = 0
2) i = 1, start = 0, last_index(a) = 0 --> 1 substring - a
3) i = 2, start = 0, last_index(a) = 0, last_index(b) = 1 -- > 1 substring ab
4) i = 3, start = 0, last_index(a) = 0, last_index(b) = 2 --> 1 substring abb
5) i = 4, start = 1, last_index(b) = 2, last_index(c) = 3 --> 1 substring bbc(removed a from the substring)
6) i = 5, start = 3, last_index(c) = 3, last_index(a) = 4 --> 1 substring ca(removed b from the substring)
but this isn't exhaustive
Given that the problem in its original definition can't be solved in less than O(N^2) time, as some comments point out, I suggest a linear algorithm for counting the number of substrings (not necessarily unique in their values, but unique in their positions within the original string).
The algorithm
count = 0
For every char C in {'a','b','c'} scan the input S and break it into longest sequences not including C. For each such section A, add |A|*(|A|+1)/2 to count. This addition stands for the number of legal sub-strings inside A.
Now we have the total number of legal strings including only {'a','b'}, only {'a','c'} and only {'b','c'}. The problem is that we counted substrings with a single repeated character twice. To fix this we iterate over S again, this time subtracting |A|*(|A|+1)/2 for every largest sequence A of a single character that we encounter.
Return count
Example
S='aacb'
breaking it using 'a' gives us only 'cb', so count = 3. For C='b' we have 'aac', which makes count = 3 + 6 = 9. With C='c' we get 'aa' and 'b', so count = 9 + 3 + 1 = 13. Now we have to do the subtraction: 'aa': -3, 'c': -1, 'b': -1. So we have count=8.
The 8 substrings are:
'a'
'a' (the second char this time)
'aa'
'ac'
'aac'
'cb'
'c'
'b'
To get something better than O(n) we may need additional assumptions (maybe longest substrings with this property).
Consider a string of the form aaaaaaaaaabbbbbbbbbb of length n. There is at least O(n^2) possible substrings so if we want to list them all we need O(n^2) time.
I came up with a linear solution for the longest substrings.
Take a set S of all substrings separated by a, all substrings separated by b and finally all substrings separated by c. Each of those steps can be done in O(n), so we have O(3n), thus O(n).
Example:
Take aaabcaaccbaa.
In this case set S contains:
substrings separated by a: bc, ccb
substrings separated by b: aaa, caacc
substrings separated by c: aaab, aa, baa.
By the set I mean a data structure with adding and finding element with a given key in O(1).

How do I apply the CYK algorithm to this CFG?

Let CFG G be:
S −→ AB|BA|AC|BD|EE
A −→ a
B −→ b
C −→ EB
D −→ EA
E −→ AB|BA|AC|BD|EE
How do I use the CYK algorithm to determine if the string aabbab is part of the language?
This is the pseudo code I have in my notes:
for i in 1 .. n
V[i,1] = { A | A -> x[i] }
for j in 2..n
for i in 1 .. n-j+1
{
V[i,j] = phi
for k in 1 .. j-1
V[i,j] = V[i,j] union { A | A -> BC where B in V[i,k]
and C in V[i+k,j-k]}
}
But I am not understanding how the answer got to be in an upside down triangular shape.
For example,
V[i,j] i
1(b) 2(a) 3(a) 4(b) 5(a)
1 B A,C A,C B A,C
2 S,A B S,C S,A
j
3 phi B B
4 phi S,A,C
5 S,A,C
^
|_ accept
The pseudocode[*] describes how to apply the algorithm to create the chart.
The [i, j] pair refers to a substring of the input that starts at the ith symbol and extends for j symbols. So [2, 3] refers to a 3-symbol substring, starting at symbol 2. If your input is baaba, then [2, 3] refers to the aab in the middle. (The indexes are 1-based, not 0-based.)
The chart forms a triangle because you can't have a substring that's longer than the input. If the input is 5 symbols long, then you can have a value in [1, 5], but you can't have [2, 5] because that wouldn't refer to a substring anymore. So each row is one box shorter than the row before it, forming the triangle.
V[i, j] refers to a box in the chart. Each box is the set of non-terminals that may have produced the substring described by [i, j].
The algorithm relies on the grammar being in Chomsky Normal Form. In CNF, the right side of each production is either one terminal symbol or two non-terminal symbols. (There's another algorithm that can transform a context-free grammar into CNF.)
Basically, you start with all the 1-symbol substrings of the input. The first loop in your pseudocode fills out the top row (j == 1) of your chart. It looks at all the productions in the grammar, and, if the right side of a production corresponds to that symbol, then the non-terminal on the left side of that production is added to the set V[i, 1]. (Your example seems to have some bogus entries in the first row. The {A, C} sets should be just {A}.)
The algorithm then proceeds through the rest of the rows, looking for all the possible productions that can produce the corresponding substring. For each possible way to split the current substring into two, it looks for a corresponding production. This involves combining pairs of non-terminals from certain boxes on previous rows and checking if there are any productions that produce that pair, thus building a set of non-terminals for that box.
If the box in the last row ends up with a set that contains the start symbol, then the input is valid according to the grammar. Intuitively, it says that the start symbol is a valid production for making the substring that starts at the first symbol and proceeds for the entire length.
[*] It looks like the pseudocode shown in the question contains some transcription errors. You'll want to consult an authoritative source to get the details right.

string of integers puzzle

I apologize for not have the math background to put this question in a more formal way.
I'm looking to create a string of 796 letters (or integers) with certain properties.
Basically, the string is a variation on a De Bruijn sequence B(12,4), except order and repetition within each n-length subsequence are disregarded.
i.e. ABBB BABA BBBA are each equivalent to {AB}.
In other words, the main property of the string involves looking at consecutive groups of 4 letters within the larger string
(i.e. the 1st through 4th letters, the 2nd through 5th letters, the 3rd through 6th letters, etc)
And then producing the set of letters that comprise each group (repetitions and order disregarded)
For example, in the string of 9 letters:
A B B A C E B C D
the first 4-letter groups is: ABBA, which is comprised of the set {AB}
the second group is: BBAC, which is comprised of the set {ABC}
the third group is: BACE, which is comprised of the set {ABCE}
etc.
The goal is for every combination of 1-4 letters from a set of N letters to be represented by the 1-4-letter resultant sets of the 4-element groups once and only once in the original string.
For example, if there is a set of 5 letters {A, B, C, D, E} being used
Then the possible 1-4 letter combinations are:
A, B, C, D, E,
AB, AC, AD, AE, BC, BD, BE, CD, CE, DE,
ABC, ABD, ABE, ACD, ACE, ADE, BCD, BCE, BDE, CDE,
ABCD, ABCE, ABDE, ACDE, BCDE
Here is a working example that uses a set of 5 letters {A, B, C, D, E}.
D D D D E C B B B B A E C C C C D A E E E E B D A A A A C B D D B
The 1st through 4th elements form the set: D
The 2nd through 5th elements form the set: DE
The 3rd through 6th elements form the set: CDE
The 4th through 7th elements form the set: BCDE
The 5th through 8th elements form the set: BCE
The 6th through 9th elements form the set: BC
The 7th through 10th elements form the set: B
etc.
* I am hoping to find a working example of a string that uses 12 different letters (a total of 793 4-letter groups within a 796-letter string) starting (and if possible ending) with 4 of the same letter. *
Here is a working solution for 7 letters:
AAAABCDBEAAACDECFAAADBFBACEAGAADEFBAGACDFBGCCCCDGEAFAGCBEEECGFFBFEGGGGFDEEEEFCBBBBGDCFFFFDAGBEGDDDDBE
Beware that in order to attempt exhaustive search (answer in VB is trying a naive version of that) you'll first have to solve the problem of generating all possible expansions while maintaining lexicographical order. Just ABC, expands to all perms of AABC, plus all perms of ABBC, plus all perms of ABCC which is 3*4! instead of just AABC. If you just concatenate AABC and AABD it would cover just 4 out of 4! perms of AABC and even that by accident. Just this expansion will bring you exponential complexity - end of game. Plus you'll need to maintain association between all explansions and the set (the set becomes a label).
Your best bet is to use one of known efficient De Bruijn constuctors and try to see if you can put your set-equivalence in there. Check out
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.674&rep=rep1&type=pdf
and
http://www.dim.uchile.cl/~emoreno/publicaciones/FINALES/copyrighted/IPL05-De_Bruijn_sequences_and_De_Bruijn_graphs_for_a_general_language.pdf
for a start.
If you know graphs, another viable option is to start with De Bruijn graph and formulate your set-equivalence as a graph rewriting. 2nd paper does De Bruijn graph partitioning.
BTW, try VB answer just for A,B,AB (at least expansion is small) - it will make AABBAB and construct ABBA or ABBAB (or throw in a decent language) both of which are wrong. You can even prove that it will always miss with 1st lexical expansions (that's what AAB, AAAB etc. are) just by examining first 2 passes (it will always miss 2nd A for NxA because (N-1)xA+B is in the string (1st expansion of {AB}).
Oh and if we could establish how many of each letters an optimal soluton should have (don't look at B(5,2) it's too easy and regular :-) a random serch would be feasible - you generate candidates with provable traits (like AAAA, BBBB ... are present and not touching and is has n1 A-s, n2 B-s ...) and random arrangement and then test whether they are solutions (checking is much faster than exhaustive search in this case).
Cool problem. Just a draft/psuedo algo:
dim STR-A as string = getall(ABCDEFGHIJKL)
//custom function to generate concat list of all 793 4-char combos.
//should be listed side-by-side to form 3172 character-long string.
//different ordering may ultimately produce different results.
//brute-forcing all orders of combos is too much work (793! is a big #).
//need to determine how to find optimal ordering, for this particular
//approach below.
dim STR-B as string = "" // to hold the string you're searching for
dim STR-C as string = "" // to hold the sub-string you are searching in
dim STR-A-NEW as string = "" //variable to hold your new string
dim MATCH as boolean = false //variable to hold matching status
while len(STR-A) > 0
//check each character in STR-A, which will be shorted by 1 char on each
//pass.
MATCH = false
STR-B = left(STR-A, 4)
STR-B = reduce(STR-B)
//reduce(str) is a custom re-usable function to sort & remove duplicates
for i as integer = 1 to len((STR-A) - 1)
STR-C = substr(STR-A, i, 4)
//gives you the 4-character sequence beginning at position i
STR-C = reduce(STR-C)
IF STR-B = STR-C Then
MATCH = true
exit for
//as long as there is even one match, you can throw-away the first
//letter
END IF
i = i+1
next
IF match = false then
//if you didn't find a match, then the first letter should be saved
STR-A-NEW += LEFT(STR-B, 1)
END IF
MATCH = false //re-init MATCH
STR-A = RIGHT(STR-A, LEN(STR-A) - 1) //re-init STR_A
wend
Anyway -- there could be problems at this, and you'd need to write another function to parse your result string (STR-A-NEW) to prove that it's a viable answer...
I've been thinking about this one and I'm sketching out a solution.
Let's call a string of four symbols a word and we'll write S(w) to denote the set of symbols in word w.
Each word abcd has "follow-on" words bcde where a,...,e are all symbols.
Let succ(w) be the set of follow-on words v for w such that S(w) != S(v). succ(w) is the set of successor words that can follow on from the first symbol in w if w is in a solution.
For each non-empty set of symbols s of cardinality at most four, let words(s) be the set of words w such that S(w) = s. Any solution must contain exactly one word in words(s) for each such set s.
Now we can do a reasonable search. The basic idea is this: say we are exploring a search path ending with word w. The follow-on word must be a non-excluded word in succ(w). A word v is excluded if the search path contains some word w such that v in words(S(w)).
You can be slightly more cunning: if we track the possible "predecessor" words to a set s (i.e., words w with a successor v such that v in words(s)) and reach a point where every predecessor of s is excluded, then we know we have reached a dead end, since we'll never be able to obtain s from any extension of the current search path.
Code to follow after the weekend, with a bit of luck...
Here is my proposal. I'll admit upfront this is a performance and memory hog.
This may be overkill, but have a class We'll call it UniqueCombination This will contain a unique 1-4 char reduced combination of the input set (i.e. A,AB,ABC,...) This will also contain a list of possible combination (AB {AABB,ABAB,BBAA,...}) this will need a method that determines if any possible combination overlaps any possible combination of another UniqueCombination by three characters. Also need a override that takes a string as well.
Then we start with the string "AAAA" then we find all of the UniqueCombinations that overlap this string. Then we find how many uniqueCombinations those possible matches overlap with. (we could be smart at this point an store this number.) Then we pick the one with the least number of overlaps greater than 0. Use up the ones with the least possible matches first.
Then we find a specific combination for the chosen UniqueCombination and add it to the final string. Remove this UniqueCombination from the list, then as we find overlaps for current string. rinse and repeat. (we could be smart and on subsequent runs while searching for overlaps we could remove any of the unreduced combination that are contained in the final string.)
Well that's my plan I will work on the code this weekend. Granted this does not guarantee that the final 4 characters will be 4 of the same letter (it might actually be trying to avoid that but I will look into that as well.)
If there is a non-exponential solution at all it may need to be formulated in terms of a recursive "growth" from a problem with a smaller size i.e to contruct B(N,k) from B(N-1,k-1) or from B(N-1,k) or from B(N,k-1).
Systematic construction for B(5,2) - one step at the time :-) It's bound to get more complex latter [card stands for cardinality, {AB} has card=2, I'll also call them 2-s, 3-s etc.] Note, 2-s and 3-s will be k-1 and k latter (I hope).
Initial. Start with k-1 result and inject symbols for singletons
(unique expansion empty intersection):
ABCDE -> AABBCCDDEE
mark used card=2 sets: AB,BC,CD,DE
Rewriting. Form card=3 sets to inject symbols into marked card=2.
1st feasible lexicographic expansion fires (may have to backtrack for k>2)
it's OK to use already marked 2-s since they'll all get replaced
but may have to do a verification pass for higher k
AB->ACB, BC->BCD, CD->CED, DE->DAE ==> AACBBDCCEDDAEEB
mark/verify used 2s
normally keep marking/unmarking during the construction but also keep keep old
mark list
marking/unmarking can get expensive if there's backtracking in #3
Unused: AB, BE
For higher k may need several recursive rewriting passes
possibly partitioning new sets into classes
Finalize: unused 2-s should overlap around the edge (that's why it's cyclic)
ABE - B can go to the begining or and: AACBBDCCEDDAEEB
Note: a step from B(N-1,k) to B(N,k) may need injection of pseudo-signletons, like doubling or trippling A
B(5,2) -> B(5,3) - B(5,4)
Initial. same: - ABCDE -> AAACBBBDCCCEDDDAEEEB
no use of marking 3-sets since they are all going to be chenged
Rewriting.
choose systematic insertion positions
AAA_CBBB_DCCC_EDDD_AEEE_B
mark all 2-s released by this: AC,AD,BD,BE,CE
use marked 2-s to decide inserted symbols - totice total regularity:
AxCB D -> ADCB
BxDC E -> BEDC
CxED A -> CAED
DxAE B => DBAE
ExBA C -> ECBA
Verify that 3-s are all used (marked inserted symbols just for fun)
AAA[D]CBBB[E]DCCC[A]EDDD[B]AEEE[C]B
Note: Systematic choice if insertion point deterministically dictated insertions (only AD can fit 1st, AC would create duplicate 2-set (AAC, ACC))
Note: It's not going to be so nice for B(6,2) and B(6,3) since number of 2-s will exceede 2x the no of 1-s. This is important since 2-s sit naturally on the sides of 1-s like CBBBE and the issue is how to place them when you run out of 1-s.
B(5,3) is so symetrical that just repeating #1 produces B(5.4):
AAAADCBBBBEDCCCCAEDDDDBAEEEECB

Resources