How to read this PDA? - computation-theory

How to read this PDA? - computation-theory

This is an old HW question where I had to find out how many times the q\loop state was visited for each given string:
bb: 5
abab: 0
abba: 8
babbab: 11
I understand how the 1st string visits the state 5 times and the 2nd string is not accepted, but I don't know the process for the 3rd and 4th strings. I would really appreciate it if someone could just walk through the states visited for the 3rd or 4th string because I keep getting stuck

This PDA matches palindromes. How does it do it? It looks for P on the stack and replaces it with either aPa or bPb, then matches a on the stack with a in the input. Likewise for b. It replaces P nondeterminstically.
Let's walk through #3. We'll focus on Q_loop, which I'll just call L for simplicity. The top of the stack will be the rightmost character.
The first time we get to L, the input is abba and the stack is $P. We will nondeterministically follow the e,P->a transition.
The input is abba, the stack is $aPa. We will follow the a,a->e transition.
Input: bba, stack: $aP. Follow e,P->b.
Input: bba, stack: $abPb. Follow b,b->e.
Input: ba, stack: $abP. Follow e,P->e.
Input: ba, stack: $ab. Follow b,b->e.
Input: a, stack: $a. Follow a,a->e.
Input: e, stack, $. Follow e,$->e and accept in the next state.

Related

Counting in Wonderland

The text of Alice in Wonderland contains the word 'Wonderland' 8 times. (Let's be case-insensitive for this question).
However it contains the word many more times if you count non-contiguous subsequences as well as substrings, eg.
Either the well was very deep, or she fell very slowly, for she had
plenty of time as she went down to look about her and to WONDER what was
going to happen next. First, she tried to Look down AND make out what
she was coming to, but it was too dark to see anything;
(A subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements. —Wikipedia)
How many times does the book contain the word Wonderland as a subsequence? I expect this will be a big number—it's a long book with many w's and o's and n's and d's.
I tried brute force counting (recursion to make a loop 10 deep) but it was too slow, even for that example paragraph.

Let's say you didn't want to search for wonderland, but just for w. Then you'd simply count how many times w occurred in the story.
Now let's say you want wo. For each first character of the current pattern you find, you add to your count:
How many times the current pattern without its first character occurs in the rest of the story, after this character you're at: so you have reduced the problem (story[1..n], pattern[1..n]) to (story[2..n], pattern[2..n])
How many times the entire current pattern occurs in the rest of the story. So you have reduced the problem to (story[2..n], pattern[1..n])
Now you can just add the two. There is no overcounting if we talk in terms of subproblems. Consider the example wawo. Obviously, wo occurs 2 times. You might think the counting will go like:
For the first w, add 1 because o occurs once after it and another 1 because wo occurs once after it.
For the second w, add 1 because o occurs once after it.
Answer is 3, which is wrong.
But this is what actually happens:
(wawo, wo) -> (awo, o) -> (wo, o) -> (o, o) -> (-, -) -> 1
-> (-, o) -> 0
-> (awo, wo) -> (wo, wo) -> (o, wo) -> (-, wo) -> 0
-> (o, o) -> (-, -) -> 1
-> (-, o) -> 0
So you can see that the answer is 2.
If you don't find a w, then the count for this position is just how many times wo occurs after this current character.
This allows for dynamic programming with memoization:
count(story_index, pattern_index, dp):
if dp[story_index, pattern_index] not computed:
if pattern_index == len(pattern):
return 1
if story_index == len(story):
return 0
if story[story_index] == pattern[pattern_index]:
dp[story_index, pattern_index] = count(story_index + 1, pattern_index + 1, dp) +
count(story_index + 1, pattern_index, dp)
else:
dp[story_index, pattern_index] = count(story_index + 1, pattern_index, dp)
return dp[story_index, pattern_index]
Call with count(0, 0, dp). Note that you can make the code cleaner (remove the duplicate function call).
Python code, with no memoization:
def count(story, pattern):
if len(pattern) == 0:
return 1
if len(story) == 0:
return 0
s = count(story[1:], pattern)
if story[0] == pattern[0]:
s += count(story[1:], pattern[1:])
return s
print(count('wonderlandwonderland', 'wonderland'))
Output:
17
This makes sense: for each i first characters in the first wonderland of the story, you can group it with remaining final characters in the second wonderland, giving you 10 solutions. Another 2 are the words themselves. The other five are:
wonderlandwonderland
********* *
******** **
******** * *
** ** ******
*** * ******
You're right that this will be a huge number. I suggest that you either use large integers or take the result modulo something.
The same program returns 9624 for your example paragraph.

The string "wonderland" occurs as a subsequence in Alice in Wonderland1 24100772180603281661684131458232 times.
The main idea is to scan the main text character by character, keeping a running count of how often each prefix of the target string (i.e.: in this case, "w", "wo", "won", ..., "wonderlan", and "wonderland") has occurred up to the current letter. These running counts are easy to compute and update. If the current letter does not occur in "wonderland", then the counts are left untouched. If the current letter is "a" then we increment the count of "wonderla"s seen by the number of "wonderl"s seen up to this point. If the current letter is "n" then we increment the count of "won"s by the count of "wo"s and the count of "wonderlan"s by the count of "wonderla"s. And so forth. When we reach end of the text, we will have the count of all prefixes of "wonderland" including the string "wonderland" itself, as desired.
The advantage of this approach is that it requires a single pass through the text and does not require O(n) recursive calls (which will likely exceed the maximum recursion depth unless you do something clever).
Code
import fileinput
import string
target = 'wonderland'
prefixes = dict()
count = dict()
for i in range(len(target)) :
letter = target[i]
prefix = target[:i+1]
if letter not in prefixes :
prefixes[letter] = [prefix]
else :
prefixes[letter].append(prefix)
count[prefix] = 0L
for line in fileinput.input() :
for letter in line.lower() :
if letter in prefixes :
for prefix in prefixes[letter] :
if len(prefix) > 1 :
count[prefix] = count[prefix] + count[prefix[:len(prefix)-1]]
else:
count[prefix] = count[prefix] + 1
print count[target]
Using this text from Project Gutenberg, starting with "CHAPTER I. Down the Rabbit-Hole" and ending with "THE END"

Following up on previous comments, if you are looking for an algorithm that would return 2 for the input wonderlandwonderland and 1 for wonderwonderland, then I think you could adapt the algorithm from this question:
How to find smallest substring which contains all characters from a given string?
Effectively, the change in your case would be that, once an instance of the word is found, you increment a counter and repeat all the procedure with the remaining part of the text.
Such algorithm would be O(n) in time when n is the lenght of the text and O(m) in space where m is the length of the searched string.

Implementing Parallel Algorithm for Longest Common Subsequence

I am trying to implement the Parallel Algorithm for Longest Common Subsequence Problem described in http://www.iaeng.org/publication/WCE2010/WCE2010_pp499-504.pdf
But i am having a problem with the variable C in Equation 6 on page 4
The paper refered to C on at the end of page 3 as
C as Let C[1 : l] bethe ﬁnite alphabet
I am not sure what is ment by this, as i guess it would it with the 2 strings ABCDEF and ABQXYEF be ABCDEFQXY. But what if my 2 stings is a list of objects (Where my match test for an example is obj1.Name = obj2.Name), what would my C be here? just a union on the 2 arrays?

Having read and studied the paper, I can say that C is supposed to be an array holding the alphabet of your strings, where the alphabet size (and, thus, the size of C) is l.
By the looks of your question, however, I feel the need to go deeper on this, because it looks like you didn't get the whole picture yet. What is P[i,j], and why do you need it? The answer is that you don't really need it, but it's an elegant optimization. In page 3, a little bit before Theorem 1, it is said that:
[...] This process ends when j-k = 0 at the k-th step, or a(i) =
b(j-k) at the k-th step. Assume that the process stops at the k-th
step, and k must be the minimum number that makes a(i) = b(j-k) or j-k
= 0. [...]
The recurrence relation in (3) is equivalent to (2), but the fundamental difference is that (2) expands recursively, whereas with (3) you never have recursive calls, provided that you know k. In other words, the magic behind (3) not expanding recursively is that you somehow know the spot where the recursion on (2) would stop, so you look at that cell immediately, rather than recursively approaching it.
Ok then, but how do you find out the value for k? Since k is the spot where (2) reaches a base case, it can be seen that k is the amount of columns that you have to "go back" on B until you are either off the limits (i.e., the first column that is filled with 0's) OR you find a match between a character in B and a character in A (which corresponds to the base case conditions in (2)). Remember that you will be matching the character a(i-1), where i is the current row.
So, what you really want is to find the last position in B before j where the character a(i-1) appears. If no such character ever appears in B before j, then that would be equivalent to reaching the case i = 0 or j-1 = 0 in (2); otherwise, it's the same as reaching a(i) = b(j-1) in (2).
Let's look at an example:
Consider that the algorithm is working on computing the values for i = 2 and j = 3 (the row and column are highlighted in gray). Imagine that the algorithm is working on the cell highlighted in black and is applying (2) to determine the value of S[2,2] (the position to the left of the black one). By applying (2), it would then start by looking at a(2) and b(2). a(2) is C, b(2) is G, to there's no match (this is the same procedure as the original, well-known algorithm). The algorithm now wants to find the value of S[2,2], because it is needed to compute S[2,3] (where we are). S[2,2] is not known yet, but the paper shows that it is possible to determine that value without refering to the row with i = 2. In (2), the 3rd case is chosen: S[2,2] = max(S[1, 2], S[2, 1]). Notice, if you will, that all this formula is doing is looking at the positions that would have been used to calculate S[2,2]. So, to rephrase that: we're computing S[2,3], we need S[2,2] for that, we don't know it yet, so we're going back on the table to see what's the value of S[2,2] in pretty much the same way we did in the original, non-parallel algorithm.
When will this stop? In this example, it will stop when we find the letter C (this is our a(i)) in TGTTCGACA before the second T (the letter on the current column) OR when we reach column 0. Because there is no C before T, we reach column 0. Another example:
Here, (2) would stop with j-1 = 5, because that is the last position in TGTTCGACA where C shows up. Thus, the recursion reaches the base case a(i) = b(j-1) when j-1 = 5.
With this in mind, we can see a shortcut here: if you could somehow know the amount k such that j-1-k is a base case in (2), then you wouldn't have to go through the score table to find the base case.
That's the whole idea behind P[i,j]. P is a table where you lay down the whole alphabet vertically (on the left side); the string B is, once again, placed horizontally in the upper side. This table is computed as part of a preprocessing step, and it will tell you exactly what you will need to know ahead of time: for each position j in B, it says, for each character C[i] in C (the alphabet), what is the last position in B before j where C[i] is found (note that i is used to index C, the alphabet, and not the string A. Maybe the authors should have used another index variable to avoid confusion).
So, you can think of the semantics for an entry P[i,j] as something along the lines of: The last position in B where I saw C[i] before position j. For example, if you alphabet is sigma = {A, E, I, O, U}, and B = "AOOIUEI", thenP` is:
Take the time to understand this table. Note the row for O. Remember: this row lists, for every position in B, where is the last known "O". Only when j = 3 will we have a value that is not zero (it's 2), because that's the position after the first O in AOOIUEI. This entry says that the last position in B where O was seen before is position 2 (and, indeed, B[2] is an O, the one that follows A). Notice, in that same row, that for j = 4, we have the value 3, because now the last position for O is the one that correspnds to the second O in B (and since no more O's exist, the rest of the row will be 3).
Recall that building P is a preprocessing step necessary if you want to easily find the value of k that makes the recursion from equation (2) stop. It should make sense by now that P[i,j] is the k you're looking for in (3). With P, you can determine that value in O(1) time.
Thus, the C[i] in (6) is a letter of the alphabet - the letter that we are currently considering. In the example above, C = [A,E,I,O,U], and C[1] = A, C[2] = E, etc. In equaton (7), c is the position in C where a(i) (the current letter of string A being considered) lives. It makes sense: after all, when building the score table position S[i,j], we want to use P to find the value of k - we want to know where was the last time we saw an a(i) in B before j. We do that by reading P[index_of(a(i)), j].
Ok, now that you understand the use of P, let's see what's happening with your implementation.
About your specific case
In the paper, P is shown as a table that lists the whole alphabet. It is a good idea to iterate through the alphabet because the typical uses of this algorithm are in bioinformatics, where the alphabet is much, much smaller than the string A, making the iteration through the alphabet cheaper.
Because your strings are sequences of objects, your C would be the set of all possible objects, so you'd have to build a table P with the set of all possible object instance (nonsense, of course). This is definitely a case where the alphabet size is huge when compared to your string size. However, note that you will only be indexing P in those rows that correspond to letters from A: any row in P for a letter C[i] that is not in A is useless and will never be used. This makes your life easier, because it means you can build P with the string A instead of using the alphabet of every possible object.
Again, an example: if your alphabet is AEIOU, A is EEI and B is AOOIUEI, you will only be indexing P in the rows for E and I, so that's all you need in P:
This works and suffices, because in (7), P[c,j] is the entry in P for the character c, and c is the index of a(i). In other words: C[c] always belongs to A, so it makes perfect sense to build P for the characters of A instead of using the whole alphabet for the cases where the size of A is considerably smaller than the size of C.
All you have to do now is to apply the same principle to whatever your objects are.
I really don't know how to explain it any better. This may be a little dense at first. Make sure to re-read it until you really get it - and I mean every little detail. You have to master this before thinking about implementing it.
NOTE: You said you were looking for a credible and / or official source. I'm just another CS student, so I'm not an official source, but I think I can be considered "credible". I've studied this before and I know the subject. Happy coding!

Find all words and phrases from one string

Due to subject area (writing on a wall) interesting condition is added - letters cannot change their order, so this is not a question about anagrams.
I saw a long word, written by paint on a wall, and now suddenly
I want all possible words and phrases I can get from this word by painting out any combination of letters. Wo r ds, randomly separated by whitespace are OK.
To broaden possible results let's make an assumption, that space is not necessary to separate words.
Edit: Obviously letter order should be maintained (thanks idz for pointing that out). Also, phrases may be meaningless. Here are some examples:
Source word: disestablishment
paint out: ^ ^^^ ^^^^ ^^
left: i tabl e -> i table
or paint out:^^^^^^^^^ ^ ^^
left: ish e -> i she (spacelessness is ok)
Visual example
Hard mode/bonus task: consider possible slight alterations to letters (D <-> B, C <-> O and so on)
Please suggest your variants of solving this problem.
Here's my general straightforward approach
It's clear that we'll need an English dictionary to find words.
Our goal is to get words to search for in dictionary.
We need to find all possible letters variations to match them against dictionary: each letter can be itself (1) or painted out (0).
Taking the 'space is not needed to separate words' condition in consideration, to distinguish words we must assume that there might be a space between any two letters (1 - there's a space, 0 - there isn't).
d i s e s t a b l i s h m e n t
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ - possible whitespace
N = number of letters in source word
N-1 = number of 'might-be spaces'
Any of the N + N - 1 elements can be in two states, so let's treat them as booleans. The number of possible variations is 2^(N + N - 1). Yes, it counts useless variants like pasting a space between to spaces, but I didn't come up with more elegant formula.
Now we need an algorithm to get all possible variations of N+N-1 sequence of booleans (I haven't thought it out yet, but word recursion flows through my mind). Then substitute all 1s with corresponding letters (if index of boolean is odd) or whitespace (even)
and 0s with whitespace (odd) or nothing (even). Then trim leading and trailing whitespace, separate words and search them in dictionary.
I don't like this monstrous approach and hope you will help me find good alternatives.

1) Put your dictionary in a trie or prefix tree
2) For each position in the string find legal words by trie look up; store these
3) Print all combinations of non-overlapping words
This assumes that like the examples in the question you want to maintain the letter order (i.e. you are not interested in anagrams).

#!/usr/bin/python3
from itertools import *
from pprint import pprint as pp
Read in dictionary, remove all 1- and 2-letter words which we never use in the English language:
with open('/usr/share/dict/words') as f:
english = f.read().splitlines()
english = map(str.lower, english)
english = [w for w in english if (len(w)>2 or w in ['i','a','as','at','in','on','im','it','if','is','am','an'])]
def isWord(word):
return word in english
Your problem:
def splitwords(word):
"""
splitwords('starts') -> (('st', 'ar', 'ts'), ('st', 'arts'), ('star', 'ts'), ('starts'))
"""
if word=='':
yield ()
for i in range(1,len(word)+1):
try:
left,right = word[:i],word[i:]
if left in english:
for reading in list(splitwords(right)):
yield (left,) + tuple(reading)
else:
raise IndexError()
except IndexError:
pass
def splitwordsWithDeletions(word):
masks = product(*[(0,1) for char in word])
for mask in masks:
candidate = ''.join(compress(word,mask))
for reading in splitwords(candidate):
yield reading
for reading in splitwordsWithDeletions('interesting'):
print(reading)
Result (takes about 30 seconds):
()
('i',)
('in',)
('tin',)
('ting',)
('sin',)
('sing',)
('sting',)
('eng',)
('rig',)
('ring',)
('rein',)
('resin',)
('rest',)
('rest', 'i')
('rest', 'in')
...
('inters', 'tin')
('inter', 'sting')
('inters', 'ting')
('inter', 'eng')
('interest',)
('interest', 'i')
('interest', 'in')
('interesting',)
Speedup possible perhaps by precalculating which words can be read on each letter, into one bin per letter, and iterating with those pre-calculated to speed things up. I think someone else outlines a solution to that effect.

There are other places you can find anagram algorithms.
subwords(word):
if word is empty return
if word is real word:
print word
anagrams(word)
for each letter in word:
subwords(word minus letter)
Edit: shoot, you'll want to pass a starting point in for the for loop. Otherwise, you'll be redundantly creating a LOT of calls. Frank minus r minus n is the same as Frank minus n minus r. Putting a starting point can ensure that you get each subset once... Except for repeats due to double letters. Maybe just memoize the results to a hash table before printing? Argh...

Algorithm to find streets and same kind in a hand

This is actually a Mahjong-based question, but a Romme- or even Poker-based background will also easily suffice to understand.
In Mahjong 14 tiles (tiles are like cards in Poker) are arranged to 4 sets and a pair. A street ("123") always uses exactly 3 tiles, not more and not less. A set of the same kind ("111") consists of exactly 3 tiles, too. This leads to a sum of 3 * 4 + 2 = 14 tiles.
There are various exceptions like Kan or Thirteen Orphans that are not relevant here. Colors and value ranges (1-9) are also not important for the algorithm.
I'm trying to determine if a hand can be arranged in the way described above. For certain reasons it should not only be able to deal with 14 but any number of tiles. (The next step would be to find how many tiles need to be exchanged to be able to complete a hand.)
Examples:
11122233344455 - easy enough, 4 sets and a pair.
12345555678999 - 123, 456, 789, 555, 99
11223378888999 - 123, 123, 789, 888, 99
11223344556789 - not a valid hand
My current and not yet implemented idea is this: For each tile, try to make a) a street b) a set c) a pair. If none works (or there would be > 1 pair), go back to the previous iteration and try the next option, or, if this is the highest level, fail. Else, remove the used tiles from the list of remaining tiles and continue with the next iteration.
I believe this approach works and would also be reasonably fast (performance is a "nice bonus"), but I'm interested in your opinion on this. Can you think of alternate solutions? Does this or something similar already exist?
(Not homework, I'm learning to play Mahjong.)

The sum of the values in a street and in a set can be divided by 3:
n + n + n = 3n
(n-1) + n + (n + 1) = 3n
So, if you add together all the numbers in a solved hand, you would get a number of the form 3N + 2M where M is the value of the tile in the pair. The remainder of the division by three (total % 3) is, for each value of M :
total % 3 = 0 -> M = {3,6,9}
total % 3 = 1 -> M = {2,5,8}
total % 3 = 2 -> M = {1,4,7}
So, instead of having to test nine possible pairs, you only have to try three based on a simple addition. For each possible pair, remove two tiles with that value and move on to the next step of the algorithm to determine if it's possible.
Once you have this, start with the lowest value. If there are less than three tiles with that value, it means they're necessarily the first element of a street, so remove that street (if you can't because tiles n+1 or n+2 are missing, it means the hand is not valid) and move on to the next lowest value.
If there are at least three tiles with the lowest value, remove them as a set (if you ask "what if they were part of a street?" consider that if they were, then there are also three of tile n+1 and three of tile n+2, which can also be turned into sets) and continue.
If you reach an empty hand, the hand is valid.
For example, for your invalid hand the total is 60, which means M = {3,6,9}:
Remove the 3: 112244556789
- Start with 1: there are less than three, so remove a street
-> impossible: 123 needs a 3
Remove the 6: impossible, there is only one
Remove the 9: impossible, there is only one
With your second example 12345555678999, the total is 78, which means M = {3,6,9}:
Remove the 3: impossible, there is only one
Remove the 6: impossible, there is only one
Remove the 9: 123455556789
- Start with 1: there is only one, so remove a street
-> 455556789
- Start with 4: there is only one, so remove a street
-> 555789
- Start with 5: there are three, so remove a set
-> 789
- Start with 7: there is only one, so remove a street
-> empty : hand is valid, removals were [99] [123] [456] [555] [789]
Your third example 11223378888999 also has a total of 78, which causes backtracking:
Remove the 3: 11227888899
- Start with 1: there are less than three, so remove a street
-> impossible: 123 needs a 3
Remove the 6: impossible, there are none
Remove the 9: 112233788889
- Start with 1: there are less than three, so remove streets
-> 788889
- Start with 7: there is only one, so remove a street
-> 888
- Start with 8: there are three, so remove a set
-> empty, hand is valid, removals were : [99] [123] [123] [789] [888]

There is a special case that you need to do some re-work to get it right. This happens when there is a run-of-three and a pair with the same value (but in different suit).
Let b denates bamboo, c donates character, and d donates dot, try this hand:
b2,b3,b4,b5,b6,b7,c4,c4,c4,d4,d4,d6,d7,d8
d4,d4 should serve as the pair, and c4,c4,c4 should serve as the run-of-3 set.
But because the 3 "c4" tiles appear before the 2 d4 tiless, the first 2 c4 tiles will be picked up as the pair, leaving an orphan c4 and 2 d4s, and these 3 tiles won't form a valid set.
In this case, you'll need to "return" the 2 c4 tiles back to the hand (and keep the hand sorted), and search for next tile that meets the criteria (value == 4). To do that you'll need to make the code "remember" that it had tried c4 so in next iteration it should skip c4 and looks for other tiles with value == 4. The code will be a bit messy, but doable.

I would break it down into 2 steps.
Figure out possible combinations. I think exhaustive checking is feasible with these numbers. The result of this step is a list of combinations, where each combination has a type (set, street, or pair) and a pattern with the cards used (could be a bitmap).
With the previous information, determine possible collections of combinations. This is where a bitmap would come in handy. Using bitwise operators, you could see overlaps in usage of the same tile for different combinators.
You could also do a step 1.5 where you just check to see if enough of each type is available. This step and step 2 would be where you would be able to create a general algorithm. The first step would be the same for all numbers of tiles and possible combinations quickly.

string of integers puzzle

I apologize for not have the math background to put this question in a more formal way.
I'm looking to create a string of 796 letters (or integers) with certain properties.
Basically, the string is a variation on a De Bruijn sequence B(12,4), except order and repetition within each n-length subsequence are disregarded.
i.e. ABBB BABA BBBA are each equivalent to {AB}.
In other words, the main property of the string involves looking at consecutive groups of 4 letters within the larger string
(i.e. the 1st through 4th letters, the 2nd through 5th letters, the 3rd through 6th letters, etc)
And then producing the set of letters that comprise each group (repetitions and order disregarded)
For example, in the string of 9 letters:
A B B A C E B C D
the first 4-letter groups is: ABBA, which is comprised of the set {AB}
the second group is: BBAC, which is comprised of the set {ABC}
the third group is: BACE, which is comprised of the set {ABCE}
etc.
The goal is for every combination of 1-4 letters from a set of N letters to be represented by the 1-4-letter resultant sets of the 4-element groups once and only once in the original string.
For example, if there is a set of 5 letters {A, B, C, D, E} being used
Then the possible 1-4 letter combinations are:
A, B, C, D, E,
AB, AC, AD, AE, BC, BD, BE, CD, CE, DE,
ABC, ABD, ABE, ACD, ACE, ADE, BCD, BCE, BDE, CDE,
ABCD, ABCE, ABDE, ACDE, BCDE
Here is a working example that uses a set of 5 letters {A, B, C, D, E}.
D D D D E C B B B B A E C C C C D A E E E E B D A A A A C B D D B
The 1st through 4th elements form the set: D
The 2nd through 5th elements form the set: DE
The 3rd through 6th elements form the set: CDE
The 4th through 7th elements form the set: BCDE
The 5th through 8th elements form the set: BCE
The 6th through 9th elements form the set: BC
The 7th through 10th elements form the set: B
etc.
* I am hoping to find a working example of a string that uses 12 different letters (a total of 793 4-letter groups within a 796-letter string) starting (and if possible ending) with 4 of the same letter. *
Here is a working solution for 7 letters:
AAAABCDBEAAACDECFAAADBFBACEAGAADEFBAGACDFBGCCCCDGEAFAGCBEEECGFFBFEGGGGFDEEEEFCBBBBGDCFFFFDAGBEGDDDDBE

Beware that in order to attempt exhaustive search (answer in VB is trying a naive version of that) you'll first have to solve the problem of generating all possible expansions while maintaining lexicographical order. Just ABC, expands to all perms of AABC, plus all perms of ABBC, plus all perms of ABCC which is 3*4! instead of just AABC. If you just concatenate AABC and AABD it would cover just 4 out of 4! perms of AABC and even that by accident. Just this expansion will bring you exponential complexity - end of game. Plus you'll need to maintain association between all explansions and the set (the set becomes a label).
Your best bet is to use one of known efficient De Bruijn constuctors and try to see if you can put your set-equivalence in there. Check out
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.674&rep=rep1&type=pdf
and
http://www.dim.uchile.cl/~emoreno/publicaciones/FINALES/copyrighted/IPL05-De_Bruijn_sequences_and_De_Bruijn_graphs_for_a_general_language.pdf
for a start.
If you know graphs, another viable option is to start with De Bruijn graph and formulate your set-equivalence as a graph rewriting. 2nd paper does De Bruijn graph partitioning.
BTW, try VB answer just for A,B,AB (at least expansion is small) - it will make AABBAB and construct ABBA or ABBAB (or throw in a decent language) both of which are wrong. You can even prove that it will always miss with 1st lexical expansions (that's what AAB, AAAB etc. are) just by examining first 2 passes (it will always miss 2nd A for NxA because (N-1)xA+B is in the string (1st expansion of {AB}).
Oh and if we could establish how many of each letters an optimal soluton should have (don't look at B(5,2) it's too easy and regular :-) a random serch would be feasible - you generate candidates with provable traits (like AAAA, BBBB ... are present and not touching and is has n1 A-s, n2 B-s ...) and random arrangement and then test whether they are solutions (checking is much faster than exhaustive search in this case).

Cool problem. Just a draft/psuedo algo:
dim STR-A as string = getall(ABCDEFGHIJKL)
//custom function to generate concat list of all 793 4-char combos.
//should be listed side-by-side to form 3172 character-long string.
//different ordering may ultimately produce different results.
//brute-forcing all orders of combos is too much work (793! is a big #).
//need to determine how to find optimal ordering, for this particular
//approach below.
dim STR-B as string = "" // to hold the string you're searching for
dim STR-C as string = "" // to hold the sub-string you are searching in
dim STR-A-NEW as string = "" //variable to hold your new string
dim MATCH as boolean = false //variable to hold matching status
while len(STR-A) > 0
//check each character in STR-A, which will be shorted by 1 char on each
//pass.
MATCH = false
STR-B = left(STR-A, 4)
STR-B = reduce(STR-B)
//reduce(str) is a custom re-usable function to sort & remove duplicates
for i as integer = 1 to len((STR-A) - 1)
STR-C = substr(STR-A, i, 4)
//gives you the 4-character sequence beginning at position i
STR-C = reduce(STR-C)
IF STR-B = STR-C Then
MATCH = true
exit for
//as long as there is even one match, you can throw-away the first
//letter
END IF
i = i+1
next
IF match = false then
//if you didn't find a match, then the first letter should be saved
STR-A-NEW += LEFT(STR-B, 1)
END IF
MATCH = false //re-init MATCH
STR-A = RIGHT(STR-A, LEN(STR-A) - 1) //re-init STR_A
wend
Anyway -- there could be problems at this, and you'd need to write another function to parse your result string (STR-A-NEW) to prove that it's a viable answer...

I've been thinking about this one and I'm sketching out a solution.
Let's call a string of four symbols a word and we'll write S(w) to denote the set of symbols in word w.
Each word abcd has "follow-on" words bcde where a,...,e are all symbols.
Let succ(w) be the set of follow-on words v for w such that S(w) != S(v). succ(w) is the set of successor words that can follow on from the first symbol in w if w is in a solution.
For each non-empty set of symbols s of cardinality at most four, let words(s) be the set of words w such that S(w) = s. Any solution must contain exactly one word in words(s) for each such set s.
Now we can do a reasonable search. The basic idea is this: say we are exploring a search path ending with word w. The follow-on word must be a non-excluded word in succ(w). A word v is excluded if the search path contains some word w such that v in words(S(w)).
You can be slightly more cunning: if we track the possible "predecessor" words to a set s (i.e., words w with a successor v such that v in words(s)) and reach a point where every predecessor of s is excluded, then we know we have reached a dead end, since we'll never be able to obtain s from any extension of the current search path.
Code to follow after the weekend, with a bit of luck...

Here is my proposal. I'll admit upfront this is a performance and memory hog.
This may be overkill, but have a class We'll call it UniqueCombination This will contain a unique 1-4 char reduced combination of the input set (i.e. A,AB,ABC,...) This will also contain a list of possible combination (AB {AABB,ABAB,BBAA,...}) this will need a method that determines if any possible combination overlaps any possible combination of another UniqueCombination by three characters. Also need a override that takes a string as well.
Then we start with the string "AAAA" then we find all of the UniqueCombinations that overlap this string. Then we find how many uniqueCombinations those possible matches overlap with. (we could be smart at this point an store this number.) Then we pick the one with the least number of overlaps greater than 0. Use up the ones with the least possible matches first.
Then we find a specific combination for the chosen UniqueCombination and add it to the final string. Remove this UniqueCombination from the list, then as we find overlaps for current string. rinse and repeat. (we could be smart and on subsequent runs while searching for overlaps we could remove any of the unreduced combination that are contained in the final string.)
Well that's my plan I will work on the code this weekend. Granted this does not guarantee that the final 4 characters will be 4 of the same letter (it might actually be trying to avoid that but I will look into that as well.)

If there is a non-exponential solution at all it may need to be formulated in terms of a recursive "growth" from a problem with a smaller size i.e to contruct B(N,k) from B(N-1,k-1) or from B(N-1,k) or from B(N,k-1).
Systematic construction for B(5,2) - one step at the time :-) It's bound to get more complex latter [card stands for cardinality, {AB} has card=2, I'll also call them 2-s, 3-s etc.] Note, 2-s and 3-s will be k-1 and k latter (I hope).
Initial. Start with k-1 result and inject symbols for singletons
(unique expansion empty intersection):
ABCDE -> AABBCCDDEE
mark used card=2 sets: AB,BC,CD,DE
Rewriting. Form card=3 sets to inject symbols into marked card=2.
1st feasible lexicographic expansion fires (may have to backtrack for k>2)
it's OK to use already marked 2-s since they'll all get replaced
but may have to do a verification pass for higher k
AB->ACB, BC->BCD, CD->CED, DE->DAE ==> AACBBDCCEDDAEEB
mark/verify used 2s
normally keep marking/unmarking during the construction but also keep keep old
mark list
marking/unmarking can get expensive if there's backtracking in #3
Unused: AB, BE
For higher k may need several recursive rewriting passes
possibly partitioning new sets into classes
Finalize: unused 2-s should overlap around the edge (that's why it's cyclic)
ABE - B can go to the begining or and: AACBBDCCEDDAEEB
Note: a step from B(N-1,k) to B(N,k) may need injection of pseudo-signletons, like doubling or trippling A
B(5,2) -> B(5,3) - B(5,4)
Initial. same: - ABCDE -> AAACBBBDCCCEDDDAEEEB
no use of marking 3-sets since they are all going to be chenged
Rewriting.
choose systematic insertion positions
AAA_CBBB_DCCC_EDDD_AEEE_B
mark all 2-s released by this: AC,AD,BD,BE,CE
use marked 2-s to decide inserted symbols - totice total regularity:
AxCB D -> ADCB
BxDC E -> BEDC
CxED A -> CAED
DxAE B => DBAE
ExBA C -> ECBA
Verify that 3-s are all used (marked inserted symbols just for fun)
AAA[D]CBBB[E]DCCC[A]EDDD[B]AEEE[C]B
Note: Systematic choice if insertion point deterministically dictated insertions (only AD can fit 1st, AC would create duplicate 2-set (AAC, ACC))
Note: It's not going to be so nice for B(6,2) and B(6,3) since number of 2-s will exceede 2x the no of 1-s. This is important since 2-s sit naturally on the sides of 1-s like CBBBE and the issue is how to place them when you run out of 1-s.
B(5,3) is so symetrical that just repeating #1 produces B(5.4):
AAAADCBBBBEDCCCCAEDDDDBAEEEECB

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio