Clusters of words with Hamming distance of 1 - algorithm

You are given a set of words, e.g.
{ ded, aaa, dec, aab, cab, def }
You need to add each word into a group. A word can be added to a group if:
It is the only word in the group
There is at least one other word already in the group that is at the most 1 edit distance away from the given word.
Your function must return the minimum possible number of such groups that can be formed by using all the given words.
Example, for the given input string, the groups would look like this:
{ aaa, aab, cab }, { ded, def, dec }
Explanation: distance(aaa, aab) is 1 so they belong in the same group. Also, distance(aab, cab) is also 1, so they also belong in the same group. But no words in the second group are at an edit distance of 1 from any other word in the first group, but are at an edit distance of 1 from at least one other word in their own group.
If we were given two more words in addition to the ones in the example, let's say cad, ced, then the answer would change to 1, because now distance(cab, cad) is 1, hence cad is in group 1 and distance(cad, ced) is 1, so ced is in group 1. Also, distance(ded, ced) is 1, so the second group will be "connected" with the first group, hence we will be left with only 1 group.
We're only interested in the number of groups, not the groups themselves.
Constraints: all words will have the same length, but that length is not fixed and could be large.
I could only come up with O(mn^2) where m is the length of any word and n is number of words. Did this using graph approach (Each word as a node and word with edit distance 1 as a neighbouring node).
Expected solution is O(mn).

Found a solution which is an extension of the accepted solution here:
Efficiently build a graph of words with given Hamming distance
Basically, the idea is to store the strings in a Set where lookup and delete are O(1) on average. Putting them in a set means we'd be overwriting strings with edit distance of 0 i.e. equal strings. But we don't care for them anyway, as they will always be in the same group.
Create an empty list of "start nodes" N.
Add next item from the set S in the list
Remove this string from the set S and call 4 for this string.
Generate all strings with Hamming distance 1 from string passed in parameter. For each such generated string, if it exists in the set, remove it from the set and call 4 for this string.
While set is not empty, repeat 2
Return size of "start nodes" list
Explanation of why this would work:
We traverse each node only once and remove it from the set. After we remove the string from the set, we also recursively remove any item in the set that was "adjacent" to it. But only the first node in the recursion is added to the start nodes list.
In our example, ded would get added to the node list and dec, def would get removed. Then aaa would get added to the node list and aab would be removed. While removing aab, recursively, cab would also be removed. The returned answer would be 2.
Time complexity:
O(mnC) where C is the size of the charset, m is the length of the string and n is the number of strings.
C substitutions made for each character in string m times. This is done once for each item in the string set.

Related

Best mapping between 2 sequences

I have two sequences of items:
S1 = [ A B C D E F ]
S2 = [ 1 2 3 4 5 6 7 8 ]
And I can determine "similarity" for each pair of items (s1, s2) as a number (for example on scale 0 to 10).
I want to find a mapping between S1/S2 items, such that ordering of each sequence is preserved and sum of "similarity" values between mapped items is maximum. It is not required that all S1/S2 items are part of mapping.
Example:
[ A B C D E F ]
[ 1 2 3 4 5 6 7 8 ]
In example above, mapping 'A on 3', 'D on 4' and 'F on 6' gives overall maximum "similarity".
Are there any existing problems (/algorithms) this could be turned into?
Looks like the Smith–Waterman algorithm, which is traditional used for determining similar regions between two strings of nucleic acid sequences or protein sequences, should be perfect:
Smith–Waterman algorithm aligns two sequences by matches/mismatches (also known as substitutions), insertions, and deletions. Both insertions and deletions are the operations that introduce gaps, which are represented by dashes. The Smith–Waterman algorithm has several steps:
Determine the substitution matrix and the gap penalty scheme. A substitution matrix assigns each pair of items (s1, s2) a score for match or mismatch. Usually matches get positive scores, whereas mismatches get relatively lower scores. A gap penalty function determines the score cost for opening or extending gaps. It is suggested that users choose the appropriate scoring system based on the goals. In addition, it is also a good practice to try different combinations of substitution matrices and gap penalties.
Initialize the scoring matrix. The dimensions of the scoring matrix are 1+length of each sequence respectively. All the elements of the first row and the first column are set to 0. The extra first row and first column make it possible to align one sequence to another at any position, and setting them to 0 makes the terminal gap free from penalty.
Scoring. Score each element from left to right, top to bottom in the matrix, considering the outcomes of substitutions (diagonal scores) or adding gaps (horizontal and vertical scores). If none of the scores are positive, this element gets a 0. Otherwise the highest score is used and the source of that score is recorded.
Traceback. Starting at the element with the highest score, traceback based on the source of each score recursively, until 0 is encountered. The segments that have the highest similarity score based on the given scoring system is generated in this process. To obtain the second best local alignment, apply the traceback process starting at the second highest score outside the trace of the best alignment.
Just choose the substitution matrix to match yours
And I can determine "similarity" for each pair of items (s1, s2) as a number (for example on scale 0 to 10).
and set the gap and no match penalty to zero
I want to find a mapping between S1/S2 items, such that ordering of each sequence is preserved and sum of "similarity" values between mapped items is maximum. It is not required that all S1/S2 items are part of mapping.
More information can be found at: https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm#Scoring_matrix
The problem you described looks like Longest Common Subsequence Problem variation.
Use this recurrent relation instead of original:
ans[i][j] = max(
ans[i-1][j],
ans[i][j-1],
ans[i-1][j-1] + similarity(S1[i], S2[j])
)

Efficiently find a given subsequence in a string, maximizing the number of contiguous characters

Long problem description
Fuzzy string matcher utilities like fzf or CtrlP filter a list of strings for ones which have a given search string as a subsequence.
As an example, consider that a user wants to search for a specific photo in a list of files. To find the file
/home/user/photos/2016/pyongyang_photo1.png
it suffices to type ph2016png, because this search string is a subsequence of this file name. (Mind that this is not LCS. The whole search string must be a subsequence of the file name.)
It is trivial to check whether a given search string is a subsequence of another string, but I wonder how to efficiently obtain the best match: In the above example, there are multiple possible matches. One is
/home/user/photos/2016/pyongyang_photo1.png
but the one which the user probably had in mind is
/home/user/photos/2016/pyongyang_photo1.png
To formalize this, I'd define the "best" match as the one that is composed of the the smallest number of substrings. This number is 5 for the first example match and 3 for the second.
I came up with this because it would be interesting to obtain the best match to assign a score to each result, for sorting. I'm not interested in approximate solutions though, my interest in this problem is primarily of academic nature.
tl;dr problem description
Given strings s and t, find among the subsequences of t that are equal to s one that maximizes the number of pairs of elements that are contiguous in t.
What I've tried so far
For discussion, let's call the search query s and the string to test t. The problem's solution is denoted fuzzy(s, t). I'll utilize Python's string slicing notation. The easiest approach is as follows:
Since any solution must use all characters from s in order, an algorithm for solving this problem can start by searching the first occurrence of s[0] in t (with index i) and then use the better of the two solutions
t[:i+1] + fuzzy(s[1:], t[i+1:]) # Use the character
t[:i] + fuzzy(s, t[i+1:]) # Skip it and use the next occurence
# of s[0] in t instead
This is obviously not the best solution to this problem. En contraire, it's the obvious brute force one. (I've played around with simultaneously searching for the last occurrence of s[-1] and using this information in an earlier version of this question, but it turned out that this approach does not work.)
→ My question is: What is the most efficient solution to this problem?
I would suggest creating a search tree, where each node represents a character position in the haystack that matches one of the needle characters.
The top nodes are siblings and represent the occurrences of the first needle character in the haystack.
The children of a parent node are those nodes that represent the occurrences of the next needle character in the haystack, but only those that are positioned after the position represented by that parent node.
This logically means that some children are shared by several parents, and so this structure is not really a tree, but a directed acyclic graph. Some sibling parents might even have exactly the same children. Other parents might not have children at all: they are a dead-end, unless they are at the bottom of the graph where the leaves represent positions of the last needle character.
Once this graph is set up, a depth-first search in it can easily derive the number of segments that are still needed from a certain node onwards, and then minimise that among alternatives.
I have added some comments in the Python code below. This code might still be improved, but it seems already quite efficient compared to your solution.
def fuzzy_trincot(haystack, needle, returnSegments = False):
inf = float('inf')
def getSolutionAt(node, depth, optimalCount = 2):
if not depth: # reached end of needle
node['count'] = 0
return
minCount = inf # infinity ensures also that incomplete branches are pruned
child = node['child']
i = node['i']+1
# Optimisation: optimalCount gives the theoretical minimum number of
# segments needed for any solution. If we find such case,
# there is no need to continue the search.
while child and minCount > optimalCount:
# If this node was already evaluated, don't lose time recursing again.
# It works without this condition, but that is less optimal.
if 'count' not in child:
getSolutionAt(child, depth-1, 1)
count = child['count'] + (i < child['i'])
if count < minCount:
minCount = count
child = child['sibling']
# Store the results we found in this node, so if ever we come here again,
# we don't need to recurse the same sub-tree again.
node['count'] = minCount
# Preprocessing: build tree
# A node represents a needle character occurrence in the haystack.
# A node can have these keys:
# i: index in haystack where needle character occurs
# child: node that represents a match, at the right of this index,
# for the next needle character
# sibling: node that represents the next match for this needle character
# count: the least number of additional segments needed for matching the
# remaining needle characters (only; so not counting the segments
# already taken at the left)
root = { 'i': -2, 'child': None, 'sibling': None }
# Take a short-cut for when needle is a substring of haystack
if haystack.find(needle) != -1:
root['count'] = 1
else:
parent = root
leftMostIndex = 0
rightMostIndex = len(haystack)-len(needle)
for j, c in enumerate(needle):
sibling = None
child = None
# Use of leftMostIndex is an optimisation; it works without this argument
i = haystack.find(c, leftMostIndex)
# Use of rightMostIndex is an optimisation; it works without this test
while 0 <= i <= rightMostIndex:
node = { 'i': i, 'child': None, 'sibling': None }
while parent and parent['i'] < i:
parent['child'] = node
parent = parent['sibling']
if sibling: # not first child
sibling['sibling'] = node
else: # first child
child = node
leftMostIndex = i+1
sibling = node
i = haystack.find(c, i+1)
if not child: return False
parent = child
rightMostIndex += 1
getSolutionAt(root, len(needle))
count = root['count']
if not returnSegments:
return count
# Use the `returnSegments` option when you need the character content
# of the segments instead of only the count. It runs in linear time.
if count == 1: # Deal with short-cut case
return [needle]
segments = []
node = root['child']
i = -2
start = 0
for end, c in enumerate(needle):
i += 1
# Find best child among siblings
while (node['count'] > count - (i < node['i'])):
node = node['sibling']
if count > node['count']:
count = node['count']
if end:
segments.append(needle[start:end])
start = end
i = node['i']
node = node['child']
segments.append(needle[start:])
return segments
The function can be called with an optional third argument:
haystack = "/home/user/photos/2016/pyongyang_photo1.png"
needle = "ph2016png"
print (fuzzy_trincot(haystack, needle))
print (fuzzy_trincot(haystack, needle, True))
Outputs:
3
['ph', '2016', 'png']
As the function is optimised to return only the count, the second call will add a bit to the execution time.
This is probably not the most efficient solution, but it is an efficient and easy to implement solution. To illustrate, I'll borrow your example. Let /home/user/photos/2016/pyongyang_photo1.png be the filename, and ph2016png, the input.
The first step (precalculation) is optional but might help speed up the next step (setup) quite a bit, especially if you are applying the algorithm to many filenames.
Precalculation
Create a table counting the occurrences of each character in the input. Since you are probably only dealing with ASCII characters, 256 entries are sufficient (maybe 128, or even less depending on the character set).
"ph2016png"
['p'] : 2
['h'] : 1
['2'] : 1
['0'] : 1
['b'] : 0
...
Setup
Slice the filename into substrings by throwing away characters that are not present in the input. At the same time, check if each character of the input is present the correct amount of times in the filename (if the precalculation is done). Finally, check that each character of the input appears in order in the substrings list. If you take the substrings list as a single string, for any given character of that string, every character that is found before it in the input must be found before it in that string. That can be done while creating the substrings.
"/home/user/photos/2016/pyongyang_photo1.png"
"h", "ph", "2016", "p", "ng", "ng", "ph", "1", "png"
'p' must come before "h", so throw this one away
"ph", "2016", "p", "ng", "ng", "ph", "1", "png"
Core
Match the longest substring with the input and keep track of the longest match. This match can keep the beginning of the substring (for instance, matching ababa (substring) with babaa (input) would result in aba, not baba) because it's easier to implement, although it doesn't have to. If you don't get a complete match, use the longest one to slice up the substring once more, and retry with the next longest substring.
Since there is no instance of incomplete match with your example,
let's take something else, made to illustrate the point.
Let's take "babaaababcb" as the filename, and "ababb" as input.
Substrings : "abaaabab", "b"
Longest substring : "abaaabab"
If you keep the beginning of matches
Longest match : "aba"
Slice "abaaabab" into "aba", "aabab"
-> "aba", "aabab", "b"
Retry with "aabab"
-> "aba", "a", "abab", "b"
Retry with "abab" (complete match)
Otherwise (harder to implement, not necessarily better performing, as shown in this example)
Longest match : "abab"
Slice "abaaabab" into "abaa", "abab"
-> "abaa", "abab", "b"
Retry with "abaa"
-> "aba", "a", "abab", "b"
Retry with "abab" (complete match)
If you do get a complete match, continue by slicing the input in two as well as the list of substrings, and repeat matching the longest substring.
With "ph2016png" as input
Longest substring : "2016"
Complete match
Match substrings "h", "ph" with input "ph"
Match substrings "p", "ng", "ng", "ph", "1", "png" with input "png"
You are guaranteed to find the sequence of substrings that contains the fewest substrings because you try the longest ones first. That will typically perform well if the input doesn't contain many short substrings from the filename.

Using dynamic programming to count the number of permutations

I have a string A of length N. I have to find number of strings (B) of length N that have M (M<=N) same characters as string A but satisfies the condition that A[i]!=B[i] for all i. Assume the characters that have to be same and the different ones are also given. What will be the recurrence relation to find number of such strings?
Example
123 is string A and M=1, and the character which is same is '1', and the new characters are '4' and '5'. The valid permutations are 451, 415, 514, 541. So it is a sort of derangement of 1 item of the given 3.
I am able to find the answer using inclusion-exclusion principle but wanted to know whether there is a recurrence relation to do the same?
Let us call g(M,N) the number of permutations satisfying your condition.
If M is 0, then the answer is N!
Otherwise, M>0 and consider placing the first character that is in string A.
There are M important positions corresponding to the places in the string where we are not allowed to place a certain character.
If we put our first character in one of these (M-1) important places (we cannot put it in position 1 due to the restriction), then we must take the place of one of the restricted characters, and so the number of restrictions reduces by 2 (1 for the character we place, and 1 for the character whose position we occupied).
If we put our first character in one of the N-M unimportant places, then we have only reduced the number of restrictions by 1.
Therefore the recurrence relation is:
g(M,N)=(M-1)g(M-2,N-1)+(N-M)g(M-1,N-1) if M>0
=N! if M=0
For your example, we wish to calculate g(1,3) (1 character matches, total of 3 characters placed)
g(1,3)=(3-1)g(0,2)
=(3-1).2!
=4

On counting pairs of words that differ by one letter

Let us consider n words, each of length k. Those words consist of letters over an alphabet (whose cardinality is n) with defined order. The task is to derive an O(nk) algorithm to count the number of pairs of words that differ by one position (no matter which one exactly, as long as it's only a single position).
For instance, in the following set of words (n = 5, k = 4):
abcd, abdd, adcb, adcd, aecd
there are 5 such pairs: (abcd, abdd), (abcd, adcd), (abcd, aecd), (adcb, adcd), (adcd, aecd).
So far I've managed to find an algorithm that solves a slightly easier problem: counting the number of pairs of words that differ by one GIVEN position (i-th). In order to do this I swap the letter at the ith position with the last letter within each word, perform a Radix sort (ignoring the last position in each word - formerly the ith position), linearly detect words whose letters at the first 1 to k-1 positions are the same, eventually count the number of occurrences of each letter at the last (originally ith) position within each set of duplicates and calculate the desired pairs (the last part is simple).
However, the algorithm above doesn't seem to be applicable to the main problem (under the O(nk) constraint) - at least not without some modifications. Any idea how to solve this?
Assuming n and k isn't too large so that this will fit into memory:
Have a set with the first letter removed, one with the second letter removed, one with the third letter removed, etc. Technically this has to be a map from strings to counts.
Run through the list, simply add the current element to each of the maps (obviously by removing the applicable letter first) (if it already exists, add the count to totalPairs and increment it by one).
Then totalPairs is the desired value.
EDIT:
Complexity:
This should be O(n.k.logn).
You can use a map that uses hashing (e.g. HashMap in Java), instead of a sorted map for a theoretical complexity of O(nk) (though I've generally found a hash map to be slower than a sorted tree-based map).
Improvement:
A small alteration on this is to have a map of the first 2 letters removed to 2 maps, one with first letter removed and one with second letter removed, and have the same for the 3rd and 4th letters, and so on.
Then put these into maps with 4 letters removed and those into maps with 8 letters removed and so on, up to half the letters removed.
The complexity of this is:
You do 2 lookups into 2 sorted sets containing maximum k elements (for each half).
For each of these you do 2 lookups into 2 sorted sets again (for each quarter).
So the number of lookups is 2 + 4 + 8 + ... + k/2 + k, which I believe is O(k).
I may be wrong here, but, worst case, the number of elements in any given map is n, but this will cause all other maps to only have 1 element, so still O(logn), but for each n (not each n.k).
So I think that's O(n.(logn + k)).
.
EDIT 2:
Example of my maps (without the improvement):
(x-1) means x maps to 1.
Let's say we have abcd, abdd, adcb, adcd, aecd.
The first map would be (bcd-1), (bdd-1), (dcb-1), (dcd-1), (ecd-1).
The second map would be (acd-3), (add-1), (acb-1) (for 4th and 5th, value already existed, so increment).
The third map : (abd-2), (adb-1), (add-1), (aed-1) (2nd already existed).
The fourth map : (abc-1), (abd-1), (adc-2), (aec-1) (4th already existed).
totalPairs = 0
For second map - acd, for the 4th, we add 1, for the 5th we add 2.
totalPairs = 3
For third map - abd, for the 2th, we add 1.
totalPairs = 4
For fourth map - adc, for the 4th, we add 1.
totalPairs = 5.
Partial example of improved maps:
Same input as above.
Map of first 2 letters removed to maps of 1st and 2nd letter removed:
(cd-{ {(bcd-1)}, {(acd-1)} }),
(dd-{ {(bdd-1)}, {(add-1)} }),
(cb-{ {(dcb-1)}, {(acb-1)} }),
(cd-{ {(dcd-1)}, {(acd-1)} }),
(cd-{ {(ecd-1)}, {(acd-1)} })
The above is a map consisting of an element cd mapped to 2 maps, one containing one element (bcd-1) and the other containing (acd-1).
But for the 4th and 5th cd already existed, so, rather than generating the above, it will be added to that map instead, as follows:
(cd-{ {(bcd-1, dcd-1, ecd-1)}, {(acd-3)} }),
(dd-{ {(bdd-1)}, {(add-1)} }),
(cb-{ {(dcb-1)}, {(acb-1)} })
You can put each word into an array.Pop out elements from that array one by one.Then compare the resulting arrays.Finally you add back the popped element to get back the original arrays.
The popped elements from both the arrays must not be same.
Count number of cases where this occurs and finally divide it by 2 to get the exact solution
Think about how you would enumerate the language - you would likely use a recursive algorithm. Recursive algorithms map onto tree structures. If you construct such a tree, each divergence represents a difference of one letter, and each leaf will represent a word in the language.
It's been two months since I submitted the problem here. I have discussed it with my peers in the meantime and would like to share the outcome.
The main idea is similar to the one presented by Dukeling. For each word A and for each ith position within that word we are going to consider a tuple: (prefix, suffix, letter at the ith position), i.e. (A[1..i-1], A[i+1..n], A[i]). If i is either 1 or n, then the applicable substring is considered empty (these are simple boundary cases).
Having these tuples in hand, we should be able to apply the reasoning I provided in my first post to count the number of pairs of different words. All we have to do is sort the tuples by the prefix and suffix values (separately for each i) - then, words with letters equal at all but ith position will be adjacent to each other.
Here though is the technical part I am lacking. So as to make the sorting procedure (RadixSort appears to be the way to go) meet the O(nk) constraint, we might want to assign labels to our prefixes and suffixes (we only need n labels for each i). I am not quite sure how to go about the labelling stuff. (Sure, we might do some hashing instead, but I am pretty confident the former solution is viable).
While this is not an entirely complete solution, I believe it casts some light on the possible way to tackle this problem and that is why I posted it here. If anyone comes up with an idea of how to do the labelling part, I will implement it in this post.
How's the following Python solution?
import string
def one_apart(words, word):
res = set()
for i, _ in enumerate(word):
for c in string.ascii_lowercase:
w = word[:i] + c + word[i+1:]
if w != word and w in words:
res.add(w)
return res
pairs = set()
for w in words:
for other in one_apart(words, w):
pairs.add(frozenset((w, other)))
for pair in pairs:
print(pair)
Output:
frozenset({'abcd', 'adcd'})
frozenset({'aecd', 'adcd'})
frozenset({'adcb', 'adcd'})
frozenset({'abcd', 'aecd'})
frozenset({'abcd', 'abdd'})

string of integers puzzle

I apologize for not have the math background to put this question in a more formal way.
I'm looking to create a string of 796 letters (or integers) with certain properties.
Basically, the string is a variation on a De Bruijn sequence B(12,4), except order and repetition within each n-length subsequence are disregarded.
i.e. ABBB BABA BBBA are each equivalent to {AB}.
In other words, the main property of the string involves looking at consecutive groups of 4 letters within the larger string
(i.e. the 1st through 4th letters, the 2nd through 5th letters, the 3rd through 6th letters, etc)
And then producing the set of letters that comprise each group (repetitions and order disregarded)
For example, in the string of 9 letters:
A B B A C E B C D
the first 4-letter groups is: ABBA, which is comprised of the set {AB}
the second group is: BBAC, which is comprised of the set {ABC}
the third group is: BACE, which is comprised of the set {ABCE}
etc.
The goal is for every combination of 1-4 letters from a set of N letters to be represented by the 1-4-letter resultant sets of the 4-element groups once and only once in the original string.
For example, if there is a set of 5 letters {A, B, C, D, E} being used
Then the possible 1-4 letter combinations are:
A, B, C, D, E,
AB, AC, AD, AE, BC, BD, BE, CD, CE, DE,
ABC, ABD, ABE, ACD, ACE, ADE, BCD, BCE, BDE, CDE,
ABCD, ABCE, ABDE, ACDE, BCDE
Here is a working example that uses a set of 5 letters {A, B, C, D, E}.
D D D D E C B B B B A E C C C C D A E E E E B D A A A A C B D D B
The 1st through 4th elements form the set: D
The 2nd through 5th elements form the set: DE
The 3rd through 6th elements form the set: CDE
The 4th through 7th elements form the set: BCDE
The 5th through 8th elements form the set: BCE
The 6th through 9th elements form the set: BC
The 7th through 10th elements form the set: B
etc.
* I am hoping to find a working example of a string that uses 12 different letters (a total of 793 4-letter groups within a 796-letter string) starting (and if possible ending) with 4 of the same letter. *
Here is a working solution for 7 letters:
AAAABCDBEAAACDECFAAADBFBACEAGAADEFBAGACDFBGCCCCDGEAFAGCBEEECGFFBFEGGGGFDEEEEFCBBBBGDCFFFFDAGBEGDDDDBE
Beware that in order to attempt exhaustive search (answer in VB is trying a naive version of that) you'll first have to solve the problem of generating all possible expansions while maintaining lexicographical order. Just ABC, expands to all perms of AABC, plus all perms of ABBC, plus all perms of ABCC which is 3*4! instead of just AABC. If you just concatenate AABC and AABD it would cover just 4 out of 4! perms of AABC and even that by accident. Just this expansion will bring you exponential complexity - end of game. Plus you'll need to maintain association between all explansions and the set (the set becomes a label).
Your best bet is to use one of known efficient De Bruijn constuctors and try to see if you can put your set-equivalence in there. Check out
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.674&rep=rep1&type=pdf
and
http://www.dim.uchile.cl/~emoreno/publicaciones/FINALES/copyrighted/IPL05-De_Bruijn_sequences_and_De_Bruijn_graphs_for_a_general_language.pdf
for a start.
If you know graphs, another viable option is to start with De Bruijn graph and formulate your set-equivalence as a graph rewriting. 2nd paper does De Bruijn graph partitioning.
BTW, try VB answer just for A,B,AB (at least expansion is small) - it will make AABBAB and construct ABBA or ABBAB (or throw in a decent language) both of which are wrong. You can even prove that it will always miss with 1st lexical expansions (that's what AAB, AAAB etc. are) just by examining first 2 passes (it will always miss 2nd A for NxA because (N-1)xA+B is in the string (1st expansion of {AB}).
Oh and if we could establish how many of each letters an optimal soluton should have (don't look at B(5,2) it's too easy and regular :-) a random serch would be feasible - you generate candidates with provable traits (like AAAA, BBBB ... are present and not touching and is has n1 A-s, n2 B-s ...) and random arrangement and then test whether they are solutions (checking is much faster than exhaustive search in this case).
Cool problem. Just a draft/psuedo algo:
dim STR-A as string = getall(ABCDEFGHIJKL)
//custom function to generate concat list of all 793 4-char combos.
//should be listed side-by-side to form 3172 character-long string.
//different ordering may ultimately produce different results.
//brute-forcing all orders of combos is too much work (793! is a big #).
//need to determine how to find optimal ordering, for this particular
//approach below.
dim STR-B as string = "" // to hold the string you're searching for
dim STR-C as string = "" // to hold the sub-string you are searching in
dim STR-A-NEW as string = "" //variable to hold your new string
dim MATCH as boolean = false //variable to hold matching status
while len(STR-A) > 0
//check each character in STR-A, which will be shorted by 1 char on each
//pass.
MATCH = false
STR-B = left(STR-A, 4)
STR-B = reduce(STR-B)
//reduce(str) is a custom re-usable function to sort & remove duplicates
for i as integer = 1 to len((STR-A) - 1)
STR-C = substr(STR-A, i, 4)
//gives you the 4-character sequence beginning at position i
STR-C = reduce(STR-C)
IF STR-B = STR-C Then
MATCH = true
exit for
//as long as there is even one match, you can throw-away the first
//letter
END IF
i = i+1
next
IF match = false then
//if you didn't find a match, then the first letter should be saved
STR-A-NEW += LEFT(STR-B, 1)
END IF
MATCH = false //re-init MATCH
STR-A = RIGHT(STR-A, LEN(STR-A) - 1) //re-init STR_A
wend
Anyway -- there could be problems at this, and you'd need to write another function to parse your result string (STR-A-NEW) to prove that it's a viable answer...
I've been thinking about this one and I'm sketching out a solution.
Let's call a string of four symbols a word and we'll write S(w) to denote the set of symbols in word w.
Each word abcd has "follow-on" words bcde where a,...,e are all symbols.
Let succ(w) be the set of follow-on words v for w such that S(w) != S(v). succ(w) is the set of successor words that can follow on from the first symbol in w if w is in a solution.
For each non-empty set of symbols s of cardinality at most four, let words(s) be the set of words w such that S(w) = s. Any solution must contain exactly one word in words(s) for each such set s.
Now we can do a reasonable search. The basic idea is this: say we are exploring a search path ending with word w. The follow-on word must be a non-excluded word in succ(w). A word v is excluded if the search path contains some word w such that v in words(S(w)).
You can be slightly more cunning: if we track the possible "predecessor" words to a set s (i.e., words w with a successor v such that v in words(s)) and reach a point where every predecessor of s is excluded, then we know we have reached a dead end, since we'll never be able to obtain s from any extension of the current search path.
Code to follow after the weekend, with a bit of luck...
Here is my proposal. I'll admit upfront this is a performance and memory hog.
This may be overkill, but have a class We'll call it UniqueCombination This will contain a unique 1-4 char reduced combination of the input set (i.e. A,AB,ABC,...) This will also contain a list of possible combination (AB {AABB,ABAB,BBAA,...}) this will need a method that determines if any possible combination overlaps any possible combination of another UniqueCombination by three characters. Also need a override that takes a string as well.
Then we start with the string "AAAA" then we find all of the UniqueCombinations that overlap this string. Then we find how many uniqueCombinations those possible matches overlap with. (we could be smart at this point an store this number.) Then we pick the one with the least number of overlaps greater than 0. Use up the ones with the least possible matches first.
Then we find a specific combination for the chosen UniqueCombination and add it to the final string. Remove this UniqueCombination from the list, then as we find overlaps for current string. rinse and repeat. (we could be smart and on subsequent runs while searching for overlaps we could remove any of the unreduced combination that are contained in the final string.)
Well that's my plan I will work on the code this weekend. Granted this does not guarantee that the final 4 characters will be 4 of the same letter (it might actually be trying to avoid that but I will look into that as well.)
If there is a non-exponential solution at all it may need to be formulated in terms of a recursive "growth" from a problem with a smaller size i.e to contruct B(N,k) from B(N-1,k-1) or from B(N-1,k) or from B(N,k-1).
Systematic construction for B(5,2) - one step at the time :-) It's bound to get more complex latter [card stands for cardinality, {AB} has card=2, I'll also call them 2-s, 3-s etc.] Note, 2-s and 3-s will be k-1 and k latter (I hope).
Initial. Start with k-1 result and inject symbols for singletons
(unique expansion empty intersection):
ABCDE -> AABBCCDDEE
mark used card=2 sets: AB,BC,CD,DE
Rewriting. Form card=3 sets to inject symbols into marked card=2.
1st feasible lexicographic expansion fires (may have to backtrack for k>2)
it's OK to use already marked 2-s since they'll all get replaced
but may have to do a verification pass for higher k
AB->ACB, BC->BCD, CD->CED, DE->DAE ==> AACBBDCCEDDAEEB
mark/verify used 2s
normally keep marking/unmarking during the construction but also keep keep old
mark list
marking/unmarking can get expensive if there's backtracking in #3
Unused: AB, BE
For higher k may need several recursive rewriting passes
possibly partitioning new sets into classes
Finalize: unused 2-s should overlap around the edge (that's why it's cyclic)
ABE - B can go to the begining or and: AACBBDCCEDDAEEB
Note: a step from B(N-1,k) to B(N,k) may need injection of pseudo-signletons, like doubling or trippling A
B(5,2) -> B(5,3) - B(5,4)
Initial. same: - ABCDE -> AAACBBBDCCCEDDDAEEEB
no use of marking 3-sets since they are all going to be chenged
Rewriting.
choose systematic insertion positions
AAA_CBBB_DCCC_EDDD_AEEE_B
mark all 2-s released by this: AC,AD,BD,BE,CE
use marked 2-s to decide inserted symbols - totice total regularity:
AxCB D -> ADCB
BxDC E -> BEDC
CxED A -> CAED
DxAE B => DBAE
ExBA C -> ECBA
Verify that 3-s are all used (marked inserted symbols just for fun)
AAA[D]CBBB[E]DCCC[A]EDDD[B]AEEE[C]B
Note: Systematic choice if insertion point deterministically dictated insertions (only AD can fit 1st, AC would create duplicate 2-set (AAC, ACC))
Note: It's not going to be so nice for B(6,2) and B(6,3) since number of 2-s will exceede 2x the no of 1-s. This is important since 2-s sit naturally on the sides of 1-s like CBBBE and the issue is how to place them when you run out of 1-s.
B(5,3) is so symetrical that just repeating #1 produces B(5.4):
AAAADCBBBBEDCCCCAEDDDDBAEEEECB

Resources