Solve this matching with exponential running time in brute force manner - algorithm

I have a question about matching. It is stated as follows.
I have a sequence which consists of two characters, {l,h}.
The character l can be mapped to a number from {1,2,3}. (l stands for low)
The character h can be mapped to a number from {4,5,6}. (h stands for high)
For example, I have a sequence (I call it the original sequence) with length = 6. It is [h,l,h,l,h,l].
This sequence can be transformed to a detailed sequence by the above mapping rules. A detailed sequence can be [6,1,5,2,4,3]. And for a sequence with length = 6, there are 6^3 detailed sequences.
I obtain a difference sequence from a detailed sequence by computing the pairwise differences. For example, my detailed sequence is [6,1,5,2,4,3], then the corresponding difference sequence is [6-1,1-5,5-2,2-4,4-3] = [5,-4,3,-2,1]. Hence, the largest value of an entry of the difference sequence is 5 resulted from 6 minus 1 and the smallest value of it is -5 resulted from 1 minus 6.
Now, I have a database consists of m difference sequences with length = 5.
My query sequence is a original sequence with length = 6. I want to find that:
Among the m difference sequences, what are those the corresponding original sequence can be my query sequence. If they do not exist, the program will return null set. If they exist, the program will return the set consists of them.
For example, for a difference sequence [5,-4,3,-2,1], its corresponding original sequence can be [h,l,h,l,h,l]. Hence, ff my query sequence is [h,l,h,l,h,l], [5,-4,3,-2,1] will be in the returned set if [5,-4,3,-2,1] is in the database.
For my real problem, the query sequence has length = n. And the database consists of m difference sequences with length = n-1.
The brute force method can be as follows:
For the input original sequence, enumerates its 3^n detailed sequences and gets the 3^n difference sequences. For each of the difference sequences, check that does it exist in the database.
The brute force will take O(3^n). I know that this exponential running time is not good.
I want to have a faster algorithm. An approximation algorithm also looks good to me.
Thank you so much.

Each difference sequence can be created from at most four h-l sequences.
So pre-process your database to build an index from h-l sequences to the difference sequences you're interested in.
For example:
[1, -1, 0, 2, -1]
This must come from one of these sequences:
[3 2 3 3 1 2] -> l l l l l l
[4 3 4 4 2 3] -> h l h h l l
[5 4 5 5 3 4] -> h h h h l h
[6 5 6 6 4 5] -> h h h h h h
You can generate them all by starting the specific sequence from 1 to 6, and applying the differences. Sometimes the sequence will underflow or overflow the legal range of 1..6 and then you can discard that possibility.
Building this index is O(mn) time, and uses an additional O(mn) space. After you've built it, you can efficiently query a specific h-l sequence using the index.

Related

Substring search with max 1's in a binary sequence

Problem
The task is to find a substring from the given binary string with highest score. The substring should be at least of given min length.
score = number of 1s / substring length where score can range from 0 to 1.
Inputs:
1. min length of substring
2. binary sequence
Outputs:
1. index of first char of substring
2. index of last char of substring
Example 1:
input
-----
5
01010101111100
output
------
7
11
explanation
-----------
1. start with minimum window = 5
2. start_ind = 0, end_index = 4, score = 2/5 (0.4)
3. start_ind = 1, end_index = 5, score = 3/5 (0.6)
4. and so on...
5. start_ind = 7, end_index = 11, score = 5/5 (1) [max possible]
Example 2:
input
-----
5
10110011100
output
------
2
8
explanation
-----------
1. while calculating all scores for windows 5 to len(sequence)
2. max score occurs in the case: start_ind=2, end_ind=8, score=5/7 (0.7143) [max possible]
Example 3:
input
-----
4
00110011100
output
------
5
8
What I attempted
The only technique i could come up with was a brute force technique, with nested for loops
for window_size in (min to max)
for ind 0 to end
calculate score
save max score
Can someone suggest a better algorithm to solve this problem?
There's a few observations to make before we start talking about an algorithm- some of these observations already have been pointed out in the comments.
Maths
Take the minimum length to be M, the length of the entire string to be L, and a substring from the ith char to the jth char (inclusive-exclusive) to be S[i:j].
All optimal substrings will satisfy at least one of two conditions:
It is exactly M characters in length
It starts and ends with a 1 character
The reason for the latter being if it were longer than M characters and started/ended with a 0, we could just drop that 0 resulting in a higher ratio.
In the same spirit (again, for the 2nd case), there exists an optimal substring which is not preceded by a 1. Otherwise, if it were, we could include that 1, resulting in an equal or higher ratio. The same logic applies to the end of S and a following 1.
Building on the above- such a substring being preceded or followed by another 1 will NOT be optimal, unless the substring contains no 0s. In the case where it doesn't contain 0s, there will exist an optimal substring of length M as well anyways.
Again, that all only applies to the length greater than M case substrings.
Finally, there exists an optimal substring that has length at least M (by definition), and at most 2 * M - 1. If an optimal substring had length K, we could split it into two substrings of length floor(K/2) and ceil(K/2) - S[i:i+floor(K/2)] and S[i+floor(K/2):i+K]. If the substring has the score (ratio) R, and its halves R0 and R1, we would have one of two scenarios:
R = R0 = R1, meaning we could pick either half and get the same score as the combined substring, giving us a shorter substring.
If this substring has length less than 2 * M, we are done- we have an optimal substring of length [M, 2*M).
Otherwise, recurse on the new substring.
R0 != R1, so (without loss of generality) R0 < R < R1, meaning the combined substring would not be optimal in the first place.
Note that I say "there exists an optimal" as opposed to "the optimal". This is because there may be multiple optimal solutions, and the observations above may refer to different instances.
Algorithm
You could search every window size [M, 2*M) at every offset, which would already be better than a full search for small M. You can also try a two-phase approach:
search every M sized window, find the max score
search from the beginning of every run of 1s forward through a special list of ends of runs of 1s, implicitly skipping over 0s and irrelevant 1s, breaking when out of the [M, 2 * M) bound.
For random data, I only expect this to save a small factor- skipping 15/16 of the windows (ignoring the added overhead). For less-random data, you could potentially see huge benefits, particularly if there's LOTS of LARGE runs of 1s and 0s.
The biggest speedup you'll be able to do (besides limiting the window max to 2 * M) is computing a cumulative sum of the bit array. This lets you query "how many 1s were seen up to this point". You can then take the difference of two elements in this array to query "how many 1s occurred between these offsets" in constant time. This allows for very quick calculation of the score.
You can use 2 pointer method, starting from both left-most and right-most ends. then adjust them searching for highest score.
We can add some cache to optimize time.
Example: (Python)
binary="01010101111100"
length=5
def get_score(binary,left,right):
ones=0
for i in range(left,right+1):
if binary[i]=="1":
ones+=1
score= ones/(right-left+1)
return score
cache={}
def get_sub(binary,length,left,right):
if (left,right) in cache:
return cache[(left,right)]
table=[0,set()]
if right-left+1<length:
pass
else:
scores=[[get_score(binary,left,right),set([(left,right)])],
get_sub(binary,length,left+1,right),
get_sub(binary,length,left,right-1),
get_sub(binary,length,left+1,right-1)]
for s in scores:
if s[0]>table[0]:
table[0]=s[0]
table[1]=s[1]
elif s[0]==table[0]:
for e in s[1]:
table[1].add(e)
cache[(left,right)]=table
return table
result=get_sub(binary,length,0,len(binary)-1)
print("Score: %f"%result[0])
print("Index: %s"%result[1])
Output
Score: 1
Index: {(7, 11)}

Dynamic Programming Implementation of Unique subset generation

I came across this problem where we need to find the total number of ways in which you can form a subset of n(where n is the input by the user). The subset conditions are : the numbers in the subset should be distinct and the numbers in the subset should be in decreasing order.
Example: Given n =7 ,the output is 4 because the possible combinations are (6,1)(5,2)(4,3)(4,2,1). Note that though (4,1,1,1) also add upto 7,however it has repeating numbers. Hence it is not a valid subset.
I have solved this problem using the backtracking method which has exponential complexity. However, I am trying to figure out how can we solve this problem using Dynamic Prog ? The nearest I could come up with is the series : for n= 0,1,2,3,4,5,6,7,8,9,10 for which the output is 0,0,0,1,1,2,3,4,5,7,9 respectively. However, I am not able to come up with a general formula that I can use to calculate f(n) . While going through the internet I came across this kind of integer sequences in https://oeis.org/search?q=0%2C0%2C0%2C1%2C1%2C2%2C3%2C4%2C5%2C7%2C9%2C11%2C14&sort=&language=&go=Search . But I am unable to understand the formula they have provided here. Any help would be appreciated. Any other insights that improves exponential complexity is also appreciated.
What you call "unique subset generation" is better known as integer partitions with distinct parts. Mathworld has an entry on the Q function, which counts the number of partitions with distinct parts.
However, your function does not count trivial partitions (i.e. n -> (n)), so what you're looking for is actually Q(n) - 1, which generates the sequence that you linked in your question--the number of partitions into at least 2 distinct parts. This answer to a similar question on Mathematics contains an efficient algorithm in Java for computing the sequence up to n = 200, which easily can be adapted for larger values.
Here's a combinatorial explanation of the referenced algorithm:
Let's start with a table of all the subsets of {1, 2, 3}, grouped by their sums:
index (sum) partitions count
0 () 1
1 (1) 1
2 (2) 1
3 (1+2), (3) 2
4 (1+3) 1
5 (2+3) 1
6 (1+2+3) 1
Suppose we want to construct a new table of all of subsets of {1, 2, 3, 4}. Notice that every subset of {1, 2, 3} is also a subset of {1, 2, 3, 4}, so each subset above will appear in our new table. In fact, we can divide our new table into two equally sized categories: subsets which do not include 4, and those that do. What we can do is start from the table above, copy it over, and then "extend" it with 4. Here's the table for {1, 2, 3, 4}:
index (sum) partitions count
0 () 1
1 (1) 1
2 (2) 1
3 (1+2), (3) 2
4 (1+3), [4] 2
5 (2+3), [1+4] 2
6 (1+2+3),[2+4] 2
7 [1+2+4],[3+4] 2
8 [1+3+4] 1
9 [2+3+4] 1
10 [1+2+3+4] 1
All the subsets that include 4 are surrounded by square brackets, and they are formed by adding 4 to exactly one of the old subsets which are surrounded by parentheses. We can repeat this process and build the table for {1,2,..,5}, {1,2,..,6}, etc.
But we don't actually needed to store the actual subsets/partitions, we just need the counts for each index/sum. For example, if with have the table for {1, 2, 3} with only counts, we can build the count-table for {1, 2, 3, 4} by taking each (index, count) pair from the old table, and adding count to the the current count for index + 4. The idea is that, say, if there are two subsets of {1, 2, 3} that sum to 3, then adding 4 to each of those two subsets will make two new subsets for 7.
With this in mind, here's a Python implementation of this process:
def c(n):
counts = [1]
for k in range(1, n + 1):
new_counts = counts[:] + [0]*k
for index, count in enumerate(counts):
new_counts[index + k] += count
counts = new_counts
return counts
We start with the table for {}, which has just one subset--the empty set-- which sums to zero. Then we build the table for {1}, then for {1, 2}, ..., all the way up to {1,2,..,n}. Once we're done, we'll have counted every partition for n, since no partition of n can include integers larger than n.
Now, we can make 2 major optimizations to the code:
Limit the table to only include entries up to n, since that's all we're interested in. If index + k exceeds n, then we just ignore it. We can even preallocate space for the final table up front, rather than growing bigger tables on each iteration.
Instead of building a new table from scratch on each iteration, if we carefully iterate over the old table backwards, we can actually update it in-place without messing up any of the new values.
With these optimizations, you effectively have the same algorithm from the Mathematics answer that was referenced earlier, which was motivated by generating functions. It runs in O(n^2) time and takes only O(n) space. Here's what it looks like in Python:
def c(n):
counts = [1] + [0]*n
for k in range(1, n + 1):
for i in reversed(range(k, n + 1)):
counts[i] += counts[i - k]
return counts
def Q(n):
"-> the number of subsets of {1,..,n} that sum to n."
return c(n)[n]
def f(n):
"-> the number of subsets of {1,..,n} of size >= 2 that sum to n."
return Q(n) - 1

Number of ways to add up to a sum S with N numbers

Say S = 5 and N = 3 the solutions would look like - <0,0,5> <0,1,4> <0,2,3> <0,3,2> <5,0,0> <2,3,0> <3,2,0> <1,2,2> etc etc.
In the general case, N nested loops can be used to solve the problem. Run N nested loop, inside them check if the loop variables add upto S.
If we do not know N ahead of time, we can use a recursive solution. In each level, run a loop starting from 0 to N, and then call the function itself again. When we reach a depth of N, see if the numbers obtained add up to S.
Any other dynamic programming solution?
Try this recursive function:
f(s, n) = 1 if s = 0
= 0 if s != 0 and n = 0
= sum f(s - i, n - 1) over i in [0, s] otherwise
To use dynamic programming you can cache the value of f after evaluating it, and check if the value already exists in the cache before evaluating it.
There is a closed form formula : binomial(s + n - 1, s) or binomial(s+n-1,n-1)
Those numbers are the simplex numbers.
If you want to compute them, use the log gamma function or arbitrary precision arithmetic.
See https://math.stackexchange.com/questions/2455/geometric-proof-of-the-formula-for-simplex-numbers
I have my own formula for this. We, together with my friend Gio made an investigative report concerning this. The formula that we got is [2 raised to (n-1) - 1], where n is the number we are looking for how many addends it has.
Let's try.
If n is 1: its addends are o. There's no two or more numbers that we can add to get a sum of 1 (excluding 0). Let's try a higher number.
Let's try 4. 4 has addends: 1+1+1+1, 1+2+1, 1+1+2, 2+1+1, 1+3, 2+2, 3+1. Its total is 7.
Let's check with the formula. 2 raised to (4-1) - 1 = 2 raised to (3) - 1 = 8-1 =7.
Let's try 15. 2 raised to (15-1) - 1 = 2 raised to (14) - 1 = 16384 - 1 = 16383. Therefore, there are 16383 ways to add numbers that will equal to 15.
(Note: Addends are positive numbers only.)
(You can try other numbers, to check whether our formula is correct or not.)
This can be calculated in O(s+n) (or O(1) if you don't mind an approximation) in the following way:
Imagine we have a string with n-1 X's in it and s o's. So for your example of s=5, n=3, one example string would be
oXooXoo
Notice that the X's divide the o's into three distinct groupings: one of length 1, length 2, and length 2. This corresponds to your solution of <1,2,2>. Every possible string gives us a different solution, by counting the number of o's in a row (a 0 is possible: for example, XoooooX would correspond to <0,5,0>). So by counting the number of possible strings of this form, we get the answer to your question.
There are s+(n-1) positions to choose for s o's, so the answer is Choose(s+n-1, s).
There is a fixed formula to find the answer. If you want to find the number of ways to get N as the sum of R elements. The answer is always:
(N+R-1)!/((R-1)!*(N)!)
or in other words:
(N+R-1) C (R-1)
This actually looks a lot like a Towers of Hanoi problem, without the constraint of stacking disks only on larger disks. You have S disks that can be in any combination on N towers. So that's what got me thinking about it.
What I suspect is that there is a formula we can deduce that doesn't require the recursive programming. I'll need a bit more time though.

Algorithm to count the number of valid blocks in a permutation [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Finding sorted sub-sequences in a permutation
Given an array A which holds a permutation of 1,2,...,n. A sub-block A[i..j]
of an array A is called a valid block if all the numbers appearing in A[i..j]
are consecutive numbers (may not be in order).
Given an array A= [ 7 3 4 1 2 6 5 8] the valid blocks are [3 4], [1,2], [6,5],
[3 4 1 2], [3 4 1 2 6 5], [7 3 4 1 2 6 5], [7 3 4 1 2 6 5 8]
So the count for above permutation is 7.
Give an O( n log n) algorithm to count the number of valid blocks.
Ok, I am down to 1 rep because I put 200 bounty on a related question: Finding sorted sub-sequences in a permutation
so I cannot leave comments for a while.
I have an idea:
1) Locate all permutation groups. They are: (78), (34), (12), (65). Unlike in group theory, their order and position, and whether they are adjacent matters. So, a group (78) can be represented as a structure (7, 8, false), while (34) would be (3,4,true). I am using Python's notation for tuples, but it is actually might be better to use a whole class for the group. Here true or false means contiguous or not. Two groups are "adjacent" if (max(gp1) == min(gp2) + 1 or max(gp2) == min(gp1) + 1) and contigous(gp1) and contiguos(gp2). This is not the only condition, for union(gp1, gp2) to be contiguous, because (14) and (23) combine into (14) nicely. This is a great question for algo class homework, but a terrible one for interview. I suspect this is homework.
Just some thoughts:
At first sight, this sounds impossible: a fully sorted array would have O(n2) valid sub-blocks.
So, you would need to count more than one valid sub-block at a time. Checking the validity of a sub-block is O(n). Checking whether a sub-block is fully sorted is O(n) as well. A fully sorted sub-block contains n·(n - 1)/2 valid sub-blocks, which you can count without further breaking this sub-block up.
Now, the entire array is obviously always valid. For a divide-and-conquer approach, you would need to break this up. There are two conceivable breaking points: the location of the highest element, and that of the lowest element. If you break the array into two at one of these points, including the extremum in the part that contains the second-to-extreme element, there cannot be a valid sub-block crossing this break-point.
By always choosing the extremum that produces a more even split, this should work quite well (average O(n log n)) for "random" arrays. However, I can see problems when your input is something like (1 5 2 6 3 7 4 8), which seems to produce O(n2) behaviour. (1 4 7 2 5 8 3 6 9) would be similar (I hope you see the pattern). I currently see no trick to catch this kind of worse case, but it seems that it requires other splitting techniques.
This question does involve a bit of a "math trick" but it's fairly straight forward once you get it. However, the rest of my solution won't fit the O(n log n) criteria.
The math portion:
For any two consecutive numbers their sum is 2k+1 where k is the smallest element. For three it is 3k+3, 4 : 4k+6 and for N such numbers it is Nk + sum(1,N-1). Hence, you need two steps which can be done simultaneously:
Create the sum of all the sub-arrays.
Determine the smallest element of a sub-array.
The dynamic programming portion
Build two tables using the results of the previous row's entries to build each successive row's entries. Unfortunately, I'm totally wrong as this would still necessitate n^2 sub-array checks. Ugh!
My proposition
STEP = 2 // amount of examed number
B [0,0,0,0,0,0,0,0]
B [1,1,0,0,0,0,0,0]
VALID(A,B) - if not valid move one
B [0,1,1,0,0,0,0,0]
VALID(A,B) - if valid move one and step
B [0,0,0,1,1,0,0,0]
VALID (A,B)
B [0,0,0,0,0,1,1,0]
STEP = 3
B [1,1,1,0,0,0,0,0] not ok
B [0,1,1,1,0,0,0,0] ok
B [0,0,0,0,1,1,1,0] not ok
STEP = 4
B [1,1,1,1,0,0,0,0] not ok
B [0,1,1,1,1,0,0,0] ok
.....
CON <- 0
STEP <- 2
i <- 0
j <- 0
WHILE(STEP <= LEN(A)) DO
j <- STEP
WHILE(STEP <= LEN(A) - j) DO
IF(VALID(A,i,j)) DO
CON <- CON + 1
i <- j + 1
j <- j + STEP
ELSE
i <- i + 1
j <- j + 1
END
END
STEP <- STEP + 1
END
The valid method check that all elements are consecutive
Never tested but, might be ok
The original array doesn't contain duplicates so must itself be a consecutive block. Lets call this block (1 ~ n). We can test to see whether block (2 ~ n) is consecutive by checking if the first element is 1 or n which is O(1). Likewise we can test block (1 ~ n-1) by checking whether the last element is 1 or n.
I can't quite mould this into a solution that works but maybe it will help someone along...
Like everybody else, I'm just throwing this out ... it works for the single example below, but YMMV!
The idea is to count the number of illegal sub-blocks, and subtract this from the total possible number. We count the illegal ones by examining each array element in turn and ruling out sub-blocks that include the element but not its predecessor or successor.
Foreach i in [1,N], compute B[A[i]] = i.
Let Count = the total number of sub-blocks with length>1, which is N-choose-2 (one for each possible combination of starting and ending index).
Foreach i, consider A[i]. Ignoring edge cases, let x=A[i]-1, and let y=A[i]+1. A[i] cannot participate in any sub-block that does not include x or y. Let iX=B[x] and iY=B[y]. There are several cases to be treated independently here. The general case is that iX<i<iY<i. In this case, we can eliminate the sub-block A[iX+1 .. iY-1] and all intervening blocks containing i. There are (i - iX + 1) * (iY - i + 1) such sub-blocks, so call this number Eliminated. (Other cases left as an exercise for the reader, as are those edge cases.) Set Count = Count - Eliminated.
Return Count.
The total cost appears to be N * (cost of step 2) = O(N).
WRINKLE: In step 2, we must be careful not to eliminate each sub-interval more than once. We can accomplish this by only eliminating sub-intervals that lie fully or partly to the right of position i.
Example:
A = [1, 3, 2, 4]
B = [1, 3, 2, 4]
Initial count = (4*3)/2 = 6
i=1: A[i]=1, so need sub-blocks with 2 in them. We can eliminate [1,3] from consideration. Eliminated = 1, Count -> 5.
i=2: A[i]=3, so need sub-blocks with 2 or 4 in them. This rules out [1,3] but we already accounted for it when looking right from i=1. Eliminated = 0.
i=3: A[i] = 2, so need sub-blocks with [1] or [3] in them. We can eliminate [2,4] from consideration. Eliminated = 1, Count -> 4.
i=4: A[i] = 4, so we need sub-blocks with [3] in them. This rules out [2,4] but we already accounted for it when looking right from i=3. Eliminated = 0.
Final Count = 4, corresponding to the sub-blocks [1,3,2,4], [1,3,2], [3,2,4] and [3,2].
(This is an attempt to do this N.log(N) worst case. Unfortunately it's wrong -- it sometimes undercounts. It incorrectly assumes you can find all the blocks by looking at only adjacent pairs of smaller valid blocks. In fact you have to look at triplets, quadruples, etc, to get all the larger blocks.)
You do it with a struct that represents a subblock and a queue for subblocks.
struct
c_subblock
{
int index ; /* index into original array, head of subblock */
int width ; /* width of subblock > 0 */
int lo_value;
c_subblock * p_above ; /* null or subblock above with same index */
};
Alloc an array of subblocks the same size as the original array, and init each subblock to have exactly one item in it. Add them to the queue as you go. If you start with array [ 7 3 4 1 2 6 5 8 ] you will end up with a queue like this:
queue: ( [7,7] [3,3] [4,4] [1,1] [2,2] [6,6] [5,5] [8,8] )
The { index, width, lo_value, p_above } values for subbblock [7,7] will be { 0, 1, 7, null }.
Now it's easy. Forgive the c-ish pseudo-code.
loop {
c_subblock * const p_left = Pop subblock from queue.
int const right_index = p_left.index + p_left.width;
if ( right_index < length original array ) {
// Find adjacent subblock on the right.
// To do this you'll need the original array of length-1 subblocks.
c_subblock const * p_right = array_basic_subblocks[ right_index ];
do {
Check the left/right subblocks to see if the two merged are also a subblock.
If they are add a new merged subblock to the end of the queue.
p_right = p_right.p_above;
}
while ( p_right );
}
}
This will find them all I think. It's usually O(N log(N)), but it'll be O(N^2) for a fully sorted or anti-sorted list. I think there's an answer to this though -- when you build the original array of subblocks you look for sorted and anti-sorted sequences and add them as the base-level subblocks. If you are keeping a count increment it by (width * (width + 1))/2 for the base-level. That'll give you the count INCLUDING all the 1-length subblocks.
After that just use the loop above, popping and pushing the queue. If you're counting you'll have to have a multiplier on both the left and right subblocks and multiply these together to calculate the increment. The multiplier is the width of the leftmost (for p_left) or rightmost (for p_right) base-level subblock.
Hope this is clear and not too buggy. I'm just banging it out, so it may even be wrong.
[Later note. This doesn't work after all. See note below.]

generate sequence with all permutations

How can I generate the shortest sequence with contains all possible permutations?
Example:
For length 2 the answer is 121, because this list contains 12 and 21, which are all possible permutations.
For length 3 the answer is 123121321, because this list contains all possible permutations:
123, 231, 312, 121 (invalid), 213, 132, 321.
Each number (within a given permutation) may only occur once.
This greedy algorithm produces fairly short minimal sequences.
UPDATE: Note that for n ≥ 6, this algorithm does not produce the shortest possible string!
Make a collection of all permutations.
Remove the first permutation from the collection.
Let a = the first permutation.
Find the sequence in the collection that has the greatest overlap with the end of a. If there is a tie, choose the sequence is first in lexicographic order. Remove the chosen sequence from the collection and add the non-overlapping part to the end of a. Repeat this step until the collection is empty.
The curious tie-breaking step is necessary for correctness; breaking the tie at random instead seems to result in longer strings.
I verified (by writing a much longer, slower program) that the answer this algorithm gives for length 4, 123412314231243121342132413214321, is indeed the shortest answer. However, for length 6 it produces an answer of length 873, which is longer than the shortest known solution.
The algorithm is O(n!2).
An implementation in Python:
import itertools
def costToAdd(a, b):
for i in range(1, len(b)):
if a.endswith(b[:-i]):
return i
return len(b)
def stringContainingAllPermutationsOf(s):
perms = set(''.join(tpl) for tpl in itertools.permutations(s))
perms.remove(s)
a = s
while perms:
cost, next = min((costToAdd(a, x), x) for x in perms)
perms.remove(next)
a += next[-cost:]
return a
The length of the strings generated by this function are 1, 3, 9, 33, 153, 873, 5913, ... which appears to be this integer sequence.
I have a hunch you can do better than O(n!2).
Create all permutations.
Let each
permutation represent a node in a
graph.
Now, for any two states add an
edge with a value 1 if they share
n-1 digits (for the source from the
end, and for the target from the
end), two if they share n-2 digits
and so on.
Now, you are left to find
the shortest path containing n
vertices.
Here is a fast algorithm that produces a short string containing all permutations. I am pretty sure it produces the shortest possible answer, but I don't have a complete proof in hand.
Explanation. Below is a tree of All Permutations. The picture is incomplete; imagine that the tree goes on forever to the right.
1 --+-- 12 --+-- 123 ...
| |
| +-- 231 ...
| |
| +-- 312 ...
|
+-- 21 --+-- 213 ...
|
+-- 132 ...
|
+-- 321 ...
The nodes at level k of this tree are all the permutations of length
k. Furthermore, the permutations are in a particular order with a lot
of overlap between each permutation and its neighbors above and below.
To be precise, each node's first child is found by simply adding the next
symbol to the end. For example, the first child of 213 would be 2134. The rest
of the children are found by rotating to the first child to left one symbol at
a time. Rotating 2134 would produce 1342, 3421, 4213.
Taking all the nodes at a given level and stringing them together, overlapping
as much as possible, produces the strings 1, 121, 123121321, etc.
The length of the nth string in that sequence is the sum for x=1 to n of x!. (You can prove this by observing how much non-overlap there is between neighboring permutations. Siblings overlap in all but 1 symbol; first-cousins overlap in all but 2 symbols; and so on.)
Sketch of proof. I haven't completely proved that this is the best solution, but here's a sketch of how the proof would proceed. First show that any string containing n distinct permutations has length ≥ 2n - 1. Then show that adding any string containing n+1 distinct permutations has length 2n + 1. That is, adding one more permutation will cost you two digits. Proceed by calculating the minimum length of strings containing nPr and nPr + 1 distinct permutations, up to n!. In short, this sequence is optimal because you can't make it worse somewhere in the hope of making it better someplace else. It's already locally optimal everywhere. All the moves are forced.
Algorithm. Given all this background, the algorithm is very simple. Walk this tree to the desired depth and string together all the nodes at that depth.
Fortunately we do not actually have to build the tree in memory.
def build(node, s):
"""String together all descendants of the given node at the target depth."""
d = len(node) # depth of this node. depth of "213" is 3.
n = len(s) # target depth
if d == n - 1:
return node + s[n - 1] + node # children of 213 join to make "2134213"
else:
c0 = node + s[d] # first child node
children = [c0[i:] + c0[:i] for i in range(d + 1)] # all child nodes
strings = [build(c, s) for c in children] # recurse to the desired depth
for j in range(1, d + 1):
strings[j] = strings[j][d:] # cut off overlap with previous sibling
return ''.join(strings) # join what's left
def stringContainingAllPermutationsOf(s):
return build(s[:1], s)
Performance. The above code is already much faster than my other solution, and it does a lot of cutting and pasting of large strings that you can optimize away. The algorithm can be made to run in time and memory proportional to the size of the output.
For n 3 length chain is 8
12312132
Seems to me we are working with cycled system - it's ring, saying in other words. But we are are working with ring as if it is chain. Chain is realy 123121321 = 9
But the ring is 12312132 = 8
We take last 1 for 321 from the beginning of the sequence 12312132[1].
These are called (minimal length) superpermutations (cf. Wikipedia).
Interest on this has re-sparked when an anonymous user has posted a new lower bound on 4chan. (See Wikipedia and many other web pages for history.)
AFAIK, as of today we just know:
Their length is A180632(n) ≤ A007489(n) = Sum_{k=1..n} k! but this bound is only sharp for n ≤ 5, i.e., we have equality for n ≤ 5 but strictly less for n > 5.
There's a very simple recursive algorithm, given below, producing a superpermutation of length A007489(n), which is always palindromic (but as said above this is not the minimal length for n > 5).
For n ≥ 7 we have the better upper bound n! + (n−1)! + (n−2)! + (n−3)! + n − 3.
For n ≤ 5 all minimal SP's are known; and for all n > 5 we don't know which is the minimal SP.
For n = 1, 2, 3, 4 the minimal SP's are unique (up to changing the symbols), given by (1, 121, 123121321, 123412314231243121342132413214321) of length A007489(1..4) = (1, 3, 9, 33).
For n = 5 there are 8 inequivalent ones of minimal length 153 = A007489(5); the palindromic one produced by the algorithm below is the 3rd in lexicographic order.
For n = 6 Houston produced thousands of the smallest known length 872 = A007489(6) - 1, but AFAIK we still don't know whether this is minimal.
For n = 7 Egan produced one of length 5906 (one less than the better upper bound given above) but again we don't know whether that's minimal.
I've written a very short PARI/GP program (you can paste to run it on the PARI/GP web site) which implements the standard algorithm producing a palindromic superpermutation of length A007489(n):
extend(S,n=vecmax(s))={ my(t); concat([
if(#Set(s)<n, [], /* discard if not a permutation */
s=concat([s, n+1, s]); /* Now merge with preceding segment: */
forstep(i=min(#s, #t)-1, 0, -1,
if(s[1..1+i]==t[#t-i..#t], s=s[2+i..-1]; break));
t=s /* store as previous for next */
)/*endif*/
| s <- [ S[i+1..i+n] | i <- [0..#S-n] ]])
}
SSP=vector(6, n, s=if(n>1, extend(s), [1])); // gives the first 6, the 6th being non-minimal
I think that easily translates to any other language. (For non-PARI speaking persons: "| x <-" means "for x in".)

Resources