Substring search with max 1's in a binary sequence - algorithm

Problem
The task is to find a substring from the given binary string with highest score. The substring should be at least of given min length.
score = number of 1s / substring length where score can range from 0 to 1.
Inputs:
1. min length of substring
2. binary sequence
Outputs:
1. index of first char of substring
2. index of last char of substring
Example 1:
input
-----
5
01010101111100
output
------
7
11
explanation
-----------
1. start with minimum window = 5
2. start_ind = 0, end_index = 4, score = 2/5 (0.4)
3. start_ind = 1, end_index = 5, score = 3/5 (0.6)
4. and so on...
5. start_ind = 7, end_index = 11, score = 5/5 (1) [max possible]
Example 2:
input
-----
5
10110011100
output
------
2
8
explanation
-----------
1. while calculating all scores for windows 5 to len(sequence)
2. max score occurs in the case: start_ind=2, end_ind=8, score=5/7 (0.7143) [max possible]
Example 3:
input
-----
4
00110011100
output
------
5
8
What I attempted
The only technique i could come up with was a brute force technique, with nested for loops
for window_size in (min to max)
for ind 0 to end
calculate score
save max score
Can someone suggest a better algorithm to solve this problem?

There's a few observations to make before we start talking about an algorithm- some of these observations already have been pointed out in the comments.
Maths
Take the minimum length to be M, the length of the entire string to be L, and a substring from the ith char to the jth char (inclusive-exclusive) to be S[i:j].
All optimal substrings will satisfy at least one of two conditions:
It is exactly M characters in length
It starts and ends with a 1 character
The reason for the latter being if it were longer than M characters and started/ended with a 0, we could just drop that 0 resulting in a higher ratio.
In the same spirit (again, for the 2nd case), there exists an optimal substring which is not preceded by a 1. Otherwise, if it were, we could include that 1, resulting in an equal or higher ratio. The same logic applies to the end of S and a following 1.
Building on the above- such a substring being preceded or followed by another 1 will NOT be optimal, unless the substring contains no 0s. In the case where it doesn't contain 0s, there will exist an optimal substring of length M as well anyways.
Again, that all only applies to the length greater than M case substrings.
Finally, there exists an optimal substring that has length at least M (by definition), and at most 2 * M - 1. If an optimal substring had length K, we could split it into two substrings of length floor(K/2) and ceil(K/2) - S[i:i+floor(K/2)] and S[i+floor(K/2):i+K]. If the substring has the score (ratio) R, and its halves R0 and R1, we would have one of two scenarios:
R = R0 = R1, meaning we could pick either half and get the same score as the combined substring, giving us a shorter substring.
If this substring has length less than 2 * M, we are done- we have an optimal substring of length [M, 2*M).
Otherwise, recurse on the new substring.
R0 != R1, so (without loss of generality) R0 < R < R1, meaning the combined substring would not be optimal in the first place.
Note that I say "there exists an optimal" as opposed to "the optimal". This is because there may be multiple optimal solutions, and the observations above may refer to different instances.
Algorithm
You could search every window size [M, 2*M) at every offset, which would already be better than a full search for small M. You can also try a two-phase approach:
search every M sized window, find the max score
search from the beginning of every run of 1s forward through a special list of ends of runs of 1s, implicitly skipping over 0s and irrelevant 1s, breaking when out of the [M, 2 * M) bound.
For random data, I only expect this to save a small factor- skipping 15/16 of the windows (ignoring the added overhead). For less-random data, you could potentially see huge benefits, particularly if there's LOTS of LARGE runs of 1s and 0s.
The biggest speedup you'll be able to do (besides limiting the window max to 2 * M) is computing a cumulative sum of the bit array. This lets you query "how many 1s were seen up to this point". You can then take the difference of two elements in this array to query "how many 1s occurred between these offsets" in constant time. This allows for very quick calculation of the score.

You can use 2 pointer method, starting from both left-most and right-most ends. then adjust them searching for highest score.
We can add some cache to optimize time.
Example: (Python)
binary="01010101111100"
length=5
def get_score(binary,left,right):
ones=0
for i in range(left,right+1):
if binary[i]=="1":
ones+=1
score= ones/(right-left+1)
return score
cache={}
def get_sub(binary,length,left,right):
if (left,right) in cache:
return cache[(left,right)]
table=[0,set()]
if right-left+1<length:
pass
else:
scores=[[get_score(binary,left,right),set([(left,right)])],
get_sub(binary,length,left+1,right),
get_sub(binary,length,left,right-1),
get_sub(binary,length,left+1,right-1)]
for s in scores:
if s[0]>table[0]:
table[0]=s[0]
table[1]=s[1]
elif s[0]==table[0]:
for e in s[1]:
table[1].add(e)
cache[(left,right)]=table
return table
result=get_sub(binary,length,0,len(binary)-1)
print("Score: %f"%result[0])
print("Index: %s"%result[1])
Output
Score: 1
Index: {(7, 11)}

Related

Longest odd Palindromic Substring with middle index i

Longest Odd Pallindromes
Problem Description
Given a string S(consisting of only lower case characters) and Q queries.
In each query you will given an integer i and your task is to find the length of longest odd palindromic substring whose middle index is i. Note:
1.) Assume 1 based indexing.
2.) Longest odd palindrome: A palindrome substring whose length is odd.
Problem Constraints
1<=|s|,Q<=1e5
1<=i<=|s|
Input Format
First argument A is string S.
Second argument B is an array of integers where B[i] denotes the query index of ith query.
Output Format
Return an array of integers where ith integer denotes the answer of ith query.
Is there any better way to solve this question other than brute force, that is, when we generate all the palindromic substrings and check
There is Manacher's algorithm that calculates number of palindromes centered at i-th index in linear time.
After precalculation stage you can answer query in O(1). I changed result array to contain lengths of the longest palindromes centered at every position.
Python code (link contains C++ one)
def manacher_odd(s):
n = len(s)
odds = []
l, r = 0, -1
for i in range(n):
k = min(odds[l+r-i], r-i+1) if i<=r else 1
while (i+k < n) and (i-k >= 0) and (s[i+k]==s[i-k]):
k += 1
odds.append(k)
if (i+k-1 > r):
l, r = i-k+1, i+k-1
for i in range(n):
odds[i] = 2 * odds[i] - 1
return odds
print(manacher_odd("abaaaba"))
[1, 3, 1, 7, 1, 3, 1]
There are 2 possible optimizations they might be looking for.
First, you can do an initial run over S first, cleverly building a lookup table, and then your query will just use that, which I think would be faster if B is long.
Alternatively, if not doing a look up, then while you're searching at index i, you'll potentially search neighboring indexes at the same time. As you check i, you can also be checking i+1, i-1, i+2, i-2, etc... as you go, and save that answer for later. This seems the less promising route to me, so I want to dive into the first idea more.
Finally, if B is quite short, then the best answer might be brute force, actually. It's good to know when to keep it simple.
Initial run method
One optimization that comes to mind for a pre-process run is as follows:
Search the next unknown index, brute force it by looking forwards and back, while recording the frequency of each letter (including the middle one.) If a palindrome of 1 or 3 was found, move to the next index and repeat.
If a palindrome or 5 or longer was found, calculate mid points of any letters that showed up more than twice which are to the right of the current index.
Any point between the current index and the index of the last palindrome letter that isn't in the mid-points list is a 1 for length.
This means you'll search all the midpoints found in (2). After that, you'll continue searching from the index of the last letter of the palindrome found in (2).
An example
Let's say S starts with: ```a, b, c, d, a, f, g, f, g, f, a, d, c, b, a, ...``` and you have checked from ```i = 2``` up to ```i = 7``` but found nothing except a run of 3 at ```i = 7```. Now, you check index ```i = 8```. You will find a palindrome extending out 7 letters in each direction, for a total of 15, but as you check, note any letters that show up more than twice. In this case, there are 3 ```f```s and 4 ```a```s. Find any mid points these pairs have that are right of the current index (8). In this case, 2 ```f```s have a mid point of i=9, the 2 right-most ```a```s have a midpoint of i=13. Once you're done looking at i=8, then you can skip any index not on your list, all the way up to the last letter you found in i=8. For example, we only have to check i=9 and i=13, and then start from i=15, checking every step. We've been able to skip checking i=10, 11, 12, and 14.

Generate first N combinations of length L from weighted set in order of weight

I have a set of letters with weights, which gives their probability of appearing in a string:
a - 0.7
b - 0.1
c - 0.3
...
z - 0.01
As such, the word aaaa has a probability of 0.7*0.7*0.7*0.7 = 0.24. The word aaac would have probability 0.7*0.7*0.7*0.3 = 0.10. All permutations of the same word have the same probability, so we only need to worry about combinations.
I would like to generate the first unique N strings of length L in order of probability (for example here, with 4 letters and length 4, that should be aaaa, aaac, aacc, aaab, accc, aabc, cccc, etc).
Assume that the brute force approach of generating all combinations with their probabilities and sorting by weight to not be possible here. The algorithm, should it exist, must be able to work for any set size and any length of string (e.g. all 256 bytes with weighted probabilities, 1024 length string, generate first trillion.)
Below is some enumeration code that uses a heap. The implementation principle is slightly different from what user3386109 proposes in their comment.
Order the symbols by decreasing probability. There’s a constructive one-to-one correspondence between length-L combinations of S symbols, and binary strings of length S + L − 1 with L − 1 zeros (counting out each symbol in unary with L − 1 delimiters). We can do bit-at-a-time enumeration of the possibilities for the latter.
The part that saves us from having to enumerate every single combination is that, for each binary prefix, the most probable word can be found by repeating the most probable letter still available. By storing the prefixes in a heap, we can open only the ones that appear in the top N.
Note that this uses memory proportional to the number of enumerated combinations. This may still be too much, in which case you probably want something like iterative deepening depth-first search.
symbol_probability_dict = {"a": 0.7, "b": 0.1, "c": 0.3, "z": 0.01}
L = 4
import heapq
import math
loss_symbol_pairs = [(-math.log(p), c) for (c, p) in symbol_probability_dict.items()]
loss_symbol_pairs.sort()
heap = [(0, 0, "")]
while heap:
min_loss, i, s = heapq.heappop(heap)
if len(s) < L:
heapq.heappush(heap, (min_loss, i, s + loss_symbol_pairs[i][1]))
if i + 1 < len(loss_symbol_pairs):
heapq.heappush(
heap,
(
min_loss
+ (L - len(s))
* (loss_symbol_pairs[i + 1][0] - loss_symbol_pairs[i][0]),
i + 1,
s,
),
)
else:
print(s)
Output:
aaaa
aaac
aacc
aaab
accc
aacb
cccc
accb
aabb
aaaz
cccb
acbb
aacz
ccbb
abbb
accz
aabz
cbbb
cccz
acbz
bbbb
ccbz
abbz
aazz
cbbz
aczz
bbbz
cczz
abzz
cbzz
bbzz
azzz
czzz
bzzz
zzzz
This answer provides an implementation of #user3386109's comment:
I think the solution is a priority queue. The initial input to the queue is a string with the highest probability letter (aaaa). After reading a string from the queue, replace a letter with the next lower probability letter, and add that new string to the queue. So after reading aaaa, write aaac. After reading aaac, write aacc (changing an a to c) and aaab (changing a c to a b).
I wrote a generator function, following these exact steps:
Define a helper functions: prio(word) which returns the priority of a word (as a negative number because python heaps are min-heaps);
Define a helper dict: next_letter.get(letter) is the next higher-priority letter after letter, if any;
Initialise a heapq queue with first word aaaa, and its corresponding priority;
When popping words from the queue, avoid possible duplicates by comparing the current word with the previous word;
After popping a word from the queue, if it is not a duplicate, then yield this word, and push the words obtained by replacing a letter by the next probability letter;
Since it is a generator function, it is lazy: you can get the first N words without computing all words. However, some extra words will still be computed, since the whole idea of the priority queue is that we don't know the exact order in advance.
import heapq
from math import prod
from itertools import pairwise
def gen_combs(weights, r = 4):
next_letter = dict(pairwise(sorted(weights, key=weights.get, reverse=True)))
def prio(word, weights=weights):
return -prod(map(weights.get, word))
first_word = max(weights, key=weights.get) * r
queue = [(prio(first_word), first_word)]
prev_word = None
while queue:
w, word = heapq.heappop(queue)
if word != prev_word:
yield word
prev_word = word
seen_letters = set()
for i, letter in enumerate(word):
if letter not in seen_letters:
new_letter = next_letter.get(letter)
if new_letter:
new_word = ''.join((word[:i], new_letter, word[i+1:]))
heapq.heappush(queue, (prio(new_word), new_word))
seen_letters.add(letter)
# TESTING
weights = {'a': 0.7, 'b': 0.1, 'c': 0.3, 'z': 0.01}
# print all words
print(list(gen_combs(weights)))
# ['aaaa', 'caaa', 'ccaa', 'baaa', 'ccca', 'bcaa', 'cccc',
# 'bcca', 'bbaa', 'zaaa', 'bccc', 'bbca', 'zcaa', 'bbcc',
# 'bbba', 'zcca', 'zbaa', 'bbbc', 'zccc', 'zbca', 'bbbb',
# 'zbcc', 'zbba', 'zzaa', 'zbbc', 'zzca', 'zbbb', 'zzcc',
# 'zzba', 'zzbc', 'zzbb', 'zzza', 'zzzc', 'zzzb', 'zzzz']
n = 10
# print first n words, one per line
for _,word in zip(range(n), gen_combs(weights)):
print(word)
# print first n words
from itertools import islice
print(list(islice(gen_combs(weights), n)))
# ['aaaa', 'caaa', 'ccaa', 'baaa', 'ccca', 'bcaa', 'cccc', 'bcca', 'bbaa', 'zaaa']
First pardon me for pointing out your probabilities do not sum to 1 which instantly makes me feel something is wrong.
I would like to share my version (credit to Stef, David, and user3386109), which is easier to prove to be O(NL) in space and O(N*max(L,log(N))) in time. Because you need O(NL) to store/print out the result, and add a log(N) for prio-queue, you can see it is hard to further make improvement. I don't see the folks here have proved efficient enough on the complexity-wise. P.S, making it a generator won't help since the space is mostly caused by the heap itself. P.S.2 for large L, the size of N can be astronomical.
Complexity analysis: The heap starts with 1 element and each pop, which is one of the top-N, pushes up to 2 elements back in. Therefore, the total space is O(N2item size)=O(NL). Time cost is N*max(L,log(N)) because push/pop each costs log(N) while many steps in the inner loop costs O(L).
You can see it print the same 35 results as Stef's. Now the core part of this algo is how to choose the 2 elements to push after each pop. First consider the opposite question, how to find the parent of each child so (1) all possible combos form a partial order (maximal element is 'aaaa', and No we don't need Zorn's lemma here), (2) we won't miss anything larger than the child but smaller than the parent (not a entirely correct statement for 2 letters sharing the same probability, but in actuality I found it does not matter), and (3) no repeats so we don't have to manage caching the visited strings. My way of choosing the parent is always lower the letters on the front of the string until it reaches 'a'. E.g, parent of 'cbbz' will be 'abbz' instead of 'ccbz'. Conversely, we find the children of a given parent, and it is nice to see we have up to 2 children in this case: raise the first non-'a' letter until it equals the next letter, ('acbb'=>'abbb', while 'accb' has no child in this direction) or raise the last 'a' ('acbb'=>'ccbb', 'accb'=>'cccb')
import heapq, math
letters=[[-math.log(x[1]),x[0]] for x in zip('abcz',[0.7,0.1,0.3,0.01])]
N=35 #something big to force full print in testing. Here 35 is calculated by DP or Pascal triangle
L=4
letters.sort()
letters=[[i,x[0],x[1]] for i,x in enumerate(letters)]
heap=[[letters[0][1]*L,[0]*L,L-1]]
while heap and N>0:
N-=1
p,nextLargest,childPtr=heapq.heappop(heap)
print(''.join([letters[x][2] for x in nextLargest]))
if childPtr<L-1 and nextLargest[childPtr+1]<len(letters)-1:
child=list(nextLargest)
child[childPtr+1]+=1
if childPtr+2==L or child[childPtr+1]<=child[childPtr+2]:
prob=sum([letters[x][1] for x in child]) #this can be improved by using a delta, but won't change overall complexity
heapq.heappush(heap,[prob,child,childPtr])
if childPtr>=0:
child=list(nextLargest)
child[childPtr]+=1
prob=sum([letters[x][1] for x in child])
heapq.heappush(heap,[prob,child,childPtr-1])

Maximum gap in unsorted array

I'm following algorithm from here:
http://cgm.cs.mcgill.ca/~godfried/teaching/dm-reading-assignments/Maximum-Gap-Problem.pdf
I dont understand step 2 and 3:
Divide
 the
 interval
 [x­min,
x­max] 
into
(n−1)
 "buckets"
 of
 equal
size
 delta= (x­max
–
x­min)/(n‐1)
For
 each 
of 
the 
remaining 
(n‐2) 
numbers 
determine
 in 
which 
bucket it 
falls
 using 
the 
floor
 function. 
The 
number 
xi 
belongs 
to 
the 
kth
 bucket 
Bk
 if, 
and 
only 
if, 
(xi
‐
x­min)/δ
=
k‐1.
Lets say
a = [13, 4, 7, 2, 9, 17, 18]
Minm: 2 Maxm: 18 n-1: 6.
So my # of buckets will be 6. And delta = (18-2)/6 = 2. That is 6 buckets
having 2 elements into each of them. (Total 12 elements I can have)
Step 2. Que:
If there are only 12 elements where would be my max 18?
Step 3.
For element 18:
as per algorithm it should be in math.floor((17-2)/float(2)) = 7
So 18 should be in 8th block, BUT we have only (n-1) = 6 buckets.
Mystery to me!
EDIT1:
Sorry
Step 3: wrong Math:
math.floor((17-2)/float(2)) = 5
Still need to figure out where does minimum and maximum goes.
EDIT2:
As per answer by Miljen Mikic:
He was right, my question is "What we do with maximum and minimum"
And in step 6:
In
 L
 find 
the 
maximum
 distance
 between
 a 
pair 
of 
consecutive
 minimum
 and
maximum
(xi­max,
xj­min), 
where 
j
>
i.
How come j > i? i.e. max from next bucket and min from current bucket.
In the algorithm you cited, you don't put minimum and maximum in the buckets. Pay attention to the Note after Step 5:
Note: Since there are n-1 buckets and only n-2 numbers..
If you put minimum and maximum in some buckets, then you would have had n numbers, not n-2. The real question now is: what to do with minimum and maximum? Actually, step 6 of the algorithm should be clarified a little bit more. When you examine the list L, you should start with x-min and compare it with x1-min, and you should end by comparing x(n-1)-max and x-max, because the maximum gap can actually include minimum or maximum, like you get e.g. in this example: [1,7,3,2]. Of course, these two additional comparisons still give you linear time complexity.
Note that you can change the algorithm slightly by putting minimum and maximum in buckets as well (by the exact same formula!) and then you would have n numbers and n buckets. Why? Minimum always goes in the first bucket (see the formula), and maximum needs to go in the n-th bucket, which didn't exist previously, so we have one extra bucket if we apply this change. This means that in this case you cannot always apply Pigeonhole principle, however it still holds that the maximum 
distance 
between 
a 
pair 
of 
consecutive elements 
must 
be 
at
 least 
the 
length 
of 
the 
bucket. How come? If at least one bucket contains two elements, then there must be some empty bucket and this is clear. Otherwise, all buckets contain exactly one element; this means that the first bucket contains the minimum, and the second bucket contains an element whose value is at least x_min + δ, so the difference between this element and x_min is at least δ, the 
length 
of 
the 
bucket. Why the element in the second bucket has to be at least x_min + δ? If it is smaller than that, e.g. if it's x_min + δ - k, where k > 0, then it will also belong to the first bucket because [((x_min + δ - k) - x_min) / δ] = [(δ - k) / δ] = 0, i.e. not to the second as we assumed!

generate sequence with all permutations

How can I generate the shortest sequence with contains all possible permutations?
Example:
For length 2 the answer is 121, because this list contains 12 and 21, which are all possible permutations.
For length 3 the answer is 123121321, because this list contains all possible permutations:
123, 231, 312, 121 (invalid), 213, 132, 321.
Each number (within a given permutation) may only occur once.
This greedy algorithm produces fairly short minimal sequences.
UPDATE: Note that for n ≥ 6, this algorithm does not produce the shortest possible string!
Make a collection of all permutations.
Remove the first permutation from the collection.
Let a = the first permutation.
Find the sequence in the collection that has the greatest overlap with the end of a. If there is a tie, choose the sequence is first in lexicographic order. Remove the chosen sequence from the collection and add the non-overlapping part to the end of a. Repeat this step until the collection is empty.
The curious tie-breaking step is necessary for correctness; breaking the tie at random instead seems to result in longer strings.
I verified (by writing a much longer, slower program) that the answer this algorithm gives for length 4, 123412314231243121342132413214321, is indeed the shortest answer. However, for length 6 it produces an answer of length 873, which is longer than the shortest known solution.
The algorithm is O(n!2).
An implementation in Python:
import itertools
def costToAdd(a, b):
for i in range(1, len(b)):
if a.endswith(b[:-i]):
return i
return len(b)
def stringContainingAllPermutationsOf(s):
perms = set(''.join(tpl) for tpl in itertools.permutations(s))
perms.remove(s)
a = s
while perms:
cost, next = min((costToAdd(a, x), x) for x in perms)
perms.remove(next)
a += next[-cost:]
return a
The length of the strings generated by this function are 1, 3, 9, 33, 153, 873, 5913, ... which appears to be this integer sequence.
I have a hunch you can do better than O(n!2).
Create all permutations.
Let each
permutation represent a node in a
graph.
Now, for any two states add an
edge with a value 1 if they share
n-1 digits (for the source from the
end, and for the target from the
end), two if they share n-2 digits
and so on.
Now, you are left to find
the shortest path containing n
vertices.
Here is a fast algorithm that produces a short string containing all permutations. I am pretty sure it produces the shortest possible answer, but I don't have a complete proof in hand.
Explanation. Below is a tree of All Permutations. The picture is incomplete; imagine that the tree goes on forever to the right.
1 --+-- 12 --+-- 123 ...
| |
| +-- 231 ...
| |
| +-- 312 ...
|
+-- 21 --+-- 213 ...
|
+-- 132 ...
|
+-- 321 ...
The nodes at level k of this tree are all the permutations of length
k. Furthermore, the permutations are in a particular order with a lot
of overlap between each permutation and its neighbors above and below.
To be precise, each node's first child is found by simply adding the next
symbol to the end. For example, the first child of 213 would be 2134. The rest
of the children are found by rotating to the first child to left one symbol at
a time. Rotating 2134 would produce 1342, 3421, 4213.
Taking all the nodes at a given level and stringing them together, overlapping
as much as possible, produces the strings 1, 121, 123121321, etc.
The length of the nth string in that sequence is the sum for x=1 to n of x!. (You can prove this by observing how much non-overlap there is between neighboring permutations. Siblings overlap in all but 1 symbol; first-cousins overlap in all but 2 symbols; and so on.)
Sketch of proof. I haven't completely proved that this is the best solution, but here's a sketch of how the proof would proceed. First show that any string containing n distinct permutations has length ≥ 2n - 1. Then show that adding any string containing n+1 distinct permutations has length 2n + 1. That is, adding one more permutation will cost you two digits. Proceed by calculating the minimum length of strings containing nPr and nPr + 1 distinct permutations, up to n!. In short, this sequence is optimal because you can't make it worse somewhere in the hope of making it better someplace else. It's already locally optimal everywhere. All the moves are forced.
Algorithm. Given all this background, the algorithm is very simple. Walk this tree to the desired depth and string together all the nodes at that depth.
Fortunately we do not actually have to build the tree in memory.
def build(node, s):
"""String together all descendants of the given node at the target depth."""
d = len(node) # depth of this node. depth of "213" is 3.
n = len(s) # target depth
if d == n - 1:
return node + s[n - 1] + node # children of 213 join to make "2134213"
else:
c0 = node + s[d] # first child node
children = [c0[i:] + c0[:i] for i in range(d + 1)] # all child nodes
strings = [build(c, s) for c in children] # recurse to the desired depth
for j in range(1, d + 1):
strings[j] = strings[j][d:] # cut off overlap with previous sibling
return ''.join(strings) # join what's left
def stringContainingAllPermutationsOf(s):
return build(s[:1], s)
Performance. The above code is already much faster than my other solution, and it does a lot of cutting and pasting of large strings that you can optimize away. The algorithm can be made to run in time and memory proportional to the size of the output.
For n 3 length chain is 8
12312132
Seems to me we are working with cycled system - it's ring, saying in other words. But we are are working with ring as if it is chain. Chain is realy 123121321 = 9
But the ring is 12312132 = 8
We take last 1 for 321 from the beginning of the sequence 12312132[1].
These are called (minimal length) superpermutations (cf. Wikipedia).
Interest on this has re-sparked when an anonymous user has posted a new lower bound on 4chan. (See Wikipedia and many other web pages for history.)
AFAIK, as of today we just know:
Their length is A180632(n) ≤ A007489(n) = Sum_{k=1..n} k! but this bound is only sharp for n ≤ 5, i.e., we have equality for n ≤ 5 but strictly less for n > 5.
There's a very simple recursive algorithm, given below, producing a superpermutation of length A007489(n), which is always palindromic (but as said above this is not the minimal length for n > 5).
For n ≥ 7 we have the better upper bound n! + (n−1)! + (n−2)! + (n−3)! + n − 3.
For n ≤ 5 all minimal SP's are known; and for all n > 5 we don't know which is the minimal SP.
For n = 1, 2, 3, 4 the minimal SP's are unique (up to changing the symbols), given by (1, 121, 123121321, 123412314231243121342132413214321) of length A007489(1..4) = (1, 3, 9, 33).
For n = 5 there are 8 inequivalent ones of minimal length 153 = A007489(5); the palindromic one produced by the algorithm below is the 3rd in lexicographic order.
For n = 6 Houston produced thousands of the smallest known length 872 = A007489(6) - 1, but AFAIK we still don't know whether this is minimal.
For n = 7 Egan produced one of length 5906 (one less than the better upper bound given above) but again we don't know whether that's minimal.
I've written a very short PARI/GP program (you can paste to run it on the PARI/GP web site) which implements the standard algorithm producing a palindromic superpermutation of length A007489(n):
extend(S,n=vecmax(s))={ my(t); concat([
if(#Set(s)<n, [], /* discard if not a permutation */
s=concat([s, n+1, s]); /* Now merge with preceding segment: */
forstep(i=min(#s, #t)-1, 0, -1,
if(s[1..1+i]==t[#t-i..#t], s=s[2+i..-1]; break));
t=s /* store as previous for next */
)/*endif*/
| s <- [ S[i+1..i+n] | i <- [0..#S-n] ]])
}
SSP=vector(6, n, s=if(n>1, extend(s), [1])); // gives the first 6, the 6th being non-minimal
I think that easily translates to any other language. (For non-PARI speaking persons: "| x <-" means "for x in".)

Number base conversion as a stream operation

Is there a way in constant working space to do arbitrary size and arbitrary base conversions. That is, to convert a sequence of n numbers in the range [1,m] to a sequence of ceiling(n*log(m)/log(p)) numbers in the range [1,p] using a 1-to-1 mapping that (preferably but not necessarily) preservers lexigraphical order and gives sequential results?
I'm particularly interested in solutions that are viable as a pipe function, e.i. are able to handle larger dataset than can be stored in RAM.
I have found a number of solutions that require "working space" proportional to the size of the input but none yet that can get away with constant "working space".
Does dropping the sequential constraint make any difference? That is: allow lexicographically sequential inputs to result in non lexicographically sequential outputs:
F(1,2,6,4,3,7,8) -> (5,6,3,2,1,3,5,2,4,3)
F(1,2,6,4,3,7,9) -> (5,6,3,2,1,3,5,2,4,5)
some thoughts:
might this work?
streamBasen -> convert(n, lcm(n,p)) -> convert(lcm(n,p), p) -> streamBasep
(where lcm is least common multiple)
I don't think it's possible in the general case. If m is a power of p (or vice-versa), or if they're both powers of a common base, you can do it, since each group of logm(p) is then independent. However, in the general case, suppose you're converting the number a1 a2 a3 ... an. The equivalent number in base p is
sum(ai * mi-1 for i in 1..n)
If we've processed the first i digits, then we have the ith partial sum. To compute the i+1'th partial sum, we need to add ai+1 * mi. In the general case, this number is going have non-zero digits in most places, so we'll need to modify all of the digits we've processed so far. In other words, we'll have to process all of the input digits before we'll know what the final output digits will be.
In the special case where m are both powers of a common base, or equivalently if logm(p) is a rational number, then mi will only have a few non-zero digits in base p near the front, so we can safely output most of the digits we've computed so far.
I think there is a way of doing radix conversion in a stream-oriented fashion in lexicographic order. However, what I've come up with isn't sufficient for actually doing it, and it has a couple of assumptions:
The length of the positional numbers are already known.
The numbers described are integers. I've not considered what happens with the maths and -ive indices.
We have a sequence of values a of length p, where each value is in the range [0,m-1]. We want a sequence of values b of length q in the range [0,n-1]. We can work out the kth digit of our output sequence b from a as follows:
bk = floor[ sum(ai * mi for i in 0 to p-1) / nk ] mod n
Lets rearrange that sum into two parts, splitting it at an arbitrary point z
bk = floor[ ( sum(ai * mi for i in z to p-1) + sum(ai * mi for i in 0 to z-1) ) / nk ] mod n
Suppose that we don't yet know the values of a between [0,z-1] and can't compute the second sum term. We're left with having to deal with ranges. But that still gives us information about bk.
The minimum value bk can be is:
bk >= floor[ sum(ai * mi for i in z to p-1) / nk ] mod n
and the maximum value bk can be is:
bk <= floor[ ( sum(ai * mi for i in z to p-1) + mz - 1 ) / nk ] mod n
We should be able to do a process like this:
Initialise z to be p. We will count down from p as we receive each character of a.
Initialise k to the index of the most significant value in b. If my brain is still working, ceil[ logn(mp) ].
Read a value of a. Decrement z.
Compute the min and max value for bk.
If the min and max are the same, output bk, and decrement k. Goto 4. (It may be possible that we already have enough values for several consecutive values of bk)
If z!=0 then we expect more values of a. Goto 3.
Hopefully, at this point we're done.
I've not considered how to efficiently compute the range values as yet, but I'm reasonably confident that computing the sum from the incoming characters of a can be done much more reasonably than storing all of a. Without doing the maths though, I won't make any hard claims about it though!
Yes, it is possible
For every I character(s) you read in, you will write out O character(s)
based on Ceiling(Length * log(In) / log(Out)).
Allocate enough space
Set x to 1
Loop over digits from end to beginning # Horner's method
Set a to x * digit
Set t to O - 1
Loop while a > 0 and t >= 0
Set a to a + out digit
Set out digit at position t to a mod to base
Set a to a / to base
Set x to x * from base
Return converted digit(s)
Thus, for base 16 to 2 (which is easy), using "192FE" we read '1' and convert it, then repeat on '9', then '2' and so on giving us '0001', '1001', '0010', '1111', and '1110'.
Note that for bases that are not common powers, such as base 17 to base 2 would mean reading 1 characters and writing 5.

Resources