generate sequence with all permutations - algorithm

How can I generate the shortest sequence with contains all possible permutations?
Example:
For length 2 the answer is 121, because this list contains 12 and 21, which are all possible permutations.
For length 3 the answer is 123121321, because this list contains all possible permutations:
123, 231, 312, 121 (invalid), 213, 132, 321.
Each number (within a given permutation) may only occur once.

This greedy algorithm produces fairly short minimal sequences.
UPDATE: Note that for n ≥ 6, this algorithm does not produce the shortest possible string!
Make a collection of all permutations.
Remove the first permutation from the collection.
Let a = the first permutation.
Find the sequence in the collection that has the greatest overlap with the end of a. If there is a tie, choose the sequence is first in lexicographic order. Remove the chosen sequence from the collection and add the non-overlapping part to the end of a. Repeat this step until the collection is empty.
The curious tie-breaking step is necessary for correctness; breaking the tie at random instead seems to result in longer strings.
I verified (by writing a much longer, slower program) that the answer this algorithm gives for length 4, 123412314231243121342132413214321, is indeed the shortest answer. However, for length 6 it produces an answer of length 873, which is longer than the shortest known solution.
The algorithm is O(n!2).
An implementation in Python:
import itertools
def costToAdd(a, b):
for i in range(1, len(b)):
if a.endswith(b[:-i]):
return i
return len(b)
def stringContainingAllPermutationsOf(s):
perms = set(''.join(tpl) for tpl in itertools.permutations(s))
perms.remove(s)
a = s
while perms:
cost, next = min((costToAdd(a, x), x) for x in perms)
perms.remove(next)
a += next[-cost:]
return a
The length of the strings generated by this function are 1, 3, 9, 33, 153, 873, 5913, ... which appears to be this integer sequence.
I have a hunch you can do better than O(n!2).

Create all permutations.
Let each
permutation represent a node in a
graph.
Now, for any two states add an
edge with a value 1 if they share
n-1 digits (for the source from the
end, and for the target from the
end), two if they share n-2 digits
and so on.
Now, you are left to find
the shortest path containing n
vertices.

Here is a fast algorithm that produces a short string containing all permutations. I am pretty sure it produces the shortest possible answer, but I don't have a complete proof in hand.
Explanation. Below is a tree of All Permutations. The picture is incomplete; imagine that the tree goes on forever to the right.
1 --+-- 12 --+-- 123 ...
| |
| +-- 231 ...
| |
| +-- 312 ...
|
+-- 21 --+-- 213 ...
|
+-- 132 ...
|
+-- 321 ...
The nodes at level k of this tree are all the permutations of length
k. Furthermore, the permutations are in a particular order with a lot
of overlap between each permutation and its neighbors above and below.
To be precise, each node's first child is found by simply adding the next
symbol to the end. For example, the first child of 213 would be 2134. The rest
of the children are found by rotating to the first child to left one symbol at
a time. Rotating 2134 would produce 1342, 3421, 4213.
Taking all the nodes at a given level and stringing them together, overlapping
as much as possible, produces the strings 1, 121, 123121321, etc.
The length of the nth string in that sequence is the sum for x=1 to n of x!. (You can prove this by observing how much non-overlap there is between neighboring permutations. Siblings overlap in all but 1 symbol; first-cousins overlap in all but 2 symbols; and so on.)
Sketch of proof. I haven't completely proved that this is the best solution, but here's a sketch of how the proof would proceed. First show that any string containing n distinct permutations has length ≥ 2n - 1. Then show that adding any string containing n+1 distinct permutations has length 2n + 1. That is, adding one more permutation will cost you two digits. Proceed by calculating the minimum length of strings containing nPr and nPr + 1 distinct permutations, up to n!. In short, this sequence is optimal because you can't make it worse somewhere in the hope of making it better someplace else. It's already locally optimal everywhere. All the moves are forced.
Algorithm. Given all this background, the algorithm is very simple. Walk this tree to the desired depth and string together all the nodes at that depth.
Fortunately we do not actually have to build the tree in memory.
def build(node, s):
"""String together all descendants of the given node at the target depth."""
d = len(node) # depth of this node. depth of "213" is 3.
n = len(s) # target depth
if d == n - 1:
return node + s[n - 1] + node # children of 213 join to make "2134213"
else:
c0 = node + s[d] # first child node
children = [c0[i:] + c0[:i] for i in range(d + 1)] # all child nodes
strings = [build(c, s) for c in children] # recurse to the desired depth
for j in range(1, d + 1):
strings[j] = strings[j][d:] # cut off overlap with previous sibling
return ''.join(strings) # join what's left
def stringContainingAllPermutationsOf(s):
return build(s[:1], s)
Performance. The above code is already much faster than my other solution, and it does a lot of cutting and pasting of large strings that you can optimize away. The algorithm can be made to run in time and memory proportional to the size of the output.

For n 3 length chain is 8
12312132
Seems to me we are working with cycled system - it's ring, saying in other words. But we are are working with ring as if it is chain. Chain is realy 123121321 = 9
But the ring is 12312132 = 8
We take last 1 for 321 from the beginning of the sequence 12312132[1].

These are called (minimal length) superpermutations (cf. Wikipedia).
Interest on this has re-sparked when an anonymous user has posted a new lower bound on 4chan. (See Wikipedia and many other web pages for history.)
AFAIK, as of today we just know:
Their length is A180632(n) ≤ A007489(n) = Sum_{k=1..n} k! but this bound is only sharp for n ≤ 5, i.e., we have equality for n ≤ 5 but strictly less for n > 5.
There's a very simple recursive algorithm, given below, producing a superpermutation of length A007489(n), which is always palindromic (but as said above this is not the minimal length for n > 5).
For n ≥ 7 we have the better upper bound n! + (n−1)! + (n−2)! + (n−3)! + n − 3.
For n ≤ 5 all minimal SP's are known; and for all n > 5 we don't know which is the minimal SP.
For n = 1, 2, 3, 4 the minimal SP's are unique (up to changing the symbols), given by (1, 121, 123121321, 123412314231243121342132413214321) of length A007489(1..4) = (1, 3, 9, 33).
For n = 5 there are 8 inequivalent ones of minimal length 153 = A007489(5); the palindromic one produced by the algorithm below is the 3rd in lexicographic order.
For n = 6 Houston produced thousands of the smallest known length 872 = A007489(6) - 1, but AFAIK we still don't know whether this is minimal.
For n = 7 Egan produced one of length 5906 (one less than the better upper bound given above) but again we don't know whether that's minimal.
I've written a very short PARI/GP program (you can paste to run it on the PARI/GP web site) which implements the standard algorithm producing a palindromic superpermutation of length A007489(n):
extend(S,n=vecmax(s))={ my(t); concat([
if(#Set(s)<n, [], /* discard if not a permutation */
s=concat([s, n+1, s]); /* Now merge with preceding segment: */
forstep(i=min(#s, #t)-1, 0, -1,
if(s[1..1+i]==t[#t-i..#t], s=s[2+i..-1]; break));
t=s /* store as previous for next */
)/*endif*/
| s <- [ S[i+1..i+n] | i <- [0..#S-n] ]])
}
SSP=vector(6, n, s=if(n>1, extend(s), [1])); // gives the first 6, the 6th being non-minimal
I think that easily translates to any other language. (For non-PARI speaking persons: "| x <-" means "for x in".)

Related

Substring search with max 1's in a binary sequence

Problem
The task is to find a substring from the given binary string with highest score. The substring should be at least of given min length.
score = number of 1s / substring length where score can range from 0 to 1.
Inputs:
1. min length of substring
2. binary sequence
Outputs:
1. index of first char of substring
2. index of last char of substring
Example 1:
input
-----
5
01010101111100
output
------
7
11
explanation
-----------
1. start with minimum window = 5
2. start_ind = 0, end_index = 4, score = 2/5 (0.4)
3. start_ind = 1, end_index = 5, score = 3/5 (0.6)
4. and so on...
5. start_ind = 7, end_index = 11, score = 5/5 (1) [max possible]
Example 2:
input
-----
5
10110011100
output
------
2
8
explanation
-----------
1. while calculating all scores for windows 5 to len(sequence)
2. max score occurs in the case: start_ind=2, end_ind=8, score=5/7 (0.7143) [max possible]
Example 3:
input
-----
4
00110011100
output
------
5
8
What I attempted
The only technique i could come up with was a brute force technique, with nested for loops
for window_size in (min to max)
for ind 0 to end
calculate score
save max score
Can someone suggest a better algorithm to solve this problem?
There's a few observations to make before we start talking about an algorithm- some of these observations already have been pointed out in the comments.
Maths
Take the minimum length to be M, the length of the entire string to be L, and a substring from the ith char to the jth char (inclusive-exclusive) to be S[i:j].
All optimal substrings will satisfy at least one of two conditions:
It is exactly M characters in length
It starts and ends with a 1 character
The reason for the latter being if it were longer than M characters and started/ended with a 0, we could just drop that 0 resulting in a higher ratio.
In the same spirit (again, for the 2nd case), there exists an optimal substring which is not preceded by a 1. Otherwise, if it were, we could include that 1, resulting in an equal or higher ratio. The same logic applies to the end of S and a following 1.
Building on the above- such a substring being preceded or followed by another 1 will NOT be optimal, unless the substring contains no 0s. In the case where it doesn't contain 0s, there will exist an optimal substring of length M as well anyways.
Again, that all only applies to the length greater than M case substrings.
Finally, there exists an optimal substring that has length at least M (by definition), and at most 2 * M - 1. If an optimal substring had length K, we could split it into two substrings of length floor(K/2) and ceil(K/2) - S[i:i+floor(K/2)] and S[i+floor(K/2):i+K]. If the substring has the score (ratio) R, and its halves R0 and R1, we would have one of two scenarios:
R = R0 = R1, meaning we could pick either half and get the same score as the combined substring, giving us a shorter substring.
If this substring has length less than 2 * M, we are done- we have an optimal substring of length [M, 2*M).
Otherwise, recurse on the new substring.
R0 != R1, so (without loss of generality) R0 < R < R1, meaning the combined substring would not be optimal in the first place.
Note that I say "there exists an optimal" as opposed to "the optimal". This is because there may be multiple optimal solutions, and the observations above may refer to different instances.
Algorithm
You could search every window size [M, 2*M) at every offset, which would already be better than a full search for small M. You can also try a two-phase approach:
search every M sized window, find the max score
search from the beginning of every run of 1s forward through a special list of ends of runs of 1s, implicitly skipping over 0s and irrelevant 1s, breaking when out of the [M, 2 * M) bound.
For random data, I only expect this to save a small factor- skipping 15/16 of the windows (ignoring the added overhead). For less-random data, you could potentially see huge benefits, particularly if there's LOTS of LARGE runs of 1s and 0s.
The biggest speedup you'll be able to do (besides limiting the window max to 2 * M) is computing a cumulative sum of the bit array. This lets you query "how many 1s were seen up to this point". You can then take the difference of two elements in this array to query "how many 1s occurred between these offsets" in constant time. This allows for very quick calculation of the score.
You can use 2 pointer method, starting from both left-most and right-most ends. then adjust them searching for highest score.
We can add some cache to optimize time.
Example: (Python)
binary="01010101111100"
length=5
def get_score(binary,left,right):
ones=0
for i in range(left,right+1):
if binary[i]=="1":
ones+=1
score= ones/(right-left+1)
return score
cache={}
def get_sub(binary,length,left,right):
if (left,right) in cache:
return cache[(left,right)]
table=[0,set()]
if right-left+1<length:
pass
else:
scores=[[get_score(binary,left,right),set([(left,right)])],
get_sub(binary,length,left+1,right),
get_sub(binary,length,left,right-1),
get_sub(binary,length,left+1,right-1)]
for s in scores:
if s[0]>table[0]:
table[0]=s[0]
table[1]=s[1]
elif s[0]==table[0]:
for e in s[1]:
table[1].add(e)
cache[(left,right)]=table
return table
result=get_sub(binary,length,0,len(binary)-1)
print("Score: %f"%result[0])
print("Index: %s"%result[1])
Output
Score: 1
Index: {(7, 11)}

Generate first N combinations of length L from weighted set in order of weight

I have a set of letters with weights, which gives their probability of appearing in a string:
a - 0.7
b - 0.1
c - 0.3
...
z - 0.01
As such, the word aaaa has a probability of 0.7*0.7*0.7*0.7 = 0.24. The word aaac would have probability 0.7*0.7*0.7*0.3 = 0.10. All permutations of the same word have the same probability, so we only need to worry about combinations.
I would like to generate the first unique N strings of length L in order of probability (for example here, with 4 letters and length 4, that should be aaaa, aaac, aacc, aaab, accc, aabc, cccc, etc).
Assume that the brute force approach of generating all combinations with their probabilities and sorting by weight to not be possible here. The algorithm, should it exist, must be able to work for any set size and any length of string (e.g. all 256 bytes with weighted probabilities, 1024 length string, generate first trillion.)
Below is some enumeration code that uses a heap. The implementation principle is slightly different from what user3386109 proposes in their comment.
Order the symbols by decreasing probability. There’s a constructive one-to-one correspondence between length-L combinations of S symbols, and binary strings of length S + L − 1 with L − 1 zeros (counting out each symbol in unary with L − 1 delimiters). We can do bit-at-a-time enumeration of the possibilities for the latter.
The part that saves us from having to enumerate every single combination is that, for each binary prefix, the most probable word can be found by repeating the most probable letter still available. By storing the prefixes in a heap, we can open only the ones that appear in the top N.
Note that this uses memory proportional to the number of enumerated combinations. This may still be too much, in which case you probably want something like iterative deepening depth-first search.
symbol_probability_dict = {"a": 0.7, "b": 0.1, "c": 0.3, "z": 0.01}
L = 4
import heapq
import math
loss_symbol_pairs = [(-math.log(p), c) for (c, p) in symbol_probability_dict.items()]
loss_symbol_pairs.sort()
heap = [(0, 0, "")]
while heap:
min_loss, i, s = heapq.heappop(heap)
if len(s) < L:
heapq.heappush(heap, (min_loss, i, s + loss_symbol_pairs[i][1]))
if i + 1 < len(loss_symbol_pairs):
heapq.heappush(
heap,
(
min_loss
+ (L - len(s))
* (loss_symbol_pairs[i + 1][0] - loss_symbol_pairs[i][0]),
i + 1,
s,
),
)
else:
print(s)
Output:
aaaa
aaac
aacc
aaab
accc
aacb
cccc
accb
aabb
aaaz
cccb
acbb
aacz
ccbb
abbb
accz
aabz
cbbb
cccz
acbz
bbbb
ccbz
abbz
aazz
cbbz
aczz
bbbz
cczz
abzz
cbzz
bbzz
azzz
czzz
bzzz
zzzz
This answer provides an implementation of #user3386109's comment:
I think the solution is a priority queue. The initial input to the queue is a string with the highest probability letter (aaaa). After reading a string from the queue, replace a letter with the next lower probability letter, and add that new string to the queue. So after reading aaaa, write aaac. After reading aaac, write aacc (changing an a to c) and aaab (changing a c to a b).
I wrote a generator function, following these exact steps:
Define a helper functions: prio(word) which returns the priority of a word (as a negative number because python heaps are min-heaps);
Define a helper dict: next_letter.get(letter) is the next higher-priority letter after letter, if any;
Initialise a heapq queue with first word aaaa, and its corresponding priority;
When popping words from the queue, avoid possible duplicates by comparing the current word with the previous word;
After popping a word from the queue, if it is not a duplicate, then yield this word, and push the words obtained by replacing a letter by the next probability letter;
Since it is a generator function, it is lazy: you can get the first N words without computing all words. However, some extra words will still be computed, since the whole idea of the priority queue is that we don't know the exact order in advance.
import heapq
from math import prod
from itertools import pairwise
def gen_combs(weights, r = 4):
next_letter = dict(pairwise(sorted(weights, key=weights.get, reverse=True)))
def prio(word, weights=weights):
return -prod(map(weights.get, word))
first_word = max(weights, key=weights.get) * r
queue = [(prio(first_word), first_word)]
prev_word = None
while queue:
w, word = heapq.heappop(queue)
if word != prev_word:
yield word
prev_word = word
seen_letters = set()
for i, letter in enumerate(word):
if letter not in seen_letters:
new_letter = next_letter.get(letter)
if new_letter:
new_word = ''.join((word[:i], new_letter, word[i+1:]))
heapq.heappush(queue, (prio(new_word), new_word))
seen_letters.add(letter)
# TESTING
weights = {'a': 0.7, 'b': 0.1, 'c': 0.3, 'z': 0.01}
# print all words
print(list(gen_combs(weights)))
# ['aaaa', 'caaa', 'ccaa', 'baaa', 'ccca', 'bcaa', 'cccc',
# 'bcca', 'bbaa', 'zaaa', 'bccc', 'bbca', 'zcaa', 'bbcc',
# 'bbba', 'zcca', 'zbaa', 'bbbc', 'zccc', 'zbca', 'bbbb',
# 'zbcc', 'zbba', 'zzaa', 'zbbc', 'zzca', 'zbbb', 'zzcc',
# 'zzba', 'zzbc', 'zzbb', 'zzza', 'zzzc', 'zzzb', 'zzzz']
n = 10
# print first n words, one per line
for _,word in zip(range(n), gen_combs(weights)):
print(word)
# print first n words
from itertools import islice
print(list(islice(gen_combs(weights), n)))
# ['aaaa', 'caaa', 'ccaa', 'baaa', 'ccca', 'bcaa', 'cccc', 'bcca', 'bbaa', 'zaaa']
First pardon me for pointing out your probabilities do not sum to 1 which instantly makes me feel something is wrong.
I would like to share my version (credit to Stef, David, and user3386109), which is easier to prove to be O(NL) in space and O(N*max(L,log(N))) in time. Because you need O(NL) to store/print out the result, and add a log(N) for prio-queue, you can see it is hard to further make improvement. I don't see the folks here have proved efficient enough on the complexity-wise. P.S, making it a generator won't help since the space is mostly caused by the heap itself. P.S.2 for large L, the size of N can be astronomical.
Complexity analysis: The heap starts with 1 element and each pop, which is one of the top-N, pushes up to 2 elements back in. Therefore, the total space is O(N2item size)=O(NL). Time cost is N*max(L,log(N)) because push/pop each costs log(N) while many steps in the inner loop costs O(L).
You can see it print the same 35 results as Stef's. Now the core part of this algo is how to choose the 2 elements to push after each pop. First consider the opposite question, how to find the parent of each child so (1) all possible combos form a partial order (maximal element is 'aaaa', and No we don't need Zorn's lemma here), (2) we won't miss anything larger than the child but smaller than the parent (not a entirely correct statement for 2 letters sharing the same probability, but in actuality I found it does not matter), and (3) no repeats so we don't have to manage caching the visited strings. My way of choosing the parent is always lower the letters on the front of the string until it reaches 'a'. E.g, parent of 'cbbz' will be 'abbz' instead of 'ccbz'. Conversely, we find the children of a given parent, and it is nice to see we have up to 2 children in this case: raise the first non-'a' letter until it equals the next letter, ('acbb'=>'abbb', while 'accb' has no child in this direction) or raise the last 'a' ('acbb'=>'ccbb', 'accb'=>'cccb')
import heapq, math
letters=[[-math.log(x[1]),x[0]] for x in zip('abcz',[0.7,0.1,0.3,0.01])]
N=35 #something big to force full print in testing. Here 35 is calculated by DP or Pascal triangle
L=4
letters.sort()
letters=[[i,x[0],x[1]] for i,x in enumerate(letters)]
heap=[[letters[0][1]*L,[0]*L,L-1]]
while heap and N>0:
N-=1
p,nextLargest,childPtr=heapq.heappop(heap)
print(''.join([letters[x][2] for x in nextLargest]))
if childPtr<L-1 and nextLargest[childPtr+1]<len(letters)-1:
child=list(nextLargest)
child[childPtr+1]+=1
if childPtr+2==L or child[childPtr+1]<=child[childPtr+2]:
prob=sum([letters[x][1] for x in child]) #this can be improved by using a delta, but won't change overall complexity
heapq.heappush(heap,[prob,child,childPtr])
if childPtr>=0:
child=list(nextLargest)
child[childPtr]+=1
prob=sum([letters[x][1] for x in child])
heapq.heappush(heap,[prob,child,childPtr-1])

Find North-East path with most points [duplicate]

In Cracking the Coding Interview, Fourth Edition, there is such a problem:
A circus is designing a tower routine consisting of people standing
atop one anoth- er’s shoulders For practical and aesthetic reasons,
each person must be both shorter and lighter than the person below him
or her Given the heights and weights of each person in the circus,
write a method to compute the largest possible number of people in
such a tower.
EXAMPLE: Input (ht, wt): (65, 100) (70, 150) (56, 90)
(75, 190) (60, 95) (68, 110)
Output: The longest tower is length 6 and
includes from top to bottom: (56, 90) (60,95) (65,100) (68,110)
(70,150) (75,190)
Here is its solution in the book
Step 1 Sort all items by height first, and then by weight This means that if all the heights are unique, then the items will be sorted by their height If heights are the same, items will be sorted by their weight
Step 2 Find the longest sequence which contains increasing heights and increasing weights
To do this, we:
a) Start at the beginning of the sequence Currently, max_sequence is empty
b) If, for the next item, the height and the weight is not greater than those of the previous item, we mark this item as “unfit”
c) If the sequence found has more items than “max sequence”, it becomes “max sequence”
d) After that the search is repeated from the “unfit item”, until we reach the end of the original sequence
I have some questions about its solutions.
Q1
I believe this solution is wrong.
For example
(3,2) (5,9) (6,7) (7,8)
Obviously, (6,7) is an unfit item, but how about (7,8)? According to the solution, it is NOT unfit as its h and w are bother bigger than (6,7), however, it cannot be considered into the sequence, because (7,8) does not fit (5,9).
Am I right?
If I am right, what is the fix?
Q2
I believe even if there is a fix for the above solution, the style of the solution will lead to at least O(n^2), because it need to iterate again and again, according to step 2-d.
So is it possible to have a O(nlogn) solution?
You can solve the problem with dynamic programming.
Sort the troupe by height. For simplicity, assume all the heights h_i and weights w_j are distinct. Thus h_i is an increasing sequence.
We compute a sequence T_i, where T_i is a tower with person i at the top of maximal size. T_1 is simply {1}. We can deduce subsequent T_k from the earlier T_j — find the largest tower T_j that can take k's weight (w_j < w_k) and stand k on it.
The largest possible tower from the troupe is then the largest of the T_i.
This algorithm takes O(n**2) time, where n is the cardinality of the troupe.
Tried solving this myself, did not meant to give 'ready made solution', but still giving , more to check my own understanding and if my code(Python) is ok and would work of all test cases. I tried for 3 cases and it seemed to work of correct answer.
#!/usr/bin/python
#This function takes a list of tuples. Tuple(n):(height,weight) of nth person
def htower_len(ht_wt):
ht_sorted = sorted(ht_wt,reverse=True)
wt_sorted = sorted(ht_wt,key=lambda ht_wt:ht_wt[1])
max_len = 1
len1 = len(ht_sorted)
i=0
j=0
while i < (len1-1):
if(ht_sorted[i+1][1] < ht_sorted[0][1]):
max_len = max_len+1
i=i+1
print "maximum tower length :" ,max_len
###Called above function with below sample app code.
testcase =1
print "Result of Test case ",testcase
htower_len([(5,75),(6.7,83),(4,78),(5.2,90)])
testcase = testcase + 1
print "Result of Test case ",testcase
htower_len([(65, 100),(70, 150),(56, 90),(75, 190),(60, 95),(68, 110)])
testcase = testcase + 1
print "Result of Test case ",testcase
htower_len([(3,2),(5,9),(6,7),(7,8)])
For example
(3,2) (5,9) (6,7) (7,8)
Obviously, (6,7) is an unfit item, but how about (7,8)?
In answer to your Question - the algorithm first runs starting with 3,2 and gets the sequence (3,2) (5,9) marking (6,7) and (7,8) as unfit.
It then starts again on (6,7) (the first unfit) and gets (6,7) (7,8), and that makes the answer 2. Since there are no more "unfit" items, the sequence terminates with maximum length 2.
After first sorting the array by height and weight, my code checks what the largest tower would be if we grabbed any of the remaining tuples in the array (and possible subsequent tuples). In order to avoid re-computing sub-problems, solution_a is used to store the optimal max length from the tail of the input_array.
The beginning_index is the index from which we can consider grabbing elements from (the index from which we can consider people who could go below on the human stack), and beginning_tuple refers to the element/person higher up on the stack.
This solution runs in O(nlogn) to do the sort. The space used is O(n) for the solution_a array and the copy of the input_array.
def determine_largest_tower(beginning_index, a, beginning_tuple, solution_a):
# base case
if beginning_index >= len(a):
return 0
if solution_a[beginning_index] != -1: # already computed
return solution_a[beginning_index]
# recursive case
max_len = 0
for i in range(beginning_index, len(a)):
# if we can grab that value, check what the max would be
if a[i][0] >= beginning_tuple[0] and a[i][1] >= beginning_tuple[1]:
max_len = max(1 + determine_largest_tower(i+1, a, a[i], solution_a), max_len)
solution_a[beginning_index] = max_len
return max_len
def algorithm_for_human_towering(input_array):
a = sorted(input_array)
return determine_largest_tower(0, a, (-1,-1), [-1] * len(a))
a = [(3,2),(5,9),(6,7),(7,8)]
print algorithm_for_human_towering(a)
Here is another way to approach the problem altogether with code;
Algorithm
Sorting first by height and then by width
Sorted array:
[(56, 90), (60, 95), (65, 100), (68, 110), (70, 150), (75, 190)]
Finding the length of the longest increasing subsequence of weights
Why the longest subsequence of weights is the answer?
The people are sorted by increasing height,
so when we are finding a subsequence of people with increasing weights too
these selected people would satisfy our requirement as they are both in increasing order of heights and weights and therefore can form a human tower.
For example:
[(56, 90) (60,95) (65,100) (68,110) (70,150) (75,190)]
Efficient Implementation
In the attached implementation we maintain a list of increasing numbers and uses bisect_left, which is implemented under the hood using binary search, to find the proper index for insertion.
Please Note; The sequence generated by longest_increasing_sequence method might not be the actual longest subsequence, however, the length of it - will surely be as the length of the longest increasing subsequence.
Kindly refer to Longest increasing subsequence Efficient algorithms for more details.
The overall time complexity is O(n log(n)) as desired.
Code
from bisect import bisect_left
def human_tower(height, weight):
def longest_increasing_sequence(A, get_property):
lis = []
for i in range(len(A)):
x = get_property(A[i])
i = bisect_left(lis, x)
if i == len(lis):
lis.append(x)
else:
lis[i] = x
return len(lis)
# Edge case, no people
if 0 == len(height):
return 0
# Creating array of heights and widths
people = [(h, w) for h, w in zip(height, weight)]
# Sorting array first by height and then by width
people.sort()
# Returning length longest increasing sequence
return longest_increasing_sequence(people, lambda t : t[1])
assert 6 == human_tower([65,70,56,75,60,68], [100,150,90,190,95,110])

Fast algorithm to optimize a sequence of arithmetic expression

EDIT: clarified description of problem
Is there a fast algorithm solving following problem?
And, is also for extendend version of this problem
that is replaced natural numbers to Z/(2^n Z)?(This problem was too complex to add more quesion in one place, IMO.)
Problem:
For a given set of natural numbers like {7, 20, 17, 100}, required algorithm
returns the shortest sequence of additions, mutliplications and powers compute
all of given numbers.
Each item of sequence are (correct) equation that matches following pattern:
<number> = <number> <op> <number>
where <number> is a natual number, <op> is one of {+, *, ^}.
In the sequence, each operand of <op> should be one of
1
numbers which are already appeared in the left-hand-side of equal.
Example:
Input: {7, 20, 17, 100}
Output:
2 = 1 + 1
3 = 1 + 2
6 = 2 * 3
7 = 1 + 6
10 = 3 + 7
17 = 7 + 10
20 = 2 * 10
100 = 10 ^ 2
I wrote backtracking algorithm in Haskell.
it works for small input like above, but my real query is
randomly distributed ~30 numbers in [0,255].
for real query, following code takes 2~10 minutes in my PC.
(Actual code,
very simple test)
My current (Pseudo)code:
-- generate set of sets required to compute n.
-- operater (+) on set is set union.
requiredNumbers 0 = { {} }
requiredNumbers 1 = { {} }
requiredNumbers n =
{ {j, k} | j^k == n, j >= 2, k >= 2 }
+ { {j, k} | j*k == n, j >= 2, k >= 2 }
+ { {j, k} | j+k == n, j >= 1, k >= 1 }
-- remember the smallest set of "computed" number
bestSet := {i | 1 <= i <= largeNumber}
-- backtracking algorithm
-- from: input
-- to: accumulator of "already computed" number
closure from to =
if (from is empty)
if (|bestSet| > |to|)
bestSet := to
return
else if (|from| + |to| >= |bestSet|)
-- cut branch
return
else
m := min(from)
from' := deleteMin(from)
foreach (req in (requiredNumbers m))
closure (from' + (req - to)) (to + {m})
-- recoverEquation is a function converts set of number to set of equation.
-- it can be done easily.
output = recoverEquation (closure input {})
Additional Note:
Answers like
There isn't a fast algorithm, because...
There is a heuristic algorithm, it is...
are also welcomed. Now I'm feeling that there is no fast and exact algorithm...
Answer #1 can be used as a heuristic, I think.
What if you worked backwards from the highest number in a sorted input, checking if/how to utilize the smaller numbers (and numbers that are being introduced) in its construction?
For example, although this may not guarantee the shortest sequence...
input: {7, 20, 17, 100}
(100) = (20) * 5 =>
(7) = 5 + 2 =>
(17) = 10 + (7) =>
(20) = 10 * 2 =>
10 = 5 * 2 =>
5 = 3 + 2 =>
3 = 2 + 1 =>
2 = 1 + 1
What I recommend is to transform it into some kind of graph shortest path algorithm.
For each number, you compute (and store) the shortest path of operations. Technically one step is enough: For each number you can store the operation and the two operands (left and right, because power operation is not commutative), and also the weight ("nodes")
Initially you register 1 with the weight of zero
Every time you register a new number, you have to generate all calculations with that number (all additions, multiplications, powers) with all already-registered numbers. ("edges")
Filter for the calculations: it the result of the calculation is already registered, you shouldn't store that, because there is an easier way to get to that number
Store only 1 operation for the commutative ones (1+2=2+1)
Prefilter the power operation because that may even cause overflow
You have to order this list to the shortest sum path (weight of the edge). Weight = (weight of operand1) + (weight of operand2) + (1, which is the weight of the operation)
You can exclude all resulting numbers which are greater than the maximum number that we have to find (e.g. if we found 100 already, anything greater that 20 can be excluded) - this can be refined so that you can check the members of the operations also.
If you hit one of your target numbers, then you found the shortest way of calculating one of your target numbers, you have to restart the generations:
Recalculate the maximum of the target numbers
Go back on the paths of the currently found number, set their weight to 0 (they will be given from now on, because their cost is already paid)
Recalculate the weight for the operations in the generation list, because the source operand weight may have been changed (this results reordering at the end) - here you can exclude those where either operand is greater than the new maximum
If all the numbers are hit, then the search is over
You can build your expression using the "backlinks" (operation, left and right operands) for each of your target numbers.
The main point is that we always keep our eye on the target function, which is that the total number of operation must be the minimum possible. In order to get this, we always calculate the shortest path to a certain number, then considering that number (and all the other numbers on the way) as given numbers, then extending our search to the remaining targets.
Theoretically, this algorithm processes (registers) each numbers only once. Applying the proper filters cuts the unnecessary branches, so nothing is calculated twice (except the weights of the in-queue elements)

In what order should you insert a set of known keys into a B-Tree to get minimal height?

Given a fixed number of keys or values(stored either in array or in some data structure) and order of b-tree, can we determine the sequence of inserting keys that would generate a space efficient b-tree.
To illustrate, consider b-tree of order 3. Let the keys be {1,2,3,4,5,6,7}. Inserting elements into tree in the following order
for(int i=1 ;i<8; ++i)
{
tree.push(i);
}
would create a tree like this
4
2 6
1 3 5 7
see http://en.wikipedia.org/wiki/B-tree
But inserting elements in this way
flag = true;
for(int i=1,j=7; i<8; ++i,--j)
{
if(flag)
{
tree.push(i);
flag = false;
}
else
{
tree.push(j);
flag = true;
}
}
creates a tree like this
3 5
1 2 4 6 7
where we can see there is decrease in level.
So is there a particular way to determine sequence of insertion which would reduce space consumption?
The following trick should work for most ordered search trees, assuming the data to insert are the integers 1..n.
Consider the binary representation of your integer keys - for 1..7 (with dots for zeros) that's...
Bit : 210
1 : ..1
2 : .1.
3 : .11
4 : 1..
5 : 1.1
6 : 11.
7 : 111
Bit 2 changes least often, Bit 0 changes most often. That's the opposite of what we want, so what if we reverse the order of those bits, then sort our keys in order of this bit-reversed value...
Bit : 210 Rev
4 : 1.. -> ..1 : 1
------------------
2 : .1. -> .1. : 2
6 : 11. -> .11 : 3
------------------
1 : ..1 -> 1.. : 4
5 : 1.1 -> 1.1 : 5
3 : .11 -> 11. : 6
7 : 111 -> 111 : 7
It's easiest to explain this in terms of an unbalanced binary search tree, growing by adding leaves. The first item is dead centre - it's exactly the item we want for the root. Then we add the keys for the next layer down. Finally, we add the leaf layer. At every step, the tree is as balanced as it can be, so even if you happen to be building an AVL or red-black balanced tree, the rebalancing logic should never be invoked.
[EDIT I just realised you don't need to sort the data based on those bit-reversed values in order to access the keys in that order. The trick to that is to notice that bit-reversing is its own inverse. As well as mapping keys to positions, it maps positions to keys. So if you loop through from 1..n, you can use this bit-reversed value to decide which item to insert next - for the first insert use the 4th item, for the second insert use the second item and so on. One complication - you have to round n upwards to one less than a power of two (7 is OK, but use 15 instead of 8) and you have to bounds-check the bit-reversed values. The reason is that bit-reversing can move some in-bounds positions out-of-bounds and visa versa.]
Actually, for a red-black tree some rebalancing logic will be invoked, but it should just be re-colouring nodes - not rearranging them. However, I haven't double checked, so don't rely on this claim.
For a B tree, the height of the tree grows by adding a new root. Proving this works is, therefore, a little awkward (and it may require a more careful node-splitting than a B tree normally requires) but the basic idea is the same. Although rebalancing occurs, it occurs in a balanced way because of the order of inserts.
This can be generalised for any set of known-in-advance keys because, once the keys are sorted, you can assign suitable indexes based on that sorted order.
WARNING - This isn't an efficient way to construct a perfectly balanced tree from known already-sorted data.
If you have your data already sorted, and know it's size, you can build a perfectly balanced tree in O(n) time. Here's some pseudocode...
if size is zero, return null
from the size, decide which index should be the (subtree) root
recurse for the left subtree, giving that index as the size (assuming 0 is a valid index)
take the next item to build the (subtree) root
recurse for the right subtree, giving (size - (index + 1)) as the size
add the left and right subtree results as the child pointers
return the new (subtree) root
Basically, this decides the structure of the tree based on the size and traverses that structure, building the actual nodes along the way. It shouldn't be too hard to adapt it for B Trees.
This is how I would add elements to b-tree.
Thanks to Steve314, for giving me the start with binary representation,
Given are n elements to add, in order. We have to add it to m-order b-tree. Take their indexes (1...n) and convert it to radix m. The main idea of this insertion is to insert number with highest m-radix bit currently and keep it above the lesser m-radix numbers added in the tree despite splitting of nodes.
1,2,3.. are indexes so you actually insert the numbers they point to.
For example, order-4 tree
4 8 12 highest radix bit numbers
1,2,3 5,6,7 9,10,11 13,14,15
Now depending on order median can be:
order is even -> number of keys are odd -> median is middle (mid median)
order is odd -> number of keys are even -> left median or right median
The choice of median (left/right) to be promoted will decide the order in which I should insert elements. This has to be fixed for the b-tree.
I add elements to trees in buckets. First I add bucket elements then on completion next bucket in order. Buckets can be easily created if median is known, bucket size is order m.
I take left median for promotion. Choosing bucket for insertion.
| 4 | 8 | 12 |
1,2,|3 5,6,|7 9,10,|11 13,14,|15
3 2 1 Order to insert buckets.
For left-median choice I insert buckets to the tree starting from right side, for right median choice I insert buckets from left side. Choosing left-median we insert median first, then elements to left of it first then rest of the numbers in the bucket.
Example
Bucket median first
12,
Add elements to left
11,12,
Then after all elements inserted it looks like,
| 12 |
|11 13,14,|
Then I choose the bucket left to it. And repeat the same process.
Median
12
8,11 13,14,
Add elements to left first
12
7,8,11 13,14,
Adding rest
8 | 12
7 9,10,|11 13,14,
Similarly keep adding all the numbers,
4 | 8 | 12
3 5,6,|7 9,10,|11 13,14,
At the end add numbers left out from buckets.
| 4 | 8 | 12 |
1,2,|3 5,6,|7 9,10,|11 13,14,|15
For mid-median (even order b-trees) you simply insert the median and then all the numbers in the bucket.
For right-median I add buckets from the left. For elements within the bucket I first insert median then right elements and then left elements.
Here we are adding the highest m-radix numbers, and in the process I added numbers with immediate lesser m-radix bit, making sure the highest m-radix numbers stay at top. Here I have only two levels, for more levels I repeat the same process in descending order of radix bits.
Last case is when remaining elements are of same radix-bit and there is no numbers with lesser radix-bit, then simply insert them and finish the procedure.
I would give an example for 3 levels, but it is too long to show. So please try with other parameters and tell if it works.
Unfortunately, all trees exhibit their worst case scenario running times, and require rigid balancing techniques when data is entered in increasing order like that. Binary trees quickly turn into linked lists, etc.
For typical B-Tree use cases (databases, filesystems, etc), you can typically count on your data naturally being more distributed, producing a tree more like your second example.
Though if it is really a concern, you could hash each key, guaranteeing a wider distribution of values.
for( i=1; i<8; ++i )
tree.push(hash(i));
To build a particular B-tree using Insert() as a black box, work backward. Given a nonempty B-tree, find a node with more than the minimum number of children that's as close to the leaves as possible. The root is considered to have minimum 0, so a node with the minimum number of children always exists. Delete a value from this node to be prepended to the list of Insert() calls. Work toward the leaves, merging subtrees.
For example, given the 2-3 tree
8
4 c
2 6 a e
1 3 5 7 9 b d f,
we choose 8 and do merges to obtain the predecessor
4 c
2 6 a e
1 3 5 79 b d f.
Then we choose 9.
4 c
2 6 a e
1 3 5 7 b d f
Then a.
4 c
2 6 e
1 3 5 7b d f
Then b.
4 c
2 6 e
1 3 5 7 d f
Then c.
4
2 6 e
1 3 5 7d f
Et cetera.
So is there a particular way to determine sequence of insertion which would reduce space consumption?
Edit note: since the question was quite interesting, I try to improve my answer with a bit of Haskell.
Let k be the Knuth order of the B-Tree and list a list of keys
The minimization of space consumption has a trivial solution:
-- won't use point free notation to ease haskell newbies
trivial k list = concat $ reverse $ chunksOf (k-1) $ sort list
Such algorithm will efficiently produce a time-inefficient B-Tree, unbalanced on the left but with minimal space consumption.
A lot of non trivial solutions exist that are less efficient to produce but show better lookup performance (lower height/depth). As you know, it's all about trade-offs!
A simple algorithm that minimizes both the B-Tree depth and the space consumption (but it doesn't minimize lookup performance!), is the following
-- Sort the list in increasing order and call sortByBTreeSpaceConsumption
-- with the result
smart k list = sortByBTreeSpaceConsumption k $ sort list
-- Sort list so that inserting in a B-Tree with Knuth order = k
-- will produce a B-Tree with minimal space consumption minimal depth
-- (but not best performance)
sortByBTreeSpaceConsumption :: Ord a => Int -> [a] -> [a]
sortByBTreeSpaceConsumption _ [] = []
sortByBTreeSpaceConsumption k list
| k - 1 >= numOfItems = list -- this will be a leaf
| otherwise = heads ++ tails ++ sortByBTreeSpaceConsumption k remainder
where requiredLayers = minNumberOfLayersToArrange k list
numOfItems = length list
capacityOfInnerLayers = capacityOfBTree k $ requiredLayers - 1
blockSize = capacityOfInnerLayers + 1
blocks = chunksOf blockSize balanced
heads = map last blocks
tails = concat $ map (sortByBTreeSpaceConsumption k . init) blocks
balanced = take (numOfItems - (mod numOfItems blockSize)) list
remainder = drop (numOfItems - (mod numOfItems blockSize)) list
-- Capacity of a layer n in a B-Tree with Knuth order = k
layerCapacity k 0 = k - 1
layerCapacity k n = k * layerCapacity k (n - 1)
-- Infinite list of capacities of layers in a B-Tree with Knuth order = k
capacitiesOfLayers k = map (layerCapacity k) [0..]
-- Capacity of a B-Tree with Knut order = k and l layers
capacityOfBTree k l = sum $ take l $ capacitiesOfLayers k
-- Infinite list of capacities of B-Trees with Knuth order = k
-- as the number of layers increases
capacitiesOfBTree k = map (capacityOfBTree k) [1..]
-- compute the minimum number of layers in a B-Tree of Knuth order k
-- required to store the items in list
minNumberOfLayersToArrange k list = 1 + f k
where numOfItems = length list
f = length . takeWhile (< numOfItems) . capacitiesOfBTree
With this smart function given a list = [21, 18, 16, 9, 12, 7, 6, 5, 1, 2] and a B-Tree with knuth order = 3 we should obtain [18, 5, 9, 1, 2, 6, 7, 12, 16, 21] with a resulting B-Tree like
[18, 21]
/
[5 , 9]
/ | \
[1,2] [6,7] [12, 16]
Obviously this is suboptimal from a performance point of view, but should be acceptable, since obtaining a better one (like the following) would be far more expensive (computationally and economically):
[7 , 16]
/ | \
[5,6] [9,12] [18, 21]
/
[1,2]
If you want to run it, compile the previous code in a Main.hs file and compile it with ghc after prepending
import Data.List (sort)
import Data.List.Split
import System.Environment (getArgs)
main = do
args <- getArgs
let knuthOrder = read $ head args
let keys = (map read $ tail args) :: [Int]
putStr "smart: "
putStrLn $ show $ smart knuthOrder keys
putStr "trivial: "
putStrLn $ show $ trivial knuthOrder keys

Resources