I'm searching for a data structure that can be sorted as fast as a plain list and which should allow to remove elements in the following way. Let's say we have a list like this:
i.e. a list containing tuples (this is Erlang syntax). Each tuple contains a number, and a list which includes the members of a list used to compute previous number. What I want to do with the list is the following. First, sort it, then take the head of the list, and finally clean the list. With clean I mean to remove all the elements from the tail that contain elements that are in the head, or, in other words, all the elements from the tail which intersection with head is not empty. For example, after sorting the head is {18,[4,3]}. Next step is removing all the elements of the list that contain 4 or 3, i.e. the resulting list should be this one:
The process follows by taking the new head and cleaning again till the whole list is consumed. Note that if the the clean process preserves the order, there is no need to resorting the list each iteration.
The bottleneck here is the clean process. I would need some structure which allows me to do the cleaning in a faster way than now.
Does anyone know some structure that allows to do this in an efficient way without losing the order or at least allowing fast sorting?

Yes, you can get faster than this. Your problem is that you are representing the second tuple members as lists. Searching them is cumbersome and quite unnecessary. They are all contiguous substrings of 5..1. You could simply represent them as a tuple of indices!
And in fact you don't even need a list with these index tuples. Put them in a two-dimensional array right at the position given by the respective tuple, and you'll get a triangular array:
h\l| 1 2 3 4 5
1 | 2
2 | 6 2
3 | -4 -6 -10
4 | -2 -4 18 2
5 | -4 -10 -10 0 -2
Instead of storing the data in a two-dimensional array, you might want to store them in a simple array with some index magic to account for the triangular shape (if your programming language only allows for rectangular two-dimensional arrays), but that doesn't affect complexity.
This is all the structure you need to quickly filter the "list" by simply looking the things up.
Instead of sorting first and getting the head, we simply iterate once through the whole structure to find the maximum value and its indices:
max_val = 18
max = (4, 3) // the two indices
The filter is quite simple. If we don't use lists (not (any (substring `contains`) selection)) or sets (isEmpty (intersect substring selection)) but tuples then it's just sel.high < substring.low || sel.low > substring.high. And we don't even need to iterate the whole triangular array, we can simple iterate the higer and the lower triangles:
result = []
for (i from 1 until max[1])
for (j from i until max[1])
result.push({array[j][i], (j,i)})
for (i from max[0] until 5)
for (j from i until 5)
result.push({array[j+1][i+1], (j+1,i+1)})
And you've got the elements you need:
[{ 2, (1,1)},
{ 6, (2,1)},
{ 4, (2,2)},
{-2, (5,5)}]
Now you only need to sort that and you've got your result.
Actually the overall complexity doesn't get better with the triangular array. You still got O(n) from building the list and finding the maximum. Whether you filter in O(n) by testing against every substring index tuple, or filter in O(|result|) by smart selection doesn't matter any more, but you were specifically asking about a fast cleaning step. This still might be beneficial in reality if the data is large, or when you need to do multiple cleanings.
The only thing affecting overall complexity is to sort only the result, not the whole input.

I wonder if your original data structure can be seen as an adjacency list for a directed graph? E.g;
means you have these nodes and edges;
node 2 => node 1
node 6 => node 2
node 6 => node 1
So your question can be rewritten as;
If I find a node that links to nodes 4 and 3, what happens to the graph if I delete nodes 4 and 3?
One approach would be to build an adjacency matrix; an NxN bit matrix where every edge is the 1-bit. Your problem now becomes;
set every bit in the 4-row, and every bit in the 4-column, to zero.
That is, nothing links in or out of this deleted node.
As an optimisation, keep a bit array of length N. The bit is set if the node hasn't been deleted. So if nodes 1, 2, 4, and 5 are 'live' and 3 and 6 are 'deleted', the array looks like
Now to delete '4', you just clear the bit;
When you're done deleting, go through the adjacency matrix, but ignore any edge that's encoded in a row or column with 0 set.
Full example. Lets say you have
[ {2, [1,3]},
{3, [1]},
{4, [2,3]} ]
That's the adjacency matrix
1 2 3 4
1 0 0 0 0 # no entry for 1
2 1 0 1 0 # 2, [1,3]
3 1 0 0 0 # 3, [1]
4 0 1 1 0 # 4, [2,3]
and the mask
[1 1 1 1]
To delete node 2, you just alter the mask;
[1 0 1 1]
Now, to figure out the structure, pseudocode like:
rows = []
for r in 1..4:
if mask[r] == false:
# this row was deleted
targets = []
for c in 1..4:
if mask[c] == true && matrix[r,c]:
# this node wasn't deleted and was there before
if (!targets.empty):
rows.add({ r, targets})
Adjacency matrices can get large - it's NxN bits, after all - so this will only better on small, dense matrices, not large, sparse ones.
If this isn't great, you might find that it's easier to google for graph algorithms than invent them yourself :)


Split array into four boxes such that sum of XOR's of the boxes is maximum

Given an array of integers which are needed to be split into four
boxes such that sum of XOR's of the boxes is maximum.
I/P -- [1,2,1,2,1,2]
O/P -- 9
Explanation: Box1--[1,2]
I've tried using recursion but failed for larger test cases as the
Time Complexity is exponential. I'm expecting a solution using dynamic
def max_Xor(b1,b2,b3,b4,A,index,size):
if index == size:
return b1+b2+b3+b4
return m
def main():
Thanks in Advance!!
There are several things to speed up your algorithm:
Build in some start-up logic: it doesn't make sense to put anything into box 3 until boxes 1 & 2 are differentiated. In fact, you should generally have an order of precedence to keep you from repeating configurations in a different order.
Memoize your logic; this avoids repeating computations.
For large cases, take advantage of what value algebra exists.
This last item may turn out to be the biggest saving. For instance, if your longest numbers include several 5-bit and 4-bit numbers, it makes no sense to consider shorter numbers until you've placed those decently in the boxes, gaining maximum advantage for the leading bits. With only four boxes, you cannot have a num from 3-bit numbers that dominates a single misplaced 5-bit number.
Your goal is to place an odd number of 5-bit numbers into 3 or all 4 boxes; against this, check only whether this "pessimizes" bit 4 of the remaining numbers. For instance, given six 5-digit numbers (range 16-31) and a handful of small ones (0-7), your first consideration is to handle only combinations that partition the 5-digit numbers by (3, 1, 1, 1), as this leaves that valuable 5-bit turned on in each set.
With a more even mixture of values in your input, you'll also need to consider how to distribute the 4-bits for a similar "keep it odd" heuristic. Note that, as you work from largest to smallest, you need worry only about keeping it odd, and watching the following bit.
These techniques should let you prune your recursion enough to finish in time.
We can use Dynamic programming here to break the problem into smaller sets then store their result in a table. Then use already stored result to calculate answer for bigger set.
For example:
Input -- [1,2,1,2,1,2]
We need to divide the array consecutively into 4 boxed such that sum of XOR of all boxes is maximised.
Lets take your test case, break the problem into smaller sets and start solving for smaller set.
box = 1, num = [1,2,1,2,1,2]
ans = 1 3 2 0 1 3
Since we only have one box so all numbers will go into this box. We will store this answer into a table. Lets call the matrix as DP.
DP[1] = [1 3 2 0 1 3]
DP[i][j] stores answer for distributing 0-j numbers to i boxes.
now lets take the case where we have two boxes and we will take numbers one by one.
num = [1] since we only have one number it will go into the first box.
DP[1][0] = 1
Lets add another number.
num = [1 2]
now there can be two ways to put this new number into the box.
case 1: 2 will go to the First box. Since we already have answer
for both numbers in one box. we will just use that.
answer = DP[0][1] + 0 (Second box is empty)
case 2: 2 will go to second box.
answer = DP[0][0] + 2 (only 2 is present in the second box)
Maximum of the two cases will be stored in DP[1][1].
DP[1][1] = max(3+0, 1+2) = 3.
Now for num = [1 2 1].
Again for new number we have three cases.
box1 = [1 2 1], box2 = [], DP[0][2] + 0
box1 = [1 2], box2 = [1], DP[0][1] + 1
box1 = [1 ], box2 = [2 1], DP[0][0] + 2^1
Maximum of these three will be answer for DP[1][2].
Similarly we can find answer of num = [1 2 1 2 1 2] box = 4
1 3 2 0 1 3
1 3 4 6 5 3
1 3 4 6 7 9
1 3 4 6 7 9
Also note that a xor b xor a = b. you can use this property to get xor of a segment of an array in constant time as suggested in comments.
This way you can break the problem in smaller subset and use smaller set answer to compute for the bigger ones. Hope this helps. After understanding the concept you can go ahead and implement it with better time than exponential.
I would go bit by bit from the highest bit to the lowest bit. For every bit, try all combinations that distribute the still unused numbers that have that bit set so that an odd number of them is in each box, nothing else matters. Pick the best path overall. One issue that complicates this greedy method is that two boxes with a lower bit set can equal one box with the next higher bit set.
Alternatively, memoize the boxes state in your recursion as an ordered tuple.

minimum switch to sorted permutation

Suppose I have an array like this:
[5 4 1 2 3]
And I want to compute the minimum switch I have to make to sort the unsorted permutation.
Now the answer is 7 in this case. Just move 4 and 5 to the right, or move 1, 2, 3 to the left.
The irony though, is that I used [4 5 1 2 3] in my notes, which gives 6, and mislead myself and make a fool of myself.
[5 1 4 2 3] // step 1
[1 5 4 2 3] // step 2
[1 5 2 4 3] // step 3
[1 2 5 4 3] // step 4
[1 2 5 3 4] // step 5
[1 2 3 5 4] // step 6
[1 2 3 4 5] // step 7
I've thought of things like having an array that keep the offset needed, and for each loop, just look for the switch that moves the whole thing closer to goal.
But that just seem too slow, any ideas?
from comment: are the members of the array guaranteed to completely belong to {1..N} set for an array of size N, without repeating numbers?
Nope. It's not guaranteed not to repeat or being in [1...n] for array sized N.
There are two solutions to this particular problem, once is slower but more straightforward bubblesort, another is the faster but less straightforward mergesort.
With bubblesort, you basically count the number of switches when running the algorithm.
With mergesort, it's a bit more trickier, but the counting happens when merging. When the array is already merged, the count should yield 0 as no switches will be needed to sort this array. With bubblesort, you count the switches when you push the largest or the smallest number to the left or right. With mergesort, you count switches when merging. I bit of hand writing brute forcing will get you there.
What you're actually looking for is calculating the number of inversions in a sequence.
This can be done in O(n*logn) using mergesort, for example.
Here you have an article about this subject, looks quite understandable.
Some more links:
This looks suspiciously similar to bubble sort, in which you need up to n^2 movements.
And the interesting fact is that, simple bubble sort actually achieves your goal to find the minimum number of switches! (proof below)
In that case, we don't need to further improve algorithms using double loops, and it's actually possible using double loops (in C++):
int switch = 0;
for(int repeat=0; repeat<n; repeat++){
for(int j=0; j<n-repeat; j++){
int tmp = arr[j];
arr[j] = arr[j+1];
arr[j+1] = tmp;
switch = switch + 1
The switch is the result.
arr is the array containing the numbers.
n is the length of the array.
Prove that this produces minimum number of switch:
First, we note that the bubble sort essentially moves the highest element into the rightmost position in the array at each iteration (outer loop)
Note that switching the highest element with any other element in the process does not change the relative order of other elements. And also any other switch operations done in between our attempt to move the highest element to its position will not change the number of switch required to move the highest element to place. And so we can interchange the switch operations such that the highest element is always switched first until it gets into position. Therefore switching the highest element into its position one at a time is optimum.

Disperse Duplicates in an Array

Source : Google Interview Question
Write a routine to ensure that identical elements in the input are maximally spread in the output?
Basically, we need to place the same elements,in such a way , that the TOTAL spreading is as maximal as possible.
Input: {1,1,2,3,2,3}
Possible Output: {1,2,3,1,2,3}
Total dispersion = Difference between position of 1's + 2's + 3's = 4-1 + 5-2 + 6-3 = 9 .
I am NOT AT ALL sure, if there's an optimal polynomial time algorithm available for this.Also,no other detail is provided for the question other than this .
What i thought is,calculate the frequency of each element in the input,then arrange them in the output,each distinct element at a time,until all the frequencies are exhausted.
I am not sure of my approach .
Any approaches/ideas people .
I believe this simple algorithm would work:
count the number of occurrences of each distinct element.
make a new list
add one instance of all elements that occur more than once to the list (order within each group does not matter)
add one instance of all unique elements to the list
add one instance of all elements that occur more than once to the list
add one instance of all elements that occur more than twice to the list
add one instance of all elements that occur more than trice to the list
Now, this will intuitively not give a good spread:
for {1, 1, 1, 1, 2, 3, 4} ==> {1, 2, 3, 4, 1, 1, 1}
for {1, 1, 1, 2, 2, 2, 3, 4} ==> {1, 2, 3, 4, 1, 2, 1, 2}
However, i think this is the best spread you can get given the scoring function provided.
Since the dispersion score counts the sum of the distances instead of the squared sum of the distances, you can have several duplicates close together, as long as you have a large gap somewhere else to compensate.
for a sum-of-squared-distances score, the problem becomes harder.
Perhaps the interview question hinged on the candidate recognizing this weakness in the scoring function?
In perl
then make a hash table of the counts of different numbers in the list, like a frequency table
map { $x{$_}++ } #a;
then repeatedly walk through all the keys found, with the keys in a known order and add the appropriate number of individual numbers to an output list until all the keys are exhausted
while( $g == 1 ) {
for my $n (sort keys %x)
if ($x{$n}>1) {
push #r, $n;
I'm sure that this could be adapted to any programming language that supports hash tables
python code for algorithm suggested by Vorsprung and HugoRune:
from collections import Counter, defaultdict
def max_spread(data):
cnt = Counter()
for i in data: cnt[i] += 1
res, num = [], list(cnt)
while len(cnt) > 0:
for i in num:
if num[i] > 0:
cnt[i] -= 1
if cnt[i] == 0: del cnt[i]
return res
def calc_spread(data):
d = defaultdict()
for i, v in enumerate(data):
d.setdefault(v, []).append(i)
return sum([max(x) - min(x) for _, x in d.items()])
HugoRune's answer takes some advantage of the unusual scoring function but we can actually do even better: suppose there are d distinct non-unique values, then the only thing that is required for a solution to be optimal is that the first d values in the output must consist of these in any order, and likewise the last d values in the output must consist of these values in any (i.e. possibly a different) order. (This implies that all unique numbers appear between the first and last instance of every non-unique number.)
The relative order of the first copies of non-unique numbers doesn't matter, and likewise nor does the relative order of their last copies. Suppose the values 1 and 2 both appear multiple times in the input, and that we have built a candidate solution obeying the condition I gave in the first paragraph that has the first copy of 1 at position i and the first copy of 2 at position j > i. Now suppose we swap these two elements. Element 1 has been pushed j - i positions to the right, so its score contribution will drop by j - i. But element 2 has been pushed j - i positions to the left, so its score contribution will increase by j - i. These cancel out, leaving the total score unchanged.
Now, any permutation of elements can be achieved by swapping elements in the following way: swap the element in position 1 with the element that should be at position 1, then do the same for position 2, and so on. After the ith step, the first i elements of the permutation are correct. We know that every swap leaves the scoring function unchanged, and a permutation is just a sequence of swaps, so every permutation also leaves the scoring function unchanged! This is true at for the d elements at both ends of the output array.
When 3 or more copies of a number exist, only the position of the first and last copy contribute to the distance for that number. It doesn't matter where the middle ones go. I'll call the elements between the 2 blocks of d elements at either end the "central" elements. They consist of the unique elements, as well as some number of copies of all those non-unique elements that appear at least 3 times. As before, it's easy to see that any permutation of these "central" elements corresponds to a sequence of swaps, and that any such swap will leave the overall score unchanged (in fact it's even simpler than before, since swapping two central elements does not even change the score contribution of either of these elements).
This leads to a simple O(nlog n) algorithm (or O(n) if you use bucket sort for the first step) to generate a solution array Y from a length-n input array X:
Sort the input array X.
Use a single pass through X to count the number of distinct non-unique elements. Call this d.
Set i, j and k to 0.
While i < n:
If X[i+1] == X[i], we have a non-unique element:
Set Y[j] = Y[n-j-1] = X[i].
Increment i twice, and increment j once.
While X[i] == X[i-1]:
Set Y[d+k] = X[i].
Increment i and k.
Otherwise we have a unique element:
Set Y[d+k] = X[i].
Increment i and k.

In what order should you insert a set of known keys into a B-Tree to get minimal height?

Given a fixed number of keys or values(stored either in array or in some data structure) and order of b-tree, can we determine the sequence of inserting keys that would generate a space efficient b-tree.
To illustrate, consider b-tree of order 3. Let the keys be {1,2,3,4,5,6,7}. Inserting elements into tree in the following order
for(int i=1 ;i<8; ++i)
would create a tree like this
2 6
1 3 5 7
But inserting elements in this way
flag = true;
for(int i=1,j=7; i<8; ++i,--j)
flag = false;
flag = true;
creates a tree like this
3 5
1 2 4 6 7
where we can see there is decrease in level.
So is there a particular way to determine sequence of insertion which would reduce space consumption?
The following trick should work for most ordered search trees, assuming the data to insert are the integers 1..n.
Consider the binary representation of your integer keys - for 1..7 (with dots for zeros) that's...
Bit : 210
1 : ..1
2 : .1.
3 : .11
4 : 1..
5 : 1.1
6 : 11.
7 : 111
Bit 2 changes least often, Bit 0 changes most often. That's the opposite of what we want, so what if we reverse the order of those bits, then sort our keys in order of this bit-reversed value...
Bit : 210 Rev
4 : 1.. -> ..1 : 1
2 : .1. -> .1. : 2
6 : 11. -> .11 : 3
1 : ..1 -> 1.. : 4
5 : 1.1 -> 1.1 : 5
3 : .11 -> 11. : 6
7 : 111 -> 111 : 7
It's easiest to explain this in terms of an unbalanced binary search tree, growing by adding leaves. The first item is dead centre - it's exactly the item we want for the root. Then we add the keys for the next layer down. Finally, we add the leaf layer. At every step, the tree is as balanced as it can be, so even if you happen to be building an AVL or red-black balanced tree, the rebalancing logic should never be invoked.
[EDIT I just realised you don't need to sort the data based on those bit-reversed values in order to access the keys in that order. The trick to that is to notice that bit-reversing is its own inverse. As well as mapping keys to positions, it maps positions to keys. So if you loop through from 1..n, you can use this bit-reversed value to decide which item to insert next - for the first insert use the 4th item, for the second insert use the second item and so on. One complication - you have to round n upwards to one less than a power of two (7 is OK, but use 15 instead of 8) and you have to bounds-check the bit-reversed values. The reason is that bit-reversing can move some in-bounds positions out-of-bounds and visa versa.]
Actually, for a red-black tree some rebalancing logic will be invoked, but it should just be re-colouring nodes - not rearranging them. However, I haven't double checked, so don't rely on this claim.
For a B tree, the height of the tree grows by adding a new root. Proving this works is, therefore, a little awkward (and it may require a more careful node-splitting than a B tree normally requires) but the basic idea is the same. Although rebalancing occurs, it occurs in a balanced way because of the order of inserts.
This can be generalised for any set of known-in-advance keys because, once the keys are sorted, you can assign suitable indexes based on that sorted order.
WARNING - This isn't an efficient way to construct a perfectly balanced tree from known already-sorted data.
If you have your data already sorted, and know it's size, you can build a perfectly balanced tree in O(n) time. Here's some pseudocode...
if size is zero, return null
from the size, decide which index should be the (subtree) root
recurse for the left subtree, giving that index as the size (assuming 0 is a valid index)
take the next item to build the (subtree) root
recurse for the right subtree, giving (size - (index + 1)) as the size
add the left and right subtree results as the child pointers
return the new (subtree) root
Basically, this decides the structure of the tree based on the size and traverses that structure, building the actual nodes along the way. It shouldn't be too hard to adapt it for B Trees.
This is how I would add elements to b-tree.
Thanks to Steve314, for giving me the start with binary representation,
Given are n elements to add, in order. We have to add it to m-order b-tree. Take their indexes (1...n) and convert it to radix m. The main idea of this insertion is to insert number with highest m-radix bit currently and keep it above the lesser m-radix numbers added in the tree despite splitting of nodes.
1,2,3.. are indexes so you actually insert the numbers they point to.
For example, order-4 tree
4 8 12 highest radix bit numbers
1,2,3 5,6,7 9,10,11 13,14,15
Now depending on order median can be:
order is even -> number of keys are odd -> median is middle (mid median)
order is odd -> number of keys are even -> left median or right median
The choice of median (left/right) to be promoted will decide the order in which I should insert elements. This has to be fixed for the b-tree.
I add elements to trees in buckets. First I add bucket elements then on completion next bucket in order. Buckets can be easily created if median is known, bucket size is order m.
I take left median for promotion. Choosing bucket for insertion.
| 4 | 8 | 12 |
1,2,|3 5,6,|7 9,10,|11 13,14,|15
3 2 1 Order to insert buckets.
For left-median choice I insert buckets to the tree starting from right side, for right median choice I insert buckets from left side. Choosing left-median we insert median first, then elements to left of it first then rest of the numbers in the bucket.
Bucket median first
Add elements to left
Then after all elements inserted it looks like,
| 12 |
|11 13,14,|
Then I choose the bucket left to it. And repeat the same process.
8,11 13,14,
Add elements to left first
7,8,11 13,14,
Adding rest
8 | 12
7 9,10,|11 13,14,
Similarly keep adding all the numbers,
4 | 8 | 12
3 5,6,|7 9,10,|11 13,14,
At the end add numbers left out from buckets.
| 4 | 8 | 12 |
1,2,|3 5,6,|7 9,10,|11 13,14,|15
For mid-median (even order b-trees) you simply insert the median and then all the numbers in the bucket.
For right-median I add buckets from the left. For elements within the bucket I first insert median then right elements and then left elements.
Here we are adding the highest m-radix numbers, and in the process I added numbers with immediate lesser m-radix bit, making sure the highest m-radix numbers stay at top. Here I have only two levels, for more levels I repeat the same process in descending order of radix bits.
Last case is when remaining elements are of same radix-bit and there is no numbers with lesser radix-bit, then simply insert them and finish the procedure.
I would give an example for 3 levels, but it is too long to show. So please try with other parameters and tell if it works.
Unfortunately, all trees exhibit their worst case scenario running times, and require rigid balancing techniques when data is entered in increasing order like that. Binary trees quickly turn into linked lists, etc.
For typical B-Tree use cases (databases, filesystems, etc), you can typically count on your data naturally being more distributed, producing a tree more like your second example.
Though if it is really a concern, you could hash each key, guaranteeing a wider distribution of values.
for( i=1; i<8; ++i )
To build a particular B-tree using Insert() as a black box, work backward. Given a nonempty B-tree, find a node with more than the minimum number of children that's as close to the leaves as possible. The root is considered to have minimum 0, so a node with the minimum number of children always exists. Delete a value from this node to be prepended to the list of Insert() calls. Work toward the leaves, merging subtrees.
For example, given the 2-3 tree
4 c
2 6 a e
1 3 5 7 9 b d f,
we choose 8 and do merges to obtain the predecessor
4 c
2 6 a e
1 3 5 79 b d f.
Then we choose 9.
4 c
2 6 a e
1 3 5 7 b d f
Then a.
4 c
2 6 e
1 3 5 7b d f
Then b.
4 c
2 6 e
1 3 5 7 d f
Then c.
2 6 e
1 3 5 7d f
Et cetera.
So is there a particular way to determine sequence of insertion which would reduce space consumption?
Edit note: since the question was quite interesting, I try to improve my answer with a bit of Haskell.
Let k be the Knuth order of the B-Tree and list a list of keys
The minimization of space consumption has a trivial solution:
-- won't use point free notation to ease haskell newbies
trivial k list = concat $ reverse $ chunksOf (k-1) $ sort list
Such algorithm will efficiently produce a time-inefficient B-Tree, unbalanced on the left but with minimal space consumption.
A lot of non trivial solutions exist that are less efficient to produce but show better lookup performance (lower height/depth). As you know, it's all about trade-offs!
A simple algorithm that minimizes both the B-Tree depth and the space consumption (but it doesn't minimize lookup performance!), is the following
-- Sort the list in increasing order and call sortByBTreeSpaceConsumption
-- with the result
smart k list = sortByBTreeSpaceConsumption k $ sort list
-- Sort list so that inserting in a B-Tree with Knuth order = k
-- will produce a B-Tree with minimal space consumption minimal depth
-- (but not best performance)
sortByBTreeSpaceConsumption :: Ord a => Int -> [a] -> [a]
sortByBTreeSpaceConsumption _ [] = []
sortByBTreeSpaceConsumption k list
| k - 1 >= numOfItems = list -- this will be a leaf
| otherwise = heads ++ tails ++ sortByBTreeSpaceConsumption k remainder
where requiredLayers = minNumberOfLayersToArrange k list
numOfItems = length list
capacityOfInnerLayers = capacityOfBTree k $ requiredLayers - 1
blockSize = capacityOfInnerLayers + 1
blocks = chunksOf blockSize balanced
heads = map last blocks
tails = concat $ map (sortByBTreeSpaceConsumption k . init) blocks
balanced = take (numOfItems - (mod numOfItems blockSize)) list
remainder = drop (numOfItems - (mod numOfItems blockSize)) list
-- Capacity of a layer n in a B-Tree with Knuth order = k
layerCapacity k 0 = k - 1
layerCapacity k n = k * layerCapacity k (n - 1)
-- Infinite list of capacities of layers in a B-Tree with Knuth order = k
capacitiesOfLayers k = map (layerCapacity k) [0..]
-- Capacity of a B-Tree with Knut order = k and l layers
capacityOfBTree k l = sum $ take l $ capacitiesOfLayers k
-- Infinite list of capacities of B-Trees with Knuth order = k
-- as the number of layers increases
capacitiesOfBTree k = map (capacityOfBTree k) [1..]
-- compute the minimum number of layers in a B-Tree of Knuth order k
-- required to store the items in list
minNumberOfLayersToArrange k list = 1 + f k
where numOfItems = length list
f = length . takeWhile (< numOfItems) . capacitiesOfBTree
With this smart function given a list = [21, 18, 16, 9, 12, 7, 6, 5, 1, 2] and a B-Tree with knuth order = 3 we should obtain [18, 5, 9, 1, 2, 6, 7, 12, 16, 21] with a resulting B-Tree like
[18, 21]
[5 , 9]
/ | \
[1,2] [6,7] [12, 16]
Obviously this is suboptimal from a performance point of view, but should be acceptable, since obtaining a better one (like the following) would be far more expensive (computationally and economically):
[7 , 16]
/ | \
[5,6] [9,12] [18, 21]
If you want to run it, compile the previous code in a Main.hs file and compile it with ghc after prepending
import Data.List (sort)
import Data.List.Split
import System.Environment (getArgs)
main = do
args <- getArgs
let knuthOrder = read $ head args
let keys = (map read $ tail args) :: [Int]
putStr "smart: "
putStrLn $ show $ smart knuthOrder keys
putStr "trivial: "
putStrLn $ show $ trivial knuthOrder keys

Algorithm to count the number of valid blocks in a permutation [duplicate]

Given an array A which holds a permutation of 1,2,...,n. A sub-block A[i..j]
of an array A is called a valid block if all the numbers appearing in A[i..j]
are consecutive numbers (may not be in order).
Given an array A= [ 7 3 4 1 2 6 5 8] the valid blocks are [3 4], [1,2], [6,5],
[3 4 1 2], [3 4 1 2 6 5], [7 3 4 1 2 6 5], [7 3 4 1 2 6 5 8]
So the count for above permutation is 7.
Give an O( n log n) algorithm to count the number of valid blocks.
Ok, I am down to 1 rep because I put 200 bounty on a related question: Finding sorted sub-sequences in a permutation
so I cannot leave comments for a while.
I have an idea:
1) Locate all permutation groups. They are: (78), (34), (12), (65). Unlike in group theory, their order and position, and whether they are adjacent matters. So, a group (78) can be represented as a structure (7, 8, false), while (34) would be (3,4,true). I am using Python's notation for tuples, but it is actually might be better to use a whole class for the group. Here true or false means contiguous or not. Two groups are "adjacent" if (max(gp1) == min(gp2) + 1 or max(gp2) == min(gp1) + 1) and contigous(gp1) and contiguos(gp2). This is not the only condition, for union(gp1, gp2) to be contiguous, because (14) and (23) combine into (14) nicely. This is a great question for algo class homework, but a terrible one for interview. I suspect this is homework.
Just some thoughts:
At first sight, this sounds impossible: a fully sorted array would have O(n2) valid sub-blocks.
So, you would need to count more than one valid sub-block at a time. Checking the validity of a sub-block is O(n). Checking whether a sub-block is fully sorted is O(n) as well. A fully sorted sub-block contains n·(n - 1)/2 valid sub-blocks, which you can count without further breaking this sub-block up.
Now, the entire array is obviously always valid. For a divide-and-conquer approach, you would need to break this up. There are two conceivable breaking points: the location of the highest element, and that of the lowest element. If you break the array into two at one of these points, including the extremum in the part that contains the second-to-extreme element, there cannot be a valid sub-block crossing this break-point.
By always choosing the extremum that produces a more even split, this should work quite well (average O(n log n)) for "random" arrays. However, I can see problems when your input is something like (1 5 2 6 3 7 4 8), which seems to produce O(n2) behaviour. (1 4 7 2 5 8 3 6 9) would be similar (I hope you see the pattern). I currently see no trick to catch this kind of worse case, but it seems that it requires other splitting techniques.
This question does involve a bit of a "math trick" but it's fairly straight forward once you get it. However, the rest of my solution won't fit the O(n log n) criteria.
The math portion:
For any two consecutive numbers their sum is 2k+1 where k is the smallest element. For three it is 3k+3, 4 : 4k+6 and for N such numbers it is Nk + sum(1,N-1). Hence, you need two steps which can be done simultaneously:
Create the sum of all the sub-arrays.
Determine the smallest element of a sub-array.
The dynamic programming portion
Build two tables using the results of the previous row's entries to build each successive row's entries. Unfortunately, I'm totally wrong as this would still necessitate n^2 sub-array checks. Ugh!
My proposition
STEP = 2 // amount of examed number
B [0,0,0,0,0,0,0,0]
B [1,1,0,0,0,0,0,0]
VALID(A,B) - if not valid move one
B [0,1,1,0,0,0,0,0]
VALID(A,B) - if valid move one and step
B [0,0,0,1,1,0,0,0]
B [0,0,0,0,0,1,1,0]
STEP = 3
B [1,1,1,0,0,0,0,0] not ok
B [0,1,1,1,0,0,0,0] ok
B [0,0,0,0,1,1,1,0] not ok
STEP = 4
B [1,1,1,1,0,0,0,0] not ok
B [0,1,1,1,1,0,0,0] ok
CON <- 0
STEP <- 2
i <- 0
j <- 0
j <- STEP
CON <- CON + 1
i <- j + 1
j <- j + STEP
i <- i + 1
j <- j + 1
STEP <- STEP + 1
The valid method check that all elements are consecutive
Never tested but, might be ok
The original array doesn't contain duplicates so must itself be a consecutive block. Lets call this block (1 ~ n). We can test to see whether block (2 ~ n) is consecutive by checking if the first element is 1 or n which is O(1). Likewise we can test block (1 ~ n-1) by checking whether the last element is 1 or n.
I can't quite mould this into a solution that works but maybe it will help someone along...
Like everybody else, I'm just throwing this out ... it works for the single example below, but YMMV!
The idea is to count the number of illegal sub-blocks, and subtract this from the total possible number. We count the illegal ones by examining each array element in turn and ruling out sub-blocks that include the element but not its predecessor or successor.
Foreach i in [1,N], compute B[A[i]] = i.
Let Count = the total number of sub-blocks with length>1, which is N-choose-2 (one for each possible combination of starting and ending index).
Foreach i, consider A[i]. Ignoring edge cases, let x=A[i]-1, and let y=A[i]+1. A[i] cannot participate in any sub-block that does not include x or y. Let iX=B[x] and iY=B[y]. There are several cases to be treated independently here. The general case is that iX<i<iY<i. In this case, we can eliminate the sub-block A[iX+1 .. iY-1] and all intervening blocks containing i. There are (i - iX + 1) * (iY - i + 1) such sub-blocks, so call this number Eliminated. (Other cases left as an exercise for the reader, as are those edge cases.) Set Count = Count - Eliminated.
Return Count.
The total cost appears to be N * (cost of step 2) = O(N).
WRINKLE: In step 2, we must be careful not to eliminate each sub-interval more than once. We can accomplish this by only eliminating sub-intervals that lie fully or partly to the right of position i.
A = [1, 3, 2, 4]
B = [1, 3, 2, 4]
Initial count = (4*3)/2 = 6
i=1: A[i]=1, so need sub-blocks with 2 in them. We can eliminate [1,3] from consideration. Eliminated = 1, Count -> 5.
i=2: A[i]=3, so need sub-blocks with 2 or 4 in them. This rules out [1,3] but we already accounted for it when looking right from i=1. Eliminated = 0.
i=3: A[i] = 2, so need sub-blocks with [1] or [3] in them. We can eliminate [2,4] from consideration. Eliminated = 1, Count -> 4.
i=4: A[i] = 4, so we need sub-blocks with [3] in them. This rules out [2,4] but we already accounted for it when looking right from i=3. Eliminated = 0.
Final Count = 4, corresponding to the sub-blocks [1,3,2,4], [1,3,2], [3,2,4] and [3,2].
(This is an attempt to do this N.log(N) worst case. Unfortunately it's wrong -- it sometimes undercounts. It incorrectly assumes you can find all the blocks by looking at only adjacent pairs of smaller valid blocks. In fact you have to look at triplets, quadruples, etc, to get all the larger blocks.)
You do it with a struct that represents a subblock and a queue for subblocks.
int index ; /* index into original array, head of subblock */
int width ; /* width of subblock > 0 */
int lo_value;
c_subblock * p_above ; /* null or subblock above with same index */
Alloc an array of subblocks the same size as the original array, and init each subblock to have exactly one item in it. Add them to the queue as you go. If you start with array [ 7 3 4 1 2 6 5 8 ] you will end up with a queue like this:
queue: ( [7,7] [3,3] [4,4] [1,1] [2,2] [6,6] [5,5] [8,8] )
The { index, width, lo_value, p_above } values for subbblock [7,7] will be { 0, 1, 7, null }.
Now it's easy. Forgive the c-ish pseudo-code.
loop {
c_subblock * const p_left = Pop subblock from queue.
int const right_index = p_left.index + p_left.width;
if ( right_index < length original array ) {
// Find adjacent subblock on the right.
// To do this you'll need the original array of length-1 subblocks.
c_subblock const * p_right = array_basic_subblocks[ right_index ];
do {
Check the left/right subblocks to see if the two merged are also a subblock.
If they are add a new merged subblock to the end of the queue.
p_right = p_right.p_above;
while ( p_right );
This will find them all I think. It's usually O(N log(N)), but it'll be O(N^2) for a fully sorted or anti-sorted list. I think there's an answer to this though -- when you build the original array of subblocks you look for sorted and anti-sorted sequences and add them as the base-level subblocks. If you are keeping a count increment it by (width * (width + 1))/2 for the base-level. That'll give you the count INCLUDING all the 1-length subblocks.
After that just use the loop above, popping and pushing the queue. If you're counting you'll have to have a multiplier on both the left and right subblocks and multiply these together to calculate the increment. The multiplier is the width of the leftmost (for p_left) or rightmost (for p_right) base-level subblock.
Hope this is clear and not too buggy. I'm just banging it out, so it may even be wrong.
[Later note. This doesn't work after all. See note below.]
