Polyphase merge sort - what is the number of phases - algorithm

Suppose that we have to sort some big set of numbers externally. We want to examine 2 cases:
4 tapes: 2 input tapes, 2 output
3 tapes: 2 in, 1 out
Case 1:
We start with k runs, then we copy those runs to 2 input tapes (on the left on the pic below), each iteration we take two different runs from the input tapes, merge (and sort) them, and in one iteration save them to the first output tape, and in next iteration - to the second one, as shown below. Then we switch output tapes with input ones and repeat the procedure. So if we have, lets say, n=10^6 elements and k=1000 runs, then after the first phase run’s size will be 2000, after the third 4000 and so on. So the total number of phases is ceil(log_2(n)).
Case 2:
In the best-case complexity, the number of phases is position of Fibonacci’s number in the Fibonacci’s sequence minus two, i.e. if our initial number of runs is k=34 and 34 is the 9th number in the Fibonacci sequence, then we will have 7 phases.
But… if our number of runs isn’t a Fibonacci number, it is necessary to pad the tape with dummy runs in order to get no. of runs up to Fibonacci number.
Finally, my question is:
What is the average-case number of phases needed in order to sort a sequence, when the number of runs isn’t a Fibonacci number?

What is the number of phases ... when number of runs isn’t a Fibonacci number?
If the run count is not an ideal number, then the sort will take one extra phase, similar to rounding the run count up to the next ideal number. Dummy runs don't need to occupy any space on the tapes, but the code has to handle reaching the end of data on more than one tape during a phase on non-ideal distributions.
Some notes about the information in the original question:
The 4 tape example shows a balanced 2-way merge sort. For polyphase merge sort, there's only one output tape per phase. With 4 tape drives, the initial setup distributes runs between the 3 other drives, so after the initial distribution, it is always 3 input tapes, 1 output tape.
The Fibonacci numbers only apply to a 3 tape scenario. For a 4 or more tape scenario, the sequence is easiest to generate by starting at the final phase and working backwards. For 31 runs on 4 tapes, the final run count is {1,0,0,0},
working backwards: {0,1,1,1}, {1,0,2,2}, {3,2,0,4}, {7,6,4,0}, {0,13,11,7}.
The run sizes increase as the result of merging prior runs of various sizes. Assume run size is 1 element, 31 runs, 4 tapes. After initial distribution, run count:run size is {0:0,13:1,11:1,7:1}. First phase: {7:3,6:1,4:1,0:0}. Second phase: {3:3,2:1,0:0,4:5}. Third phase {1:3,0:0,2:9,2:5}. Fourth phase: {0:0,1:17,1:9,1:5}. Fifth and final phase {1:31,0:0,0:0,0:0}.
Keeping track of run sizes can get complex, so a simple solution for tapes is to use a single file mark to indicate the end of a run and a double file mark to indicate the end of data.
Wiki has an article on polyphase merge sort.
https://en.wikipedia.org/wiki/Polyphase_merge_sort
If the total run count is known in advance, the initial distribution can include initial merge operations to get the run count to an ideal number, but now the run sizes vary due to the initial merge operations, so each tape ends up with a mix of run sizes. Again, using file marks to indicate end of runs eliminates having to keep track of run sizes in memory.
Polyphase merge sort is the fastest way to do a sort using 3 stacks.

Related

Shuffle sequential numbers without a buffer

I am looking for a shuffle algorithm to shuffle a set of sequential numbers without buffering. Another way to state this is that I’m looking for a random sequence of unique numbers that have a given period.
Your typical Fisher–Yates shuffle needs to have each element all of the elements it is going to shuffle, so that isn’t going to work.
A Linear-Feedback Shift Register (LFSR) does what I want, but only works for periods that are powers-of-two less two. Here is an example of using a 4-bit LFSR to shuffle the numbers 1-14:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
8
12
14
7
4
10
5
11
6
3
2
1
9
13
The first two is the input, and the second row the output. What’s nice is that the state is very small—just the current index. You can start of any index and get a difference set of numbers (starting at 1 yields: 8, 12, 14; starting at 9: 6, 3, 2), although the sequence is always the same (5 is always followed by 11). If I want a different sequence, I can pick a different generator polynomial.
The limitations to the LFSR are that the periods are always power-of-two less two (the min and max are always the same, thus unshuffled) and there not enough enough generator polynomials to allow every possible random sequence.
A block cipher algorithm would work. Every key produces a uniquely shuffled set of numbers. However all block ciphers (that I know about) have power-of-two block sizes, and usually a fixed or limited number of block sizes. A block cipher with a arbitrary non-binary block size would be perfect if such a thing exists.
There are a couple of projects I have that could benefit from such an algorithm. One is for small embedded micros that need to produce a shuffled sequence of numbers with a period larger than the memory they have available (think Arduino Uno needing to shuffle 1 to 100,000).
Does such an algorithm exist? If not, what things might I search for to help me develop such an algorithm? Or is this simply not possible?
Edit 2022-01-30
I have received a lot of good feedback and I need to better explain what I am searching for.
In addition to the Arduino example, where memory is an issue, there is also the shuffle of a large number of records (billions to trillions). The desire is to have a shuffle applied to these records without needing a buffer to hold the shuffle order array, or the time needed to build that array.
I do not need an algorithm that could produce every possible permutation, but a large number of permutations. Something like a typical block cipher in counter mode where each key produces a unique sequence of values.
A Linear Congruential Generator using coefficients to produce the desired sequence period will only produce a single sequence. This is the same problem for a Linear Feedback Shift Register.
Format-Preserving Encryption (FPE), such as AES FFX, shows promise and is where I am currently focusing my attention. Additional feedback welcome.
It is certainly not possible to produce an algorithm which could potentially generate every possible sequence of length N with less than N (log2N - 1.45) bits of state, because there are N! possible sequence and each state can generate exactly one sequence. If your hypothetical Arduino application could produce every possible sequence of 100,000 numbers, it would require at least 1,516,705 bits of state, a bit more than 185Kib, which is probably more memory than you want to devote to the problem [Note 1].
That's also a lot more memory than you would need for the shuffle buffer; that's because the PRNG driving the shuffle algorithm also doesn't have enough state to come close to being able to generate every possible sequence. It can't generate more different sequences than the number of different possible states that it has.
So you have to make some compromise :-)
One simple algorithm is to start with some parametrisable generator which can produce non-repeating sequences for a large variety of block sizes. Then you just choose a block size which is as least as large as your target range but not "too much larger"; say, less than twice as large. Then you just select a subrange of the block size and start generating numbers. If the generated number is inside the subrange, you return its offset; if not, you throw it away and generate another number. If the generator's range is less than twice the desired range, then you will throw away less than half of the generated values and producing the next element in the sequence will be amortised O(1). In theory, it might take a long time to generate an individual value, but that's not very likely, and if you use a not-very-good PRNG like a linear congruential generator, you can make it very unlikely indeed by restricting the possible generator parameters.
For LCGs you have a couple of possibilities. You could use a power-of-two modulus, with an odd offset and a multiplier which is 5 mod 8 (and not too far from the square root of the block size), or you could use a prime modulus with almost arbitrary offset and multiplier. Using a prime modulus is computationally more expensive but the deficiencies of LCG are less apparent. Since you don't need to handle arbitrary primes, you can preselect a geometrically-spaced sample and compute the efficient division-by-multiplication algorithm for each one.
Since you're free to use any subrange of the generator's range, you have an additional potential parameter: the offset of the start of the subrange. (Or even offsets, since the subrange doesn't need to be contiguous.) You can also increase the apparent randomness by doing any bijective transformation (XOR/rotates are good, if you're using a power-of-two block size.)
Depending on your application, there are known algorithms to produce block ciphers for subword bit lengths [Note 2], which gives you another possible way to increase randomness and/or add some more bits to the generator state.
Notes
The approximation for the minimum number of states comes directly from Stirling's approximation for N!, but I computed the number of bits by using the commonly available lgamma function.
With about 30 seconds of googling, I found this paper on researchgate.net; I'm far from knowledgable enough in crypto to offer an opinion, but it looks credible; also, there are references to other algorithms in its footnotes.

Optimal k-way merge pattern

I need to merge n sorted fixed record files of different sizes using k simultaneous consumers, where k<n. Because k is (possibly a lot) smaller than n, the merge will be done in a number of iterations/steps. The challenge is to pick at each step the right files to merge.
Because the files can differ wildly in size, a simple greedy approach of using all k consumers at each step can be very suboptimal.
An simple example makes this clear. Consider the case of 4 files with 1, 1, 10 and 10 records respectively and 3 consumers. We need two merge steps to merge all files. Start with 3 consumers in the first step. The merge sequence ((1,1,10),10) leads to 12 read/write operations in (inner) step 1 and 22 operations in (outer) step 2, making a total of 34 ops. The sequence (1,(1,10,10)) is even worse with 21+22=43 ops. By contrast, if we use only 2 consumers in the first step and 3 in the second step, the merge pattern ((1,1),10,10) takes only 2+22=24 ops. Here our restraint pays off handsomely.
My solution for picking the right number of consumers at each step is the following. All possible merge states can be ordered into a directed graph (which is a lattice I suppose) with the number of ops to move from one state to another attached to each edge as the cost. I can then use a shortest path algorithm to determine the optimal sequence.
The problem with this solution is that the amount of nodes explodes, even with a modest number of files (say hundreds) and even after applying some sensible constraints (like sorting the files on size and allowing only merges of the top 2..k of this list). Moreover, I cannot shake the feeling that there might be an "analytical" solution to this problem, or at least a simple heuristic that comes very close to optimality.
Any thoughts would be appreciated.
May I present it another way:
The traditionnal merge sort complexity is o( n.ln(n)) but in my case with different sublist size, in the worst case if one file is big and all the other are small (that's the example you give) the complexity may be o( n.n ) : which is a bad performance complexity.
The question is "how to schedule the subsort in an optimal way"?
Precompute the graph of all executions is really too big, in the worst case it can be as big as the data you sort.
My proposition is to compute it "on the fly" and let it be not optimal but at least avoid the worse case.
My first naive impression was simply sort the files by sizes and begin with the smaller ones: this way you will privilege the elimination of small files during iterations.
I have K=2:
in your example 1 1 10 10 -> 2 20 -> 22 : It is still (20 + 2) + 22 CC so 42 CC*
CC: Comparison or copy: this is the ops I count for a complexity of 1.
If I have K=1 and reinject the result in my sorted file Array I get:
(1 1 10 10) -> 2 10 10 -> 12 10 -> (22) : 2 CC + 12 + 22 = 46
For different value of K the complexity vary slightly
Compute the complexity of this algorithm in the mean case with probability would be very interresting, but if you can accept some N² execution for bad cases.
PS:
The fact that k<n is another problem: it will be simply resolved by adding a worker per couple of files to a queue (n/2 workers at the beginning), and making the queue read by the k Threads.
Firstly an alternative algorithm
read all record keys (N reads) with a fileid
sort them
read all files and place the records in the final position according to the sorted key (N R/W)
might be a problem if your filesystem can't handle N+1 open files or if your random file access is slow for either read or write. i.e. either the random read or random write will be faster.
Advantage is only N*2 reads and N writes.
Back to your algorithm
Does it pay to merge the large files with small files at a random point in the merging? No
E.g. (1,1,10,10) -> ((1,10),(1,10)) [2*11 ops] -> (11,11) [22 ops] sum 44. ((1,1),10,10) is only 24.
Merging large and small files cause the content of the large files to be R/W an extra time.
Does it pay to merge the large files first? no
E.g (1,10,10,10) -> (1,10,(10,10)) 20+31 ops vs. ((1,10),10,10) 11+31 ops
again we get a penalty for doing the ops on the large file multiple times.
Does it ever pay to merge less than K files at the last merge? yes
e.g. (1,2,3,4,5,6) -> (((1,2),3,4),5,6) 3+10+21 vs ((1,2,3),(4,5,6)) 6+15+21
again merging the largest files more time is a bad idea
Does it pay to merge less than K files, except at the first merge? yes
e.g. !1 (1,2,3,4,5,6) -> (((1,2),3,4),5,6) 3+10+21=34 vs (((1,2,3),4),5,6)) 6+10+21=37
the size 3 file gets copied an extra time
e.g. #2 (((1,1),10),100,100). Here we use k=2 in the first two steps, taking 2+12+212=226 ops. The alternative ((1,1),10,100),100) that uses k=3 in the second step is 2+112+212=326 ops
New heuristic
while #files is larger than 1
sum size of smallest files until K or next larger file is greater than the sum.
K-merge these
ToDo make proof that the sum of additions in this case will be smaller than all other methods.

Tim Sort Merging Arrays Part

suppose I got these integers 6,1,4,2,1,5,9,6,3,4 and the size of run is 2 so we start by insertion sort of each run and i get these sub arrays:
1-6, 2-4, 1-5, 6-9, 3-4
my question is how do I merge them to get the sorted array?? I mean do I merge each two arrays and then the rest etc etc ?
Once you create the initial runs, you then merge the runs. Timsort uses a stack to keep track of run boundaries, and uses the top 3 entries on the stack to decide which runs to merge to "balance" the merges while maintaining "stability". A queue (FIFO) instead of a stack (LIFO) could be used (although I'm not sure if that would still be technically timsort) . With 10 elements, a run size of 3 would take one less merge pass. Timsort normally uses a larger minimum run size, 32 to 64 (inclusive), using insertion sort to force minimum run size if a natural run is smaller than it's calculated ideal minimum run size. Link to wiki article:
https://en.wikipedia.org/wiki/Timsort

Finding the Nth largest value in a group of numbers as they are generated

I'm writing a program than needs to find the Nth largest value in a group of numbers. These numbers are generated by the program, but I don't have enough memory to store N numbers. Is there a better upper bound than N that can be acheived for storage? The upper bound for the size of the group of numbers (and for N) is approximately 100,000,000.
Note: The numbers are decimals and the list can include duplicates.
[Edit]: My memory limit is 16 MB.
This is a multipass algorithm (therefore, you must be able to generate the same list multiple times, or store the list off to secondary storage).
First pass:
Find the highest value and the lowest value. That's your initial range.
Passes after the first:
Divide the range up into 10 equally spaced bins. We don't need to store any numbers in the bins. We're just going to count membership in the bins. So we just have an array of integers (or bigints--whatever can accurately hold our counts) Note that 10 is an arbitrary choice for the number of bins. Your sample size and distribution will determine the best choice.
Spin through each number in the data, incrementing the count of whichever bin holds the number you see.
Figure out which bin has your answer, and add how many numbers are above that bin to a count of numbers above the winning bin.
The winning bin's top and bottom range are your new range.
Loop through these steps again until you have enough memory to hold the numbers in the current bin.
Last pass:
You should know how many numbers are above the current bin by now.
You have enough storage to grab all the numbers within your range of the current bin, so you can spin through and grab the actual numbers. Just sort them and grab the correct number.
Example: if the range you see is 0.0 through 1000.0, your bins' ranges will be:
(- 0.0 - 100.0]
(100.0 - 200.0]
(200.0 - 300.0]
...
(900.0 - 1000.0)
If you find through the counts that your number is in the (100.0 - 2000.0] bin, your next set of bins will be:
(100.0 - 110.0]
(110.0 - 120.0]
etc.
Another multipass idea:
Simply do a binary search. Choose the midpoint of the range as the first guess. Your passes just need to do an above/below count to determine the next estimate (which can be weighted by the count, or a simple average for code simplicity).
Are you able to regenerate the same group of numbers from start? If you are, you could make multiple passes over the output: start by finding the largest value, restart the generator, find the largest number smaller than that, restart the generator, and repeat this until you have your result.
It's going to be a real performance killer, because you have a lot of numbers and a lot of passes will be required - but memory-wise, you will only need to store 2 elements (the current maximum and a "limit", the number you found during the last pass) and a pass counter.
You could speed it up by using your priority queue to find the M largest elements (choosing some M that you are able to fit in memory), allowing you to reduce the number of passes required to N/M.
If you need to find, say, the 10th largest element in a list of 15 numbers, you could save time by working the other way around. Since it is the 10th largest element, that means there are 15-10=5 elements smaller than this element - so you could look for the 6th smallest element instead.
This is similar to another question -- C Program to search n-th smallest element in array without sorting? -- where you may get some answers.
The logic will work for Nth largest/smallest search similarly.
Note: I am not saying this is a duplicate of that.
Since you have a lot (nearly 1 billion?) numbers, here is another way for space optimization.
Lets assume your numbers fit in 32-bit values, so about 1 billion would require sometime close to 32GB space. Now, if you can afford about 128MB of working memory, we can do this in one pass.
Imagine a 1 billion bit-vector stored as an array of 32-bit words
Let it be initialized to all zeros
Start running through your numbers and keep setting the correct bit position for the value of the number
When you are done with one pass, start counting from the start of this bit vector for the Nth set-bit
That bit's position gives you the value for your Nth largest number
You have actually sorted all the numbers in the process (however, count of duplicates is not tracked)
If I understood well, the upper bound memory usage for your program is O(N) (possibly N+1). You can maintain a list of the generated values that are greater than the current X (the Nth largest value so far) ordered by lowest first. As soon as a new greater value is generated, you can replace the current X by the first element of the list and insert the just generated value to its corresponding position in the list.
sort -n | uniq -c and the Nth should be the Nth row

Sort numbers by sum algorithm

I have a language-agnostic question about an algorithm.
This comes from a (probably simple) programming challenge I read. The problem is, I'm too stupid to figure it out, and curious enough that it is bugging me.
The goal is to sort a list of integers to ascending order by swapping the positions of numbers in the list. Each time you swap two numbers, you have to add their sum to a running total. The challenge is to produce the sorted list with the smallest possible running total.
Examples:
3 2 1 - 4
1 8 9 7 6 - 41
8 4 5 3 2 7 - 34
Though you are free to just give the answer if you want, if you'd rather offer a "hint" in the right direction (if such a thing is possible), I would prefer that.
Only read the first two paragraph is you just want a hint. There is a an efficient solution to this (unless I made a mistake of course). First sort the list. Now we can write the original list as a list of products of disjoint cycles.
For example 5,3,4,2,1 has two cycles, (5,1) and (3,4,2). The cycle can be thought of as starting at 3, 4 is in 3's spot, 2 is in 4's spot, and 4 is in 3's. spot. The end goal is 1,2,3,4,5 or (1)(2)(3)(4)(5), five disjoint cycles.
If we switch two elements from different cycles, say 1 and 3 then we get: 5,1,4,2,3 and in cycle notation (1,5,3,4,2). The two cycles are joined into one cycle, this is the opposite of what we want to do.
If we switch two elements from the same cycle, say 3 and 4 then we get: 5,4,3,2,1 in cycle notation (5,1)(2,4)(3). The one cycle is split into two smaller cycles. This gets us closer to the goal of all cycles of length 1. Notice that any switch of two elements in the same cycle splits the cycle into two cycles.
If we can figure out the optimal algorithm for switching one cycle we can apply that for all cycles and get an optimal algorithm for the entire sort. One algorithm is to take the minimum element in the cycle and switch it with the the whose position it is in. So for (3,4,2) we would switch 2 with 4. This leaves us with a cycle of length 1 (the element just switched into the correct position) and a cycle of size one smaller than before. We can then apply the rule again. This algorithm switches the smallest element cycle length -1 times and every other element once.
To transform a cycle of length n into cycles of length 1 takes n - 1 operations. Each element must be operated on at least once (think about each element to be sorted, it has to be moved to its correct position). The algorithm I proposed operates on each element once, which all algorithms must do, then every other operation was done on the minimal element. No algorithm can do better.
This algorithm takes O(n log n) to sort then O(n) to mess with cycles. Solving one cycle takes O(cycle length), the total length of all cycles is n so cost of the cycle operations is O(n). The final run time is O(n log n).
I'm assuming memory is free and you can simulate the sort before performing it on the real objects.
One approach (that is likely not the fastest) is to maintain a priority queue. Each node in the queue is keyed by the swap cost to get there and it contains the current item ordering and the sequence of steps to achieve that ordering. For example, initially it would contain a 0-cost node with the original data ordering and no steps.
Run a loop that dequeues the lowest-cost queue item, and enqueues all possible single-swap steps starting at that point. Keep running the loop until the head of the queue has a sorted list.
I did a few attempts at solving one of the examples by hand:
1 8 9 7 6
6 8 9 7 1 (+6+1=7)
6 8 1 7 9 (7+1+9=17)
6 8 7 1 9 (17+1+7=25)
6 1 7 8 9 (25+1+8=34)
1 6 7 8 9 (34+1+6=41)
Since you needed to displace the 1, it seems that you may have to do an exhaustive search to complete the problem - the details of which were already posted by another user. Note that you will encounter problems if the dataset is large when doing this method.
If the problem allows for "close" answers, you can simply make a greedy algorithm that puts the largest item into position - either doing so directly, or by swapping the smallest element into that slot first.
Comparisons and traversals apparently come for free, you can pre-calculate the "distance" a number must travel (and effectively the final sort order). The puzzle is the swap algorithm.
Minimizing overall swaps is obviously important.
Minimizing swaps of larger numbers is also important.
I'm pretty sure an optimal swap process cannot be guaranteed by evaluating each ordering in a stateless fashion, although you might frequently come close (not the challenge).
I think there is no trivial solution to this problem, and my approach is likely no better than the priority queue approach.
Find the smallest number, N.
Any pairs of numbers that occupy each others' desired locations should be swapped, except for N.
Assemble (by brute force) a collection of every set of numbers that can be mutually swapped into their desired locations, such that the cost of sorting the set amongst itself is less than the cost of swapping every element of the set with N.
These sets will comprise a number of cycles. Swap within those cycles in such a way that the smallest number is swapped twice.
Swap all remaining numbers, which comprise a cycle including N, using N as a placeholder.
As a hint, this reeks of dynamic programming; that might not be precise enough a hint to help, but I'd rather start with too little!
You are charged by the number of swaps, not by the number of comparisons. Nor did you mention being charged for keeping other records.

Resources