Tim Sort Merging Arrays Part - sorting

suppose I got these integers 6,1,4,2,1,5,9,6,3,4 and the size of run is 2 so we start by insertion sort of each run and i get these sub arrays:
1-6, 2-4, 1-5, 6-9, 3-4
my question is how do I merge them to get the sorted array?? I mean do I merge each two arrays and then the rest etc etc ?

Once you create the initial runs, you then merge the runs. Timsort uses a stack to keep track of run boundaries, and uses the top 3 entries on the stack to decide which runs to merge to "balance" the merges while maintaining "stability". A queue (FIFO) instead of a stack (LIFO) could be used (although I'm not sure if that would still be technically timsort) . With 10 elements, a run size of 3 would take one less merge pass. Timsort normally uses a larger minimum run size, 32 to 64 (inclusive), using insertion sort to force minimum run size if a natural run is smaller than it's calculated ideal minimum run size. Link to wiki article:
https://en.wikipedia.org/wiki/Timsort

Related

Sorting a small array into a large sorted array

What is the best algorithm for merging a large sorted array with a small unsorted array?
I'll give examples of what I mean from my particular use case, but don't feel bound by them: I'm mostly trying to give a feel for the problem.
8 MB sorted array with 92 kB unsorted array (in-cache sort)
2.5 GB sorted array with 3.9 MB unsorted array (in-memory sort)
34 GB sorted array with 21 MB unsorted array (out-of-memory sort)
You can implement a chunk-based algorithm to solve this problem efficiently (whatever the input size of the arrays as long as one is much smaller than the other).
First of all, you need to sort the small array (possibly using a radix sort or a bitonic sort if you do not need a custom comparator).
Then the idea is to cut the big array in chunks fully fitting in the CPU cache (eg. 256 KiB).
For each chunk, find the index of the last item in the small array <= to the last item of the chunk using a binary search.
This is relatively fast because the small array likely fit in the cache and the same items of the binary search are fetched between consecutive chunks if the array is big.
This index enable you to know how many items need to be merged with the chunks before being written.
For each value to be merged in the chunk, find the index of the value using a binary search in the chunk.
This is fast because the chunk fit in the cache.
Once you know the index of the values to be inserted in the chunk, you can efficiently move the item by block in each chunk (possibly in-place from the end to the beginning).
This implementation is much faster than the traditional merge algorithm since the number of comparison needed is much smaller thanks to the binary search and small number of items to be inserted by chunk.
For relatively big input, you can use a parallel implementation. The idea is to work on a group of multiple chunks at the same time (ie. super-chunks).
Super-chunks are much bigger than classical ones (eg. >=2 MiB).
Each thread work on a super-chunk at a time. A binary search is performed on the small array to know how many values are inserted in each super-chunk.
This number is shared between threads so that each threads know where it can safely write the output independently of other thread (one could use a parallel-scan algorithm to do that on massively parallel architecture). Each super-chunk is then split in classical chunks and the previous algorithm is used to solve the problem in each thread independently.
This method should be more efficient even in sequential when the small input arrays do not fit in the cache since the number of binary search operations in the whole small array will be significantly reduced.
The (amortized) time complexity of the algorithm is O(n (1 + log(m) / c) + m (1 + log(c))) with m the length of the big array, n the length of the small array and c the chunk size (super-chunks are ignored here for sake of clarity, but they only change the complexity by a constant factor like the constant c does).
Alternative method / Optimization: If your comparison operator is cheap and can be vectorized using SIMD instructions, then you can optimize the traditional merge algorithm. The traditional method is quite slow because of branches (that can hardly be predicted in the general case) and also because it cannot be easily/efficiently vectorized. However, because the big array is much bigger than the small array, the traditional algorithm will pick a lot of consecutive value from the big array in between the ones of the small array. This means that you can pick SIMD chunks of the big array and compare the values with one of the small array. If all SIMD items are smaller than the one picked from the small array, then you can write the whole SIMD chunk at once very efficiently. Otherwise, you need to write a part of the SIMD chunk, then write the item of the small array and switch to the next one. This last operation is clearly less efficient but it should happen rarely since the small array is much smaller than the big one. Note that the small array still needs to be sorted first.

Polyphase merge sort - what is the number of phases

Suppose that we have to sort some big set of numbers externally. We want to examine 2 cases:
4 tapes: 2 input tapes, 2 output
3 tapes: 2 in, 1 out
Case 1:
We start with k runs, then we copy those runs to 2 input tapes (on the left on the pic below), each iteration we take two different runs from the input tapes, merge (and sort) them, and in one iteration save them to the first output tape, and in next iteration - to the second one, as shown below. Then we switch output tapes with input ones and repeat the procedure. So if we have, lets say, n=10^6 elements and k=1000 runs, then after the first phase run’s size will be 2000, after the third 4000 and so on. So the total number of phases is ceil(log_2(n)).
Case 2:
In the best-case complexity, the number of phases is position of Fibonacci’s number in the Fibonacci’s sequence minus two, i.e. if our initial number of runs is k=34 and 34 is the 9th number in the Fibonacci sequence, then we will have 7 phases.
But… if our number of runs isn’t a Fibonacci number, it is necessary to pad the tape with dummy runs in order to get no. of runs up to Fibonacci number.
Finally, my question is:
What is the average-case number of phases needed in order to sort a sequence, when the number of runs isn’t a Fibonacci number?
What is the number of phases ... when number of runs isn’t a Fibonacci number?
If the run count is not an ideal number, then the sort will take one extra phase, similar to rounding the run count up to the next ideal number. Dummy runs don't need to occupy any space on the tapes, but the code has to handle reaching the end of data on more than one tape during a phase on non-ideal distributions.
Some notes about the information in the original question:
The 4 tape example shows a balanced 2-way merge sort. For polyphase merge sort, there's only one output tape per phase. With 4 tape drives, the initial setup distributes runs between the 3 other drives, so after the initial distribution, it is always 3 input tapes, 1 output tape.
The Fibonacci numbers only apply to a 3 tape scenario. For a 4 or more tape scenario, the sequence is easiest to generate by starting at the final phase and working backwards. For 31 runs on 4 tapes, the final run count is {1,0,0,0},
working backwards: {0,1,1,1}, {1,0,2,2}, {3,2,0,4}, {7,6,4,0}, {0,13,11,7}.
The run sizes increase as the result of merging prior runs of various sizes. Assume run size is 1 element, 31 runs, 4 tapes. After initial distribution, run count:run size is {0:0,13:1,11:1,7:1}. First phase: {7:3,6:1,4:1,0:0}. Second phase: {3:3,2:1,0:0,4:5}. Third phase {1:3,0:0,2:9,2:5}. Fourth phase: {0:0,1:17,1:9,1:5}. Fifth and final phase {1:31,0:0,0:0,0:0}.
Keeping track of run sizes can get complex, so a simple solution for tapes is to use a single file mark to indicate the end of a run and a double file mark to indicate the end of data.
Wiki has an article on polyphase merge sort.
https://en.wikipedia.org/wiki/Polyphase_merge_sort
If the total run count is known in advance, the initial distribution can include initial merge operations to get the run count to an ideal number, but now the run sizes vary due to the initial merge operations, so each tape ends up with a mix of run sizes. Again, using file marks to indicate end of runs eliminates having to keep track of run sizes in memory.
Polyphase merge sort is the fastest way to do a sort using 3 stacks.

Making radix sort in-place - trying to understand how

I'm going through all the known / typical sorting algorithms (insertion, bubble, selection, quick, merge sort..) and now I just read about radix sort.
I think I have understood its concept but I still wonder how it could be done in-place? Let me explain how I understood it:
It's made up of 2 phases: Partitioning and Collecting. They will be executed alternately. In Partitioning phase we will split the data into each.. let me call these bucket. In Collecting phase we will collect the data again. Both phases will be executed for each position of the keys to be sorted. So the amount of cycles is based on the size of the keys (Let's rather say amount of digits if we for example want sort integers).
I don't want explain the 2 phases too much in detail because it would be too long and I hope you will read it till here because I don't know how to do this algorithm in-place..
Maybe you can explain with words instead of code? I need to know it for my exam but I couldn't find anything explaining on the internet, at least not in an easy, understandable way.
If you want me to explain more, please tell me. I will do anything to understand it.
Wikipedia is (sometimes) your friend: https://en.wikipedia.org/wiki/Radix_sort#In-place_MSD_radix_sort_implementations.
I quote the article:
Binary MSD radix sort, also called binary quicksort, can be
implemented in-place by splitting the input array into two bins - the
0s bin and the 1s bin. The 0s bin is grown from the beginning of the
array, whereas the 1s bin is grown from the end of the array. [...]
. The most significant
bit of the first array element is examined. If this bit is a 1, then
the first element is swapped with the element in front of the 1s bin
boundary (the last element of the array), and the 1s bin is grown by
one element by decrementing the 1s boundary array index. If this bit
is a 0, then the first element remains at its current location, and
the 0s bin is grown by one element. [...] . The 0s bin and the 1s bin are
then sorted recursively based on the next bit of each array element.
Recursive processing continues until the least significant bit has
been used for sorting.
The main information is: it is a binary and recursive radix sort. In other words:
you have only two buckets, let's say 0 and 1, for each step. Since the algorithm is 'in- place' you swap elements (as in quicksort) to put each element in the right bucket (0 or 1), depending on its radix.
you process recursively: each bucket is split into two buckets, depending on the next radix.
It is very simple to understand for unsigned integers: you consider the bits from the most significant to the least significant. It may be more complex (and overkill) for other data types.
To summarize the differences with quicksort algorithm:
in quicksort, your choice of a pivot defines two "buckets": lower than pivot, greater than pivot.
in binary radix sort, the two buckets are defined by the radix (eg most significant bit).
In both cases, you swap elements to put each element in its "bucket" and process recursively.

merge sort with large number of integers

Need to sort a large number of integers which cannot hold into memory. Wondering if Merge sort is the right way? My solution like this,
Using memory based sorting for each 5% of integers, which could hold into memory, using quick sort which performs efficiently in memory;
After each 20 chunks are sorted, using merge sort to sort the 20 lists, for merge sort, I just need to load part of each file into memory, and load next part of the same list if current part of the same list is fully sorted into final results. Since each of the 20 lists are sorted, and I just need to load part of the chunks from head to tail sequentially, so memory is affordable.
I am not sure if it is the right way for large number of integer sorting?
Since,
they are integers, and most of them are 1-100
all you need is Counting Sort.
It is very simple in implementation.
Create an array of 100 ints (or HashMap<int, int>) called intCounts (take 64-bit ints if you think 32-bit can overflow)
One by one read the integers that you have to sort
For every inputInteger to be sorted, just do intCounts[inputInteger]++
After you have read all integers, intCounts[i] tells how many times you saw integer i in your large set of integers
Just iterate over your intCounts from least index to highest index
Write back i a total of intCounts[i] times
You have written back a sorted list of all your input integers now.
The GNU sort program (like its Unix predecessor) uses an in-memory sort followed by as many 16-way merges as needed. See the code here to read more:
http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/sort.c#n306

Generate sequence of integers in random order without constructing the whole list upfront [duplicate]

This question already has answers here:
Closed 14 years ago.
How can I generate the list of integers from 1 to N but in a random order, without ever constructing the whole list in memory?
(To be clear: Each number in the generated list must only appear once, so it must be the equivalent to creating the whole list in memory first, then shuffling.)
This has been determined to be a duplicate of this question.
very simple random is 1+((power(r,x)-1) mod p) will be from 1 to p for values of x from 1 to p and will be random where r and p are prime numbers and r <> p.
Not the whole list technically, but you could use a bit mask to decide if a number has already been selected. This has a lot less storage than the number list itself.
Set all N bits to 0, then for each desired number:
use one of the normal linear congruent methods to select a number from 1 to N.
if that number has already been used, find the next highest unused (0 bit), with wrap.
set that numbers bit to 1 and return it.
That way you're guaranteed only one use per number and relatively random results.
It might help to specify a language you are searching a solution for.
You could use a dynamic list where you store your generated numbers, since you will need a reference which numbers you already created. Every time you create a new number you could check if the number is contained in the list and throw it away if it is contained and try again.
The only possible way without such a list would be to use a number size where it is unlikely to generate a duplicate like a UUID if the algorithm is working correctly - but this doesn't guarantee that no duplicate is generated - it is just highly unlikely.
You will need at least half of the total list's memory, just to remember what you did already.
If you are in tough memory conditions, you may try so:
Keep the results generated so far in a tree, randomize the data, and insert it into the tree. If you cannot insert then generate another number and try again, etc, until the tree fills halfway.
When the tree fills halfway, you inverse it: you construct a tree holding numbers that you haven't used already, then pick them in random order.
It has some overhead for keeping the tree structure, but it may help when your pointers are considerably smaller in size than your data is.

Resources