merge sort with large number of integers

merge sort with large number of integers - algorithm

Need to sort a large number of integers which cannot hold into memory. Wondering if Merge sort is the right way? My solution like this,
Using memory based sorting for each 5% of integers, which could hold into memory, using quick sort which performs efficiently in memory;
After each 20 chunks are sorted, using merge sort to sort the 20 lists, for merge sort, I just need to load part of each file into memory, and load next part of the same list if current part of the same list is fully sorted into final results. Since each of the 20 lists are sorted, and I just need to load part of the chunks from head to tail sequentially, so memory is affordable.
I am not sure if it is the right way for large number of integer sorting?

Since,
they are integers, and most of them are 1-100
all you need is Counting Sort.
It is very simple in implementation.
Create an array of 100 ints (or HashMap<int, int>) called intCounts (take 64-bit ints if you think 32-bit can overflow)
One by one read the integers that you have to sort
For every inputInteger to be sorted, just do intCounts[inputInteger]++
After you have read all integers, intCounts[i] tells how many times you saw integer i in your large set of integers
Just iterate over your intCounts from least index to highest index
Write back i a total of intCounts[i] times
You have written back a sorted list of all your input integers now.

The GNU sort program (like its Unix predecessor) uses an in-memory sort followed by as many 16-way merges as needed. See the code here to read more:
http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/sort.c#n306

Related

Sorting a small array into a large sorted array

What is the best algorithm for merging a large sorted array with a small unsorted array?
I'll give examples of what I mean from my particular use case, but don't feel bound by them: I'm mostly trying to give a feel for the problem.
8 MB sorted array with 92 kB unsorted array (in-cache sort)
2.5 GB sorted array with 3.9 MB unsorted array (in-memory sort)
34 GB sorted array with 21 MB unsorted array (out-of-memory sort)

You can implement a chunk-based algorithm to solve this problem efficiently (whatever the input size of the arrays as long as one is much smaller than the other).
First of all, you need to sort the small array (possibly using a radix sort or a bitonic sort if you do not need a custom comparator).
Then the idea is to cut the big array in chunks fully fitting in the CPU cache (eg. 256 KiB).
For each chunk, find the index of the last item in the small array <= to the last item of the chunk using a binary search.
This is relatively fast because the small array likely fit in the cache and the same items of the binary search are fetched between consecutive chunks if the array is big.
This index enable you to know how many items need to be merged with the chunks before being written.
For each value to be merged in the chunk, find the index of the value using a binary search in the chunk.
This is fast because the chunk fit in the cache.
Once you know the index of the values to be inserted in the chunk, you can efficiently move the item by block in each chunk (possibly in-place from the end to the beginning).
This implementation is much faster than the traditional merge algorithm since the number of comparison needed is much smaller thanks to the binary search and small number of items to be inserted by chunk.
For relatively big input, you can use a parallel implementation. The idea is to work on a group of multiple chunks at the same time (ie. super-chunks).
Super-chunks are much bigger than classical ones (eg. >=2 MiB).
Each thread work on a super-chunk at a time. A binary search is performed on the small array to know how many values are inserted in each super-chunk.
This number is shared between threads so that each threads know where it can safely write the output independently of other thread (one could use a parallel-scan algorithm to do that on massively parallel architecture). Each super-chunk is then split in classical chunks and the previous algorithm is used to solve the problem in each thread independently.
This method should be more efficient even in sequential when the small input arrays do not fit in the cache since the number of binary search operations in the whole small array will be significantly reduced.
The (amortized) time complexity of the algorithm is O(n (1 + log(m) / c) + m (1 + log(c))) with m the length of the big array, n the length of the small array and c the chunk size (super-chunks are ignored here for sake of clarity, but they only change the complexity by a constant factor like the constant c does).
Alternative method / Optimization: If your comparison operator is cheap and can be vectorized using SIMD instructions, then you can optimize the traditional merge algorithm. The traditional method is quite slow because of branches (that can hardly be predicted in the general case) and also because it cannot be easily/efficiently vectorized. However, because the big array is much bigger than the small array, the traditional algorithm will pick a lot of consecutive value from the big array in between the ones of the small array. This means that you can pick SIMD chunks of the big array and compare the values with one of the small array. If all SIMD items are smaller than the one picked from the small array, then you can write the whole SIMD chunk at once very efficiently. Otherwise, you need to write a part of the SIMD chunk, then write the item of the small array and switch to the next one. This last operation is clearly less efficient but it should happen rarely since the small array is much smaller than the big one. Note that the small array still needs to be sorted first.

A large file containing 1 million integers, what would be the fastest way to find the most occurring?

Basic approach would be to use an array or a hashmap to create a historgram of numbers and select the most frequent.
In this case let's assume that all the numbers from the file cannot be loaded into the main memory.
One way I can think of is to sort using external merge/quick sort and then chunk by chunk calculate the frequency. As they are sorted, we don't have to worry about the number appearing again after the sequence with a number finishes.
Is there a better and more efficient way to do this?

Well, a million isn't so much anymore, so lets assume we're talking about several billion integers.
In that case, I would suggest that you hash them and partition them into 2^N buckets (separate files or preallocated parts of the same file) using the top N bits of their hash values.
You would choose N so that the resulting buckets were highly likely to be small enough to process in memory.
You would then process each bucket by counting the occurrences of each unique value in a hash table or similar.
In the unlikely event that a bucket has too many unique values to fit in RAM, repartition using the next N bits of the hash and try again.

Which sorting algorithm should I use for a list with a lot of replications?

I want to sort an array of 1 million integers. What would be the best algorithm to use knowing that the universe of the array's integers are from 1 to 100? Note that this means that there are a lot of items replicated. Furthermore, the array is randomly distributed.

You create an array of 100 elements (with one for each possible value) and simply count how many there are of each. Running time: O(n), with each element of the original array accessed only once, so you're unlikely to find a faster one. :)
Or to give it its proper name, use a counting sort.

sorting a bivalued list

If I have a list of just binary values containing 0's and 1's like the following 000111010110
and I want to sort it to the following 000000111111 what would be the most efficient way to do this if you also know the list size? Right now I am thinking to have one counter where I just count the number of 0's as I traverse the list from beginning to end. Then if I divide the listSize by numberOfZeros I get numberOfOnes. Then I was thinking instead of reordering the list starting with zeros, I would just create a new list. Would you agree this is the most efficient method?

Your algorithm implements the most primitive version of the classic bucket sort algorithm (its counting sort implementation). It is the fastest possible way to sort numbers when their range is known, and is (relatively) small. Since zeros and ones is all you have, you do not need an array of counters that are present in the bucket sort: a single counter is sufficient.

If you have numeric values, you can use the assembly instruction bitscan (BSF in x86 assembly) to count the number of bits. To create the "sorted" value you would set the n+1 bit, then subtract one. This will set all the bits to the right of the n+1 bit.

Bucket sort is a sorting algorithm as it seems.
I dont think there is a need for such operations.As we know there is no Sorting algorithm faster than N*logN . So by default it is wrong.
And all that because all you got to do is what you said in the very beginning.Just traverse the list and count the Zero's or the One's that will give you O(n) complexity.Then just create a new array with the counted zero's in the beginning followed by the One's.Then you have a total of N+N complexity that gives you
O(n) complexity.
And thats only because you have only two values.So neither quick sort or any other sort can do this faster.There is no faster sorting than NLog(n)

Most efficient way to count occurrences?

I'm looking to calculate entropy and mutual information a huge number of times in performance-critical code. As an intermediate step, I need to count the number of occurrences of each value. For example:
uint[] myArray = [1,1,2,1,4,5,2];
uint[] occurrences = countOccurrences(myArray);
// Occurrences == [3, 2, 1, 1] or some permutation of that.
// 3 occurrences of 1, 2 occurrences of 2, one each of 4 and 5.
Of course the obvious ways to do this are either using an associative array or by sorting the input array using a "standard" sorting algorithm like quick sort. For small integers, like bytes, the code is currently specialized to use a plain old array.
Is there any clever algorithm to do this more efficiently than a hash table or a "standard" sorting algorithm will offer, such as an associative array implementation that heavily favors updates over insertions or a sorting algorithm that shines when your data has a lot of ties?
Note: Non-sparse integers are just one example of a possible data type. I'm looking to implement a reasonably generic solution here, though since integers and structs containing only integers are common cases, I'd be interested in solutions specific to these if they are extremely efficient.

Hashing is generally more scalable, as another answer indicates. However, for many possible distributions (and many real-life cases, where subarrays just happen to be often sorted, depending on how the overall array was put together), timsort is often "preternaturally good" (closer to O(N) than to O(N log N)) -- I hear it's probably going to become the standard/default sorting algorithm in Java at some reasonably close future data (it's been the standard sorting algorithm in Python for years now).
There's no really good way to address such problems except to benchmark on a selection of cases that are representative of the real-life workload you expect to be experiencing (with the obvious risk that you may choose a sample that actually happened to be biased/non-representative -- that's not a small risk if you're trying to build a library that will be used by many external users outside of your control).

Please tell more about your data.
How many items are there?
What is the expected ratio of unique items to total items?
What is the distribution of actual values of your integers? Are they usually small enough to use a simple counting array? Or are they clustered into reasonably narrow groups? Etc.
In any case, I suggest the following idea: a mergesort modified to count duplicates.
That is, you work in terms of not numbers but pairs (number, frequency) (you might use some clever memory-efficient representation for that, for example two arrays instead of an array of pairs etc.).
You start with [(x1,1), (x2,1), ...] and do a mergesort as usual, but when you merge two lists that start with the same value, you put the value into the output list with their sum of occurences. On your example:
[1:1,1:1,2:1,1:1,4:1,5:1,2:1]
Split into [1:1, 1:1, 2:1] and [1:1, 4:1, 5:1, 2:1]
Recursively process them; you get [1:2, 2:1] and [1:1, 2:1, 4:1, 5:1]
Merge them: (first / second / output)
[1:2, 2:1] / [1:1, 2:1, 4:1, 5:1] / [] - we add up 1:2 and 1:1 and get 1:3
[2:1] / [2:1, 4:1, 5:1] / [1:3] - we add up 2:1 and 2:1 and get 2:2
[] / [4:1, 5:1] / [1:3, 2:2]
[1:3, 2:2, 4:1, 5:1]
This might be improved greatly by using some clever tricks to do an initial reduction of the array (obtain an array of value:occurence pairs that is much smaller than the original, but the sum of 'occurence' for each 'value' is equal to the number of occurences of 'value' in the original array). For example, split the array into continuous blocks where values differ by no more than 256 or 65536 and use a small array to count occurences inside each block. Actually this trick can be applied at later merging phases, too.

With an array of integers like in the example, the most effient way would be to have an array of ints and index it based using your values (as you appear to be doing already).
If you can't do that, I can't think of a better alternative than a hashmap. You just need to have a fast hashing algorithm. You can't get better than O(n) performance if you want to use all your data. Is it an option to use only a portion of the data you have?
(Note that sorting and counting is asymptotically slower (O(n*log(n))) than using a hashmap based solution (O(n)).)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio