External sorting when indices can fit in RAM - sorting

I want to sort a multi-TB file full of 20kb records. I only need to read a few bytes from each record in order to determine its order, so I can sort the indices in memory.
I cannot fit the records themselves in memory, however. Random access is slower than sequential access, and I don't want to random-access writes to the output file either. Is there any algorithm known that will take advantage of the sorted indices to "strategize" the optimal way to re-arrange the records as they are copied from the input file to the output file?

There are reorder array according to sorted index algorithms, but they involve random access. Even in the case of an SSD, although the random access itself is not an issue, reading or writing one record at a time due to random access has a slower throughput than reading or writing multiple records at a time which is typically down by an external merge sort.
For a typical external merge sort, the file is read in "chunks" small enough for an internal sort to sort the "chunk", and write the sorted "chunks" to external media. After this initial pass, a k-way merge is done on the "chunks" multiplying the size of the merged "chunks" by k on each merge pass, until a single sorted "chunk" is produced. The read/write operations can read multiple records at a time. Say you have 1GB of ram and use a 16-way merge. For a 16 way merge, 16 "input" buffers and 1 "output" buffer are used, so buffer size could be 63MB (1GB/17 rounded down a bit for variable space) which would allow 3150 records to be read or written at a time, greatly reducing random access and command overhead. Assuming initial pass creates sorted chunks of size 0.5 GB, after 3 (16 way) merge passes, chunk size is 2TB, after 4 passes, it's 32TB, and so on.

Related

Check for duplicate input items in a data-intensive application

I have to build a server-side application that will receive a stream of data as input, it will actually receive a stream of integers up to nine decimal digits, and have to write each of them to a log file. Input data is totally random, and one of the requirements is that the application should not write duplicate items to the log file, and should periodically report the number of duplicates items found.
Taking into account that performance is a critical aspect of this application, as it should be able to handle high loads of work (and parallel work), I would like to found a proper solution to keep track of the duplicate entries, as checking the whole log (text) file every time it writes is not a suitable solution for sure. I can think of a solution consisting of maintaining some sort of data structure in memory to keep track of the whole stream of data being processed so far, but as input data can be really high, I don't think is the best way to do it either...
Any idea?
Assuming the stream of random integers is uniformly distributed. The most efficient way to keep track of duplicates is to maintain a huge bitmap of 10 billion bits in memory. However, this takes a lot of RAM: about 1.2 Gio. However, since this data structure is big, memory accesses may be slow (limited by the latency of the memory hierarchy).
If the ordering does not matter, you can use multiple threads to mitigate the impact of the memory latency. Parallel accesses can be done safely using logical atomic operations.
To check if a value is already seen before, you can check the value of a bit in the bitmap then set it (atomically if done in parallel).
If you know that your stream do contains less than one million of integers or the stream of random integers is not uniformly distributed, you can use a hash-set data structure as it store data in a more compact way (in sequential).
Bloom filters could help you to speed up the filtering when the number of value in the stream is quite big and they are very few duplicates (this method have to be combined with another approach if you want get deterministic results).
Here is an example using hash-sets in Python:
seen = set() # List of duplicated values seen so far
for value in inputStream: # Iterate over the stream value
if value not in seen: # O(1) lookup
log.write(value) # Value not duplicated here
seen.add(value) # O(1) appending

Sort the contents of a large file with low RAM

You have a large file around 50GB containing numbers. You need to read the file, sort the contents and copy the sorted contents to another new file.
Condition - you have only 1GB RAM on the computer. However disk space is not an issue.
When we sort items where all the items fit in the memory, we call it internal sorting. When we sort items where items is too big to store in the memory, we call it external sorting.
Art of Computer Programming Vol 3: Sorting and Searching on Page 248 discuss detail algorithm for external sorting (one is merge sort).
You also mention that file contains 50GB of numbers. Maybe there is a lot of duplicated number. You might as well using counting sort if there are a lotof duplicated.

What is this algorithm called and what is the time complexity?

Let's call the amount of RAM available R.
We have an unsorted file of 10 gigs with one column of keys (duplicates allowed).
You split the file into k files, each of which have size R.
You sort each file and write the file to disk.
You read (10 / R) gigs from each file into input buffers. You perform a k-way merge where you read the first key from the first file and compare to every other key in your input buffers to find the minimum. You add this to your output buffer which should also hold (10 / R) gigs of data.
Once the output buffer is full, write it to disk to a final sorted file.
Repeat this process until all k files have been fully read. If an input buffer is empty, fill it with the next (10 / R) gigs of its corresponding file until the file has been entirely read. We can do this buffer refilling in parallel.
What is the official name for this algorithm? Is it a K - Way Merge sort?
The first part, where we split into K files is O((n / k) log (n / k))
The second part, where we merge is O(nk)?
If I am wrong, can I have an explanation? If this is external merge sort, how do we optimize this further?
This is a textbook external merge sort Time complexity O(n log n)
Here's Wikipedia's entry on it (linked above):
One example of external sorting is the external merge sort algorithm,
which sorts chunks that each fit in RAM, then merges the sorted chunks
together. For example, for sorting 900 megabytes of data using
only 100 megabytes of RAM: 1) Read 100 MB of the data in main memory and
sort by some conventional method, like quicksort. 2) Write the sorted
data to disk. 3) Repeat steps 1 and 2 until all of the data is in sorted
100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to
be merged into one single output file. 4) Read the first 10 MB (= 100MB /
(9 chunks + 1)) of each sorted chunk into input buffers in main memory
and allocate the remaining 10 MB for an output buffer. (In practice,
it might provide better performance to make the output buffer larger
and the input buffers slightly smaller.) 5) Perform a 9-way merge and
store the result in the output buffer. Whenever the output buffer
fills, write it to the final sorted file and empty it. Whenever any of
the 9 input buffers empties, fill it with the next 10 MB of its
associated 100 MB sorted chunk until no more data from the chunk is
available. This is the key step that makes external merge sort work
externally -- because the merge algorithm only makes one pass
sequentially through each of the chunks, each chunk does not have to
be loaded completely; rather, sequential parts of the chunk can be
loaded as needed. Historically, instead of a sort, sometimes a
replacement-selection algorithm was used to perform the initial
distribution, to produce on average half as many output chunks of
double the length.
I'd say it's a merge algorithm, exact file IO is an implementation detail.

Sorting Data with Space Contrainsts

Now the problem is pretty simple. You have 1000 MB and you have to sort it. Now the problem is you only have 100 MB space to sort the data. (Lets say the 1000 MB is stored in the Disk and you have only 100 MB Ram to sort the data. - At at any time you can only have 100 MB data in the Ram.)
Now I came up with this solution:
Divide the data into 10 parts - 100 MB each and sort it using Quick Sort.
Then write all the chunks of data into the Hard Drive.
Now pick the first 10 MB from each chunk and then merge. Now you have 100 MB. Now keep this 100 MB separated.
Now do the same thing. Pick the next 10 MB from each chunk and merge.
Keep doing this and then concatenate the data.
Now the problem I'm facing is as we're separately merging 100 MB each time when we concatenate we will be making mistakes. (These 100 MB should also be merged together.)
How can I solve this problem?
External merge sorting differs from the internal version in the merge phase. In your question you are trying to directly apply the internal merge algorithm to an external merge.
Hence you end up with having to merge two chunks of size n / 2. As you correctly note, that won't work because you run out of memory.
Let's assume that you have enough memory to sort 1/k th of all elements. This leaves you with k sorted lists. Instead of merging two lists, you merge all k at once:
Pick an output buffer size. A good value seems to be n / 2. This leaves n / 2 memory for input buffering or m = n / (2 x k) per subchunk.
Read the first m element from each subchunk. All memory is now used and you have the lowest m elements from each subchunk in memory.
From each of the k input buffers, choose the lowest value. Append this value into the output buffer. Repeat until one of your input buffers runs out.
If one of your input buffers runs out, read the next m elements from the subchunk on disk. If there are no more elements, you are done with the subchunk.
If the output buffer is full, append it to the output file and reset its position to the start.
Rinse and repeat from (3) until all subchunks run out.
The output file now is sorted.
You can think of the input buffers as a set of streaming buffers on sorted buckets. Check the streams and pick the best (i.e. lowest) element from all of them and save that one to the output list. From the outside it is a stream merge with smart prefetch and output buffering.
You need to repeat step 3 and 4. N-1 number of times.

External Sorting with a heap?

I have a file with a large amount of data, and I want to sort it holding only a fraction of the data in memory at any given time.
I've noticed that merge sort is popular for external sorting, but I'm wondering if it can be done with a heap (min or max). Basically my goal is to get the top (using arbitrary numbers) 10 items in a 100 item list while never holding more than 10 items in memory.
I mostly understand heaps, and understand that heapifying the data would put it in the appropriate order, from which I could just take the last fraction of it as my solution, but I can't figure out how to do with without an I/O for every freakin' item.
Ideas?
Thanks! :D
Using a heapsort requires lots of seek operations in the file for creating the heap initially and also when removing the top element. For that reason, it's not a good idea.
However, you can use a variation of mergesort where every heap element is a sorted list. The size of the lists is determined by how much you want to keep in memory. You create these lists from the input file using by loading chunks of data, sorting them and then writing them to a temporary file. Then, you treat every file as one list, read the first element and create a heap from it. When removing the top element, you remove it from the list and restore the heap conditions if necessary.
There is one aspect though that makes these facts about sorting irrelevant: You say you want to determine the top 10 elements. For that, you could indeed use an in-memory heap. Just take an element from the file, push it onto the heap and if the size of the heap exceeds 10, remove the lowest element. To make it more efficient, only push it onto the heap if the size is below 10 or it is above the lowest element, which you then replace and re-heapify. Keeping the top ten in a heap allows you to only scan through the file once, everything else will be done in-memory. Using a binary tree instead of a heap would also work and probably be similarly fast, for a small number like 10, you could even use an array and bubblesort the elements in place.
Note: I'm assuming that 10 and 100 were just examples. If your numbers are really that low, any discussion about efficiency is probably moot, unless you're doing this operation several times per second.
Yes, you can use a heap to find the top-k items in a large file, holding only the heap + an I/O buffer in memory.
The following will obtain the min-k items by making use of a max-heap of length k. You could read the file sequentially, doing an I/O for every item, but it will generally be much faster to load the data in blocks into an auxillary buffer of length b. The method runs in O(n*log(k)) operations using O(k + b) space.
while (file not empty)
read block from file
for (i = all items in block)
if (heap.count() < k)
heap.push(item[i])
else
if (item[i] < heap.root())
heap.pop_root()
heap.push(item[i])
endif
endfor
endwhile
Heaps require lots of nonsequential access. Mergesort is great for external sorting because it does a whole lot of sequential access.
Sequential access is a hell of a lot faster on the kinds of disks that spin because the head doesn't need to move. Sequential access will probably also be a hell of a lot faster on solid-state disks than heapsort's access because they do accesses in blocks that are probably considerably larger than a single thing in your file.
By using Merge sort and passing the two values by reference you only have to hold the two comparison values in a buffer, and move throughout the array until it is sorted in place.

Resources