Merge sorted files effectively? - algorithm

I have n files, 50 <= n <= 100 that contain sorted integers, all of them the same size, 250MB or 500MB.
e.g
1st file: 3, 67, 123, 134, 200, ...
2nd file: 1, 12, 33, 37, 94, ...
3rd file: 11, 18, 21, 22, 1000, ...
I am running this on a 4-core machine and the goal is to merge the files as soon as possible.
Since the total size can reach 50GB I can't read them into RAM.
So far I tried to do the following:
1) Read a number from every file, and store them in an array.
2) Find the lowest number.
3) Write that number to the output.
4) Read one number from the file you found the lowest before (if file not empty)
Repeat steps 2-4 till we have no numbers left.
Reading and writing is done using buffers of 4MB.
My algorithm above works correctly but it's not perfomning as fast as I want it. The biggest issue is that it perfoms much worst if I have 100 files x 250MB compared to having 50 files x 500MB.
What is the most efficient merge algorithm in my case?

Well, you can first significantly improve efficiency by improving step (2) in your algorithm to be done smartly. Instead to do a linear search on all the numbers, use a min-heap, any insertion and deletion of the minimal value from the heap is done in logarithmic time, so it will improve the speed for large number of files. This changes time complexity to O(nlogk), over the naive O(n*k) (where n is total number of elements and k is number of files)
In addition, you need to minimize number of "random" reads from files, because few sequential big reads are much faster than many small random reads. You can do that by increasing the buffer size, for example (same goes for writing)

(java) Use GZipInputStream and GZipOutputStream for the .gz compression. Maybe that will allow memory usage to some extent. Using fast instead of high compression.
Then movement on disk for several files should be reduced, say more merging files by 2 files, both larger sequences.
For repetitions maybe use "run-length-encoding" - instead of repeating, add a repetition count: 11 12 13#7 15

An effective way to utilize the multiple cores might be to perform input and output in distinct threads from the main comparison thread, in such a way that all the cores are kept busy and the main thread never unnecessarily blocks on input or output. One thread performing the core comparison, one writing the output, and NumCores-2 processing input (each from a subset of the input files) to keep the main thread fed.
The input and output threads could also perform stream-specific pre- and post-processing - for example, depending on the distribution of the input data a run length encoding scheme of the type alluded to by #Joop might provide significant speedup of the main thread by allowing it to efficiently order entire ranges of input.
Naturally all of this increases complexity and the possibility of error.

Related

External sorting when indices can fit in RAM

I want to sort a multi-TB file full of 20kb records. I only need to read a few bytes from each record in order to determine its order, so I can sort the indices in memory.
I cannot fit the records themselves in memory, however. Random access is slower than sequential access, and I don't want to random-access writes to the output file either. Is there any algorithm known that will take advantage of the sorted indices to "strategize" the optimal way to re-arrange the records as they are copied from the input file to the output file?
There are reorder array according to sorted index algorithms, but they involve random access. Even in the case of an SSD, although the random access itself is not an issue, reading or writing one record at a time due to random access has a slower throughput than reading or writing multiple records at a time which is typically down by an external merge sort.
For a typical external merge sort, the file is read in "chunks" small enough for an internal sort to sort the "chunk", and write the sorted "chunks" to external media. After this initial pass, a k-way merge is done on the "chunks" multiplying the size of the merged "chunks" by k on each merge pass, until a single sorted "chunk" is produced. The read/write operations can read multiple records at a time. Say you have 1GB of ram and use a 16-way merge. For a 16 way merge, 16 "input" buffers and 1 "output" buffer are used, so buffer size could be 63MB (1GB/17 rounded down a bit for variable space) which would allow 3150 records to be read or written at a time, greatly reducing random access and command overhead. Assuming initial pass creates sorted chunks of size 0.5 GB, after 3 (16 way) merge passes, chunk size is 2TB, after 4 passes, it's 32TB, and so on.

What is this algorithm called and what is the time complexity?

Let's call the amount of RAM available R.
We have an unsorted file of 10 gigs with one column of keys (duplicates allowed).
You split the file into k files, each of which have size R.
You sort each file and write the file to disk.
You read (10 / R) gigs from each file into input buffers. You perform a k-way merge where you read the first key from the first file and compare to every other key in your input buffers to find the minimum. You add this to your output buffer which should also hold (10 / R) gigs of data.
Once the output buffer is full, write it to disk to a final sorted file.
Repeat this process until all k files have been fully read. If an input buffer is empty, fill it with the next (10 / R) gigs of its corresponding file until the file has been entirely read. We can do this buffer refilling in parallel.
What is the official name for this algorithm? Is it a K - Way Merge sort?
The first part, where we split into K files is O((n / k) log (n / k))
The second part, where we merge is O(nk)?
If I am wrong, can I have an explanation? If this is external merge sort, how do we optimize this further?
This is a textbook external merge sort Time complexity O(n log n)
Here's Wikipedia's entry on it (linked above):
One example of external sorting is the external merge sort algorithm,
which sorts chunks that each fit in RAM, then merges the sorted chunks
together. For example, for sorting 900 megabytes of data using
only 100 megabytes of RAM: 1) Read 100 MB of the data in main memory and
sort by some conventional method, like quicksort. 2) Write the sorted
data to disk. 3) Repeat steps 1 and 2 until all of the data is in sorted
100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to
be merged into one single output file. 4) Read the first 10 MB (= 100MB /
(9 chunks + 1)) of each sorted chunk into input buffers in main memory
and allocate the remaining 10 MB for an output buffer. (In practice,
it might provide better performance to make the output buffer larger
and the input buffers slightly smaller.) 5) Perform a 9-way merge and
store the result in the output buffer. Whenever the output buffer
fills, write it to the final sorted file and empty it. Whenever any of
the 9 input buffers empties, fill it with the next 10 MB of its
associated 100 MB sorted chunk until no more data from the chunk is
available. This is the key step that makes external merge sort work
externally -- because the merge algorithm only makes one pass
sequentially through each of the chunks, each chunk does not have to
be loaded completely; rather, sequential parts of the chunk can be
loaded as needed. Historically, instead of a sort, sometimes a
replacement-selection algorithm was used to perform the initial
distribution, to produce on average half as many output chunks of
double the length.
I'd say it's a merge algorithm, exact file IO is an implementation detail.

given 10 billion URL with average length 100 characters per each url, check duplicate

Suppose I have 1GB memory available, how to find the duplicates among those urls?
I saw one solution on the book "Cracking the Coding Interview", it suggests to use hashtable to separate these urls into 4000 files x.txt, x = hash(u)%4000 in the first scan. And in the 2nd scan, we can check duplicates in each x.txt separately file.
But how can I guarantee that each file would store about 1GB url data? I think there's a chance that some files would store much more url data than other files.
My solution to this problem is to implement the file separation trick iteratively until the files are small enough for the memory available for me.
Is there any other way to do it?
If you don't mind a solution which requires a bit more code, you can do the following:
Calculate only the hashcodes. Each hashcode is exactly 4 bytes, so you have perfect control of the amount of memory that will be occupied by each chunk of hashcodes. You can also fit a lot more hashcodes in memory than URLs, so you will have fewer chunks.
Find the duplicate hashcodes. Presumably, they are going to be much fewer than 10 billion. They might even all fit in memory.
Go through the URLs again, recomputing hashcodes, seeing if a URL has one of the duplicate hashcodes, and then comparing actual URLs to rule out false positives due to hashcode collisions. (With 10 billion urls, and with hashcodes only having 4 billion different values, there will be plenty of collisions.)
This is a bit long for a comment.
The truth is, you cannot guarantee that a file is going to be smaller than 1 Gbyte. I'm not sure where the 4,000 comes from. The total data volume is about 1,000 Gbytes, so the average file size would be 250 Mbytes.
It is highly unlikely that you would ever be off by a factor of 4 in size. Of course, it is possible. In that case, just split the file again into a handful of other files. This adds a negligible amount to the complexity.
What this doesn't account for is a simple case. What if one of the URLs has a length of 100 and appears 10,000,000 times in the data? Ouch! In that case, you would need to read a file and "reduce" it by combining each value with a count.

Sorting Data with Space Contrainsts

Now the problem is pretty simple. You have 1000 MB and you have to sort it. Now the problem is you only have 100 MB space to sort the data. (Lets say the 1000 MB is stored in the Disk and you have only 100 MB Ram to sort the data. - At at any time you can only have 100 MB data in the Ram.)
Now I came up with this solution:
Divide the data into 10 parts - 100 MB each and sort it using Quick Sort.
Then write all the chunks of data into the Hard Drive.
Now pick the first 10 MB from each chunk and then merge. Now you have 100 MB. Now keep this 100 MB separated.
Now do the same thing. Pick the next 10 MB from each chunk and merge.
Keep doing this and then concatenate the data.
Now the problem I'm facing is as we're separately merging 100 MB each time when we concatenate we will be making mistakes. (These 100 MB should also be merged together.)
How can I solve this problem?
External merge sorting differs from the internal version in the merge phase. In your question you are trying to directly apply the internal merge algorithm to an external merge.
Hence you end up with having to merge two chunks of size n / 2. As you correctly note, that won't work because you run out of memory.
Let's assume that you have enough memory to sort 1/k th of all elements. This leaves you with k sorted lists. Instead of merging two lists, you merge all k at once:
Pick an output buffer size. A good value seems to be n / 2. This leaves n / 2 memory for input buffering or m = n / (2 x k) per subchunk.
Read the first m element from each subchunk. All memory is now used and you have the lowest m elements from each subchunk in memory.
From each of the k input buffers, choose the lowest value. Append this value into the output buffer. Repeat until one of your input buffers runs out.
If one of your input buffers runs out, read the next m elements from the subchunk on disk. If there are no more elements, you are done with the subchunk.
If the output buffer is full, append it to the output file and reset its position to the start.
Rinse and repeat from (3) until all subchunks run out.
The output file now is sorted.
You can think of the input buffers as a set of streaming buffers on sorted buckets. Check the streams and pick the best (i.e. lowest) element from all of them and save that one to the output list. From the outside it is a stream merge with smart prefetch and output buffering.
You need to repeat step 3 and 4. N-1 number of times.

External Sorting with a heap?

I have a file with a large amount of data, and I want to sort it holding only a fraction of the data in memory at any given time.
I've noticed that merge sort is popular for external sorting, but I'm wondering if it can be done with a heap (min or max). Basically my goal is to get the top (using arbitrary numbers) 10 items in a 100 item list while never holding more than 10 items in memory.
I mostly understand heaps, and understand that heapifying the data would put it in the appropriate order, from which I could just take the last fraction of it as my solution, but I can't figure out how to do with without an I/O for every freakin' item.
Ideas?
Thanks! :D
Using a heapsort requires lots of seek operations in the file for creating the heap initially and also when removing the top element. For that reason, it's not a good idea.
However, you can use a variation of mergesort where every heap element is a sorted list. The size of the lists is determined by how much you want to keep in memory. You create these lists from the input file using by loading chunks of data, sorting them and then writing them to a temporary file. Then, you treat every file as one list, read the first element and create a heap from it. When removing the top element, you remove it from the list and restore the heap conditions if necessary.
There is one aspect though that makes these facts about sorting irrelevant: You say you want to determine the top 10 elements. For that, you could indeed use an in-memory heap. Just take an element from the file, push it onto the heap and if the size of the heap exceeds 10, remove the lowest element. To make it more efficient, only push it onto the heap if the size is below 10 or it is above the lowest element, which you then replace and re-heapify. Keeping the top ten in a heap allows you to only scan through the file once, everything else will be done in-memory. Using a binary tree instead of a heap would also work and probably be similarly fast, for a small number like 10, you could even use an array and bubblesort the elements in place.
Note: I'm assuming that 10 and 100 were just examples. If your numbers are really that low, any discussion about efficiency is probably moot, unless you're doing this operation several times per second.
Yes, you can use a heap to find the top-k items in a large file, holding only the heap + an I/O buffer in memory.
The following will obtain the min-k items by making use of a max-heap of length k. You could read the file sequentially, doing an I/O for every item, but it will generally be much faster to load the data in blocks into an auxillary buffer of length b. The method runs in O(n*log(k)) operations using O(k + b) space.
while (file not empty)
read block from file
for (i = all items in block)
if (heap.count() < k)
heap.push(item[i])
else
if (item[i] < heap.root())
heap.pop_root()
heap.push(item[i])
endif
endfor
endwhile
Heaps require lots of nonsequential access. Mergesort is great for external sorting because it does a whole lot of sequential access.
Sequential access is a hell of a lot faster on the kinds of disks that spin because the head doesn't need to move. Sequential access will probably also be a hell of a lot faster on solid-state disks than heapsort's access because they do accesses in blocks that are probably considerably larger than a single thing in your file.
By using Merge sort and passing the two values by reference you only have to hold the two comparison values in a buffer, and move throughout the array until it is sorted in place.

Resources