Sorting Data with Space Contrainsts - sorting

Now the problem is pretty simple. You have 1000 MB and you have to sort it. Now the problem is you only have 100 MB space to sort the data. (Lets say the 1000 MB is stored in the Disk and you have only 100 MB Ram to sort the data. - At at any time you can only have 100 MB data in the Ram.)
Now I came up with this solution:
Divide the data into 10 parts - 100 MB each and sort it using Quick Sort.
Then write all the chunks of data into the Hard Drive.
Now pick the first 10 MB from each chunk and then merge. Now you have 100 MB. Now keep this 100 MB separated.
Now do the same thing. Pick the next 10 MB from each chunk and merge.
Keep doing this and then concatenate the data.
Now the problem I'm facing is as we're separately merging 100 MB each time when we concatenate we will be making mistakes. (These 100 MB should also be merged together.)
How can I solve this problem?

External merge sorting differs from the internal version in the merge phase. In your question you are trying to directly apply the internal merge algorithm to an external merge.
Hence you end up with having to merge two chunks of size n / 2. As you correctly note, that won't work because you run out of memory.
Let's assume that you have enough memory to sort 1/k th of all elements. This leaves you with k sorted lists. Instead of merging two lists, you merge all k at once:
Pick an output buffer size. A good value seems to be n / 2. This leaves n / 2 memory for input buffering or m = n / (2 x k) per subchunk.
Read the first m element from each subchunk. All memory is now used and you have the lowest m elements from each subchunk in memory.
From each of the k input buffers, choose the lowest value. Append this value into the output buffer. Repeat until one of your input buffers runs out.
If one of your input buffers runs out, read the next m elements from the subchunk on disk. If there are no more elements, you are done with the subchunk.
If the output buffer is full, append it to the output file and reset its position to the start.
Rinse and repeat from (3) until all subchunks run out.
The output file now is sorted.
You can think of the input buffers as a set of streaming buffers on sorted buckets. Check the streams and pick the best (i.e. lowest) element from all of them and save that one to the output list. From the outside it is a stream merge with smart prefetch and output buffering.

You need to repeat step 3 and 4. N-1 number of times.

Related

External sorting when indices can fit in RAM

I want to sort a multi-TB file full of 20kb records. I only need to read a few bytes from each record in order to determine its order, so I can sort the indices in memory.
I cannot fit the records themselves in memory, however. Random access is slower than sequential access, and I don't want to random-access writes to the output file either. Is there any algorithm known that will take advantage of the sorted indices to "strategize" the optimal way to re-arrange the records as they are copied from the input file to the output file?
There are reorder array according to sorted index algorithms, but they involve random access. Even in the case of an SSD, although the random access itself is not an issue, reading or writing one record at a time due to random access has a slower throughput than reading or writing multiple records at a time which is typically down by an external merge sort.
For a typical external merge sort, the file is read in "chunks" small enough for an internal sort to sort the "chunk", and write the sorted "chunks" to external media. After this initial pass, a k-way merge is done on the "chunks" multiplying the size of the merged "chunks" by k on each merge pass, until a single sorted "chunk" is produced. The read/write operations can read multiple records at a time. Say you have 1GB of ram and use a 16-way merge. For a 16 way merge, 16 "input" buffers and 1 "output" buffer are used, so buffer size could be 63MB (1GB/17 rounded down a bit for variable space) which would allow 3150 records to be read or written at a time, greatly reducing random access and command overhead. Assuming initial pass creates sorted chunks of size 0.5 GB, after 3 (16 way) merge passes, chunk size is 2TB, after 4 passes, it's 32TB, and so on.

What is this algorithm called and what is the time complexity?

Let's call the amount of RAM available R.
We have an unsorted file of 10 gigs with one column of keys (duplicates allowed).
You split the file into k files, each of which have size R.
You sort each file and write the file to disk.
You read (10 / R) gigs from each file into input buffers. You perform a k-way merge where you read the first key from the first file and compare to every other key in your input buffers to find the minimum. You add this to your output buffer which should also hold (10 / R) gigs of data.
Once the output buffer is full, write it to disk to a final sorted file.
Repeat this process until all k files have been fully read. If an input buffer is empty, fill it with the next (10 / R) gigs of its corresponding file until the file has been entirely read. We can do this buffer refilling in parallel.
What is the official name for this algorithm? Is it a K - Way Merge sort?
The first part, where we split into K files is O((n / k) log (n / k))
The second part, where we merge is O(nk)?
If I am wrong, can I have an explanation? If this is external merge sort, how do we optimize this further?
This is a textbook external merge sort Time complexity O(n log n)
Here's Wikipedia's entry on it (linked above):
One example of external sorting is the external merge sort algorithm,
which sorts chunks that each fit in RAM, then merges the sorted chunks
together. For example, for sorting 900 megabytes of data using
only 100 megabytes of RAM: 1) Read 100 MB of the data in main memory and
sort by some conventional method, like quicksort. 2) Write the sorted
data to disk. 3) Repeat steps 1 and 2 until all of the data is in sorted
100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to
be merged into one single output file. 4) Read the first 10 MB (= 100MB /
(9 chunks + 1)) of each sorted chunk into input buffers in main memory
and allocate the remaining 10 MB for an output buffer. (In practice,
it might provide better performance to make the output buffer larger
and the input buffers slightly smaller.) 5) Perform a 9-way merge and
store the result in the output buffer. Whenever the output buffer
fills, write it to the final sorted file and empty it. Whenever any of
the 9 input buffers empties, fill it with the next 10 MB of its
associated 100 MB sorted chunk until no more data from the chunk is
available. This is the key step that makes external merge sort work
externally -- because the merge algorithm only makes one pass
sequentially through each of the chunks, each chunk does not have to
be loaded completely; rather, sequential parts of the chunk can be
loaded as needed. Historically, instead of a sort, sometimes a
replacement-selection algorithm was used to perform the initial
distribution, to produce on average half as many output chunks of
double the length.
I'd say it's a merge algorithm, exact file IO is an implementation detail.

Merge sorted files effectively?

I have n files, 50 <= n <= 100 that contain sorted integers, all of them the same size, 250MB or 500MB.
e.g
1st file: 3, 67, 123, 134, 200, ...
2nd file: 1, 12, 33, 37, 94, ...
3rd file: 11, 18, 21, 22, 1000, ...
I am running this on a 4-core machine and the goal is to merge the files as soon as possible.
Since the total size can reach 50GB I can't read them into RAM.
So far I tried to do the following:
1) Read a number from every file, and store them in an array.
2) Find the lowest number.
3) Write that number to the output.
4) Read one number from the file you found the lowest before (if file not empty)
Repeat steps 2-4 till we have no numbers left.
Reading and writing is done using buffers of 4MB.
My algorithm above works correctly but it's not perfomning as fast as I want it. The biggest issue is that it perfoms much worst if I have 100 files x 250MB compared to having 50 files x 500MB.
What is the most efficient merge algorithm in my case?
Well, you can first significantly improve efficiency by improving step (2) in your algorithm to be done smartly. Instead to do a linear search on all the numbers, use a min-heap, any insertion and deletion of the minimal value from the heap is done in logarithmic time, so it will improve the speed for large number of files. This changes time complexity to O(nlogk), over the naive O(n*k) (where n is total number of elements and k is number of files)
In addition, you need to minimize number of "random" reads from files, because few sequential big reads are much faster than many small random reads. You can do that by increasing the buffer size, for example (same goes for writing)
(java) Use GZipInputStream and GZipOutputStream for the .gz compression. Maybe that will allow memory usage to some extent. Using fast instead of high compression.
Then movement on disk for several files should be reduced, say more merging files by 2 files, both larger sequences.
For repetitions maybe use "run-length-encoding" - instead of repeating, add a repetition count: 11 12 13#7 15
An effective way to utilize the multiple cores might be to perform input and output in distinct threads from the main comparison thread, in such a way that all the cores are kept busy and the main thread never unnecessarily blocks on input or output. One thread performing the core comparison, one writing the output, and NumCores-2 processing input (each from a subset of the input files) to keep the main thread fed.
The input and output threads could also perform stream-specific pre- and post-processing - for example, depending on the distribution of the input data a run length encoding scheme of the type alluded to by #Joop might provide significant speedup of the main thread by allowing it to efficiently order entire ranges of input.
Naturally all of this increases complexity and the possibility of error.

Scalable seq -> groupby -> count

I have a very large unordered sequence of int64s - about O(1B) entries. I need to generate the frequency histogram of the elements, ie:
inSeq
|> Seq.groupBy (fun x->x)
|> Seq.map (fun (x,l) -> (x,Seq.length l))
Let's assume I have only, say 1GB of RAM to work with. The full resulting map won't fit into RAM (nor can I construct it on the fly in RAM). So, of course we're going to have to generate the result on disk. What are some performant ways for generating the result?
One approach I have tried is partitioning the range of input values and computing the counts within each partition via multiple passes over the data. This works fine but I wonder if I could accomplish it faster in a single pass.
One last note is that the frequencies are power-law distributed. ie most of the items in the list only appear only once or twice, but a very small number of items might have counts over 100k or 1M. This suggests possibly maintaining some sort of LRU map where common items are held in RAM and uncommon items are dumped to disk.
F# is my preferred language but I'm ok working with something else to get the job done.
If you have enough disk space for a copy of the input data, then your multiple passes idea really requires only two. On the first pass, read an element x and append it to a temporary file hash(x) % k, where k is the number of shards (use just enough to make the second pass possible). On the second pass, for each temporary file, use main memory to compute the histogram of that file and append that histogram to the output. Relative to the size of your data, one gigabyte of main memory should be enough buffer space that the cost will be approximately the cost of reading and writing your data twice.

Time Complexity/Cost of External Merge Sort

I got this from a link which talks about external merge sort.
From slide 6 Example: with 5 buffer pages, to sort 108 page file
Pass0: [108/5] = 22 sorted runs of 5 pages each (last run only with 3 pages)
Pass1 [22/4] = 6 sorted runs of 20 pages each (last run only with 8 pages)
Pass2: [6/3] = 2 sorted runs, 80 pages and 28 pages
Pass 3: [2/2] = 1 Sorted file of 108 pages
Question: My understanding is in external merge sort, in pass 0 you create chunks and then sort each chunk. In remaining passes you keep merging them.
So, applying that to the above example, since we have only 5 buffer pages, in Pass 0 its clear we need 22 sorted runs of 5 pages each.
Now, why are we doing sorted runs for remaining passes instead or merging ?
How come it tells for pass 1, 6 sorted runs of 20 pages each when we have only 5 buffer pages ?
Where exactly is merge happening here ? and how is N reducing in each pass i.e from 108 to 22 to 6 to 2 ?
External Merge Sort is necessary when you cannot store all the data into memory. The best you can do is break the data into sorted runs and merge the runs in subsequent passes. The length of a run is tied to your available buffer size.
Pass0: you are doing the operations IN PLACE. So you load 5 pages of data into the buffers and then sort it in place using an in place sorting algorithm.
These 5 pages will be stored together as a run.
Following passes: you can no longer do the operations in place since you're merging runs of many pages. 4 pages are loaded into the buffers and the 5th is the write buffer. The merging is identical to the merge sort algorithm, but you will be dividing and conquering by a factor of B-1 instead of 2. When the write buffer is filled, it is written to disk and the next page is started.
Complexity:
When analyzing the complexity of external merge sort, the number of I/Os is what is being considered. In each pass, you must read a page and write the page. Let N be the number of pages. Each run will cost 2N. Read the page, write the page.
Let B be the number of pages you can hold buffer space and N be the number of pages.
There will be ceil(log_B-1(ceil(N/B))) passes. Each pass will have 2N I/Os. So O(nlogn).
In each pass, the page length of a run is increasing by a factor of B-1, and the number of sorted runs is decreasing by a factor of B-1.
Pass0: ceil(108 / 5) = 22, 5 pages per run
Pass1: ceil(22 / 4) = 6, 20 pages per run
Pass2: ceil(6 / 4 ) = 2, 80 pages per run
Pass3: ceil(2 / 4 ) = 1 - done, 1 run of 108 pages
A. Since it NEVER mentions merging, I'd assume (hope) that the later "sorting" passes are doing merges.
B. Again, assuming this is merging, you need one buffer to save the merged records, and use one of the remaining buffers for each file being merged: thus, 4 input files, each w/ 5 pages: 20 pages.
C. Think I've answered where merge is twice now :)

Resources