In this web page: CS302 --- External Sorting
Merge the resulting runs together into successively bigger runs, until the file is sorted.
As I quoted, how can we merge the resulting runs together??? We don't have that much memory.
Imagine you have the numbers 1 - 9
9 7 2 6 3 4 8 5 1
And let's suppose that only 3 fit in memory at a time.
So you'd break them into chunks of 3 and sort each, storing each result in a separate file:
279
346
158
Now you'd open each of the three files as streams and read the first value from each:
2 3 1
Output the lowest value 1, and get the next value from that stream, now you have:
2 3 5
Output the next lowest value 2, and continue onwards until you've outputted the entire sorted list.
If you process two runs A and B into some larger run C you can do this line-by-line generating progressively larger runs, but still only reading at most 2 lines at a time. Because the process is iterative and because you're working on streams of data rather than full cuts of data you don't need to worry about memory usage. On the other hand, disk access might make the whole process slow -- but it sure beats not being able to do the work in the first place.
Related
Let's call the amount of RAM available R.
We have an unsorted file of 10 gigs with one column of keys (duplicates allowed).
You split the file into k files, each of which have size R.
You sort each file and write the file to disk.
You read (10 / R) gigs from each file into input buffers. You perform a k-way merge where you read the first key from the first file and compare to every other key in your input buffers to find the minimum. You add this to your output buffer which should also hold (10 / R) gigs of data.
Once the output buffer is full, write it to disk to a final sorted file.
Repeat this process until all k files have been fully read. If an input buffer is empty, fill it with the next (10 / R) gigs of its corresponding file until the file has been entirely read. We can do this buffer refilling in parallel.
What is the official name for this algorithm? Is it a K - Way Merge sort?
The first part, where we split into K files is O((n / k) log (n / k))
The second part, where we merge is O(nk)?
If I am wrong, can I have an explanation? If this is external merge sort, how do we optimize this further?
This is a textbook external merge sort Time complexity O(n log n)
Here's Wikipedia's entry on it (linked above):
One example of external sorting is the external merge sort algorithm,
which sorts chunks that each fit in RAM, then merges the sorted chunks
together. For example, for sorting 900 megabytes of data using
only 100 megabytes of RAM: 1) Read 100 MB of the data in main memory and
sort by some conventional method, like quicksort. 2) Write the sorted
data to disk. 3) Repeat steps 1 and 2 until all of the data is in sorted
100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to
be merged into one single output file. 4) Read the first 10 MB (= 100MB /
(9 chunks + 1)) of each sorted chunk into input buffers in main memory
and allocate the remaining 10 MB for an output buffer. (In practice,
it might provide better performance to make the output buffer larger
and the input buffers slightly smaller.) 5) Perform a 9-way merge and
store the result in the output buffer. Whenever the output buffer
fills, write it to the final sorted file and empty it. Whenever any of
the 9 input buffers empties, fill it with the next 10 MB of its
associated 100 MB sorted chunk until no more data from the chunk is
available. This is the key step that makes external merge sort work
externally -- because the merge algorithm only makes one pass
sequentially through each of the chunks, each chunk does not have to
be loaded completely; rather, sequential parts of the chunk can be
loaded as needed. Historically, instead of a sort, sometimes a
replacement-selection algorithm was used to perform the initial
distribution, to produce on average half as many output chunks of
double the length.
I'd say it's a merge algorithm, exact file IO is an implementation detail.
Now the problem is pretty simple. You have 1000 MB and you have to sort it. Now the problem is you only have 100 MB space to sort the data. (Lets say the 1000 MB is stored in the Disk and you have only 100 MB Ram to sort the data. - At at any time you can only have 100 MB data in the Ram.)
Now I came up with this solution:
Divide the data into 10 parts - 100 MB each and sort it using Quick Sort.
Then write all the chunks of data into the Hard Drive.
Now pick the first 10 MB from each chunk and then merge. Now you have 100 MB. Now keep this 100 MB separated.
Now do the same thing. Pick the next 10 MB from each chunk and merge.
Keep doing this and then concatenate the data.
Now the problem I'm facing is as we're separately merging 100 MB each time when we concatenate we will be making mistakes. (These 100 MB should also be merged together.)
How can I solve this problem?
External merge sorting differs from the internal version in the merge phase. In your question you are trying to directly apply the internal merge algorithm to an external merge.
Hence you end up with having to merge two chunks of size n / 2. As you correctly note, that won't work because you run out of memory.
Let's assume that you have enough memory to sort 1/k th of all elements. This leaves you with k sorted lists. Instead of merging two lists, you merge all k at once:
Pick an output buffer size. A good value seems to be n / 2. This leaves n / 2 memory for input buffering or m = n / (2 x k) per subchunk.
Read the first m element from each subchunk. All memory is now used and you have the lowest m elements from each subchunk in memory.
From each of the k input buffers, choose the lowest value. Append this value into the output buffer. Repeat until one of your input buffers runs out.
If one of your input buffers runs out, read the next m elements from the subchunk on disk. If there are no more elements, you are done with the subchunk.
If the output buffer is full, append it to the output file and reset its position to the start.
Rinse and repeat from (3) until all subchunks run out.
The output file now is sorted.
You can think of the input buffers as a set of streaming buffers on sorted buckets. Check the streams and pick the best (i.e. lowest) element from all of them and save that one to the output list. From the outside it is a stream merge with smart prefetch and output buffering.
You need to repeat step 3 and 4. N-1 number of times.
I have problem with understanding of files merging process on reduce side in Hadoop as it is described in "Hadoop: The Definitive Guide" (Tom White). Citing it:
When all the map outputs have been copied, the reduce task moves into
the sort phase (which should properly be called the merge phase, as
the sorting was carried out on the map side), which merges the map
outputs, maintaining their sort ordering. This is done in rounds. For
example, if there were 50 map outputs and the merge factor was 10 (the
default, controlled by the io.sort.factor property, just like in the
map’s merge), there would be five rounds. Each round would merge 10
files into one, so at the end there would be five intermediate files.
Rather than have a final round that merges these five files into a
single sorted file, the merge saves a trip to disk by directly feeding
the reduce function in what is the last phase: the reduce phase. This
final merge can come from a mixture of in-memory and on-disk segments.
The number of files merged in each round is actually more subtle than
this example suggests. The goal is to merge the minimum number of
files to get to the merge factor for the final round. So if there were
40 files, the merge would not merge 10 files in each of the four
rounds to get 4 files. Instead, the first round would merge only 4
files, and the subsequent three rounds would merge the full 10 files.
The 4 merged files and the 6 (as yet unmerged) files make a total of
10 files for the final round. The process is illustrated in Figure
6-7. Note that this does not change the number of rounds; it’s just an
opti- mization to minimize the amount of data that is written to disk,
since the final round always merges directly into the reduce.
In the second example (with 40 files) we really get to the merge factor for the final round. In 5th round 10 files are not written to disk, they go directly to reduce. But in the first example there are really 6 rounds, not 5. In each of first five rounds 10 files are merged and written on disk, then in 6th round we have 5 files (not 10!) that directly go to reduce. Why? If to adhere to "The goal is to merge the minimum number of files to get to the merge factor for the final round" then for this 50 files we must merge 5 files in first round, then 10 files in each of subsequent 4 rounds and then we get to merge factor of 10 for the final 6th round.
Take into account, that we can't merge more than 10 files in each round (specified by io.sort.factor for both this examples).
What does I understand wrongly in the first example with 50 files merged?
This is what I understood. If you read carefully, the important thing to remember is:
Note that this does not change the number of rounds; it’s just an optimization to minimize the amount of data that is
written to disk, since the final round always merges directly into the reduce.
With or without optimization, the number of merge rounds remains the same (5 in first case and 4 in second case).
First Case: 50 files are merged into final 5 and then they are directly fed into "reduce" phase (Total rounds is 5 + 1 = 6)
Second Case: 34 files are merged into final 4 and the remaining 6 are directly read from in-memory and fed into the "reduce" phase (Total rounds is 4 + 1 = 5)
In the both the cases, the number of merge rounds is determined by configuration mapreduce.task.io.sort.factor which is set to 10.
So number of merge rounds does not change (whether optimization is done or not). But, the number of files which are merged in each round could change (because Hadoop framework could introduce some optimizations to reduce the number of merges and hence the number spills to disk).
So, in the first case, without optimization, the contents of 50 files (merged into final 5 files) are spilled to the disk and these files are read from the disk, during "reduce" phase.
In the second case, with optimization, the contents of 34 files (merged into final 4 files) are spilled to the disk and these files are read from the disk and remaining 6 un-merged files are directly read from in-memory buffer, during the "reduce" phase.
The idea of optimization is to minimize merge and spill.
We developed a spring batch application in which we have two flows. 1. Forward 2. Backward. We are only using file read/write no DB involved.
Forward Scenario : The input file will have records with 22 fields. The 22 fields to be converted into 32 fields by doing some operations like sequence number generation and adding few fillers fields. Based on country codes the output will be split into max 3. each chunk will have 250K records. (If records are in million the multiple files will be generated for same country).
8 Million records its taking 36 minutes.
8 Million records will be in single file.
We are using spring batch thread 1000 threads we are using.
Backward Flow : The input file will have 82 fields for each record. These 82 fields to be converted into 86 records. The two fields will be added in between which is taken from the Forward flow input file. The other fields are simply copied and pasted. The error records also to be written into error file. The error records is nothing but actual input records which came for Forward flow. To track we are persisting the sequence number & actual records in a file this is done in forward flow itself. We are taking the persistent file in backward flow and comparing the sequence number if anything is missing then we are writing into error records through key,value pair. This process is done after completion of backward flow.
Maximum size of input file is 250K.
8 Million records its taking 1 hour 8 minutes which is too bad.
32 Files (each 250K) will be there for input in this flow.
There is no thread used in backward. I don't know how thread usage will be. I tried but the process got hung.
Server Configurations:
12 CPU & 64 GB Linux Server.
Can you guys help in this regard to get improved the performance since we have 12 CPU/64GB RAM.
You are already using 1000 threads and that is a very high number. I have fine tuned spring batch jobs and this is what I have done
1. Reduce network traffic- Try to reduce number of calls to data base or file system in each process. Can you get all info possible in one shot and save it in memory for the life of thread ? I have uses org.apache.commons.collections.map.MultiKeyMap for storage and retrieval of data.
For eg in your case you need sequence number comparison . So get all sequence numbers into one map before you start the process. You can store the ids (if not too many) into step execution context.
Write less frequently - Keep storing all the info you need to write for some time and then write them at the end.
Set unused objects at end of process to null to expedite GC
Check your GC frequency through VisualVm or Jconsole . You should see frequent GC happening when your process is running which means objects are being created and garbage collected. If your memory graph keeps on increasing, something is wrong.
I got this from a link which talks about external merge sort.
From slide 6 Example: with 5 buffer pages, to sort 108 page file
Pass0: [108/5] = 22 sorted runs of 5 pages each (last run only with 3 pages)
Pass1 [22/4] = 6 sorted runs of 20 pages each (last run only with 8 pages)
Pass2: [6/3] = 2 sorted runs, 80 pages and 28 pages
Pass 3: [2/2] = 1 Sorted file of 108 pages
Question: My understanding is in external merge sort, in pass 0 you create chunks and then sort each chunk. In remaining passes you keep merging them.
So, applying that to the above example, since we have only 5 buffer pages, in Pass 0 its clear we need 22 sorted runs of 5 pages each.
Now, why are we doing sorted runs for remaining passes instead or merging ?
How come it tells for pass 1, 6 sorted runs of 20 pages each when we have only 5 buffer pages ?
Where exactly is merge happening here ? and how is N reducing in each pass i.e from 108 to 22 to 6 to 2 ?
External Merge Sort is necessary when you cannot store all the data into memory. The best you can do is break the data into sorted runs and merge the runs in subsequent passes. The length of a run is tied to your available buffer size.
Pass0: you are doing the operations IN PLACE. So you load 5 pages of data into the buffers and then sort it in place using an in place sorting algorithm.
These 5 pages will be stored together as a run.
Following passes: you can no longer do the operations in place since you're merging runs of many pages. 4 pages are loaded into the buffers and the 5th is the write buffer. The merging is identical to the merge sort algorithm, but you will be dividing and conquering by a factor of B-1 instead of 2. When the write buffer is filled, it is written to disk and the next page is started.
Complexity:
When analyzing the complexity of external merge sort, the number of I/Os is what is being considered. In each pass, you must read a page and write the page. Let N be the number of pages. Each run will cost 2N. Read the page, write the page.
Let B be the number of pages you can hold buffer space and N be the number of pages.
There will be ceil(log_B-1(ceil(N/B))) passes. Each pass will have 2N I/Os. So O(nlogn).
In each pass, the page length of a run is increasing by a factor of B-1, and the number of sorted runs is decreasing by a factor of B-1.
Pass0: ceil(108 / 5) = 22, 5 pages per run
Pass1: ceil(22 / 4) = 6, 20 pages per run
Pass2: ceil(6 / 4 ) = 2, 80 pages per run
Pass3: ceil(2 / 4 ) = 1 - done, 1 run of 108 pages
A. Since it NEVER mentions merging, I'd assume (hope) that the later "sorting" passes are doing merges.
B. Again, assuming this is merging, you need one buffer to save the merged records, and use one of the remaining buffers for each file being merged: thus, 4 input files, each w/ 5 pages: 20 pages.
C. Think I've answered where merge is twice now :)