Understanding of merging in Hadoop on reduce side - hadoop

I have problem with understanding of files merging process on reduce side in Hadoop as it is described in "Hadoop: The Definitive Guide" (Tom White). Citing it:
When all the map outputs have been copied, the reduce task moves into
the sort phase (which should properly be called the merge phase, as
the sorting was carried out on the map side), which merges the map
outputs, maintaining their sort ordering. This is done in rounds. For
example, if there were 50 map outputs and the merge factor was 10 (the
default, controlled by the io.sort.factor property, just like in the
map’s merge), there would be five rounds. Each round would merge 10
files into one, so at the end there would be five intermediate files.
Rather than have a final round that merges these five files into a
single sorted file, the merge saves a trip to disk by directly feeding
the reduce function in what is the last phase: the reduce phase. This
final merge can come from a mixture of in-memory and on-disk segments.
The number of files merged in each round is actually more subtle than
this example suggests. The goal is to merge the minimum number of
files to get to the merge factor for the final round. So if there were
40 files, the merge would not merge 10 files in each of the four
rounds to get 4 files. Instead, the first round would merge only 4
files, and the subsequent three rounds would merge the full 10 files.
The 4 merged files and the 6 (as yet unmerged) files make a total of
10 files for the final round. The process is illustrated in Figure
6-7. Note that this does not change the number of rounds; it’s just an
opti- mization to minimize the amount of data that is written to disk,
since the final round always merges directly into the reduce.
In the second example (with 40 files) we really get to the merge factor for the final round. In 5th round 10 files are not written to disk, they go directly to reduce. But in the first example there are really 6 rounds, not 5. In each of first five rounds 10 files are merged and written on disk, then in 6th round we have 5 files (not 10!) that directly go to reduce. Why? If to adhere to "The goal is to merge the minimum number of files to get to the merge factor for the final round" then for this 50 files we must merge 5 files in first round, then 10 files in each of subsequent 4 rounds and then we get to merge factor of 10 for the final 6th round.
Take into account, that we can't merge more than 10 files in each round (specified by io.sort.factor for both this examples).
What does I understand wrongly in the first example with 50 files merged?

This is what I understood. If you read carefully, the important thing to remember is:
Note that this does not change the number of rounds; it’s just an optimization to minimize the amount of data that is
written to disk, since the final round always merges directly into the reduce.
With or without optimization, the number of merge rounds remains the same (5 in first case and 4 in second case).
First Case: 50 files are merged into final 5 and then they are directly fed into "reduce" phase (Total rounds is 5 + 1 = 6)
Second Case: 34 files are merged into final 4 and the remaining 6 are directly read from in-memory and fed into the "reduce" phase (Total rounds is 4 + 1 = 5)
In the both the cases, the number of merge rounds is determined by configuration mapreduce.task.io.sort.factor which is set to 10.
So number of merge rounds does not change (whether optimization is done or not). But, the number of files which are merged in each round could change (because Hadoop framework could introduce some optimizations to reduce the number of merges and hence the number spills to disk).
So, in the first case, without optimization, the contents of 50 files (merged into final 5 files) are spilled to the disk and these files are read from the disk, during "reduce" phase.
In the second case, with optimization, the contents of 34 files (merged into final 4 files) are spilled to the disk and these files are read from the disk and remaining 6 un-merged files are directly read from in-memory buffer, during the "reduce" phase.
The idea of optimization is to minimize merge and spill.

Related

Achieving interactive large-dataset map-reduce on AWS/GCE in the least lines of code / script

I have 1 billion rows of data (about 400GB uncompressed; about 40GB compressed) that I would like to process in map-reduce style, and I have two executables (binaries, not scripts) that can handle the "map" and "reduce" steps. The "map" step can process about 10,000 rows per second, per core, and its output is approximately 1MB in size, regardless of the size of its input. The "reduce" step can process about 50MB / second (excluding IO latency).
Assume that I can pre-process the data once, to do whatever I'd like such as compress it, break it into pieces, etc. For simplicity, assume input is plain text and each row terminates with a newline and each newline is a row terminator.
Once that one-time pre-processing is complete, the goal is to be able to execute a request within 30 seconds. So, if my only bottleneck is the map job (which I don't know will really be true-- it could very well be the IO), and assuming I can do all the reduce jobs in under 5 seconds, then I would need about 425 8-core computers, all processing different parts of the input data, to complete the run in time.
Assuming you have the data, and the two map/reduce executables, and you have unlimited access to AWS or GCE, what is a solution to this problem that I can implement with the fewest lines of code and/or script (and not ignoring potential IO or other non-CPU bottlenecks)?
(As an aside, it would be interesting to also knowing what would execute with the fewest nodes, if different from the solution with fewest SLOC)

Merge sorted files effectively?

I have n files, 50 <= n <= 100 that contain sorted integers, all of them the same size, 250MB or 500MB.
e.g
1st file: 3, 67, 123, 134, 200, ...
2nd file: 1, 12, 33, 37, 94, ...
3rd file: 11, 18, 21, 22, 1000, ...
I am running this on a 4-core machine and the goal is to merge the files as soon as possible.
Since the total size can reach 50GB I can't read them into RAM.
So far I tried to do the following:
1) Read a number from every file, and store them in an array.
2) Find the lowest number.
3) Write that number to the output.
4) Read one number from the file you found the lowest before (if file not empty)
Repeat steps 2-4 till we have no numbers left.
Reading and writing is done using buffers of 4MB.
My algorithm above works correctly but it's not perfomning as fast as I want it. The biggest issue is that it perfoms much worst if I have 100 files x 250MB compared to having 50 files x 500MB.
What is the most efficient merge algorithm in my case?
Well, you can first significantly improve efficiency by improving step (2) in your algorithm to be done smartly. Instead to do a linear search on all the numbers, use a min-heap, any insertion and deletion of the minimal value from the heap is done in logarithmic time, so it will improve the speed for large number of files. This changes time complexity to O(nlogk), over the naive O(n*k) (where n is total number of elements and k is number of files)
In addition, you need to minimize number of "random" reads from files, because few sequential big reads are much faster than many small random reads. You can do that by increasing the buffer size, for example (same goes for writing)
(java) Use GZipInputStream and GZipOutputStream for the .gz compression. Maybe that will allow memory usage to some extent. Using fast instead of high compression.
Then movement on disk for several files should be reduced, say more merging files by 2 files, both larger sequences.
For repetitions maybe use "run-length-encoding" - instead of repeating, add a repetition count: 11 12 13#7 15
An effective way to utilize the multiple cores might be to perform input and output in distinct threads from the main comparison thread, in such a way that all the cores are kept busy and the main thread never unnecessarily blocks on input or output. One thread performing the core comparison, one writing the output, and NumCores-2 processing input (each from a subset of the input files) to keep the main thread fed.
The input and output threads could also perform stream-specific pre- and post-processing - for example, depending on the distribution of the input data a run length encoding scheme of the type alluded to by #Joop might provide significant speedup of the main thread by allowing it to efficiently order entire ranges of input.
Naturally all of this increases complexity and the possibility of error.

Sorting Data with Space Contrainsts

Now the problem is pretty simple. You have 1000 MB and you have to sort it. Now the problem is you only have 100 MB space to sort the data. (Lets say the 1000 MB is stored in the Disk and you have only 100 MB Ram to sort the data. - At at any time you can only have 100 MB data in the Ram.)
Now I came up with this solution:
Divide the data into 10 parts - 100 MB each and sort it using Quick Sort.
Then write all the chunks of data into the Hard Drive.
Now pick the first 10 MB from each chunk and then merge. Now you have 100 MB. Now keep this 100 MB separated.
Now do the same thing. Pick the next 10 MB from each chunk and merge.
Keep doing this and then concatenate the data.
Now the problem I'm facing is as we're separately merging 100 MB each time when we concatenate we will be making mistakes. (These 100 MB should also be merged together.)
How can I solve this problem?
External merge sorting differs from the internal version in the merge phase. In your question you are trying to directly apply the internal merge algorithm to an external merge.
Hence you end up with having to merge two chunks of size n / 2. As you correctly note, that won't work because you run out of memory.
Let's assume that you have enough memory to sort 1/k th of all elements. This leaves you with k sorted lists. Instead of merging two lists, you merge all k at once:
Pick an output buffer size. A good value seems to be n / 2. This leaves n / 2 memory for input buffering or m = n / (2 x k) per subchunk.
Read the first m element from each subchunk. All memory is now used and you have the lowest m elements from each subchunk in memory.
From each of the k input buffers, choose the lowest value. Append this value into the output buffer. Repeat until one of your input buffers runs out.
If one of your input buffers runs out, read the next m elements from the subchunk on disk. If there are no more elements, you are done with the subchunk.
If the output buffer is full, append it to the output file and reset its position to the start.
Rinse and repeat from (3) until all subchunks run out.
The output file now is sorted.
You can think of the input buffers as a set of streaming buffers on sorted buckets. Check the streams and pick the best (i.e. lowest) element from all of them and save that one to the output list. From the outside it is a stream merge with smart prefetch and output buffering.
You need to repeat step 3 and 4. N-1 number of times.

Time Complexity/Cost of External Merge Sort

I got this from a link which talks about external merge sort.
From slide 6 Example: with 5 buffer pages, to sort 108 page file
Pass0: [108/5] = 22 sorted runs of 5 pages each (last run only with 3 pages)
Pass1 [22/4] = 6 sorted runs of 20 pages each (last run only with 8 pages)
Pass2: [6/3] = 2 sorted runs, 80 pages and 28 pages
Pass 3: [2/2] = 1 Sorted file of 108 pages
Question: My understanding is in external merge sort, in pass 0 you create chunks and then sort each chunk. In remaining passes you keep merging them.
So, applying that to the above example, since we have only 5 buffer pages, in Pass 0 its clear we need 22 sorted runs of 5 pages each.
Now, why are we doing sorted runs for remaining passes instead or merging ?
How come it tells for pass 1, 6 sorted runs of 20 pages each when we have only 5 buffer pages ?
Where exactly is merge happening here ? and how is N reducing in each pass i.e from 108 to 22 to 6 to 2 ?
External Merge Sort is necessary when you cannot store all the data into memory. The best you can do is break the data into sorted runs and merge the runs in subsequent passes. The length of a run is tied to your available buffer size.
Pass0: you are doing the operations IN PLACE. So you load 5 pages of data into the buffers and then sort it in place using an in place sorting algorithm.
These 5 pages will be stored together as a run.
Following passes: you can no longer do the operations in place since you're merging runs of many pages. 4 pages are loaded into the buffers and the 5th is the write buffer. The merging is identical to the merge sort algorithm, but you will be dividing and conquering by a factor of B-1 instead of 2. When the write buffer is filled, it is written to disk and the next page is started.
Complexity:
When analyzing the complexity of external merge sort, the number of I/Os is what is being considered. In each pass, you must read a page and write the page. Let N be the number of pages. Each run will cost 2N. Read the page, write the page.
Let B be the number of pages you can hold buffer space and N be the number of pages.
There will be ceil(log_B-1(ceil(N/B))) passes. Each pass will have 2N I/Os. So O(nlogn).
In each pass, the page length of a run is increasing by a factor of B-1, and the number of sorted runs is decreasing by a factor of B-1.
Pass0: ceil(108 / 5) = 22, 5 pages per run
Pass1: ceil(22 / 4) = 6, 20 pages per run
Pass2: ceil(6 / 4 ) = 2, 80 pages per run
Pass3: ceil(2 / 4 ) = 1 - done, 1 run of 108 pages
A. Since it NEVER mentions merging, I'd assume (hope) that the later "sorting" passes are doing merges.
B. Again, assuming this is merging, you need one buffer to save the merged records, and use one of the remaining buffers for each file being merged: thus, 4 input files, each w/ 5 pages: 20 pages.
C. Think I've answered where merge is twice now :)

external sorting

In this web page: CS302 --- External Sorting
Merge the resulting runs together into successively bigger runs, until the file is sorted.
As I quoted, how can we merge the resulting runs together??? We don't have that much memory.
Imagine you have the numbers 1 - 9
9 7 2 6 3 4 8 5 1
And let's suppose that only 3 fit in memory at a time.
So you'd break them into chunks of 3 and sort each, storing each result in a separate file:
279
346
158
Now you'd open each of the three files as streams and read the first value from each:
2 3 1
Output the lowest value 1, and get the next value from that stream, now you have:
2 3 5
Output the next lowest value 2, and continue onwards until you've outputted the entire sorted list.
If you process two runs A and B into some larger run C you can do this line-by-line generating progressively larger runs, but still only reading at most 2 lines at a time. Because the process is iterative and because you're working on streams of data rather than full cuts of data you don't need to worry about memory usage. On the other hand, disk access might make the whole process slow -- but it sure beats not being able to do the work in the first place.

Resources