How hadoop handle very large individual split file - hadoop

Suppose you only have 1GB heap size which can be used for each mapper, however, the block size is set to be 10 GB and each split is 10GB. How the mapper read the large individual split?
Will the mapper buffer the input into disk and process the input split in a round-robin fashion?
Thanks!

The overall pattern of a mapper is quite simple:
while not end of split
(key, value) = RecordReader.next()
(keyOut, valueOut) = map(key, value)
RecordWriter.write(keyOut, valueOut)
Usually the first two operations only care about the size of the record. For example when TextInputFormat is asked for the next line it stores the bytes in a buffer until the next end of line is found. Then the buffer is cleared. Etc.
The map implementation is up to you. If you don't store things in your mapper then you are fine. If you want it to be stateful, then you can be in trouble. Make sure that your memory consumption is bounded.
In the last step the keys and values written by your mapper are stored in memory. They are then partitioned and sorted. If the in-memory buffer becomes full, then its content is spilled to disk (it will eventually be anyway because reducers need to be able to download the partition file even after the mapper vanished).
So the answer to your question is: yes it will be fine.
What could cause trouble is:
Large records (exponential buffer growth + memory copies => significant to insane memory overhead)
Storing data from the previous key/value in your mapper
Storing data from the previous key/value in your custom (Input|Output)Format implementation if you have one
If you want to learn more, here are a few entry points:
In Mapper.java you can see the while loop
In LineRecordReader you can see how a line is read by a TextInputFormat
You most likely want to understand the spill mechanism because it impacts the performance of your jobs. See theses Cloudera slides for example. Then you will be able to decide what is the best setting for your use case (large vs small splits).

Related

Map Reduce, does reducer automatically sorts?

there is something that i do not have clear about the whoel functioning view of a MapReduce programming environment.
Considering to have 1k of random unsorted words in the form (word, 1) coming out from a (or more than one) mapper. Suppose with the reducer i wanna save them all inside a single huge sorted file. How does it works? I mean, the reducer itself sort all the words automatically? What does the reducer function should do? What if i have just one reducer with limited ram and disk?
when the reducer get the data ,the data has already be sorted in the map side .
the process is like this
Map side:
1. Each inputSplit will be processed by a map task, and the result of the map output will be temporarily placed in a circular memory buffer [ SHUFFLE ](the size of the buffer is 100M by default, controlled by the io.sort.mb property). When the buffer is about to overflow (the default is 80% of the buffer size), an overflow file will be created in the local file system .
2. Before writing to the disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, a reduce task corresponds to the data of one partition. to avoid some of the reduction tasks being assigned to large amounts of data, even without data. In fact, the data in each partition is sorted. If the Combiner is set at this time, the sorted result is subjected to the Combiner operation.
3. When the local task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The sorting and combining operations are continually performed during the merge process for two purposes: 1. Minimize the amount of data written to disk each time; 2. Minimize the amount of data transferred by the network during the next replication phase. Finally merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, you can compress the data here, just set mapred.compress.map.out to true.
4. Copy the data from the partition to the corresponding reduce task.
Reduce side:
1.Reduce will receive data from different map tasks, and the amount of data sent from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is directly stored in the memory. If the amount of data exceeds a certain proportion of the size of the buffer, the data is merged and written to the disk.
2. As the number of overflow files increases, the background thread will merge them into a larger, more ordered file. In fact, regardless of the map side or the reduce side, MapReduce repeatedly performs sorting and merging operations.
3. The merge process will generate a lot of intermediate files (written to disk), but MapReduce will make the data written to the disk as small as possible, and the result of the last merge is not written to the disk, but directly input To reduce the function.

When does a mapper store its output to its local hard disk?

I know that
The output of the Mapper (intermediate data) is stored on the Local file system (not HDFS) of each individual mapper data nodes. This is typically a temporary directory which can be setup in config by the Hadoop administrator. Once the Mapper job completed or the data transferred to the Reducer, these intermediate data is cleaned up and no more accessible.
But, I wanted to know when does a mapper store its output to its local hard disk? Is it because the data is too large for it to be in memory? And only the data which is being processed remains in the memory? If the data is small and the whole data can fit in memory, then no disk involvement is there?
Can we not directly move the data, once it is processed in the mapper, from the mapper to reducer without the involvement of the hard disk of the mapper m/c. I mean as the data is being processed in the mapper, and it is in memory, once it is computed, it is directly transferred to the reducer and the mapper could pass on the next chunk of data similarly with no disk involvement.
In spark, it is said there is in-memory computation, How different is that from above? What makes spark compute in-memory better than map reduce? Also, in spark there would have to be disk involvement, if data is too huge?
Please explain
Lot of questions here. I shall try to explain each one.
when does a mapper store it's output to its local hard disk?
The mapper stores the data in the configured memory. When the memory is 80% full(again configurable), it runs combiner on the data which is present in memory to reduce the data. But when the combined data also surpasses this memory limit, its spilled to disk. The files are known as spill files. During the whole operation, multiple spilled files are written. While writing spill files, mapper sorts and partitions the data according to the reducer. At the end of the map operations, these spill files needs to be merged.
Can we not directly move the data, once it is processed in the mapper, from the mapper to reducer without the involvement of the hard disk of the mapper m/c.
The costliest operation in any processing is the 'data transfer' between machines. The whole paradigm of map reduce is to take processing near to data rather than moving data around. So if its done the way you are suggesting, there shall be a lot of data movement. Its faster to write into local disk as compared to writing on network. This data can be reduced down by merging of the spill files.
Sorting is done while spilling the files because its easier(faster) to merge sorted data. Partition is done, because you only need to merge same partitions( the data going to same reducer). In the process of merging, combiners are run again to reduce the data. This reduced data is then sent to reducers.
In spark, it is said there is in-memory computation, How different is that from above?
There is no difference in a spark and map reduce program where you just read from some data set, perform one map function and one reduce function. It will do the same reads and writes in disk as a mapreduce code. Difference comes when you need to run few operations on the same data set. In map reduce it will read from disk for each operation but in spark you have an option to use memory to store it in which case it will read from disk only once and later operations shall run on stored data in memory which is obviously way faster.
Or in case when there are chains of operation where output of 1st operation is input to second. In Mapreduce, the output of 1st operation will be written to disk and be read from disk in 2nd operation, whereas in spark you can persist the output of 1st operation in memory so that 2nd operation reads from memory and shall be faster.

What is the exact Map Reduce WorkFlow?

Summary from the book "hadoop definitive guide - tom white" is:
All the logic between user's map function and user's reduce function is called shuffle. Shuffle then spans across both map and reduce. After user's map() function, the output is in in-memory circular buffer. When the buffer is 80% full, the background thread starts to run. The background thread will output the buffer's content into a spill file. This spill file is partitioned by key. And within each partition, the key-value pairs are sorted by key.After sorting, if combiner function is enabled, then combiner function is called. All spill files will be merged into one MapOutputFile. And all Map tasks's MapOutputFile will be collected over network to Reduce task. Reduce task will do another sort. And then user's Reduce function will be called.
So the questions are:
1.) According to the above summary, this is the flow:
Mapper--Partioner--Sort--Combiner--Shuffle--Sort--Reducer--Output
1a.) Is this the flow or is it something else?
1b.) Can u explain the above flow with an example say word count example, (the ones I found online weren't that elaborative) ?
2.) So the mappers phase output is one big file (MapOutputFile)? And it is this one big file that is broken into and the key-value pairs are passed onto the respective reducers?
3.) Why does the sorting happens for a second time, when the data is already sorted & combined when passed onto their respective reducers?
4.) Say if mapper1 is run on Datanode1 then is it necessary for reducer1 to run on the datanode1? Or it can run on any Datanode?
Answering this question is like rewriting the whole history . A lot of your doubts have to do with Operating System concepts and not MapReduce.
Mappers data is written on local File System. The data is partitioned based on the number of reducer. And in each partition , there can be multiple files based on the number of time the spills have happened.
Each small file in a given partition is sorted , as before writing the file, in Memory sort is done.
Why the data needs to be sorted on mapper side ?
a.The data is sorted and merged on the mapper side to decrease the number of files.
b.The files are sorted as it would become impossible on the reducer to gather all the values for a given key.
After gathering data on the reducer, first the number of files on the system needs to be decreased (remember uLimit has a fixed amount for every user in this case hdfs)
Reducer just maintains a file pointer on a small set of sorted files and does a merge of them.
To know about more interesting ideas please refer :
http://bytepadding.com/big-data/map-reduce/understanding-map-reduce-the-missing-guide/

Is it possible to perform any mapreduce task with a single reducer?

What if the output is so big that it does not fit into the reducers RAM?
For example a sorting task. In this case, output is as big as the input. If you use a single reducer then all the data do not fit into the RAM. How does the sorting take place then?
I think I have got the answer.
Yes, it is possible to perform any map task in a single reducer, even if the data are bigger than the memory of reduce. In the shuffle phase reducer copies the data from mapper to reducer's memory and sorts it until it spills. Once it spills the memory that part of data is stored in reducers local disk and it starts to get the new values. Once it spills again it merges the new data with the previously stored file. The merged file maintains the sorted fashion (Probably using external merge sort). Once the shuffling is done the intermediate key,value pairs are stored in a sorted manner. Then the reduce task is performed on that data. As the data are sorted it is easy to do the aggregation in memory by taking a chunk of the data at a time in memory.

Hadoop combiner sort phase

When running a MapReduce job with a specified combiner, is the combiner run during the sort phase? I understand that the combiner is run on mapper output for each spill, but it seems like it would also be beneficial to run during intermediate steps when merge sorting. I'm assuming here that in some stages of the sort, mapper output for some equivalent keys is held in memory at some point.
If this doesn't currently happen, is there a particular reason, or just something which hasn't been implemented?
Thanks in advance!
Combiners are there to save network bandwidth.
The mapoutput directly gets sorted:
sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);
This happens right after the real mapping is done. During iteration through the buffer it checks if there has a combiner been set and if yes it combines the records. If not, it directly spills onto disk.
The important parts are in the MapTask, if you'd like to see it for yourself.
sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);
// some fields
for (int i = 0; i < partitions; ++i) {
// check if configured
if (combinerRunner == null) {
// spill directly
} else {
combinerRunner.combine(kvIter, combineCollector);
}
}
This is the right stage to save the disk space and the network bandwidth, because it is very likely that the output has to be transfered.
During the merge/shuffle/sort phase it is not beneficial because then you have to crunch more amounts of data in comparision with the combiner run at map finish time.
Note the sort-phase which is shown in the web interface is misleading. It is just pure merging.
There are two opportunities for running the Combiner, both on the map side of processing. (A very good online reference is from Tom White's "Hadoop: The Definitive Guide" - https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/shuffle-and-sort )
The first opportunity comes on the map side after completing the in-memory sort by key of each partition, and before writing those sorted data to disk. The motivation for running the Combiner at this point is to reduce the amount of data ultimately written to local storage. By running the Combiner here, we also reduce the amount of data that will need to be merged and sorted in the next step. So to the original question posted, yes, the Combiner is already being applied at this early step.
The second opportunity comes right after merging and sorting the spill files. In this case, the motivation for running the Combiner is to reduce the amount of data ultimately sent over the network to the reducers. This stage benefits from the earlier application of the Combiner, which may have already reduced the amount of data to be processed by this step.
The combiner is only going to run how you understand it.
I suspect the reason that the combiner only works in this way is that it reduces the amount of data being sent to the reducers. This is a huge gain in many situations. Meanwhile, in the reducer, the data is already there, and whether you combine them in the sort/merge or in your reduce logic is not really going to matter computationally (it's either done now or later).
So, I guess my point is: you may get gains by combining like you say in the merge, but it's not going to be as much as the map-side combiner.
I haven't gone through the code but in reference to Hadoop : The definitive guide by Tom White 3rd edition, it does mention that if the combiner is specified it will run during the merge phase in the reducer. Following is excerpt from the text:
" The map outputs are copied to the reduce task JVM’s memory if they are small enough
(the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent, which
specifies the proportion of the heap to use for this purpose); otherwise, they are copied
to disk. When the in-memory buffer reaches a threshold size (controlled by
mapred.job.shuffle.merge.percent), or reaches a threshold number of map outputs
(mapred.inmem.merge.threshold), it is merged and spilled to disk. If a combiner is specified it will be run during the merge to reduce the amount of data written to disk.
"

Resources