Is it possible to disable sorting in hadoop? - hadoop

My job dosn't require sorting, just aggregation information per key. So I think if it possible to disable sorting of all information in order of increasing performance.
Note: I can't set reducers count to zero because I need to aggregate data between many mappers. I just not interested in sorted result withing one reducer.

One of the main purpose to sort the map output is, when the tuples reaches reducer, reducer has to make ) to invoke reducer task, with the sorted map output list it can make the list just by sequential scan (when it sees different key then just make new list), if the map output is not sorted then it has to scan the whole list to form the list with same key.

No, Sorting in MapReduce is essentially performed for internal purposes and not for the end results to be sorted.
Sorted input ensures good performance when creating list of values for unique keys, which are fed as Values> arguments when calling the reduce() function.

Shuffling and sorting in Hadoop MapReduce are not performed at all if you specify zero reducers (setNumReduceTasks(0)).
and
The number of reducer can be set to 0 in driver class by job.setNumreduceTasks(0).This shows that there is no reducer phase and has only map phase.It is called as a map-only job.

Related

Do we really need sorting in the MapReduce framework?

I am completely new to MapReduce and just can't get my mind around the need to sort the mapper output according to the keys in each partition. Eventually all we want is that a reducer is fed a partition which consists of several pairs of <key,List of Values> and that the key in each pair is unique not just for the corresponding partition but all the partitions which are fed to different reducers.
For doing that what is the need to do a sort at any stage whatsoever. Can't we use a hash table to group the values corresponding to the same key?
To break it down for each stage. At the mapper stage, for each output pair we simply hash the key to find the partition number and then we append the corresponding pair to a linked list of all such pairs belonging to the same partition. So at the end, the output obtained by a single mapper would be a hashtable. In which for each partition number we have a linked list of <key,value> pairs with no key based order whatsoever i.e. no locality for similar key values .
Then the partitions from different mapper tasks are shuffled to a reducer. We now need to make sure that we first group all the values corresponding to the same key (a kind of a merge) and then feed those merged pairs of <key,List of Values> to a separate reducer function . Here again we can use a hashtable to do the same, we simply iterate through all the partition and for each key map them to an index in the hashtable and append the corresponding value to the linked list in the hashtable.
Wouldn't this method save more time as compare to the one in which we sort the output of each mapper?
I have already gone through the link (I currently can't comment on the thread , so I wrote a separate question.) The top answer mentions that
Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply. Each reduce task takes a list of key-value pairs, but it has to call the reduce() method which takes a key-list(value) input, so it has to group values by key. It's easy to do so, if input data is pre-sorted (locally) in the map phase and simply merge-sorted in the reduce phase (since the reducers get data from many mappers)
But again we can do the same by using a hash table or can we not?
Well, yeah, you could use a hash table as long as everything fits in memory. But once the amount of data you're working with exceeds your computer's memory capacity, you have a problem.
The solution is to output data to a disk file and do an external sort.

n-Records to reducer after Shuffle and Sort

I would like to move only the first 10 records of the output after sort/shuffle to the reducer. Is this possible?
The reason is this: I am to find the least 10 items with the largest count in a file. However, I know that the results of the mapping phase will be arrive at the reducer already sorted. Hence, instead of sorting in the mappers, I'd like to just pass only the first 10 lines after 'shuffle and sort' to the reducer. this will allow the reducer sort only a subset of the original record.
Is there any way to do this?
You can achieve this by writing a custom Combiner for the job.
The different stages in the MapReduce job are:
Mapper -> Partitioner -> Sorting -> Combiner -> Reducer.
Now Combiner logic only read the first 10 (n) records and discord all the other. The Reducer will receive only 10 records from each Mapper/Combiner.
Comment provided by #K246:
From haodop definitive guide (4th ed) : Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort.
When you say least 10 in the file...Is it for each mapper or for the entire input.
If for each mapper, then you have to aggregate the again at reducer from all mappers. Then as #YoungHobbit pointed out, Combiner will do the work.
If you need least 10 from entire input file, then I think, you need to handle it with a single reducer and output accordingly.
Also, you said in the last line, that reducer will sort only subset. Do you mean you are sorting again in Reducer or that some logic is performed in reducer for only subset of the input.

Sorting in MapReduce Hadoop

I have few basic questions in Hadoop MapReduce.
Assume if 100 mappers were executed and zero reducer. Will it
generate 100 files? All individual are sorted? Across all mapper
output are sorted?
Input for reducer is Key -> Values. For each key, all values are sorted?
Assume if 50 reducers were executed. Will it generate 50 files? All individual files are sorted? Across all reducer output are sorted?
Is there any place where guaranteed sorting happens in MapReduce?
1.Assume if 100 mappers were executed and zero reducer. Will it generate 100 files?
Yes.
All individual are sorted?
No. If no reducers are used, then the output of mappers are not sorted. Sorting only takes place when there is a reduce phase.
Across all mapper output are sorted?
No, for the same reason, as above.
2.Input for reducer is Key -> Values. For each key, all values are sorted?
No. However, the keys are sorted. After the shuffling phase, in which the reducer gets the output of the mappers, it merge-sorts the sorted output keys of the mappers (since there IS a reduce phase) and when it starts reducing, the keys are sorted.
3.Assume if 50 reducers were executed. Will it generate 50 files?
Yes. (unless you use MultipleOutputs)
All individual files are sorted?
No. The sorted input does not guarantee a sorted output. The output depends on the algorithm that you use in the reduce method.
Across all reducer output are sorted?
No, for the same reason as above. However, if you use an Identity Reducer, i.e., you just write the input of the reducer as you get it, the reducer's output will be sorted PER REDUCER, not globally.
Is there any place where guaranteed sorting happens in MapReduce?
Sorting takes place when there is a reduce phase and it is applied in the output keys of each mapper and the input keys of each reducer. If you want to globally sort the input of the reducer, you can either use a single reducer, or a TotalOrderPartitioner, which is a bit tricky...

Secondary sort - leave it to Hadoop framework or do it yourself in reducer

I read about secondary sorting, where there is a need to sort not only be key, but also by part of value, for each of the keys.
There are two ways to do this:
Cache values in reducer for each key and sort the values yourself
Leave the job to Hadoop framework, by specifying custom Comparator, Partitioner...all that you need to enable not only to sort by key, but also by value
My question is, when would you recommend first and when the second approach?
As I currently see it - if the framework already performs sorting, why not sort it by key and value at the same time...please correct me if there is some side-effect. For example, which should be faster?
I understand that the biggest problem of "in-Reducer sorting" is the number of records, but I would like to get the whole picture.
Your understanding about secondary sort is correct. Before answering the given 2 scenarios, I would like to let you know what happen before the reduce() method of Reducer called.
Every reducer copies its relevant partitioned results from all mappers and stores in the reducer's disk as multiple spill files. A background thread merges all these spill files and creates a single sorted file.
The records in the final sinlge sorted file are first sorted by the natural key. Then the records of every key are internally sorted by the grouping key(secondar sorting), if it is configured.
So the decision to decide when to use the given 2 scenarios is depends on how much amount of reords are there in the single sorted file for a given key. The reduce method reads the all the values of a key one by one from single sorted file through the iterable reference given in the reduce() method.
If any java sorting collection such as TreeSet/TreeMap is capable of storing all the values of a key , provided the size of the java collection doesn't blow the JVM heap memory of reducer, then you may skip secondary sorting and you may use the Java sorting collection itself to achieve the sorting order that you prefer.
Incase, if the JVM heap memory for java sorting collections are not sufficient to store all the values of a key, then you should prefer the Mapreduce's custom secondary sorting which would sort the keys (natural sort) and sort all values of a key internally (secondary sort) in the disk merge/sort phase of the single sorted file creation phase , and would pass the values in the prepared sorted order to the reduce method for a every key.

What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

In Map Reduce programming the reduce phase has shuffling, sorting and reduce as its sub-parts. Sorting is a costly affair.
What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?
First of all shuffling is the process of transfering data from the mappers to the reducers, so I think it is obvious that it is necessary for the reducers, since otherwise, they wouldn't be able to have any input (or input from every mapper). Shuffling can start even before the map phase has finished, to save some time. That's why you can see a reduce status greater than 0% (but less than 33%) when the map status is not yet 100%.
Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply. Each reduce task takes a list of key-value pairs, but it has to call the reduce() method which takes a key-list(value) input, so it has to group values by key. It's easy to do so, if input data is pre-sorted (locally) in the map phase and simply merge-sorted in the reduce phase (since the reducers get data from many mappers).
Partitioning, that you mentioned in one of the answers, is a different process. It determines in which reducer a (key, value) pair, output of the map phase, will be sent. The default Partitioner uses a hashing on the keys to distribute them to the reduce tasks, but you can override it and use your own custom Partitioner.
A great source of information for these steps is this Yahoo tutorial (archived).
A nice graphical representation of this is the following (shuffle is called "copy" in this figure):
Note that shuffling and sorting are not performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce job stops at the map phase, and the map phase does not include any kind of sorting (so even the map phase is faster).
UPDATE: Since you are looking for something more official, you can also read Tom White's book "Hadoop: The Definitive Guide". Here is the interesting part for your question.
Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation, so I guess it is pretty credible and official...
Let's revisit key phases of Mapreduce program.
The map phase is done by mappers. Mappers run on unsorted input key/values pairs. Each mapper emits zero, one, or multiple output key/value pairs for each input key/value pairs.
The combine phase is done by combiners. The combiner should combine key/value pairs with the same key. Each combiner may run zero, once, or multiple times.
The shuffle and sort phase is done by the framework. Data from all mappers are grouped by the key, split among reducers and sorted by the key. Each reducer obtains all values associated with the same key. The programmer may supply custom compare functions for sorting and a partitioner for data split.
The partitioner decides which reducer will get a particular key value pair.
The reducer obtains sorted key/[values list] pairs, sorted by the key. The value list contains all values with the same key produced by mappers. Each reducer emits zero, one or multiple output key/value pairs for each input key/value pair.
Have a look at this javacodegeeks article by Maria Jurcovicova and mssqltips article by Datta for a better understanding
Below is the image from safaribooksonline article
I thought of just adding some points missing in above answers. This diagram taken from here clearly states the what's really going on.
If I state again the real purpose of
Split: Improves the parallel processing by distributing the processing load across different nodes (Mappers), which would save the overall processing time.
Combine: Shrinks the output of each Mapper. It would save the time spending for moving the data from one node to another.
Sort (Shuffle & Sort): Makes it easy for the run-time to schedule (spawn/start) new reducers, where while going through the sorted item list, whenever the current key is different from the previous, it can spawn a new reducer.
Some of the data processing requirements doesn't need sort at all. Syncsort had made the sorting in Hadoop pluggable. Here is a nice blog from them on sorting. The process of moving the data from the mappers to the reducers is called shuffling, check this article for more information on the same.
I've always assumed this was necessary as the output from the mapper is the input for the reducer, so it was sorted based on the keyspace and then split into buckets for each reducer input. You want to ensure all the same values of a Key end up in the same bucket going to the reducer so they are reduced together. There is no point sending K1,V2 and K1,V4 to different reducers as they need to be together in order to be reduced.
Tried explaining it as simply as possible
Shuffling is the process by which intermediate data from mappers are transferred to 0,1 or more reducers. Each reducer receives 1 or more keys and its associated values depending on the number of reducers (for a balanced load). Further the values associated with each key are locally sorted.
Because of its size, a distributed dataset is usually stored in partitions, with each partition holding a group of rows. This also improves parallelism for operations like a map or filter. A shuffle is any operation over a dataset that requires redistributing data across its partitions. Examples include sorting and grouping by key.
A common method for shuffling a large dataset is to split the execution into a map and a reduce phase. The data is then shuffled between the map and reduce tasks. For example, suppose we want to sort a dataset with 4 partitions, where each partition is a group of 4 blocks.The goal is to produce another dataset with 4 partitions, but this time sorted by key.
In a sort operation, for example, each square is a sorted subpartition with keys in a distinct range. Each reduce task then merge-sorts subpartitions of the same shade.
The above diagram shows this process. Initially, the unsorted dataset is grouped by color (blue, purple, green, orange). The goal of the shuffle is to regroup the blocks by shade (light to dark). This regrouping requires an all-to-all communication: each map task (a colored circle) produces one intermediate output (a square) for each shade, and these intermediate outputs are shuffled to their respective reduce task (a gray circle).
The text and image was largely taken from here.
There only two things that MapReduce does NATIVELY: Sort and (implemented by sort) scalable GroupBy.
Most of applications and Design Patterns over MapReduce are built over these two operations, which are provided by shuffle and sort.
This is a good reading. Hope it helps. In terms of sorting you are concerning, I think it is for the merge operation in last step of Map. When map operation is done, and need to write the result to local disk, a multi-merge will be operated on the splits generated from buffer. And for a merge operation, sorting each partition in advanced is helpful.
Well,
In Mapreduce there are two important phrases called Mapper and reducer both are too important, but Reducer is mandatory. In some programs reducers are optional. Now come to your question.
Shuffling and sorting are two important operations in Mapreduce. First Hadoop framework takes structured/unstructured data and separate the data into Key, Value.
Now Mapper program separate and arrange the data into keys and values to be processed. Generate Key 2 and value 2 values. This values should process and re arrange in proper order to get desired solution. Now this shuffle and sorting done in your local system (Framework take care it) and process in local system after process framework cleanup the data in local system.
Ok
Here we use combiner and partition also to optimize this shuffle and sort process. After proper arrangement, those key values passes to Reducer to get desired Client's output. Finally Reducer get desired output.
K1, V1 -> K2, V2 (we will write program Mapper), -> K2, V' (here shuffle and soft the data) -> K3, V3 Generate the output. K4,V4.
Please note all these steps are logical operation only, not change the original data.
Your question: What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?
Short answer: To process the data to get desired output. Shuffling is aggregate the data, reduce is get expected output.

Resources