hadoop mapreduce two sorts - sorting

I'm new to Hadoop mapreduce. I went through some of the tutorials and noticed that the output of the mapper is sorted while in the reducer side, we also have a shuffle & sort phase.
So why do we have two sorts there? What are the purposes of them?
Thanks!

Mapper : It arranges the input data from a source into key value pairs for further processing.
Reducer: Aggregation Logic is written here.
Role of Shuffler is shuffle and sort and it passes output from mapper to reducer.This is done internally by MR framework.But,we can implement our own custom shuffler in using MR Api and Java.
Refer this example of WordCount:
http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map-reduce-example.html
Refer this also.
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

Related

In hadoop what is meant by ability to preserve state across mapper reducer multiple inputs?

The heading of the question explains everything what my question is.
I have been reading through multiple texts, answers where I came across this line
Through use of the combiner and by taking advantage of the ability to
preserve state across multiple inputs, it is often possible to
substantially reduce both the number and size of key-value pairs that
need to be shuffled from the mappers to the reducers.
I am not able to understand this concept. An elaborate answer and explanation with an example would be really helpful. How to develop an intuition to understand such concepts?
If you already feel comfortable with the "reducer" concept, a combiner concept will be easy. A combiner can be seen as a mini-reducer on the map phase. What i mean by that? Lets see an example: suppose that you are doing the classic wordcount problem, you know that for every word a key-value pair is emited by the mapper. Then the reducer will take as input this key-value pairs and summaryze them.
Supose that a mapper collects some key-value pairs like:
<key1,1>,
<key2,1>,
<key1,1>,
<key3,1>,
<key1,1>
If you are not using a combiner, this 4 key-value pairs will be sent to the reducer. but using a combiner we could perform a pre-reduce in the mapper, so the output of the mapper will be:
<key1,3>,
<key2,1>,
<key3,1>
In this simple example by using a combiner, you reduced the total number of key-value pairs from 5 to 3, which will give you less network traffic and better performance in the shuffle phase.

Why do we need the "map" part in MapReduce?

The programming model MapReduce consists of 2 procedures, map and reduce. Why do we need the map part, when we can simply do the mapping inside reduce function.
Consider the following pseudocode:
result = my_list.map(my_mapper).reduce(my_reducer);
This could be shortened to
result = my_list.reduce(lambda x : my_reducer(my_mapper(x)));
How can the 1st approach be more preferred than the 2nd one, while the 1st approach requires one more pass through the data? Is my code example oversimplifying?
Well, if you refer to Hadoop style MapReduce it is actually map-shuffle-reduce where the shuffle is a reason for map and reduce to be separated. At a little bit higher you can think about data locality. Each key-value pair passed through map can generate zero or more key-value pairs. To be able to reduce these you have to ensure that all values for a given key are available on a single reduce, hence the shuffle. What is important pairs emitted from a single input pair can be processed by different reducers.
It is possible to use patterns like map-side aggregations or combiners but at the end of the day it is still (map)-reduce-shuffle-reduce.
Assuming data locality is not an issue, higher order functions like map and reduce provide an elegant abstraction layer. Finally it is a declarative API. Simple expression like xs.map(f1).reduce(f2) describe only what not how. Depending on a language or context these can be eagerly or lazily evaluated, operations can be squashed, in more complex scenario reordered and optimized in many different ways.
Regarding your code. Even if signatures were correct it wouldn't really reduce number of times you pass over the data. Moreover if you push map into aggregation then arguments passed to aggregation function are not longer of the same type. It means either sequential fold or much more complex merging logic.
At a high level, map reduce is all about processing in parallel. Even though the reducer work on map output, in practical terms, each reducer will get only part of data, and that is possible only in first approach.
In your second approach, your reducer actually needs entire output of mapper, which beats the idea of parallelism.

Is (key,value) pair in Hadoop always ('text',1)?

I am new to Hadoop.
Can you please tell about (key/value) pair? Is the value always one? Is the output of the reduce step always a (key/value) pair? If yes, how is that (key/value) data used further?
Please help me.
I guess you are asking about the 'one' value for the (key,values) pair due to the wordcount example in the Hadoop tutorials. So, the answer is no, it is not always 'one'.
Hadoop implementation of MapReduce works by passing (key,values) pairs in the entire workflow, from the input to the output:
Map step: Generally speaking (there are other particular cases, depending on the input format), the mappers process line by line the data within the splits they are assigned to; such lines are passed to the map method as (key,value) pairs telling about the offset (the key) of the line within the split, and the line itself (the value). Then, they produce at the output another (key,value) pair, and its meaning depends on the mapping function you are implementing; sometimes it will be a variable key and a fixed value (e.g. in wordcount, the key is the word, and the value is always 'one'); other times the value will be the length of the line, or the sum of all the words starting by a prefix... whatever you may imagine; the key may be a word, a fixed custom key...
Reduce step: Typically the reducer receives lists of (key,value) pairs produced by the mappers whose key is the same (this depends on the combiner class you are using, of course but this is generaly speaking). Then, they produce another (key,value) pair in the poutput, again, this depends on the logic of your application. Typically, the reducer is used to aggregate all the values regarding the same key.
This is a very rough quick and undetailed explanation, I encourage you to read some official documentation about it, or especialized literature such as this.
Hope you have started learning mapreduce with Wordcount example..
Key/Value pair is the record entity that mapreduce accepts for execution. The InputFormat classes to read records from source and the OutputFormat classes to commit results operate only using the records as Key/Value format.
Key/Value format is the best suited representation of records to pass through the different stages of the map-partition-sort-combine-shuffle-merge-sort-reduce lifecycle of mapreduce. Please refer,
http://www.thecloudavenue.com/2012/09/why-does-hadoop-uses-kv-keyvalue-pairs.html
The Key/Value data types can be anything. The Text/Interwritable key/value you used is the best pair used for wordcount. Its actually can be anything according to your requirement.
Kindly Spend some time in reading hadoop definitive guide/ Yahoo tutorials to get more understanding. happy learning...

Hadoop map only job

My situation is like the following:
I have two MapReduce jobs.
First one is MapReduce job which produces output sorted by key.
Then second Map only job will extract some part of the data and just collect it.
I have no reducer in second job.
Problem is I am not sure if the output from map only job will be sorted or it will be shuffled from the map function.
First of all: If your second job only contains a filter to include/exclude specific records then you are better of simply adding this filter to the end of your reducer of the first job.
A rather important fact of the MapReduce is that the reducer will sort the records in "some way" that you do not control. When writing a job you should assume the records are output in a random order.
If you really need all records to be output in a specific order then using the SecondarySort mechanism in combination with a single reducer is "easy" solution that doesn't scale well.
The "hard" solution is what the "Tera sort" benchmark uses.
Read this SO question for more insight into how that works:
How does the MapReduce sort algorithm work?
No as zsxwing said, there wont be any processing done unless you specify reducer, then partitioning will be performed at map side and sorting and grouping will be done on reduce side.

Hadoop MapReduce with already sorted files

I'm working with Hadoop MapReduce. I've got data in HDFS and data in each file is already sorted. Is it possible to force MapReduce not to resort the data after map phase? I've tried to change the map.sort.class to no-op, but it didn't work (i.e. the data wasn't sorted as I'd expected). Does anyone tried doing something similar and managed to achieve it?
I think it depends on what style result you want, sorted result or unsorted result?
If you need result be sorted, I think hadoop is not suitable to do this work. There are two reasons:
INPUT DATA will be stored in different chunk(if big enough) and partitioned into multi-splits. Each one split will be mapped to one map task and all output of map tasks will gathered(after processes of partitioned/sorted/combined/copied/merged) as reduce's input. It is hard to keep keys in order among these stages.
Sort function exists not only after map process in map task. When do merge process during reduce task, there is sort option,too.
If you do not need result be sorted,I think this patch may be what you want:
Support no sort dataflow in map output and reduce merge phrase : https://issues.apache.org/jira/browse/MAPREDUCE-3397

Resources