How does map-reduce work..Did i get it right? - hadoop

I'm trying to understand how map-reduce actually work. please read what i written below and tell me if there's any missing parts or incorrect things in here.
Thank you.
The data is first splitted into what is called input splits(which is a logical kind of group which we define the size of it as our needs of record processing).
Then, there is a Mapper for every input split which takes every input split and sort it by key and value.
Then, there is the shuffling process which takes all of the data from the mappers (key-values) and merges all the same keys with its values(output it's all the keys with its list of values). The shuffling process occurs in order to give the reducer an input of a 1 key for each type of key with its summed values.
Then, the Reducer merges all the key value into one place(page maybe?) which is the final result of the MapReduce process.
We only have to make sure to define the Map(which gives output of key-value always) and Reduce(final result- get the input key-value and can be count,sum,avg,etc..) step code.

Your understanding is slightly wrong specially how mapper works.
I got a very nice pictorial image to explain in simple term
It is similar to the wordcount program, where
Each bundle of chocolates are the InputSplit, which is handled by a mapper. So we have 3 bundles.
Each chocolate is a word. One or more words (making a sentence) is a record input to single mapper. So, within one inputsplit, there may be multiple records and each record is input to single mapper.
mapper count occurrence of each of the word (chocolate) and spit the count. Note that each of the mapper is working on only one line (record). As soon as it is done, it picks next record from the inputsplit. (2nd phase in the image)
Once map phase is finished, sorting and shuffling takes place to make a bucket of same chocolates counts. (3rd phase in the image)
One reducer get one bucket with key as name of the chocolate (or the word) and a list of counts. So, there are as many reducer as many distinct words in whole input file.
The reducer iterates through the count and sum them up to produce the final count and emit it against the word.
The Below diagram shows how one single inputsplit of wordcount program works:

Similar QA - Simple explanation of MapReduce?
Also, this post explain Hadoop - HDFS & Mapreduce in very simple way https://content.pivotal.io/blog/demystifying-apache-hadoop-in-5-pictures

Related

Designing of the "mapper" and "reducer" functions' functionality for hadoop?

I am trying to design a mapper and reducer for Hadoop. I am new to Hadoop, and I'm a bit confused about how the mapper and reducer is supposed for work for my specific application.
The input to my mapper is a large directed graph's connectivity. It is a 2 column input where each row is an individual edge connectivity. The first column is the start node id and the second column is the end node id of each edge. I'm trying to output the number of neighbors for each start node id into a 2 column text file, where the first column is sorted in order of increasing start node id.
My questions are:
(1) The input is already set up such that each line is a key-value pair, where the key is the start node id, and the value is the end node id. Would the mapper simply just read in each line and write it out? That seems redundant.
(2) Does the sorting take place in between the mapper and reducer or could the sorting actually be done with the reducer itself?
If my understanding is correct, you want to count how many distinct values a key will have.
Simply emitting the input key-value pairs in the mapper, and then counting the distinct values per key (e.g., by adding them to a set and emitting the set size as the value of the reducer) in the reducer is one way of doing it, but a bit redundant, as you say.
In general, you want to reduce the network traffic, so you may want to do some more computations before the shuffling (yes, this is done by Hadoop).
Two easy ways to improve the efficiency are:
1) Use a combiner, which will output sets of values, instead of single values. This way, you will send fewer key-value pairs to the reducers, and also, some values may be skipped, since they have been already in the local value set of the same key.
2) Use map-side aggregation. Instead of emitting the input key-value pairs right away, store them locally in the mapper (in memory) in a data structure (e.g., hashmap or multimap). The key can be the map input key and the value can be a set of values seen so far for this key. Each type you meet a new value for this key, you append it to this structure. At the end of each mapper, you emit this structure (or you convert the values to an array), from the close() method (if I remember the name).
You can lookup both methods using the keywords "combiner" and "map-side aggregation".
A global sorting on the key is a bit trickier. Again, two basic options, but are not really good though:
1) you use a single reducer, but then you don't gain anything from parallelism,
2) you use a total order partitioner, which needs some extra coding.
Other than that, you may want to move to Spark for a more intuitive and efficient solution.

How does map reduce parallel processing really work in hadoop with respect to the word count example?

I am learning hadoop map reduce using word count example , please see the diagram attached :
My questions are regarding how the parallel processing actually happens , my understanding/questions below , please correct me if i am wrong :
Split step : This assigns number of mappers , here the two data sets go to two different processor [p1,p2] , so two mappers ? This splitting is done by first processor P.
Mapping Step : Each of these processor [p1,p2] now divides the data into key value pairs by applying required function f() on keys which produces value v giving [k1,v1],[k2,v2].
Merge Step 1 : Within each processor , values are grouped by key giving [k1,[v1,v2,v3]].
Merge Step 2 : Now p1,p2 returns output to P which merges both the resultant key value pairs. This happens in P.
Sorting Step : Now here P , will sort all the results.
Reduce Step : Here P will apply f() on each individual keys [k1,[v1,v2,v3]] to give [k1,V]
Let me know is this understanding right, i have a feeling i am completely off in many respects?
Let me explain each step in little bit detail so that it will be more clear to you and I have tried to keep them as brief as possible but I would recommend you to go through offical docs (https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html) to get a good feel about this whole process
Split Step: if you have made some program by now you must have observed that we sometimes set a number of reducers but we never set a number of mapper because of the reason that number of mapper depend on the number of input splits. In simple words, no of mapper in any job is proportional to a number of input split. So now the question arises how do splittings take place. That actually depends on a number of factors like mapred.max.split.size which set the size of input split and there are many other ways but the fact we can control the size of input split.
Mapping Step: if by 2 processors you mean 2 JVM(2 containers) or 2 different node or 2 mapper then your intuition is wrong container or for say node have nothing to do with splitting any input file it is job of hdfs which divide and distribute the files on different nodes and then it is the responsibility of resource manager to initiate the mapper task on the same node which has the input splits if possible and once map task is initiated you can create pair of key and value according to your logic in mapper. Here one thing to remember is that one mapper can only work on one input split.
you have mixed up a little bit on step 3, step 4 and step 5. I have tried to explain those step by describing in reference with the actual classes which handles these steps.
Partitioner class: This class divide the output from mapper task according to the number of reducers. This class is useful if you have more then 1 reducer otherwise it do not effect your output. This class contain a method called getPartition which decide to which reducer your mapper output will go(if you have more than one reducer) this method is called for each key which is present in mapper output. You can override this class and subsiquently this method to customize it according to your requirment. So in case of your example since there is one reducer so it will merge output from both the mapper in a single file. If there would have been more reducer and the same number of intermediate files would have been created.
WritableComparator class : Sorting of your map output is done by this class This sorting is done on the basis of key. Like partitioner class you can override this. In your example if the key is colour name then it will sort them like this (here we are considering if you do not overide this class then it will use the default method for sorting of Text which is alphabatical order):
Black,1
Black,1
Black,1
Blue,1
Blue,1
.
.
and so on
Now this same class is also used for grouping your values according your key so that in reducer you can use iterable over them in case of your Ex ->
Black -> {1,1,1}
Blue -> {1,1,1,1,1,1}
Green -> {1,1,1,1,1}
.
.
and so on
Reducer -> This step will simply reduce your map accroding to the logic define in your reducer class. you initution is appropriate for this class.
Now there are some other implications also which effect the intermediae step between mapper and reducer and before mapper also but those are not that much relevent to what you want to know.
I Hope this solve your query.
Your diagram is not exactly showing the basic word counting in MapReduce. Specifically, the stuff after 'Merging-step 1' is misleading in terms of understanding how MapReduce parallelize the reducing phase. The better diagram, imo, can be found at https://dzone.com/articles/word-count-hello-word-program-in-mapreduce
On the latter diagram it is easy to see that as soon as mappers' output is sorted by output key and then shuffled based on this key across the nodes with reducers then reducers can easily run in parallel.

Filtering output in hadoop

I'm new with Hadoop and playing around with the WordCount example.
I ran into an issue that is confusing me. If I take word count from a text file and I want to, for example, filter it in a such way that only words longer than 5 letters are in the output, do I have to run 2 jobs to do this?
The first job to do the word count and second job to filter the words shorter than 5 letters?
Or can I just write logic into reducer that does not write the word into the result file if there are less then 5 occurrences? Would this result an invalid output if there are multiple instances of the reducer running?
Simple answer is you don't need to jobs.
You can achieve this with a single job. The logic you have described into the last of the problem is absolutely correct.
In the MapReduce framework, all the data (values) related to a single keys always passed to the same Reducer. So even if multiple reducers are running for your job will not affect the output.
PS:
only words longer than 5 letters are in the output
This is from second paragraph of your problem. I am assuming that you mean 5 occurrence of a word not the length of the word.
But you want only words with more than 5 length, that you can filter in Mapper itself. So there will be less data for sort-shuffle phase (data sorting and transfer over network) and less data process for Reducer.
One MapReduce job should be enough.
The best-practices say that you should filter and project data in the mapper if it is possible.
In you case, your filter condition only depends on the input data (characters in the input word), then you could filter the input in the mapper side and only send to the reducer words with more than 5 letters. Improving the performance of your job. It doesn't make sense send the data to the reducer to drop it. Although it should work too.

Why does MapReduce bother mapping every value to 1 in the map step?

I'm trying to figure out MapReduce and so far I think I'm gaining an okay understanding.
However, one thing confuses me. In every example and explanation of MapReduce I can find, the map step maps all values to 1. For instance, in the most common example (counting occurrences of words in a string), the Map section splits up each word and then maps it to the value 1.
The Reduce section then combines/reduces like words, adding up the amount of times they occur so that they map to N instead of 1 (N being how many times the word appears).
What I don't understand is: why even bother mapping them to 1 in the first place? It seems like they will ALWAYS map to 1. Why not just split them apart, and then in the Reduce step, do the mapping there, and sum everything up at the same time?
I'm sure there must be a good reason that I just can't think of. Thanks!
(this question is about MapReduce as a concept in general, not necessarily about Hadoop or any other specific technology or implementation)
The output of the mapper is decided based on the use case you wanted to have. In word count, we want the mapper to separate the individual words and output the number of occurrences for each word. The mapper is called for every key value pair(input split) in the input. Here its for each line. Key is the offset and value is the entire sentence. There would be grouping performed before the reducer is invoked. so all the words are grouped and each occurence(1 here) is counted. It is not a hard rule to emit 1 as mapper output. If you have noticed the data set example in Hadoop : Definitive guide, they have the year and temperature emitted as mapper output. The usecase is to group based on the years and find the max/min temperature. You can think of this as group parameter, for a basic understanding sake. Happy learning

Is (key,value) pair in Hadoop always ('text',1)?

I am new to Hadoop.
Can you please tell about (key/value) pair? Is the value always one? Is the output of the reduce step always a (key/value) pair? If yes, how is that (key/value) data used further?
Please help me.
I guess you are asking about the 'one' value for the (key,values) pair due to the wordcount example in the Hadoop tutorials. So, the answer is no, it is not always 'one'.
Hadoop implementation of MapReduce works by passing (key,values) pairs in the entire workflow, from the input to the output:
Map step: Generally speaking (there are other particular cases, depending on the input format), the mappers process line by line the data within the splits they are assigned to; such lines are passed to the map method as (key,value) pairs telling about the offset (the key) of the line within the split, and the line itself (the value). Then, they produce at the output another (key,value) pair, and its meaning depends on the mapping function you are implementing; sometimes it will be a variable key and a fixed value (e.g. in wordcount, the key is the word, and the value is always 'one'); other times the value will be the length of the line, or the sum of all the words starting by a prefix... whatever you may imagine; the key may be a word, a fixed custom key...
Reduce step: Typically the reducer receives lists of (key,value) pairs produced by the mappers whose key is the same (this depends on the combiner class you are using, of course but this is generaly speaking). Then, they produce another (key,value) pair in the poutput, again, this depends on the logic of your application. Typically, the reducer is used to aggregate all the values regarding the same key.
This is a very rough quick and undetailed explanation, I encourage you to read some official documentation about it, or especialized literature such as this.
Hope you have started learning mapreduce with Wordcount example..
Key/Value pair is the record entity that mapreduce accepts for execution. The InputFormat classes to read records from source and the OutputFormat classes to commit results operate only using the records as Key/Value format.
Key/Value format is the best suited representation of records to pass through the different stages of the map-partition-sort-combine-shuffle-merge-sort-reduce lifecycle of mapreduce. Please refer,
http://www.thecloudavenue.com/2012/09/why-does-hadoop-uses-kv-keyvalue-pairs.html
The Key/Value data types can be anything. The Text/Interwritable key/value you used is the best pair used for wordcount. Its actually can be anything according to your requirement.
Kindly Spend some time in reading hadoop definitive guide/ Yahoo tutorials to get more understanding. happy learning...

Resources