Mapper reducer calling in Hadoop - hadoop

In Hadoop for 1 mapper , only one mapper object created for one input split which internally calls map methods for each line in input split. Similarly how many times Reducer gets called ? one reduce method for each unique key? right

You have control over how many Reducers are used. In your driver you set the number using something like:
job.setNumReduceTasks(int tasks)
The default number is 1.
Using the default HashPartitioner keys will be distributed to a reducer based on the hashcode of the key. So a reduce can process multiple keys.

Related

How to map the value in Mapper

I have a file with data like
City,Quarter,Classification,Index
Bordeux,Q1,R,3
Krakow,Q1,U,2
Halifax,Q1,U,4
I need to find out the highest Index in each Classification and write them to two separate files. The output should be
Bordeux,Q1,R,3
Halifax,Q1,U,4
How to load the data in Mapper as it requires a key/value pair. In mapper it seems programmer should not do any modification to data. So, how to load it in Context object.
I think the data type of key or value is not changed in Reducer. If so, I'm going to infuse my logic to find the top records, then how to organize into a context object there.
I don't have clue on how to proceed.
Necessary pointers will help me to proceed further.
In your case when you read the file in Mapper the input key is the ObjectId of the line and value is the line itself. So in other words, in each line of the file will be received in Mapper as value field. Now, the output (key,value) of Mapper should be (Classification, Index).
The output of Mapper will become input (key,value) of reducer. So reducer will receive (Classification,Iterable) as input. So for each classification, you can iterate over Index List to get the Max and output of the reducer will be (Classification,Max)
In this case, output key and value type will be same in for Mapper and Reducer.
However, regarding writing it to separate lines: Separate files will be generated only if every key is routed to different reducer instance. So in your case, the total number of reducers should be equal to the total number of unique classifications(Not in good terms of resource utilization though). So, you have to write a custom partitioner to make it happen

MapReduce filter before reduce

I have a Hadoop MapReduce Job that splits documents of different kinds (Places, People, Organisations, Algorithms, etc...). For each document I have a tag that identify the type of document and links to other documents, however I don't know which kind is the document of the link until the page of the link is reached in the task.
In the Map phase I identify, the links and the kind of the current page and then Emmit as values the information of the links and the current document with his tag to a single reducer, Key NullWritable Value "CurrentDoc::Type::Link".
In the reducer phase it is grouped all the documents by type using the "CurrentDoc::Type" of the values, and then emit a relation between "Document::Link" of only ones that belongs to certain Types.
However I have a memory issue because all the final step is performed only in one reducer.
It is a way, to perform a grouping task after the map process and before the reduce task for identify all the documents with its tags and then distribute them to different reducers.
I mean group all document/tag as "CurrentDoc::Type" in an ArrayWritable Text. Then emit to reducers as key the "CurrentDoc::Link" tuple and as value the ArrayWritable to perform some filtering in the reduce phase in a parallel way.
Thanks for your help!
Unfortunately the system does not work in the way you expect.
We can't change Mapper,Reducer & Combiner functionality.
Hadoop allows the user to specify a combiner function to be run on the map output, and the combiner function’s output forms the input to the reduce function. In other words, calling the combiner function zero,one, or many times should produce the same output from the reducer.
Combiner can't combine data from multiple maps. Let's leave the job to Reducer.
For your problem,
1) Use Customer Partitioner and decide which reducer should be used to process a specific key (CurrentDoc::Type)
2) Combiner will combine the data with-in a Mapper
3) Outfrom Mapper will be redirected a specific Reducer depending on Key Partition (shuffling)
4) Reducer will combine data for key received from respective Mappers
Working code of Partitioner & Combiner

What if we only have one reducer

As we know that Hadoop tend to lanunch reducer on the machines that the corresponding mapper is run. What if we have 100 mappers and 1 reducer. We know that the mapper stores data on local disk ,will all the mapped data be transfered to the single reducer?
Yes, if the reducer is only one, all the data will be transferred to that reducer.
Each mapper initially stores its output in its local buffer(100mb default), and when the buffer is filled to a certain percentage defined by io.sort.spill.percent, the result will be spilled on to disk defined by mapred.local.dir.
These files are copied on to the reducer during copy phase, in which output of each mapper is copied by mapred.reduce.parallel.copies parallel threads.(default 5)
If you fix reducer number to one (by job.setNumReduceTasks(1) or -Dmapred.reduce.tasks=1) then all data from mappers will be transferred to one reducer that will process all keys.
If you have only 1 reducer then all the data get tranferred to that reducer and all the output will be stored in HDFS as a single file.
If you are not giving no of reducers then the default no of reducer that run is one.
You can set no of reducers using job.setNumReduceTasks(__) and if you are using ToolRunner you can set no of reducers through command line itself.
-Dmapred.reduce.tasks=4

How does Hadoop decide to distribute among buckets/nodes?

I am new to Map/Reduce and Hadoop framework.
I am running a Hadoop program on single machine (for trying it out).
I have n input files and I want some summary of words from those files.
I know map function returns key value pair, but how map is called?
Once on each file or Once on each line of every file? Can I configure it?
Is it correct to assume, "reduce" is called for each key?
A map is called for one InputSplit(or split in short) and it is the duty of the InputFormat, you are using in your MR job, to crate these splits. It could be one line, multiple lines, one whole file and so on, based on the logic inside your InputFormat. For example, the default InputFormat, i.e TextInputFormat crates of splits which consist of a single line.
Yes you can configure it by altering the InputFormat you are using.
All the values corresponding to a particular key are clubbed together and multiple keys are partitioned into partitions and an entire partition goes to a reducer for further processing. So, all the values corresponding to a particular key get processed by a single reducer, but a single reducer can get multiple keys.
In Hadoop MR framework,the job tracker creates a map task for each InputSplit as determined by the InputFormat specified by your job. Each Inputsplit assigned to map task is further processed by RecordReader to generate input key/value pairs for map function. The map function is called for each key/value pair that is generated by RecordReader.
For default InputFormat i.e TextInputFormat the input split will be single HDFS block which will be processed by single map task and RecordReader will process one line at a time within block and generate key/value pair where key is byte offset of starting of line in file and value is contents of line which will be passed to map function.
The number of reducers is dependent on job configuration by user and all key/value pairs with same key are grouped and will be sent to single reducer sorted by key but at same time single reducer can process multiple keys too.
For more details on InputFormat and customizing it, refer this YDN documentation:
http://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat

What happens when identical keys are passed to the Mapper in Hadoop

What is the significance of data being passed as key/value pairs to the mapper also in the Hadoop Map Reduce framework. I understand that key/value pairs hold significance when they are passed to the reducers as they cater to the partitioning of data coming from the mappers. Values belonging to the same key go as a list from the mapper to the reducer stage. But how are the keys used before the mapper stage itself? What happens to values belonging to the same key? If we don't define a custom input format, I presume Hadoop takes in the record number from the input file as the key and the text line as the value in the mapper function. But in case we decide to implement a custom input format there is a custom selection of the keys and there could be a possibility where we have values corresponding to the same key.
How does phenomenon get handled in the mapper stage? Does the mapper ignore duplicate records and treats them as separate records or does it only choose one record per key?
An input split is a chunk of the input that is processed by a
single map. Each map processes a single split. Each split is divided into records, and
the map processes each record—a key-value pair—in turn.
So mapper treats records with same key as separate records.

Resources