MapReduce filter before reduce - hadoop

I have a Hadoop MapReduce Job that splits documents of different kinds (Places, People, Organisations, Algorithms, etc...). For each document I have a tag that identify the type of document and links to other documents, however I don't know which kind is the document of the link until the page of the link is reached in the task.
In the Map phase I identify, the links and the kind of the current page and then Emmit as values the information of the links and the current document with his tag to a single reducer, Key NullWritable Value "CurrentDoc::Type::Link".
In the reducer phase it is grouped all the documents by type using the "CurrentDoc::Type" of the values, and then emit a relation between "Document::Link" of only ones that belongs to certain Types.
However I have a memory issue because all the final step is performed only in one reducer.
It is a way, to perform a grouping task after the map process and before the reduce task for identify all the documents with its tags and then distribute them to different reducers.
I mean group all document/tag as "CurrentDoc::Type" in an ArrayWritable Text. Then emit to reducers as key the "CurrentDoc::Link" tuple and as value the ArrayWritable to perform some filtering in the reduce phase in a parallel way.
Thanks for your help!

Unfortunately the system does not work in the way you expect.
We can't change Mapper,Reducer & Combiner functionality.
Hadoop allows the user to specify a combiner function to be run on the map output, and the combiner function’s output forms the input to the reduce function. In other words, calling the combiner function zero,one, or many times should produce the same output from the reducer.
Combiner can't combine data from multiple maps. Let's leave the job to Reducer.
For your problem,
1) Use Customer Partitioner and decide which reducer should be used to process a specific key (CurrentDoc::Type)
2) Combiner will combine the data with-in a Mapper
3) Outfrom Mapper will be redirected a specific Reducer depending on Key Partition (shuffling)
4) Reducer will combine data for key received from respective Mappers
Working code of Partitioner & Combiner

Related

How to map the value in Mapper

I have a file with data like
City,Quarter,Classification,Index
Bordeux,Q1,R,3
Krakow,Q1,U,2
Halifax,Q1,U,4
I need to find out the highest Index in each Classification and write them to two separate files. The output should be
Bordeux,Q1,R,3
Halifax,Q1,U,4
How to load the data in Mapper as it requires a key/value pair. In mapper it seems programmer should not do any modification to data. So, how to load it in Context object.
I think the data type of key or value is not changed in Reducer. If so, I'm going to infuse my logic to find the top records, then how to organize into a context object there.
I don't have clue on how to proceed.
Necessary pointers will help me to proceed further.
In your case when you read the file in Mapper the input key is the ObjectId of the line and value is the line itself. So in other words, in each line of the file will be received in Mapper as value field. Now, the output (key,value) of Mapper should be (Classification, Index).
The output of Mapper will become input (key,value) of reducer. So reducer will receive (Classification,Iterable) as input. So for each classification, you can iterate over Index List to get the Max and output of the reducer will be (Classification,Max)
In this case, output key and value type will be same in for Mapper and Reducer.
However, regarding writing it to separate lines: Separate files will be generated only if every key is routed to different reducer instance. So in your case, the total number of reducers should be equal to the total number of unique classifications(Not in good terms of resource utilization though). So, you have to write a custom partitioner to make it happen

Mapper reducer calling in Hadoop

In Hadoop for 1 mapper , only one mapper object created for one input split which internally calls map methods for each line in input split. Similarly how many times Reducer gets called ? one reduce method for each unique key? right
You have control over how many Reducers are used. In your driver you set the number using something like:
job.setNumReduceTasks(int tasks)
The default number is 1.
Using the default HashPartitioner keys will be distributed to a reducer based on the hashcode of the key. So a reduce can process multiple keys.

How does Hadoop decide to distribute among buckets/nodes?

I am new to Map/Reduce and Hadoop framework.
I am running a Hadoop program on single machine (for trying it out).
I have n input files and I want some summary of words from those files.
I know map function returns key value pair, but how map is called?
Once on each file or Once on each line of every file? Can I configure it?
Is it correct to assume, "reduce" is called for each key?
A map is called for one InputSplit(or split in short) and it is the duty of the InputFormat, you are using in your MR job, to crate these splits. It could be one line, multiple lines, one whole file and so on, based on the logic inside your InputFormat. For example, the default InputFormat, i.e TextInputFormat crates of splits which consist of a single line.
Yes you can configure it by altering the InputFormat you are using.
All the values corresponding to a particular key are clubbed together and multiple keys are partitioned into partitions and an entire partition goes to a reducer for further processing. So, all the values corresponding to a particular key get processed by a single reducer, but a single reducer can get multiple keys.
In Hadoop MR framework,the job tracker creates a map task for each InputSplit as determined by the InputFormat specified by your job. Each Inputsplit assigned to map task is further processed by RecordReader to generate input key/value pairs for map function. The map function is called for each key/value pair that is generated by RecordReader.
For default InputFormat i.e TextInputFormat the input split will be single HDFS block which will be processed by single map task and RecordReader will process one line at a time within block and generate key/value pair where key is byte offset of starting of line in file and value is contents of line which will be passed to map function.
The number of reducers is dependent on job configuration by user and all key/value pairs with same key are grouped and will be sent to single reducer sorted by key but at same time single reducer can process multiple keys too.
For more details on InputFormat and customizing it, refer this YDN documentation:
http://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat

What happens when identical keys are passed to the Mapper in Hadoop

What is the significance of data being passed as key/value pairs to the mapper also in the Hadoop Map Reduce framework. I understand that key/value pairs hold significance when they are passed to the reducers as they cater to the partitioning of data coming from the mappers. Values belonging to the same key go as a list from the mapper to the reducer stage. But how are the keys used before the mapper stage itself? What happens to values belonging to the same key? If we don't define a custom input format, I presume Hadoop takes in the record number from the input file as the key and the text line as the value in the mapper function. But in case we decide to implement a custom input format there is a custom selection of the keys and there could be a possibility where we have values corresponding to the same key.
How does phenomenon get handled in the mapper stage? Does the mapper ignore duplicate records and treats them as separate records or does it only choose one record per key?
An input split is a chunk of the input that is processed by a
single map. Each map processes a single split. Each split is divided into records, and
the map processes each record—a key-value pair—in turn.
So mapper treats records with same key as separate records.

Map Reduce Keep input ordering

I tried to implement an application using hadoop which processes text files.The problem is that I cannot keep the ordering of the input text.Is there any way to choose the hash function?This problem could be easily solved by assigning a partition of the input to each mapper an then send the partition to the reducers.Is this possible with hadoop ?
The base idea of MapReduce is that the order in which things are done is irrelevant.
So you cannot (and do not need to) control the order in which:
the input records go through the mappers.
the key and related values go through the reducers.
The only thing you can control is the order in which the values are placed in the iterator that is made available in the reducer.
This is done using a construct called "secondary sort".
A simple google action for this term resulted in several points where you can continue.
I like this blog post : link

Resources