What is the significance of data being passed as key/value pairs to the mapper also in the Hadoop Map Reduce framework. I understand that key/value pairs hold significance when they are passed to the reducers as they cater to the partitioning of data coming from the mappers. Values belonging to the same key go as a list from the mapper to the reducer stage. But how are the keys used before the mapper stage itself? What happens to values belonging to the same key? If we don't define a custom input format, I presume Hadoop takes in the record number from the input file as the key and the text line as the value in the mapper function. But in case we decide to implement a custom input format there is a custom selection of the keys and there could be a possibility where we have values corresponding to the same key.
How does phenomenon get handled in the mapper stage? Does the mapper ignore duplicate records and treats them as separate records or does it only choose one record per key?
An input split is a chunk of the input that is processed by a
single map. Each map processes a single split. Each split is divided into records, and
the map processes each record—a key-value pair—in turn.
So mapper treats records with same key as separate records.
Related
I have a file with data like
City,Quarter,Classification,Index
Bordeux,Q1,R,3
Krakow,Q1,U,2
Halifax,Q1,U,4
I need to find out the highest Index in each Classification and write them to two separate files. The output should be
Bordeux,Q1,R,3
Halifax,Q1,U,4
How to load the data in Mapper as it requires a key/value pair. In mapper it seems programmer should not do any modification to data. So, how to load it in Context object.
I think the data type of key or value is not changed in Reducer. If so, I'm going to infuse my logic to find the top records, then how to organize into a context object there.
I don't have clue on how to proceed.
Necessary pointers will help me to proceed further.
In your case when you read the file in Mapper the input key is the ObjectId of the line and value is the line itself. So in other words, in each line of the file will be received in Mapper as value field. Now, the output (key,value) of Mapper should be (Classification, Index).
The output of Mapper will become input (key,value) of reducer. So reducer will receive (Classification,Iterable) as input. So for each classification, you can iterate over Index List to get the Max and output of the reducer will be (Classification,Max)
In this case, output key and value type will be same in for Mapper and Reducer.
However, regarding writing it to separate lines: Separate files will be generated only if every key is routed to different reducer instance. So in your case, the total number of reducers should be equal to the total number of unique classifications(Not in good terms of resource utilization though). So, you have to write a custom partitioner to make it happen
In Hadoop for 1 mapper , only one mapper object created for one input split which internally calls map methods for each line in input split. Similarly how many times Reducer gets called ? one reduce method for each unique key? right
You have control over how many Reducers are used. In your driver you set the number using something like:
job.setNumReduceTasks(int tasks)
The default number is 1.
Using the default HashPartitioner keys will be distributed to a reducer based on the hashcode of the key. So a reduce can process multiple keys.
I have a Hadoop MapReduce Job that splits documents of different kinds (Places, People, Organisations, Algorithms, etc...). For each document I have a tag that identify the type of document and links to other documents, however I don't know which kind is the document of the link until the page of the link is reached in the task.
In the Map phase I identify, the links and the kind of the current page and then Emmit as values the information of the links and the current document with his tag to a single reducer, Key NullWritable Value "CurrentDoc::Type::Link".
In the reducer phase it is grouped all the documents by type using the "CurrentDoc::Type" of the values, and then emit a relation between "Document::Link" of only ones that belongs to certain Types.
However I have a memory issue because all the final step is performed only in one reducer.
It is a way, to perform a grouping task after the map process and before the reduce task for identify all the documents with its tags and then distribute them to different reducers.
I mean group all document/tag as "CurrentDoc::Type" in an ArrayWritable Text. Then emit to reducers as key the "CurrentDoc::Link" tuple and as value the ArrayWritable to perform some filtering in the reduce phase in a parallel way.
Thanks for your help!
Unfortunately the system does not work in the way you expect.
We can't change Mapper,Reducer & Combiner functionality.
Hadoop allows the user to specify a combiner function to be run on the map output, and the combiner function’s output forms the input to the reduce function. In other words, calling the combiner function zero,one, or many times should produce the same output from the reducer.
Combiner can't combine data from multiple maps. Let's leave the job to Reducer.
For your problem,
1) Use Customer Partitioner and decide which reducer should be used to process a specific key (CurrentDoc::Type)
2) Combiner will combine the data with-in a Mapper
3) Outfrom Mapper will be redirected a specific Reducer depending on Key Partition (shuffling)
4) Reducer will combine data for key received from respective Mappers
Working code of Partitioner & Combiner
Use-case:
I had 2 dataset/fileset Machine (Parent) and Alerts (Child).
Their data is also stored in two avro files viz machine.avro and alert.avro.
Alert schema had machineId : column type int.
How can I filter data from machine if there is a dependency on alert too? (one-to-many).
e.g. get all machines where alert time is between 2 time-stamp.
Any e.g. with source will be great help...
Thanks in advance...
Got answer in another thread....
Mapping through two data sets with Hadoop
Posting comments from that thread...
According to the documentation, the MapReduce framework includes the following steps:
Map
Sort/Partition
Combine (optional)
Reduce
You've described one way to perform your join: loading all of Set A into memory in each Mapper. You're correct that this is inefficient.
Instead, observe that a large join can be partitioned into arbitrarily many smaller joins if both sets are sorted and partitioned by key. MapReduce sorts the output of each Mapper by key in step (2) above. Sorted Map output is then partitioned by key, so that one partition is created per Reducer. For each unique key, the Reducer will receive all values from both Set A and Set B.
To finish your join, the Reducer needs only to output the key and either the updated value from Set B, if it exists; otherwise, output the key and the original value from Set A. To distinguish between values from Set A and Set B, try setting a flag on the output value from the Mapper.
I am new to Map/Reduce and Hadoop framework.
I am running a Hadoop program on single machine (for trying it out).
I have n input files and I want some summary of words from those files.
I know map function returns key value pair, but how map is called?
Once on each file or Once on each line of every file? Can I configure it?
Is it correct to assume, "reduce" is called for each key?
A map is called for one InputSplit(or split in short) and it is the duty of the InputFormat, you are using in your MR job, to crate these splits. It could be one line, multiple lines, one whole file and so on, based on the logic inside your InputFormat. For example, the default InputFormat, i.e TextInputFormat crates of splits which consist of a single line.
Yes you can configure it by altering the InputFormat you are using.
All the values corresponding to a particular key are clubbed together and multiple keys are partitioned into partitions and an entire partition goes to a reducer for further processing. So, all the values corresponding to a particular key get processed by a single reducer, but a single reducer can get multiple keys.
In Hadoop MR framework,the job tracker creates a map task for each InputSplit as determined by the InputFormat specified by your job. Each Inputsplit assigned to map task is further processed by RecordReader to generate input key/value pairs for map function. The map function is called for each key/value pair that is generated by RecordReader.
For default InputFormat i.e TextInputFormat the input split will be single HDFS block which will be processed by single map task and RecordReader will process one line at a time within block and generate key/value pair where key is byte offset of starting of line in file and value is contents of line which will be passed to map function.
The number of reducers is dependent on job configuration by user and all key/value pairs with same key are grouped and will be sent to single reducer sorted by key but at same time single reducer can process multiple keys too.
For more details on InputFormat and customizing it, refer this YDN documentation:
http://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat