How does Hadoop decide to distribute among buckets/nodes? - hadoop

I am new to Map/Reduce and Hadoop framework.
I am running a Hadoop program on single machine (for trying it out).
I have n input files and I want some summary of words from those files.
I know map function returns key value pair, but how map is called?
Once on each file or Once on each line of every file? Can I configure it?
Is it correct to assume, "reduce" is called for each key?

A map is called for one InputSplit(or split in short) and it is the duty of the InputFormat, you are using in your MR job, to crate these splits. It could be one line, multiple lines, one whole file and so on, based on the logic inside your InputFormat. For example, the default InputFormat, i.e TextInputFormat crates of splits which consist of a single line.
Yes you can configure it by altering the InputFormat you are using.
All the values corresponding to a particular key are clubbed together and multiple keys are partitioned into partitions and an entire partition goes to a reducer for further processing. So, all the values corresponding to a particular key get processed by a single reducer, but a single reducer can get multiple keys.

In Hadoop MR framework,the job tracker creates a map task for each InputSplit as determined by the InputFormat specified by your job. Each Inputsplit assigned to map task is further processed by RecordReader to generate input key/value pairs for map function. The map function is called for each key/value pair that is generated by RecordReader.
For default InputFormat i.e TextInputFormat the input split will be single HDFS block which will be processed by single map task and RecordReader will process one line at a time within block and generate key/value pair where key is byte offset of starting of line in file and value is contents of line which will be passed to map function.
The number of reducers is dependent on job configuration by user and all key/value pairs with same key are grouped and will be sent to single reducer sorted by key but at same time single reducer can process multiple keys too.
For more details on InputFormat and customizing it, refer this YDN documentation:
http://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat

Related

How Blocks gets converted into Records and what exactly is the definition of Record in Hadoop

I am learning Hadoop, and to begin with started with HDFS and MapReduce. I understood the basics of HDFS and MapReduce.
There is one particular point where I am not able to understand, which I am explaining below:
Large data set --> Stored in HDFS as Blocks, say for example B1, B2, B3.
Now, when we run a MR Job, each mapper works on a single block (assuming 1 mapper processes a block of data for simplicity)
1 Mapper ==> processes 1 block
I also read that the block is divided into Records and for a given block, same mapper is called for each records within that block (of data).
But what exactly is a Record?
For a given block, since it has to be "broken" down into records, how that block gets broken into records and what constituents a record.
In most of the examples, I have seen a record being a full line delimited by new line.
My doubt is what decides the "conditions" basis on which something can be treated as record.
I know there are many InputFormat in Hadoop, but my question is what are the conditions which decides something to be considered as a record.
Can anyone help me understand this in simple words.
You need to understand the concept of RecordReader.
Block is a hard bound number of bytes the data is stored on disk. So, by saying a block of 256 MB, means exactly 256 MB piece of data on the disk.
The mapper get 1 record from the block, process it; and get the next one - the onus of defining a record is on RecordReader.
Now what is a record? If I provide an analogy of block being a table, record is a row in the table.
Now think about this - How to process of a block data in mapper, after all you can not write a logic on a random byte of data. From a mapper perspective, you can only have a logic, if the input data "make some sense" or has a structure or a logical chunk of data (from the mapper logic perspective).
That logical chunk is called a record. By default one line of data is the logical chunk in the default implementation. But sometime, it does not make sense to have one line of data being a logical data. Sometime, there is no line at all (Say its a MP4 type of data and mapper need one song as input) !
Let's say you have a requirement in mapper which needs to work on 5 consecutive lines together. In that case you need to override the RecordReader with an implementation where 5 lines are one record and passed together to the mapper.
EDIT 1
Your understanding is on right path
InputFormat: opens the data source and splits the data into chunks
RecordReader: actually parses the chunks into Key/Value pairs.
For the JavaDoc of InputFormat
InputFormat describes the input-specification for a Map-Reduce job.
The Map-Reduce framework relies on the InputFormat of the job to:
Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper.
Provide the RecordReader implementation to be used to extract input records from the logical InputSplit for processing by the Mapper.
From the 1st point, one block is not exactly the input to the mapper; it is rather an InputSplit. e.g. think about a Zip File (compressed with GZIP). A Zip file is a collection of ZipEntry (each compressed file). A zip file is a non-splitable from processing perspective. It means, the InputSplit for a Zip file will be of several blocks (in fact all the blocks used to store the particular ZIP file). This happens at the expense of data locality. i.e. even though the zip file is broken and stored in HDFS at different node, the whole file would be moved to the node running the mapper.
The ZipFileInputFormat provides the default record reader implementation ZipFileRecordReader, which has logic to read one ZipEntry (compressed file) for the mapper key-value.
You've already basically answered this for yourself, so hopefully my explanation can help.
A record is a MapReduce-specific term for a key-value pair. A single MapReduce job can have several different types of records - in the wordcount example then the mapper input record type is <Object, Text>, the mapper output/reducer input record type is <Text, IntWritable>, and the reducer output record type is also <Text, IntWritable>.
The InputFormat is responsible for defining how the block is split into individual records. As you identified, there are many InputFormats, and each is responsible for implementing code that manages how it splits the data into records.
The block itself has no concept of records as the records aren't created until the data is read the mapper. You could have two separate MapReduce jobs that read the same block but use different InputFormats. As far as the HDFS is concerned, it's just storing a single big blob of data.
There's no "condition" for defining how the data is split - you can make your own InputFormat and split the data however you want.

How to map the value in Mapper

I have a file with data like
City,Quarter,Classification,Index
Bordeux,Q1,R,3
Krakow,Q1,U,2
Halifax,Q1,U,4
I need to find out the highest Index in each Classification and write them to two separate files. The output should be
Bordeux,Q1,R,3
Halifax,Q1,U,4
How to load the data in Mapper as it requires a key/value pair. In mapper it seems programmer should not do any modification to data. So, how to load it in Context object.
I think the data type of key or value is not changed in Reducer. If so, I'm going to infuse my logic to find the top records, then how to organize into a context object there.
I don't have clue on how to proceed.
Necessary pointers will help me to proceed further.
In your case when you read the file in Mapper the input key is the ObjectId of the line and value is the line itself. So in other words, in each line of the file will be received in Mapper as value field. Now, the output (key,value) of Mapper should be (Classification, Index).
The output of Mapper will become input (key,value) of reducer. So reducer will receive (Classification,Iterable) as input. So for each classification, you can iterate over Index List to get the Max and output of the reducer will be (Classification,Max)
In this case, output key and value type will be same in for Mapper and Reducer.
However, regarding writing it to separate lines: Separate files will be generated only if every key is routed to different reducer instance. So in your case, the total number of reducers should be equal to the total number of unique classifications(Not in good terms of resource utilization though). So, you have to write a custom partitioner to make it happen

Mapper reducer calling in Hadoop

In Hadoop for 1 mapper , only one mapper object created for one input split which internally calls map methods for each line in input split. Similarly how many times Reducer gets called ? one reduce method for each unique key? right
You have control over how many Reducers are used. In your driver you set the number using something like:
job.setNumReduceTasks(int tasks)
The default number is 1.
Using the default HashPartitioner keys will be distributed to a reducer based on the hashcode of the key. So a reduce can process multiple keys.

MapReduce filter before reduce

I have a Hadoop MapReduce Job that splits documents of different kinds (Places, People, Organisations, Algorithms, etc...). For each document I have a tag that identify the type of document and links to other documents, however I don't know which kind is the document of the link until the page of the link is reached in the task.
In the Map phase I identify, the links and the kind of the current page and then Emmit as values the information of the links and the current document with his tag to a single reducer, Key NullWritable Value "CurrentDoc::Type::Link".
In the reducer phase it is grouped all the documents by type using the "CurrentDoc::Type" of the values, and then emit a relation between "Document::Link" of only ones that belongs to certain Types.
However I have a memory issue because all the final step is performed only in one reducer.
It is a way, to perform a grouping task after the map process and before the reduce task for identify all the documents with its tags and then distribute them to different reducers.
I mean group all document/tag as "CurrentDoc::Type" in an ArrayWritable Text. Then emit to reducers as key the "CurrentDoc::Link" tuple and as value the ArrayWritable to perform some filtering in the reduce phase in a parallel way.
Thanks for your help!
Unfortunately the system does not work in the way you expect.
We can't change Mapper,Reducer & Combiner functionality.
Hadoop allows the user to specify a combiner function to be run on the map output, and the combiner function’s output forms the input to the reduce function. In other words, calling the combiner function zero,one, or many times should produce the same output from the reducer.
Combiner can't combine data from multiple maps. Let's leave the job to Reducer.
For your problem,
1) Use Customer Partitioner and decide which reducer should be used to process a specific key (CurrentDoc::Type)
2) Combiner will combine the data with-in a Mapper
3) Outfrom Mapper will be redirected a specific Reducer depending on Key Partition (shuffling)
4) Reducer will combine data for key received from respective Mappers
Working code of Partitioner & Combiner

What happens when identical keys are passed to the Mapper in Hadoop

What is the significance of data being passed as key/value pairs to the mapper also in the Hadoop Map Reduce framework. I understand that key/value pairs hold significance when they are passed to the reducers as they cater to the partitioning of data coming from the mappers. Values belonging to the same key go as a list from the mapper to the reducer stage. But how are the keys used before the mapper stage itself? What happens to values belonging to the same key? If we don't define a custom input format, I presume Hadoop takes in the record number from the input file as the key and the text line as the value in the mapper function. But in case we decide to implement a custom input format there is a custom selection of the keys and there could be a possibility where we have values corresponding to the same key.
How does phenomenon get handled in the mapper stage? Does the mapper ignore duplicate records and treats them as separate records or does it only choose one record per key?
An input split is a chunk of the input that is processed by a
single map. Each map processes a single split. Each split is divided into records, and
the map processes each record—a key-value pair—in turn.
So mapper treats records with same key as separate records.

Resources