How to map the value in Mapper - hadoop

I have a file with data like
City,Quarter,Classification,Index
Bordeux,Q1,R,3
Krakow,Q1,U,2
Halifax,Q1,U,4
I need to find out the highest Index in each Classification and write them to two separate files. The output should be
Bordeux,Q1,R,3
Halifax,Q1,U,4
How to load the data in Mapper as it requires a key/value pair. In mapper it seems programmer should not do any modification to data. So, how to load it in Context object.
I think the data type of key or value is not changed in Reducer. If so, I'm going to infuse my logic to find the top records, then how to organize into a context object there.
I don't have clue on how to proceed.
Necessary pointers will help me to proceed further.

In your case when you read the file in Mapper the input key is the ObjectId of the line and value is the line itself. So in other words, in each line of the file will be received in Mapper as value field. Now, the output (key,value) of Mapper should be (Classification, Index).
The output of Mapper will become input (key,value) of reducer. So reducer will receive (Classification,Iterable) as input. So for each classification, you can iterate over Index List to get the Max and output of the reducer will be (Classification,Max)
In this case, output key and value type will be same in for Mapper and Reducer.
However, regarding writing it to separate lines: Separate files will be generated only if every key is routed to different reducer instance. So in your case, the total number of reducers should be equal to the total number of unique classifications(Not in good terms of resource utilization though). So, you have to write a custom partitioner to make it happen

Related

How Blocks gets converted into Records and what exactly is the definition of Record in Hadoop

I am learning Hadoop, and to begin with started with HDFS and MapReduce. I understood the basics of HDFS and MapReduce.
There is one particular point where I am not able to understand, which I am explaining below:
Large data set --> Stored in HDFS as Blocks, say for example B1, B2, B3.
Now, when we run a MR Job, each mapper works on a single block (assuming 1 mapper processes a block of data for simplicity)
1 Mapper ==> processes 1 block
I also read that the block is divided into Records and for a given block, same mapper is called for each records within that block (of data).
But what exactly is a Record?
For a given block, since it has to be "broken" down into records, how that block gets broken into records and what constituents a record.
In most of the examples, I have seen a record being a full line delimited by new line.
My doubt is what decides the "conditions" basis on which something can be treated as record.
I know there are many InputFormat in Hadoop, but my question is what are the conditions which decides something to be considered as a record.
Can anyone help me understand this in simple words.
You need to understand the concept of RecordReader.
Block is a hard bound number of bytes the data is stored on disk. So, by saying a block of 256 MB, means exactly 256 MB piece of data on the disk.
The mapper get 1 record from the block, process it; and get the next one - the onus of defining a record is on RecordReader.
Now what is a record? If I provide an analogy of block being a table, record is a row in the table.
Now think about this - How to process of a block data in mapper, after all you can not write a logic on a random byte of data. From a mapper perspective, you can only have a logic, if the input data "make some sense" or has a structure or a logical chunk of data (from the mapper logic perspective).
That logical chunk is called a record. By default one line of data is the logical chunk in the default implementation. But sometime, it does not make sense to have one line of data being a logical data. Sometime, there is no line at all (Say its a MP4 type of data and mapper need one song as input) !
Let's say you have a requirement in mapper which needs to work on 5 consecutive lines together. In that case you need to override the RecordReader with an implementation where 5 lines are one record and passed together to the mapper.
EDIT 1
Your understanding is on right path
InputFormat: opens the data source and splits the data into chunks
RecordReader: actually parses the chunks into Key/Value pairs.
For the JavaDoc of InputFormat
InputFormat describes the input-specification for a Map-Reduce job.
The Map-Reduce framework relies on the InputFormat of the job to:
Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper.
Provide the RecordReader implementation to be used to extract input records from the logical InputSplit for processing by the Mapper.
From the 1st point, one block is not exactly the input to the mapper; it is rather an InputSplit. e.g. think about a Zip File (compressed with GZIP). A Zip file is a collection of ZipEntry (each compressed file). A zip file is a non-splitable from processing perspective. It means, the InputSplit for a Zip file will be of several blocks (in fact all the blocks used to store the particular ZIP file). This happens at the expense of data locality. i.e. even though the zip file is broken and stored in HDFS at different node, the whole file would be moved to the node running the mapper.
The ZipFileInputFormat provides the default record reader implementation ZipFileRecordReader, which has logic to read one ZipEntry (compressed file) for the mapper key-value.
You've already basically answered this for yourself, so hopefully my explanation can help.
A record is a MapReduce-specific term for a key-value pair. A single MapReduce job can have several different types of records - in the wordcount example then the mapper input record type is <Object, Text>, the mapper output/reducer input record type is <Text, IntWritable>, and the reducer output record type is also <Text, IntWritable>.
The InputFormat is responsible for defining how the block is split into individual records. As you identified, there are many InputFormats, and each is responsible for implementing code that manages how it splits the data into records.
The block itself has no concept of records as the records aren't created until the data is read the mapper. You could have two separate MapReduce jobs that read the same block but use different InputFormats. As far as the HDFS is concerned, it's just storing a single big blob of data.
There's no "condition" for defining how the data is split - you can make your own InputFormat and split the data however you want.

Extracting rows containing specific value using mapReduce and hadoop

I'm new to developing map-reduce function. Consider I have csv file containing four column data.
For example:
101,87,65,67
102,43,45,40
103,23,56,34
104,65,55,40
105,87,96,40
Now, I want extract say
40 102
40 104
40 105
as those row contain 40 in forth column.
How to write map reduce function?
Basically WordCount example resembles very well what you are trying to achieve. Instead of initializing the count per each word, you should have a condition to check if the tokenized String has required value and only in that case you write to context. This will work, since Mapper will receive each line of the CSV separately.
Now Reducer will receive the list of the values, already organized per key. In Reducer, instead of having IntWritable as output value type, you can use NullWritable for return value type, so your code will only output the keys. Also you do not need the cycle in Reducer, since you only would like to output the keys.
I do not provide you any code in my answer, since you will learn nothing from that. Make you way from the recommendations.
EDIT: since you modified you question with request for Reducer, here are some tips how you can achieve what you want.
One of the possibilities for achiving desired result is: in Mapper, after splitting (or tekenizing) the line, you write to context column 3 as key and column 0 as value. Your Reducer, since you do not need to any kind of aggregation, can simply write the keys and values produced by Mappers (yep, your Reducer code will end up with a single line of code). You can check one of my previous answers, the figure there explains quite well what Map and Reduce phases are doing.

How do we count the number of times a map function is called in a mapreduce program?

I have to do certain operations on my input data and write it to hdfs using mapreduce program.
My input data looks like
abc
some data
some data
some data
def
other data
other data
other data
and continues in the same way, where abc ,def are the headers and some data are records with tab space.
My task is to eliminate the headers and append it to its below records like
some data abc
some data abc
some data abc
other data def
other data def
other data def
Each header will have 50 records.
I am using the default record reader so it reads each line at a time
Now my problem is how do I know that map function has been called for a nth time?
Do I have any counter to know that?
So that I can use that counter to append the header to string as
if (counter % 50 ==0 )
*some code*
Or else static variables are the only way?
You can use member variables to keep the count, how many have processed till now. The member variable are instance variables and will not be reset each time map function get called. You can instantiate them in mapper setup method.
Obviously, you can use static variable for keeping the counter.
The data in HDFS is stored in blocks, how are you going to handle when data is split in two blocks.
To handle the data split between two blocks, you might need the Reducers. The property of the reducers is, all the data (values) related to a particular key are always sent to the same (single) reducer. The input to the reducer is key and list of values which is in your case list of data. So you can store them very easily as per your requirement.
Optimization : You can use the same Reducer code as Combiner for optimizing the data transfer.
Idea : The Mapper will emit the key and value as it is. Now when the Reducer receive the data, which is Key, List<value>, all of your values are already combined by the MapReduce framework. You just to need to emit them again. This is the output you are looking for.

How does Hadoop decide to distribute among buckets/nodes?

I am new to Map/Reduce and Hadoop framework.
I am running a Hadoop program on single machine (for trying it out).
I have n input files and I want some summary of words from those files.
I know map function returns key value pair, but how map is called?
Once on each file or Once on each line of every file? Can I configure it?
Is it correct to assume, "reduce" is called for each key?
A map is called for one InputSplit(or split in short) and it is the duty of the InputFormat, you are using in your MR job, to crate these splits. It could be one line, multiple lines, one whole file and so on, based on the logic inside your InputFormat. For example, the default InputFormat, i.e TextInputFormat crates of splits which consist of a single line.
Yes you can configure it by altering the InputFormat you are using.
All the values corresponding to a particular key are clubbed together and multiple keys are partitioned into partitions and an entire partition goes to a reducer for further processing. So, all the values corresponding to a particular key get processed by a single reducer, but a single reducer can get multiple keys.
In Hadoop MR framework,the job tracker creates a map task for each InputSplit as determined by the InputFormat specified by your job. Each Inputsplit assigned to map task is further processed by RecordReader to generate input key/value pairs for map function. The map function is called for each key/value pair that is generated by RecordReader.
For default InputFormat i.e TextInputFormat the input split will be single HDFS block which will be processed by single map task and RecordReader will process one line at a time within block and generate key/value pair where key is byte offset of starting of line in file and value is contents of line which will be passed to map function.
The number of reducers is dependent on job configuration by user and all key/value pairs with same key are grouped and will be sent to single reducer sorted by key but at same time single reducer can process multiple keys too.
For more details on InputFormat and customizing it, refer this YDN documentation:
http://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat

What happens when identical keys are passed to the Mapper in Hadoop

What is the significance of data being passed as key/value pairs to the mapper also in the Hadoop Map Reduce framework. I understand that key/value pairs hold significance when they are passed to the reducers as they cater to the partitioning of data coming from the mappers. Values belonging to the same key go as a list from the mapper to the reducer stage. But how are the keys used before the mapper stage itself? What happens to values belonging to the same key? If we don't define a custom input format, I presume Hadoop takes in the record number from the input file as the key and the text line as the value in the mapper function. But in case we decide to implement a custom input format there is a custom selection of the keys and there could be a possibility where we have values corresponding to the same key.
How does phenomenon get handled in the mapper stage? Does the mapper ignore duplicate records and treats them as separate records or does it only choose one record per key?
An input split is a chunk of the input that is processed by a
single map. Each map processes a single split. Each split is divided into records, and
the map processes each record—a key-value pair—in turn.
So mapper treats records with same key as separate records.

Resources