Purpose Of NullWritable - hadoop

I want to count the number of studens who have booked a movie ticket and I wannt only one output after the reduce phase. I desire that the mapper emits the count of the number of students not the keys.
Can I use NullWritable as an Output Key so that nothing is emitted from the map side as the key to the reduce side?
as shown below
context.write(NullWritable.get(),new IntWritable(1);
The data will be emitted to the reducer and the reducer will perform further aggregation
Please suggest if anybody has a better alternative.
Thank You in advance!!

Instead you could emit the map output as
context.write(new Text("number of students"),new IntWritable(1));
with number of reducers set to 1 in driver.Then you could sum up the value on the reducer side.
Suppose if you only need value in the Output file and don't need key in that case you could use NullWritable.
context.write(NullWritable.get(),value)

Related

Comparing data from same file in reducer function of a map reduce program

In my map reduce program, the mapper function will give two key value pair:
1) (person1, age)
2) (person2, age)
(I have kept 2 pairs only for simplicity it would be nice if u can explain for n nos of line)
Now I want to write a reducer which will compare age of both and give the answer who is older.
The thing I cannot understand is the output of mapper will be in different line in the file. And as reducer works on line by line bases over a file how will it compare them.
Thanks in advance.
See if any of the following logic serves your purpose:
A.
emit (age,person_name) from your map
have only 1 reducer -
your will get all ages, person pair in sorted manner. so simply emitting will give first one as youngest and last one as oldest.
If you don't want to print all values, just have two references in the reducer task - youngest, oldest - set them in reduce method and emit whichever you want in the cleanup of reducer task
B.
Have a mapper emitting (name,age) as you said
in reducer task:
a. Use setup() to create a treemap
b. in reduce() add (age, person) in the treemap
c. you map will be sorted by age which you can use in cleanup() to do something about it.
Essentially you can store all key,value in internal object(s) in reduce(), in cleanup() you will have access to all of these value and perform any logic you want in it.
I think your use case straight away fits into Secondary Sorting technique.
Secondary sorting is a technique, which has been introduced to sort "value" emitted by mapper. Primary sorting will be done by "key" emitted by mapper.
If you try to sort all values at reducer level, you may get out of memory. Secondary Sorting should be done at mapper level.
Have a look at this article
In above example, just replace "year" with "person" and "temperature" with "age"
Solution:
Create Custom partitioner to send all values from a particular key to a single reducer
Sorting should be done Key, Value combination emitted by mapper => Create a composite key with Key + Value have been used for sorting. Come up with a Comparator which sorts first by Key and then with Value.
In reducer method, all you would be getting is a key and list of values. So you can find min or max among a list of values for that key. However if you need to compare with other keys, then may be you should think of a single reducer and get all the records from mappers and handle that logic in your reducer class with help of reference variable rather than local variable and updating the reference variable with every min/max value for each key

Looking for good example to understand significance of different keys in Map phase

I have seen this figure/definition in most books / blogs for Map phase of MapReduce
What I dont understand is in Map phase the input key is k and output is a different key k(dash) , I googled around and just found one trivial example on this http://java.dzone.com/articles/confused-about-mapreduce
I am looking for more example (theoretical) , explanation on same. where the keys are different in input and output of map reduce.
Will appreciate if someone can provide same. Let me know if i need to explain my question further.
That's very straight forward actually. The key that is the input for the map phase is the key the source data has and the key going out of the map is the key you want to order by or group by the end result.
It is important to note that the input key depends on the input file format e.g. if it the input is HBase the key would be the HBase key, in a CSV file the key would be the line number
For instance you if you have a sequence file where each line has a key of SSN and a value which is first name a last name and you want the end result to be ordered by last name the k in would be the SSN and you'd emit the lastname concatenated by the first name as the k' to order by it

Input/Output flow in map reduce chaining

i need help regarding map reduce chaining.i have a map reduce chain like this
map->reduce->map
i want the output of reducer to be used in the last mapper
for example, in my reducer i am getting the max salary of an employee and this value is supposed to be used in the next mapper and find the record with that max salary value.so obviously my last mapper should get the output of the reducer and the contents of the file?is it possible?how can i fix the problem?any better solution?
I'm not sure i understood the problem, but i will try to help.
You have reduced some input containing an employee salaries (lets call it input1) into output (lets call it output1) that looks like that:
Key: someEmployee Value: max salary.
and now you want another mapper to to map the data from both input1 and output1?
if so, than u have few options, u may choose one according to your needs.
Manipulate first reducer output. instad of creating output1 in the format Key: someEmployee Value:
max_salary##salary_1,salary_2,salary_3...salary_n
and than create new job, and set the new mapper input as output1.
Try reading this issue explaining how to get multiple inputs into one mapper

Sort reducer input iterator value before processing in Hadoop

I have some input data coming to the reducer with the value type Iterator .
How can I sort this list of values to be ascending order?
I need to sort them in order since they are time values, before processing all in the reducer.
To achieve sorting of reducer input values using hadoop's built-in features,you can do this:
1.Modify map output key - Append map output key with the corresponding value.Emit this composite key and the value from map.Since hadoop uses entire key by default for sorting, map output records will be sorted by (your old key + value).
2.Although sorting is done in step 1, you have manipulated the map output key in the process.Hadoop does Partitioning and Grouping based on the key by default.
3.Since you have modified the original key, you need to take care of modifying Partitioner and GroupingComparator to work based on the old key i.e., only the first part of your composite key.
Partitioner - decides which key-value pairs land in the same Reducer instance
GroupComparator - decides which key-value pairs among the ones that landed into the Reducer go to the same reduce method call.
4.Finally(and obviously) you need to extract the first part of input key in the reducer to get old key.
If you need more(and a better) answer, turn to Hadoop Definitive Guide 3rd Edition -> chapter 8 -> sorting -> secondary sort
What you asked for is called Secondary Sort. In a nutshell - you extend the key to add "value sort key" to it and make hadoop to group by only "real key" but sort by both.
Here is a very good explanation about the secondary sort:
http://pkghosh.wordpress.com/2011/04/13/map-reduce-secondary-sort-does-it-all/

retrieving unique results from a column in a text file with Hadoop MapReduce

I have the data set below. I want to get a unique list of the first column as the output. {9719,382 ..} there are integers in the end of the each line so checking if it starts and ends with a number is not a way and i couldn't think of a solution. Can you show me how to do it? I'd really
appreciate it if you show it in detail.(with what to do in map and what to do in reduce step)
id - - [date] "URL"
In your mapper you should parse each line and write out the token that you are interested in from the beginning of the line (e.g. 9719) as the Key in a Key-Value pair (the Value is irrelevant in this case). Since the keys will be sorted before sending to the reducer, all you need to do in the reducer is iterate thru the values and each time a value changes, output it.
The WordCount example app that is packaged with Hadoop is very close to what you need.

Resources