retrieving unique results from a column in a text file with Hadoop MapReduce - hadoop

I have the data set below. I want to get a unique list of the first column as the output. {9719,382 ..} there are integers in the end of the each line so checking if it starts and ends with a number is not a way and i couldn't think of a solution. Can you show me how to do it? I'd really
appreciate it if you show it in detail.(with what to do in map and what to do in reduce step)
id - - [date] "URL"

In your mapper you should parse each line and write out the token that you are interested in from the beginning of the line (e.g. 9719) as the Key in a Key-Value pair (the Value is irrelevant in this case). Since the keys will be sorted before sending to the reducer, all you need to do in the reducer is iterate thru the values and each time a value changes, output it.
The WordCount example app that is packaged with Hadoop is very close to what you need.

Related

Comparing data from same file in reducer function of a map reduce program

In my map reduce program, the mapper function will give two key value pair:
1) (person1, age)
2) (person2, age)
(I have kept 2 pairs only for simplicity it would be nice if u can explain for n nos of line)
Now I want to write a reducer which will compare age of both and give the answer who is older.
The thing I cannot understand is the output of mapper will be in different line in the file. And as reducer works on line by line bases over a file how will it compare them.
Thanks in advance.
See if any of the following logic serves your purpose:
A.
emit (age,person_name) from your map
have only 1 reducer -
your will get all ages, person pair in sorted manner. so simply emitting will give first one as youngest and last one as oldest.
If you don't want to print all values, just have two references in the reducer task - youngest, oldest - set them in reduce method and emit whichever you want in the cleanup of reducer task
B.
Have a mapper emitting (name,age) as you said
in reducer task:
a. Use setup() to create a treemap
b. in reduce() add (age, person) in the treemap
c. you map will be sorted by age which you can use in cleanup() to do something about it.
Essentially you can store all key,value in internal object(s) in reduce(), in cleanup() you will have access to all of these value and perform any logic you want in it.
I think your use case straight away fits into Secondary Sorting technique.
Secondary sorting is a technique, which has been introduced to sort "value" emitted by mapper. Primary sorting will be done by "key" emitted by mapper.
If you try to sort all values at reducer level, you may get out of memory. Secondary Sorting should be done at mapper level.
Have a look at this article
In above example, just replace "year" with "person" and "temperature" with "age"
Solution:
Create Custom partitioner to send all values from a particular key to a single reducer
Sorting should be done Key, Value combination emitted by mapper => Create a composite key with Key + Value have been used for sorting. Come up with a Comparator which sorts first by Key and then with Value.
In reducer method, all you would be getting is a key and list of values. So you can find min or max among a list of values for that key. However if you need to compare with other keys, then may be you should think of a single reducer and get all the records from mappers and handle that logic in your reducer class with help of reference variable rather than local variable and updating the reference variable with every min/max value for each key

How to process large file with one record dependent on another in MapReduce

I have a scenario where there is a really large file and say line 1 record might have dependency on 1000th line data and the line 1 and 1000 can be part of separate spilts. Now my understanding of the framework is that record reader is going to return one key, value pair to mapper and each k,v pair will be independent of another. Moreover since the file has been divided into splits and i want that as well (i.e. splittable false is no option), can i handle this anyhow may be writing my own record reader, mapper or reducer?
Dependency is like -
Row1: a,b,c,d,e,f
Row2: x,y,z,p,q,r
Now x in Row2 need to be used with say d in Row1 to get my desired output.
Thanks.
I think what you need is to implement a reducer side join. Here you can see a better explanation of it: http://hadooped.blogspot.mx/2013/09/reduce-side-joins-in-java-map-reduce.html.
Both related values have to end in the same reducer (defined by the key and the Partitioner) and they should be grouped together (GroupingComparator) and may be use a SecondSort to order the grouped values.

Looking for good example to understand significance of different keys in Map phase

I have seen this figure/definition in most books / blogs for Map phase of MapReduce
What I dont understand is in Map phase the input key is k and output is a different key k(dash) , I googled around and just found one trivial example on this http://java.dzone.com/articles/confused-about-mapreduce
I am looking for more example (theoretical) , explanation on same. where the keys are different in input and output of map reduce.
Will appreciate if someone can provide same. Let me know if i need to explain my question further.
That's very straight forward actually. The key that is the input for the map phase is the key the source data has and the key going out of the map is the key you want to order by or group by the end result.
It is important to note that the input key depends on the input file format e.g. if it the input is HBase the key would be the HBase key, in a CSV file the key would be the line number
For instance you if you have a sequence file where each line has a key of SSN and a value which is first name a last name and you want the end result to be ordered by last name the k in would be the SSN and you'd emit the lastname concatenated by the first name as the k' to order by it

Map job which splits file into small ones and generates names of these files on reduce stage

Given big file A, I need to iterate over records of that file and for each record
extract value of certain field (status)
add this record to the file with name "status_" + value
emit that status value to reducer
so output would contain set of files with records, grouped by statuses, and some file with list of statuses
ideally, it should
place files with statuses under 'output_dir/statuses/status_nnn' (where nnn is actual status value),
'output_dir/status_list' would contain statuses one per line
Is that possible to do with hadoop? I found out how to generate filename per record with this example, but not sure how to do separation of records and enumerate statuses.
I don't know in advance which statuses could be in those records.
In the map phase you can do 2 emits per record: and <list_statuses, status>. The 'list_statusses' must be a unique key you choose in advance. Then in the reduce phase your behaviour depends on the key, if it equals your special key then you emit a file with the statuses (this reduce function will store all statuses in a Set for example) otherwise generate the <status,field> file.
Does this make sense to you?

Sort reducer input iterator value before processing in Hadoop

I have some input data coming to the reducer with the value type Iterator .
How can I sort this list of values to be ascending order?
I need to sort them in order since they are time values, before processing all in the reducer.
To achieve sorting of reducer input values using hadoop's built-in features,you can do this:
1.Modify map output key - Append map output key with the corresponding value.Emit this composite key and the value from map.Since hadoop uses entire key by default for sorting, map output records will be sorted by (your old key + value).
2.Although sorting is done in step 1, you have manipulated the map output key in the process.Hadoop does Partitioning and Grouping based on the key by default.
3.Since you have modified the original key, you need to take care of modifying Partitioner and GroupingComparator to work based on the old key i.e., only the first part of your composite key.
Partitioner - decides which key-value pairs land in the same Reducer instance
GroupComparator - decides which key-value pairs among the ones that landed into the Reducer go to the same reduce method call.
4.Finally(and obviously) you need to extract the first part of input key in the reducer to get old key.
If you need more(and a better) answer, turn to Hadoop Definitive Guide 3rd Edition -> chapter 8 -> sorting -> secondary sort
What you asked for is called Secondary Sort. In a nutshell - you extend the key to add "value sort key" to it and make hadoop to group by only "real key" but sort by both.
Here is a very good explanation about the secondary sort:
http://pkghosh.wordpress.com/2011/04/13/map-reduce-secondary-sort-does-it-all/

Resources