Run a few lines after map & reduce - hadoop

I have a MapReduce program(in Java) which finds count of words in a document and stores the output as:
word1 10
word2 20
...
I would like to know how to add a few lines to the end of the final output, (something like a finally block of a try and catch) that is I would like to append few words & their scores to the final output.
So my question here, is there a way to add a piece of code which runs after the execution of the reducer so that I may do something after the whole Map & Reduce completes?

One Reducer : If you are having one reducer, then you can use the context object in the cleanup to write the rank/score for each word. But to do this, you need to have the data that is already written to the output file(word count). I'd suggest you to add a Map or some other object in reduce function to store the word counts. Use that Map object in clean up to find the rank/score and write the result through context object.
Multiple Reducer : If you have multiple reducer, then you have to do the same only in main/run method. But in this case you'd have to read the output file data and then do the calculation before appending to the file. I'd suggest you to use combiners and use a reducer as suggested above to calculate the rank/score.

Related

Mapreduce without reducer function

I have a file of data and my task is to use map reduce to create a new data from each line of the file because the data is huge in the file.
ex: the file contains: expression (3 -4 *7-4) and I need to create a new expression randomly from this expression (3+4/7*4). When I implement the task using map reduce I use map to do the change, and reduce to just to receive data from mapper and sort them Is it correct to use just map to do the main task?
If you do not need sorting of map results - you set 0 reduced, ( by doing
job.setNumReduceTasks(0);
in your driver code )
and the job is called map only.
Your implementation is correct. Just make sure the keys output from the mapper are all unique if you don't want any expressions that happen to be identical being combined.
For example, since you said you have a huge data file, there may be a possibility that you get two expressions such as 3-4*7-4 and 3*4/7+4 and both new expressions turn out to be 3+4*7-4. If you use the expression as the key, the reducer will only get called once for both expressions. If you don't want this to happen, make sure you use a unique number for each key.

hadoop, word count in paragraph

Normally, Hadoop examples define how to do word count for a file or multiple files, the result of word count 'll be from entire set!
i wish to do wordcount for each paragraph and store in seperate files like paragh(i)_wordcnt.txt.
how to do it? (the issue is mapper runs for entire set and reducer collects output finally!
can i do something like if i reach a specific mark write results!
)
say if filecontent:
para1
...
para2
...
para3
...
can i do like on seeing para2 write results of wordcount of para1? or if other way writing each para in seperate file how to do like this sequence
loop:
file(i)(parai)->Mapper->Reducer->multipleOutput(output-file(i))->writetofile(i);
i++;
goto loop;
You need to make the RecordReader read a paragraph at a time. See this question: Overriding RecordReader to read Paragraph at once instead of line
I am writing the basic funda as how we can do it.
I think we have to run linked mapper and reducer for this proccess.
In the first mapper you have to use RecordReader and set its key as whole paragraph. This way we will get as many keys as paragraph you have.Then you need to use the reducer as identity reducer and again let the output of reducer to a new mapper which will get paragraph as key.
Now since you have paragraph in your new mapper ,you can tweak the famous word count code for your need.(Just replacing KEYS with VALUES here and all rest will be same).
Since you have nested mapper in a reducer ,getting the word count of a paragraph in separate files will be easy.
Please tell if my method is not correct.

"Map" and "Reduce" functions in Hadoop's MapReduce

I've been looking at this word count example by hadoop:
http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Source+Code
And I'm a little confused about the Map function. In the map function shown, it takes in a "key" of type LongWritable, but this parameter is never used in the body of the Map function. What does the application programmer expect Hadoop to pass in for this key? Why does a map function require a key if it simply parses values from a line of text or something. Can someone give me an example where both a key and a value is required for input? I only see map as V1 -> (K2, V2).
Another question: In the real implementation of hadoop, are their multiple reduction steps? If so, how does hadoop apply the same reduction function multiple times if the function is (K2, V2) -> (K3, V3)? If another reduction is performed, it needs to take in type (K3, V3)...
Thank you!
There's a key there because the map() method is always passed a key and a value (and a context). It's up to you as to whether you actually use the key and/or value. In this case, the key represents a line number from the file being read. The word count logic doesn't need that. The map() method just uses the value, which in the case of a text file is a line of the file.
As to your second question (which really should be its own stack overflow question), you may have any number of map/reduce jobs in a hadoop workflow. Some of those jobs will read as input pre-existing files and others will read the output of other jobs. Each job will have one or more mappers and a single reducer.

Getting output files which contain the value of one key only?

I have a use-case with Hadoop where I would like my output files to be split by key. At the moment I have the reducer simply outputting each value in the iterator. For example, here's some python streaming code:
for line in sys.stdin:
data = line.split("\t")
print data[1]
This method works for a small dataset (around 4GB). Each output file of the job only contains the values for one key.
However, if I increase the size of the dataset (over 40GB) then each file contains a mixture of keys, in sorted order.
Is there an easier way to solve this? I know that the output will be in sorted order and I could simply do a sequential scan and add to files. But it seems that this shouldn't be necessary since Hadoop sorts and splits the keys for you.
Question may not be the clearest, so I'll clarify if anyone has any comments. Thanks
Ok then create a custom jar implementation of your MapReduce solution and go for MultipleTextOutputFormat to be the OutputFormat used as explained here. You just have to emit the filename (in your case the key) as the key in your reducer and the entire payload as the value, and your data will be written in the file named as your key.

two file comparison in hdfs

I want to write a map reduce to compare two large file in hdfs. any thoughts how to achieve that. Or if there is nay other way to do the comparison because file size is very large, so thought map-reduce would be an ideal approach.
Thanks for your help.
You may do this in 2 steps.
First make the line number to be the part of text files:
Say initial file looks like:
I am awesome
He is my best friend
Now, convert this to something like this:
1,I am awesome
2,He is my best friend
This may well be done by a MapReduce job itself or some other tool.
2. Now write a MapReduce step where in mapper emit the line number as the key and rest of the actual sentence as value. Then in reducer just compare the values. As and when it doesn't match emit out the line number (the key) and the payloads, whatever you may want here. Also if the count of the values is just 1 then also it is a mismatch.
EDIT: Better approach
Better still what you can do is, just emit the complete line read at a time in the mapper as the key and make the value a number, say 1. So taking my above example your mapper output would be as follows:
< I am awesome,1 >
< He is my best friend,1 >
And in reducer just check the count of values, if it isn't 2, you have a mismatch.
But there is one catch in this approach, if there is a possibility of exactly same line occurring at two different places then instead of checking for the length of values for a given key in reducer, you should be checking it to be a multiple of 2.
One possible solution could be, put line number as count in map job.
There are two files like below :
File 1:
I am here --Line 1
I am awesome -- Line 2
You are my best friend -- Line 3
File 2 also similar kind
Now your map job output should be like , < I am awesome, 2>...
Once you done with Map job for both the file, you have two record(key,value) which has same value to reduce.
At the time of reduce, you can either compare the counter or generate the output as , and so on. If the line is exist in the different location too than out put could be which indicates that this line is mismatch.
I have a solution for comparing files with keys. In your case if you know that your ID's are unique, you could emit the ID's as keys in the map, the entire record as value. Lets say your file has ID,Line1 then emit as key and as value from mapper.
In the shuffle and sort phase, the ID's will be sorted and you will get an iterator with data from both the files. ie, the records from both files with same ID will end up in same iterator.
Then in the reducer, compare both the values from the iterator and if they match move on with next record. Else, if they do not match write them to an output.
I have done this and it worked like a charm.
Scenario - No matching key
If there is no matching ID between two files, they will have only one iterator value.
Scenario 2 - Duplicate keys
If the files have duplicate keys, the iterator will have more than 2 values.
Note: You should compare the values only when the iterator has only 2 values.
**Tip:**The iterator will not have values in order always. To identify the value from a particular file, in the mapper add a small indicator at the end of the line like Line1;file1
Line1;file2
Then on the reducer you will be able to identify which value belongs to which mapper.

Resources