hadoop, word count in paragraph - hadoop

Normally, Hadoop examples define how to do word count for a file or multiple files, the result of word count 'll be from entire set!
i wish to do wordcount for each paragraph and store in seperate files like paragh(i)_wordcnt.txt.
how to do it? (the issue is mapper runs for entire set and reducer collects output finally!
can i do something like if i reach a specific mark write results!
)
say if filecontent:
para1
...
para2
...
para3
...
can i do like on seeing para2 write results of wordcount of para1? or if other way writing each para in seperate file how to do like this sequence
loop:
file(i)(parai)->Mapper->Reducer->multipleOutput(output-file(i))->writetofile(i);
i++;
goto loop;

You need to make the RecordReader read a paragraph at a time. See this question: Overriding RecordReader to read Paragraph at once instead of line

I am writing the basic funda as how we can do it.
I think we have to run linked mapper and reducer for this proccess.
In the first mapper you have to use RecordReader and set its key as whole paragraph. This way we will get as many keys as paragraph you have.Then you need to use the reducer as identity reducer and again let the output of reducer to a new mapper which will get paragraph as key.
Now since you have paragraph in your new mapper ,you can tweak the famous word count code for your need.(Just replacing KEYS with VALUES here and all rest will be same).
Since you have nested mapper in a reducer ,getting the word count of a paragraph in separate files will be easy.
Please tell if my method is not correct.

Related

Run a few lines after map & reduce

I have a MapReduce program(in Java) which finds count of words in a document and stores the output as:
word1 10
word2 20
...
I would like to know how to add a few lines to the end of the final output, (something like a finally block of a try and catch) that is I would like to append few words & their scores to the final output.
So my question here, is there a way to add a piece of code which runs after the execution of the reducer so that I may do something after the whole Map & Reduce completes?
One Reducer : If you are having one reducer, then you can use the context object in the cleanup to write the rank/score for each word. But to do this, you need to have the data that is already written to the output file(word count). I'd suggest you to add a Map or some other object in reduce function to store the word counts. Use that Map object in clean up to find the rank/score and write the result through context object.
Multiple Reducer : If you have multiple reducer, then you have to do the same only in main/run method. But in this case you'd have to read the output file data and then do the calculation before appending to the file. I'd suggest you to use combiners and use a reducer as suggested above to calculate the rank/score.

How to just output value in context.write(k,v)

In my mapreduce job, I just want to output some lines.
But if I code like this:
context.write(data, null);
the program will throw java.lang.NullPointerException.
I don't want to code like below:
context.write(data, new Text(""));
because I have to trim the blank space in every line in the output files.
Is there any good ways to solve it?
Thanks in advance.
Sorry, it's my mistake. I checked the program carefully, found the reason is I set the Reducer as combiner.
If I do not use the combiner, the statement
context.write(data, null);
in reducer works fine. In the output data file, there is just the data line.
Share the NullWritable explanation from hadoop definitive guide:
NullWritable is a special type of Writable, as it has a zero-length serialization. No bytes
are written to, or read from, the stream. It is used as a placeholder; for example, in
MapReduce, a key or a value can be declared as a NullWritable when you don’t need
to use that position—it effectively stores a constant empty value. NullWritable can also
be useful as a key in SequenceFile when you want to store a list of values, as opposed
to key-value pairs. It is an immutable singleton: the instance can be retrieved by calling
NullWritable.get().
You should use NullWritable for this purpose.

Getting output files which contain the value of one key only?

I have a use-case with Hadoop where I would like my output files to be split by key. At the moment I have the reducer simply outputting each value in the iterator. For example, here's some python streaming code:
for line in sys.stdin:
data = line.split("\t")
print data[1]
This method works for a small dataset (around 4GB). Each output file of the job only contains the values for one key.
However, if I increase the size of the dataset (over 40GB) then each file contains a mixture of keys, in sorted order.
Is there an easier way to solve this? I know that the output will be in sorted order and I could simply do a sequential scan and add to files. But it seems that this shouldn't be necessary since Hadoop sorts and splits the keys for you.
Question may not be the clearest, so I'll clarify if anyone has any comments. Thanks
Ok then create a custom jar implementation of your MapReduce solution and go for MultipleTextOutputFormat to be the OutputFormat used as explained here. You just have to emit the filename (in your case the key) as the key in your reducer and the entire payload as the value, and your data will be written in the file named as your key.

two file comparison in hdfs

I want to write a map reduce to compare two large file in hdfs. any thoughts how to achieve that. Or if there is nay other way to do the comparison because file size is very large, so thought map-reduce would be an ideal approach.
Thanks for your help.
You may do this in 2 steps.
First make the line number to be the part of text files:
Say initial file looks like:
I am awesome
He is my best friend
Now, convert this to something like this:
1,I am awesome
2,He is my best friend
This may well be done by a MapReduce job itself or some other tool.
2. Now write a MapReduce step where in mapper emit the line number as the key and rest of the actual sentence as value. Then in reducer just compare the values. As and when it doesn't match emit out the line number (the key) and the payloads, whatever you may want here. Also if the count of the values is just 1 then also it is a mismatch.
EDIT: Better approach
Better still what you can do is, just emit the complete line read at a time in the mapper as the key and make the value a number, say 1. So taking my above example your mapper output would be as follows:
< I am awesome,1 >
< He is my best friend,1 >
And in reducer just check the count of values, if it isn't 2, you have a mismatch.
But there is one catch in this approach, if there is a possibility of exactly same line occurring at two different places then instead of checking for the length of values for a given key in reducer, you should be checking it to be a multiple of 2.
One possible solution could be, put line number as count in map job.
There are two files like below :
File 1:
I am here --Line 1
I am awesome -- Line 2
You are my best friend -- Line 3
File 2 also similar kind
Now your map job output should be like , < I am awesome, 2>...
Once you done with Map job for both the file, you have two record(key,value) which has same value to reduce.
At the time of reduce, you can either compare the counter or generate the output as , and so on. If the line is exist in the different location too than out put could be which indicates that this line is mismatch.
I have a solution for comparing files with keys. In your case if you know that your ID's are unique, you could emit the ID's as keys in the map, the entire record as value. Lets say your file has ID,Line1 then emit as key and as value from mapper.
In the shuffle and sort phase, the ID's will be sorted and you will get an iterator with data from both the files. ie, the records from both files with same ID will end up in same iterator.
Then in the reducer, compare both the values from the iterator and if they match move on with next record. Else, if they do not match write them to an output.
I have done this and it worked like a charm.
Scenario - No matching key
If there is no matching ID between two files, they will have only one iterator value.
Scenario 2 - Duplicate keys
If the files have duplicate keys, the iterator will have more than 2 values.
Note: You should compare the values only when the iterator has only 2 values.
**Tip:**The iterator will not have values in order always. To identify the value from a particular file, in the mapper add a small indicator at the end of the line like Line1;file1
Line1;file2
Then on the reducer you will be able to identify which value belongs to which mapper.

Hadoop searching words from one file in another file

I want to build a hadoop application which can read words from one file and search in another file.
If the word exists - it has to write to one output file
If the word doesn't exist - it has to write to another output file
I tried a few examples in hadoop. I have two questions
Two files are approximately 200MB each. Checking every word in another file might cause out of memory. Is there an alternative way of doing this?
How to write data to different files because output of the reduce phase of hadoop writes to only one file. Is it possible to have a filter for reduce phase to write data to different output files?
Thank you.
How I would do it:
split value in 'map' by words, emit (<word>, <source>) (*1)
you'll get in 'reduce': (<word>, <list of sources>)
check source-list (might be long for both/all sources)
if NOT all sources are in the list, emit every time (<missingsource>, <word>)
job2: job.setNumReduceTasks(<numberofsources>)
job2: emit in 'map' (<missingsource>, <word>)
job2: emit for each <missingsource> in 'reduce' all (null, <word>)
You'll end up with as much reduce-outputs as different <missingsources>, each containing the missing words for the document. You could write out the <missingsource> ONCE at the beginning of 'reduce' to mark the files.
(*1) Howto find out the source in map (0.20):
private String localname;
private Text outkey = new Text();
private Text outvalue = new Text();
...
public void setup(Context context) throws InterruptedException, IOException {
super.setup(context);
localname = ((FileSplit)context.getInputSplit()).getPath().toString();
}
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
...
outkey.set(...);
outvalue.set(localname);
context.write(outkey, outvalue);
}
Are you using Hadoop/MapReduce for a specific reason to solve this problem? This sounds like something more suited to a Lucene based application than Hadoop.
If you have to use Hadoop I have a few suggestions:
Your 'documents' will need to be in a format that MapReduce can deal with. The easiest format to use would be a CSV based file with each word in the document on a line. Having PDF etc will not work.
To take a set of words as input to you MapReduce job to compare against the data that the MapReduce processes you could use the Distributed Cache to enable each mapper to build a set of words you want to find in the input. However if your list of words to find it large (you mention 200MB) I doubt this would work. This method is one of the main ways you can do a join in MapReduce however.
The indexing method mentioned in another answer here does also offer possibilities. Again though, the terms indexing a document just make me think Lucene and not hadoop. If you did use this method you would need to make sure the key value contains a document identifier as well as the word, so that you have the word counts contained within each document.
I don't think i've ever produced multiple output files from a MapReduce job. You would need to write some (and it would be very simple) code to process the indexed output into multiple files.
You'll want to do this in two stages, in my opinion. Run the wordcount program (included in the hadoop examples jar) against the two initial documents, this will give you two files, each containing a unique list (with count) of the words in each document. From there, rather than using hadoop do a simple diff on the two files which should answer your question,

Resources