"Map" and "Reduce" functions in Hadoop's MapReduce - hadoop

I've been looking at this word count example by hadoop:
http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Source+Code
And I'm a little confused about the Map function. In the map function shown, it takes in a "key" of type LongWritable, but this parameter is never used in the body of the Map function. What does the application programmer expect Hadoop to pass in for this key? Why does a map function require a key if it simply parses values from a line of text or something. Can someone give me an example where both a key and a value is required for input? I only see map as V1 -> (K2, V2).
Another question: In the real implementation of hadoop, are their multiple reduction steps? If so, how does hadoop apply the same reduction function multiple times if the function is (K2, V2) -> (K3, V3)? If another reduction is performed, it needs to take in type (K3, V3)...
Thank you!

There's a key there because the map() method is always passed a key and a value (and a context). It's up to you as to whether you actually use the key and/or value. In this case, the key represents a line number from the file being read. The word count logic doesn't need that. The map() method just uses the value, which in the case of a text file is a line of the file.
As to your second question (which really should be its own stack overflow question), you may have any number of map/reduce jobs in a hadoop workflow. Some of those jobs will read as input pre-existing files and others will read the output of other jobs. Each job will have one or more mappers and a single reducer.

Related

Run a few lines after map & reduce

I have a MapReduce program(in Java) which finds count of words in a document and stores the output as:
word1 10
word2 20
...
I would like to know how to add a few lines to the end of the final output, (something like a finally block of a try and catch) that is I would like to append few words & their scores to the final output.
So my question here, is there a way to add a piece of code which runs after the execution of the reducer so that I may do something after the whole Map & Reduce completes?
One Reducer : If you are having one reducer, then you can use the context object in the cleanup to write the rank/score for each word. But to do this, you need to have the data that is already written to the output file(word count). I'd suggest you to add a Map or some other object in reduce function to store the word counts. Use that Map object in clean up to find the rank/score and write the result through context object.
Multiple Reducer : If you have multiple reducer, then you have to do the same only in main/run method. But in this case you'd have to read the output file data and then do the calculation before appending to the file. I'd suggest you to use combiners and use a reducer as suggested above to calculate the rank/score.

Mapreduce without reducer function

I have a file of data and my task is to use map reduce to create a new data from each line of the file because the data is huge in the file.
ex: the file contains: expression (3 -4 *7-4) and I need to create a new expression randomly from this expression (3+4/7*4). When I implement the task using map reduce I use map to do the change, and reduce to just to receive data from mapper and sort them Is it correct to use just map to do the main task?
If you do not need sorting of map results - you set 0 reduced, ( by doing
job.setNumReduceTasks(0);
in your driver code )
and the job is called map only.
Your implementation is correct. Just make sure the keys output from the mapper are all unique if you don't want any expressions that happen to be identical being combined.
For example, since you said you have a huge data file, there may be a possibility that you get two expressions such as 3-4*7-4 and 3*4/7+4 and both new expressions turn out to be 3+4*7-4. If you use the expression as the key, the reducer will only get called once for both expressions. If you don't want this to happen, make sure you use a unique number for each key.

How to just output value in context.write(k,v)

In my mapreduce job, I just want to output some lines.
But if I code like this:
context.write(data, null);
the program will throw java.lang.NullPointerException.
I don't want to code like below:
context.write(data, new Text(""));
because I have to trim the blank space in every line in the output files.
Is there any good ways to solve it?
Thanks in advance.
Sorry, it's my mistake. I checked the program carefully, found the reason is I set the Reducer as combiner.
If I do not use the combiner, the statement
context.write(data, null);
in reducer works fine. In the output data file, there is just the data line.
Share the NullWritable explanation from hadoop definitive guide:
NullWritable is a special type of Writable, as it has a zero-length serialization. No bytes
are written to, or read from, the stream. It is used as a placeholder; for example, in
MapReduce, a key or a value can be declared as a NullWritable when you don’t need
to use that position—it effectively stores a constant empty value. NullWritable can also
be useful as a key in SequenceFile when you want to store a list of values, as opposed
to key-value pairs. It is an immutable singleton: the instance can be retrieved by calling
NullWritable.get().
You should use NullWritable for this purpose.

I wanted to develop a map reduce logic to find sentence count form the input file

I am new to hadoop and have basic idea of map reduce , the input to the map function will be key and value pair. So how do i basically identify when my sentence is completed and how can i count it. Is default input format that is TextInput format can be used or can we use some other input format to do it in a easier way.
I suppose you'd just check the line for periods. Decide whether an elipses (...) should be ignored, etc. Then as each line is passed to the map() method, you'd write out a key/value counting those legitimate periods to the context. The definition of what it means to end a sentence is your call. The logic to do that should be straightforward.
You can make it so that entire sentences are passed, one at a time, to the map() method, but that's much harder to do. You basically take that same logic and put it in a new input format type and corresponding RecordReader. If you have a choice go with the logic in the map() method and not the input format type and record reader.

Getting output files which contain the value of one key only?

I have a use-case with Hadoop where I would like my output files to be split by key. At the moment I have the reducer simply outputting each value in the iterator. For example, here's some python streaming code:
for line in sys.stdin:
data = line.split("\t")
print data[1]
This method works for a small dataset (around 4GB). Each output file of the job only contains the values for one key.
However, if I increase the size of the dataset (over 40GB) then each file contains a mixture of keys, in sorted order.
Is there an easier way to solve this? I know that the output will be in sorted order and I could simply do a sequential scan and add to files. But it seems that this shouldn't be necessary since Hadoop sorts and splits the keys for you.
Question may not be the clearest, so I'll clarify if anyone has any comments. Thanks
Ok then create a custom jar implementation of your MapReduce solution and go for MultipleTextOutputFormat to be the OutputFormat used as explained here. You just have to emit the filename (in your case the key) as the key in your reducer and the entire payload as the value, and your data will be written in the file named as your key.

Resources