How to just output value in context.write(k,v) - hadoop

In my mapreduce job, I just want to output some lines.
But if I code like this:
context.write(data, null);
the program will throw java.lang.NullPointerException.
I don't want to code like below:
context.write(data, new Text(""));
because I have to trim the blank space in every line in the output files.
Is there any good ways to solve it?
Thanks in advance.
Sorry, it's my mistake. I checked the program carefully, found the reason is I set the Reducer as combiner.
If I do not use the combiner, the statement
context.write(data, null);
in reducer works fine. In the output data file, there is just the data line.
Share the NullWritable explanation from hadoop definitive guide:
NullWritable is a special type of Writable, as it has a zero-length serialization. No bytes
are written to, or read from, the stream. It is used as a placeholder; for example, in
MapReduce, a key or a value can be declared as a NullWritable when you don’t need
to use that position—it effectively stores a constant empty value. NullWritable can also
be useful as a key in SequenceFile when you want to store a list of values, as opposed
to key-value pairs. It is an immutable singleton: the instance can be retrieved by calling
NullWritable.get().

You should use NullWritable for this purpose.

Related

Mapreduce without reducer function

I have a file of data and my task is to use map reduce to create a new data from each line of the file because the data is huge in the file.
ex: the file contains: expression (3 -4 *7-4) and I need to create a new expression randomly from this expression (3+4/7*4). When I implement the task using map reduce I use map to do the change, and reduce to just to receive data from mapper and sort them Is it correct to use just map to do the main task?
If you do not need sorting of map results - you set 0 reduced, ( by doing
job.setNumReduceTasks(0);
in your driver code )
and the job is called map only.
Your implementation is correct. Just make sure the keys output from the mapper are all unique if you don't want any expressions that happen to be identical being combined.
For example, since you said you have a huge data file, there may be a possibility that you get two expressions such as 3-4*7-4 and 3*4/7+4 and both new expressions turn out to be 3+4*7-4. If you use the expression as the key, the reducer will only get called once for both expressions. If you don't want this to happen, make sure you use a unique number for each key.

Mapper input Key-Value pair in Hadoop

Normally, we write the mapper in the form :
public static class Map extends Mapper<**LongWritable**, Text, Text, IntWritable>
Here the input key-value pair for the mapper is <LongWritable, Text> - as far as I know when the mapper gets the input data its goes through line by line - so the Key for the mapper signifies the line number - please correct me if I am wrong.
My question is : If I give the input key-value pair for mapper as <Text, Text> then it is giving the error
java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text
Is it a mandatory to give the input key-value pair of mapper as <LongWritable, Text> - if yes then why ? if no then what the reason of the error ? Can you please help me understand the proper reasoning of the error ?
Thanks in advance.
The input to the mapper depends on what InputFormat is used. The InputFormat is responsible for reading the incoming data and shaping it into whatever format the Mapper expects.The default InputFormat is TextInputFormat, which extends FileInputFormat<LongWritable, Text>.
If you do not change the InputFormat, using a Mapper with different Key-Value type signature than <LongWritable, Text> will cause this error. If you expect <Text, Text> input, you will have to choose an appropiate InputFormat. You can set the InputFormat in Job setup:
job.setInputFormatClass(MyInputFormat.class);
And like I said, by default this is set to TextInputFormat.
Now, let's say your input data is a bunch of newline-separated records delimited by a comma:
"A,value1"
"B,value2"
If you want the input key to the mapper to be ("A", "value1"), ("B", "value2") you will have to implement a custom InputFormat and RecordReader with the <Text, Text> signature. Fortunately, this is pretty easy. There is an example here and probably a few examples floating around StackOverflow as well.
In short, add a class which extends FileInputFormat<Text, Text> and a class which extends RecordReader<Text, Text>. Override the FileInputFormat#getRecordReader method, and have it return an instance of your custom RecordReader.
Then you will have to implement the required RecordReader logic. The simplest way to do this is to create an instance of LineRecordReader in your custom RecordReader, and delegate all basic responsibilities to this instance. In the getCurrentKey and getCurrentValue-methods you will implement the logic for extracting the comma delimited Text contents by calling LineRecordReader#getCurrentValue and splitting it on comma.
Finally, set your new InputFormat as Job InputFormat as shown after the second paragraph above.
In the book "Hadoop: The Difinitive Guide" by Tom White I think he has an appropriate answer to this(pg. 197):
"TextInputFormat’s
keys, being simply the offset within the file, are not normally very
useful. It is common for each line in a file to be a key-value pair, separated by a delimiter
such as a tab character. For example, this is the output produced by
TextOutputFormat, Hadoop’s default
OutputFormat. To interpret such files correctly,
KeyValueTextInputFormat
is appropriate.
You can specify the separator via the
key.value.separator.in.input.line
property. It
is a tab character by default."
Key for Mapper Input will always be a Integer type....the mapper input key indicates the line's offset no. and the values indicates the whole line ......
record reader reads a single line in first cycle. And o/p of the mapper can be whatever u want (it can be (Text,Text) or (Text, IntWritable) or ......)

hadoop, word count in paragraph

Normally, Hadoop examples define how to do word count for a file or multiple files, the result of word count 'll be from entire set!
i wish to do wordcount for each paragraph and store in seperate files like paragh(i)_wordcnt.txt.
how to do it? (the issue is mapper runs for entire set and reducer collects output finally!
can i do something like if i reach a specific mark write results!
)
say if filecontent:
para1
...
para2
...
para3
...
can i do like on seeing para2 write results of wordcount of para1? or if other way writing each para in seperate file how to do like this sequence
loop:
file(i)(parai)->Mapper->Reducer->multipleOutput(output-file(i))->writetofile(i);
i++;
goto loop;
You need to make the RecordReader read a paragraph at a time. See this question: Overriding RecordReader to read Paragraph at once instead of line
I am writing the basic funda as how we can do it.
I think we have to run linked mapper and reducer for this proccess.
In the first mapper you have to use RecordReader and set its key as whole paragraph. This way we will get as many keys as paragraph you have.Then you need to use the reducer as identity reducer and again let the output of reducer to a new mapper which will get paragraph as key.
Now since you have paragraph in your new mapper ,you can tweak the famous word count code for your need.(Just replacing KEYS with VALUES here and all rest will be same).
Since you have nested mapper in a reducer ,getting the word count of a paragraph in separate files will be easy.
Please tell if my method is not correct.

"Map" and "Reduce" functions in Hadoop's MapReduce

I've been looking at this word count example by hadoop:
http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Source+Code
And I'm a little confused about the Map function. In the map function shown, it takes in a "key" of type LongWritable, but this parameter is never used in the body of the Map function. What does the application programmer expect Hadoop to pass in for this key? Why does a map function require a key if it simply parses values from a line of text or something. Can someone give me an example where both a key and a value is required for input? I only see map as V1 -> (K2, V2).
Another question: In the real implementation of hadoop, are their multiple reduction steps? If so, how does hadoop apply the same reduction function multiple times if the function is (K2, V2) -> (K3, V3)? If another reduction is performed, it needs to take in type (K3, V3)...
Thank you!
There's a key there because the map() method is always passed a key and a value (and a context). It's up to you as to whether you actually use the key and/or value. In this case, the key represents a line number from the file being read. The word count logic doesn't need that. The map() method just uses the value, which in the case of a text file is a line of the file.
As to your second question (which really should be its own stack overflow question), you may have any number of map/reduce jobs in a hadoop workflow. Some of those jobs will read as input pre-existing files and others will read the output of other jobs. Each job will have one or more mappers and a single reducer.

Getting output files which contain the value of one key only?

I have a use-case with Hadoop where I would like my output files to be split by key. At the moment I have the reducer simply outputting each value in the iterator. For example, here's some python streaming code:
for line in sys.stdin:
data = line.split("\t")
print data[1]
This method works for a small dataset (around 4GB). Each output file of the job only contains the values for one key.
However, if I increase the size of the dataset (over 40GB) then each file contains a mixture of keys, in sorted order.
Is there an easier way to solve this? I know that the output will be in sorted order and I could simply do a sequential scan and add to files. But it seems that this shouldn't be necessary since Hadoop sorts and splits the keys for you.
Question may not be the clearest, so I'll clarify if anyone has any comments. Thanks
Ok then create a custom jar implementation of your MapReduce solution and go for MultipleTextOutputFormat to be the OutputFormat used as explained here. You just have to emit the filename (in your case the key) as the key in your reducer and the entire payload as the value, and your data will be written in the file named as your key.

Resources