Mapper input Key-Value pair in Hadoop - hadoop

Normally, we write the mapper in the form :
public static class Map extends Mapper<**LongWritable**, Text, Text, IntWritable>
Here the input key-value pair for the mapper is <LongWritable, Text> - as far as I know when the mapper gets the input data its goes through line by line - so the Key for the mapper signifies the line number - please correct me if I am wrong.
My question is : If I give the input key-value pair for mapper as <Text, Text> then it is giving the error
java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text
Is it a mandatory to give the input key-value pair of mapper as <LongWritable, Text> - if yes then why ? if no then what the reason of the error ? Can you please help me understand the proper reasoning of the error ?
Thanks in advance.

The input to the mapper depends on what InputFormat is used. The InputFormat is responsible for reading the incoming data and shaping it into whatever format the Mapper expects.The default InputFormat is TextInputFormat, which extends FileInputFormat<LongWritable, Text>.
If you do not change the InputFormat, using a Mapper with different Key-Value type signature than <LongWritable, Text> will cause this error. If you expect <Text, Text> input, you will have to choose an appropiate InputFormat. You can set the InputFormat in Job setup:
job.setInputFormatClass(MyInputFormat.class);
And like I said, by default this is set to TextInputFormat.
Now, let's say your input data is a bunch of newline-separated records delimited by a comma:
"A,value1"
"B,value2"
If you want the input key to the mapper to be ("A", "value1"), ("B", "value2") you will have to implement a custom InputFormat and RecordReader with the <Text, Text> signature. Fortunately, this is pretty easy. There is an example here and probably a few examples floating around StackOverflow as well.
In short, add a class which extends FileInputFormat<Text, Text> and a class which extends RecordReader<Text, Text>. Override the FileInputFormat#getRecordReader method, and have it return an instance of your custom RecordReader.
Then you will have to implement the required RecordReader logic. The simplest way to do this is to create an instance of LineRecordReader in your custom RecordReader, and delegate all basic responsibilities to this instance. In the getCurrentKey and getCurrentValue-methods you will implement the logic for extracting the comma delimited Text contents by calling LineRecordReader#getCurrentValue and splitting it on comma.
Finally, set your new InputFormat as Job InputFormat as shown after the second paragraph above.

In the book "Hadoop: The Difinitive Guide" by Tom White I think he has an appropriate answer to this(pg. 197):
"TextInputFormat’s
keys, being simply the offset within the file, are not normally very
useful. It is common for each line in a file to be a key-value pair, separated by a delimiter
such as a tab character. For example, this is the output produced by
TextOutputFormat, Hadoop’s default
OutputFormat. To interpret such files correctly,
KeyValueTextInputFormat
is appropriate.
You can specify the separator via the
key.value.separator.in.input.line
property. It
is a tab character by default."

Key for Mapper Input will always be a Integer type....the mapper input key indicates the line's offset no. and the values indicates the whole line ......
record reader reads a single line in first cycle. And o/p of the mapper can be whatever u want (it can be (Text,Text) or (Text, IntWritable) or ......)

Related

how to design 1 mapper for 1 text file in Mapreduce

I am running Mapreduce on hadoop 2.9.0.
My problem:
I have a number of text files (about 10- 100 text files). Each file is very small in terms of size, but due to my logical problem, I need 1 mapper to handle 1 text file. The result of these mappers will be aggregated by my reducers.
I need to design so that the number of mappers always equals number of files. How to do that in Java code? What kind of function that I need to extend?
Thanks a lot.
I've had to do something very similar, and faced similar problems to you.
The way I achieved this was to feed in a text file containing the path's to each file, for example the text file would contain this kind of information:
/path/to/filea
/path/to/fileb
/a/different/path/to/filec
/a/different/path/to/another/called/filed
I'm not sure what exactly you want your mapper's to do, but when creating your job, you want to do the following:
public static void main( String args[] ) {
Job job = Job.getInstance(new Configuration(), 'My Map reduce application');
job.setJarByClass(Main.class);
job.setMapperClass(CustomMapper.class);
job.setInputFormatClass(NLineInputFormat.class);
...
}
Your CustomMapper.class will want to extend Mapper like so:
public class CustomMapper extends Mapper<LongWritable, Text, <Reducer Key>, <Reducer Value> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Configuration configuration = context.getConfiguration();
ObjectTool tool = new ObjectTool(configuration, new Path(value.toString()));
context.write(<reducer key>, <reducer value>);
}
}
Where ObjectTool is another class which deals with what you want to actually do with your files.
So let me explain broadly what this is doing, the magic here is job.setInputFormatClass(NLineInputFormat.class), but what is it doing exactly?
It's essentially taking your input and splitting the data by each line, and sends each line to a mapper. By having a text file containing each file by a new line, you then create a 1:1 relationship between mappers and files. A great addition to this setup is it allows you to create advanced tooling for the files you want to deal with.
I used this to create a compression tool in HDFS, when I was researching on approaches to this, a lot of people were essentially reading the file to stdout and compressing it that way, however, when it came to doing a checksum on the original file and the file being compressed and decompressed, the results were different. This was due to the type of data in these files, and there was no easy way to implement bytes writeable. (Information on the cat'ing of files to std out can be seen here).
That link also quotes the following:
org.apache.hadoop.mapred.lib.NLineInputFormat is the magic here. It basically tells the job to feed one file per maptask
Hope this helps!

hadoop, word count in paragraph

Normally, Hadoop examples define how to do word count for a file or multiple files, the result of word count 'll be from entire set!
i wish to do wordcount for each paragraph and store in seperate files like paragh(i)_wordcnt.txt.
how to do it? (the issue is mapper runs for entire set and reducer collects output finally!
can i do something like if i reach a specific mark write results!
)
say if filecontent:
para1
...
para2
...
para3
...
can i do like on seeing para2 write results of wordcount of para1? or if other way writing each para in seperate file how to do like this sequence
loop:
file(i)(parai)->Mapper->Reducer->multipleOutput(output-file(i))->writetofile(i);
i++;
goto loop;
You need to make the RecordReader read a paragraph at a time. See this question: Overriding RecordReader to read Paragraph at once instead of line
I am writing the basic funda as how we can do it.
I think we have to run linked mapper and reducer for this proccess.
In the first mapper you have to use RecordReader and set its key as whole paragraph. This way we will get as many keys as paragraph you have.Then you need to use the reducer as identity reducer and again let the output of reducer to a new mapper which will get paragraph as key.
Now since you have paragraph in your new mapper ,you can tweak the famous word count code for your need.(Just replacing KEYS with VALUES here and all rest will be same).
Since you have nested mapper in a reducer ,getting the word count of a paragraph in separate files will be easy.
Please tell if my method is not correct.

How to just output value in context.write(k,v)

In my mapreduce job, I just want to output some lines.
But if I code like this:
context.write(data, null);
the program will throw java.lang.NullPointerException.
I don't want to code like below:
context.write(data, new Text(""));
because I have to trim the blank space in every line in the output files.
Is there any good ways to solve it?
Thanks in advance.
Sorry, it's my mistake. I checked the program carefully, found the reason is I set the Reducer as combiner.
If I do not use the combiner, the statement
context.write(data, null);
in reducer works fine. In the output data file, there is just the data line.
Share the NullWritable explanation from hadoop definitive guide:
NullWritable is a special type of Writable, as it has a zero-length serialization. No bytes
are written to, or read from, the stream. It is used as a placeholder; for example, in
MapReduce, a key or a value can be declared as a NullWritable when you don’t need
to use that position—it effectively stores a constant empty value. NullWritable can also
be useful as a key in SequenceFile when you want to store a list of values, as opposed
to key-value pairs. It is an immutable singleton: the instance can be retrieved by calling
NullWritable.get().
You should use NullWritable for this purpose.

"Map" and "Reduce" functions in Hadoop's MapReduce

I've been looking at this word count example by hadoop:
http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Source+Code
And I'm a little confused about the Map function. In the map function shown, it takes in a "key" of type LongWritable, but this parameter is never used in the body of the Map function. What does the application programmer expect Hadoop to pass in for this key? Why does a map function require a key if it simply parses values from a line of text or something. Can someone give me an example where both a key and a value is required for input? I only see map as V1 -> (K2, V2).
Another question: In the real implementation of hadoop, are their multiple reduction steps? If so, how does hadoop apply the same reduction function multiple times if the function is (K2, V2) -> (K3, V3)? If another reduction is performed, it needs to take in type (K3, V3)...
Thank you!
There's a key there because the map() method is always passed a key and a value (and a context). It's up to you as to whether you actually use the key and/or value. In this case, the key represents a line number from the file being read. The word count logic doesn't need that. The map() method just uses the value, which in the case of a text file is a line of the file.
As to your second question (which really should be its own stack overflow question), you may have any number of map/reduce jobs in a hadoop workflow. Some of those jobs will read as input pre-existing files and others will read the output of other jobs. Each job will have one or more mappers and a single reducer.

Getting byte offset with MRJob

According to "The Definitive Guide to Hadoop", the input format TextInputFormat gives key value pairs (k, v) = (byte offset, line). However, in MRJob, the key in the mapper input is always None. It should be easy to get the byte offset as key, since that's what TextInputFormat does. How do I get this?
I know that you can use the environment variable 'map_input_start' and calculate byte offsets yourself, but this has caused problems and I would like to do it the much simpler way of just getting the offset as key.
The TextInputFormat is a Java class ... I do not see how that would work in the streaming world.
Doesn't defining the map method in your mapper class with the following signature give you the byte offset as the key.
public void map(LongWritable key,Text value,OutputCollector<>,Reporter)

Resources