What determines how many times map() will get called? - hadoop

I have a text file and a parser that will parse each line(s) and store into my customSplitInput, I do the parsing in my custom FileInputFormat phase so my splits are custom. Right now, I have 2 splits and within each split contains a list of my data.
But right now, my mapper function is getting called repeatedly on the same split. I thought the mapper function will only get called based on the number of splits you have?
I don't know if this applies but my custom InputSplit returns a fixed number for getLength() and an empty string array for getLocation(). I am unsure of what to put in for these.
#Override
public RecordReader<LongWritable, ArrayWritable> createRecordReader(
InputSplit input, TaskAttemptContext taskContext)
throws IOException, InterruptedException {
logger.info(">>> Creating Record Reader");
CustomRecordReader recordReader = new CustomRecordReader(
(EntryInputSplit) input);
return recordReader;
}

map() is called once for every record from the RecordReader in (or referenced by) your InputFormat. For example, TextInputFormat calls map() for every line in the input, even though there are usually many lines in a split.

Related

Chaining jobs using user defined class

I have to implement a Graph algorithm using Map Reduce. For this I have to chain jobs.
MAP1 -> REDUCE1 -> MAP2 -> REDUCE2 -> ...
I will be reading the adjacent matrix from file in MAP1 and creating a user defined java class Node that will contain the data and the child informations. I want to pass this information to MAP2.
But, in the REDUCE1 when I write
context.write(node, NullWritable.get());
the node data gets saved in a file as a text format using the toString() of the Node class.
When the MAP2 tries to read this Node information,
public void map(LongWritable key, Node node, Context context) throws IOException, InterruptedException
it says that it cannot convert the text in the file to Node.
I am not sure what is the right approach for this type of Chaining of jobs in Map reduce.
The REDUCE1 writes the Node in this format:
Node [nodeId=1, adjacentNodes=[Node [nodeId=2, adjacentNodes=[]], Node [nodeId=2, adjacentNodes=[]]]]
Actual exception:
java.lang.Exception: java.lang.ClassCastException:
org.apache.hadoop.io.Text cannot be cast to custom.node.nauty.Node
Based on the comments, the suggested changes that will make your code work are the following:
You should use SequenceFileInputFormat in mapper2 and SequenceFileOutputFormat in reducer1, and not TextInputFormat and TextOutputFormat, respectively. TextInputFormat reads a LongWritable key and a Text value, which is why you get this error.
Accordingly, you should also change the declaration of mapper two, to accept a Node key and a NullWritable value.
Make sure that the Node class extends the Writable class (or the WritableComparable if you use it as a key). Then, set the outputKeyClass of the first job to be Node.class, instead of TextWritable.class.

How to skip reading the file header in hadoop mapreduce

I am learning hadoop mapreduce using java,I have a sample file with data as below, how do I skip processing the header line in this file..because when I see the mapper input, it is considering the header also..
roll no|school name|name|age|Gender|class|subject|marks
1|xyz|pqr|abc|10|M|1|science|98
Because you already know what header looks like, you can just skip the header. This approach makes the application more slower.
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException{
String[] row = value.toString();
if( row.equals( "roll no|school name|name|age|Gender|class|subject|marks") )
return;
//NOW YOU ARE HEADER FREE
//do some operations depending on your needs..
}
If you are running with single mapper, you can use the counter in an if condition. If you are running more than one mapper, check for header string in an if condition.

Creating custom InputFormat and RecordReader for Binary Files in Hadoop MapReduce

I'm writing a M/R job that processes large time-series-data files written in binary format that looks something like this (new lines here for readability, actual data is continuous, obviously):
TIMESTAMP_1---------------------TIMESTAMP_1
TIMESTAMP_2**********TIMESTAMP_2
TIMESTAMP_3%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%TIMESTAMP_3
.. etc
Where timestamp is simply a 8 byte struct, identifiable as such by the first 2 bytes. The actual data is bounded between duplicate value timestamps, as displayed above, and contains one or more predefined structs. I would like to write a custom InputFormat that will emit the key/value pair to the mappers:
< TIMESTAMP_1, --------------------- >
< TIMESTAMP_2, ********** >
< TIMESTAMP_3, %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% >
Logically, I'd like to keep track of the current TIMESTAMP, and aggregate all the data until that TIMESTAMP is detected again, then send out my <TIMESTAMP, DATA> pair as a record. My problem is syncing between splits inside the RecordReader, so if a certain reader receives the following split
# a split occurs inside my data
reader X: TIMESTAMP_1--------------
reader Y: -------TIMESTAMP_1 TIMESTAMP_2****..
# or inside the timestamp
or even: #######TIMES
TAMP_1-------------- ..
What's a good way to approach this? Do I have an easy way to access the file offsets such that my CustomRecordReader can sync between splits and not lose data? I feel I have some conceptual gaps on how splits are handled, so perhaps an explanation of these may help. thanks.
In general it is not simple to create input format which support splits, since you should be able to find out where to move from the split boundary to get consistent records. XmlInputFormat is good example of format doing so.
I would suggest first consider if you indeed need splittable inputs? You can define your input format as not splittable and not have all these issues.
If you files are generally not much larger then block size - you loose nothing. If they do - you will loose part of the data locality.
You can subclass the concrete subclass of FileInputFormat, for example, SeqenceFileAsBinaryInputFormat, and override the isSplitable() method to return false:
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat;
public class NonSplitableBinaryFile extends SequenceFileAsBinaryInputFormat{
#Override
protected boolean isSplitable(FileSystem fs, Path file) {
return false;
}
#Override
public RecordReader getRecordReader(InputSplit split, JobConf job,
Reporter reporter) throws IOException {
//return your customized record reader here
}
}

Map Reduce - Not able to get the rigth key

Hi I am writing map reduce code finding the maximum temperature. The problem is that I am getting the maximum temperature but without the corresponding key.
public static class TemperatureReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
Text year=new Text();
int maxTemperature=Integer.MIN_VALUE;
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
for(IntWritable valTemp:values) {
maxTemperature=Math.max(maxTemperature, valTemp.get());
}
//System.out.println("The maximunm temperature is " + maxTemperature);
context.write(year, new IntWritable(maxTemperature));
}
}
mapper imagin like
1955 52
1958 7
1985 22
1999 32
and so on.
It is overwriting the keys and printing all the data. I want only maximum temperature and its year.
I see a couple of things wrong with your code sample:
Reset the maxTemperature inside the reduce method (as the first statement), at the moment you have a bug in that it will output the maximum temperature seen for all preceding key/values
Where are you configuring the contents of year? in fact you don't need to, just call context.write(key, new IntWritable(maxTemperature); as the input key is the year
You might want to create a IntWritable instance variable and re-use it rather than creating a new IntWritable when writing out the output value (this is an efficiency thing rather than a potential cause of your problem)

How to call Partitioner in Haoop v 0.21

In my application I want to create as many reducer jobs as possible based on the keys. Now my current implementation writes all the keys and values in a single (reducer) output file. So to solve this, I have used one partitioner but I cannot call the class.The partitioner should be called after the selection Map task and before the selection reduce task but it did not.The code of the partitioner is the following
public class MultiWayJoinPartitioner extends Partitioner<Text, Text> {
#Override
public int getPartition(Text key, Text value, int nbPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % nbPartitions;
return 0;
}
}
Is this code is correct to partition the files based on the keys and values and the output will be transfer to the reducer automatically??
You don't show all of your code, but there is usually a class (called the "Job" or "MR" class) that configures the mapper, reducer, partitioner, etc. and then actually submits the job to hadoop. In this class you will have a job configuration object that has many properties, one of which is the number of reducers. Set this property to whatever number your hadoop configuration can handle.
Once the job is configured with a given number of reducers, that number will be passed into your partition (which looks correct, by the way). Your partitioner will start returning the appropriate reducer/partition for the key/value pair. That's how you get as many reducers as possible.

Resources