Chaining jobs using user defined class - hadoop

I have to implement a Graph algorithm using Map Reduce. For this I have to chain jobs.
MAP1 -> REDUCE1 -> MAP2 -> REDUCE2 -> ...
I will be reading the adjacent matrix from file in MAP1 and creating a user defined java class Node that will contain the data and the child informations. I want to pass this information to MAP2.
But, in the REDUCE1 when I write
context.write(node, NullWritable.get());
the node data gets saved in a file as a text format using the toString() of the Node class.
When the MAP2 tries to read this Node information,
public void map(LongWritable key, Node node, Context context) throws IOException, InterruptedException
it says that it cannot convert the text in the file to Node.
I am not sure what is the right approach for this type of Chaining of jobs in Map reduce.
The REDUCE1 writes the Node in this format:
Node [nodeId=1, adjacentNodes=[Node [nodeId=2, adjacentNodes=[]], Node [nodeId=2, adjacentNodes=[]]]]
Actual exception:
java.lang.Exception: java.lang.ClassCastException:
org.apache.hadoop.io.Text cannot be cast to custom.node.nauty.Node

Based on the comments, the suggested changes that will make your code work are the following:
You should use SequenceFileInputFormat in mapper2 and SequenceFileOutputFormat in reducer1, and not TextInputFormat and TextOutputFormat, respectively. TextInputFormat reads a LongWritable key and a Text value, which is why you get this error.
Accordingly, you should also change the declaration of mapper two, to accept a Node key and a NullWritable value.
Make sure that the Node class extends the Writable class (or the WritableComparable if you use it as a key). Then, set the outputKeyClass of the first job to be Node.class, instead of TextWritable.class.

Related

What determines how many times map() will get called?

I have a text file and a parser that will parse each line(s) and store into my customSplitInput, I do the parsing in my custom FileInputFormat phase so my splits are custom. Right now, I have 2 splits and within each split contains a list of my data.
But right now, my mapper function is getting called repeatedly on the same split. I thought the mapper function will only get called based on the number of splits you have?
I don't know if this applies but my custom InputSplit returns a fixed number for getLength() and an empty string array for getLocation(). I am unsure of what to put in for these.
#Override
public RecordReader<LongWritable, ArrayWritable> createRecordReader(
InputSplit input, TaskAttemptContext taskContext)
throws IOException, InterruptedException {
logger.info(">>> Creating Record Reader");
CustomRecordReader recordReader = new CustomRecordReader(
(EntryInputSplit) input);
return recordReader;
}
map() is called once for every record from the RecordReader in (or referenced by) your InputFormat. For example, TextInputFormat calls map() for every line in the input, even though there are usually many lines in a split.

How to sort values fed to a reducer in a specific order

In my map-reduce job, the mapper's output type is <Text, FileAlias> and the class FileAlias has two attributes as follows
public class FileAlias extends Configured implements WritableComparable<FileAlias>{
public boolean isAlias;
public String value;
...
}
For each output key (of type Text), only one of the output values (of type FileAlias) has attribute isAlias set as true. I would like this output value to be the first item in the OutputCollector fed to the reducer. Is there any way to do that?
Take a look at the setGroupingComparatorClass method on the Job object. You should be able to implement a comparator that makes FileAlias with isAlias == true first in the Iterable that is passed to the reduce task.

Creating custom InputFormat and RecordReader for Binary Files in Hadoop MapReduce

I'm writing a M/R job that processes large time-series-data files written in binary format that looks something like this (new lines here for readability, actual data is continuous, obviously):
TIMESTAMP_1---------------------TIMESTAMP_1
TIMESTAMP_2**********TIMESTAMP_2
TIMESTAMP_3%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%TIMESTAMP_3
.. etc
Where timestamp is simply a 8 byte struct, identifiable as such by the first 2 bytes. The actual data is bounded between duplicate value timestamps, as displayed above, and contains one or more predefined structs. I would like to write a custom InputFormat that will emit the key/value pair to the mappers:
< TIMESTAMP_1, --------------------- >
< TIMESTAMP_2, ********** >
< TIMESTAMP_3, %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% >
Logically, I'd like to keep track of the current TIMESTAMP, and aggregate all the data until that TIMESTAMP is detected again, then send out my <TIMESTAMP, DATA> pair as a record. My problem is syncing between splits inside the RecordReader, so if a certain reader receives the following split
# a split occurs inside my data
reader X: TIMESTAMP_1--------------
reader Y: -------TIMESTAMP_1 TIMESTAMP_2****..
# or inside the timestamp
or even: #######TIMES
TAMP_1-------------- ..
What's a good way to approach this? Do I have an easy way to access the file offsets such that my CustomRecordReader can sync between splits and not lose data? I feel I have some conceptual gaps on how splits are handled, so perhaps an explanation of these may help. thanks.
In general it is not simple to create input format which support splits, since you should be able to find out where to move from the split boundary to get consistent records. XmlInputFormat is good example of format doing so.
I would suggest first consider if you indeed need splittable inputs? You can define your input format as not splittable and not have all these issues.
If you files are generally not much larger then block size - you loose nothing. If they do - you will loose part of the data locality.
You can subclass the concrete subclass of FileInputFormat, for example, SeqenceFileAsBinaryInputFormat, and override the isSplitable() method to return false:
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat;
public class NonSplitableBinaryFile extends SequenceFileAsBinaryInputFormat{
#Override
protected boolean isSplitable(FileSystem fs, Path file) {
return false;
}
#Override
public RecordReader getRecordReader(InputSplit split, JobConf job,
Reporter reporter) throws IOException {
//return your customized record reader here
}
}

Multiple Custom Writable formats

I have multiple input sources and I have used Sqoop's codegen tool to generate custom classes for each input source
public class SQOOP_REC1 extends SqoopRecord implements DBWritable, Writable
public class SQOOP_REC2 extends SqoopRecord implements DBWritable, Writable
On the Map side, based on the input source, I create objects of the above 2 classes accordingly.
I have the key as type "Text" and since I have 2 different types of values, I kept the value output type as "Writable".
On the reduce side, I accept the value type as Writable.
public class SkeletonReduce extends Reducer<Text,Writable, Text, Text> {
public void reduce(Text key, Iterable<Writable> values, Context context) throws IOException,InterruptedException {
}
}
I also set
job.setMapOutputValueClass(Writable.class);
During execution, it does not enter the reduce function at all.
Could someone tell me if it possible to do this? If so, what am I doing wrong?
You can't specify Writable as your output type; it has to be a concrete type. All records need to have the same (concrete) key and value types, in Mappers and Reducers. If you need different types you can create some kind of hybrid Writable that contains either an "A" or "B" inside. It's a little ugly but works and is done a lot in Mahout for example.
But I don't know why any of this would make the reducer not run; this is likely something quite separate and not answerable based on this info.
Look into extending GenericWritable for your value type. You need to define the set of classes which are allowed (SQOOP_REC1 and SQOOP_REC2 in your case), and it's not as efficient because it creates new object instances in the readFields method (but you can override this if you have a small set of classes, just have instance variables of both types, and a flag which denotes which one is valid)
http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/io/GenericWritable.html
Ok, I think I figured out how to do this. Based on a suggestion give by Doug Cutting himself
http://grokbase.com/t/hadoop/common-user/083gzhd6zd/multiple-output-value-classes
I wrapped the class using ObjectWritable
ObjectWritable obj = new ObjectWritable(SQOOP_REC2.class,sqoop_rec2);
And then on the Reduce side, I can get the name of the wrapped class and Cast it back to the original class.
if(val.getDeclaredClass().getName().equals("SQOOP_REC2")){
SQOOP_REC2temp = (SQOOP_REC2) val.get();
And don't forget
job.setMapOutputValueClass(ObjectWritable.class);

How to call Partitioner in Haoop v 0.21

In my application I want to create as many reducer jobs as possible based on the keys. Now my current implementation writes all the keys and values in a single (reducer) output file. So to solve this, I have used one partitioner but I cannot call the class.The partitioner should be called after the selection Map task and before the selection reduce task but it did not.The code of the partitioner is the following
public class MultiWayJoinPartitioner extends Partitioner<Text, Text> {
#Override
public int getPartition(Text key, Text value, int nbPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % nbPartitions;
return 0;
}
}
Is this code is correct to partition the files based on the keys and values and the output will be transfer to the reducer automatically??
You don't show all of your code, but there is usually a class (called the "Job" or "MR" class) that configures the mapper, reducer, partitioner, etc. and then actually submits the job to hadoop. In this class you will have a job configuration object that has many properties, one of which is the number of reducers. Set this property to whatever number your hadoop configuration can handle.
Once the job is configured with a given number of reducers, that number will be passed into your partition (which looks correct, by the way). Your partitioner will start returning the appropriate reducer/partition for the key/value pair. That's how you get as many reducers as possible.

Resources