How does Mapper class identify the SequenceFile as inputfile in hadoop? - hadoop

In my one MapReduce task, I override the BytesWritable as KeyBytesWritable, and override the ByteWritable as ValueBytesWritable. Then I output the result using SequenceFileOutputFormat.
My question is when I start the next MapReduce task, I want to use this SequenceFile as inputfile. So how could I set the jobclass, and how the Mapper class could identify the key and value in the SequenceFile which I overrided before?
I understand that I could SequenceFile.Reader to read the key and value.
Configuration config = new Configuration();
Path path = new Path(PATH_TO_YOUR_FILE);
SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config);
WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value))
But I don't know how to use this Reader to pass the key and value into Mapper class as Parameters. How could I set conf.setInputFormat to SequenceFileInputFormat and then let Mapper get the key and values?
Thanks

You do not need to manually read the sequence file. Just set the
input format class to sequence file:
job.setInputFormatClass(SequenceFileInputFormat.class);
and set the input path to the directory containing yor sequence files.
FileInputFormat.setInputPaths(<path to the dir containing your sequence files>);
You will need to pay attention to the (Key,Value) types of the inputs on the parameterized types of your Mapper class to match the (key,value) tuples inside your sequence file.

Related

Chaining jobs using user defined class

I have to implement a Graph algorithm using Map Reduce. For this I have to chain jobs.
MAP1 -> REDUCE1 -> MAP2 -> REDUCE2 -> ...
I will be reading the adjacent matrix from file in MAP1 and creating a user defined java class Node that will contain the data and the child informations. I want to pass this information to MAP2.
But, in the REDUCE1 when I write
context.write(node, NullWritable.get());
the node data gets saved in a file as a text format using the toString() of the Node class.
When the MAP2 tries to read this Node information,
public void map(LongWritable key, Node node, Context context) throws IOException, InterruptedException
it says that it cannot convert the text in the file to Node.
I am not sure what is the right approach for this type of Chaining of jobs in Map reduce.
The REDUCE1 writes the Node in this format:
Node [nodeId=1, adjacentNodes=[Node [nodeId=2, adjacentNodes=[]], Node [nodeId=2, adjacentNodes=[]]]]
Actual exception:
java.lang.Exception: java.lang.ClassCastException:
org.apache.hadoop.io.Text cannot be cast to custom.node.nauty.Node
Based on the comments, the suggested changes that will make your code work are the following:
You should use SequenceFileInputFormat in mapper2 and SequenceFileOutputFormat in reducer1, and not TextInputFormat and TextOutputFormat, respectively. TextInputFormat reads a LongWritable key and a Text value, which is why you get this error.
Accordingly, you should also change the declaration of mapper two, to accept a Node key and a NullWritable value.
Make sure that the Node class extends the Writable class (or the WritableComparable if you use it as a key). Then, set the outputKeyClass of the first job to be Node.class, instead of TextWritable.class.

Which files are ignored as input by mapper?

I'm chaining multiple MapReduce jobs and want to pass along/store some meta information (e.g. configuration or name of original input) with the results. At least the file "_SUCCESS" and also anything in the directory "_logs" seams to be ignored.
Are there any filename patterns which are by default ignored by the InputReader? Or is this just a fixed limited list?
The FileInputFormat uses the following hiddenFileFilter by default:
private static final PathFilter hiddenFileFilter = new PathFilter(){
public boolean accept(Path p){
String name = p.getName();
return !name.startsWith("_") && !name.startsWith(".");
}
};
So if you uses any FileInputFormat (such as TextInputFormat, KeyValueTextInputFormat, SequenceFileInputFormat), the hidden files (the file name starts with "_" or ".") will be ignored.
You can use FileInputFormat.setInputPathFilter to set your custom PathFilter. Remember that the hiddenFileFilter is always active.

Hadoop variable set in reducer and read in driver

How I can set a variable in a reducer, which after its execution can be read by the driver after all tasks finish their execution? Something like:
class Driver extends Configured implements Tool{
public int run(String[] args) throws Exception {
...
JobClient.runJob(conf); // reducer sets some variable
String varValue = ...; // variable value is read by driver
}
}
WORKAROUND
I came up with this "ugly" workaround. The main idea is that you create a group of counters in which you hold only one counter where its name is the value you wish to return (you ignore the actual counter value). The code look like this:
// reducer || mapper
reporter.incrCounter("Group name", "counter name -> actual value", 0);
// driver
RunningJob runningJob = JobClient.runJob(conf);
String value = runningJob.getCounters().getGroup("Group name").iterator().next().getName();
The same will work for mappers as well. Though this solves my problem, I think this type of solution is "ugly". Thus I leave the question open.
You can't amend the configuration in a map / reduce task and expect that change to be persisted to configurations in other tasks and / or the job client that submitted the job (lets say you write different values in the reducer - which one 'wins' out and is persisted back?).
You can however write files to HDFS yourself which can then be read back when your job returns - No less ugly really but there isn't a way doesn't involve another technology (Zookeeper, HBase or any other NoSQL / RDB) holding the value between your task ending and you being able to retrieve the value upon job success.

Why not mapper/reducer for hadoop TeraSort

I am planning to insert some code into the mapper of the TeraSort class in Hadoop 0.20.2. However, after reviewing the source code, I cannot locate the segment that mapper is implemented.
Normally, we will see a method called job.setMapperClass() which indicates the mapper class. However, for the TeraSort, I can only see thing like setInputformat, setOutputFormat. I canno t find where the mapper and reduce methods are called?
can any one please give some hints about this? Thanks,
The source code is something like this,
public int run(String[] args) throws Exception {
LOG.info("starting");
JobConf job = (JobConf) getConf();
Path inputDir = new Path(args[0]);
inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
URI partitionUri = new URI(partitionFile.toString() +
"#" + TeraInputFormat.PARTITION_FILENAME);
TeraInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJobName("TeraSort");
job.setJarByClass(TeraSort.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormat(TeraInputFormat.class);
job.setOutputFormat(TeraOutputFormat.class);
job.setPartitionerClass(TotalOrderPartitioner.class);
TeraInputFormat.writePartitionFile(job, partitionFile);
DistributedCache.addCacheFile(partitionUri, job);
DistributedCache.createSymlink(job);
job.setInt("dfs.replication", 1);
// TeraOutputFormat.setFinalSync(job, true);
job.setNumReduceTasks(0);
JobClient.runJob(job);
LOG.info("done");
return 0;
}
For other classes, like TeraValidate, we can find the code like,
job.setMapperClass(ValidateMapper.class);
job.setReducerClass(ValidateReducer.class);
I cannot see such methods for TeraSort.
Thanks,
Why should a sort need to set the Mapper and Reducer class for it?
The default value is the standard Mapper (former identity Mapper) and standard Reducer.
These are the classes you usually inherit from.
You can basically say, that you're just emitting everything from the input and let Hadoop do its own sorting stuff. So sorting is working by "default".
Thomas answer is right i.e mapper and reducers are identity since shuffled data is sorted before applying your reduce function . Whats special about terasort is its custom partitioner (which is not default hash function). You should read more about it from here Hadoop's implementation for Terasort. It states
"TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."

How to call Partitioner in Haoop v 0.21

In my application I want to create as many reducer jobs as possible based on the keys. Now my current implementation writes all the keys and values in a single (reducer) output file. So to solve this, I have used one partitioner but I cannot call the class.The partitioner should be called after the selection Map task and before the selection reduce task but it did not.The code of the partitioner is the following
public class MultiWayJoinPartitioner extends Partitioner<Text, Text> {
#Override
public int getPartition(Text key, Text value, int nbPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % nbPartitions;
return 0;
}
}
Is this code is correct to partition the files based on the keys and values and the output will be transfer to the reducer automatically??
You don't show all of your code, but there is usually a class (called the "Job" or "MR" class) that configures the mapper, reducer, partitioner, etc. and then actually submits the job to hadoop. In this class you will have a job configuration object that has many properties, one of which is the number of reducers. Set this property to whatever number your hadoop configuration can handle.
Once the job is configured with a given number of reducers, that number will be passed into your partition (which looks correct, by the way). Your partitioner will start returning the appropriate reducer/partition for the key/value pair. That's how you get as many reducers as possible.

Resources