Why not mapper/reducer for hadoop TeraSort - hadoop

I am planning to insert some code into the mapper of the TeraSort class in Hadoop 0.20.2. However, after reviewing the source code, I cannot locate the segment that mapper is implemented.
Normally, we will see a method called job.setMapperClass() which indicates the mapper class. However, for the TeraSort, I can only see thing like setInputformat, setOutputFormat. I canno t find where the mapper and reduce methods are called?
can any one please give some hints about this? Thanks,
The source code is something like this,
public int run(String[] args) throws Exception {
LOG.info("starting");
JobConf job = (JobConf) getConf();
Path inputDir = new Path(args[0]);
inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
URI partitionUri = new URI(partitionFile.toString() +
"#" + TeraInputFormat.PARTITION_FILENAME);
TeraInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJobName("TeraSort");
job.setJarByClass(TeraSort.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormat(TeraInputFormat.class);
job.setOutputFormat(TeraOutputFormat.class);
job.setPartitionerClass(TotalOrderPartitioner.class);
TeraInputFormat.writePartitionFile(job, partitionFile);
DistributedCache.addCacheFile(partitionUri, job);
DistributedCache.createSymlink(job);
job.setInt("dfs.replication", 1);
// TeraOutputFormat.setFinalSync(job, true);
job.setNumReduceTasks(0);
JobClient.runJob(job);
LOG.info("done");
return 0;
}
For other classes, like TeraValidate, we can find the code like,
job.setMapperClass(ValidateMapper.class);
job.setReducerClass(ValidateReducer.class);
I cannot see such methods for TeraSort.
Thanks,

Why should a sort need to set the Mapper and Reducer class for it?
The default value is the standard Mapper (former identity Mapper) and standard Reducer.
These are the classes you usually inherit from.
You can basically say, that you're just emitting everything from the input and let Hadoop do its own sorting stuff. So sorting is working by "default".

Thomas answer is right i.e mapper and reducers are identity since shuffled data is sorted before applying your reduce function . Whats special about terasort is its custom partitioner (which is not default hash function). You should read more about it from here Hadoop's implementation for Terasort. It states
"TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."

Related

how to hold objects in array in hadoop

How can we have array of our own objects in hadoop? Does Hadoop have containers like List, Set, LinkedList etc. similar to java? Are the following lines good?
Text[] textArray = new Text[2];
textArray[0] = new Text(maxSalaryDeptEmployee.getEmployeeName());
textArray[1] = new Text(Integer.toString(maxSalaryDeptEmployee.getEmployeeSalary()));
ArrayWritable arrayWritable = new ArrayWritable(Text.class,textArray);
Your code snippet looks good. Hadoop by default supports only ArrayWritable, ArrayPrimitiveWritable, TwoDArrayWritable, MapWritable, SortedMapWritable and EnumSetWritable as "containers".
You can also implement custom Writables, a good reference is here.

How does Mapper class identify the SequenceFile as inputfile in hadoop?

In my one MapReduce task, I override the BytesWritable as KeyBytesWritable, and override the ByteWritable as ValueBytesWritable. Then I output the result using SequenceFileOutputFormat.
My question is when I start the next MapReduce task, I want to use this SequenceFile as inputfile. So how could I set the jobclass, and how the Mapper class could identify the key and value in the SequenceFile which I overrided before?
I understand that I could SequenceFile.Reader to read the key and value.
Configuration config = new Configuration();
Path path = new Path(PATH_TO_YOUR_FILE);
SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config);
WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value))
But I don't know how to use this Reader to pass the key and value into Mapper class as Parameters. How could I set conf.setInputFormat to SequenceFileInputFormat and then let Mapper get the key and values?
Thanks
You do not need to manually read the sequence file. Just set the
input format class to sequence file:
job.setInputFormatClass(SequenceFileInputFormat.class);
and set the input path to the directory containing yor sequence files.
FileInputFormat.setInputPaths(<path to the dir containing your sequence files>);
You will need to pay attention to the (Key,Value) types of the inputs on the parameterized types of your Mapper class to match the (key,value) tuples inside your sequence file.

Hadoop variable set in reducer and read in driver

How I can set a variable in a reducer, which after its execution can be read by the driver after all tasks finish their execution? Something like:
class Driver extends Configured implements Tool{
public int run(String[] args) throws Exception {
...
JobClient.runJob(conf); // reducer sets some variable
String varValue = ...; // variable value is read by driver
}
}
WORKAROUND
I came up with this "ugly" workaround. The main idea is that you create a group of counters in which you hold only one counter where its name is the value you wish to return (you ignore the actual counter value). The code look like this:
// reducer || mapper
reporter.incrCounter("Group name", "counter name -> actual value", 0);
// driver
RunningJob runningJob = JobClient.runJob(conf);
String value = runningJob.getCounters().getGroup("Group name").iterator().next().getName();
The same will work for mappers as well. Though this solves my problem, I think this type of solution is "ugly". Thus I leave the question open.
You can't amend the configuration in a map / reduce task and expect that change to be persisted to configurations in other tasks and / or the job client that submitted the job (lets say you write different values in the reducer - which one 'wins' out and is persisted back?).
You can however write files to HDFS yourself which can then be read back when your job returns - No less ugly really but there isn't a way doesn't involve another technology (Zookeeper, HBase or any other NoSQL / RDB) holding the value between your task ending and you being able to retrieve the value upon job success.

Hadoop Custom Input format with the new API

I'm a newbie to Hadoop and I'm stuck with the following problem. What I'm trying to do is to map a shard of the database (please don't ask why I need to do that etc) to a mapper, then do certain operation on this data, output the results to reducers and use that output again to do the second phase map/reduce job on the same data using the same shard format.
Hadoop does not provide any input method to send a shard of the database. You can only send line by line using LineInputFormat and LineRecordReader. NLineInputFormat doesn't also help in this case. I need to extend FileInputFormat and RecordReader classes to write my own InputFormat. I have been advised to use LineRecordReader since the underlying code already deals with the FileSplits and all the problems associated with splitting the files.
All I need to do now is to override the nextKeyValue() method which I don't exactly know how.
for(int i=0;i<shard_size;i++){
if(lineRecordReader.nextKeyValue()){
lineValue.append(lineRecordReader.getCurrentValue().getBytes(),0,lineRecordReader.getCurrentValue().getLength());
}
}
The above code snippet is the one that wrote but somehow doesn't work well.
I would suggest to put into your input files connection strings and some other indications where to find the shard.
Mapper will take this information, connect to the database and do a job. I would not suggest t o convert result sets to hadoop's writable classes - it will hinder performance.
The problem I see to be addressed - is to have enough splits of this relatively small input.
You can simply create enough small files with a few shards references each, or you can tweak input format to build small splits. Second way will be more flexible.
What I did, is something like this. I wrote my own record reader to read n lines at a time and send them to mappers as input
public boolean nextKeyValue() throws IOException,
InterruptedException {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 5; i++) {
if (!lineRecordReader.nextKeyValue()) {
return false;
}
lineKey = lineRecordReader.getCurrentKey();
lineValue = lineRecordReader.getCurrentValue();
sb.append(lineValue.toString());
sb.append(eol);
}
lineValue.set(sb.toString());
//System.out.println(lineValue.toString());
return true;
// throw new UnsupportedOperationException("Not supported yet.");
}
how do you thin

How to call Partitioner in Haoop v 0.21

In my application I want to create as many reducer jobs as possible based on the keys. Now my current implementation writes all the keys and values in a single (reducer) output file. So to solve this, I have used one partitioner but I cannot call the class.The partitioner should be called after the selection Map task and before the selection reduce task but it did not.The code of the partitioner is the following
public class MultiWayJoinPartitioner extends Partitioner<Text, Text> {
#Override
public int getPartition(Text key, Text value, int nbPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % nbPartitions;
return 0;
}
}
Is this code is correct to partition the files based on the keys and values and the output will be transfer to the reducer automatically??
You don't show all of your code, but there is usually a class (called the "Job" or "MR" class) that configures the mapper, reducer, partitioner, etc. and then actually submits the job to hadoop. In this class you will have a job configuration object that has many properties, one of which is the number of reducers. Set this property to whatever number your hadoop configuration can handle.
Once the job is configured with a given number of reducers, that number will be passed into your partition (which looks correct, by the way). Your partitioner will start returning the appropriate reducer/partition for the key/value pair. That's how you get as many reducers as possible.

Resources