Hadoop: When does the setup method gets invoked in reducer? - hadoop

As far as I understand, the reduce task has three phases.
Shuffle, Sort and actual reduce invocation.
So usually in hadoop job's output we see something like,
map 0% reduce 0%
map 20% reduce 0%
.
.
.
map 90% reduce 10%
.
.
.
So I assume that the reduce tasks start before all the maps are finished and this behavior is controlled by the slow start configuration.
Now I don't yet understand when does the setup method of the reducer is actually called.
In my use case, I have some files to parse in the setup method. The file is about 60MB in size and is picked up from the distributed cache. While the file is being parsed, there is another set of data from configuration that can update the just parsed record. After parsing and possible updation, the file is stored in a HashMap for fast lookups. So I would like this method to be invoked as soon as possible, possibly while the mappers are still doing their thing.
Is it possible to do this? Or is that what already happens?
Thanks

Setup is called right before it is able to read the first key/values pair from the stream.
Which is effectively after all mappers ran and all the merging for a given reducer partition is finished.

As explained in Hadoop docs, setup() method is called once at the start of the task. It should be used for the instantiating resources/variables or reading configurable params which in turn can be used in reduce() method. Think of it like a constructor.
Here is an example reducer:
class ExampleReducer extends TableReducer<ImmutableBytesWritable, ImmutableBytesWritable, ImmutableBytesWritable> {
private int runId;
private ObjectMapper objectMapper;
#Override
protected void setup(Context context) throws IOException {
Configuration conf = context.getConfiguration();
this.runId = Integer.valueOf(conf.get("stackoverflow_run_id"));
this.objectMapper = new ObjectMapper();
}
#Override
protected void reduce(ImmutableBytesWritable keyFromMap, Iterable<ImmutableBytesWritable> valuesFromMap, Context context) throws IOException, InterruptedException {
// your code
var = objectMapper.writeValueAsString();
// your code
context.write(new ImmutableBytesWritable(somekey.getBytes()), put);
}
}

Related

How to force file content to be processed sequenctially?

I got a requirement to process the file as it is means the file content should be processed as it appears in the file.
For Example: I have a file and size is 700MBs. How we can make sure the file will be processed as it appears since it depends on Datanode availability. In some cases, if any of Datanode process the file slowly(low configuration).
One way to fix this, adding unique id/key in file but we dont want to add anything new in the file.
Any thoughts :)
You can guarantee that only one mapper calculates the content of the file by writing your own FileInputFormat which sets isSplitable to false. E.g.
public class WholeFileInputFormat extends FileInputFormat<Text, BytesWritable> {
#Override
protected boolean isSplitable(FileSystem fs, Path filename) {
return false;
}
#Override
public RecordReader<Text, BytesWritable> getRecordReader(
InputSplit split, JobConf job, Reporter reporter) throws IOException {
return new WholeFileRecordReader((FileSplit) split, job);
}
}
For more examples how to do it, I like to recommend a github project. Depending on your hadoop version slight changes might be necessary.

How to manipulate reduce() output and store it in another file?

I have just started learning Hadoop. I would like to use the output of my reduce() and do some manipulations on it. I am working on the new API and have tried using JobControl, but it doesn't seem to work with the new API.
Any way out?
Not sure what you are trying to do. Do you want to send different kinds of output to different output formats? Check This If you want to filter out or do manipulations on the values from the map, reduce is the best place to do this.
You can make use of ChainReducer to create a job of the form [MAP+ / REDUCE MAP*] i.e. several Maps followed by a reducer and then another series of maps that start with working on the output of the Reducer. The final output is the output of the last Mapper in series.
Alternatively, you can have multiple jobs that start sequentially and the output of the reducer of the previous is the input to the next . But, this causes unnecessary IO incase you are not interested in the intermediate output
Do whatever you want inside the reducer, create a FSDataOutputStream and write the output through it.
For example :
public static class TokenCounterReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
FileSystem fs = FileSystem.get(context.getConfiguration());
FSDataOutputStream out = fs.create(new Path("/path/to/your/file"));
//do the manipulation and write it down to the file
out.write(......);
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}

Find which data split caused the job to fail in hadoop

I was wondering if I can get some help on how to debug this situation?
Basically, I am reading the data from hdfs.. perform some basic computation.. and write the result back to hdfs..
But in job tracker.. I see that one of the task is always in initializing phase?
Task Complete Phase ..... Counter
task_201312040108_0001_m_003006 0 Initializing 0
And after few attempts (3) this task failed.. forcing whole job to fail.. while rest of the others.. succeed...
How do I debug this situation?
I was wondering if i can take a look at what data split this mapper is getting...?? Oh.. this is a map only task..
all my Java mappers extend a base mapper that has the following code:
// hook for subclasses
protected void doSetup( Context ctx ) throws IOException, InterruptedException {}
public final void setup( Context ctx )
throws IOException, InterruptedException {
String strSplitMsg = "Input split: " + ctx.getInputSplit();
LOG.info( strSplitMsg );
ctx.setStatus( strSplitMsg );
doSetup( ctx );
}
so that I never get bitten by that problem. However, your freeze might be happening before the call to setup(); perhaps you can look at the task tracker log on the host where the failures occurred or the task attempt log itself.

Job wide custom cleanup after all the map tasks are completed

While running a map-reduce job, that has only mapper, I have a counter that counts the number of failed documents .And after all the mappers are done, I want the job to fail if the total number of failed documents are above a fixed fraction. ( I need it in the end because I don't know the total number of documents initially). How can I achieve this without implementing a reduce just for this ?
I know that there are task level cleanup method. But is there any job level cleanup method, that can be used to perform this after all the tasks are done ?
This can be done very easily. Thats the beauty of latest mapreduce API.
The execution of a mapper can be controlled with the help of overriding run method in Mapper class, and same way for the reducer. I do not know the final outcome that you are expecting. But, i have prepared a small example for you. I have
in my mapper class, i have overriden run method and giving you a sample, it iterrupts the execution if the key value is greater than 200 in my code.
public class ReversingMapper extends Mapper<LongWritable, Text, ReverseIntWritable, Text>
{
public final LongWritable border = new LongWritable(100);
#Override
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
/* extra code to standard run method started here */
//if(context.getCounter(<ENUM>) > 200 ){} -- you can place your counter check here.
if(context.getCurrentKey().get() > 200 )
{
throw new InterruptedException();
}else
{
/* extra code to standard run method ended here */
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
}
}
and you need to handle in properly in the Driver as well.
} catch (InterruptedException e) {
e.printStackTrace();
System.exit(0);
}
You can have logger and even log a proper message that is required here..
I hope this solves your problem. Let me know if you need any more help.

Calling progress or increase counter in configure method of reducer

Is it possible to do so ?
Context: My configure method for a reducer needs to read a set of files from DistributedCache (total size is ~150MB). However, I don't know why it takes so long that hadoop kill some reducers despite the fact that there are some reducers that have finished successfully.
I use the old API where I can only access the JobConf conf variable in the configure method.
My idea is to make the reporter variable a field then I can call it in the configure method. But it seems configure is called before reduce is called.
Convert your code to use new API!
Then in setup(), you can access the context variable and call progress() as follows:
#Override
protected void setup(Context context) throws IOException, InterruptedException {
context.progress();
}

Resources