Find which data split caused the job to fail in hadoop - hadoop

I was wondering if I can get some help on how to debug this situation?
Basically, I am reading the data from hdfs.. perform some basic computation.. and write the result back to hdfs..
But in job tracker.. I see that one of the task is always in initializing phase?
Task Complete Phase ..... Counter
task_201312040108_0001_m_003006 0 Initializing 0
And after few attempts (3) this task failed.. forcing whole job to fail.. while rest of the others.. succeed...
How do I debug this situation?
I was wondering if i can take a look at what data split this mapper is getting...?? Oh.. this is a map only task..

all my Java mappers extend a base mapper that has the following code:
// hook for subclasses
protected void doSetup( Context ctx ) throws IOException, InterruptedException {}
public final void setup( Context ctx )
throws IOException, InterruptedException {
String strSplitMsg = "Input split: " + ctx.getInputSplit();
LOG.info( strSplitMsg );
ctx.setStatus( strSplitMsg );
doSetup( ctx );
}
so that I never get bitten by that problem. However, your freeze might be happening before the call to setup(); perhaps you can look at the task tracker log on the host where the failures occurred or the task attempt log itself.

Related

Hadoop: When does the setup method gets invoked in reducer?

As far as I understand, the reduce task has three phases.
Shuffle, Sort and actual reduce invocation.
So usually in hadoop job's output we see something like,
map 0% reduce 0%
map 20% reduce 0%
.
.
.
map 90% reduce 10%
.
.
.
So I assume that the reduce tasks start before all the maps are finished and this behavior is controlled by the slow start configuration.
Now I don't yet understand when does the setup method of the reducer is actually called.
In my use case, I have some files to parse in the setup method. The file is about 60MB in size and is picked up from the distributed cache. While the file is being parsed, there is another set of data from configuration that can update the just parsed record. After parsing and possible updation, the file is stored in a HashMap for fast lookups. So I would like this method to be invoked as soon as possible, possibly while the mappers are still doing their thing.
Is it possible to do this? Or is that what already happens?
Thanks
Setup is called right before it is able to read the first key/values pair from the stream.
Which is effectively after all mappers ran and all the merging for a given reducer partition is finished.
As explained in Hadoop docs, setup() method is called once at the start of the task. It should be used for the instantiating resources/variables or reading configurable params which in turn can be used in reduce() method. Think of it like a constructor.
Here is an example reducer:
class ExampleReducer extends TableReducer<ImmutableBytesWritable, ImmutableBytesWritable, ImmutableBytesWritable> {
private int runId;
private ObjectMapper objectMapper;
#Override
protected void setup(Context context) throws IOException {
Configuration conf = context.getConfiguration();
this.runId = Integer.valueOf(conf.get("stackoverflow_run_id"));
this.objectMapper = new ObjectMapper();
}
#Override
protected void reduce(ImmutableBytesWritable keyFromMap, Iterable<ImmutableBytesWritable> valuesFromMap, Context context) throws IOException, InterruptedException {
// your code
var = objectMapper.writeValueAsString();
// your code
context.write(new ImmutableBytesWritable(somekey.getBytes()), put);
}
}

Job wide custom cleanup after all the map tasks are completed

While running a map-reduce job, that has only mapper, I have a counter that counts the number of failed documents .And after all the mappers are done, I want the job to fail if the total number of failed documents are above a fixed fraction. ( I need it in the end because I don't know the total number of documents initially). How can I achieve this without implementing a reduce just for this ?
I know that there are task level cleanup method. But is there any job level cleanup method, that can be used to perform this after all the tasks are done ?
This can be done very easily. Thats the beauty of latest mapreduce API.
The execution of a mapper can be controlled with the help of overriding run method in Mapper class, and same way for the reducer. I do not know the final outcome that you are expecting. But, i have prepared a small example for you. I have
in my mapper class, i have overriden run method and giving you a sample, it iterrupts the execution if the key value is greater than 200 in my code.
public class ReversingMapper extends Mapper<LongWritable, Text, ReverseIntWritable, Text>
{
public final LongWritable border = new LongWritable(100);
#Override
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
/* extra code to standard run method started here */
//if(context.getCounter(<ENUM>) > 200 ){} -- you can place your counter check here.
if(context.getCurrentKey().get() > 200 )
{
throw new InterruptedException();
}else
{
/* extra code to standard run method ended here */
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
}
}
and you need to handle in properly in the Driver as well.
} catch (InterruptedException e) {
e.printStackTrace();
System.exit(0);
}
You can have logger and even log a proper message that is required here..
I hope this solves your problem. Let me know if you need any more help.

Calling progress or increase counter in configure method of reducer

Is it possible to do so ?
Context: My configure method for a reducer needs to read a set of files from DistributedCache (total size is ~150MB). However, I don't know why it takes so long that hadoop kill some reducers despite the fact that there are some reducers that have finished successfully.
I use the old API where I can only access the JobConf conf variable in the configure method.
My idea is to make the reporter variable a field then I can call it in the configure method. But it seems configure is called before reduce is called.
Convert your code to use new API!
Then in setup(), you can access the context variable and call progress() as follows:
#Override
protected void setup(Context context) throws IOException, InterruptedException {
context.progress();
}

Working of RecordReader in Hadoop

Can anyone explain how the RecordReader actually works? How are the methods nextkeyvalue(), getCurrentkey() and getprogress() work after the program starts executing?
(new API): The default Mapper class has a run method which looks like this:
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
}
The Context.nextKeyValue(), Context.getCurrentKey() and Context.getCurrentValue() methods are wrappers for the RecordReader methods. See the source file src/mapred/org/apache/hadoop/mapreduce/MapContext.java.
So this loop executes and calls your Mapper implementation's map(K, V, Context) method.
Specifically, what else would you like to know?
org.apache.hadoop.mapred.MapTask
- runNewMapper()
Imp steps:
creates new mapper
get input split for the mapper
get recordreader for the split
initialize record reader
using record reader iterate through getNextKeyVal() and pass key,val to mappers map method
clean up

Hadoop Global Property Conf.Set / Conf.Get in Cleanup()?

I am trying to use Global Variables in Hadoop via the Conf.set() and Context.getConfiguration().get() methods.
However, these don't seem to be working inside a Cleanup method I'm using - Though I am able to use the properties in Mapper and Reducer. Is is strange or normal behaviour?
Is there any other way of propagating the value of a variable across MapReduce Jobs, and inside cleanup method of a hadoop job.
The parameters set on the Job class are coming properly in the cleanup method.
The following is in the main method
Configuration conf = new Configuration();
conf.set("test", "123");
Job job = new Job(conf);
The following is the Mapper#cleanup method
protected void cleanup(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
String param = conf.get("test");
System.out.println("clean p--> param = " + param);
}
The O/P of the above is
clean p--> param = 123
Check the code again. BTW, I tested it against 0.21 release.

Resources