How to fake task reporting in hadoop job? - hadoop

I am using hadoop 1.0.3 to run some data crunching jobs. My reducer does not write to the HDFS, instead, I make my reducer write the result directly to mongoDB. Recently I have started to face a problem; my jobs some times "timeout" and restart and the message that I get from hadoop console is "Task attempt_201301241103_0003_m_000001_0 failed to report status for 601 seconds". So I think the problem lies with my approach, which is to write to mongodb instead of HDFS. I want to fake hadoop job status report. How can I do that ? Please help.
Also, I have observed that my reducer always remains 0% and only the Map phase shows constant increment in %. As soon as the job completes, the reducer shows 100% all of a sudden.
Thankyou,
Regards,
Mohsin

The message on the console you are seeing is from a map phase. Notice the "m" in it. To keep sending progress, you can do context.progress(); in the map method.
http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/StatusReporter.html

Related

how to check that mapreduce is running parallelly?

I submitted a mapreduce job and checked the log.
In the log l see that there are many mappers, each mapper processes one split, and the processing details of each mapper is logged in the log file sequentially in time.
However, I would like to check if my job is running parallelly and I want to see how many mappers are running concurrently.
I don't know where to find these informations.
Please help me, thx!
Use following Jobtracker Web UI and drill down to executing MapReduce job
http://<Jobtracker-HostName>:50030/

What is the difference between job.submit and job.waitForComplete in Apache Hadoop?

I have read the documentation so I know the difference.
My question however is that, is there any risk in using .submit instead of .waitForComplete if I want to run several Hadoop jobs on a cluster in parallel ?
I mostly use Elastic Map Reduce.
When I tried doing so, I noticed that only the first job being executed.
If your aim is to run jobs in parallel then there is certainly no risk in using job.submit(). The main reason job.waitForCompletion exists is that it's method call returns only when the job gets finished, and it returns with it's success or failure status which can be used to determine that further steps are to be run or not.
Now, getting back at you seeing only the first job being executed, this is because by default Hadoop schedules the jobs in FIFO order. You certainly can change this behaviour. Read more here.

Hadoop reuse Job object

I have a pool of Jobs from which I retrieve jobs and start them. The pattern is something like:
Job job = JobPool.getJob();
job.waitForCompletion();
JobPool.release(job);
I get a problem when I try to reuse a job object in the sense that it doesn't even run (most probably because it's status is : COMPLETED). So, in the following snippet the second waitForCompletion call prints the statistics/counters of the job and doesn't do anything else.
Job jobX = JobPool.getJob();
jobX.waitForCompletion();
JobPool.release(jobX);
//.......
Job jobX = JobPool.getJob();
jobX.waitForCompletion(); // <--- here the job should run, but it doesn't
Am I right when I say that the job doesn't actually run because hadoop sees its status as completed and it doesn't event try to run it ? If yes, do you know how to reset a job object so that I can reuse it ?
The Javadoc includes this hint that the jobs should only run once
The set methods only work until the job is submitted, afterwards they will throw an IllegalStateException.
I think there's some confusion about the job, and the view of the job. The latter is the thing that you have got, and it is designed to map to at most one job running in hadoop. The view of the job is fundamentally light weight, and if creating that object is expensive relative to actually running the job... well, I've got to believe that your jobs are simple enough that you don't need hadoop.
Using the view to submit a job is potentially expensive (copying jars into the cluster, initializing the job in the JobTracker, and so on); conceptually, the idea of telling the jobtracker to "rerun " or "copy ; run ", makes sense. As far as I can tell, there's no support for either of those ideas in practice. I suspect that hadoop isn't actually guaranteeing retention policies that would support either use case.

How to debug a hung hadoop map-reduce job

I run MR Job, Map Phase run successful, but Reduce Phase complied at 33% and hang (hanging about 1 hour) status: "reduce > sort"
How i can debug it?
It may be nothing to do with your case, but I had this happen when IPTABLES (~firewall) was mis-configured on one node. When that node was assigned a reducer role, the reduce phase would hang at 33%. Check the error logs to make sure the connections are working, especially if you have recently added new nodes and/or configured them manually.

A tool showing a breakdown of completion times and source machine names for each and every mapper and reducer?

I know job tasks page (in the JobTracker UI) is already showing start time and end time of every tasks in mapper and reducer but I would like to see something more like source machine names, number of spills and so on. I guess I can try to write such a tool using JobTracker class? But before embarking on that, I would like to see if there is such a tool already.
Does the hadoop job -history all output-dir command give you enough information to parse / process?
http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html - Search for the above command

Resources