All map tasks reached 100%, but still in running state - hadoop

In my MR job, which does bulk loading using HFileOutputFormat, 87 map tasks are spawned and in around 20 mins all the tasks reached 100%. Yet the individual task status is still in 'Running' in the hadoop admin page and none is moved to the completed state. The reducer is always in pending state and never starts. I just waited but it errored out after the 30 mins timeout.
My job has to load around 150+ columns. I tried running same MR job with less number of columns and it gets easily completed. Any idea why the map tasks are not moved to completed state even after reaching 100%?

One probable cause would be that the output data emitted would be huge. Sorting it, writing it back to disk would be a time consuming thing to do. This is typically not the case.
It would be even wise to check the logs and look out for ways to improve your map-reduce code.

Related

Java 8 parallelstream worker issue

I am running a weekly job using java 8 springboot. I use forkjoin custom pool. With 8 threads, I see that the job takes 3 hours to complete. When I check the logs I see that the performance/throughput or is consistent till about 80% and I see almost 5 to 6 threads are running fine. But after the job completes around 80% I see only one thread running and the performance/throughput is decreased drastically.
Going with initial analysis I feel some how the threads are lost after 80%. Not sure thought.
Question:
1) Any hints on what is going wrong?
2) What is best way to debug this issue and fix it, so the all threads run correctly till job completes.
I think the job should be complete within lesser time than it is now, and I feel threads might be the issue.

How to resume failed execution in MapReduce?

I have one requirement saying that -
a. Lets say i have 100GB of file/data
b. I have written Map Reduce job to process this data on certain logic.
c. I fired Map Reduce job, but it failed after reading 50GB
So my question is -
Can i resume the Map Reduce job from the 51th GB?
Please let me know if anybody have idea on how to do it, i don't want to reprocess the data which i processed before point of failure.
Thanks in advance
Brief answer: no.
And that's why working with large batch processing systems such as Hadoop or MPI is hard. Not only restarts of large jobs are inefficient from resource consumption point of view, but are also very psychologically depressive. That's why your primary goal is to reduce running time of single job to no more than couple of hours. Maybe it would be possible some day to implement "pausing" of jobs and "hot fixing" code, but currently it is not supported to my knowledge.
Solution #1. Split your job into error-prone parallelizable job and final error-free non-parallelizable job. Consider following example: you have hundreds of gigabytes of textual access logs from web server and you want to write job that will print how popular different browsers are. If you combine parsing and aggregating (summing) to a single huge job, then it's running time will be of order of days, and also chances that it will fail are very high because textual logs are usually hard to parse due to disambiguity. Much better idea is to split this job into two separate jobs:
First job is solely responsible for parsing log files. It prints only browser string as its output and even doesn't need to have any reducers. This job is the place for 99% of all errors because here is where parsing of "wild" data occurs. This job is parallelizable in the sense that you may split your input into chunks and process each chunk separately, so that each chunk is processed in 10-30 minutes. If job fails for some chunk, you fix it and restart; 30 minutes is not a big loss.
Second job is grand job that takes outputs from instances of first jobs and performs aggregation. Because aggregation code is very simple, this job is not likely to fail.
chunk(20G)->parse-job(20G)->browsers(0.5G)
chunk(20G)->parse-job(20G)->browsers(0.5G)
input(1T)->chunk(20G)->parse-job(20G)->browsers(0.5G)->aggregate-job->output
... .... ...
chunk(20G)->parse-job(20G)->browsers(0.5G)
Solution #2. Sometimes you may be satisfied with result even if parts of input data are dropped out. In this case you may set options mapred.max.map.failures.percent and/or mapred.max.reduce.failures.percent to non-zero values.
If your entire job fails, the output gets cleared, so you loose whatever you processed. However, Hadoop retries failed tasks of a job. So as long as your failure is recoverable within preconfigured number of attempts, a job will not fail and you are not going to loose output from already completed tasks.
If your failure is not recoverable, then in most cases it is your fault, and you might need to do one or more of the following:
Fix your code, even simple bug may cause all your tasks to consistently fail
Use less resources (e.g. care of available memory)
Better partition the problem (see if some tasks are fed more data than others or make sure task input is getting split into smaller chunks)
Have a bigger cluster capacity.

Job unexpectedly cancelled due to time limit

There are several partitions on the cluster I work on. With sinfo I can see the time limit for each partition. I put my code to work on mid1 partition which has time limit of 8-00:00:00 from which I understand that time limit is 8 days. I had to wait for 1-15:23:41 which means nearly 1 day and 15 hours. However, my code ran for only 00:02:24 which means nearly 2.5 minutes (and the solution was converging). Also, I did not set a time limit in the file submitted with sbatch The reason of my code stopped was given as:
JOB 3216125 CANCELLED AT 2015-12-19T04:22:04 DUE TO TIME LIMIT
So, why my code was stopped if I did not exceed the time limit? I was asking this to the guys who were responsible for the cluster but they did not return.
Look at the value of DefaultTime in the output of scontrol show partitions. This is the maximum time that is allocated to your job in the case you do not specify it by yourself with --time.
Most probably this value is set to 2 minutes to force you to specify a sensible time limit (within the limits of the partition.)

Actual processing time of hadoop job

My cluster is currently occupied by a job A that takes long time and has VERY_LOW priority.
I started another job B yesterday while A was already running and I think it should have ran quite fast.
However, I saw it took 47 minutes at the job details.
I don't think this is the actual processing time.
I'm trying to find out when the job really started.
Where can I look?
I cant seem to find anywhere which states exactly what you're after, but you could look into the job in the job tracker on port 50030 and look at the individual mapper and reducer details. On there you can see how long each individual mapper and reducer took to complete their tasks from their start and end times.
If there weren't any mappers or reducers free when you started the second job, the second job wouldnt be able to make any progress until the first job released them, which might explain why it claimed to take so long, as they might not have actually been running simultaneously. The time of the job being started and the first actual mapper starting should give you an indication of whether it was just waiting around for resources, which means you can deduct the period of time between the job and mapper's start times from the overall 47 minutes.

Performance of Resque jobs

My Resque job basically takes params hash and stores it into the DB. In the process it does several reads and writes.
These R/Ws take approx. 5ms in total on my local machine and a little bit more on Heroku (I guess it's because of the shared DB).
However, the rate at which the queue is processed is very low / about 2-3 jobs per second. What could be causing this?
Thank you.
Check for a new job, lock a job, do the job, mark it as completed, look for a new job.
You might find that the negotiation to get a new job, accessing Redis etc is causing a lot of overhead. If your task is only 5ms long, it can probably live inside the request-response cycle. Background jobs are great when running a task would extend the response time considerably, very small jobs generally aren't worth the effort involved.

Resources