Actual processing time of hadoop job - hadoop

My cluster is currently occupied by a job A that takes long time and has VERY_LOW priority.
I started another job B yesterday while A was already running and I think it should have ran quite fast.
However, I saw it took 47 minutes at the job details.
I don't think this is the actual processing time.
I'm trying to find out when the job really started.
Where can I look?

I cant seem to find anywhere which states exactly what you're after, but you could look into the job in the job tracker on port 50030 and look at the individual mapper and reducer details. On there you can see how long each individual mapper and reducer took to complete their tasks from their start and end times.
If there weren't any mappers or reducers free when you started the second job, the second job wouldnt be able to make any progress until the first job released them, which might explain why it claimed to take so long, as they might not have actually been running simultaneously. The time of the job being started and the first actual mapper starting should give you an indication of whether it was just waiting around for resources, which means you can deduct the period of time between the job and mapper's start times from the overall 47 minutes.

Related

Is there any way to know which job will start next in qsub

In our institute (IISc Bangalore)Supercomputer ,we submit jobs using qsub. The jobs will start running according to the following-
(1) Its wall time(Expected completion time)
(2) Its position in the respected queue(small,medium,large etc).
So,it is very difficult to know which job will start after finishing one job which is currently running. But qsub is probably has a list of its own,by which it is starting a new job after finishing another job immediately.
Is there any way to know which job will start next.Is there any command for this.
Thank you.
Unfortunately, there is no clear way to know which job will be run next in a supercomputing system. The job start is depending not only on it's wall time or position in the queue but also many other factors based on the site-level policy, scheduling strategies and priorities. There can be some internal job ranking (priorities) chosen by the institute based on factors like power management, load balancing etc.
On the other side, there are many researches to predict the waiting time for job allocation. TeraGrid systems provides estimated waiting time. Also, see link1, link2 (by SERC) for more information about predicting the waiting time.

How to resume failed execution in MapReduce?

I have one requirement saying that -
a. Lets say i have 100GB of file/data
b. I have written Map Reduce job to process this data on certain logic.
c. I fired Map Reduce job, but it failed after reading 50GB
So my question is -
Can i resume the Map Reduce job from the 51th GB?
Please let me know if anybody have idea on how to do it, i don't want to reprocess the data which i processed before point of failure.
Thanks in advance
Brief answer: no.
And that's why working with large batch processing systems such as Hadoop or MPI is hard. Not only restarts of large jobs are inefficient from resource consumption point of view, but are also very psychologically depressive. That's why your primary goal is to reduce running time of single job to no more than couple of hours. Maybe it would be possible some day to implement "pausing" of jobs and "hot fixing" code, but currently it is not supported to my knowledge.
Solution #1. Split your job into error-prone parallelizable job and final error-free non-parallelizable job. Consider following example: you have hundreds of gigabytes of textual access logs from web server and you want to write job that will print how popular different browsers are. If you combine parsing and aggregating (summing) to a single huge job, then it's running time will be of order of days, and also chances that it will fail are very high because textual logs are usually hard to parse due to disambiguity. Much better idea is to split this job into two separate jobs:
First job is solely responsible for parsing log files. It prints only browser string as its output and even doesn't need to have any reducers. This job is the place for 99% of all errors because here is where parsing of "wild" data occurs. This job is parallelizable in the sense that you may split your input into chunks and process each chunk separately, so that each chunk is processed in 10-30 minutes. If job fails for some chunk, you fix it and restart; 30 minutes is not a big loss.
Second job is grand job that takes outputs from instances of first jobs and performs aggregation. Because aggregation code is very simple, this job is not likely to fail.
chunk(20G)->parse-job(20G)->browsers(0.5G)
chunk(20G)->parse-job(20G)->browsers(0.5G)
input(1T)->chunk(20G)->parse-job(20G)->browsers(0.5G)->aggregate-job->output
... .... ...
chunk(20G)->parse-job(20G)->browsers(0.5G)
Solution #2. Sometimes you may be satisfied with result even if parts of input data are dropped out. In this case you may set options mapred.max.map.failures.percent and/or mapred.max.reduce.failures.percent to non-zero values.
If your entire job fails, the output gets cleared, so you loose whatever you processed. However, Hadoop retries failed tasks of a job. So as long as your failure is recoverable within preconfigured number of attempts, a job will not fail and you are not going to loose output from already completed tasks.
If your failure is not recoverable, then in most cases it is your fault, and you might need to do one or more of the following:
Fix your code, even simple bug may cause all your tasks to consistently fail
Use less resources (e.g. care of available memory)
Better partition the problem (see if some tasks are fed more data than others or make sure task input is getting split into smaller chunks)
Have a bigger cluster capacity.

How jobs are assigned to executors in Spark Streaming?

Let's say I've got 2 or more executors in a Spark Streaming application.
I've set the batch time of 10 seconds, so a job is started every 10 seconds reading input from my HDFS.
If the every job lasts for more than 10 seconds, the new job that is started is assigned to a free executor right?
Even if the previous one didn't finish?
I know it seems like a obvious answer but I haven't found anything about job scheduling in the website or on the paper related to Spark Streaming.
If you know some links where all of those things are explained, I would really appreciate to see them.
Thank you.
Actually, in the current implementation of Spark Streaming and under default configuration, only job is active (i.e. under execution) at any point of time. So if one batch's processing takes longer than 10 seconds, then then next batch's jobs will stay queued.
This can be changed with an experimental Spark property "spark.streaming.concurrentJobs" which is by default set to 1. Its not currently documented (maybe I should add it).
The reason it is set to 1 is that concurrent jobs can potentially lead to weird sharing of resources and which can make it hard to debug the whether there is sufficient resources in the system to process the ingested data fast enough. With only 1 job running at a time, it is easy to see that if batch processing time < batch interval, then the system will be stable. Granted that this may not be the most efficient use of resources under certain conditions. We definitely hope to improve this in the future.
There is a little bit of material regarding the internals of Spark Streaming in this meetup slides (sorry, about the shameless self advertising :) ). That may be useful to you.

How to estimate the number of instances in Amazon EMR?

I have a map-reduce job to be run on the Amazon EMR. I would like to have up to 400 mappers and reducers and I would like to use either Medium or Large instances. How can I estimate the number of instances I need.
Besides, if one job ends within 2 minutes, let's say, and I run another job which take 4 minutes, will I be charged for 2 hours or that's considered 1 hour?
I know if you use the CLI tool to create your Job Flow and add the steps, then you can run both of the steps one after another on the same job flow and they will be counted within the same hour.
I believe if you use the GUI then you can not re-use the job flow and so you may get charged one hour for each job. I haven't tried this though so may be wrong there.
Check this article which is where I got the information:
https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce

All map tasks reached 100%, but still in running state

In my MR job, which does bulk loading using HFileOutputFormat, 87 map tasks are spawned and in around 20 mins all the tasks reached 100%. Yet the individual task status is still in 'Running' in the hadoop admin page and none is moved to the completed state. The reducer is always in pending state and never starts. I just waited but it errored out after the 30 mins timeout.
My job has to load around 150+ columns. I tried running same MR job with less number of columns and it gets easily completed. Any idea why the map tasks are not moved to completed state even after reaching 100%?
One probable cause would be that the output data emitted would be huge. Sorting it, writing it back to disk would be a time consuming thing to do. This is typically not the case.
It would be even wise to check the logs and look out for ways to improve your map-reduce code.

Resources