I have a bulk load job using HFileOutputFormat to load a HBase table. My mapper completes within 2-3 mins and then the reducer(PutSortReducer invoked by HFileOutputFomat) completes till 92% in the next 2 mins, but takes around 9 mins to complete the remaining 8%
Totally 10 reduce tasks are spawned in my job, of which always 8 or 9 tasks completes within 2-3 mins and the remaining one or two tasks take then 9 mins. And these one or two last tasks are usually the ones restarted in place of the failed tasks. The logs don't show any evident errors as the reason of the failed tasks
Related
I am running a spark streaming (1.6.1) on yarn using DirectAPI to read events from Kafka topic having 50 partitions and writing on HDFS. I have a batch interval of 60 seconds. I was receiving around 500K messages which was getting processed under 60 Sec.
Suddenly spark started receiving 15-20 million messages which took around 5-6 minutes to process with a batch interval of 60 seconds. I have configured "spark.streaming.concurrentJobs=4".
So when batch takes a long time for processing spark initiate concurrent 4 active tasks to handle the backlog batches but still over a period of time batch backlog increases as batch interval is too less for such volume of data.
I have few doubts around this.
When I start receiving 15-20 million messages & time to process those messages is around 5-6 minutes with batch interval of 60 Sec. When I check my HDFS directory I see the files created for each 60 Sec with 50 part files, I am little confused here my batch is getting processed in 5-6 minutes, then how it is writing files on HDFS every 1 min & 'saveAsTextFile' action is called only once per batch. Total records from all the files 50 part files comes around 3.3 million.
In order to handle the processing of 15-20 million messages, I configured my batch interval to 8-10 minutes now spark started consuming around 35-40 million messages from Kafka & again its processing time started exceeding batch interval.
I have configured 'spark.streaming.kafka.maxRatePerPartition=50' & 'spark.streaming.backpressure.enabled=true'.
I think one thing that may have confused you is the relationship between the length of a job, and the frequency.
From what you describe, with the resources available it seems that in the end the job took about 5 minutes to complete. However your batch frequency is 1 minute.
So as a result, every 1 minute you kick off some batch that takes 5 minutes to complete.
As a result, in the end you will expect to see HDFS receive nothing for the first few minutes, and then you keep receiving something every 1 minute (but with a 5 minute 'delay' from when the data went in).
There are several partitions on the cluster I work on. With sinfo I can see the time limit for each partition. I put my code to work on mid1 partition which has time limit of 8-00:00:00 from which I understand that time limit is 8 days. I had to wait for 1-15:23:41 which means nearly 1 day and 15 hours. However, my code ran for only 00:02:24 which means nearly 2.5 minutes (and the solution was converging). Also, I did not set a time limit in the file submitted with sbatch The reason of my code stopped was given as:
JOB 3216125 CANCELLED AT 2015-12-19T04:22:04 DUE TO TIME LIMIT
So, why my code was stopped if I did not exceed the time limit? I was asking this to the guys who were responsible for the cluster but they did not return.
Look at the value of DefaultTime in the output of scontrol show partitions. This is the maximum time that is allocated to your job in the case you do not specify it by yourself with --time.
Most probably this value is set to 2 minutes to force you to specify a sensible time limit (within the limits of the partition.)
I would like
to know the real meaning of these two counters Total time spent by all
maps in occupied slots (ms) and Total time spent by all reduces in
occupied slots (ms). I just wrote MR program similar to word count
I got
**Total time spent by all maps in occupied slots (ms)=15667400
Total time spent by all reduces in occupied slots (ms)=158952
CPU time spent (ms)=51930
real 7m38.886s**
Why is it so?????? The first counter is having a very very high value
which is actually incomparable with the other three. Kindly clear this
to me.
Thank You
With Regards
Probably need some more context around your input data but the first two counters show how much time was spent across all map and reduce tasks. This number is larger than everything else as you probably have a multi-node hadoop cluster and a large input dataset - meaning you have lots of map tasks running in parallel. Say you have 1000 map tasks running in parallel and each takes 10 seconds to complete - in this case the total time across all mappers would be 1000*10, 10000 secs. In reality the map phase may only take 10-30 seconds to complete in parallel, but if you were to run them in serial they would take 10000 secs to complete with a single node, single map slot cluster.
The CPU time spent refers to the how much of the total time was pure CPU processing - this is smaller than the others as your job is mostly IO bound (reading from and writing to disk, or across the network).
In my MR job, which does bulk loading using HFileOutputFormat, 87 map tasks are spawned and in around 20 mins all the tasks reached 100%. Yet the individual task status is still in 'Running' in the hadoop admin page and none is moved to the completed state. The reducer is always in pending state and never starts. I just waited but it errored out after the 30 mins timeout.
My job has to load around 150+ columns. I tried running same MR job with less number of columns and it gets easily completed. Any idea why the map tasks are not moved to completed state even after reaching 100%?
One probable cause would be that the output data emitted would be huge. Sorting it, writing it back to disk would be a time consuming thing to do. This is typically not the case.
It would be even wise to check the logs and look out for ways to improve your map-reduce code.
Say I have 100 mappers running in parallel and there are total 500 mappers running.
Input size received by each mapper is almost same and the processing time each mapper should take should be more or less identical.
But say first 100 mappers finishes in 20 minutes, the next 100 mappers take like 25-30 minutes and the next batch of 100 mappers take around 40-50 minutes each. And then later we get GC overhead error.
Why is this happening?
I have following configurations already set:
<property><name>mapred.child.java.opts</name><value>-Xmx4096m</value></property>
<property><name>mapred.job.reuse.jvm.num.tasks</name><value>1</value></property>
What else can be done here?