I would like to discuss the total resource usage of my MapReduce jobs and wonder how I could do this. Hadoop provides the CPU millis metric for jobs, but I wonder if this is a good idea.
As an alternative, I could add up different times: Map time, Sort time, Merge time and Reduce time, but this gives a number that is different from the total CPU time I mentioned above. I do not know why this is the case, I suspect that it is caused by "IO-wait" time, i.e., waiting for resources (such as disk) to become available.
Finally, in papers, I often find references to the "execution time". Is this the total time (sum of all resource usage), or is this the elapsed time (end-start)?
This leads to the following two questions:
what is the definition of execution time? is this "elapsed time", or is this total time?
how can I best represent the resource usage in Hadoop, is CPU millis adequate?
Related
I am seeing as the size of the input file increase failed shuffles increases and job complete time increases non linearly.
eg.
75GB took 1h
86GB took 5h
I also see average shuffle time increase 10 fold
eg.
75GB 4min
85GB 41min
Can someone point me a direction to debug this?
Whenever you are sure your algorithms are correct, automatic hard-disk volumes partioning or fragmentation problems may occur somewhere after that 75Gb threshold, as of you are probably using the same filesystem for caching the results.
Spark's Timeline contains:
Scheduler Delay
Task Deserialization Time
Shuffle Read Time
Executor Computing Time
Shuffle Write Time
Result Serialization Time
Getting Result Time
It seems that the time cost of reading data from sources, such as hdfs, is included in Executor Computing Time. But I am not sure.
If it is in Executor Computing Time, how can I get it without including the time cost of computation.
Thanks.
It's hard to properly distinguish how long a read operation takes as processing is done on the data as it's being read.
A simple best-bet is just to apply a trivial operation (say, count) that will have very little overhead. If your file is sizable, read will vastly dominate the trivial operation, especially if it's one like count that can be done without shuffling data between nodes (aside from the single-value result).
Let's say I have a data with 25 blocks and the replication factor is 1. The mapper requires about 5 mins to read and process a single block of the data. Then how can I calculate the time for one worker node? The what about 15 nodes? Will the time be changed if we change the replication factor to 3?
I really need a help.
First of all I would advice reading some scientific papers regarding the issue (Google Scholar is a good starting point).
Now a bit of discussion. From my latest experiments I have concluded that processing time has very strong relation with amount of data you want to process (makes sense). On our cluster, on average it takes around 7-8 seconds for Mapper to read a block of 128MBytes. Now there are several factors which you need to consider in order to predict the overall execution time:
How much data the Mapper produces, which will determine moreless the time Hadoop requires to execute Shuffling
What Reducer is doing? Does it do some iterative processing? (might be slow!)
What is the configuration of the resources? (how many Mappers and Reducers are allowed to run on the same machine)
Finally are there other jobs running simultaneously? (this might be slowing down the jobs significantly, since your Reducer slots can be occupied waiting for data instead of doing useful things).
So already for one machine you are seeing the complexity of the task of predicting the time of job execution. Basically during my study I was able to conclude that in average one machine is capable of processing from 20-50 MBytes/second (the rate is calculated according to the following formula: total input size/total job running time). The processing rate includes the staging time (when your application is starting and uploading required files to the cluster for example). The processing rate is different for different use cases and greatly influenced by the input size and more importantly the amount of data produced by Mappers (once again this values are for our infrastructure and on different machine configuration you will be seeing completely different execution times).
When you start scaling your experiments, you would see in average improved performance, but once again from my study I could conclude that it is not linear and you would need to fit by yourself, for your own infrastructure the model with respective variables which would approximate the job execution time.
Just to give you an idea, I will share some part of the results. The rate when executing determine use case on 1 node was ~46MBytes/second, for 2 nodes it was ~73MBytes/second and for 3 nodes it was ~85MBytes/second (in my case the replication factor was equal to the number of nodes).
The problem is complex requires time, patience and some analytical skills to solve it. Have fun!
I am running a spark job with input file of size 6.6G (hdfs) with master as local. My Spark Job with 53 partitions completed quickly when I assign local[6] than local[2], however the individual task takes more computation time when number of cores are more. Say if I assign 1 core(local[1]) then each task takes 3 secs where the same goes up to 12 seconds if I assign 6 cores (local[6]). Where the time gets wasted? The spark UI shows increase in computation time for each task in local[6] case, I couldn't understand the reason why the same code takes different computation time when more cores are assigned.
Update:
I could see more %iowait in iostat output if I use local[6] than local[1]. Please let me know this is the only reason or any possible reasons. I wonder why this iowait is not reported in sparkUI. I see the increase in computing time than iowait time.
I am assuming you are referring to spark.task.cpus and not spark.cores.max
With spark.tasks.cpus each task get assigned more cores, but it doesn't necessarily have to use them. If you process is single threaded it really can't use them. You wind up with additional overhead without additional benefit and those cores are taken away from other single threaded tasks that can use them.
With spark.cores.max it is simply and overhead issue with transferring data around at the same time.
I would like
to know the real meaning of these two counters Total time spent by all
maps in occupied slots (ms) and Total time spent by all reduces in
occupied slots (ms). I just wrote MR program similar to word count
I got
**Total time spent by all maps in occupied slots (ms)=15667400
Total time spent by all reduces in occupied slots (ms)=158952
CPU time spent (ms)=51930
real 7m38.886s**
Why is it so?????? The first counter is having a very very high value
which is actually incomparable with the other three. Kindly clear this
to me.
Thank You
With Regards
Probably need some more context around your input data but the first two counters show how much time was spent across all map and reduce tasks. This number is larger than everything else as you probably have a multi-node hadoop cluster and a large input dataset - meaning you have lots of map tasks running in parallel. Say you have 1000 map tasks running in parallel and each takes 10 seconds to complete - in this case the total time across all mappers would be 1000*10, 10000 secs. In reality the map phase may only take 10-30 seconds to complete in parallel, but if you were to run them in serial they would take 10000 secs to complete with a single node, single map slot cluster.
The CPU time spent refers to the how much of the total time was pure CPU processing - this is smaller than the others as your job is mostly IO bound (reading from and writing to disk, or across the network).