Time to complete Map and Reduce Tasks in Hadoop

Time to complete Map and Reduce Tasks in Hadoop - hadoop

I would like
to know the real meaning of these two counters Total time spent by all
maps in occupied slots (ms) and Total time spent by all reduces in
occupied slots (ms). I just wrote MR program similar to word count
I got
**Total time spent by all maps in occupied slots (ms)=15667400
Total time spent by all reduces in occupied slots (ms)=158952
CPU time spent (ms)=51930
real 7m38.886s**
Why is it so?????? The first counter is having a very very high value
which is actually incomparable with the other three. Kindly clear this
to me.
Thank You
With Regards

Probably need some more context around your input data but the first two counters show how much time was spent across all map and reduce tasks. This number is larger than everything else as you probably have a multi-node hadoop cluster and a large input dataset - meaning you have lots of map tasks running in parallel. Say you have 1000 map tasks running in parallel and each takes 10 seconds to complete - in this case the total time across all mappers would be 1000*10, 10000 secs. In reality the map phase may only take 10-30 seconds to complete in parallel, but if you were to run them in serial they would take 10000 secs to complete with a single node, single map slot cluster.
The CPU time spent refers to the how much of the total time was pure CPU processing - this is smaller than the others as your job is mostly IO bound (reading from and writing to disk, or across the network).

Related

Round Robin Memory Scheduling with CPU & Memory Visualisations

For a Round Robin implementation, I have 5 processes with their arrival & duration times and the memory needed to be processed, as shown below.
5 Processes accessing the CPU
The total memory of the system is 512K and the time quantum used is 3. Based on the Round Robin theory I created the following Gantt graph.
Gantt Graph creation
I have to represent on the following table, the visualisation of the memory and the CPU until time interval t=10 by showcasing the processes that are on CPU queue (which I did on the graph), which parts of the memory have been occupied by the processes and which are free, by using i) a system with variable diviations without compaction and ii) with compaction.
Table of Results to be created
I suppose that I have to adjust the memory usage of each process accordingly with the quantum time given at 3. For example, for process P1 the duration is equal to the quantum time and thus, the whole 85K of it will be used. If I that assumption is correct, the system I am using runs without compaction? How can I proceed the next steps with compaction?
Thank you in advance

How long does it take to process the file If I have only one worker node?

Let's say I have a data with 25 blocks and the replication factor is 1. The mapper requires about 5 mins to read and process a single block of the data. Then how can I calculate the time for one worker node? The what about 15 nodes? Will the time be changed if we change the replication factor to 3?
I really need a help.

First of all I would advice reading some scientific papers regarding the issue (Google Scholar is a good starting point).
Now a bit of discussion. From my latest experiments I have concluded that processing time has very strong relation with amount of data you want to process (makes sense). On our cluster, on average it takes around 7-8 seconds for Mapper to read a block of 128MBytes. Now there are several factors which you need to consider in order to predict the overall execution time:
How much data the Mapper produces, which will determine moreless the time Hadoop requires to execute Shuffling
What Reducer is doing? Does it do some iterative processing? (might be slow!)
What is the configuration of the resources? (how many Mappers and Reducers are allowed to run on the same machine)
Finally are there other jobs running simultaneously? (this might be slowing down the jobs significantly, since your Reducer slots can be occupied waiting for data instead of doing useful things).
So already for one machine you are seeing the complexity of the task of predicting the time of job execution. Basically during my study I was able to conclude that in average one machine is capable of processing from 20-50 MBytes/second (the rate is calculated according to the following formula: total input size/total job running time). The processing rate includes the staging time (when your application is starting and uploading required files to the cluster for example). The processing rate is different for different use cases and greatly influenced by the input size and more importantly the amount of data produced by Mappers (once again this values are for our infrastructure and on different machine configuration you will be seeing completely different execution times).
When you start scaling your experiments, you would see in average improved performance, but once again from my study I could conclude that it is not linear and you would need to fit by yourself, for your own infrastructure the model with respective variables which would approximate the job execution time.
Just to give you an idea, I will share some part of the results. The rate when executing determine use case on 1 node was ~46MBytes/second, for 2 nodes it was ~73MBytes/second and for 3 nodes it was ~85MBytes/second (in my case the replication factor was equal to the number of nodes).
The problem is complex requires time, patience and some analytical skills to solve it. Have fun!

Individual Spark Task consume more time on computation if more cores are assigned

I am running a spark job with input file of size 6.6G (hdfs) with master as local. My Spark Job with 53 partitions completed quickly when I assign local[6] than local[2], however the individual task takes more computation time when number of cores are more. Say if I assign 1 core(local[1]) then each task takes 3 secs where the same goes up to 12 seconds if I assign 6 cores (local[6]). Where the time gets wasted? The spark UI shows increase in computation time for each task in local[6] case, I couldn't understand the reason why the same code takes different computation time when more cores are assigned.
Update:
I could see more %iowait in iostat output if I use local[6] than local[1]. Please let me know this is the only reason or any possible reasons. I wonder why this iowait is not reported in sparkUI. I see the increase in computing time than iowait time.

I am assuming you are referring to spark.task.cpus and not spark.cores.max
With spark.tasks.cpus each task get assigned more cores, but it doesn't necessarily have to use them. If you process is single threaded it really can't use them. You wind up with additional overhead without additional benefit and those cores are taken away from other single threaded tasks that can use them.
With spark.cores.max it is simply and overhead issue with transferring data around at the same time.

Measuring Execution Time in Hadoop

I would like to discuss the total resource usage of my MapReduce jobs and wonder how I could do this. Hadoop provides the CPU millis metric for jobs, but I wonder if this is a good idea.
As an alternative, I could add up different times: Map time, Sort time, Merge time and Reduce time, but this gives a number that is different from the total CPU time I mentioned above. I do not know why this is the case, I suspect that it is caused by "IO-wait" time, i.e., waiting for resources (such as disk) to become available.
Finally, in papers, I often find references to the "execution time". Is this the total time (sum of all resource usage), or is this the elapsed time (end-start)?
This leads to the following two questions:
what is the definition of execution time? is this "elapsed time", or is this total time?
how can I best represent the resource usage in Hadoop, is CPU millis adequate?

UDF optimization in Hadoop

I testing my UDF on Windows virtual machine with 8 cores and 8 GB RAM. I have created 5 files of 2 GB about and run the pig script after modifying "mapred.tasktracker.map.tasks.maximum".
The following runtime and statistics:
mapred.tasktracker.map.tasks.maximum = 2
duration = 20 min 54 sec
mapred.tasktracker.map.tasks.maximum = 4
duration = 13 min 38 sec and about 30 sec for task
35% better
mapred.tasktracker.map.tasks.maximum = 8
duration = 12 min 44 sec and about 1 min for task
only 7% better
Why such a small improvement when changing settings? any ideas? Job was divided into 145 tasks.
![4 slots][1]
![8 slots][2]

Couple of observations:
I imagine your windows machine only has a single disk backing this VM - so there is a limit to how much data you can read off disk at any one time (and write back for the spills). By increasing the task slots, your effectively driving up the read / write demands on your disk (and a more disk thrashing too potentially). If you have multiple disks backing your VM (and not virtual disks all on the same physical disk, i mean virtual disks backed by different physical disks), you would probably see a performance increase over what you've already seen.
By adding more map slots, you've reduced the amount of assignment waves that the Job Tracker needs to do - and each wave has a polling overhead (TT polling the jobs, JT polling the TTs and assigning new tasks to free slots). A 2 slot TT vs 8 slot TT will mean that you have 145/2=~73 assignment waves (if all tasks ran in equal time - obviously not realistic) vs 145/8=~19 waves - thats a ~3x increase in the amount of polling needed to be done (and it all adds up).

mapred.tasktracker.map.tasks.maximum configures the maximum number of map tasks that will be run simultaneously by a task tracker. There is a practical hardware limit to how many tasks a single node can run at a time. So there will be diminishing returns when you keep increasing this number.
For example, say the tasktracker node has 8 cores. Say 4 cores are being used by processes other than the tasktracker. That leaves 4 cores for the mapred tasks. So your task time will improve from mapred.tasktracker.map.tasks.maximum = 1 to 4, but after that, it would just remain static because the other tasks will just be waiting. In fact, if you increase it too much, the contention and context switching might make it slower. The recommended value for this parameter is the No. of CPU cores - 1

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio