Tuning Hadoop job execution on YARN - hadoop

A bit of intro - I'm learning about Hadoop. I have implemented machine learning algorithm on top of Hadoop (clustering) and tested it only on a small example (30MB).
A couple of days ago I installed Ambari and created a small cluster of four machines (master and 3 workers). Master has Resource manager and NameNode.
Now I'm testing my algorithm by increasing the amount of data (300MB, 3GB). I'm looking for a pointer how to tune up my mini-cluster. Concretely, I would like to know how to determine MapReduce2 and YARN settings in Ambari.
How to determine min/max memory for container, reserved memory for container, Sort Allocation Memory, map memory and reduce memory?
The problem is that execution of my jobs is very slow on Hadoop (and clustering is an iterative algorithm, which makes things worse).
I have a feeling that my cluster setup is not good, because of the following reason:
I run a job for a dataset of 30MB (I set-up block memory for this job to be 8MB, since data is small and processing is intensive) - execution time 30 minutes
I run the same job, but multiply same dataset 10 times - 300MB (same block size, 8MB) - execution time 2 hours
Now same amount of data - 300MB, but block size 128MB - same execution time, maybe even a bit greater than 2 hours
Size of blocks on HDFS is 128MB, so I thought that this will cause the speedup, but that is not the case. My doubts are that the cluster setup (min/max RAM size, map and reduce RAM) is not good, hence it cannot improve even though greater data locality is achieved.
Could this be the consequence of a bad setup, or am I wrong?

Please set the below properties in Yarn configuratins to allocate 33% of max yarn memory per job, which can be altered based on your requirement.
If you need further info on this, please refer following link https://analyticsanvil.wordpress.com/2015/08/16/managing-yarn-memory-with-multiple-hive-users/


Spark Executors hang after Out of Memory

I have a spark application running on EMR (16 nodes, 1 master, 15 core, r3.2xlarge instances). For spark executor configuration, we use dynamic Allocation.
While loading the data into the RDD, I see that sometimes when there's a huge amount of data (700 Gb), then Spark runs Out of Memory, but it does not fail the App. Rather the app sits there hung. I'm not sure why this happens but here is my theory :-
We use dataframes which might be caching things.
The spark flag spark.dynamicAllocation.cachedExecutorIdleTimeout is set to infinity
My theory is that it might be caching things while creating dataframes but the cache is never relinquished and this leads to a Spark hang.
There are two solutions
Increase cluster size (worse case)
Figure out a way to add a timeout to Spark app.
Programatically kill the EMR step (could not find an API which does this)
Any leads about how to go about it ?
There could be two other possibilities. Either the partitions are too big, or you have sever skewness (size of partitions varies a lot).
Try to increase the number of partitions (anf hence, reduce their size) using repartition. This will randomly reshuffle the data throughout your executors (good to reduce skewness, but slow). Ideally, I like my partitions to be around 64Mo, depending on your machines.

Shuffle phase lasts too long Hadoop

I'm having a MR job in which shuffle phase lasts too long.
At first I thought that it is because I'm emitting a lot of data from Mapper (around 5GB). Then I fixed that problem by adding a Combiner, thus emitting less data to Reducer. After that shuffle period did not shorten, as I thought it would.
My next idea was to eliminate Combiner, by combining in Mapper itself. That idea I got from here, where it says that data needs to be serialized/deserialized to use Combiner. Unfortunately shuffle phase is still the same.
My only thought is that it can be because I'm using a single Reducer. But this shouldn't be a case since I'm not emitting a lot of data when using Combiner or combining in Mapper.
Here are my stats:
Here are all the counters for my Hadoop (YARN) job:
I should also add that this is run on a small cluster of 4 machines. Each has 8GB of RAM (2GB reserved) and number of virtual cores is 12 (2 reserved).
These are virtual machines. At first they were all on a single unit, but then I separated them 2-2 on two units. So they were sharing HDD at first, now there are two machines per disk. Between them is a gigabit network.
And here are more stats:
Whole memory is occupied
CPU is constantly under pressure while the job is run (the picture shows CPU for two consecutive runs of same job)
My question is - why is shuffle time so big and how to fix it? I also don't understand how there was no speedup even though I have dramatically reduced the amount of data emitted from Mapper?
Few observations :
For a job of 30 mins, the GC time is too high (Try reusing objects rather creating a new one for each call in map()/Reduce() method)
Average map time is TOOOOO hight , 16 mins what are you doing in ur map ?
YARN memory is 99% , this signifies you are running too many services on your HDP cluster and RAM is not sufficient to support those many services.
Please increse YAN container memory, please give at least 1 GB.
This looks like a GC + overscheduled cluster problem

Map Reduce Slot Definition

I am on my way for becoming a cloudera Hadoop administrator. Since my start, I am hearing a lot about computing slots per machine in a Hadoop Cluster like defining number of Map Slots and Reduce slots.
I have searched internet for a log time for getting a Noob definition for a Map Reduce Slot but didn't find any.
I am really pissed off by going through PDF's explaining the configuration of Map Reduce.
Please explain what exactly it means when it comes to a computing slot in a Machine of a cluster.
In map-reduce v.1 mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum are used to configure number of map slots and reduce slots accordingly in mapred-site.xml.
starting from map-reduce v.2 (YARN), containers is a more generic term is used instead of slots, containers represents the max number of tasks that can run in parallel under the node regardless being Map task, Reduce task or application master task (in YARN).
generally it depends on CPU and memory
In out cluster, we set 20 map slot and 15 reduce slot for a machine with 32Core,64G memory
1.approximately one slot needs one cpu core
2.number of map slot should be a little more than reduce
IN MRV1 each machine had fixed number of Slots dedicated for maps and reduce.
In general each machine is configured with 4:1 ratio of maps:reducer on a machine .
logically one would be reading lot of data(Maps) and crunching them to small set(Reduce).
In MRV2 concept of containers came in and any container can run either a map/reducer/shell script .
A bit late though, I'll answer anyways.
Computing Slot. Can you think of all the various computations in the Hadoop that would require some resource i.e. memory/CPUs/Disk Size.
Resource = Memory or CPU-Core or Disk Size required
Allocating resource to start a Container, allocating resource to perform a map or a reduce task etc.
It is all about how you would want to manage the resources you have in hand. Now what would that be? RAM, Cores, Disks Size.
Goal is to ensure your processing is not constrained by any one of these cluster resources. You want your processing to be as dynamic as possible.
As an example, Hadoop YARN allows you to configure min RAM required to start a YARN container, min RAM require to start a MAP/REDUCE task, JVM Heap Size (for Map and Reduce tasks) and the amount of virtual memory each task would get.
Unlike Hadoop MR1, you do not pre-configure (as an example RAM size) before you even begin executing Map-Reduce tasks. In the sense you would want your resource allocation to be as elastic as possible, i.e. dynamically increase RAM/CPU cores for either MAP or a REDUCE task.

Benchmarking Hadoop on EC2 gives identical performances

I am trying to benchmark Hadoop on EC2. I am using a 1GB file with 1 Master and 5 slaves. When I varied the dfs.blocksize like 1m, 64m, 128m, 500m. I was expecting the best performance at 128m since the file size is 1GB and there are 5 slaves. But to my surprise, irrespective of the block size, time taken falls more or less within the same range. How am I achieving this wierd performance?
Couple of things to think about most likely explanation first
Check you are correctly passing in the system variables to control the split size of the job, if you don't change this you won't alter the numbers of mappers (which you can check in the jobtracker UI). If you get the same number of mappers each time your not actually changing anything. To change the split size, use the system props mapred.min.split.size and mapred.max.split.size
Make sure you are really hitting the cluster and not accidentally running locally with 1 process
Be aware that (unlike Spark) Hadoop has a horrifying job initialization time. IME it's around 20 seconds, and therefore for only 1 GB of data your not really seeing much time difference as the majority of the job is spent in initialization.

How to decrease number of map sweeps in a job (without changing data chunk size)?

The gist of my problem is..how does one decrease the number of map sweeps a job may need ? The number of map tasks for a job is data_size/HDFS_BLOCK_SIZE. The number of sweeps it may take to complete this is dependent on how many map slots we have. Assuming I am running nothing else and just one job, I find that the per node CPU utilization is low (implying I could actually run more map jobs per node). I played with mapred.tasktracker.map.tasks.maximum parameter (for example, each of my node has 32 processors and I set it to as high as 30) - but I could never increase the number of map slots and the overall CPU utilization is 60% or so. Are there any other parameters to play with? The data size I have is large enough (32GB, 8 node cluster each with 32 cpus) and it does take two map sweeps (first sweep does map 1-130 and second sweep completes the rest).
In case anyone haven't told you yet:
MapReduce is mainly IO bound, it has to read a lot of data from disk, write it back, read it and write it again. In between the reads and writes it executes your map and reduce logic.
So what I have heard lifting the CPU usage is making a cluster not IO bound anymore
RAID-0 or RAID-10 your hard disks, get the fastest harddisk out there. In consumer market there are the Western Digital VelociRaptors with 10k RPM.
SSD's don't contribute too much, since Hadoop is mostly optimized for sequencial rads.
Give as much network bandwidth as possible.
Lots of RAM for disk caching.
Even then, you should face <100% CPU utilization, but it is much better and the perfomance will skyrocket.
However, CPU utilization is not a good metric for a Hadoop cluster, as you might conclude from the points above.
Hadoop is mainly about the reliable storage of data, giving neat features to crunch it. Not given you the super-computer performance, if you need this get a MPI cluster and a PH.D to code your algorithms ;)
Sorry for the thrash - but something must have gone wrong with my installation. I happen to reinstall hadoop and it works as expected. I guess some parameter must have been conflicting.
