Issue allocating resources (Number of Mappers) in EMR MapReduce2 YARN - hadoop

I have a very small new EMR cluster to play around with and I'm trying to limit the number of concurrent mappers per node to 2. I tried this by tweaking the default cpu-vcores down to 2.
Formula used:
min((yarn.nodemanager.resource.memory-mb / mapreduce.map.memory.mb),
(yarn.nodemanager.resource.cpu-vcores / mapreduce.map.cpu.vcores))
Cluster configuration:
AMI version: 3.3.1
Hadoop distribution: Amazon 2.4.0
Core: 4 m1.large
Job Configuration:
yarn.nodemanager.resource.memory-mb:5120
mapreduce.map.memory.mb:768
yarn.nodemanager.resource.cpu-vcores: 2
mapreduce.map.cpu.vcores: 1
As a result, I am currently seeing 22 mappers running at the same time. Besides being wrong according to the formula, this does not make sense at all giving the I have 4 cores. Any thoughts?

I haven't experienced the second part of the formula (the one with vcores) to ever take place on small dedicated cluster I worked on (although it should have according to formula). I read somewhere also that YARN does not take cpu cores into account when allocating resources (i.e. it only allocates based on memory requirements).
As for the memory calculation, yarn.nodemanager.resource.memory-mb is a per node setting, but dashboards often give you cluster wide numbers, so before you divide the yarn.nodemanager.resource.memory-mb by mapreduce.map.memory.mb, multiply it with the number of nodes in your cluster, i.e.
(yarn.nodemanager.resource.memory-mb*number_of_nodes_in_cluster) / mapreduce.map.memory.mb

Related

Spark streaming application configuration with YARN

I'm trying to squeeze every single bit from my cluster when configuring the spark application but it seems I'm not understanding everything completely right. So I'm running the application on an AWS EMR cluster with 1 master and 2 core nodes from type m3.xlarge(15G ram and 4 vCPU for every node). This means that by default 11.25 GB are reserved on every node for applications scheduled by yarn. So the master node is used only by the resource manager(yarn) and that means the remaining 2 core nodes will be used to schedule applications(so we have 22.5G for that purpose). So far so good. But here comes the part which I don't get. I'm starting the spark application with the following parameters:
--driver-memory 4G --num-executors 4 --executor-cores 7 --executor-memory 4G
What this means by my perceptions(from what I found as information) is that for the driver will be allocated 4G and 4 executors will be launched with 4G every one of them. So a rough estimate makes it 5*4=20G(lets make them 21G with the expected memory reserves), which should be fine as we have 22.5G for applications. Here's a screenshot from the UI of the hadoop yarn after the launch:
What we can see is that 17.63 are used by the application but this a little bit less than the expected ~21G and this triggers the first question- what did happen here?
Then I go to the spark UI's executors page. Here comes the bigger question:
The executors are 3(not 4), the memory allocated for them and the driver is 2.1G(not the specified 4G). So hadoop yarn says 17.63G are used, but the spark says 8.4G are allocated. So, what is happening here? Is this related to the Capacity Scheduler(from the documentation I couldn't come up with this conclusion)?
Can you check whether spark.dynamicAllocation.enabled is turned on. If that is the case then spark your application may give resources back to the cluster if they are no longer used. The minimum number of executors to be launched at the startup will be decided by spark.executor.instances.
If that is not the case, what is your source for spark application and what is the partition size set for that, spark will literally map the partition size to the spark cores, if your source has only 10 partitions, and when you try to allocate 15 cores it will only use 10 cores because that is what is needed. I guess this might be the cause that spark has launched 3 executors instead of 4. Regarding memory i would recommend to revisit because you are asking for 4 executors and 1 driver with 4Gb each which would be 5*4+5*384MB approx equals to 22GB and you are trying to use up everything and not much is left for your OS and nodemanager to run that would not be the ideal way to do.

hadoop not creating enough containers when more nodes are used

So I'm trying to run some hadoop jobs on AWS R3.4xLarge machines. They have 16 vcores and 122 gigabytes of ram available.
Each of my mappers requires about 8 gigs of ram and one thread, so these machines are very nearly perfect for the job.
I have mapreduce.memory.mb set to 8192,
and mapreduce.map.java.opts set to -Xmx6144
This should result in approximately 14 mappers (in practice nearer to 12) running on each machine.
This is in fact the case for a 2 slave setup, where the scheduler shows 90 percent utilization of the cluster.
When scaling to, say, 4 slaves however, it seems that hadoop simply doesnt create more mappers. In fact it creates LESS.
On my 2 slave setup I had just under 30 mappers running at any one time, on four slaves I had about 20. The machines were sitting at just under 50 percent utilization.
The vcores are there, the physical memory is there. What the heck is missing? Why is hadoop not creating more containers?
So it turns out that this is one of those hadoop things that never makes sense, no matter how hard you try to figure it out.
there is a setting in yarn-default called yarn.nodemanager.heartbeat.interval-ms.
This is set to 1000. Apparently it controls the minimum period between assigning containers in milliseconds.
This means it only creates one new map task per second. This means the number of containers is limited by how many containers I have running*the time that it takes for a container to be finished.
By setting this value to 50, or better yet, 1, I was able to get the kind of scaling that is expected from a hadoop cluster. Honestly should be documented better.

Tuning Hadoop job execution on YARN

A bit of intro - I'm learning about Hadoop. I have implemented machine learning algorithm on top of Hadoop (clustering) and tested it only on a small example (30MB).
A couple of days ago I installed Ambari and created a small cluster of four machines (master and 3 workers). Master has Resource manager and NameNode.
Now I'm testing my algorithm by increasing the amount of data (300MB, 3GB). I'm looking for a pointer how to tune up my mini-cluster. Concretely, I would like to know how to determine MapReduce2 and YARN settings in Ambari.
How to determine min/max memory for container, reserved memory for container, Sort Allocation Memory, map memory and reduce memory?
The problem is that execution of my jobs is very slow on Hadoop (and clustering is an iterative algorithm, which makes things worse).
I have a feeling that my cluster setup is not good, because of the following reason:
I run a job for a dataset of 30MB (I set-up block memory for this job to be 8MB, since data is small and processing is intensive) - execution time 30 minutes
I run the same job, but multiply same dataset 10 times - 300MB (same block size, 8MB) - execution time 2 hours
Now same amount of data - 300MB, but block size 128MB - same execution time, maybe even a bit greater than 2 hours
Size of blocks on HDFS is 128MB, so I thought that this will cause the speedup, but that is not the case. My doubts are that the cluster setup (min/max RAM size, map and reduce RAM) is not good, hence it cannot improve even though greater data locality is achieved.
Could this be the consequence of a bad setup, or am I wrong?
Please set the below properties in Yarn configuratins to allocate 33% of max yarn memory per job, which can be altered based on your requirement.
yarn.scheduler.capacity.root.default.user-limit-factor=1
yarn.scheduler.capacity.root.default.user-limit-factor=0.33
If you need further info on this, please refer following link https://analyticsanvil.wordpress.com/2015/08/16/managing-yarn-memory-with-multiple-hive-users/

YARN: maximum parallel Map task count

Following is mentioned in the Hadoop definitive guide
"What qualifies as a small job? By default one that has less than 10 mappers, only one reducer, and the input size is less than the size of one HDFS block. "
But how does it count no of mapper in a job before executing it on YARN ?
In MR1 number of mapper depends on the no. of input splits. is the same applies for the YARN as well ?
In YARN containers are flexible. So Is there any way for computing max number of map task that can run on a given cluster in parallel( some kind of tight upper bound, because it will give me rough idea about how much data i can process in parallel ? ) ?
But how does it count no of mapper in a job before executing it on YARN ? In MR1 number of mapper depends on the no. of input splits. is the same applies for the YARN as well ?
Yes, in YARN as well if you are using MapReduce based frameworks, the number of mappers depend on input splits.
In YARN containers are flexible. So Is there any way for computing max number of map task that can run on a given cluster in parallel( some kind of tight upper bound, because it will give me rough idea about how much data i can process in parallel ? ) ?
The number of map tasks that can run in parallel on the YARN cluster depends on how many containers that can be launched and run in parallel on the cluster. This ultimately depends on how you will configure MapReduce in the cluster, which is clearly explained clearly in this guide from cloudera.
mapreduce.job.maps = MIN(yarn.nodemanager.resource.memory-mb / mapreduce.map.memory.mb,yarn.nodemanager.resource.cpu-vcores / mapreduce.map.cpu.vcores, number of physical drives x workload factor) x number of worker nodes
mapreduce.job.reduces = MIN(yarn.nodemanager.resource.memory-mb / mapreduce.reduce.memory.mb,yarn.nodemanager.resource.cpu-vcores / mapreduce.reduce.cpu.vcores, # of physical drives xworkload factor) x # of worker nodes
The workload factor can be set to 2.0 for most workloads. Consider a higher setting for CPU-bound workloads.
yarn.nodemanager.resource.memory-mb( Available Memory on a node for containers )= Total System memory – Reserved memory( like 10-20% of memory for Linux and its daemon services) - HDFS Data node ( 1024 MB) – (resources for task buffers, such as the HDFS Sort I/O buffer) – (Memory allocated for DataNode( default 1024 MB), NodeManager, RegionServer etc.)
Hadoop is a disk I/O-centric platform by design. The number of independent physical drives (“spindles”) dedicated to DataNode use limits how much concurrent processing a node can sustain. As a result, the number of vcores allocated to the NodeManager should be the lesser of either:
[(total vcores) – (number of vcores reserved for non-YARN use)] or [ 2 x (number of physical disks used for DataNode storage)]
So
yarn.nodemanager.resource.cpu-vcores = min{ ((total vcores) – (number of vcores reserved for non-YARN use)), (2 x (number of physical disks used for DataNode storage))}
Available vcores on a node for containers = total no. of vcores – for operating system( For calculating vcore demand, consider the number of concurrent processes or tasks each service runs as an initial guide. For OS we take 2 ) – Yarn node Manager( Def. is 1) – HDFS data node( Def. is 1).
Note ==>
mapreduce.map.memory.mb is combination of both mapreduce.map.java.opts.max.heap + some head room (safety value)
The settings for mapreduce.[map | reduce].java.opts.max.heap specify the default memory allotted for mapper and reducer heap size, respectively.
The mapreduce.[map| reduce].memory.mb settings specify memory allotted their containers, and the value assigned should allow overhead beyond the task heap size. Cloudera recommends applying a factor of 1.2 to the mapreduce.[map | reduce].java.opts.max.heap setting. The optimal value depends on the actual tasks. Cloudera also recommends setting mapreduce.map.memory.mb to 1–2 GB and setting mapreduce.reduce.memory.mb to twice the mapper value. The ApplicationMaster heap size is 1 GB by default, and can be increased if your jobs contain many concurrent tasks.
Reference –
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_yarn_tuning.html
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html

Why more memory on hadoop map task make mapreduce job slower?

I want to ask. Why if I configured on mapred-site.xml in mapreduce.map/reduce.memory.mb and mapreduce.map/reduce.java.opts to bigger value than default value make my job slower?
But If I configured it too low, then I'll get task failed. And I think on this condition, my memory configuration on hadoop is not necessary...
Can you give me an explanation?
What might be happening in your environment is, when you increase values of the mapreduce.map/reduce.memory.mb and mapreduce.map/reduce.java.opts configurations to upper bound, it actually reduces the number of containers allowed to execute Map/Reduce task in every node thus eventually causes the slowness in the over all job time.
If you have 2 nodes, each with 25 GB of free ram , and say you configured the mapreduce.map/reduce.memory.mb as 4 GB, then you might get atleast 6 containers on every node, totally it is 12. So you would get a chance of running 12 mapper/reducer tasks in parallel.
In case if you configure mapreduce.map/reduce.memory.mb as 10 GB , then you might get only 2 containers on every node , totally it would be 4 containers to execute your mapper/reducer tasks in parallel. So the mapper/reducer tasks would mostly run in sequence due to lack of free containers, thus causes a delay in the over all job completion time.
You should justify the approprite value for the configuration with considering the resources available and the amount of resources required for the Map/Reduce containers according to your environment. Hope this makes sense.
you can allocate memory for map/reduce containers based on two factors
available memory per each Datanode
total number of cores(vcores) you have.
try to create number of containers equivalent to number of cores you have in each detained. ( including hyper threading)
for example if you have 10 physical core ( 20 cores including hyper threading)
so total number containers you can plan is 19 ( leaving 1 core for other processes)
assume that you have 'X' GB Ram in each data node, then
leave some memory(assume Y GB) for other processes (heap) like, Datanode, Node Manager,Region server ,etc
Now memory available for YARN is X -Y = Z
Memory for Map container = Y/number of containers per node
Memory for Reduce container = Y/(2 * number of containers per node)

Resources