yarn is not honouring yarn.nodemanager.resource.cpu-vcores - hadoop

I am using Hadoop-2.4.0 and my system configs are 24 cores, 96 GB RAM.
I am using following configs
mapreduce.map.cpu.vcores=1
yarn.nodemanager.resource.cpu-vcores=10
yarn.scheduler.minimum-allocation-vcores=1
yarn.scheduler.maximum-allocation-vcores=4
yarn.app.mapreduce.am.resource.cpu-vcores=1
yarn.nodemanager.resource.memory-mb=88064
mapreduce.map.memory.mb=3072
mapreduce.map.java.opts=-Xmx2048m
Capacity Scheduler configs
queue.default.capacity=50
queue.default.maximum_capacity=100
yarn.scheduler.capacity.root.default.user-limit-factor=2
With above configs, I expect yarn won't launch more than 10 mappers per node, but It is launching 28 mappers per node.
Am I doing something wrong??

YARN is running more containers than allocated cores because by default DefaultResourceCalculator is used. It considers only memory.
public int computeAvailableContainers(Resource available, Resource required) {
// Only consider memory
return available.getMemory() / required.getMemory();
}
Use DominantResourceCalculator, It uses both cpu and memory.
Set below config in capacity-scheduler.xml
yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
More about DominantResourceCalculator

Related

Spark job just hangs with large data

I am trying to query from s3 (15 days of data). I tried querying them separately (each day) it works fine. It works fine for 14 days as well. But when I query 15 days the job keeps running forever (hangs) and the task # is not updating.
My settings :
I am using 51 node cluster r3.4x large with dynamic allocation and maximum resource turned on.
All I am doing is =
val startTime="2017-11-21T08:00:00Z"
val endTime="2017-12-05T08:00:00Z"
val start = DateUtils.getLocalTimeStamp( startTime )
val end = DateUtils.getLocalTimeStamp( endTime )
val days: Int = Days.daysBetween( start, end ).getDays
val files: Seq[String] = (0 to days)
.map( start.plusDays )
.map( d => s"$input_path${DateTimeFormat.forPattern( "yyyy/MM/dd" ).print( d )}/*/*" )
sqlSession.sparkContext.textFile( files.mkString( "," ) ).count
When I run the same with 14 days, I got 197337380 (count) and I ran the 15th day separately and got 27676788. But when I query 15 days total the job hangs
Update :
The job works fine with :
var df = sqlSession.createDataFrame(sc.emptyRDD[Row], schema)
for(n <- files ){
val tempDF = sqlSession.read.schema( schema ).json(n)
df = df(tempDF)
}
df.count
But can some one explain why it works now but not before ?
UPDATE : After setting mapreduce.input.fileinputformat.split.minsize to 256 GB it works fine now.
Dynamic allocation and maximize resource allocation are both different settings, one would be disabled when other is active. With Maximize resource allocation in EMR, 1 executor per node is launched, and it allocates all the cores and memory to that executor.
I would recommend taking a different route. You seem to have a pretty big cluster with 51 nodes, not sure if it is even required. However, follow this rule of thumb to begin with, and you will get a hang of how to tune these configurations.
Cluster memory - minimum of 2X the data you are dealing with.
Now assuming 51 nodes is what you require, try below:
r3.4x has 16 CPUs - so you can put all of them to use by leaving one for the OS and other processes.
Set your number of executors to 150 - this will allocate 3 executors per node.
Set number of cores per executor to 5 (3 executors per node)
Set your executor memory to roughly total host memory/3 = 35G
You got to control the parallelism (default partitions), set this to number of total cores you have ~ 800
Adjust shuffle partitions - make this twice of number of cores - 1600
Above configurations have been working like a charm for me. You can monitor the resource utilization on Spark UI.
Also, in your yarn config /etc/hadoop/conf/capacity-scheduler.xml file, set yarn.scheduler.capacity.resource-calculator to org.apache.hadoop.yarn.util.resource.DominantResourceCalculator - which will allow Spark to really go full throttle with those CPUs. Restart yarn service after change.
You should be increasing the executor memory and # executors, If the data is huge try increasing the Driver memory.
My suggestion is to not use the dynamic resource allocation and let it run and see if it still hangs or not (Please note that spark job can consume entire cluster resources and make other applications starve for resources try this approach when no jobs are running). if it doesn't hang that means you should play with the resource allocation, then start hardcoding the resources and keep increasing resources so that you can find the best resource allocation you can possibly use.
Below links can help you understand the resource allocation and optimization of resources.
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
https://community.hortonworks.com/articles/42803/spark-on-yarn-executor-resource-allocation-optimiz.html

How to make Hadoop/EMR use more containers per node

I'm in the process of moving our application from Hadoop 1.0.3 to 2.7, on EMR v5.1.0. I got it running, but I'm still having problems getting my head around the resource-allocation system in Yarn. With the default settings provided by EMR, Hadoop only allocates one container per node, even if I select a larger instance type for the nodes. This is a problem, since we'll now be using twice as many nodes to do the same amount of work.
I want to squeeze more containers into one node, and ensure that we're using all the available resources. I assume that I shouldn't touch yarn.nodemanager.resource.memory-mb or yarn.nodemanager.resource.cpu-vcores, since those are set by EMR to reflect the actual available resources. Which settings do I have to change?
Your container sizes are defined by setting the memory (default criteria for a container) and vcores. The following can be configured:
yarn-scheduler.minimum-allocation-mb
yarn-scheduler.maximum-allocation-mb
yarn-scheduler.increment-allocation-mb
yarn-scheduler.minimum-allocation-vcores
yarn-scheduler.maximum-allocation-vcores
yarn-scheduler.increment-allocation-vcores
All the following criteria must be satified (they are per container, except for yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb which are per NodeManager hence per DataNode):
1 <= yarn-scheduler.minimum-allocation-vcores <= yarn-scheduler.maximum-allocation-vcores
yarn-scheduler.maximum-allocation-vcores <= yarn.nodemanager.resource.cpu-vcores
yarn-scheduler.increment-allocation-vcores = 1
1024 <= yarn-scheduler.minimum-allocation-mb <= yarn-scheduler.maximum-allocation-mb
yarn-scheduler.maximum-allocation-mb <= yarn.nodemanager.resource.memory-mb
yarn-scheduler.increment-allocation-mb = 512
You can also see this helpful link https://www.cloudera.com/documentation/enterprise/5-4-x/topics/cdh_ig_yarn_tuning.html

Not all Spark Workers are starting: SPARK_WORKER_INSTANCES

I have my spark-defaults.conf configuration like this.
my node has 32Gb RAM. 8 cores.
I am planning to use 16gb and 4 workers with each using 1 core.
SPARK_WORKER_MEMORY=16g
SPARK_PUBLIC_DNS=vodip-dt-a4d.ula.comcast.net
SPARK_WORKER_CORES=4
SPARK_WORKER_INSTANCES=4
SPARK_DAEMON_MEMORY=1g
When i try to start the master and workes like this, only 1 worker is being started where an i am expecting 4 workers.
start-master.sh --properties-file /app/spark/spark-1.6.1-bin-hadoop2.6/conf/ha.conf
start-slaves.sh
these commands started the master and showed that they were starting 4 workes where as only 1 worker was started.
The one worker that started using all 4 cores.
Please let me know why my other 3 worked are not starting.
Memory and core properties are for every executor. So when you say SPARK_WORKER_CORES=4 this is every executor with 4 cores.
Also you cant use all memory in your server for executors. If you want 4 executors with total 16gb memory, your properties should be like this
SPARK_WORKER_MEMORY=4g
SPARK_PUBLIC_DNS=vodip-dt-a4d.ula.comcast.net
SPARK_WORKER_CORES=1
SPARK_WORKER_INSTANCES=4

Problems with memory kill limits for YARN

I have problem with understanding YARN configuration.
I have such lines in yarn/mapreduce configs:
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
<name>mapreduce.reduce.memory.mb</name>
<value>1024</value>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
Here is written:
By default ("yarn.nodemanager.vmem-pmem-ratio") is set to 2.1. This means that a map or reduce container can allocate up to 2.1 times the ("mapreduce.reduce.memory.mb") or ("mapreduce.map.memory.mb") of virtual memory before the NM will kill the container.
When NodeManager will kill my container?
When a whole container reaches 2048MB*2.1=4300,8MB? Or 1024MB*2.1=2150,4MB
Can i get some better explanation?
Each Mapper and Reducer runs in its own separate container (containers are not shared between Mappers and Reducers, unless it is a Uber job. Check about Uber mode here: What is the purpose of "uber mode" in hadoop?).
Typically, memory requirements for a Mapper and a Reducer differ.
Hence, there are different configuration parameters for Mapper (mapreduce.map.memory.mb) and Reducer (mapreduce.reduce.memory.mb).
So, as per the settings in your yarn-site.xml, virtual memory limits for Mapper and Redcuer are:
Mapper limit: 2048 * 2.1 = 4300.8 MB
Reducer limit: 1024 * 2.1 = 2150.4 MB
In short, Mappers and Reducers have different memory settings and limits.

Hadoop performance modeling

I am working on Hadoop performance modeling. Hadoop has 200+ parameters so setting them manually is not possible. So often we run our hadoop jobs with default parameter value(like using default value io.sort.mb, io.sort.record.percent, mapred.output.compress etc). But using default parameter value gives us sub optimal performance. There is some work done in this area by Herodotos Herodotou (http://www.cs.duke.edu/starfish/files/vldb11-job-optimization.pdf) to improve performance. But i have following doubt in their work --
They are fixing the value of parameters at the job start time( according to proportionality assumption of data) for all the phases( read, map, collect etc.) of MapReduce job. Can we set different value of these parameters for each phase at run time according to run time environment( like cluster configuration, underling file system etc.), by changing Hadoop configuration log files of a particular node to get optimal performance from a node ?
They are using white box model for Hadoop core are they still applicable for
current Hadoop ( http://arxiv.org/pdf/1106.0940.pdf) ?
No, you couldn't dynamically change MapReduce parameters per job per node.
Configuring set of nodes
Rather what you could do is change the configuration parameters per node statically in the configuration files (generally located in /etc/hadoop/conf), so that you could take the most out of your cluster with different h/w configurations.
Example: Assume you have 20 worker nodes with different hardware configurations like:
10 with configuration of 128GB RAM, 24 Cores
10 with configuration of 64GB RAM, 12 Cores
In that case you would want to configure each of identical servers to take most out of the hardware for example, you would want to run more child tasks (mappers & reducers) on worker nodes with more RAM and Cores, for example:
Nodes with 128GB RAM, 24 Cores => 36 worker tasks (mappers + reducers), JVM heap for each worker task would be around 3GB.
Nodes with 64GB RAM, 12 Cores => 18 worker tasks (mappers + reducers), JVM heap for each worker task would be around 3GB.
So, you would want to configure the set of nodes respectively with appropriate parameters.
Using ToolRunner to pass configuration parameters dynamically to a Job:
Also, you could dynamically change the MapReduce job parameters per job but these parameters would be applied to the entire cluster not just to a set of nodes. Provided your MapReduce job driver extends ToolRunner.
ToolRunner allows you to parse generic hadoop command line arguments. You'll be able to pass MapReduce configuration parameters using -D property.name=property.value.
You can pretty much pass almost all hadoop parameters dynamically to a job. But most commonly passed MapReduce configuration parameters dynamically to a job are:
mapreduce.task.io.sort.mb
mapreduce.map.speculative
mapreduce.job.reduces
mapreduce.task.io.sort.factor
mapreduce.map.output.compress
mapreduce.map.outout.compress.codec
mapreduce.reduce.memory.mb
mapreduce.map.memory.mb
Here is an example terasort job passing lots of parameters dynamically per job:
hadoop jar hadoop-mapreduce-examples.jar tearsort \
-Ddfs.replication=1 -Dmapreduce.task.io.sort.mb=500 \
-Dmapreduce.map.sort.splill.percent=0.9 \
-Dmapreduce.reduce.shuffle.parallelcopies=10 \
-Dmapreduce.reduce.shuffle.memory.limit.percent=0.1 \
-Dmapreduce.reduce.shuffle.input.buffer.percent=0.95 \
-Dmapreduce.reduce.input.buffer.percent=0.95 \
-Dmapreduce.reduce.shuffle.merge.percent=0.95 \
-Dmapreduce.reduce.merge.inmem.threshold=0 \
-Dmapreduce.job.speculative.speculativecap=0.05 \
-Dmapreduce.map.speculative=false \
-Dmapreduce.map.reduce.speculative=false \

 -Dmapreduce.job.jvm.numtasks=-1 \
-Dmapreduce.job.reduces=84 \

 -Dmapreduce.task.io.sort.factor=100 \
-Dmapreduce.map.output.compress=true \

 -Dmapreduce.map.outout.compress.codec=\
org.apache.hadoop.io.compress.SnappyCodec \
-Dmapreduce.job.reduce.slowstart.completedmaps=0.4 \
-Dmapreduce.reduce.merge.memtomem.enabled=fasle \
-Dmapreduce.reduce.memory.totalbytes=12348030976 \
-Dmapreduce.reduce.memory.mb=12288 \

 -Dmapreduce.reduce.java.opts=“-Xms11776m -Xmx11776m \
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode \
-XX:+CMSIncrementalPacing -XX:ParallelGCThreads=4” \

 -Dmapreduce.map.memory.mb=4096 \

 -Dmapreduce.map.java.opts=“-Xmx1356m” \
/terasort-input /terasort-output

Resources