Maximize use of Vcores by Yarn when running spark - hadoop

I have a virtualized cluster with Hadoop 2.9 in 4 nodes.
Each node it has 16 cpu's with 126 gb ram.
For more that i try to set yarn.scheduler.minimum-allocation-vcores to something different than 1, when i run spark-submit declaring yarn as master, it uses only 1 vcore for each container.
Is there a way to override that?
Thanks!

Use spark.executor.cores. From docs:
The number of cores to use on each executor. In standalone and Mesos coarse-grained modes, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.
which by default is:
1 in YARN mode

Related

Calculating yarn.nodemanager.resource.cpu-vcores for a yarn cluster with multiple spark clients

If I have 3 spark applications all using the same yarn cluster, how should I set
yarn.nodemanager.resource.cpu-vcores
in each of the 3 yarn-site.xml?
(each spark application is required to have it's own yarn-site.xml on the classpath)
Does this value even matter in the client yarn-site.xml's ?
If it does:
Let's say the cluster has 16 cores.
Should the value in each yarn-site.xml be 5 (for a total of 15 to leave 1 core for system processes) ? Or should I set each one to 15 ?
(Note: Cloudera indicates one core should be left for system processes here: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ however, they do not go into details of using multiple clients against the same cluster)
Assume Spark is running with yarn as the master, and running in cluster mode.
Are you talking about the server-side configuration for each YARN Node Manager? If so, it would typically be configured to be a little less than the number of CPU cores (or virtual cores if you have hyperthreading) on each node in the cluster. So if you have 4 nodes with 4 cores each, you could dedicate for example 3 per node to the YARN node manager and your cluster would have a total of 12 virtual CPUs.
Then you request the desired resources when submitting the Spark job (see http://spark.apache.org/docs/latest/submitting-applications.html for example) to the cluster and YARN will attempt to fulfill that request. If it can't be fulfilled, your Spark job (or application) will be queued up or there will eventually be a timeout.
You can configure different resource pools in YARN to guarantee a specific amount of memory/CPU resources to such a pool, but that's a little bit more advanced.
If you submit your Spark application in cluster mode, you have to consider that the Spark driver will run on a cluster node and not your local machine (that one that submitted it). Therefore it will require at least 1 virtual CPU more.
Hope that clarifies things a little for you.

Incorrect memory allocation for Yarn/Spark after automatic setup of Dataproc Cluster

I'm trying to run Spark jobs on a Dataproc cluster, but Spark will not start due to Yarn being misconfigured.
I receive the following error when running "spark-shell" from the shell (locally on the master), as well as when uploading a job through the web-GUI and the gcloud command line utility from my local machine:
15/11/08 21:27:16 ERROR org.apache.spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Required executor memory (38281+2679 MB) is above the max threshold (20480 MB) of this cluster! Please increase the value of 'yarn.s
cheduler.maximum-allocation-mb'.
I tried modifying the value in /etc/hadoop/conf/yarn-site.xml but it didn't change anything. I don't think it pulls the configuration from that file.
I've tried with multiple cluster combinations, at multiple sites (mainly Europe), and I only got this to work with the low memory version (4-cores, 15 gb memory).
I.e. this is only a problem on the nodes configured for memory higher than the yarn default allows.
Sorry about these issues you're running into! It looks like this is part of a known issue where certain memory settings end up computed based on the master machine's size rather than the worker machines' size, and we're hoping to fix this in an upcoming release soon.
There are two current workarounds:
Use a master machine type with memory either equal to or smaller
than worker machine types.
Explicitly set spark.executor.memory and spark.executor.cores either using the --conf flag if running from an SSH connection like:
spark-shell --conf spark.executor.memory=4g --conf spark.executor.cores=2
or if running gcloud beta dataproc, use --properties:
gcloud beta dataproc jobs submit spark --properties spark.executor.memory=4g,spark.executor.cores=2
You can adjust the number of cores/memory per executor as necessary; it's fine to err on the side of smaller executors and letting YARN pack lots of executors onto each worker, though you can save some per-executor overhead by setting spark.executor.memory to the full size available in each YARN container and spark.executor.cores to all the cores in each worker.
EDIT: As of January 27th, new Dataproc clusters will now be configured correctly for any combination of master/worker machine types, as mentioned in the release notes.

Mismatch in no of Executors(Spark in YARN Pseudo distributed mode)

I am running Spark using YARN(Hadoop 2.6) as cluster manager. YARN is running in Pseudo distributed mode. I have started the spark shell with 6 executors and was expecting the same
spark-shell --master yarn --num-executors 6
But whereas in the Spark Web UI, I see only 4 executors
Any reason for this?
PS : I ran the nproc command in my Ubuntu(14.04) and give below is the result. I believe this mean, my system has 8 cores
mountain#mountain:~$ nproc
8
did you take in account spark.yarn.executor.memoryOverhead?
possobly it creates hiden memory requrement and finaly yarn could not provide whole resources.
also, note that yarn round container size to yarn.scheduler.increment-allocation-mb.
all detail here:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
This happens when there are not enough resources on your cluster to start more executors. Following things are taken into account
Spark executor runs inside a yarn container. This container size is determined from the value of yarn.scheduler.minimum-allocation-mb in yarn-site.xml. Check this property. If your existing containers consume all available memory then more memory will not be available for new containers. so no new executors will be started
The storage memory column in the UI displays the amount of memory used for execution and RDD storage. By default, this equals (HEAP_SPACE - 300MB) * 75%. The rest of the memory is used for internal metadata, user data structures and other stuffs. ref(Spark on YARN: Less executor memory than set via spark-submit)
I hope this helps.

Is it allowed in YARN to have multiple containers on same DataNode?

Is it allowed in YARN to have "multiple containers" of the "same application" running on one DataNode?
Yes.
Example: multiple mappers of a job running on same DN
Yes, any data node can have multiple containers running in parallel.
The number of parallel containers is calculated by the YARN resource manager by considering the amount of ram memory , cpu cores available on the data node.
There are chances to see multiple containers running on the same data node when resource manager decides to run multiple mappers/reducers on the containers of a data node.

Hadoop Yarn container configuration (CPU, Memory...)

I've just setup a new Hadoop cluster with Hadoop 2.2.0 and running the MapReduce job on HBase based on Yarn framework.
I have a problem of the configuration of containers. In general, we have 8 nodes, half of which are old machines with 8 cores CPU and half of which are new machines with 24 cores CPU. I wonder if it's possible to configure separately with more containers in new machines and less in old machines. With actual setting, the number of containers are limited to 8 which means 1 core per container at least. Even though, I have resources left in new machines, it's not allocated to more containers in new machines. We use the fair scheduler.
Thanks
In the configuration file yarn-site.xml, there is a property named yarn.nodemanager.resource.cpu-vcores which defines the CPU cores of the node. As I set this value differently of the old and the new machines, more containers are running in new nodes.
I again answer my own question :)

Resources