Spark on EC2 cannot utilize all cores available - amazon-ec2

I am running Spark on a EC2 cluster set up via spark-ec2.sh script. The 5 slave instances I launched have 40 cores intotal, but each instance just cannot utilize all the cores.
From the slave log, I can see it seems slaves execute tasks one by one. And I ran top on slave instances, the cpu is around 100% instead of 800%.
I have turned on the spark.mesos.coarse mode. And the data is splited into 40 chunks. And it can utilize 8 cores when I run Spark in stand alone mode on my local.
Is there anything I can do to make the Spark slaves to utilize all the cores available?

Try setting spark.cores.max let's say to 8 before creating SparkContext
in Spark 0.9:
val conf = new SparkConf()
.setMaster("...")
.set("spark.cores.max", "8")
val sc = new SparkContext(conf)

Related

Maximize use of Vcores by Yarn when running spark

I have a virtualized cluster with Hadoop 2.9 in 4 nodes.
Each node it has 16 cpu's with 126 gb ram.
For more that i try to set yarn.scheduler.minimum-allocation-vcores to something different than 1, when i run spark-submit declaring yarn as master, it uses only 1 vcore for each container.
Is there a way to override that?
Thanks!
Use spark.executor.cores. From docs:
The number of cores to use on each executor. In standalone and Mesos coarse-grained modes, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.
which by default is:
1 in YARN mode

Calculating yarn.nodemanager.resource.cpu-vcores for a yarn cluster with multiple spark clients

If I have 3 spark applications all using the same yarn cluster, how should I set
yarn.nodemanager.resource.cpu-vcores
in each of the 3 yarn-site.xml?
(each spark application is required to have it's own yarn-site.xml on the classpath)
Does this value even matter in the client yarn-site.xml's ?
If it does:
Let's say the cluster has 16 cores.
Should the value in each yarn-site.xml be 5 (for a total of 15 to leave 1 core for system processes) ? Or should I set each one to 15 ?
(Note: Cloudera indicates one core should be left for system processes here: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ however, they do not go into details of using multiple clients against the same cluster)
Assume Spark is running with yarn as the master, and running in cluster mode.
Are you talking about the server-side configuration for each YARN Node Manager? If so, it would typically be configured to be a little less than the number of CPU cores (or virtual cores if you have hyperthreading) on each node in the cluster. So if you have 4 nodes with 4 cores each, you could dedicate for example 3 per node to the YARN node manager and your cluster would have a total of 12 virtual CPUs.
Then you request the desired resources when submitting the Spark job (see http://spark.apache.org/docs/latest/submitting-applications.html for example) to the cluster and YARN will attempt to fulfill that request. If it can't be fulfilled, your Spark job (or application) will be queued up or there will eventually be a timeout.
You can configure different resource pools in YARN to guarantee a specific amount of memory/CPU resources to such a pool, but that's a little bit more advanced.
If you submit your Spark application in cluster mode, you have to consider that the Spark driver will run on a cluster node and not your local machine (that one that submitted it). Therefore it will require at least 1 virtual CPU more.
Hope that clarifies things a little for you.

Mismatch in no of Executors(Spark in YARN Pseudo distributed mode)

I am running Spark using YARN(Hadoop 2.6) as cluster manager. YARN is running in Pseudo distributed mode. I have started the spark shell with 6 executors and was expecting the same
spark-shell --master yarn --num-executors 6
But whereas in the Spark Web UI, I see only 4 executors
Any reason for this?
PS : I ran the nproc command in my Ubuntu(14.04) and give below is the result. I believe this mean, my system has 8 cores
mountain#mountain:~$ nproc
8
did you take in account spark.yarn.executor.memoryOverhead?
possobly it creates hiden memory requrement and finaly yarn could not provide whole resources.
also, note that yarn round container size to yarn.scheduler.increment-allocation-mb.
all detail here:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
This happens when there are not enough resources on your cluster to start more executors. Following things are taken into account
Spark executor runs inside a yarn container. This container size is determined from the value of yarn.scheduler.minimum-allocation-mb in yarn-site.xml. Check this property. If your existing containers consume all available memory then more memory will not be available for new containers. so no new executors will be started
The storage memory column in the UI displays the amount of memory used for execution and RDD storage. By default, this equals (HEAP_SPACE - 300MB) * 75%. The rest of the memory is used for internal metadata, user data structures and other stuffs. ref(Spark on YARN: Less executor memory than set via spark-submit)
I hope this helps.

Understanding Spark alongside Hadoop

In the set up I have, both Hadoop and Spark are running on the same network but on different nodes. We can run Spark alongside your existing Hadoop cluster by just launching it as a separate service. Will it show any improvement in performance?
I have thousands of files around 10 GB loaded into HDFS.
I have 8 nodes for Hadoop, 1 master and 5 workers for Spark
As long as the worker is on the same node, we have the advantage of Locality. You can launch your service alongside hadoop as well.

Hadoop Yarn container configuration (CPU, Memory...)

I've just setup a new Hadoop cluster with Hadoop 2.2.0 and running the MapReduce job on HBase based on Yarn framework.
I have a problem of the configuration of containers. In general, we have 8 nodes, half of which are old machines with 8 cores CPU and half of which are new machines with 24 cores CPU. I wonder if it's possible to configure separately with more containers in new machines and less in old machines. With actual setting, the number of containers are limited to 8 which means 1 core per container at least. Even though, I have resources left in new machines, it's not allocated to more containers in new machines. We use the fair scheduler.
Thanks
In the configuration file yarn-site.xml, there is a property named yarn.nodemanager.resource.cpu-vcores which defines the CPU cores of the node. As I set this value differently of the old and the new machines, more containers are running in new nodes.
I again answer my own question :)

Resources