How to utilize all vcores in R3 2X large 8 node cluster - performance

My spark job is only using 32 Vcores out of 127 in total. Please refer to the image below.
My spark-submit command is:
spark-submit --executor-memory 12G --num-executors 32 --executor-cores 3 --conf spark.executor.memoryOverhead=1.5g
How do I tweak spark-submit params to utilize all the resources available in the cluster.

Related

How to read all the properties files with given prefix in spark, spring application?

I am developing an application in spark and scala and using spring to read configuration files.
My Environment specific files available in that directory like this
src/main/resource/DEV
mms_kafka.properties
mms_app.properties
pps_kafka.properties
pps_app.properties
And common files under src/main/resoruce like below
src/main/resource
mmsmappings.properties
ppsmappings.properties
Currently, I am doing like below and working fine
#PropertySource(value = Array("classpath:${ENV}/mms_app.properties","classpath:${ENV}/mms_kafka.properties","classpath:$mmsmapping.properties"), ignoreResourceNotFound=false)
Spark submit command: spark2-submit --master yarn --deploy-mode client --class job.Driver --conf 'spark.driver.extraJavaOptions=-DENV=DEV' --driver-memory 4g --executor-memory 16g --num-executors 4 --executor-cores 4 temp-0.0.1-shaded.jar
But I want to read all the files for a particular prefix(mms/pps) like below , I tried it but it is giving ENV and APP place holder is not resolve
#PropertySource(value = Array("classpath:${ENV}/${APP}_app.properties","classpath:${ENV}/${APP}_kafka.properties","classpath:${APP}mapping.properties"), ignoreResourceNotFound=false)
Spark submit command: spark2-submit --master yarn --deploy-mode client --class job.Driver --conf 'spark.driver.extraJavaOptions=-DENV=DEV' --conf 'spark.driver.extraJavaOptions=-DAPP=mms' --driver-memory 4g --executor-memory 16g --num-executors 4 --executor-cores 4 temp-0.0.1-shaded.jar
How should I fix this?
I solved this by passing parameters in spark job like below
spark2-submit --master yarn --deploy-mode client --class com.job.Driver --conf 'spark.driver.extraJavaOptions=-DENV=DEV -DAPP=mms' --driver-memory 4g --executor-memory 16g --num-executors 4 --executor-cores 4 test.jar

Spark Program running very slow on cluster

I am trying to run my PySpark in Cluster with 2 nodes and 1 master (all have 16 Gb RAM). I have run my spark with below command.
spark-submit --master yarn --deploy-mode cluster --name "Pyspark"
--num-executors 40 --executor-memory 2g CD.py
However my code runs very slow, it takes almost 1 hour to parse 8.2 GB of data.
Then i tried to change the configuration in my YARN. I changed following properties.
yarn.scheduler.increment-allocation-mb = 2 GiB
yarn.scheduler.minimum-allocation-mb = 2 GiB
yarn.scheduler.increment-allocation-mb = 2 GiB
yarn.scheduler.maximum-allocation-mb = 2 GiB
After doing these changes still my spark is running very slow and taking more than 1 hour to parse 8.2 GB of files.
could you please try with the below configuration
spark.executor.memory 5g
spark.executor.cores 5
spark.executor.instances 3
spark.driver.cores 2

Spark on YARN: execute driver without worker

Running Spark on YARN, cluster mode.
3 data nodes with YARN
YARN => 32 vCores, 32 GB RAM
I am submitting Spark program like this:
spark-submit \
--class com.blablacar.insights.etl.SparkETL \
--name ${JOB_NAME} \
--master yarn \
--num-executors 1 \
--deploy-mode cluster \
--driver-memory 512m \
--driver-cores 1 \
--executor-memory 2g \
--executor-cores 20 \
toto.jar json
I can see 2 jobs are running fine on 2 nodes. But I can see also 2 other job with just a driver container !
Is it possible to not run driver if there no resource for worker?
Actually, there is a setting to limit resources to "Application Master" (in case of Spark, this is the driver):
yarn.scheduler.capacity.maximum-am-resource-percent
From http://maprdocs.mapr.com/home/AdministratorGuide/Hadoop2.xCapacityScheduler-RunningPendingApps.html:
Maximum percent of resources in the cluster that can be used to run
application masters - controls the number of concurrent active
applications.
This way, YARN will not take full resources for Spark drivers, and keep resources for workers. Youpi !

run Spark-Submit on YARN but Imbalance (only 1 node is working)

i try to run Spark Apps on YARN-CLUSTER (2 Nodes) but it seems those 2 nodes are imbalance because only 1 node is working but another one is not.
My Script :
spark-submit --class org.apache.spark.examples.SparkPi
--master yarn-cluster --deploy-mode cluster --num-executors 2
--driver-memory 1G
--executor-memory 1G
--executor-cores 2 spark-examples-1.6.1-hadoop2.6.0.jar 1000
I see one of my node is working but another is not, so this is imbalance :
Note : in the left is namenode, and datanode is on the right...
Any Idea ?
The complete dataset could be local to one of the nodes, hence it might be trying to honour data locality.
You can try the following config while launching spark-submit
--conf "spark.locality.wait.node=0"
The same worked for me.
you are running job in yarn-cluster mode, in cluster mode Spark driver runs in the ApplicationMaster on a cluster host
try running it in yarn-client mode, in client mode Spark driver runs on the host where the job is submitted, so you will be able to see output on console
spark-submit --verbose --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
--num-executors 2 \
--driver-memory 1G \
--executor-memory 1G \
--executor-cores 2 spark-examples-1.6.1-hadoop2.6.0.jar 10
You can check on which node the executor are launched from SPARK UI
Spark UI gives the details of nodes where the execution are launched
Executor is the TAB in Spark UI

Hadoop Capacity Scheduler and Spark

If I define CapacityScheduler Queues in yarn as explained here
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
how do I make spark use this?
I want to run spark jobs... but they should not take up all the cluster but instead execute on a CapacityScheduler which has a fixed set of resources allocated to it.
Is that possible ... specifically on the cloudera platform (given that spark on cloudera runs on yarn?).
You should configure the CapacityScheduler as your need by editing capacity-scheduler.xml. You also need to specify yarn.resourcemanager.scheduler.class in yarn-site.xml to be org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler which is also a default option for current hadoop version
submit spark job to a designed queue.
eg:
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--queue thequeue \
lib/spark-examples*.jar \
10
The --queue indicates the queue you will submit which should be conformed with your CapacityScheduler configuration

Resources