Hadoop Capacity Scheduler and Spark - hadoop

If I define CapacityScheduler Queues in yarn as explained here
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
how do I make spark use this?
I want to run spark jobs... but they should not take up all the cluster but instead execute on a CapacityScheduler which has a fixed set of resources allocated to it.
Is that possible ... specifically on the cloudera platform (given that spark on cloudera runs on yarn?).

You should configure the CapacityScheduler as your need by editing capacity-scheduler.xml. You also need to specify yarn.resourcemanager.scheduler.class in yarn-site.xml to be org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler which is also a default option for current hadoop version
submit spark job to a designed queue.
eg:
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--queue thequeue \
lib/spark-examples*.jar \
10
The --queue indicates the queue you will submit which should be conformed with your CapacityScheduler configuration

Related

How to read all the properties files with given prefix in spark, spring application?

I am developing an application in spark and scala and using spring to read configuration files.
My Environment specific files available in that directory like this
src/main/resource/DEV
mms_kafka.properties
mms_app.properties
pps_kafka.properties
pps_app.properties
And common files under src/main/resoruce like below
src/main/resource
mmsmappings.properties
ppsmappings.properties
Currently, I am doing like below and working fine
#PropertySource(value = Array("classpath:${ENV}/mms_app.properties","classpath:${ENV}/mms_kafka.properties","classpath:$mmsmapping.properties"), ignoreResourceNotFound=false)
Spark submit command: spark2-submit --master yarn --deploy-mode client --class job.Driver --conf 'spark.driver.extraJavaOptions=-DENV=DEV' --driver-memory 4g --executor-memory 16g --num-executors 4 --executor-cores 4 temp-0.0.1-shaded.jar
But I want to read all the files for a particular prefix(mms/pps) like below , I tried it but it is giving ENV and APP place holder is not resolve
#PropertySource(value = Array("classpath:${ENV}/${APP}_app.properties","classpath:${ENV}/${APP}_kafka.properties","classpath:${APP}mapping.properties"), ignoreResourceNotFound=false)
Spark submit command: spark2-submit --master yarn --deploy-mode client --class job.Driver --conf 'spark.driver.extraJavaOptions=-DENV=DEV' --conf 'spark.driver.extraJavaOptions=-DAPP=mms' --driver-memory 4g --executor-memory 16g --num-executors 4 --executor-cores 4 temp-0.0.1-shaded.jar
How should I fix this?
I solved this by passing parameters in spark job like below
spark2-submit --master yarn --deploy-mode client --class com.job.Driver --conf 'spark.driver.extraJavaOptions=-DENV=DEV -DAPP=mms' --driver-memory 4g --executor-memory 16g --num-executors 4 --executor-cores 4 test.jar

Spark on YARN: execute driver without worker

Running Spark on YARN, cluster mode.
3 data nodes with YARN
YARN => 32 vCores, 32 GB RAM
I am submitting Spark program like this:
spark-submit \
--class com.blablacar.insights.etl.SparkETL \
--name ${JOB_NAME} \
--master yarn \
--num-executors 1 \
--deploy-mode cluster \
--driver-memory 512m \
--driver-cores 1 \
--executor-memory 2g \
--executor-cores 20 \
toto.jar json
I can see 2 jobs are running fine on 2 nodes. But I can see also 2 other job with just a driver container !
Is it possible to not run driver if there no resource for worker?
Actually, there is a setting to limit resources to "Application Master" (in case of Spark, this is the driver):
yarn.scheduler.capacity.maximum-am-resource-percent
From http://maprdocs.mapr.com/home/AdministratorGuide/Hadoop2.xCapacityScheduler-RunningPendingApps.html:
Maximum percent of resources in the cluster that can be used to run
application masters - controls the number of concurrent active
applications.
This way, YARN will not take full resources for Spark drivers, and keep resources for workers. Youpi !

run Spark-Submit on YARN but Imbalance (only 1 node is working)

i try to run Spark Apps on YARN-CLUSTER (2 Nodes) but it seems those 2 nodes are imbalance because only 1 node is working but another one is not.
My Script :
spark-submit --class org.apache.spark.examples.SparkPi
--master yarn-cluster --deploy-mode cluster --num-executors 2
--driver-memory 1G
--executor-memory 1G
--executor-cores 2 spark-examples-1.6.1-hadoop2.6.0.jar 1000
I see one of my node is working but another is not, so this is imbalance :
Note : in the left is namenode, and datanode is on the right...
Any Idea ?
The complete dataset could be local to one of the nodes, hence it might be trying to honour data locality.
You can try the following config while launching spark-submit
--conf "spark.locality.wait.node=0"
The same worked for me.
you are running job in yarn-cluster mode, in cluster mode Spark driver runs in the ApplicationMaster on a cluster host
try running it in yarn-client mode, in client mode Spark driver runs on the host where the job is submitted, so you will be able to see output on console
spark-submit --verbose --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
--num-executors 2 \
--driver-memory 1G \
--executor-memory 1G \
--executor-cores 2 spark-examples-1.6.1-hadoop2.6.0.jar 10
You can check on which node the executor are launched from SPARK UI
Spark UI gives the details of nodes where the execution are launched
Executor is the TAB in Spark UI

Can't get pyspark job to run on all nodes of hadoop cluster

Summary: I can't get my python-spark job to run on all nodes of my hadoop cluster.
I've installed the spark for hadoop 'spark-1.5.2-bin-hadoop2.6'. When launching a java spark job, the load gets
distributed over all nodes, when launching a python spark job, only the one node takes the load.
Setup:
hdfs and yarn configured for 4 nodes: nk01 (namenode), nk02, nk03, nk04, running on xen virtual servers
versions: jdk1.8.0_66, hadoop-2.7.1, spark-1.5.2-bin-hadoop2.6
hadoop installed all 4 nodes
spark only installed on nk01
I copied a bunch of Gutenberg files (thank you, Johannes!) onto hdfs, and try doing a wordcount using java and python on a subset of the files (the files that start with an 'e') :
Python:
Using a homebrew python script for doing wordcount:
/opt/spark/bin/spark-submit wordcount.py --master yarn-cluster \
--num-executors 4 --executor-cores 1
The Python code assigns 4 partions:
tt=sc.textFile('/user/me/gutenberg/text/e*.txt',4)
Load on the 4 nodes during 60 seconds:
Java:
Using the JavaWordCount found in the spark distribution:
/opt/spark/bin/spark-submit --class JavaWordCount --master yarn-cluster \
--num-executors 4 jwc.jar '/user/me/gutenberg/text/e*.txt'
Conclusion: the java version distributes its load across the cluster, the python version just runs on 1 node.
Question: how do I get the python version also to distribute the load across all nodes?
The python-program name was indeed in the wrong position, as suggested by Shawn Guo. It should have been run this way:
/opt/spark/bin/spark-submit --master yarn-cluster --num-executors 4
--executor-cores 1 wordcount.py
That gives this load on the nodes:
Spark-submit
./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
Here are some different with scala/java submit in parameter position.
For Python applications, simply pass a .py file in the place of
application-jar instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.
You should use below command instead:
/opt/spark/bin/spark-submit --master yarn-cluster wordcount.py
--num-executors 4 --executor-cores 1

Spark not able to run in yarn cluster mode

I am trying to execute my code on a yarn cluster
The command which I am using is
$SPARK_HOME/bin/spark-submit \
--class "MyApp" \
target/scala-2.10/my-application_2.10-1.0.jar \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 6g \
--executor-memory 7g \
<outputPath>
But, I can see that this program is running only on the localhost.
Its able to read the file from hdfs.
I have tried this in standalone mode and it works fine.
Please suggest where is it going wrong.
I am using Hadoop2.4 with Spark 1.1.0 . I was able to get it running in the cluster mode.
To solve it we simply removed all the configuration files from all the slave nodes. Earlier we were running in the standalone mode and that lead to duplicating the configuration on all the slaves. Once that was done it ran as expected in cluster mode. Although performance is not up to the standalone mode.
Thanks.

Resources