Spark on YARN: execute driver without worker - hadoop

Running Spark on YARN, cluster mode.
3 data nodes with YARN
YARN => 32 vCores, 32 GB RAM
I am submitting Spark program like this:
spark-submit \
--class com.blablacar.insights.etl.SparkETL \
--name ${JOB_NAME} \
--master yarn \
--num-executors 1 \
--deploy-mode cluster \
--driver-memory 512m \
--driver-cores 1 \
--executor-memory 2g \
--executor-cores 20 \
toto.jar json
I can see 2 jobs are running fine on 2 nodes. But I can see also 2 other job with just a driver container !
Is it possible to not run driver if there no resource for worker?

Actually, there is a setting to limit resources to "Application Master" (in case of Spark, this is the driver):
yarn.scheduler.capacity.maximum-am-resource-percent
From http://maprdocs.mapr.com/home/AdministratorGuide/Hadoop2.xCapacityScheduler-RunningPendingApps.html:
Maximum percent of resources in the cluster that can be used to run
application masters - controls the number of concurrent active
applications.
This way, YARN will not take full resources for Spark drivers, and keep resources for workers. Youpi !

Related

run Spark-Submit on YARN but Imbalance (only 1 node is working)

i try to run Spark Apps on YARN-CLUSTER (2 Nodes) but it seems those 2 nodes are imbalance because only 1 node is working but another one is not.
My Script :
spark-submit --class org.apache.spark.examples.SparkPi
--master yarn-cluster --deploy-mode cluster --num-executors 2
--driver-memory 1G
--executor-memory 1G
--executor-cores 2 spark-examples-1.6.1-hadoop2.6.0.jar 1000
I see one of my node is working but another is not, so this is imbalance :
Note : in the left is namenode, and datanode is on the right...
Any Idea ?
The complete dataset could be local to one of the nodes, hence it might be trying to honour data locality.
You can try the following config while launching spark-submit
--conf "spark.locality.wait.node=0"
The same worked for me.
you are running job in yarn-cluster mode, in cluster mode Spark driver runs in the ApplicationMaster on a cluster host
try running it in yarn-client mode, in client mode Spark driver runs on the host where the job is submitted, so you will be able to see output on console
spark-submit --verbose --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
--num-executors 2 \
--driver-memory 1G \
--executor-memory 1G \
--executor-cores 2 spark-examples-1.6.1-hadoop2.6.0.jar 10
You can check on which node the executor are launched from SPARK UI
Spark UI gives the details of nodes where the execution are launched
Executor is the TAB in Spark UI

Spark Job Keep on Running

I've submitted my spark job in ambari-server
using following command..
./spark-submit --class customer.core.classname --master yarn --numexecutors 2 --driver-memory 2g --executor-memory 2g --executor-cores 1 /home/hdfs/Test/classname-0.0.1-SNAPSHOT-SNAPSHOT.jar newdata host:6667
and it is working fine...
But how can it will be keep on running like if we close the command prompt or try to kill the job, it must be keep on running.
Any help is appreciated.
You can achieve this by couple of ways
1)You can run the spark submit driver process in background using nohup
Eg:
nohup ./spark-submit --class customer.core.classname \
--master yarn --numexecutors 2 \
--driver-memory 2g --executor-memory 2g --executor-cores 1 \
/home/hdfs/Test/classname-0.0.1-SNAPSHOT-SNAPSHOT.jar \
newdata host:6667 &
2)Run in deploy mode as cluster so that driver process runs in different node.
I think this question is more about shell than spark,
To keep an application running, even when closing the shell, tou should add & at the end of your command. So your spark-submit command will be (just add the & to the end)
./spark-submit --class customer.core.classname --master yarn --numexecutors 2 --driver-memory 2g --executor-memory 2g --executor-cores 1 /home/hdfs/Test/classname-0.0.1-SNAPSHOT-SNAPSHOT.jar newdata host:6667 &
[1] 28299
You still get the logs and output messages, unless you redirected them
hope I understand the question. In general, if you want a process to keep running you can create a process file that will run in the background. in your case, the job will continue running until you specifically kill it using yarn -kill. so even if you kill the spark submit it will continue to run since yarn is managing it after submission.
Warning: I didn't test this. But the better way to do what you describe is probably by using the following settings:
--deploy-mode cluster \
--conf spark.yarn.submit.waitAppCompletion=false
Found here:
How to exit spark-submit after the submission

Hadoop Capacity Scheduler and Spark

If I define CapacityScheduler Queues in yarn as explained here
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
how do I make spark use this?
I want to run spark jobs... but they should not take up all the cluster but instead execute on a CapacityScheduler which has a fixed set of resources allocated to it.
Is that possible ... specifically on the cloudera platform (given that spark on cloudera runs on yarn?).
You should configure the CapacityScheduler as your need by editing capacity-scheduler.xml. You also need to specify yarn.resourcemanager.scheduler.class in yarn-site.xml to be org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler which is also a default option for current hadoop version
submit spark job to a designed queue.
eg:
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--queue thequeue \
lib/spark-examples*.jar \
10
The --queue indicates the queue you will submit which should be conformed with your CapacityScheduler configuration

Can't get pyspark job to run on all nodes of hadoop cluster

Summary: I can't get my python-spark job to run on all nodes of my hadoop cluster.
I've installed the spark for hadoop 'spark-1.5.2-bin-hadoop2.6'. When launching a java spark job, the load gets
distributed over all nodes, when launching a python spark job, only the one node takes the load.
Setup:
hdfs and yarn configured for 4 nodes: nk01 (namenode), nk02, nk03, nk04, running on xen virtual servers
versions: jdk1.8.0_66, hadoop-2.7.1, spark-1.5.2-bin-hadoop2.6
hadoop installed all 4 nodes
spark only installed on nk01
I copied a bunch of Gutenberg files (thank you, Johannes!) onto hdfs, and try doing a wordcount using java and python on a subset of the files (the files that start with an 'e') :
Python:
Using a homebrew python script for doing wordcount:
/opt/spark/bin/spark-submit wordcount.py --master yarn-cluster \
--num-executors 4 --executor-cores 1
The Python code assigns 4 partions:
tt=sc.textFile('/user/me/gutenberg/text/e*.txt',4)
Load on the 4 nodes during 60 seconds:
Java:
Using the JavaWordCount found in the spark distribution:
/opt/spark/bin/spark-submit --class JavaWordCount --master yarn-cluster \
--num-executors 4 jwc.jar '/user/me/gutenberg/text/e*.txt'
Conclusion: the java version distributes its load across the cluster, the python version just runs on 1 node.
Question: how do I get the python version also to distribute the load across all nodes?
The python-program name was indeed in the wrong position, as suggested by Shawn Guo. It should have been run this way:
/opt/spark/bin/spark-submit --master yarn-cluster --num-executors 4
--executor-cores 1 wordcount.py
That gives this load on the nodes:
Spark-submit
./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
Here are some different with scala/java submit in parameter position.
For Python applications, simply pass a .py file in the place of
application-jar instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.
You should use below command instead:
/opt/spark/bin/spark-submit --master yarn-cluster wordcount.py
--num-executors 4 --executor-cores 1

Spark not able to run in yarn cluster mode

I am trying to execute my code on a yarn cluster
The command which I am using is
$SPARK_HOME/bin/spark-submit \
--class "MyApp" \
target/scala-2.10/my-application_2.10-1.0.jar \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 6g \
--executor-memory 7g \
<outputPath>
But, I can see that this program is running only on the localhost.
Its able to read the file from hdfs.
I have tried this in standalone mode and it works fine.
Please suggest where is it going wrong.
I am using Hadoop2.4 with Spark 1.1.0 . I was able to get it running in the cluster mode.
To solve it we simply removed all the configuration files from all the slave nodes. Earlier we were running in the standalone mode and that lead to duplicating the configuration on all the slaves. Once that was done it ran as expected in cluster mode. Although performance is not up to the standalone mode.
Thanks.

Resources