Spark not able to run in yarn cluster mode - hadoop

I am trying to execute my code on a yarn cluster
The command which I am using is
$SPARK_HOME/bin/spark-submit \
--class "MyApp" \
target/scala-2.10/my-application_2.10-1.0.jar \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 6g \
--executor-memory 7g \
<outputPath>
But, I can see that this program is running only on the localhost.
Its able to read the file from hdfs.
I have tried this in standalone mode and it works fine.
Please suggest where is it going wrong.

I am using Hadoop2.4 with Spark 1.1.0 . I was able to get it running in the cluster mode.
To solve it we simply removed all the configuration files from all the slave nodes. Earlier we were running in the standalone mode and that lead to duplicating the configuration on all the slaves. Once that was done it ran as expected in cluster mode. Although performance is not up to the standalone mode.
Thanks.

Related

run Spark-Submit on YARN but Imbalance (only 1 node is working)

i try to run Spark Apps on YARN-CLUSTER (2 Nodes) but it seems those 2 nodes are imbalance because only 1 node is working but another one is not.
My Script :
spark-submit --class org.apache.spark.examples.SparkPi
--master yarn-cluster --deploy-mode cluster --num-executors 2
--driver-memory 1G
--executor-memory 1G
--executor-cores 2 spark-examples-1.6.1-hadoop2.6.0.jar 1000
I see one of my node is working but another is not, so this is imbalance :
Note : in the left is namenode, and datanode is on the right...
Any Idea ?
The complete dataset could be local to one of the nodes, hence it might be trying to honour data locality.
You can try the following config while launching spark-submit
--conf "spark.locality.wait.node=0"
The same worked for me.
you are running job in yarn-cluster mode, in cluster mode Spark driver runs in the ApplicationMaster on a cluster host
try running it in yarn-client mode, in client mode Spark driver runs on the host where the job is submitted, so you will be able to see output on console
spark-submit --verbose --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
--num-executors 2 \
--driver-memory 1G \
--executor-memory 1G \
--executor-cores 2 spark-examples-1.6.1-hadoop2.6.0.jar 10
You can check on which node the executor are launched from SPARK UI
Spark UI gives the details of nodes where the execution are launched
Executor is the TAB in Spark UI

Spark Job Keep on Running

I've submitted my spark job in ambari-server
using following command..
./spark-submit --class customer.core.classname --master yarn --numexecutors 2 --driver-memory 2g --executor-memory 2g --executor-cores 1 /home/hdfs/Test/classname-0.0.1-SNAPSHOT-SNAPSHOT.jar newdata host:6667
and it is working fine...
But how can it will be keep on running like if we close the command prompt or try to kill the job, it must be keep on running.
Any help is appreciated.
You can achieve this by couple of ways
1)You can run the spark submit driver process in background using nohup
Eg:
nohup ./spark-submit --class customer.core.classname \
--master yarn --numexecutors 2 \
--driver-memory 2g --executor-memory 2g --executor-cores 1 \
/home/hdfs/Test/classname-0.0.1-SNAPSHOT-SNAPSHOT.jar \
newdata host:6667 &
2)Run in deploy mode as cluster so that driver process runs in different node.
I think this question is more about shell than spark,
To keep an application running, even when closing the shell, tou should add & at the end of your command. So your spark-submit command will be (just add the & to the end)
./spark-submit --class customer.core.classname --master yarn --numexecutors 2 --driver-memory 2g --executor-memory 2g --executor-cores 1 /home/hdfs/Test/classname-0.0.1-SNAPSHOT-SNAPSHOT.jar newdata host:6667 &
[1] 28299
You still get the logs and output messages, unless you redirected them
hope I understand the question. In general, if you want a process to keep running you can create a process file that will run in the background. in your case, the job will continue running until you specifically kill it using yarn -kill. so even if you kill the spark submit it will continue to run since yarn is managing it after submission.
Warning: I didn't test this. But the better way to do what you describe is probably by using the following settings:
--deploy-mode cluster \
--conf spark.yarn.submit.waitAppCompletion=false
Found here:
How to exit spark-submit after the submission

Hadoop Capacity Scheduler and Spark

If I define CapacityScheduler Queues in yarn as explained here
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
how do I make spark use this?
I want to run spark jobs... but they should not take up all the cluster but instead execute on a CapacityScheduler which has a fixed set of resources allocated to it.
Is that possible ... specifically on the cloudera platform (given that spark on cloudera runs on yarn?).
You should configure the CapacityScheduler as your need by editing capacity-scheduler.xml. You also need to specify yarn.resourcemanager.scheduler.class in yarn-site.xml to be org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler which is also a default option for current hadoop version
submit spark job to a designed queue.
eg:
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--queue thequeue \
lib/spark-examples*.jar \
10
The --queue indicates the queue you will submit which should be conformed with your CapacityScheduler configuration

Can't get pyspark job to run on all nodes of hadoop cluster

Summary: I can't get my python-spark job to run on all nodes of my hadoop cluster.
I've installed the spark for hadoop 'spark-1.5.2-bin-hadoop2.6'. When launching a java spark job, the load gets
distributed over all nodes, when launching a python spark job, only the one node takes the load.
Setup:
hdfs and yarn configured for 4 nodes: nk01 (namenode), nk02, nk03, nk04, running on xen virtual servers
versions: jdk1.8.0_66, hadoop-2.7.1, spark-1.5.2-bin-hadoop2.6
hadoop installed all 4 nodes
spark only installed on nk01
I copied a bunch of Gutenberg files (thank you, Johannes!) onto hdfs, and try doing a wordcount using java and python on a subset of the files (the files that start with an 'e') :
Python:
Using a homebrew python script for doing wordcount:
/opt/spark/bin/spark-submit wordcount.py --master yarn-cluster \
--num-executors 4 --executor-cores 1
The Python code assigns 4 partions:
tt=sc.textFile('/user/me/gutenberg/text/e*.txt',4)
Load on the 4 nodes during 60 seconds:
Java:
Using the JavaWordCount found in the spark distribution:
/opt/spark/bin/spark-submit --class JavaWordCount --master yarn-cluster \
--num-executors 4 jwc.jar '/user/me/gutenberg/text/e*.txt'
Conclusion: the java version distributes its load across the cluster, the python version just runs on 1 node.
Question: how do I get the python version also to distribute the load across all nodes?
The python-program name was indeed in the wrong position, as suggested by Shawn Guo. It should have been run this way:
/opt/spark/bin/spark-submit --master yarn-cluster --num-executors 4
--executor-cores 1 wordcount.py
That gives this load on the nodes:
Spark-submit
./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
Here are some different with scala/java submit in parameter position.
For Python applications, simply pass a .py file in the place of
application-jar instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.
You should use below command instead:
/opt/spark/bin/spark-submit --master yarn-cluster wordcount.py
--num-executors 4 --executor-cores 1

Spark-submit not working when application jar is in hdfs

I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied my application jar to a directory in hdfs, i get the following exception:
Warning: Skip remote jar hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar.
java.lang.ClassNotFoundException: com.example.SimpleApp
Here's the command:
$ ./bin/spark-submit --class com.example.SimpleApp --master local hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar
I'm using hadoop version 2.6.0, spark version 1.2.1
The only way it worked for me, when I was using
--master yarn-cluster
To make HDFS library accessible to spark-job , you have to run job in cluster mode.
$SPARK_HOME/bin/spark-submit \
--deploy-mode cluster \
--class <main_class> \
--master yarn-cluster \
hdfs://myhost:8020/user/root/myjar.jar
Also, There is Spark JIRA raised for client mode which is not supported yet.
SPARK-10643 :Support HDFS application download in client mode spark submit
There is a workaround. You could mount the directory in HDFS (which contains your application jar) as local directory.
I did the same (with azure blob storage, but it should be similar for HDFS)
example command for azure wasb
sudo mount -t cifs //{storageAccountName}.file.core.windows.net/{directoryName} {local directory path} -o vers=3.0,username={storageAccountName},password={storageAccountKey},dir_mode=0777,file_mode=0777
Now, in your spark submit command, you provide the path from the command above
$ ./bin/spark-submit --class com.example.SimpleApp --master local {local directory path}/simple-project-1.0-SNAPSHOT.jar
spark-submit --master spark://kssr-virtual-machine:7077 --deploy-mode client --executor-memory 1g hdfs://localhost:9000/user/wordcount.py
For me its working I am using Hadoop 3.3.1 & Spark 3.2.1. I am able to read the file from HDFS.
Yes, it has to be a local file. I think that's simply the answer.

Resources