Can't see Yarn Job when doing Spark-Submit on Yarn Cluster - hadoop

I am using spark-submit for my job with the command below:
spark-submit script_test.py --master yarn --deploy-mode cluster
spark-submit script_test.py --master yarn-cluster --deploy-mode cluster
The job is working fine. I can see it under the Spark History Server UI. However, I cannot see it under the RessourceManager UI ( YARN).
I have the feeling that my job is not sent to the cluster but it is running only in one node. However, I see nothing wrong on the way I use the Spark-submit command.
Am-i wrong? How can I check it? Or send the job to yarn cluster?

When you are using --master yarn means that in some place you have configured the yarn-site with hosts, ports, and so on.
Maybe the machine where you are using the spark-submit doesn't know where is the Yarn master.
You could check your hadoop/yarn/spark config files, specially the yarn-site.xml to check if the host of the Resource Manager is correct or not.
Those files are in different folders depending on which distribution of Hadoop you are using. In HDP I guess they are in /etc/hadoop/conf
Hope it helps.

Related

how to switch between cluster types in Apache Spark

I'm trying to switch cluster manager from standalone to 'YARN' in Apache Spark that I've installed for learning.
I read following thread to understand which cluster type should be chosen
However, I'd like to know the steps/syntax to change the cluster type.
Ex: from Standalone to YARN or from YARN to Standalone.
In spark there is one function name as --master that can helps you to execute your script on yarn Cluster mode or standalone mode.
Run the application on local mode or standalone used this with spark-submit command
--master Local[*]
or
--master spark://192.168.10.01:7077 \
--deploy-mode cluster \
Run on a YARN cluster
--master yarn
--deploy-mode cluster
For more information kindly visit this link.
https://spark.apache.org/docs/latest/submitting-applications.html
If you are not running through command line then you can directly set this master on SparkConf object.
sparkConf.setMaster(http://path/to/master/url:port) in cluster mode
or
sparkConf.setMaster(local[*]) in client/local mode

Spark submit to remote yarn

I have two clodera hadoop cluster (prod and dev) and one client machine. This client machine is configured to be a gateway node to the prod cluster.
From this I am able to submit a spark job to my prod cluster using
spark-submit --master yarn job_script.py
Now I would like to submit the same job to my dev cluster from this client machine.
I tried using
spark-submit --master yarn://<dev_resource_manager_ip>:8032 job_script.py
But this doesn't seem to work and my job is still getting submitted to prod cluster. How could I tell spark-submit to submit job to dev cluster resource manager instead of prod cluster.
Create directory with all Hadoop XMLs for dev cluster and override HADOOP_CONF_DIR environment variable before spark-submit.

which mode we should use when running spark on yarn?

I know there are two modes while running spark applications on yarn cluster.
In yarn-cluster mode, the driver runs in the Application Master (inside a YARN cluster). In yarn-client mode, it runs in the client node where the job is submitted
I wanted to know what are the advantages of using one mode over the other ? Which mode we should use under what circumstances.
There are two deploy modes that can be used to launch Spark applications on YARN.
Yarn-cluster: the Spark driver runs within the Hadoop cluster as a YARN Application Master and spins up Spark executors within YARN containers. This allows Spark applications to run within the Hadoop cluster and be completely decoupled from the workbench, which is used only for job submission. An example:
[terminal~]:cd $SPARK_HOME
[terminal~]:./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn
–deploy-mode cluster --num-executors 3 --driver-memory 1g --executor-memory
2g --executor-cores 1 --queue thequeue $SPARK_HOME/examples/target/spark-examples_*-1.2.1.jar
Note that in the example above, the –queue option is used to specify the Hadoop queue to which the application is submitted.
Yarn-client: The Spark driver runs on the workbench itself with the Application Master operating in a reduced role. It only requests resources from YARN to ensure the Spark workers reside in the Hadoop cluster within YARN containers. This provides an interactive environment with distributed operations. Here’s an example of invoking Spark in this mode while ensuring it picks up the Hadoop LZO codec:
[terminal~]:cd $SPARK_HOME
[terminal~]:bin/spark-shell --master yarn --deploy-mode client --queue research
--driver-memory 512M --driver-class-path /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo-0.4.18-201409171947.jar
So when you want interactive environment for your job, you should use client mode. The yarn-client mode accepts commands from the spark-shell.
When you want to decouple your job from Spark workbench, use Yarn cluster mode.

Running a Spark job with spark-submit across the whole cluster

I have recently set up a Spark cluster on Amazon EMR with 1 master and 2 slaves.
I can run pyspark, and submit jobs with spark-submit.
However, when I create a standalone job, like job.py, I create a SparkContext, like so:
sc=SparkContext("local", "App Name")
This doesn't seem right, but I'm not sure what to put there.
When I submit the job, I am sure it is not utilizing the whole cluster.
If I want to run a job against my entire cluster, say 4 processes per slave, what do I have to
a.) pass as arguments to spark-submit
b.) pass as arguments to SparkContext() in the script itself.
You can create spark context using
conf = SparkConf().setAppName(appName)
sc = SparkContext(conf=conf)
and you have to submit the program to spark-submit using the following command for spark standalone cluster
./bin/spark-submit --master spark://<sparkMasterIP>:7077 code.py
For Mesos cluster
./bin/spark-submit --master mesos://207.184.161.138:7077 code.py
For YARN cluster
./bin/spark-submit --master yarn --deploy-mode cluster code.py
For YARN master, the configuration would be read from HADOOP_CONF_DIR.

Running spark-submit with --master yarn-cluster: issue with spark-assembly

I am running Spark 1.1.0, HDP 2.1, on a kerberized cluster. I can successfully run spark-submit using --master yarn-client and the results are properly written to HDFS, however, the job doesn't show up on the Hadoop All Applications page. I want to run spark-submit using --master yarn-cluster but I continue to get this error:
appDiagnostics: Application application_1417686359838_0012 failed 2 times due to AM Container
for appattempt_1417686359838_0012_000002 exited with exitCode: -1000 due to: File does not
exist: hdfs://<HOST>/user/<username>/.sparkStaging/application_<numbers>_<more numbers>/spark-assembly-1.1.0-hadoop2.4.0.jar
.Failing this attempt.. Failing the application.
I've provisioned my account with access to the cluster. I've configured yarn-site.xml. I've cleared .sparkStaging. I've tried including --jars [path to my spark assembly in spark/lib]. I've found this question that is very similar, yet unanswered. I can't tell if this is a 2.1 issue, spark 1.1.0, kerberized cluster, configurations, or what. Any help would be much appreciated.
This is probably because you left sparkConf.setMaster("local[n]") in the code.

Resources