Spark submit to remote yarn - hadoop

I have two clodera hadoop cluster (prod and dev) and one client machine. This client machine is configured to be a gateway node to the prod cluster.
From this I am able to submit a spark job to my prod cluster using
spark-submit --master yarn job_script.py
Now I would like to submit the same job to my dev cluster from this client machine.
I tried using
spark-submit --master yarn://<dev_resource_manager_ip>:8032 job_script.py
But this doesn't seem to work and my job is still getting submitted to prod cluster. How could I tell spark-submit to submit job to dev cluster resource manager instead of prod cluster.

Create directory with all Hadoop XMLs for dev cluster and override HADOOP_CONF_DIR environment variable before spark-submit.

Related

how to switch between cluster types in Apache Spark

I'm trying to switch cluster manager from standalone to 'YARN' in Apache Spark that I've installed for learning.
I read following thread to understand which cluster type should be chosen
However, I'd like to know the steps/syntax to change the cluster type.
Ex: from Standalone to YARN or from YARN to Standalone.
In spark there is one function name as --master that can helps you to execute your script on yarn Cluster mode or standalone mode.
Run the application on local mode or standalone used this with spark-submit command
--master Local[*]
or
--master spark://192.168.10.01:7077 \
--deploy-mode cluster \
Run on a YARN cluster
--master yarn
--deploy-mode cluster
For more information kindly visit this link.
https://spark.apache.org/docs/latest/submitting-applications.html
If you are not running through command line then you can directly set this master on SparkConf object.
sparkConf.setMaster(http://path/to/master/url:port) in cluster mode
or
sparkConf.setMaster(local[*]) in client/local mode

Can't see Yarn Job when doing Spark-Submit on Yarn Cluster

I am using spark-submit for my job with the command below:
spark-submit script_test.py --master yarn --deploy-mode cluster
spark-submit script_test.py --master yarn-cluster --deploy-mode cluster
The job is working fine. I can see it under the Spark History Server UI. However, I cannot see it under the RessourceManager UI ( YARN).
I have the feeling that my job is not sent to the cluster but it is running only in one node. However, I see nothing wrong on the way I use the Spark-submit command.
Am-i wrong? How can I check it? Or send the job to yarn cluster?
When you are using --master yarn means that in some place you have configured the yarn-site with hosts, ports, and so on.
Maybe the machine where you are using the spark-submit doesn't know where is the Yarn master.
You could check your hadoop/yarn/spark config files, specially the yarn-site.xml to check if the host of the Resource Manager is correct or not.
Those files are in different folders depending on which distribution of Hadoop you are using. In HDP I guess they are in /etc/hadoop/conf
Hope it helps.

Running a Spark job with spark-submit across the whole cluster

I have recently set up a Spark cluster on Amazon EMR with 1 master and 2 slaves.
I can run pyspark, and submit jobs with spark-submit.
However, when I create a standalone job, like job.py, I create a SparkContext, like so:
sc=SparkContext("local", "App Name")
This doesn't seem right, but I'm not sure what to put there.
When I submit the job, I am sure it is not utilizing the whole cluster.
If I want to run a job against my entire cluster, say 4 processes per slave, what do I have to
a.) pass as arguments to spark-submit
b.) pass as arguments to SparkContext() in the script itself.
You can create spark context using
conf = SparkConf().setAppName(appName)
sc = SparkContext(conf=conf)
and you have to submit the program to spark-submit using the following command for spark standalone cluster
./bin/spark-submit --master spark://<sparkMasterIP>:7077 code.py
For Mesos cluster
./bin/spark-submit --master mesos://207.184.161.138:7077 code.py
For YARN cluster
./bin/spark-submit --master yarn --deploy-mode cluster code.py
For YARN master, the configuration would be read from HADOOP_CONF_DIR.

What is the master URL in EC2 spark cluster

I have a spark cluster launched using spark-ec2 script.
(EDIT: after login into the master), I can run spark jobs locally on the master node as :
spark-submit --class myApp --master local myApp.jar
But I can't seem to run the job in the cluster mode:
../spark/bin/spark-submit --class myApp --master spark://54.111.111.111:7077 --deploy-mode cluster myApp.jar
The ip address of the master is obtained from the AWS console.
I get the following errors:
WARN RestSubmissionClient: Unable to connect to server
Warning: Master endpoint spark://54.111.111.111:7077 was not a REST server. Falling back to legacy submission gateway instead.
Error connecting to master (akka.tcp://sparkMaster#54.111.111.111:7077).
Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#54.177.156.236:7077
No master is available, exiting.
How to submit to a EC2 spark cluster ?
When you run with --master local you are also not connecting to the master. You are executing Spark operations in the same JVM as the application. (See docs.)
Your application code may be wrong too. So first just try to run spark-shell on the master node. /root/spark/bin/spark-shell is configured to connect to the EC2 Spark master when started without flags. If that works, you can try spark-shell --master spark://ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com:7077 on your laptop. Be sure to use the external IP or hostname of the master machine.
If that works too, try running your application in client mode (without --deploy-mode cluster). Hopefully in the course of trying all these, you will figure out what was wrong with your original approach. Good luck!
This is nothing to do with EC2, I had similar error on my server. I was able to resolve it by overwriting spark-env.sh SPARK_MASTER_IP.

Spark on yarn concept understanding

I am trying to understand how spark runs on YARN cluster/client. I have the following question in my mind.
Is it necessary that spark is installed on all the nodes in yarn cluster? I think it should because worker nodes in cluster execute a task and should be able to decode the code(spark APIs) in spark application sent to cluster by the driver?
It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster". Why does client node have to install Hadoop when it is sending the job to cluster?
Adding to other answers.
Is it necessary that spark is installed on all the nodes in the yarn
cluster?
No, If the spark job is scheduling in YARN(either client or cluster mode). Spark installation is needed in many nodes only for standalone mode.
These are the visualizations of spark app deployment modes.
Spark Standalone Cluster
In cluster mode driver will be sitting in one of the Spark Worker node whereas in client mode it will be within the machine which launched the job.
YARN cluster mode
YARN client mode
This table offers a concise list of differences between these modes:
pics source
It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client-side)
configuration files for the Hadoop cluster". Why does the client node have
to install Hadoop when it is sending the job to cluster?
Hadoop installation is not mandatory but configurations(not all) are!. We can call them Gateway nodes. It's for two main reasons.
The configuration contained in HADOOP_CONF_DIR directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.
In YARN mode the ResourceManager’s address is picked up from the
Hadoop configuration(yarn-default.xml). Thus, the --master parameter is yarn.
Update: (2017-01-04)
Spark 2.0+ no longer requires a fat assembly jar for production
deployment. source
We are running spark jobs on YARN (we use HDP 2.2).
We don't have spark installed on the cluster. We only added the Spark assembly jar to the HDFS.
For example to run the Pi example:
./bin/spark-submit \
--verbose \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar \
--num-executors 2 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 4 \
hdfs://master:8020/spark/spark-examples-1.3.1-hadoop2.6.0.jar 100
--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar - This config tell the yarn from were to take the spark assembly. If you don't use it, it will upload the jar from were you run spark-submit.
About your second question: The client node doesn't not need Hadoop installed. It only needs the configuration files. You can copy the directory from your cluster to your client.
1 - Spark if following s slave/master architecture. So on your cluster, you have to install a spark master and N spark slaves. You can run spark in a standalone mode. But using Yarn architecture will give you some benefits.
There is a very good explanation of it here : http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
2- It is necessary if you want to use Yarn or HDFS for example, but as i said before you can run it in standalone mode.
Let me try to cut glues and make it short for impatient.
6 components: 1. client, 2. driver, 3. executors, 4. application master, 5. workers, and 6. resource manager; 2 deploy modes; and 2 resource (cluster) management.
Here's the relation:
Client
Nothing special, is the one submitting spark app.
Worker, executors
Nothing special, one worker holds one or more executors.
Master, & resource (cluster) manager
(no matter client or cluster mode)
in yarn, resource manager and master sit in two different nodes;
in standalone, resource manager == master, same process in the same node.
Driver
in client mode, sits with client
in yarn - cluster mode, sits with master (in this case, client process exits after submission of app)
in standalone - cluster mode, sits with one worker
VoilĂ !

Resources