yarn - spark parallel job - hadoop

I made yarn-cluster which has only 1 work node, and it seems to work fine when I submit my spark application job. When I submit job more than one, jobs are on hadoop queue and process submitted application one by one. I want to process my applications parallelly, not one by one. Is there any configuration for this? or unable to do this on yarn?

Yarn submits jobs one by one by default.
For submit multiple jobs you can change amount of your executor cores:
spark-submit class /jar --executor-memory 2g --num-executors 15 --executor-cores 3 --master yarn --deploy-mode cluster
You also can change this properties in your yarn-site.xml

Related

Can't see Yarn Job when doing Spark-Submit on Yarn Cluster

I am using spark-submit for my job with the command below:
spark-submit script_test.py --master yarn --deploy-mode cluster
spark-submit script_test.py --master yarn-cluster --deploy-mode cluster
The job is working fine. I can see it under the Spark History Server UI. However, I cannot see it under the RessourceManager UI ( YARN).
I have the feeling that my job is not sent to the cluster but it is running only in one node. However, I see nothing wrong on the way I use the Spark-submit command.
Am-i wrong? How can I check it? Or send the job to yarn cluster?
When you are using --master yarn means that in some place you have configured the yarn-site with hosts, ports, and so on.
Maybe the machine where you are using the spark-submit doesn't know where is the Yarn master.
You could check your hadoop/yarn/spark config files, specially the yarn-site.xml to check if the host of the Resource Manager is correct or not.
Those files are in different folders depending on which distribution of Hadoop you are using. In HDP I guess they are in /etc/hadoop/conf
Hope it helps.

spark-submit on yarn - multiple jobs

I would like to submit multiple spark-submit jobs with yarn. When I run
spark-submit --class myclass --master yarn --deploy-mode cluster blah blah
as it is now, I have to wait for the job to complete for me to submit more jobs. I see the heartbeat:
16/09/19 16:12:41 INFO yarn.Client: Application report for application_1474313490816_0015 (state: RUNNING)
16/09/19 16:12:42 INFO yarn.Client: Application report for application_1474313490816_0015 (state: RUNNING)
How can I tell yarn to pick up another job all from the same terminal. Ultimately I want to be able to run from a script where I cand send hundreds of jobs in one go.
Thank you.
Every user has a fixed capacity as specified in the yarn configuration. If you are allocated N executors (usually, you will be allocated some fixed number of vcores), and you want to run 100 jobs, you will need to specify the allocation to each of the jobs:
spark-submit --num-executors N/100 --executor-cores 5
Otherwise, the jobs will loop in accepted.
You can launch multiple jobs in parallel using & at the last of every invocation.
for i inseq 20; do spark-submit --master yarn --num-executors N/100 --executor-cores 5 blah blah &; done
Check dynamic allocation in spark
Check what scheduler is in use with Yarn, if FIFO change it to FAIR
How are you planning to allocate resources to N number of jobs on yarn?

which mode we should use when running spark on yarn?

I know there are two modes while running spark applications on yarn cluster.
In yarn-cluster mode, the driver runs in the Application Master (inside a YARN cluster). In yarn-client mode, it runs in the client node where the job is submitted
I wanted to know what are the advantages of using one mode over the other ? Which mode we should use under what circumstances.
There are two deploy modes that can be used to launch Spark applications on YARN.
Yarn-cluster: the Spark driver runs within the Hadoop cluster as a YARN Application Master and spins up Spark executors within YARN containers. This allows Spark applications to run within the Hadoop cluster and be completely decoupled from the workbench, which is used only for job submission. An example:
[terminal~]:cd $SPARK_HOME
[terminal~]:./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn
–deploy-mode cluster --num-executors 3 --driver-memory 1g --executor-memory
2g --executor-cores 1 --queue thequeue $SPARK_HOME/examples/target/spark-examples_*-1.2.1.jar
Note that in the example above, the –queue option is used to specify the Hadoop queue to which the application is submitted.
Yarn-client: The Spark driver runs on the workbench itself with the Application Master operating in a reduced role. It only requests resources from YARN to ensure the Spark workers reside in the Hadoop cluster within YARN containers. This provides an interactive environment with distributed operations. Here’s an example of invoking Spark in this mode while ensuring it picks up the Hadoop LZO codec:
[terminal~]:cd $SPARK_HOME
[terminal~]:bin/spark-shell --master yarn --deploy-mode client --queue research
--driver-memory 512M --driver-class-path /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo-0.4.18-201409171947.jar
So when you want interactive environment for your job, you should use client mode. The yarn-client mode accepts commands from the spark-shell.
When you want to decouple your job from Spark workbench, use Yarn cluster mode.

Running a Spark job with spark-submit across the whole cluster

I have recently set up a Spark cluster on Amazon EMR with 1 master and 2 slaves.
I can run pyspark, and submit jobs with spark-submit.
However, when I create a standalone job, like job.py, I create a SparkContext, like so:
sc=SparkContext("local", "App Name")
This doesn't seem right, but I'm not sure what to put there.
When I submit the job, I am sure it is not utilizing the whole cluster.
If I want to run a job against my entire cluster, say 4 processes per slave, what do I have to
a.) pass as arguments to spark-submit
b.) pass as arguments to SparkContext() in the script itself.
You can create spark context using
conf = SparkConf().setAppName(appName)
sc = SparkContext(conf=conf)
and you have to submit the program to spark-submit using the following command for spark standalone cluster
./bin/spark-submit --master spark://<sparkMasterIP>:7077 code.py
For Mesos cluster
./bin/spark-submit --master mesos://207.184.161.138:7077 code.py
For YARN cluster
./bin/spark-submit --master yarn --deploy-mode cluster code.py
For YARN master, the configuration would be read from HADOOP_CONF_DIR.

Spark over Yarn - Incorrect Application Master selection

I'm trying to fire some jobs with Spark over Yarn with the following command (this is just an example, actually i'm using different amount of memory and core) :
./bin/spark-submit --class org.mypack.myapp \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
lib/myapp.jar \
When I look at the Web UI to see what's is really happening under the hood, I notice that YARN is picking as Application Master a node that is not the Spark Master. This is a problem because the real Spark Master node is forcefully involved into the distributed computation leading to unnecessary network transfers of data (because, of course, the Spark master has no data to start with).
For what I saw during my tests, Yarn is picking the AM in a totally random fashion and I can't find a way to force him picking the Spark Master as AM.
My cluster is made of 4 nodes (3 Spark slaves, 1 Spark Master) with 64GB of total RAM and 32 cores, built upon HDP 2.4 with HortonWorks. The Spark Master is only hosting the namenode, the three slaves are datanodes.
You want to be able to specify a node, which does not have any DataNodes, to run the Spark Master. This, as far as I know, is not possible out of the box.
What you could do is run the master in yarn-client mode on the node which is running the NameNode, but this is probably not what you are looking for.
Another way would be to create your own Spark Client (where you specify using YARN API to prefer certain nodes over others for your Spark Master).

Resources