Yarn - executor for spark job - hadoop

Process spark = new SparkLauncher()
.setAppResource("myApp.jar")
.setMainClass("com.aa.bb.app")
.setMaster("yarn")
.setDeployMode( "cluster")
.addAppArgs( data)
.launch();
This is how I executed my spark jar to yarn cluster. Here are some question below.
Is this processing with a executor? ( 1 spark-submit per 1 yarn executor?)
How should I executed multi spark job concurrently? (Where should I set dynamic allocation(spark.dynamicAllocation.enabled)?)
Where should I set number of executor configuration? In java code? In yarn xml?
If I set number of executor as 2, and process single job, one of executor will do nothing?

Need to do nothing for this. It allocated automatically.

Related

How to set yarn.app.mapreduce.am.command-opts for spark yarn cluster job

I am getting "Container... is running beyond virtual memory limits" error while running spark job in yarn cluster mode.
It is not possible to ignore this error or increase Vmem Pmem ratio.
Job is submitted through spark-submit with " --conf spark.driver.memory=2800m".
I think it is because default value of yarn.app.mapreduce.am.command-opts is 1G, so yarn kills my driver/AM as soon as my driver/AM uses more than 1G memory.
So I would like to pass "yarn.app.mapreduce.am.command-opts" to spark-submit in bash script. Passing it with "spark.driver.extraJavaOptions" errors out with "Not allowed to specify max heap(Xmx) memory settings through java options"
So how do I pass it ?
EDIT: I cannot edit conf files as that will make the change for all MR and spark jobs.

spark-submit on yarn - multiple jobs

I would like to submit multiple spark-submit jobs with yarn. When I run
spark-submit --class myclass --master yarn --deploy-mode cluster blah blah
as it is now, I have to wait for the job to complete for me to submit more jobs. I see the heartbeat:
16/09/19 16:12:41 INFO yarn.Client: Application report for application_1474313490816_0015 (state: RUNNING)
16/09/19 16:12:42 INFO yarn.Client: Application report for application_1474313490816_0015 (state: RUNNING)
How can I tell yarn to pick up another job all from the same terminal. Ultimately I want to be able to run from a script where I cand send hundreds of jobs in one go.
Thank you.
Every user has a fixed capacity as specified in the yarn configuration. If you are allocated N executors (usually, you will be allocated some fixed number of vcores), and you want to run 100 jobs, you will need to specify the allocation to each of the jobs:
spark-submit --num-executors N/100 --executor-cores 5
Otherwise, the jobs will loop in accepted.
You can launch multiple jobs in parallel using & at the last of every invocation.
for i inseq 20; do spark-submit --master yarn --num-executors N/100 --executor-cores 5 blah blah &; done
Check dynamic allocation in spark
Check what scheduler is in use with Yarn, if FIFO change it to FAIR
How are you planning to allocate resources to N number of jobs on yarn?

Running a Spark job with spark-submit across the whole cluster

I have recently set up a Spark cluster on Amazon EMR with 1 master and 2 slaves.
I can run pyspark, and submit jobs with spark-submit.
However, when I create a standalone job, like job.py, I create a SparkContext, like so:
sc=SparkContext("local", "App Name")
This doesn't seem right, but I'm not sure what to put there.
When I submit the job, I am sure it is not utilizing the whole cluster.
If I want to run a job against my entire cluster, say 4 processes per slave, what do I have to
a.) pass as arguments to spark-submit
b.) pass as arguments to SparkContext() in the script itself.
You can create spark context using
conf = SparkConf().setAppName(appName)
sc = SparkContext(conf=conf)
and you have to submit the program to spark-submit using the following command for spark standalone cluster
./bin/spark-submit --master spark://<sparkMasterIP>:7077 code.py
For Mesos cluster
./bin/spark-submit --master mesos://207.184.161.138:7077 code.py
For YARN cluster
./bin/spark-submit --master yarn --deploy-mode cluster code.py
For YARN master, the configuration would be read from HADOOP_CONF_DIR.

Spark setAppName doesn't appear in Hadoop running applications UI

I am running a spark streaming job and when I set the app name (a better readable string) for my spark streaming job, It doesn't appear in the Hadoop running applications UI. I always see the class name as the name in Hadoop UI
val sparkConf = new SparkConf().setAppName("BetterName")
How to set the job name in Spark, so it appears in this Hadoop UI ?
Hadoop URL for running applications is - http://localhost:8088/cluster/apps/RUNNING
[update]
Looks like this is the issue only with Spark Streaming jobs, couldn't find solution on how to fix it though.
When submitting a job via spark-submit, the SparkContext created can't set the name of the app, as the YARN is already configured for job before Spark. For the app name to appear in the Hadoop running jobs UI, you have to set it in the command line for spark-submit "--name BetterName". I kick off my job with a shell script that calls spark-submit, so added the name to the command in my shell script.

Parallel Map Reduce Jobs in Hadoop

I have to run in hadoop 1.0.4 many (maybe 12) jobs. I want tha five first to run in parallel, and when all finish to run 4 other jobs in parallel and at last to run the last 3 again to run in parallel. How can i set it in hadoop 1.0.4 as i see that all jobs run one each other and not in parallel.
JobControl API can be used for MR job dependency. For complex work flows, Oozie or Azkaban is recommended. Here is Oozie vs Azkaban,

Resources