How to set yarn.app.mapreduce.am.command-opts for spark yarn cluster job - hadoop

I am getting "Container... is running beyond virtual memory limits" error while running spark job in yarn cluster mode.
It is not possible to ignore this error or increase Vmem Pmem ratio.
Job is submitted through spark-submit with " --conf spark.driver.memory=2800m".
I think it is because default value of yarn.app.mapreduce.am.command-opts is 1G, so yarn kills my driver/AM as soon as my driver/AM uses more than 1G memory.
So I would like to pass "yarn.app.mapreduce.am.command-opts" to spark-submit in bash script. Passing it with "spark.driver.extraJavaOptions" errors out with "Not allowed to specify max heap(Xmx) memory settings through java options"
So how do I pass it ?
EDIT: I cannot edit conf files as that will make the change for all MR and spark jobs.

Related

set spark vm options

I'm trying to build a spark application which uses zookeeper and kafka. Maven is being used for build. The project I'm trying to build is here. On executing:
mvn clean package exec:java -Dexec.mainClass="com.iot.video.app.spark.processor.VideoStreamProcessor"
It shows
ERROR SparkContext:91 - Error initializing SparkContext.
java.lang.IllegalArgumentException: System memory 253427712 must be at least 471859200. Please increase heap size using the --driver-memory option or spark.driver.memory in Spark configuration.
I tried adding spark.driver.memory 4g to spark-defaults.conf but I still get the error. How can I fix it?
You can send extra JVM options to your workers by using dedicated spark-submit arguments:
spark-submit --conf 'spark.executor.memory=1g'\
--conf 'spark.executor.extraJavaOptions=-Xms1024m -Xmx4096m'
Similarly, you can set the option for your driver (useful if your application is submitted in cluster mode, or launched by spark-submit):
--conf 'spark.driver.extraJavaOptions=-Xms512m -Xmx2048m'

Spark on Hadoop YARN - executor missing

I have a cluster of 3 macOS machines running Hadoop and Spark-1.5.2 (though with Spark-2.0.0 the same problem exists). With 'yarn' as the Spark master URL, I am running into a strange issue where tasks are only allocated to 2 of the 3 machines.
Based on the Hadoop dashboard (port 8088 on the master) it is clear that all 3 nodes are part of the cluster. However, any Spark job I run only uses 2 executors.
For example here is the "Executors" tab on a lengthy run of the JavaWordCount example:
"batservers" is the master. There should be an additional slave, "batservers2", but it's just not there.
Why might this be?
Note that none of my YARN or Spark (or, for that matter, HDFS) configurations are unusual, except provisions for giving the YARN resource- and node-managers extra memory.
Remarkably, all it took was a detailed look at the spark-submit help message to discover the answer:
YARN-only:
...
--num-executors NUM Number of executors to launch (Default: 2).
If I specify --num-executors 3 in my spark-submit command, the 3rd node is used.

which mode we should use when running spark on yarn?

I know there are two modes while running spark applications on yarn cluster.
In yarn-cluster mode, the driver runs in the Application Master (inside a YARN cluster). In yarn-client mode, it runs in the client node where the job is submitted
I wanted to know what are the advantages of using one mode over the other ? Which mode we should use under what circumstances.
There are two deploy modes that can be used to launch Spark applications on YARN.
Yarn-cluster: the Spark driver runs within the Hadoop cluster as a YARN Application Master and spins up Spark executors within YARN containers. This allows Spark applications to run within the Hadoop cluster and be completely decoupled from the workbench, which is used only for job submission. An example:
[terminal~]:cd $SPARK_HOME
[terminal~]:./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn
–deploy-mode cluster --num-executors 3 --driver-memory 1g --executor-memory
2g --executor-cores 1 --queue thequeue $SPARK_HOME/examples/target/spark-examples_*-1.2.1.jar
Note that in the example above, the –queue option is used to specify the Hadoop queue to which the application is submitted.
Yarn-client: The Spark driver runs on the workbench itself with the Application Master operating in a reduced role. It only requests resources from YARN to ensure the Spark workers reside in the Hadoop cluster within YARN containers. This provides an interactive environment with distributed operations. Here’s an example of invoking Spark in this mode while ensuring it picks up the Hadoop LZO codec:
[terminal~]:cd $SPARK_HOME
[terminal~]:bin/spark-shell --master yarn --deploy-mode client --queue research
--driver-memory 512M --driver-class-path /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo-0.4.18-201409171947.jar
So when you want interactive environment for your job, you should use client mode. The yarn-client mode accepts commands from the spark-shell.
When you want to decouple your job from Spark workbench, use Yarn cluster mode.

Spark submit, spark shell or pyspark not taking default environment values

I have installed Hadoop and Spark on Google Cloud using Click To Deploy. I am trying to run the spark shell and spark submit to test the installation. But when I try
spark-shell --master yarn-client
I get error Caused by: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=6383, maxMemory=5999
The problem is when I don't provide --executor-memory 1G or some value , it doesn't pick the default 1G value and don't know why but yarn allocates the max memory to the executor.
Commands with arguments and without arguments
pyspark --master yarn-client --executor-memory 1G --num-executors 1 --verbose
Parsed arguments:
master yarn-client
deployMode null
executorMemory 1G
numExecutors 1
pyspark --master yarn --verbose
Parsed arguments:
master yarn
deployMode null
executorMemory null
numExecutors null
Is this a spark bug or google cloud configuration issue? Is there anyway we can set the default values.

How to increase the memory for the job history server

Job History server is going out of memory when trying to load the task status after the job completed. We are trying to increase the memory for the job history server. Any idea how we can increase the XMX for the job history server? I am Apache Hadoop 2.4.0 and running YARN.
Restart the history server with the following settings:
HADOOP_JOB_HISTORYSERVER_HEAPSIZE=5 bash -x bin/mapred --config /Users/ajaymysore/gitHome/proton/resources/hadoop/conf historyserver
Note that 5 is the amount of memory you want to allocate in MB.

Resources