Spark submit, spark shell or pyspark not taking default environment values - hadoop

I have installed Hadoop and Spark on Google Cloud using Click To Deploy. I am trying to run the spark shell and spark submit to test the installation. But when I try
spark-shell --master yarn-client
I get error Caused by: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=6383, maxMemory=5999
The problem is when I don't provide --executor-memory 1G or some value , it doesn't pick the default 1G value and don't know why but yarn allocates the max memory to the executor.
Commands with arguments and without arguments
pyspark --master yarn-client --executor-memory 1G --num-executors 1 --verbose
Parsed arguments:
master yarn-client
deployMode null
executorMemory 1G
numExecutors 1
pyspark --master yarn --verbose
Parsed arguments:
master yarn
deployMode null
executorMemory null
numExecutors null
Is this a spark bug or google cloud configuration issue? Is there anyway we can set the default values.

Related

how to switch between cluster types in Apache Spark

I'm trying to switch cluster manager from standalone to 'YARN' in Apache Spark that I've installed for learning.
I read following thread to understand which cluster type should be chosen
However, I'd like to know the steps/syntax to change the cluster type.
Ex: from Standalone to YARN or from YARN to Standalone.
In spark there is one function name as --master that can helps you to execute your script on yarn Cluster mode or standalone mode.
Run the application on local mode or standalone used this with spark-submit command
--master Local[*]
or
--master spark://192.168.10.01:7077 \
--deploy-mode cluster \
Run on a YARN cluster
--master yarn
--deploy-mode cluster
For more information kindly visit this link.
https://spark.apache.org/docs/latest/submitting-applications.html
If you are not running through command line then you can directly set this master on SparkConf object.
sparkConf.setMaster(http://path/to/master/url:port) in cluster mode
or
sparkConf.setMaster(local[*]) in client/local mode

spark-submit in local mode - configuration

I am doing a spark-submit using --master local on my laptop (spark 1.6.1) to load data into hive tables. Laptop has 8 GB RAM and 4 cores. I have not set any properties manually - just using defaults.
When I load 50k records, the jobs finishes successfully. But when I try and load 200k records, I get a "GC Overhead Limit Exceeded" error.
In --master local mode, are there properties for job memory or heap memory that could be set manually?
Try to increase --driver-memory, --executor-memory, default value is 1g for both.
command should be like this:
spark-submit --master local --driver-memory 2g --executor-memory 2g --class classpath jarfile

How to set yarn.app.mapreduce.am.command-opts for spark yarn cluster job

I am getting "Container... is running beyond virtual memory limits" error while running spark job in yarn cluster mode.
It is not possible to ignore this error or increase Vmem Pmem ratio.
Job is submitted through spark-submit with " --conf spark.driver.memory=2800m".
I think it is because default value of yarn.app.mapreduce.am.command-opts is 1G, so yarn kills my driver/AM as soon as my driver/AM uses more than 1G memory.
So I would like to pass "yarn.app.mapreduce.am.command-opts" to spark-submit in bash script. Passing it with "spark.driver.extraJavaOptions" errors out with "Not allowed to specify max heap(Xmx) memory settings through java options"
So how do I pass it ?
EDIT: I cannot edit conf files as that will make the change for all MR and spark jobs.

run Spark-Submit on YARN but Imbalance (only 1 node is working)

i try to run Spark Apps on YARN-CLUSTER (2 Nodes) but it seems those 2 nodes are imbalance because only 1 node is working but another one is not.
My Script :
spark-submit --class org.apache.spark.examples.SparkPi
--master yarn-cluster --deploy-mode cluster --num-executors 2
--driver-memory 1G
--executor-memory 1G
--executor-cores 2 spark-examples-1.6.1-hadoop2.6.0.jar 1000
I see one of my node is working but another is not, so this is imbalance :
Note : in the left is namenode, and datanode is on the right...
Any Idea ?
The complete dataset could be local to one of the nodes, hence it might be trying to honour data locality.
You can try the following config while launching spark-submit
--conf "spark.locality.wait.node=0"
The same worked for me.
you are running job in yarn-cluster mode, in cluster mode Spark driver runs in the ApplicationMaster on a cluster host
try running it in yarn-client mode, in client mode Spark driver runs on the host where the job is submitted, so you will be able to see output on console
spark-submit --verbose --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
--num-executors 2 \
--driver-memory 1G \
--executor-memory 1G \
--executor-cores 2 spark-examples-1.6.1-hadoop2.6.0.jar 10
You can check on which node the executor are launched from SPARK UI
Spark UI gives the details of nodes where the execution are launched
Executor is the TAB in Spark UI

which mode we should use when running spark on yarn?

I know there are two modes while running spark applications on yarn cluster.
In yarn-cluster mode, the driver runs in the Application Master (inside a YARN cluster). In yarn-client mode, it runs in the client node where the job is submitted
I wanted to know what are the advantages of using one mode over the other ? Which mode we should use under what circumstances.
There are two deploy modes that can be used to launch Spark applications on YARN.
Yarn-cluster: the Spark driver runs within the Hadoop cluster as a YARN Application Master and spins up Spark executors within YARN containers. This allows Spark applications to run within the Hadoop cluster and be completely decoupled from the workbench, which is used only for job submission. An example:
[terminal~]:cd $SPARK_HOME
[terminal~]:./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn
–deploy-mode cluster --num-executors 3 --driver-memory 1g --executor-memory
2g --executor-cores 1 --queue thequeue $SPARK_HOME/examples/target/spark-examples_*-1.2.1.jar
Note that in the example above, the –queue option is used to specify the Hadoop queue to which the application is submitted.
Yarn-client: The Spark driver runs on the workbench itself with the Application Master operating in a reduced role. It only requests resources from YARN to ensure the Spark workers reside in the Hadoop cluster within YARN containers. This provides an interactive environment with distributed operations. Here’s an example of invoking Spark in this mode while ensuring it picks up the Hadoop LZO codec:
[terminal~]:cd $SPARK_HOME
[terminal~]:bin/spark-shell --master yarn --deploy-mode client --queue research
--driver-memory 512M --driver-class-path /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo-0.4.18-201409171947.jar
So when you want interactive environment for your job, you should use client mode. The yarn-client mode accepts commands from the spark-shell.
When you want to decouple your job from Spark workbench, use Yarn cluster mode.

Resources