Exception: java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark - hadoop

I am new apache-spark. I have tested some application in spark standalone mode.but I want to run application yarn mode.I am running apache-spark 2.1.0 in windows.Here is My code
c:\spark>spark-submit2 --master yarn --deploy-mode client --executor-cores 4 --jars C:\DependencyJars\spark-streaming-eventhubs_2.11-2.0.3.jar,C:\DependencyJars\scalaj-http_2.11-2.3.0.jar,C:\DependencyJars\config-1.3.1.jar,C:\DependencyJars\commons-lang3-3.3.2.jar --conf spark.driver.userClasspathFirst=true --conf spark.executor.extraClassPath=C:\DependencyJars\commons-lang3-3.3.2.jar --conf spark.executor.userClasspathFirst=true --class "GeoLogConsumerRT" C:\sbtazure\target\scala-2.11\azuregeologproject_2.11-1.0.jar
EXCEPTION: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark
so from searching website. I have created a folder name Hadoop_CONF_DIR and place hive site.xml in it and pointed as environment variable, after that i have run spark-submit then I have got
connection refused exception
I think i could not configure yarn mode set up properly.Could anyone help me for solving this issue? do I need to install Hadoop and yarn separately?I want to run my application in pseudo distributed mode.Kindly help me to configure yarn mode in windows thanks

You need to export two variables HADOOP_CONF_DIR and YARN_CONF_DIR to make your configurations file visible to yarn. Use below code in .bashrc file if you are using linux.
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
In windows you need to set environment variable.
Hope this helps!

If you are running spark using Yarn then you better need to add this to spark-env.sh:
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Related

how to switch between cluster types in Apache Spark

I'm trying to switch cluster manager from standalone to 'YARN' in Apache Spark that I've installed for learning.
I read following thread to understand which cluster type should be chosen
However, I'd like to know the steps/syntax to change the cluster type.
Ex: from Standalone to YARN or from YARN to Standalone.
In spark there is one function name as --master that can helps you to execute your script on yarn Cluster mode or standalone mode.
Run the application on local mode or standalone used this with spark-submit command
--master Local[*]
or
--master spark://192.168.10.01:7077 \
--deploy-mode cluster \
Run on a YARN cluster
--master yarn
--deploy-mode cluster
For more information kindly visit this link.
https://spark.apache.org/docs/latest/submitting-applications.html
If you are not running through command line then you can directly set this master on SparkConf object.
sparkConf.setMaster(http://path/to/master/url:port) in cluster mode
or
sparkConf.setMaster(local[*]) in client/local mode

Can't see Yarn Job when doing Spark-Submit on Yarn Cluster

I am using spark-submit for my job with the command below:
spark-submit script_test.py --master yarn --deploy-mode cluster
spark-submit script_test.py --master yarn-cluster --deploy-mode cluster
The job is working fine. I can see it under the Spark History Server UI. However, I cannot see it under the RessourceManager UI ( YARN).
I have the feeling that my job is not sent to the cluster but it is running only in one node. However, I see nothing wrong on the way I use the Spark-submit command.
Am-i wrong? How can I check it? Or send the job to yarn cluster?
When you are using --master yarn means that in some place you have configured the yarn-site with hosts, ports, and so on.
Maybe the machine where you are using the spark-submit doesn't know where is the Yarn master.
You could check your hadoop/yarn/spark config files, specially the yarn-site.xml to check if the host of the Resource Manager is correct or not.
Those files are in different folders depending on which distribution of Hadoop you are using. In HDP I guess they are in /etc/hadoop/conf
Hope it helps.

HDP: How to change HADOOP_CLASSPATH value

I need to add a value to the HADOOP_CLASSPATH environment variable, according to this troubleshoot article: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_installing_manually_book/content/troubleshooting-phoenix.html
when I type echo $HADOOP_CLASSPATH in console I get an empty result back. I think I need to set these values in an config.xml file...
Where or how can I set this new value to the environment variable?
Can I set it in spark-submit?
You can add environment variable HADOOP_CONF_DIR in spark-env.sh so whenever you run spark-submit it will automatically pick all environment variable. this environment variable value is path of hadoop configuration.
export HADOOP_CONF_DIR = to point Spark towards Hadoop configuration files
The error can be avoided by adding the jar path to the spark-submit call via --driver-class-path parameter:
spark-submit --class sparkhbase.PhoenixTest --master yarn --deploy-mode client --driver-class-path "/usr/hdp/current/hbase-client/lib/hbase-protocol-1.1.2.2.4.2.0-258.jar" /home/test/app.jar
It also worked by setting the --conf parameter like this:
spark-submit --class sparkhbase.PhoenixTest --master yarn --deploy-mode client --conf "spark.driver.extraClassPath=/usr/hdp/current/hbase-client/lib/hbase-protocol-1.1.2.2.4.2.0-258.jar" /home/test/app.jar
Setting ONE of them should do it!
Also add --conf "spark.executor.extraClassPath=/usr/hdp/current/hbase-client/lib/hbase-protocol-1.1.2.2.4.2.0-258.jar" to your spark-submit if you still get an exception (can happen when code is launched on the executors, not on master)

Install spark on yarn cluster

I am looking for a guide regarding how to install spark on an existing virtual yarn cluster.
I have a yarn cluster consisting of two nodes, ran map-reduce job which worked perfect. Looked for results in log and everything is working fine.
Now I need to add the spark installation commands and configurations files in my vagrantfile. I can't find a good guide, could someone give me a good link ?
I used this guide for the yarn cluster
http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/#single-node-installation
Thanks in advance!
I don't know about vagrant, but I have installed Spark on top of hadoop 2.6 (in the guide referred to as post-YARN) and I hope this helps.
Installing Spark on an existing hadoop is really easy, you just need to install it only on one machine. For that you have to download the one pre-built for your hadoop version from it's official website (I guess you can use the without hadoop version but you need to point it to the direction of hadoop binaries in your system). Then decompress it:
tar -xvf spark-2.0.0-bin-hadoop2.x.tgz -C /opt
Now you only need to set some environment variables. First in your ~/.bashrc (or ~/.zshrc) you can set SPARK_HOME and add it to your PATH if you want:
export SPARK_HOME=/opt/spark-2.0.0-bin-hadoop-2.x
export PATH=$PATH:$SPARK_HOME/bin
Also for this changes to take effect you can run:
source ~/.bashrc
Second you need to point Spark to your Hadoop configuartion directories. To do this set these two environmental variables in $SPARK_HOME/conf/spark-env.sh:
export HADOOP_CONF_DIR=[your-hadoop-conf-dir usually $HADOOP_PREFIX/etc/hadoop]
export YARN_CONF_DIR=[your-yarn-conf-dir usually the same as the last variable]
If this file doesn't exist, you can copy the contents of $SPARK_HOME/conf/spark-env.sh.template and start from there.
Now to start the shell in yarn mode you can run:
spark-shell --master yarn --deploy-mode client
(You can't run the shell in cluster deploy-mode)
----------- Update
I forgot to mention that you can also submit cluster jobs with this configuration like this (thanks #JulianCienfuegos):
spark-submit --master yarn --deploy-mode cluster project-spark.py
This way you can't see the output in the terminal, and the command exits as soon as the job is submitted (not completed).
You can also use --deploy-mode client to see the output right there in your terminal but just do this for testing, since the job gets canceled if the command is interrupted (e.g. you press Ctrl+C, or your session ends)

Spark Submit Issue

I am trying to run a fat jar on a Spark cluster using Spark submit.
I made the cluster using "spark-ec2" executable in Spark bundle on AWS.
The command I am using to run the jar file is
bin/spark-submit --class edu.gatech.cse8803.main.Main --master yarn-cluster ../src1/big-data-hw2-assembly-1.0.jar
In the beginning it was giving me the error that at least one of the HADOOP_CONF_DIR or YARN_CONF_DIR environment variable must be set.
I didn't know what to set them to, so I used the following command
export HADOOP_CONF_DIR=/mapreduce/conf
Now the error has changed to
Could not load YARN classes. This copy of Spark may not have been compiled with YARN support.
Run with --help for usage help or --verbose for debug output
The home directory structure is as follows
ephemeral-hdfs hadoop-native mapreduce persistent-hdfs scala spark spark-ec2 src1 tachyon
I even set the YARN_CONF_DIR variable to the same value as HADOOP_CONF_DIR, but the error message is not changing. I am unable to find any documentation that highlights this issue, most of them just mention these two variables and give no further details.
You need to compile spark against Yarn to use it.
Follow the steps explained here: https://spark.apache.org/docs/latest/building-spark.html
Maven:
build/mvn -Pyarn -Phadoop-2.x -Dhadoop.version=2.x.x -DskipTests clean package
SBT:
build/sbt -Pyarn -Phadoop-2.x assembly
You can also download a pre-compiled version here: http://spark.apache.org/downloads.html (choose a "pre-built for Hadoop")
Download prebuilt spark which supports hadoop 2.X versions from https://spark.apache.org/downloads.html
The --master argument should be: --master spark://hostname:7077 where hostname is the name of your Spark master server. You can also specify this value as spark.master in the spark-defaults.conf file and leave out the --master argument when using Spark submit from the command line. Including the --master argument will override the value set (if exists) in the spark-defaults.conf file.
Reference: http://spark.apache.org/docs/1.3.0/configuration.html

Resources