Spark Submit Issue - hadoop

I am trying to run a fat jar on a Spark cluster using Spark submit.
I made the cluster using "spark-ec2" executable in Spark bundle on AWS.
The command I am using to run the jar file is
bin/spark-submit --class edu.gatech.cse8803.main.Main --master yarn-cluster ../src1/big-data-hw2-assembly-1.0.jar
In the beginning it was giving me the error that at least one of the HADOOP_CONF_DIR or YARN_CONF_DIR environment variable must be set.
I didn't know what to set them to, so I used the following command
export HADOOP_CONF_DIR=/mapreduce/conf
Now the error has changed to
Could not load YARN classes. This copy of Spark may not have been compiled with YARN support.
Run with --help for usage help or --verbose for debug output
The home directory structure is as follows
ephemeral-hdfs hadoop-native mapreduce persistent-hdfs scala spark spark-ec2 src1 tachyon
I even set the YARN_CONF_DIR variable to the same value as HADOOP_CONF_DIR, but the error message is not changing. I am unable to find any documentation that highlights this issue, most of them just mention these two variables and give no further details.

You need to compile spark against Yarn to use it.
Follow the steps explained here: https://spark.apache.org/docs/latest/building-spark.html
Maven:
build/mvn -Pyarn -Phadoop-2.x -Dhadoop.version=2.x.x -DskipTests clean package
SBT:
build/sbt -Pyarn -Phadoop-2.x assembly
You can also download a pre-compiled version here: http://spark.apache.org/downloads.html (choose a "pre-built for Hadoop")

Download prebuilt spark which supports hadoop 2.X versions from https://spark.apache.org/downloads.html

The --master argument should be: --master spark://hostname:7077 where hostname is the name of your Spark master server. You can also specify this value as spark.master in the spark-defaults.conf file and leave out the --master argument when using Spark submit from the command line. Including the --master argument will override the value set (if exists) in the spark-defaults.conf file.
Reference: http://spark.apache.org/docs/1.3.0/configuration.html

Related

Exception: java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark

I am new apache-spark. I have tested some application in spark standalone mode.but I want to run application yarn mode.I am running apache-spark 2.1.0 in windows.Here is My code
c:\spark>spark-submit2 --master yarn --deploy-mode client --executor-cores 4 --jars C:\DependencyJars\spark-streaming-eventhubs_2.11-2.0.3.jar,C:\DependencyJars\scalaj-http_2.11-2.3.0.jar,C:\DependencyJars\config-1.3.1.jar,C:\DependencyJars\commons-lang3-3.3.2.jar --conf spark.driver.userClasspathFirst=true --conf spark.executor.extraClassPath=C:\DependencyJars\commons-lang3-3.3.2.jar --conf spark.executor.userClasspathFirst=true --class "GeoLogConsumerRT" C:\sbtazure\target\scala-2.11\azuregeologproject_2.11-1.0.jar
EXCEPTION: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark
so from searching website. I have created a folder name Hadoop_CONF_DIR and place hive site.xml in it and pointed as environment variable, after that i have run spark-submit then I have got
connection refused exception
I think i could not configure yarn mode set up properly.Could anyone help me for solving this issue? do I need to install Hadoop and yarn separately?I want to run my application in pseudo distributed mode.Kindly help me to configure yarn mode in windows thanks
You need to export two variables HADOOP_CONF_DIR and YARN_CONF_DIR to make your configurations file visible to yarn. Use below code in .bashrc file if you are using linux.
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
In windows you need to set environment variable.
Hope this helps!
If you are running spark using Yarn then you better need to add this to spark-env.sh:
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Spark submit with master as yarn-client (windows) gives Error "Could not find or load main class"

I have installed Hadoop2.7.1 with spark 1.4.1 on windows 8.1
When I execute below command
cd spark
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client lib/spark-examples*.jar 10
I get below error in JobHistoryServer log
Error: Could not find or load main class '-Dspark.externalBlockStore.folderName=spark-262c4697-ef0c-4042-af0c-8106b08574fb'
I did further debugging(along searching net) and could get hold of container cmd script where below sections(other lines are omitted) are given
...
#set CLASSPATH=C:/tmp/hadoop-xyz/nm-local-dir/usercache/xyz/appcache/application_1487502025818_0003/container_1487502025818_0003_02_000001/classpath-3207656532274684591.jar
...
#call %JAVA_HOME%/bin/java -server -Xmx512m -Djava.io.tmpdir=%PWD%/tmp '-Dspark.fileserver.uri=http://192.168.1.2:34814' '-Dspark.app.name=Spark shell' '-Dspark.driver.port=34810' '-Dspark.repl.class.uri=http://192.168.1.2:34785' '-Dspark.driver.host=192.168.1.2' '-Dspark.externalBlockStore.folderName=spark-dd9f3f84-6cf4-4ff8-b0f6-7ff84daf74bc' '-Dspark.master=yarn-client' '-Dspark.driver.appUIAddress=http://192.168.1.2:4040' '-Dspark.jars=' '-Dspark.executor.id=driver' -Dspark.yarn.app.container.log.dir=/dep/logs/userlogs/application_1487502025818_0003/container_1487502025818_0003_02_000001 org.apache.spark.deploy.yarn.ExecutorLauncher --arg '192.168.1.2:34810' --executor-memory 1024m --executor-cores 1 --num-executors 2 1> /dep/logs/userlogs/application_1487502025818_0003/container_1487502025818_0003_02_000001/stdout 2> /dep/logs/userlogs/application_1487502025818_0003/container_1487502025818_0003_02_000001/stderr
I check relevant files for CLASSPATH, looks OK. The main class org.apache.spark.deploy.yarn.ExecutorLauncher is available in spark assembly jar which is part of container jar
So, what could be the issue here? I searched net and found many discussions, but are for unix variants, but not many for Windows. I am wondering whether spark submit really works on Windows (yarn-client mode only, standalone cluster mode works) without any special setup!!!
BTW, if I run the above java command from cmd.exe command prompt, I get the same error as all command line arguments are quoted with single quote instead of double quotes(making these double quotes work!!), so is this a bug
Note spark-shell also fails (in yarn mode) and but yarn jar ... command works
Looks like it was a defect in earlier version. With latest Hadoop 2.7.3 with spark 2.1.0, it is working correctly.!!! Could not find any reference though.

Install spark on yarn cluster

I am looking for a guide regarding how to install spark on an existing virtual yarn cluster.
I have a yarn cluster consisting of two nodes, ran map-reduce job which worked perfect. Looked for results in log and everything is working fine.
Now I need to add the spark installation commands and configurations files in my vagrantfile. I can't find a good guide, could someone give me a good link ?
I used this guide for the yarn cluster
http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/#single-node-installation
Thanks in advance!
I don't know about vagrant, but I have installed Spark on top of hadoop 2.6 (in the guide referred to as post-YARN) and I hope this helps.
Installing Spark on an existing hadoop is really easy, you just need to install it only on one machine. For that you have to download the one pre-built for your hadoop version from it's official website (I guess you can use the without hadoop version but you need to point it to the direction of hadoop binaries in your system). Then decompress it:
tar -xvf spark-2.0.0-bin-hadoop2.x.tgz -C /opt
Now you only need to set some environment variables. First in your ~/.bashrc (or ~/.zshrc) you can set SPARK_HOME and add it to your PATH if you want:
export SPARK_HOME=/opt/spark-2.0.0-bin-hadoop-2.x
export PATH=$PATH:$SPARK_HOME/bin
Also for this changes to take effect you can run:
source ~/.bashrc
Second you need to point Spark to your Hadoop configuartion directories. To do this set these two environmental variables in $SPARK_HOME/conf/spark-env.sh:
export HADOOP_CONF_DIR=[your-hadoop-conf-dir usually $HADOOP_PREFIX/etc/hadoop]
export YARN_CONF_DIR=[your-yarn-conf-dir usually the same as the last variable]
If this file doesn't exist, you can copy the contents of $SPARK_HOME/conf/spark-env.sh.template and start from there.
Now to start the shell in yarn mode you can run:
spark-shell --master yarn --deploy-mode client
(You can't run the shell in cluster deploy-mode)
----------- Update
I forgot to mention that you can also submit cluster jobs with this configuration like this (thanks #JulianCienfuegos):
spark-submit --master yarn --deploy-mode cluster project-spark.py
This way you can't see the output in the terminal, and the command exits as soon as the job is submitted (not completed).
You can also use --deploy-mode client to see the output right there in your terminal but just do this for testing, since the job gets canceled if the command is interrupted (e.g. you press Ctrl+C, or your session ends)

Can't seem to build hive for spark

I have been trying to run this code in pyspark.
sqlContext = HiveContext(sc)
datumDF = sqlContext.createDataFrame(datumX, schema)
But have been receiving this warning:
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o44))
I log in to AWS and spin up clusters with this code: /User/Downloads/spark-1.5.2-bin-hadoop2.6/ec2/spark-ec2 -k name -i /User/Desktop/pemfile.pem login clustername
However I all the docs I've found involve this commands, which exist in the file
/users/downloads/spark-1.5.2/ I've run them anyway, and tried logging into was using the ec2 command in that folder after I did. Still, just got the same error
I submit export SPARK_HIVE=TRUE before running these commands on my local machine, but I've seen messages saying its deprecated and will be ignored anyway.
Build hive with maven:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0
-Phive -Phive-thriftserver -DskipTests clean package
Build hive with sbt
build/sbt -Pyarn -Phadoop-2.3 assembly
And another I found
./sbt/sbt -Phive assembly
I also took the hive-site.xml file and put in both the /Users/Downloads/spark-1.5.2-bin-hadoop2.6/conf folder and the /Users/Downloads/spark-1.5.2/conf
Still no luck.
I can't seem to run the hive commands no matter what I build it with or how I log in. Is there anything obvious I'm missing.
I too had the same error when using a HiveContext on a EC2 cluster built with the ec2 scripts that comes with the Spark package (v1.5.2 in my case). Through much trial and error, I found that building a EC2 cluster with the following options got the right version of Hadoop with Hive properly built so that I can use a HiveContext in my PySpark jobs:
spark-ec2 -k <your key pair name> -i /path/to/identity-file.pem -r us-west-2 -s 2 --instance-type m3.medium --spark-version 1.5.2 --hadoop-major-version yarn launch <your cluster name>
The key parameters here is that you set --spark-version to 1.5.2 and --hadoop-major-version to yarn - even though you aren't using to use Yarn to submit jobs as it forces the hadoop build to be 2.4. Of course, adjust the other parameters as appropriate for your desired cluster.

Using different hadoop-mapreduce-client-core.jar to run hadoop cluster

I'm working on a hadoop cluster with CDH4.2.0 installed and ran into this error. It's been fixed in later versions of hadoop but I don't have access to update the cluster. Is there a way to tell hadoop to use this jar when running my job through the command line arguments like
hadoop jar MyJob.jar -D hadoop.mapreduce.client=hadoop-mapreduce-client-core-2.0.0-cdh4.2.0.jar
where the new mapreduce-client-core.jar file is the patched jar from the ticket. Or must hadoop be completely recompiled with this new jar? I'm new to hadoop so I don't know all the command line options that are possible.
I'm not sure how that would work as when you're executing the hadoop command you're actually executing code in the client jar.
Can you not use MR1? The issue says this issue only occurs when you're using MR2, so unless you really need Yarn you're probably better using the MR1 library to run your map/reduce.

Resources