HDP: How to change HADOOP_CLASSPATH value - hadoop

I need to add a value to the HADOOP_CLASSPATH environment variable, according to this troubleshoot article: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_installing_manually_book/content/troubleshooting-phoenix.html
when I type echo $HADOOP_CLASSPATH in console I get an empty result back. I think I need to set these values in an config.xml file...
Where or how can I set this new value to the environment variable?
Can I set it in spark-submit?

You can add environment variable HADOOP_CONF_DIR in spark-env.sh so whenever you run spark-submit it will automatically pick all environment variable. this environment variable value is path of hadoop configuration.
export HADOOP_CONF_DIR = to point Spark towards Hadoop configuration files

The error can be avoided by adding the jar path to the spark-submit call via --driver-class-path parameter:
spark-submit --class sparkhbase.PhoenixTest --master yarn --deploy-mode client --driver-class-path "/usr/hdp/current/hbase-client/lib/hbase-protocol-1.1.2.2.4.2.0-258.jar" /home/test/app.jar
It also worked by setting the --conf parameter like this:
spark-submit --class sparkhbase.PhoenixTest --master yarn --deploy-mode client --conf "spark.driver.extraClassPath=/usr/hdp/current/hbase-client/lib/hbase-protocol-1.1.2.2.4.2.0-258.jar" /home/test/app.jar
Setting ONE of them should do it!
Also add --conf "spark.executor.extraClassPath=/usr/hdp/current/hbase-client/lib/hbase-protocol-1.1.2.2.4.2.0-258.jar" to your spark-submit if you still get an exception (can happen when code is launched on the executors, not on master)

Related

Exception: java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark

I am new apache-spark. I have tested some application in spark standalone mode.but I want to run application yarn mode.I am running apache-spark 2.1.0 in windows.Here is My code
c:\spark>spark-submit2 --master yarn --deploy-mode client --executor-cores 4 --jars C:\DependencyJars\spark-streaming-eventhubs_2.11-2.0.3.jar,C:\DependencyJars\scalaj-http_2.11-2.3.0.jar,C:\DependencyJars\config-1.3.1.jar,C:\DependencyJars\commons-lang3-3.3.2.jar --conf spark.driver.userClasspathFirst=true --conf spark.executor.extraClassPath=C:\DependencyJars\commons-lang3-3.3.2.jar --conf spark.executor.userClasspathFirst=true --class "GeoLogConsumerRT" C:\sbtazure\target\scala-2.11\azuregeologproject_2.11-1.0.jar
EXCEPTION: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark
so from searching website. I have created a folder name Hadoop_CONF_DIR and place hive site.xml in it and pointed as environment variable, after that i have run spark-submit then I have got
connection refused exception
I think i could not configure yarn mode set up properly.Could anyone help me for solving this issue? do I need to install Hadoop and yarn separately?I want to run my application in pseudo distributed mode.Kindly help me to configure yarn mode in windows thanks
You need to export two variables HADOOP_CONF_DIR and YARN_CONF_DIR to make your configurations file visible to yarn. Use below code in .bashrc file if you are using linux.
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
In windows you need to set environment variable.
Hope this helps!
If you are running spark using Yarn then you better need to add this to spark-env.sh:
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Spark-shell --conf option for Kryos/Java serializer

I need to launch the spark shell with custom classes using registerKryoClasses method as mentioned in the spark help page .
Now as mentioned in the page I cannot recreate the sc variable after launching the spark shell and hence need to provide the option --conf while launching the spark-shell command.
What should be the option value with --conf so that it is equivalent to the following update:
conf.registerKryoClasses(Array(classOf[ArrayBuffer[String]], classOf[ListBuffer[String]]))
The option to use Javaserializer instead of Kryos serializer worked for me:
spark-shell --conf 'spark.serializer=org.apache.spark.serializer.JavaSerializer'
Edit: just figured out how to use the options. We can do the following:
--conf 'spark.kryo.classesToRegister=scala.collection.mutable.ArrayBuffer,scala.collection.mutable.ListBuffer'

Missing java system properties when running spark-streaming on Mesos cluster

I submit a spark app to mesos cluster(running in cluster mode), and pass java system property through "--drive-java-options=-Dkey=value -Dkey=value", however these system properties are not available at runtime, seems they are not set. --conf "spark.driver.extraJavaOptions=-Dkey=value" doesn't work either
More details:
the command is
bin/spark-submit --master mesos://10.3.101.119:7077 --deploy-mode cluster --class ${classname} --driver-java-options "-Dconfiguration.http=http://10.3.101.119:9090/application.conf" --conf "spark.executor.extraJavaOptions=-Dconfiguration.http=http://10.3.101.119:9090/application.conf" ${jar file}
I have a two-node mesos cluster, one node both runs master and slave, and the other runs slave only. I submit the spark application on master node.
Internally, the application hopes to read a configuration file from java system property "configuration.http", if the property is not available, the application will load a default file from the root of the classpath. When I submit the application, from the logs, i saw the default configuration file is loaded.
And the actual command to run the application is
"sh -c '/home/ubuntu/spark-1.6.0/bin/spark-submit --name ${appName} --master mesos://zk://10.3.101.184:2181/mesos/grant --driver-cores 1.0 --driver-memory 1024M --class ${classname} ./${jar file} '"
from here you can see the system property is lost
You might have a look at this blog post which recommends using an external properties file for this purpose:
$ vi app.properties
spark.driver.extraJavaOptions -Dconfiguration.http=http://10.3.101.119:9090/application.conf
spark.executor.extraJavaOptions –Dconfiguration.http=http://10.3.101.119:9090/application.conf
Then try to run this via
bin/spark-submit --master mesos://10.3.101.119:7077 --deploy-mode cluster --class ${classname} —-properties-file app.properties ${jar file}
See
How to pass -D parameter or environment variable to Spark job?
Separate logs from Apache spark

Setting S3 output file grantees for spark output files

I'm running Spark on AWS EMR and I'm having some issues getting the correct permissions on the output files (rdd.saveAsTextFile('<file_dir_name>')). In hive, I would add a line in the beginning with set fs.s3.canned.acl=BucketOwnerFullControl and that would set the correct permissions. For Spark, I tried running:
hadoop jar /mnt/var/lib/hadoop/steps/s-3HIRLHJJXV3SJ/script-runner.jar \
/home/hadoop/spark/bin/spark-submit --deploy-mode cluster --master yarn-cluster \
--conf "spark.driver.extraJavaOptions -Dfs.s3.canned.acl=BucketOwnerFullControl" \
hdfs:///user/hadoop/spark.py
But the permissions do not get set properly on the output files. What is the proper way to pass in the 'fs.s3.canned.acl=BucketOwnerFullControl' or any of the S3 canned permissions to the spark job?
Thanks in advance
I found the solution. In the job, you have to access the JavaSparkContext and from there get the Hadoop configuration and set the parameter there. For example:
sc._jsc.hadoopConfiguration().set('fs.s3.canned.acl','BucketOwnerFullControl')
The proper way to pass hadoop config keys in spark is to use --conf with keys prefixed with spark.hadoop.. Your command would look like
hadoop jar /mnt/var/lib/hadoop/steps/s-3HIRLHJJXV3SJ/script-runner.jar \
/home/hadoop/spark/bin/spark-submit --deploy-mode cluster --master yarn-cluster \
--conf "spark.hadoop.fs.s3.canned.acl=BucketOwnerFullControl" \
hdfs:///user/hadoop/spark.py
Unfortunately I cannot find any reference in official documentation of spark.

Spark Submit Issue

I am trying to run a fat jar on a Spark cluster using Spark submit.
I made the cluster using "spark-ec2" executable in Spark bundle on AWS.
The command I am using to run the jar file is
bin/spark-submit --class edu.gatech.cse8803.main.Main --master yarn-cluster ../src1/big-data-hw2-assembly-1.0.jar
In the beginning it was giving me the error that at least one of the HADOOP_CONF_DIR or YARN_CONF_DIR environment variable must be set.
I didn't know what to set them to, so I used the following command
export HADOOP_CONF_DIR=/mapreduce/conf
Now the error has changed to
Could not load YARN classes. This copy of Spark may not have been compiled with YARN support.
Run with --help for usage help or --verbose for debug output
The home directory structure is as follows
ephemeral-hdfs hadoop-native mapreduce persistent-hdfs scala spark spark-ec2 src1 tachyon
I even set the YARN_CONF_DIR variable to the same value as HADOOP_CONF_DIR, but the error message is not changing. I am unable to find any documentation that highlights this issue, most of them just mention these two variables and give no further details.
You need to compile spark against Yarn to use it.
Follow the steps explained here: https://spark.apache.org/docs/latest/building-spark.html
Maven:
build/mvn -Pyarn -Phadoop-2.x -Dhadoop.version=2.x.x -DskipTests clean package
SBT:
build/sbt -Pyarn -Phadoop-2.x assembly
You can also download a pre-compiled version here: http://spark.apache.org/downloads.html (choose a "pre-built for Hadoop")
Download prebuilt spark which supports hadoop 2.X versions from https://spark.apache.org/downloads.html
The --master argument should be: --master spark://hostname:7077 where hostname is the name of your Spark master server. You can also specify this value as spark.master in the spark-defaults.conf file and leave out the --master argument when using Spark submit from the command line. Including the --master argument will override the value set (if exists) in the spark-defaults.conf file.
Reference: http://spark.apache.org/docs/1.3.0/configuration.html

Resources