Adding Spark and Hadoop configuration files to JAR? - hadoop

I have a Spark application which I would like to configure using configuration files, such as Spark's spark-defaults.conf, HBase's hbase-site.xml and log4j's log4j.properties. I also want to avoid having to add the files programmatically.
I tried adding the files to my JAR (under both / and /conf paths) but when I ran spark-submit the configuration files did not seem to have any effect.
To further check my claim I tried running spark-shell with the same JARs and checking the contents of the files and I discovered that they were overridden by files from other locations: /spark-defaults.conf and /log4j.properties were completely different, and /conf/hbase-site.xml while staying intact has (probably) had its properties overridden by another JAR's hbase-default.xml.
I use CDH 5.4.0.
The files log4j.properties and spark-defaults.conf were loaded from /etc/spark/ and hbase-default.xml was loaded from /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/hbase-common-1.0.0-cdh5.4.0.jar.
Is there a way to specify some sort of priority on my configuration files over the others? Should I just configure the files in /etc/spark (and perhaps add my hbase-site.xml too)? Is there a way to add a custom directory path to the classpath that could take priority over the others?

i don't think that it is possible to include the spark-defaults.conf into the jar. The only way i know is to edit the file on the server or to add the config settings programatically.
but for hbase-site.xml and other hadoop site configs it should work.
you can put every site.xml in the root of your resource directory and it should be loaded unless you have some other site.xml in the classpath of spark which are loaded at first.
i.e. if you are adding hadoop classpath or hbase classpath to the spark env on the server, then they are in the classloader loaded first, unless you are using the setting spark.files.userClassPathFirst

Related

Hadoop confs for client application

I have a client application that uses the hadoop conf files (hadoop-site.xml and hadoop-core.xml)
I don't want to check it in on the resources folders, so I try to add it via idea.
The problem is that the hadoop Confs ignores my HADOOP_CONF_DIR and loads the default confs from the hadoop package. Any ideia ?
I'm using gradle
I end up solving it by putting the configuration files on test resources folder. So when the jar gets build it does not take it.

Shouldn't Oozie/Sqoop jar location be configured during package installation?

I'm using HDP 2.4 in CentOS 6.7.
I have created the cluster with Ambari, so Oozie was installed and configured by Ambari.
I got two errors while running Oozie/Sqoop related to jar file location. The first concerned postgresql-jdbc.jar, since the Sqoop job is incrementally importing from Postgres. I added the postgresql-jdbc.jar file to HDFS and pointed to it in workflow.xml:
<file>/user/hdfs/sqoop/postgresql-jdbc.jar</file>
It solved the problem. But the second error seems to concern kite-data-mapreduce.jar. However, doing the same for this file:
<file>/user/hdfs/sqoop/kite-data-mapreduce.jar</file>
does not seem to solve the problem:
Failing Oozie Launcher, Main class
[org.apache.oozie.action.hadoop.SqoopMain], main() threw exception,
org/kitesdk/data/DatasetNotFoundException
java.lang.NoClassDefFoundError:
org/kitesdk/data/DatasetNotFoundException
It seems strange that this is not automatically configured by Ambari and that we have to copy jar files into HDFS as we start getting errors.
Is this the correct methodology or did I miss some configuration step?
This is happening due to the missing jars in the classpath. I would suggest you to use the property oozie.use.system.libpath=true in the job.properties file. All the sqoop related jars will be added automatically in the classpath. Then add only custom jar you need to the lib directory of the workflow application path., all the sqoop related jars will be added from the /user/oozie/share/lib/lib_<timestamp>/sqoop/*.jar.

Set HADOOP_CLASSPATH while using Spring jar-tasklet

I am using hadoop jar-tasklet:
<hdp:jar-tasklet id="testjob" jar="bhs_abhishek.jar">
</hdp:jar-tasklet>
This jar currently needs some config files on classpath which I was earlier setting through HADAOOP_CLASSPATH variable for invocation through hadoop jar command. But I could not find a way of setting HADOOP_CLASSPATH using spring xml. Please provide any suggestions on how this can be achieved or a better way of doing this. I am OK to make changes in jar.
You can try adding your config files to xd/config directory which should be on the classpath.
There is also a xd/config/hadoop-site.xml file where you could add Hadoop config properties. One more alternative is to modify xd/config/servers.yml and add Hadoop config properties under spring:hadoop:config: like we do for io.file.buffer.size in this example:
---
# Hadoop properties
spring:
hadoop:
fsUri: hdfs://hadoop.example.com:8020
resourceManagerHost: hadoop.example.com
resourceManagerPort: 8032
jobHistoryAddress: hadoop.example.com:10020
config:
io.file.buffer.size: 4096
---

How does zookeeper determine the 'java.library.path' for a hadoop job?

I am running hadoop jobs on a distributed cluster using oozie. I give a setting 'oozie.libpath' for the oozie jobs.
Recently, I have deleted few of my older version jar files from the library path oozie uses and I have replaced them with newer versions. However, when I run my hadoop job, my older version of jar files and newer version of jar files both get loaded and the mapreduce is still using the older version.
I am not sure where zookeeper is loading the jar files from. Are there any default settings that it loads the jar files from ? There is only one library path in my HDFS and it does not have those jar files.
I have found what is going wrong. The wrong jar is shipped with my job file. Should have check here first

How to specify multiple jar files in oozie

I need a solution for the following problem:
My project has two jars in which
one jar contains all bean classes like Employee etc, and the other jar contains MR jobs which uses the first jar bean class so when iam trying to run the MR job as a simple java program i am facing the issue of class not found (com.abc.Employee class not found as it is in another jar) so can any one provide me the solution how to solve the issue .... as in real time there may be many jars not 1 or 2 how to specify all those jars can any one please reply as soon as possible.
You should have a lib folder in the HDFS directory where you are storing your Oozie workflow. You can place both jar files in this folder and oozie will ensure both are on the classpath when your MR job executes:
hdfs://namenode:8020/path/to/oozie/app/workflow.xml
hdfs://namenode:8020/path/to/oozie/app/lib/first.jar
hdfs://namenode:8020/path/to/oozie/app/lib/second.jar
See Workflow Application Deployment for more details
If you often use jars in a number of oozie workflows, you can place these common jars (HBase jars for example) in a directory in HDFS, and then denote in an oozie property to include this folder's jars See HDFS Share Libraries for more details

Resources