Hadoop confs for client application - hadoop

I have a client application that uses the hadoop conf files (hadoop-site.xml and hadoop-core.xml)
I don't want to check it in on the resources folders, so I try to add it via idea.
The problem is that the hadoop Confs ignores my HADOOP_CONF_DIR and loads the default confs from the hadoop package. Any ideia ?
I'm using gradle

I end up solving it by putting the configuration files on test resources folder. So when the jar gets build it does not take it.

Related

Running Spark from a local IDE

I've been spending some time banging my head over trying to run a complex spark application locally in order to test quicker (without having to package and deploy to a cluster).
Some context:
This spark application interfaces with Datastax Enterprise version of Cassandra and their distributed file system, so it needs some explicit jars to be provided (not available in Maven)
These jars are available on my local machine, and to "cheese" this, I tried placing them in SPARK_HOME/jars so they would be automatically added to the classpath
I tried to do something similar with the required configuration settings by putting them in spark-defaults.conf under SPARK_HOME/conf
When building this application, we do not build an uber jar, but rather do a spark-submit on the server using --jars
The problem I'm facing, is when I run the Spark Application through my IDE, it seems like it doesn't pick up any of these additional items from the SPARK_HOME director (config or jars). I spent a few hours trying to get the config items to work and ended up setting them as System.property values in my test case before starting the spark session in order for Spark to pick them up, so the configuration settings can be ignored.
However, I do not know how to reproduce this for the vendor specific jar files. Is there an easy way I can emulate the --jars behavior that spark-submit does and some home set up my spark session with this jar value? Note: I am using in my code the following command to start a spark session:
SparkSession.builder().config(conf).getOrCreate()
Additional information, in case it helps:
The Spark version I have locally in SPARK_HOME is the same version that my code is compiling with using Maven.
I asked another question similar to this related to configs: Loading Spark Config for testing Spark Applications
When I print the SPARK_HOME environment variable in my application, I am getting the correct SPARK_HOME value, so I'm not sure why neither the configs or jar files are being picked up from here. Is it possible that when running the application from my IDE, it's not picking up the SPARK_HOME environment variable and using all defaults?
You can make use of .config(key, value) while building the SparkSession by passing "spark.jars" as the key and a comma separated list of paths to the jar like so:
SparkSession.builder().config("spark.jars", "/path/jar1.jar, /path/jar2.jar").config(conf).getOrCreate()

Run Spark job with properties files

As a beginner of stack Hadoop, I would like to run my Spark job with spark-submit via Oozie. Having an jar including src compiling project files, I have also a set of properties files (about 20). I want that, when running my spark Job, we can load these properties files from a different folder beside the folder including my Spark Job compiled jar. I've tried:
In my job.properties of oozie, I added:
oozie.libpath=[path to the folder including all of my properties files]
and oozie.use.system.libpath=true.
on the spark-submit command, I added --files or --properties-file but it's not working (It doesn't accept the folder)
Thanks for any suggestions or feel free to ask more if my question is not clear.

Shouldn't Oozie/Sqoop jar location be configured during package installation?

I'm using HDP 2.4 in CentOS 6.7.
I have created the cluster with Ambari, so Oozie was installed and configured by Ambari.
I got two errors while running Oozie/Sqoop related to jar file location. The first concerned postgresql-jdbc.jar, since the Sqoop job is incrementally importing from Postgres. I added the postgresql-jdbc.jar file to HDFS and pointed to it in workflow.xml:
<file>/user/hdfs/sqoop/postgresql-jdbc.jar</file>
It solved the problem. But the second error seems to concern kite-data-mapreduce.jar. However, doing the same for this file:
<file>/user/hdfs/sqoop/kite-data-mapreduce.jar</file>
does not seem to solve the problem:
Failing Oozie Launcher, Main class
[org.apache.oozie.action.hadoop.SqoopMain], main() threw exception,
org/kitesdk/data/DatasetNotFoundException
java.lang.NoClassDefFoundError:
org/kitesdk/data/DatasetNotFoundException
It seems strange that this is not automatically configured by Ambari and that we have to copy jar files into HDFS as we start getting errors.
Is this the correct methodology or did I miss some configuration step?
This is happening due to the missing jars in the classpath. I would suggest you to use the property oozie.use.system.libpath=true in the job.properties file. All the sqoop related jars will be added automatically in the classpath. Then add only custom jar you need to the lib directory of the workflow application path., all the sqoop related jars will be added from the /user/oozie/share/lib/lib_<timestamp>/sqoop/*.jar.

Adding Spark and Hadoop configuration files to JAR?

I have a Spark application which I would like to configure using configuration files, such as Spark's spark-defaults.conf, HBase's hbase-site.xml and log4j's log4j.properties. I also want to avoid having to add the files programmatically.
I tried adding the files to my JAR (under both / and /conf paths) but when I ran spark-submit the configuration files did not seem to have any effect.
To further check my claim I tried running spark-shell with the same JARs and checking the contents of the files and I discovered that they were overridden by files from other locations: /spark-defaults.conf and /log4j.properties were completely different, and /conf/hbase-site.xml while staying intact has (probably) had its properties overridden by another JAR's hbase-default.xml.
I use CDH 5.4.0.
The files log4j.properties and spark-defaults.conf were loaded from /etc/spark/ and hbase-default.xml was loaded from /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/hbase-common-1.0.0-cdh5.4.0.jar.
Is there a way to specify some sort of priority on my configuration files over the others? Should I just configure the files in /etc/spark (and perhaps add my hbase-site.xml too)? Is there a way to add a custom directory path to the classpath that could take priority over the others?
i don't think that it is possible to include the spark-defaults.conf into the jar. The only way i know is to edit the file on the server or to add the config settings programatically.
but for hbase-site.xml and other hadoop site configs it should work.
you can put every site.xml in the root of your resource directory and it should be loaded unless you have some other site.xml in the classpath of spark which are loaded at first.
i.e. if you are adding hadoop classpath or hbase classpath to the spark env on the server, then they are in the classloader loaded first, unless you are using the setting spark.files.userClassPathFirst

Missing of hadoop-mapreduce-client-core-[0-9.]*.jar in hadoop1.2.1

I have installed Hadoop 1.2.1 on a three node cluster. while installing Oozie, When i try to generate a war file for the web console, I get this error.
hadoop-mapreduce-client-core-[0-9.]*.jar' not found in '/home/hduser/hadoop'
I believe the version of Hadoop that I am using doesn't have this jar file(don't know where to find them). So can anyone please tell me how to create a war file and enable the web console. Any help is appreciated.
You are correct. You have 2 options :
1. Download individual jars and put them inside your hadoop1.2.1 directory and generate the war file.
2. Download Hadoop 2.x and use it while creating the war file and once it has been built continue using your hadoop1.2.1.
For example : oozie-3.3.2 bin/oozie-setup.sh prepare-war -hadoop
hadoop-1.1.2 ~/hadoop-eco/hadoop-2.2.0 -extjs
~/hadoop-eco/oozie/oozie-3.3.2/webapp/src/main/webapp/ext-2.2
Here I have built Oozie-3.3.2 to use it with hadoop-1.1.2 but using hadoop-2.2.0
HTH

Resources