How does the example find the lib in Oozie best case? - hadoop

According to the document of Oozie, I try to run a map-reduce example on Oozie. As everyone knows, 'workflow.xml' (and 'coordinator.xml') should be in HDFS.
Then input the command: oozie job -oozie http://localhost:11000/oozie -config examples/apps/map-reduce/job.properties -run. And I also know the 'job.properties' should be in local file system.
But there are two things confused me:
1.why dose the jar or class variable in workflow.xml come from directory Lib of HDFS?
2.There is a picture showing the content of oozie-examples-4.3.1.jar. This jar is in HDFS, how can it import Lib?
Forgive my poor English.

The highlighted red box is part of the Hadoop and Java default classpath. Any Java code that's ran within YARN, as part of MapReduce has access to the packages that appear when you run hadoop classpath command. By the way, mapred.* classes of Hadoop are almost all deprecated
That's nothing to do with Oozie, per say, but Oozie extends the Hadoop classpath with the Oozie ShareLib, which must be explicitly enabled with a property file argument
oozie.use.system.libpath=true
And in addition to that classpath,, Oozie will send the ${wf.application.path}/lib directory to all running jobs

Related

Run Spark job with properties files

As a beginner of stack Hadoop, I would like to run my Spark job with spark-submit via Oozie. Having an jar including src compiling project files, I have also a set of properties files (about 20). I want that, when running my spark Job, we can load these properties files from a different folder beside the folder including my Spark Job compiled jar. I've tried:
In my job.properties of oozie, I added:
oozie.libpath=[path to the folder including all of my properties files]
and oozie.use.system.libpath=true.
on the spark-submit command, I added --files or --properties-file but it's not working (It doesn't accept the folder)
Thanks for any suggestions or feel free to ask more if my question is not clear.

Shouldn't Oozie/Sqoop jar location be configured during package installation?

I'm using HDP 2.4 in CentOS 6.7.
I have created the cluster with Ambari, so Oozie was installed and configured by Ambari.
I got two errors while running Oozie/Sqoop related to jar file location. The first concerned postgresql-jdbc.jar, since the Sqoop job is incrementally importing from Postgres. I added the postgresql-jdbc.jar file to HDFS and pointed to it in workflow.xml:
<file>/user/hdfs/sqoop/postgresql-jdbc.jar</file>
It solved the problem. But the second error seems to concern kite-data-mapreduce.jar. However, doing the same for this file:
<file>/user/hdfs/sqoop/kite-data-mapreduce.jar</file>
does not seem to solve the problem:
Failing Oozie Launcher, Main class
[org.apache.oozie.action.hadoop.SqoopMain], main() threw exception,
org/kitesdk/data/DatasetNotFoundException
java.lang.NoClassDefFoundError:
org/kitesdk/data/DatasetNotFoundException
It seems strange that this is not automatically configured by Ambari and that we have to copy jar files into HDFS as we start getting errors.
Is this the correct methodology or did I miss some configuration step?
This is happening due to the missing jars in the classpath. I would suggest you to use the property oozie.use.system.libpath=true in the job.properties file. All the sqoop related jars will be added automatically in the classpath. Then add only custom jar you need to the lib directory of the workflow application path., all the sqoop related jars will be added from the /user/oozie/share/lib/lib_<timestamp>/sqoop/*.jar.

How do I specify multiple libpath in oozie job?

My oozie job uses 2 jars x.jar and y.jar and following is my job.properties file.
oozie.libpath=/lib
oozie.use.system.libpath=true
This works perfectly when both the jars are present at same location on HDFS at /lib/x.jar and /lib/y.jar
Now I have 2 jars placed at different locations /lib/1/x.jar and /lib/2/y.jar.
How can I re-write my code such that both the jars are used while running the map reduce job?
Note: I have already refernced the answer How to specify multiple jar files in oozie but, this does not solve my problem
Found the answer at
http://blog.cloudera.com/blog/2014/05/how-to-use-the-sharelib-in-apache-oozie-cdh-5/
Turns out that I can specify multiple paths separated by comma in the job.properties file:
oozie.libpath=/path/to/jars,another/path/to/jars

How to execute map reduce program(ex. wordcount) from HDFS and see the output?

I am new to Hadoop. I have a simple wordcount program in eclipse which takes input files and then shows the output. But I need to execute the same program from HDFS. I have already created a JAR file for the wordcount program.
Can any one pls let me know how to proceed?
You need to have a cluster set up, even if is a single node cluster. Then you can run your .jar from the hadoop command line:
jar
Runs a jar file. Users can bundle their Map Reduce code in a jar
file and execute it using this command.
Usage: hadoop jar <jar> [mainClass] args...
The streaming jobs are run via this command. Examples can be referred
from Streaming examples
Word count example is also run using jar command. It can be referred
from Wordcount example
Initially you need to set up a hadoop cluster as discussed by Remus.
Single Node SetUp and Multi Node SetUp are two good way to start with.
Once you have the set up done, start hadoop daemons and copy the input files into any hdfs directory.
Prepare the jar of your program.
Run the jar on the terminal using hadoop jar <you jar name> <your main class> <input path><output directory path>
(The jar arguments depend on your program)

Hadoop Mapreduce with two jars (one of the jars is needed on namenode only)

The mapred task is a very simple 'wordcount' implemented by Java (plz, see http://wiki.apache.org/hadoop/WordCount ).
after the last line, "job.waitForCompletion(true);"
I add some code implemented by Jython.
It means the libraries for Jythoon is only needed on namenode.
However, I added all libraries for Jython to a single jar, and then
executed it
hadoop jar wordcount.jar in out
The wordcount is done without any problem.
The problem I want to solve is I have to heavy libraries for Jython that is not needed for the slave nodes(mappers and reducers). the jar is almost 15M (upper than 14M is for Jython).
Can I split them, and get the same results?
Nobody knows this question.
I've solved this problem as follows: even if it's not the best.
Simply, copy jython.jar to /usr/local/hadoop (or path of hadoop installed) which is the default classpath of hadoop, and make a jar without jython.jar
If you need very big libraries to mapreduce task, then
upload jython.jar to hdfs
hadoop fs -put jython.jar Lib/jython.jar
add the follow line to your main code
DistributedCache.addFileToClassPath(new URI("Lib/jython.jar"));

Resources