How do I specify multiple libpath in oozie job? - hadoop

My oozie job uses 2 jars x.jar and y.jar and following is my job.properties file.
oozie.libpath=/lib
oozie.use.system.libpath=true
This works perfectly when both the jars are present at same location on HDFS at /lib/x.jar and /lib/y.jar
Now I have 2 jars placed at different locations /lib/1/x.jar and /lib/2/y.jar.
How can I re-write my code such that both the jars are used while running the map reduce job?
Note: I have already refernced the answer How to specify multiple jar files in oozie but, this does not solve my problem

Found the answer at
http://blog.cloudera.com/blog/2014/05/how-to-use-the-sharelib-in-apache-oozie-cdh-5/
Turns out that I can specify multiple paths separated by comma in the job.properties file:
oozie.libpath=/path/to/jars,another/path/to/jars

Related

How does the example find the lib in Oozie best case?

According to the document of Oozie, I try to run a map-reduce example on Oozie. As everyone knows, 'workflow.xml' (and 'coordinator.xml') should be in HDFS.
Then input the command: oozie job -oozie http://localhost:11000/oozie -config examples/apps/map-reduce/job.properties -run. And I also know the 'job.properties' should be in local file system.
But there are two things confused me:
1.why dose the jar or class variable in workflow.xml come from directory Lib of HDFS?
2.There is a picture showing the content of oozie-examples-4.3.1.jar. This jar is in HDFS, how can it import Lib?
Forgive my poor English.
The highlighted red box is part of the Hadoop and Java default classpath. Any Java code that's ran within YARN, as part of MapReduce has access to the packages that appear when you run hadoop classpath command. By the way, mapred.* classes of Hadoop are almost all deprecated
That's nothing to do with Oozie, per say, but Oozie extends the Hadoop classpath with the Oozie ShareLib, which must be explicitly enabled with a property file argument
oozie.use.system.libpath=true
And in addition to that classpath,, Oozie will send the ${wf.application.path}/lib directory to all running jobs

Run Spark job with properties files

As a beginner of stack Hadoop, I would like to run my Spark job with spark-submit via Oozie. Having an jar including src compiling project files, I have also a set of properties files (about 20). I want that, when running my spark Job, we can load these properties files from a different folder beside the folder including my Spark Job compiled jar. I've tried:
In my job.properties of oozie, I added:
oozie.libpath=[path to the folder including all of my properties files]
and oozie.use.system.libpath=true.
on the spark-submit command, I added --files or --properties-file but it's not working (It doesn't accept the folder)
Thanks for any suggestions or feel free to ask more if my question is not clear.

Running a hadoop job

It is the first time I'm running a job on hadoop and started from WordCount example. To run my job, I', using this command
hduser#ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
and I think we should copy the jar file in /usr/local/hadoop . My first question is that what is the meaning of hadoop*examples*? and if we want to locate our jar file in another location for example /home/user/WordCountJar, what I should do? Thanks for your help in advance.
I think we should copy the jar file in /usr/local/hadoop
It is not mandatory. But if you have your jar at some other location, you need to specify the complete path while running your job.
My first question is that what is the meaning of hadoop*examples*?
hadoop*examples* is the name of your jar package that contains your MR job along with other dependencies. Here, * signifies that it can be any version. Not specifically 0.19.2 or something else. But, I feel it should be hadoop-examples-*.jar and not hadoop*examples*.jar
and if we want to locate our jar file in another location for example
/home/user/WordCountJar, what I should do?
If your jar is present in a directory other than the directory from where you are executing the command, you need to specify the complete path to your jar. Say,
bin/hadoop jar /home/user/WordCountJar/hadoop-*-examples.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
The examples is just wildcard expansion to account for different version numbers in the file name. For example: hadoop-0.19.2-examples.jar
You can use the full path to your jar like so:
bin/hadoop jar /home/user/hadoop-0.19.2-examples.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
Edit: the asterisks surrounding the word examples got removed from my post at time of submission.

How to specify multiple jar files in oozie

I need a solution for the following problem:
My project has two jars in which
one jar contains all bean classes like Employee etc, and the other jar contains MR jobs which uses the first jar bean class so when iam trying to run the MR job as a simple java program i am facing the issue of class not found (com.abc.Employee class not found as it is in another jar) so can any one provide me the solution how to solve the issue .... as in real time there may be many jars not 1 or 2 how to specify all those jars can any one please reply as soon as possible.
You should have a lib folder in the HDFS directory where you are storing your Oozie workflow. You can place both jar files in this folder and oozie will ensure both are on the classpath when your MR job executes:
hdfs://namenode:8020/path/to/oozie/app/workflow.xml
hdfs://namenode:8020/path/to/oozie/app/lib/first.jar
hdfs://namenode:8020/path/to/oozie/app/lib/second.jar
See Workflow Application Deployment for more details
If you often use jars in a number of oozie workflows, you can place these common jars (HBase jars for example) in a directory in HDFS, and then denote in an oozie property to include this folder's jars See HDFS Share Libraries for more details

Hadoop Mapreduce with two jars (one of the jars is needed on namenode only)

The mapred task is a very simple 'wordcount' implemented by Java (plz, see http://wiki.apache.org/hadoop/WordCount ).
after the last line, "job.waitForCompletion(true);"
I add some code implemented by Jython.
It means the libraries for Jythoon is only needed on namenode.
However, I added all libraries for Jython to a single jar, and then
executed it
hadoop jar wordcount.jar in out
The wordcount is done without any problem.
The problem I want to solve is I have to heavy libraries for Jython that is not needed for the slave nodes(mappers and reducers). the jar is almost 15M (upper than 14M is for Jython).
Can I split them, and get the same results?
Nobody knows this question.
I've solved this problem as follows: even if it's not the best.
Simply, copy jython.jar to /usr/local/hadoop (or path of hadoop installed) which is the default classpath of hadoop, and make a jar without jython.jar
If you need very big libraries to mapreduce task, then
upload jython.jar to hdfs
hadoop fs -put jython.jar Lib/jython.jar
add the follow line to your main code
DistributedCache.addFileToClassPath(new URI("Lib/jython.jar"));

Resources