third party jars in map reduce job - hadoop

I have situation where my map reduce job is dependent on third party libraries like hive-hcatalog-xxx.jar. I am running all my jobs through oozie. Mapreduce jobs are run via java action. What is the best way to include third party libraries in my job? I have two options in hand
Bundle all the dependent jars into the main jar and create a fat jar.
Keep all the dependent jars in an HDFS location and add it via -libjars option
Which one I can choose? Please advice.
As my mapreduce job is invoked through a java action of oozie, the libraries available in oozie lib folder is not added to the classpath of mapper/reducer. If I change this java action to map reduce action, will the jars be available?
Thanks in advance.

1.Bundle all the dependent jars into the main jar and create a fat jar.
OR
2.Keep all the dependent jars in an HDFS location and add it via
-libjars option Which one I can choose?
Although, both approaches are in practice. I'd suggest Uber jar
i.e your first approach.
Uber jar : A jar that has a lib/ folder inside which carries more dependent jars (a structure known as 'uber' jars), and you submit the job via a regular 'hadoop jar' command, these lib/.jars get picked up by the framework because the supplied jar is specified explicitly via conf.setJarByClass or conf.setJar. That is, if this user uber jar goes to the JT as the mapred...jar, then it is handled by the framework properly and the lib/.jars are all considered and placed on the classpath.
Why
The advantage is that you can distribute your uber-jar and not care at all whether or not dependencies are installed at the destination, as your uber-jar actually has no dependencies.
As my mapreduce job is invoked through a java action of oozie, the
libraries available in oozie lib folder is not added to the classpath
of mapper/reducer. If I change this java action to map reduce action,
will the jars be available?
For the above question, since answer is broad,
I have sharelib links from CDH4.xx , CDH5.xx &
How to configure Mapreduce action with Oozie shre lib. for you

You can obviously adopt the approaches suggested by you, But Oozie has sharelib prepared for hcatalog. You can use them out of the box with oozie.action.sharelib.for.actiontype property in your job.properties. For the java action you can specify:
oozie.action.sharelib.for.java=hcatalog
This will load the libraries from the oozie share lib hcatalog into your launcher job. This should do the job.
You can checkout the content of the hcatalog here:
hdfs dfs -ls /user/oozie/share/lib/lib_*/hcatalog

Related

how to change flink fat jar to thin jar

can I move the dependency jars to hdfs, so I can run a thin jar without dependency jars?
the Operation and Maintenance Engineers do not allow me to move jar to flink lib folder.
Not sure what problem you are trying to solve, but you might want to consider an application mode deployment if you are using yarn:
./bin/flink run-application -t yarn-application \
-Dyarn.provided.lib.dirs="hdfs://myhdfs/remote-flink-dist-dir" \
"hdfs://myhdfs/jars/MyApplication.jar"
In this example, MyApplication.jar isn't a thin jar, but the job submission is very lightweight as the needed Flink jars and the application jar are picked up from HDFS rather than being shipped to the cluster by the client. Moreover, the application’s main() method is executed on the JobManager.
Application mode was introduced in Flink 1.11, and is described in detail in this blog post: Application Deployment in Flink: Current State and the new Application Mode.

How to figure out what JARs are in Hadoop classpath fin HDP2.5 sandbox?

How to figure out what JARs are in Hadoop classpath?
I'm using Hortonworks 2.5 sandbox and want to run my custom application using already present im sandbox Hadoop JARs
There is a command hadoop classpath that does exactly you need.
Please refer here for more details:
https://community.hortonworks.com/questions/27780/where-exactly-classpaths-for-hadoop-are-present-in.html

How to include jars in Hive (Amazon Hadoop env)

I need to include newer protobuf jar (newer than 2.5.0) in Hive. Somehow no matter where I put the jar - it's being pushed to the end of the classpath. How can I make sure that the jar is in the beginning of the classpath of Hive?
To add your own jar to the Hive classpath so that it's included in the beginning of the classpath and not overloaded by some hadoop jar you need to set the following Env variable -
export HADOOP_USER_CLASSPATH_FIRST=true
This indicates that the HADOOP_CLASSPATH will gain priority over general hadoop jars.
At Amazon emr instances you can add this to /home/hadoop/conf/hadoop-env.sh, and modify the classpath in this file also.
This is useful when you want to overload jars like protobuf that come with the hadoop general classpath.
The other thing you might consider doing is including the protobuf classes in your jar. You would need to build your jar with the assembly plugin, which will those classes. Its an option.

How do I add thirdparty jar to classpath on HDP sandbox?

I have a third party jar that I am using for mapReduce and the container that processes the mapReduce needs my jar. I've tried adding it in yarn-site.xml, YARN_USER_CLASSPATH (variable), a bunch of lib folders in hadoop directory but no luck. HortonWorks did not have much on their site about classpaths so I am trying here.
You need to setup
{YARN_USER_CLASSPATH_FIRST}
so yarn will search your custom classpath first. I found this from yarn command:
https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-yarn-project/hadoop-yarn/bin/yarn#L27

Duplicate jars while deploying oozie job

I have an oozie job which uses jets3t v 0.9 jar.
By default oozie is loading jets3t v 0.6 jar from Hadoop lib directory.
Due to this both jars are getting loaded and am getting a java verifierError.
Is there any way to stop oozie from loading certain libraries?
Or any other way to solve this issue?
Where are you keeping your jar file? Keep it in the lib folder inside workflow application folder, which will take high priority over the default hadoop version.

Resources