I need to include newer protobuf jar (newer than 2.5.0) in Hive. Somehow no matter where I put the jar - it's being pushed to the end of the classpath. How can I make sure that the jar is in the beginning of the classpath of Hive?
To add your own jar to the Hive classpath so that it's included in the beginning of the classpath and not overloaded by some hadoop jar you need to set the following Env variable -
export HADOOP_USER_CLASSPATH_FIRST=true
This indicates that the HADOOP_CLASSPATH will gain priority over general hadoop jars.
At Amazon emr instances you can add this to /home/hadoop/conf/hadoop-env.sh, and modify the classpath in this file also.
This is useful when you want to overload jars like protobuf that come with the hadoop general classpath.
The other thing you might consider doing is including the protobuf classes in your jar. You would need to build your jar with the assembly plugin, which will those classes. Its an option.
Related
What is the purpose of spark submit? From what I can see it is just adding properties and jars to the classpath.
If I am using spring boot can I avoid using spark-submit, and just package a fat jar with all the properties I want spark.master etc...
Can ppl see any downside to doing this?
recently I met same case - and also try to stick to spring boot exec jar which unfortunately failed finally, but I was close to end. the state when I gave up was - spring boot jar built without spark/hadoop libs included, and i was running it on a cluster with -Dloader.path='spark/hadoop libs list extracted from SPARK_HOME and HADOOP_HOME on cluster'. I ended up using 2d option - build fat jar with shaded plugin and running it as usual jar by spark submit which seems to be a bit strange solution but still works ok
Using spark over hbase and hadoop using Yarn,
an assembly library among other libraries is provided server side.
(called like spark-looongVersion-haddop-looongVersion.jar)
it includes numerous libraries.
When the spark jar is sent as a job to the server for execution, conflicts may arise between the libraries included in the job and the server libraries (assembly jar and possibly other libraries) .
I need to include this assembly jar as a "provided" maven dependency to avoid conflicts between client dependencies and server classpath
how can I deploy and use this assembly jar as a provided dependency ?
how can I deploy and use this assembly jar as a provided dependency ?
An assembly jar is a regular jar file and so as any other jar file can be a library dependency if it's available in the artifact repo to download it from, e.g. Nexus, Artifactory or similar.
The quickest way to do it is to "install" it in your Maven local repository (see Maven's Guide to installing 3rd party JARs). That however binds you to what you have locally available and so will quickly get out of sync with what other teams are using.
The recommended way is to deploy the dependency using Apache Maven Deploy Plugin.
Once it's deployed, declaring it as a dependency is not different from declaring other dependencies.
Provided dependencies scope
Spark dependencies must be excluded from the assembled JAR. If not, you should expect weird errors from Java classloader during application startup. Additional benefit of assembly without Spark dependencies is faster deployment. Please remember that application assembly must be copied over the network to the location accessible by all cluster nodes (e.g: HDFS or S3).
I have situation where my map reduce job is dependent on third party libraries like hive-hcatalog-xxx.jar. I am running all my jobs through oozie. Mapreduce jobs are run via java action. What is the best way to include third party libraries in my job? I have two options in hand
Bundle all the dependent jars into the main jar and create a fat jar.
Keep all the dependent jars in an HDFS location and add it via -libjars option
Which one I can choose? Please advice.
As my mapreduce job is invoked through a java action of oozie, the libraries available in oozie lib folder is not added to the classpath of mapper/reducer. If I change this java action to map reduce action, will the jars be available?
Thanks in advance.
1.Bundle all the dependent jars into the main jar and create a fat jar.
OR
2.Keep all the dependent jars in an HDFS location and add it via
-libjars option Which one I can choose?
Although, both approaches are in practice. I'd suggest Uber jar
i.e your first approach.
Uber jar : A jar that has a lib/ folder inside which carries more dependent jars (a structure known as 'uber' jars), and you submit the job via a regular 'hadoop jar' command, these lib/.jars get picked up by the framework because the supplied jar is specified explicitly via conf.setJarByClass or conf.setJar. That is, if this user uber jar goes to the JT as the mapred...jar, then it is handled by the framework properly and the lib/.jars are all considered and placed on the classpath.
Why
The advantage is that you can distribute your uber-jar and not care at all whether or not dependencies are installed at the destination, as your uber-jar actually has no dependencies.
As my mapreduce job is invoked through a java action of oozie, the
libraries available in oozie lib folder is not added to the classpath
of mapper/reducer. If I change this java action to map reduce action,
will the jars be available?
For the above question, since answer is broad,
I have sharelib links from CDH4.xx , CDH5.xx &
How to configure Mapreduce action with Oozie shre lib. for you
You can obviously adopt the approaches suggested by you, But Oozie has sharelib prepared for hcatalog. You can use them out of the box with oozie.action.sharelib.for.actiontype property in your job.properties. For the java action you can specify:
oozie.action.sharelib.for.java=hcatalog
This will load the libraries from the oozie share lib hcatalog into your launcher job. This should do the job.
You can checkout the content of the hcatalog here:
hdfs dfs -ls /user/oozie/share/lib/lib_*/hcatalog
I wrote UDF that uses some external libraries as jackson-databird etc...how can I specify where should pig looks for these external libraries?
Thanks
What if you compile all your dependencies to a single fat jar?
You can specify the additional Jars using the syntax -
pig -Dpig.additional.jars="xxx.jar:yyy.jar" -f script.pig
having a jar with dependencies might cause problems incase the packaged dependencies and the cluster installed dependencies are not compatible. This will also make your program future proof, i would assume.
I have a third party jar that I am using for mapReduce and the container that processes the mapReduce needs my jar. I've tried adding it in yarn-site.xml, YARN_USER_CLASSPATH (variable), a bunch of lib folders in hadoop directory but no luck. HortonWorks did not have much on their site about classpaths so I am trying here.
You need to setup
{YARN_USER_CLASSPATH_FIRST}
so yarn will search your custom classpath first. I found this from yarn command:
https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-yarn-project/hadoop-yarn/bin/yarn#L27