pig UDF using external java libraries

pig UDF using external java libraries - hadoop

I wrote UDF that uses some external libraries as jackson-databird etc...how can I specify where should pig looks for these external libraries?
Thanks

What if you compile all your dependencies to a single fat jar?

You can specify the additional Jars using the syntax -
pig -Dpig.additional.jars="xxx.jar:yyy.jar" -f script.pig
having a jar with dependencies might cause problems incase the packaged dependencies and the cluster installed dependencies are not compatible. This will also make your program future proof, i would assume.

Related

How to deploy assembly jar and use it as provided dependency?

Using spark over hbase and hadoop using Yarn,
an assembly library among other libraries is provided server side.
(called like spark-looongVersion-haddop-looongVersion.jar)
it includes numerous libraries.
When the spark jar is sent as a job to the server for execution, conflicts may arise between the libraries included in the job and the server libraries (assembly jar and possibly other libraries) .
I need to include this assembly jar as a "provided" maven dependency to avoid conflicts between client dependencies and server classpath
how can I deploy and use this assembly jar as a provided dependency ?

how can I deploy and use this assembly jar as a provided dependency ?
An assembly jar is a regular jar file and so as any other jar file can be a library dependency if it's available in the artifact repo to download it from, e.g. Nexus, Artifactory or similar.
The quickest way to do it is to "install" it in your Maven local repository (see Maven's Guide to installing 3rd party JARs). That however binds you to what you have locally available and so will quickly get out of sync with what other teams are using.
The recommended way is to deploy the dependency using Apache Maven Deploy Plugin.
Once it's deployed, declaring it as a dependency is not different from declaring other dependencies.

Provided dependencies scope
Spark dependencies must be excluded from the assembled JAR. If not, you should expect weird errors from Java classloader during application startup. Additional benefit of assembly without Spark dependencies is faster deployment. Please remember that application assembly must be copied over the network to the location accessible by all cluster nodes (e.g: HDFS or S3).

How to include jars in Hive (Amazon Hadoop env)

I need to include newer protobuf jar (newer than 2.5.0) in Hive. Somehow no matter where I put the jar - it's being pushed to the end of the classpath. How can I make sure that the jar is in the beginning of the classpath of Hive?

To add your own jar to the Hive classpath so that it's included in the beginning of the classpath and not overloaded by some hadoop jar you need to set the following Env variable -
export HADOOP_USER_CLASSPATH_FIRST=true
This indicates that the HADOOP_CLASSPATH will gain priority over general hadoop jars.
At Amazon emr instances you can add this to /home/hadoop/conf/hadoop-env.sh, and modify the classpath in this file also.
This is useful when you want to overload jars like protobuf that come with the hadoop general classpath.

The other thing you might consider doing is including the protobuf classes in your jar. You would need to build your jar with the assembly plugin, which will those classes. Its an option.

Spark can't find Guava Classes

I'm running Spark's example called JavaPageRank, but it's a copy that I compiled separately using maven in a new jar. I keep getting this error:
ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
java.lang.NoClassDefFoundError: com/google/common/collect/Iterables
Despite the fact that guava is listed as one of Spark's dependencies. I'm running compiled Spark 1.6 that I downloaded pre-compiled from the apache website.
Thanks!

The error means that the jar containing com.google.common.collect.Iterables class is not in the classpath. So your application is not able to find the required class in runtime.
If you are using maven/gradle , try to clean, build and refresh the project. Then check your classes folder and make sure the guava jar is in the lib folder.
Hope this will help.
Good luck!

Packaging jars inside jar

Can we pack jars inside a jar ?
If so, how can we execute this in Unix ?
I am able to pack everything in a zip, but not able to execute this without unpacking.
Is there any way I can avoid unpacking the zip ?
Thanks in advance.

Jars inside a jar is (part of) what a WAR (web-archive) file could have, because a WAR-file can contain anything needed to make a website. There is no difference in the file-format though: both use ZIP-format, with just a different file-naming convention. For more information:
Java war vs. jar - what is the difference?
Overview of WAR Files in Java EE Software Development
How do I create a war file using the jar command? (tutorial)

How should a bash script determine a classpath for a maven project and its dependencies?

Yay, my thesis is done! Now that the pressure is off and I've had my fill of playing Skyrim, I'm converting the code I wrote for my thesis from a chaotic directory built with ant to a nice maven project.
I originally had a bin directory with about 20 bash scripts that ran the various java and ruby programs used in my thesis, including the final jruby/sinatra-based web server. I am planning on moving my scripts to src/main/scripts, but I need to figure out how to handle the classpath.
I had previously just hardcoded paths in my scripts to the manually-downloaded dependencies. However, now that maven is downloading and storing all the jars I need, what's the best way to reference them from my scripts:
Should I just get the scripts to reference the full paths of various jars in the local repository like before?
Should I make the local repository directory a configuration option for my scripts and use relative paths to this directory?
Should I build a big hairy jar with all the dependencies using the maven assembly plugin and access this via the script-relative path ../../../target/*-jar-with-dependencies.jar?
Is there some better option I haven't thought of?

In your script, use the exec:java plugin to run Java classes. It will sort out the classpath based on the defined dependencies. Then you don't need to worry about it.

Relook at all the scripts that you have. Potentially you could achieve the functionality of some of them using maven exec plugin.
Besides assembly and shade plugins, you may want to look at the functionalities provided by maven dependency plugin as well.

In my project (Soluvas fb-tools/fbcli), because I use Java 6 and later (which supports wildcard classpaths), here's what I do:
#!/bin/bash
# Must run first: mvn package dependency:copy-dependencies
java -cp 'target/dependency/*:target/fbcli-1.0.0-SNAPSHOT.jar' org.jboss.weld.environment.se.StartMain "$#"
No need for manual generation of classpaths. :)

There are quite some plugin doing similar things you mentioned. Assembly plugin you mentioned is doubtless one of them (and the way you suggested is also a neat working solution).
You may want to take a look in AppAssembler and Shade. They all provide some mechanism to bundle the dependencies and produce a directly executable package.

Here is CLI example using Maven plugin exec:java as mentionned by #artbristol in another comment:
mvn exec:java -Dexec.mainClass="mypackage.MyClassWithMain" -Dexec.args="arg1 arg2"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

pig UDF using external java libraries - hadoop

I wrote UDF that uses some external libraries as jackson-databird etc...how can I specify where should pig looks for these external libraries? Thanks

What if you compile all your dependencies to a single fat jar?

Related

How to deploy assembly jar and use it as provided dependency?

How to include jars in Hive (Amazon Hadoop env)

Spark can't find Guava Classes

Packaging jars inside jar

How should a bash script determine a classpath for a maven project and its dependencies?

Categories

Resources