How to use an external Jar file in the Hadoop program - hadoop

I have a Hadoop program in which I use a couple of external jar files. When I submit the jar file of my program to the Hadoop cluster it gives me the following error.
Exception in thread "main" java.lang.NoClassDefFoundError: edu/uci/ics/jung/graph/Graph
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:201)
I understand what the problem is but don't know how to solve it. How can I add the jar files to my program?

I think, you can also modify the environment of the job’s running task attempts explicitly by specifying JAVA_LIBRARY_PATH or LD_LIBRARY_PATH variables:
hadoop jar [main class]
-D mapred.child.env="LD_LIBRARY_PATH=/path/to/your/libs" ...

You can use LIBJARS option when submitting the jobs like this:
export LIBJARS=/path/jar1,/path/jar2
hadoop jar my-example.jar com.example.MyTool -libjars ${LIBJARS} -mytoolopt value
I would recommend reading this article which describes precisely what you're looking for, in detail:
http://grepalex.com/2013/02/25/hadoop-libjars/

Add external jar file into the hadoop/lib folder to get rid out of it...

Related

Is maven JAR runnable on hadoop?

To produce jar from hadoop mapreduce program(mapreduce wordcount example) i used maven.
Here i successfully done 'clean' and 'install'.
Also 'build' successfully by running as a Java Application by including arguments(input and output).
And it provided expected result successfully.
Now the problem is not running on hadoop.
Giving the following error:
Exception in thread "main" java.lang.ClassNotFoundException: WordCount
Is maven JAR runnable on hadoop?
Maven is a build tool which creates a Java Artifact. Any JAR containing the hadoop dependencies and the class having main() method in the Manifest file should be working with the hadoop.
Try running your JAR using the below command
hadoop jar your-jar.jar wordcount input output
where "wordcount" is the name of the class with main method,"input" and "output" are the arguments.
Two things
1) I think you are missing the package details before the classname. Copy your package name and put it before the classname and it should work.
hadoop jar /home/user/examples.jar com.test.examples.WordCount /home/user/inputfolder /home/user/outputfolder
PS: If you are using a jar for which the source code is not available with you, you can do
jar -tvf /home/user/examples.jar
and it will print all the classses with their folder names. Replace the "/" with "." (dot) and you get the package name. But this needs JDK (not JRE) in the PATH.
2) You are trying to run a MapReduce program from a Windows prompt. Are you sure you have Hadoop installed on your Windows?

Shouldn't Oozie/Sqoop jar location be configured during package installation?

I'm using HDP 2.4 in CentOS 6.7.
I have created the cluster with Ambari, so Oozie was installed and configured by Ambari.
I got two errors while running Oozie/Sqoop related to jar file location. The first concerned postgresql-jdbc.jar, since the Sqoop job is incrementally importing from Postgres. I added the postgresql-jdbc.jar file to HDFS and pointed to it in workflow.xml:
<file>/user/hdfs/sqoop/postgresql-jdbc.jar</file>
It solved the problem. But the second error seems to concern kite-data-mapreduce.jar. However, doing the same for this file:
<file>/user/hdfs/sqoop/kite-data-mapreduce.jar</file>
does not seem to solve the problem:
Failing Oozie Launcher, Main class
[org.apache.oozie.action.hadoop.SqoopMain], main() threw exception,
org/kitesdk/data/DatasetNotFoundException
java.lang.NoClassDefFoundError:
org/kitesdk/data/DatasetNotFoundException
It seems strange that this is not automatically configured by Ambari and that we have to copy jar files into HDFS as we start getting errors.
Is this the correct methodology or did I miss some configuration step?
This is happening due to the missing jars in the classpath. I would suggest you to use the property oozie.use.system.libpath=true in the job.properties file. All the sqoop related jars will be added automatically in the classpath. Then add only custom jar you need to the lib directory of the workflow application path., all the sqoop related jars will be added from the /user/oozie/share/lib/lib_<timestamp>/sqoop/*.jar.

how to add xml mahout classifier jar into hadoop cluster, as i dont want to add that library into hadoop classpath

I am parsing xml file using XMLInputFormat.class which is present in mahout-exmaples jar. but while running the jar file of map reduce i am getting below error
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.mahout.classifier.bayes.XmlInputFormat not found
Please let me know how can i make these jars available while running on multinode hadoop cluster.
Include the all mahout-examples JARs in the “-libjars” command line option of the hadoop jar ... command. The jar will be placed in distributed cache and will be made available to all of the job’s task attempts. More specifically, you will find the JAR in one of the ${mapred.local.dir}/taskTracker/archive/${user.name}/distcache/… subdirectories on local nodes.
Please refer this link for more details.

How to find jar dependencies when running Apache Pig script?

I am having some difficulties running a simple pig script to import data into HBase using HBaseStorage
The error I have encountered is given by:
Caused by: <file demo.pig, line 14, column 0> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'org.apache.pig.backend.hadoop.hbase.HBaseStorage' with arguments '[rdf:predicate rdf:object]'
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Scan.setCacheBlocks(Z)V
at org.apache.pig.backend.hadoop.hbase.HBaseStorage.initScan(HBaseStorage.java:427)
at org.apache.pig.backend.hadoop.hbase.HBaseStorage.<init>(HBaseStorage.java:368)
at org.apache.pig.backend.hadoop.hbase.HBaseStorage.<init>(HBaseStorage.java:239) 13_21.51.28.tar.gz
... 29 more
According to other questions and threads, the main response/answer to this issue would be to register the appropriate jars required for the HBaseStorage references. What I am stumped by is how am I supposed to identify the required JAR given the appropriate Pig function.
I even tried to open the various jar files under the hbase and pig folders to ensure the appropriate classes are registered in the pig script.
For example, since java.lang.NoSuchMethodError was caused by org.apache.hadoop.hbase.client.Scan.setCacheBlocks(Z)V
I imported specifically the jar that contains org.apache.hadoop.hbase.client.Scan, to no avail.
Pig's documentation does not provide any obvious links and help that I can refer to.
I am using Hadoop 2.7.0, HBase 1.0.1.1., Pig 0.15.0.
If you need any other clarification, feel free to ask me again. Would really appreciate it if someone could help me out with this issue.
Also, is it better to install Hadoop and the relevant softwares from scratch, or is it better to directly get one of the Hadoop bundles available?
There is something wrong with the released jar: hbase-client-1.0.1.1.jar
you can test it with this code, the error will show up:
Scan scan = new Scan();
scan.setCacheBlocks(true);
I've tried other set functions, like setCaching, it throws the same error. While I checked the source code, those functions exist. Maybe just compile hbase-client-1.0.1.1.jar manually, I'm still looking for better solution...
============
Update for above, found the root cause is hbase-client-1.0.1.1.jar incompatibility with older versions.
https://issues.apache.org/jira/browse/HBASE-10841
https://issues.apache.org/jira/browse/HBASE-10460
There is a change of return value for set functions, jars compiled with old version won't work with current.
For your question, you can modify the pig script $PIG_HOME/bin/pig, set debug=true, then it will just print running info.
Did you register required jars.
Most important jars habse,zookeeper and guava
I solved the similar kind of issue by registering zookeeper jar in my pigscript

Making Sqoop1 work with Hadoop2

I have had a hard time making sqoop1 work on hadoop2. I always run int Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.Tool error which suggests that sqoop1 is trying to use hadoop1. But i had downloaded the sqoop1 jar with hadoop 2.0.4-alpha release from http://www.us.apache.org/dist/sqoop/1.4.5/.
Then why does it not work with hadoop2?
PS: I have tried hard to make sqoop2 work, but i faced lot of problems in the setup.
Also, this post http://mmicky.blog.163.com/blog/static/1502901542013118115417262/ suggests that it should work, but i keep running into this ClassNotFoundException.
I figured out the problem. Whatever classpath i was setting was probably being overridden by the hadoop executable. So i had to modify the hadoop executable at the place where it called the java command and add a -cp flag with the classpath of my hadoop jars like below:
exec "$JAVA" -cp "$CLASSPATH:/usr/local/Cellar/hadoop/2.4.1/libexec/share/hadoop/common/:/usr/local/Cellar/hadoop/2.4.1/libexec/share/hadoop/common/lib/:/usr/local/Cellar/hadoop/2.4.1/libexec/share/hadoop/hdfs/:/usr/local/Cellar/hadoop/2.4.1/libexec/share/hadoop/hdfs/lib/:/usr/local/Cellar/hadoop/2.4.1/libexec/share/hadoop/mapreduce/:/usr/local/Cellar/hadoop/2.4.1/libexec/share/hadoop/mapreduce/lib/:/usr/local/Cellar/hadoop/2.4.1/libexec/share/hadoop/tools/:/usr/local/Cellar/hadoop/2.4.1/libexec/share/hadoop/tools/lib/:/usr/local/Cellar/hadoop/2.4.1/libexec/share/hadoop/yarn/:/usr/local/Cellar/hadoop/2.4.1/libexec/share/hadoop/yarn/lib/" $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$#"

Resources