Include Third Party Jars in Hadoop - hadoop

I am new to Hadoop. I have added Gson API to my MapReducing Program. When I am running the program getting;
Error: java.lang.ClassNotFoundException: com.google.gson.Gson
Can anybody suggest me to how to add Third Party Libraries to Hadoop?

Be sure to add any dependencies to both the HADOOP_CLASSPATH and -libjars upon submitting a job like in the following examples:
Use the following to add all the jar dependencies from current and lib directories:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:`echo *.jar`:`echo lib/*.jar | sed 's/ /:/g'`
Bear in mind that when starting a job through hadoop jar you'll need to also pass it the jars of any dependencies through use of -libjars. I like to use:
hadoop jar <jar> <class> -libjars `echo ./lib/*.jar | sed 's/ /,/g'` [args...]
NOTE: The sed commands require a different delimiter character; the HADOOP_CLASSPATH is : separated and the -libjars need to be , separated.

Add the Jar in HADOOP_CLASSPATH
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Add last line
export HADOOP_CLASSPATH=/root/hadoop/extrajars/java-json.jar:$HADOOP_CLASSPATH
"/root/hadoop/extrajars/java-json.jar" is path on linux box itself and not on HDFS
Restart the hadoop
Command
hadoop classpath
Should show the jar in classpath
Now run MR job as usual
hadoop jar <MR-program jar> <MR Program class> <input dir> <output dir>
It will use the file from as expected.

Related

Change tmp directory while running yarn jar command

I am running an MR job using yarn jar command and it creates a temporary jar in /tmp folder which fills up the entire disk space. I want to redirect the path of this jar to some other folder where I have more disk space. On this link, I came to know that we can change the path by setting the property mapred.local.dir for hadoop version 1.x. I am using the following command to run the jar
yarn jar myjar.jar MyClass myyml.yml arg1 -D mapred.local.dir="/grid/1/uie/facts"
The above argument mapred.local.dir doesn't change the path and it is still creating the jar in tmp folder.
Found the hack to not write the unjar file to /tmp folder. Apparently, it is not a configurable behaviour, so we can avoid the use of 'hadoop jar' or 'yarn jar'(RunJar utility) by invoking instead with the generated classpath:
java -cp $(hadoop classpath):my-fat-jar-with-all-dependencies.jar
your.app.mainClass
1. Reference link

hadoop/bin/hadoop doesn't have the example jar

I installed hadoop 2.2.0 and try to run sample wordcount program. For that, first I imported data in to hdfs using:
bin/hadoop fs -copyFromLocal /home/prassanna/Desktop/input /input
After that, I tried to run word count jar file using:
root#prassanna-Studio-1558:/usr/local/hadoop# bin/hadoop jar hadoop*examples*.jar wordcount /input -output
but it showed: Not a valid JAR: /usr/local/hadoop/hadoop*examples*.jar
I checked in usr/local/hadoop/bin/hadoop directory and there is no hadoop example jar.
The Jar file you are looking for is in this directory:
hadoop_root/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
and should be run with a command like this:
$ yarn jar hadoop_root/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /input /output
Hadoop example jar is no more present here.
usr/local/hadoop/bin/hadoop
From hadoop version 2.x, as SAM has rightly indicated in his answer, the jar you are looking for is
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
You can run it like,
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /input /output
Make sure /input folder has an input file to be counted in the hdfs. Also note that /output should not exist. This is for the hadoop framework to create.
Also please refer to this document to use the Hadoop2.2.0 Shell Commands. It is always a good practice to not use the deprecated version.
http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/FileSystemShell.html
In my hadoop, I am working in 2.4.1 so the command is
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar wordcount /input /output
You have to compile the WordCount.java and then JAR it as follows below. I had to dig around for the lib paths, but in the end I was able to use this to compile the examples Class
[apesa#localhost ~]$ javac -classpath $HADOOP_HOME/apesa/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_HOME/apesa/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_HOME/apesa/hadoop/common/lib/commons-cli-1.2.jar -d wordcount_classes WordCount.java
Then JAR it as follows
[apesa#localhost ~]$ jar -cvf wordcount.jar -C wordcount_classes/ .
I have not run this in a while but you may have to verify the lib files are in the same place if you get an error

Running a hadoop job

It is the first time I'm running a job on hadoop and started from WordCount example. To run my job, I', using this command
hduser#ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
and I think we should copy the jar file in /usr/local/hadoop . My first question is that what is the meaning of hadoop*examples*? and if we want to locate our jar file in another location for example /home/user/WordCountJar, what I should do? Thanks for your help in advance.
I think we should copy the jar file in /usr/local/hadoop
It is not mandatory. But if you have your jar at some other location, you need to specify the complete path while running your job.
My first question is that what is the meaning of hadoop*examples*?
hadoop*examples* is the name of your jar package that contains your MR job along with other dependencies. Here, * signifies that it can be any version. Not specifically 0.19.2 or something else. But, I feel it should be hadoop-examples-*.jar and not hadoop*examples*.jar
and if we want to locate our jar file in another location for example
/home/user/WordCountJar, what I should do?
If your jar is present in a directory other than the directory from where you are executing the command, you need to specify the complete path to your jar. Say,
bin/hadoop jar /home/user/WordCountJar/hadoop-*-examples.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
The examples is just wildcard expansion to account for different version numbers in the file name. For example: hadoop-0.19.2-examples.jar
You can use the full path to your jar like so:
bin/hadoop jar /home/user/hadoop-0.19.2-examples.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
Edit: the asterisks surrounding the word examples got removed from my post at time of submission.

Issue adding third-party jars to hadoop job

I am trying to add third-party jars to a hadoop job. I am adding each jar using the DistributedCache.addFileToClassPath method. I can see that the mapred.job.classpath.files is properly populated in the job xml file.
-libjars does not work for me either (most likely because we are not using toolrunner)
Any suggestions, what could be wrong?
Add the Jar in HADOOP_CLASSPATH
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Add last line
export HADOOP_CLASSPATH=/root/hadoop/extrajars/java-json.jar:$HADOOP_CLASSPATH
"/root/hadoop/extrajars/java-json.jar" is path on linux box itself and not on HDFS
Restart the hadoop
Command
hadoop classpath
Should show the jar in classpath
Now run MR job as usual
hadoop jar <MR-program jar> <MR Program class> <input dir> <output dir>
It will use the file from as expected.

what's wrong with my hadoop configuration?

I've done all the work that hadoop requires, but there seems something wrong with it, for example:
I have a class Hello.class, when I use the command "java Hello" it works correctly, but when I try to use the command "hadoop Hello" it reports that "cannot load or find the main class", but when I use "jar" command to change Hello.class into Hello.jar, however, I use the command "hadoop jar Hello.jar Hello", this time it works correctly just as I used the command "java Hello"
What is wrong with my configuration?
In file etc/profile the following has been added:
export JAVA_HOME=/usr/jdk1.7.0_04
export HADOOP_INSTALL=/usr/hadoop-1.0.1
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_INSTALL/bin
export CLASS_PATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
I've added "export JAVA_HOME=/usr/jdk1.7.0_04" into file "hadoop-env.sh"
I've changed core-site.xml, hdfs-site.xml, mapred-site.xml accordingly
Is there anyone having the same problem?
The hadoop Hello command runs hadoop and looks for a class named Hello on the current classpath - which doesn't contain your class.
Bundling your class into a jar and running hadoop jar myjar.jar Hello tells hadoop to add the jar file myjar.jar to the classpath and then run the class named Hello (which is now on the classpath)
If you want to add a class to the classpath configure the HADOOP_CLASSPATH environment variable

Resources