hadoop/bin/hadoop doesn't have the example jar - hadoop

I installed hadoop 2.2.0 and try to run sample wordcount program. For that, first I imported data in to hdfs using:
bin/hadoop fs -copyFromLocal /home/prassanna/Desktop/input /input
After that, I tried to run word count jar file using:
root#prassanna-Studio-1558:/usr/local/hadoop# bin/hadoop jar hadoop*examples*.jar wordcount /input -output
but it showed: Not a valid JAR: /usr/local/hadoop/hadoop*examples*.jar
I checked in usr/local/hadoop/bin/hadoop directory and there is no hadoop example jar.

The Jar file you are looking for is in this directory:
hadoop_root/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
and should be run with a command like this:
$ yarn jar hadoop_root/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /input /output

Hadoop example jar is no more present here.
usr/local/hadoop/bin/hadoop
From hadoop version 2.x, as SAM has rightly indicated in his answer, the jar you are looking for is
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
You can run it like,
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /input /output
Make sure /input folder has an input file to be counted in the hdfs. Also note that /output should not exist. This is for the hadoop framework to create.
Also please refer to this document to use the Hadoop2.2.0 Shell Commands. It is always a good practice to not use the deprecated version.
http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/FileSystemShell.html

In my hadoop, I am working in 2.4.1 so the command is
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar wordcount /input /output

You have to compile the WordCount.java and then JAR it as follows below. I had to dig around for the lib paths, but in the end I was able to use this to compile the examples Class
[apesa#localhost ~]$ javac -classpath $HADOOP_HOME/apesa/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_HOME/apesa/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_HOME/apesa/hadoop/common/lib/commons-cli-1.2.jar -d wordcount_classes WordCount.java
Then JAR it as follows
[apesa#localhost ~]$ jar -cvf wordcount.jar -C wordcount_classes/ .
I have not run this in a while but you may have to verify the lib files are in the same place if you get an error

Related

Example Jar in Hadoop release

I am learning Hadoop with book 'Hadoop in Action' by Chuck Lam. In first chapter the books says that Hadoop installation will have example jar and by running 'hadoop jar hadoop-*-examples.jar' will show all the examples. But when I run the command then it throw error 'Could not find or load main class org.apache.hadoop.util.RunJar'. My guess is that installed Hadoop doesn't have example jar. I have installed 'hadoop-2.1.0-beta.tar.gz' on cygwin on Win 7 laptop. Please suggest how to get example jar.
run following command
hadoop jar PathToYourJarFile wordcount inputPath OutputPath
you can get examples jar file at your hadoop installation directory
What I can suggest here is you should manually go to the Hadoop installation directory and look for a jar name similar to hadoop-examples.jar yourself. Different distribution can have different names for the jar.
If you are in Cygwin, while in the Hadoop Installation directory you can also do a ls *examples*.jar to find the same, narrowing down the file listing to any jar file containing examples as a string.
You can then directly use the jar file name like --
hadoop jar <exampleJarYourFound.jar>
Hope this takes you to a solution.

Running a hadoop job

It is the first time I'm running a job on hadoop and started from WordCount example. To run my job, I', using this command
hduser#ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
and I think we should copy the jar file in /usr/local/hadoop . My first question is that what is the meaning of hadoop*examples*? and if we want to locate our jar file in another location for example /home/user/WordCountJar, what I should do? Thanks for your help in advance.
I think we should copy the jar file in /usr/local/hadoop
It is not mandatory. But if you have your jar at some other location, you need to specify the complete path while running your job.
My first question is that what is the meaning of hadoop*examples*?
hadoop*examples* is the name of your jar package that contains your MR job along with other dependencies. Here, * signifies that it can be any version. Not specifically 0.19.2 or something else. But, I feel it should be hadoop-examples-*.jar and not hadoop*examples*.jar
and if we want to locate our jar file in another location for example
/home/user/WordCountJar, what I should do?
If your jar is present in a directory other than the directory from where you are executing the command, you need to specify the complete path to your jar. Say,
bin/hadoop jar /home/user/WordCountJar/hadoop-*-examples.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
The examples is just wildcard expansion to account for different version numbers in the file name. For example: hadoop-0.19.2-examples.jar
You can use the full path to your jar like so:
bin/hadoop jar /home/user/hadoop-0.19.2-examples.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
Edit: the asterisks surrounding the word examples got removed from my post at time of submission.

Include Third Party Jars in Hadoop

I am new to Hadoop. I have added Gson API to my MapReducing Program. When I am running the program getting;
Error: java.lang.ClassNotFoundException: com.google.gson.Gson
Can anybody suggest me to how to add Third Party Libraries to Hadoop?
Be sure to add any dependencies to both the HADOOP_CLASSPATH and -libjars upon submitting a job like in the following examples:
Use the following to add all the jar dependencies from current and lib directories:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:`echo *.jar`:`echo lib/*.jar | sed 's/ /:/g'`
Bear in mind that when starting a job through hadoop jar you'll need to also pass it the jars of any dependencies through use of -libjars. I like to use:
hadoop jar <jar> <class> -libjars `echo ./lib/*.jar | sed 's/ /,/g'` [args...]
NOTE: The sed commands require a different delimiter character; the HADOOP_CLASSPATH is : separated and the -libjars need to be , separated.
Add the Jar in HADOOP_CLASSPATH
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Add last line
export HADOOP_CLASSPATH=/root/hadoop/extrajars/java-json.jar:$HADOOP_CLASSPATH
"/root/hadoop/extrajars/java-json.jar" is path on linux box itself and not on HDFS
Restart the hadoop
Command
hadoop classpath
Should show the jar in classpath
Now run MR job as usual
hadoop jar <MR-program jar> <MR Program class> <input dir> <output dir>
It will use the file from as expected.

Issue adding third-party jars to hadoop job

I am trying to add third-party jars to a hadoop job. I am adding each jar using the DistributedCache.addFileToClassPath method. I can see that the mapred.job.classpath.files is properly populated in the job xml file.
-libjars does not work for me either (most likely because we are not using toolrunner)
Any suggestions, what could be wrong?
Add the Jar in HADOOP_CLASSPATH
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Add last line
export HADOOP_CLASSPATH=/root/hadoop/extrajars/java-json.jar:$HADOOP_CLASSPATH
"/root/hadoop/extrajars/java-json.jar" is path on linux box itself and not on HDFS
Restart the hadoop
Command
hadoop classpath
Should show the jar in classpath
Now run MR job as usual
hadoop jar <MR-program jar> <MR Program class> <input dir> <output dir>
It will use the file from as expected.

Problem with Hadoop Streaming -file option for Java class files

I am struggling with a very basic issue in hadoop
streaming in the "-file" option.
First I tried the very basic example in streaming:
hadoop#ubuntu:/usr/local/hadoop$ bin/hadoop jar
contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper
org.apache.hadoop.mapred.lib.IdentityMapper \ -reducer /bin/wc
-inputformat KeyValueTextInputFormat -input gutenberg/* -output
gutenberg-outputtstchk22
which worked absolutely fine.
Then I copied the IdentityMapper.java source code and compiled it.
Then I placed this class file in the /home/hadoop folder and executed the
following in the terminal.
hadoop#ubuntu:/usr/local/hadoop$ bin/hadoop jar
contrib/streaming/hadoop-streaming-0.20.203.0.jar -file
~/IdentityMapper.class -mapper IdentityMapper.class \ -reducer /bin/wc
-inputformat KeyValueTextInputFormat -input gutenberg/* -output
gutenberg-outputtstch6
The execution failed with the following error in the stderr file:
java.io.IOException: Cannot run program "IdentityMapper.class":
java.io.IOException: error=2, No such file or directory
Then again I tried it by copying the IdentityMapper.class file in the
hadoop installation and executed the following:
hadoop#ubuntu:/usr/local/hadoop$ bin/hadoop jar
contrib/streaming/hadoop-streaming-0.20.203.0.jar -file
IdentityMapper.class -mapper IdentityMapper.class \ -reducer /bin/wc
-inputformat KeyValueTextInputFormat -input gutenberg/* -output
gutenberg-outputtstch5
But unfortunately again I got the same error.
It would be great if you can help me with it as I cannot move any further
without overcoming this.
Thanking you in anticipation.
Why do you want to compile the class? It is already compiled in the hadoop jars. You are just passing the classname (org.apache.hadoop.mapred.lib.IdentityMapper), because Hadoop uses reflection to instantiate a new instance of this mapping class.
You have to make sure that this is lying in the classpath e.g. within a jar you are passing the job.
Same answer as for your other question, you can't really use -file to send over jars as hadoop doesn't support multiple jars (that were not already in the CLASSPATH), check the streaming docs:
At least as late as version 0.14, Hadoop does not support multiple jar files. So, when specifying your own custom classes you will have to pack them along with the streaming jar and use the custom jar instead of the default hadoop streaming jar.
I met similar problem. And adding the jar file to HADOOP_CLASSPATH fixed the issue.
More info please refer this: http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/

Resources