Problem with Hadoop Streaming -file option for Java class files - hadoop

I am struggling with a very basic issue in hadoop
streaming in the "-file" option.
First I tried the very basic example in streaming:
hadoop#ubuntu:/usr/local/hadoop$ bin/hadoop jar
contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper
org.apache.hadoop.mapred.lib.IdentityMapper \ -reducer /bin/wc
-inputformat KeyValueTextInputFormat -input gutenberg/* -output
gutenberg-outputtstchk22
which worked absolutely fine.
Then I copied the IdentityMapper.java source code and compiled it.
Then I placed this class file in the /home/hadoop folder and executed the
following in the terminal.
hadoop#ubuntu:/usr/local/hadoop$ bin/hadoop jar
contrib/streaming/hadoop-streaming-0.20.203.0.jar -file
~/IdentityMapper.class -mapper IdentityMapper.class \ -reducer /bin/wc
-inputformat KeyValueTextInputFormat -input gutenberg/* -output
gutenberg-outputtstch6
The execution failed with the following error in the stderr file:
java.io.IOException: Cannot run program "IdentityMapper.class":
java.io.IOException: error=2, No such file or directory
Then again I tried it by copying the IdentityMapper.class file in the
hadoop installation and executed the following:
hadoop#ubuntu:/usr/local/hadoop$ bin/hadoop jar
contrib/streaming/hadoop-streaming-0.20.203.0.jar -file
IdentityMapper.class -mapper IdentityMapper.class \ -reducer /bin/wc
-inputformat KeyValueTextInputFormat -input gutenberg/* -output
gutenberg-outputtstch5
But unfortunately again I got the same error.
It would be great if you can help me with it as I cannot move any further
without overcoming this.
Thanking you in anticipation.

Why do you want to compile the class? It is already compiled in the hadoop jars. You are just passing the classname (org.apache.hadoop.mapred.lib.IdentityMapper), because Hadoop uses reflection to instantiate a new instance of this mapping class.
You have to make sure that this is lying in the classpath e.g. within a jar you are passing the job.

Same answer as for your other question, you can't really use -file to send over jars as hadoop doesn't support multiple jars (that were not already in the CLASSPATH), check the streaming docs:
At least as late as version 0.14, Hadoop does not support multiple jar files. So, when specifying your own custom classes you will have to pack them along with the streaming jar and use the custom jar instead of the default hadoop streaming jar.

I met similar problem. And adding the jar file to HADOOP_CLASSPATH fixed the issue.
More info please refer this: http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/

Related

Find path of jar file in GCP

Find the path of hadoop-streaming-1.2.1.jar jar file in Google File Platform.
https://github.com/devangpatel01/TF-IDF-implementation-using-map-reduce-Hadoop-python-
I am trying to run this mapreduce on GCP using hadoop, but i'm not able to find the path of hadoop-streaming-1.2.1.jar. I tried to download the jar file manually and upload it in hadoop and then run the mapper1.py. But i'm getting error saying the path is wrong. The above program was run on a local machine. How do i edit the command to run it on GCP?
hadoop jar /home/kirthyodackal/hadoop-streaming-1.2.1.jar -input hdfs://cluster-29-m/input_prgs/input_prgs/input1/000000_0 -output hdfs://cluster-29-m/input_prgs/input_prgs/output1 -mapper hdfs://cluster-29-m/input_prgs/input_prgs/mapper1.py -reducer hdfs://cluster-29-m/input_prgs/input_prgs/reducer1.py
I used a different Mapper-Reducer program and could run the mapreduce. I used the code from https://github.com/SatishUC15/TFIDF-HadoopMapReduce#tfidf-hadoop and run the following commands on my GCP cluster.
> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file /home/kirthyodackal/MapperPhaseOne.py /home/kirthyodackal/ReducerPhaseOne.py -mapper "python MapperPhaseOne.py" -reducer "python ReducerPhaseOne.py" -input hdfs://cluster-3299-m/mapinput/inputfile -output hdfs://cluster-3299-m/mappred1
> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file /home/kirthyodackal/MapperPhaseTwo.py /home/kirthyodackal/ReducerPhaseTwo.py -mapper "python MapperPhaseTwo.py" -reducer "python ReducerPhaseTwo.py" -input hdfs://cluster-3299-m/mappred1/part-00000 hdfs://cluster-3299-m/mappred1/part-00001 hdfs://cluster-3299-m/mappred1/part-00002 hdfs://cluster-3299-m/mappred1/part-00003 hdfs://cluster-3299-m/mappred1/part-00004 -output hdfs://cluster-3299-m/mappred2
> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file /home/kirthyodackal/MapperPhaseThree.py /home/kirthyodackal/ReducerPhaseThree.py -mapper "python MapperPhaseThree.py" -reducer "python ReducerPhaseThree.py" -input hdfs://cluster-3299-m/mappred2/part-00000 hdfs://cluster-3299-m/mappred2/part-00001 hdfs://cluster-3299-m/mappred2/part-00002 hdfs://cluster-3299-m/mappred2/part-00003 hdfs://cluster-3299-m/mappred2/part-00004 -output hdfs://cluster-3299-m/mappredf
The following link outlines how I went about MapReduce on GCP.
https://github.com/kirthy21/Data-Analysis-Stack-Exchange-Hadoop-Pig-Hive-MapReduce-TFIDF

hadoop/bin/hadoop doesn't have the example jar

I installed hadoop 2.2.0 and try to run sample wordcount program. For that, first I imported data in to hdfs using:
bin/hadoop fs -copyFromLocal /home/prassanna/Desktop/input /input
After that, I tried to run word count jar file using:
root#prassanna-Studio-1558:/usr/local/hadoop# bin/hadoop jar hadoop*examples*.jar wordcount /input -output
but it showed: Not a valid JAR: /usr/local/hadoop/hadoop*examples*.jar
I checked in usr/local/hadoop/bin/hadoop directory and there is no hadoop example jar.
The Jar file you are looking for is in this directory:
hadoop_root/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
and should be run with a command like this:
$ yarn jar hadoop_root/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /input /output
Hadoop example jar is no more present here.
usr/local/hadoop/bin/hadoop
From hadoop version 2.x, as SAM has rightly indicated in his answer, the jar you are looking for is
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
You can run it like,
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /input /output
Make sure /input folder has an input file to be counted in the hdfs. Also note that /output should not exist. This is for the hadoop framework to create.
Also please refer to this document to use the Hadoop2.2.0 Shell Commands. It is always a good practice to not use the deprecated version.
http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/FileSystemShell.html
In my hadoop, I am working in 2.4.1 so the command is
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar wordcount /input /output
You have to compile the WordCount.java and then JAR it as follows below. I had to dig around for the lib paths, but in the end I was able to use this to compile the examples Class
[apesa#localhost ~]$ javac -classpath $HADOOP_HOME/apesa/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_HOME/apesa/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_HOME/apesa/hadoop/common/lib/commons-cli-1.2.jar -d wordcount_classes WordCount.java
Then JAR it as follows
[apesa#localhost ~]$ jar -cvf wordcount.jar -C wordcount_classes/ .
I have not run this in a while but you may have to verify the lib files are in the same place if you get an error

Include Third Party Jars in Hadoop

I am new to Hadoop. I have added Gson API to my MapReducing Program. When I am running the program getting;
Error: java.lang.ClassNotFoundException: com.google.gson.Gson
Can anybody suggest me to how to add Third Party Libraries to Hadoop?
Be sure to add any dependencies to both the HADOOP_CLASSPATH and -libjars upon submitting a job like in the following examples:
Use the following to add all the jar dependencies from current and lib directories:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:`echo *.jar`:`echo lib/*.jar | sed 's/ /:/g'`
Bear in mind that when starting a job through hadoop jar you'll need to also pass it the jars of any dependencies through use of -libjars. I like to use:
hadoop jar <jar> <class> -libjars `echo ./lib/*.jar | sed 's/ /,/g'` [args...]
NOTE: The sed commands require a different delimiter character; the HADOOP_CLASSPATH is : separated and the -libjars need to be , separated.
Add the Jar in HADOOP_CLASSPATH
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Add last line
export HADOOP_CLASSPATH=/root/hadoop/extrajars/java-json.jar:$HADOOP_CLASSPATH
"/root/hadoop/extrajars/java-json.jar" is path on linux box itself and not on HDFS
Restart the hadoop
Command
hadoop classpath
Should show the jar in classpath
Now run MR job as usual
hadoop jar <MR-program jar> <MR Program class> <input dir> <output dir>
It will use the file from as expected.

Using other files along with EMR streaming step?

I currently have a hadoop command that I would like to copy using the AWS SDK.
The command I'm currently using
hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar -input /no_dups -output /sorted -mapper mapper.py -reducer reducer.py -file mapper.py reducer.py other_file1.py other_file2.py
As far as I can see, the StreamingStep class doesn't provide a way to let Hadoop know that other files will be needed, along with the mapper and reducer.
Is this functionality available?
I solved this by passing the -file option to HadoopJarStepConfig with a list of the files I needed.
See this question

Issue adding third-party jars to hadoop job

I am trying to add third-party jars to a hadoop job. I am adding each jar using the DistributedCache.addFileToClassPath method. I can see that the mapred.job.classpath.files is properly populated in the job xml file.
-libjars does not work for me either (most likely because we are not using toolrunner)
Any suggestions, what could be wrong?
Add the Jar in HADOOP_CLASSPATH
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Add last line
export HADOOP_CLASSPATH=/root/hadoop/extrajars/java-json.jar:$HADOOP_CLASSPATH
"/root/hadoop/extrajars/java-json.jar" is path on linux box itself and not on HDFS
Restart the hadoop
Command
hadoop classpath
Should show the jar in classpath
Now run MR job as usual
hadoop jar <MR-program jar> <MR Program class> <input dir> <output dir>
It will use the file from as expected.

Resources