Issue adding third-party jars to hadoop job - hadoop

I am trying to add third-party jars to a hadoop job. I am adding each jar using the DistributedCache.addFileToClassPath method. I can see that the mapred.job.classpath.files is properly populated in the job xml file.
-libjars does not work for me either (most likely because we are not using toolrunner)
Any suggestions, what could be wrong?

Add the Jar in HADOOP_CLASSPATH
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Add last line
export HADOOP_CLASSPATH=/root/hadoop/extrajars/java-json.jar:$HADOOP_CLASSPATH
"/root/hadoop/extrajars/java-json.jar" is path on linux box itself and not on HDFS
Restart the hadoop
Command
hadoop classpath
Should show the jar in classpath
Now run MR job as usual
hadoop jar <MR-program jar> <MR Program class> <input dir> <output dir>
It will use the file from as expected.

Related

Change tmp directory while running yarn jar command

I am running an MR job using yarn jar command and it creates a temporary jar in /tmp folder which fills up the entire disk space. I want to redirect the path of this jar to some other folder where I have more disk space. On this link, I came to know that we can change the path by setting the property mapred.local.dir for hadoop version 1.x. I am using the following command to run the jar
yarn jar myjar.jar MyClass myyml.yml arg1 -D mapred.local.dir="/grid/1/uie/facts"
The above argument mapred.local.dir doesn't change the path and it is still creating the jar in tmp folder.
Found the hack to not write the unjar file to /tmp folder. Apparently, it is not a configurable behaviour, so we can avoid the use of 'hadoop jar' or 'yarn jar'(RunJar utility) by invoking instead with the generated classpath:
java -cp $(hadoop classpath):my-fat-jar-with-all-dependencies.jar
your.app.mainClass
1. Reference link

Run Spark job with properties files

As a beginner of stack Hadoop, I would like to run my Spark job with spark-submit via Oozie. Having an jar including src compiling project files, I have also a set of properties files (about 20). I want that, when running my spark Job, we can load these properties files from a different folder beside the folder including my Spark Job compiled jar. I've tried:
In my job.properties of oozie, I added:
oozie.libpath=[path to the folder including all of my properties files]
and oozie.use.system.libpath=true.
on the spark-submit command, I added --files or --properties-file but it's not working (It doesn't accept the folder)
Thanks for any suggestions or feel free to ask more if my question is not clear.

Running a hadoop job

It is the first time I'm running a job on hadoop and started from WordCount example. To run my job, I', using this command
hduser#ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
and I think we should copy the jar file in /usr/local/hadoop . My first question is that what is the meaning of hadoop*examples*? and if we want to locate our jar file in another location for example /home/user/WordCountJar, what I should do? Thanks for your help in advance.
I think we should copy the jar file in /usr/local/hadoop
It is not mandatory. But if you have your jar at some other location, you need to specify the complete path while running your job.
My first question is that what is the meaning of hadoop*examples*?
hadoop*examples* is the name of your jar package that contains your MR job along with other dependencies. Here, * signifies that it can be any version. Not specifically 0.19.2 or something else. But, I feel it should be hadoop-examples-*.jar and not hadoop*examples*.jar
and if we want to locate our jar file in another location for example
/home/user/WordCountJar, what I should do?
If your jar is present in a directory other than the directory from where you are executing the command, you need to specify the complete path to your jar. Say,
bin/hadoop jar /home/user/WordCountJar/hadoop-*-examples.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
The examples is just wildcard expansion to account for different version numbers in the file name. For example: hadoop-0.19.2-examples.jar
You can use the full path to your jar like so:
bin/hadoop jar /home/user/hadoop-0.19.2-examples.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
Edit: the asterisks surrounding the word examples got removed from my post at time of submission.

How to execute map reduce program(ex. wordcount) from HDFS and see the output?

I am new to Hadoop. I have a simple wordcount program in eclipse which takes input files and then shows the output. But I need to execute the same program from HDFS. I have already created a JAR file for the wordcount program.
Can any one pls let me know how to proceed?
You need to have a cluster set up, even if is a single node cluster. Then you can run your .jar from the hadoop command line:
jar
Runs a jar file. Users can bundle their Map Reduce code in a jar
file and execute it using this command.
Usage: hadoop jar <jar> [mainClass] args...
The streaming jobs are run via this command. Examples can be referred
from Streaming examples
Word count example is also run using jar command. It can be referred
from Wordcount example
Initially you need to set up a hadoop cluster as discussed by Remus.
Single Node SetUp and Multi Node SetUp are two good way to start with.
Once you have the set up done, start hadoop daemons and copy the input files into any hdfs directory.
Prepare the jar of your program.
Run the jar on the terminal using hadoop jar <you jar name> <your main class> <input path><output directory path>
(The jar arguments depend on your program)

Include Third Party Jars in Hadoop

I am new to Hadoop. I have added Gson API to my MapReducing Program. When I am running the program getting;
Error: java.lang.ClassNotFoundException: com.google.gson.Gson
Can anybody suggest me to how to add Third Party Libraries to Hadoop?
Be sure to add any dependencies to both the HADOOP_CLASSPATH and -libjars upon submitting a job like in the following examples:
Use the following to add all the jar dependencies from current and lib directories:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:`echo *.jar`:`echo lib/*.jar | sed 's/ /:/g'`
Bear in mind that when starting a job through hadoop jar you'll need to also pass it the jars of any dependencies through use of -libjars. I like to use:
hadoop jar <jar> <class> -libjars `echo ./lib/*.jar | sed 's/ /,/g'` [args...]
NOTE: The sed commands require a different delimiter character; the HADOOP_CLASSPATH is : separated and the -libjars need to be , separated.
Add the Jar in HADOOP_CLASSPATH
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Add last line
export HADOOP_CLASSPATH=/root/hadoop/extrajars/java-json.jar:$HADOOP_CLASSPATH
"/root/hadoop/extrajars/java-json.jar" is path on linux box itself and not on HDFS
Restart the hadoop
Command
hadoop classpath
Should show the jar in classpath
Now run MR job as usual
hadoop jar <MR-program jar> <MR Program class> <input dir> <output dir>
It will use the file from as expected.

Resources