Running a hadoop job - hadoop

It is the first time I'm running a job on hadoop and started from WordCount example. To run my job, I', using this command
hduser#ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
and I think we should copy the jar file in /usr/local/hadoop . My first question is that what is the meaning of hadoop*examples*? and if we want to locate our jar file in another location for example /home/user/WordCountJar, what I should do? Thanks for your help in advance.

I think we should copy the jar file in /usr/local/hadoop
It is not mandatory. But if you have your jar at some other location, you need to specify the complete path while running your job.
My first question is that what is the meaning of hadoop*examples*?
hadoop*examples* is the name of your jar package that contains your MR job along with other dependencies. Here, * signifies that it can be any version. Not specifically 0.19.2 or something else. But, I feel it should be hadoop-examples-*.jar and not hadoop*examples*.jar
and if we want to locate our jar file in another location for example
/home/user/WordCountJar, what I should do?
If your jar is present in a directory other than the directory from where you are executing the command, you need to specify the complete path to your jar. Say,
bin/hadoop jar /home/user/WordCountJar/hadoop-*-examples.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output

The examples is just wildcard expansion to account for different version numbers in the file name. For example: hadoop-0.19.2-examples.jar
You can use the full path to your jar like so:
bin/hadoop jar /home/user/hadoop-0.19.2-examples.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
Edit: the asterisks surrounding the word examples got removed from my post at time of submission.

Related

Change tmp directory while running yarn jar command

I am running an MR job using yarn jar command and it creates a temporary jar in /tmp folder which fills up the entire disk space. I want to redirect the path of this jar to some other folder where I have more disk space. On this link, I came to know that we can change the path by setting the property mapred.local.dir for hadoop version 1.x. I am using the following command to run the jar
yarn jar myjar.jar MyClass myyml.yml arg1 -D mapred.local.dir="/grid/1/uie/facts"
The above argument mapred.local.dir doesn't change the path and it is still creating the jar in tmp folder.
Found the hack to not write the unjar file to /tmp folder. Apparently, it is not a configurable behaviour, so we can avoid the use of 'hadoop jar' or 'yarn jar'(RunJar utility) by invoking instead with the generated classpath:
java -cp $(hadoop classpath):my-fat-jar-with-all-dependencies.jar
your.app.mainClass
1. Reference link

Implementing hadoop example, jar error produced

I wanted to execute a hadoop example and it says jar not found. The code was run using hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+
The error produced was:
Not a valid JAR : /home/rahul/hadoop/hadoop-examples-*.jar
Try to check the jar specified path ,i think it is the cause of the issue

How do I specify multiple libpath in oozie job?

My oozie job uses 2 jars x.jar and y.jar and following is my job.properties file.
oozie.libpath=/lib
oozie.use.system.libpath=true
This works perfectly when both the jars are present at same location on HDFS at /lib/x.jar and /lib/y.jar
Now I have 2 jars placed at different locations /lib/1/x.jar and /lib/2/y.jar.
How can I re-write my code such that both the jars are used while running the map reduce job?
Note: I have already refernced the answer How to specify multiple jar files in oozie but, this does not solve my problem
Found the answer at
http://blog.cloudera.com/blog/2014/05/how-to-use-the-sharelib-in-apache-oozie-cdh-5/
Turns out that I can specify multiple paths separated by comma in the job.properties file:
oozie.libpath=/path/to/jars,another/path/to/jars

Example Jar in Hadoop release

I am learning Hadoop with book 'Hadoop in Action' by Chuck Lam. In first chapter the books says that Hadoop installation will have example jar and by running 'hadoop jar hadoop-*-examples.jar' will show all the examples. But when I run the command then it throw error 'Could not find or load main class org.apache.hadoop.util.RunJar'. My guess is that installed Hadoop doesn't have example jar. I have installed 'hadoop-2.1.0-beta.tar.gz' on cygwin on Win 7 laptop. Please suggest how to get example jar.
run following command
hadoop jar PathToYourJarFile wordcount inputPath OutputPath
you can get examples jar file at your hadoop installation directory
What I can suggest here is you should manually go to the Hadoop installation directory and look for a jar name similar to hadoop-examples.jar yourself. Different distribution can have different names for the jar.
If you are in Cygwin, while in the Hadoop Installation directory you can also do a ls *examples*.jar to find the same, narrowing down the file listing to any jar file containing examples as a string.
You can then directly use the jar file name like --
hadoop jar <exampleJarYourFound.jar>
Hope this takes you to a solution.

How to execute map reduce program(ex. wordcount) from HDFS and see the output?

I am new to Hadoop. I have a simple wordcount program in eclipse which takes input files and then shows the output. But I need to execute the same program from HDFS. I have already created a JAR file for the wordcount program.
Can any one pls let me know how to proceed?
You need to have a cluster set up, even if is a single node cluster. Then you can run your .jar from the hadoop command line:
jar
Runs a jar file. Users can bundle their Map Reduce code in a jar
file and execute it using this command.
Usage: hadoop jar <jar> [mainClass] args...
The streaming jobs are run via this command. Examples can be referred
from Streaming examples
Word count example is also run using jar command. It can be referred
from Wordcount example
Initially you need to set up a hadoop cluster as discussed by Remus.
Single Node SetUp and Multi Node SetUp are two good way to start with.
Once you have the set up done, start hadoop daemons and copy the input files into any hdfs directory.
Prepare the jar of your program.
Run the jar on the terminal using hadoop jar <you jar name> <your main class> <input path><output directory path>
(The jar arguments depend on your program)

Resources