Change tmp directory while running yarn jar command - hadoop

I am running an MR job using yarn jar command and it creates a temporary jar in /tmp folder which fills up the entire disk space. I want to redirect the path of this jar to some other folder where I have more disk space. On this link, I came to know that we can change the path by setting the property mapred.local.dir for hadoop version 1.x. I am using the following command to run the jar
yarn jar myjar.jar MyClass myyml.yml arg1 -D mapred.local.dir="/grid/1/uie/facts"
The above argument mapred.local.dir doesn't change the path and it is still creating the jar in tmp folder.

Found the hack to not write the unjar file to /tmp folder. Apparently, it is not a configurable behaviour, so we can avoid the use of 'hadoop jar' or 'yarn jar'(RunJar utility) by invoking instead with the generated classpath:
java -cp $(hadoop classpath):my-fat-jar-with-all-dependencies.jar
your.app.mainClass
1. Reference link

Related

Example Jar in Hadoop release

I am learning Hadoop with book 'Hadoop in Action' by Chuck Lam. In first chapter the books says that Hadoop installation will have example jar and by running 'hadoop jar hadoop-*-examples.jar' will show all the examples. But when I run the command then it throw error 'Could not find or load main class org.apache.hadoop.util.RunJar'. My guess is that installed Hadoop doesn't have example jar. I have installed 'hadoop-2.1.0-beta.tar.gz' on cygwin on Win 7 laptop. Please suggest how to get example jar.
run following command
hadoop jar PathToYourJarFile wordcount inputPath OutputPath
you can get examples jar file at your hadoop installation directory
What I can suggest here is you should manually go to the Hadoop installation directory and look for a jar name similar to hadoop-examples.jar yourself. Different distribution can have different names for the jar.
If you are in Cygwin, while in the Hadoop Installation directory you can also do a ls *examples*.jar to find the same, narrowing down the file listing to any jar file containing examples as a string.
You can then directly use the jar file name like --
hadoop jar <exampleJarYourFound.jar>
Hope this takes you to a solution.

Running a hadoop job

It is the first time I'm running a job on hadoop and started from WordCount example. To run my job, I', using this command
hduser#ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
and I think we should copy the jar file in /usr/local/hadoop . My first question is that what is the meaning of hadoop*examples*? and if we want to locate our jar file in another location for example /home/user/WordCountJar, what I should do? Thanks for your help in advance.
I think we should copy the jar file in /usr/local/hadoop
It is not mandatory. But if you have your jar at some other location, you need to specify the complete path while running your job.
My first question is that what is the meaning of hadoop*examples*?
hadoop*examples* is the name of your jar package that contains your MR job along with other dependencies. Here, * signifies that it can be any version. Not specifically 0.19.2 or something else. But, I feel it should be hadoop-examples-*.jar and not hadoop*examples*.jar
and if we want to locate our jar file in another location for example
/home/user/WordCountJar, what I should do?
If your jar is present in a directory other than the directory from where you are executing the command, you need to specify the complete path to your jar. Say,
bin/hadoop jar /home/user/WordCountJar/hadoop-*-examples.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
The examples is just wildcard expansion to account for different version numbers in the file name. For example: hadoop-0.19.2-examples.jar
You can use the full path to your jar like so:
bin/hadoop jar /home/user/hadoop-0.19.2-examples.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
Edit: the asterisks surrounding the word examples got removed from my post at time of submission.

Unable to execute Map/Reduce job

I've been trying to figure out how execute my Map/Reduce job for almost 2 days now. I keep getting a ClassNotFound exception.
I've installed a Hadoop cluster in Ubuntu using Cloudera CDH4.3.0. The .java file (DemoJob.java which is not inside any package) is inside a folder called inputs and all required jar files are inside inputs/lib.
I followed http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial/ht_topic_5_2.html for reference.
I compile the .java file using:
javac -cp "inputs/lib/hadoop-common.jar:inputs/lib/hadoop-map-reduce-core.jar" -d Demo inputs/DemoJob.java
(In the link, it says -cp should be "/usr/lib/hadoop/:/usr/lib/hadoop/client-0.20/". But I don't have those folders in my system at all)
Create jar file using:
jar cvf Demo.jar Demo
Move 2 input files to HDFS
(Now this is where I'm confused. Do I need to move the jar file to HDFS as well? It doesn't say so in the link. But if it is not in HDFS, then how does the hadoop jar .. command work? I mean how does it combine the jar file which is in Linux system and the input files which are in HDFS?)
I run my code using:
hadoop jar Demo.jar DemoJob /Inputs/Text1.txt /Inputs/Text2.txt /Outputs
I keep getting ClassNotFoundException : DemoJob.
Somebody please help.
The class not found exception only means that some class wasn't found when class DemoJob was loaded. The missing class could have been a class referenced (imported, for example) by DemoJob. I think the problem is that you don't have the /usr/lib/hadoop/:/usr/lib/hadoop/client-0.20/ folders (classes) in your class path. It's the classes that should be there but aren't that probably are triggering the class not found exception.
Finally figured out what the problem was. Instead of creating a jar file from a folder, I directly created the jar file from the .class files using jar -cvf Demo.jar *.class
This resolved the ClassNotFound error. But I don't understand why it was not working earlier. Even when I created the jar file from a folder, I did mention the folder name when executing the class file as:hadoop jar Demo.jar Demo.DemoJob /Inputs/Text1.txt /Inputs/Text2.txt /Outputs

Issue adding third-party jars to hadoop job

I am trying to add third-party jars to a hadoop job. I am adding each jar using the DistributedCache.addFileToClassPath method. I can see that the mapred.job.classpath.files is properly populated in the job xml file.
-libjars does not work for me either (most likely because we are not using toolrunner)
Any suggestions, what could be wrong?
Add the Jar in HADOOP_CLASSPATH
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Add last line
export HADOOP_CLASSPATH=/root/hadoop/extrajars/java-json.jar:$HADOOP_CLASSPATH
"/root/hadoop/extrajars/java-json.jar" is path on linux box itself and not on HDFS
Restart the hadoop
Command
hadoop classpath
Should show the jar in classpath
Now run MR job as usual
hadoop jar <MR-program jar> <MR Program class> <input dir> <output dir>
It will use the file from as expected.

what's wrong with my hadoop configuration?

I've done all the work that hadoop requires, but there seems something wrong with it, for example:
I have a class Hello.class, when I use the command "java Hello" it works correctly, but when I try to use the command "hadoop Hello" it reports that "cannot load or find the main class", but when I use "jar" command to change Hello.class into Hello.jar, however, I use the command "hadoop jar Hello.jar Hello", this time it works correctly just as I used the command "java Hello"
What is wrong with my configuration?
In file etc/profile the following has been added:
export JAVA_HOME=/usr/jdk1.7.0_04
export HADOOP_INSTALL=/usr/hadoop-1.0.1
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_INSTALL/bin
export CLASS_PATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
I've added "export JAVA_HOME=/usr/jdk1.7.0_04" into file "hadoop-env.sh"
I've changed core-site.xml, hdfs-site.xml, mapred-site.xml accordingly
Is there anyone having the same problem?
The hadoop Hello command runs hadoop and looks for a class named Hello on the current classpath - which doesn't contain your class.
Bundling your class into a jar and running hadoop jar myjar.jar Hello tells hadoop to add the jar file myjar.jar to the classpath and then run the class named Hello (which is now on the classpath)
If you want to add a class to the classpath configure the HADOOP_CLASSPATH environment variable

Resources