Send executable jar to hadoop cluster and run as "hadoop jar" - hadoop

I commonly make a executable jar package with a main method and run by the commandline "hadoop jar Some.jar ClassWithMain input output"
In this main method, Job and Configuration may be configured and Configuration class has a setter to specify mapper or reducer class like conf.setMapperClass(Mapper.class).
However, In the case of submitting job remotely, I should set jar and Mapper or more classes to use hadoop client api.
job.setJarByClass(HasMainMethod.class);
job.setMapperClass(Mapper_Class.class);
job.setReducerClass(Reducer_Class.class);
I want to programmatically transfer jar in client to remote hadoop cluster and execute this jar like "hadoop jar" command to make main method specify mapper and reducer.
So how can I deal with this problem?

hadoop is only a shell script. Eventually, hadoop jar will invoke org.apache.hadoop.util.RunJar. What hadoop jar do is helping you set up the CLASSPATH. So you can use it directly.
For example,
String input = "...";
String output = "...";
org.apache.hadoop.util.RunJar.main(
new String[]{"Some.jar", "ClassWithMain", input, output});
However, you need to set the CLASSPATH correctly before you use it. A convenient way to get the correct CLASSPATH is hadoop classpath. Type this command and you will get the full CLASSPATH.
Then set up the CLASSPATH before you run your java application. For example,
export CLASSPATH=$(hadoop classpath):$CLASSPATH
java -jar YourJar.jar

Related

How does the example find the lib in Oozie best case?

According to the document of Oozie, I try to run a map-reduce example on Oozie. As everyone knows, 'workflow.xml' (and 'coordinator.xml') should be in HDFS.
Then input the command: oozie job -oozie http://localhost:11000/oozie -config examples/apps/map-reduce/job.properties -run. And I also know the 'job.properties' should be in local file system.
But there are two things confused me:
1.why dose the jar or class variable in workflow.xml come from directory Lib of HDFS?
2.There is a picture showing the content of oozie-examples-4.3.1.jar. This jar is in HDFS, how can it import Lib?
Forgive my poor English.
The highlighted red box is part of the Hadoop and Java default classpath. Any Java code that's ran within YARN, as part of MapReduce has access to the packages that appear when you run hadoop classpath command. By the way, mapred.* classes of Hadoop are almost all deprecated
That's nothing to do with Oozie, per say, but Oozie extends the Hadoop classpath with the Oozie ShareLib, which must be explicitly enabled with a property file argument
oozie.use.system.libpath=true
And in addition to that classpath,, Oozie will send the ${wf.application.path}/lib directory to all running jobs

Is maven JAR runnable on hadoop?

To produce jar from hadoop mapreduce program(mapreduce wordcount example) i used maven.
Here i successfully done 'clean' and 'install'.
Also 'build' successfully by running as a Java Application by including arguments(input and output).
And it provided expected result successfully.
Now the problem is not running on hadoop.
Giving the following error:
Exception in thread "main" java.lang.ClassNotFoundException: WordCount
Is maven JAR runnable on hadoop?
Maven is a build tool which creates a Java Artifact. Any JAR containing the hadoop dependencies and the class having main() method in the Manifest file should be working with the hadoop.
Try running your JAR using the below command
hadoop jar your-jar.jar wordcount input output
where "wordcount" is the name of the class with main method,"input" and "output" are the arguments.
Two things
1) I think you are missing the package details before the classname. Copy your package name and put it before the classname and it should work.
hadoop jar /home/user/examples.jar com.test.examples.WordCount /home/user/inputfolder /home/user/outputfolder
PS: If you are using a jar for which the source code is not available with you, you can do
jar -tvf /home/user/examples.jar
and it will print all the classses with their folder names. Replace the "/" with "." (dot) and you get the package name. But this needs JDK (not JRE) in the PATH.
2) You are trying to run a MapReduce program from a Windows prompt. Are you sure you have Hadoop installed on your Windows?

Can I run JAR file which includes another JAR file under lib folder in HDInsight?

Is it possible to run a JAR file in HDInsight which includes another JAR file under the lib folder?
JAR file
├/folder1/subfolder1/myApp/…
│    └.class file
|
|
└ lib/dependency.jar // library (jar file)
Thank you!
On HDInsight, we should be able to run a Java MapReduce JAR, which has a dependency on another JAR. There are a few ways to do this, but typically not by copying the second JAR under lib folder on headnode.
Reasons are – Depending on where the dependency is, you may need to copy the JAR under the lib folder of all worker nodes and headnodes – becomes a tedious task. Also, this change will be erased when the node gets re-imaged by Azure, and hence not a supported way.
Now, there are two types of dependencies –
1. MapReduce driver class has dependency on another external JAR
2. Map or reduce task has dependency on another JAR, where Map or Reduce functions calls an API on the external JAR.
Scenario #1 (MapReduce driver class depends on another JAR):
we can use one of the following options –
a. Copy your dependency JAR to a local folder (like d:\test on windows HDI) on the headnode and then use RDP to append this path to HADOOP_CLASSPATH environment variable on head node– this is suitable for dev/test to run jobs directly from headnode, but won’t work with remote job submissions. So this is not suitable for production scenarios.
b. Using a ‘fat or uber jar’ to include all the dependent jars inside your JAR – you can use Maven ‘Shade’ plugin , example here
Scenario #2 ( Map or Reduce function calls API on external JAR) -
Basically use –libjars option.
If you want to run the mapreduce JAR from Hadoop command line -
a. Copy the Mapreduce JAR to a local path (like d:\test )
b. Copy the dependent JAR on WASB
Example of running a mapreduce JAR with dependency-
hadoop jar D:\Test\BlobCount-0.0.1-SNAPSHOT.jar css.ms.BlobCount.BlobCounter -libjars wasb://mycontainername#azimwasb.blob.core.windows.net/mrdata/jars/microsoft-windowsazure-storage-sdk-0.6.0.jar -DStorageAccount=%StorageAccount% -DStorageKey=%StorageKey% -DContainer=%Container% /mcdpoc/mrinput /mcdpoc/mroutput
The example is using HDInsight windows – you can use similar approach on HDInsight Linux as well.
Using PowerShell or .Net SDK (remote job submission) –With PowerShell, you can use the –LibJars parameter to refer to dependent jars.
you can review the following documentations, these have various examples of using powerShell, SSH etc.
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-mapreduce/
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-mapreduce/
I hope it helps!
Thanks,
Azim

How to execute map reduce program(ex. wordcount) from HDFS and see the output?

I am new to Hadoop. I have a simple wordcount program in eclipse which takes input files and then shows the output. But I need to execute the same program from HDFS. I have already created a JAR file for the wordcount program.
Can any one pls let me know how to proceed?
You need to have a cluster set up, even if is a single node cluster. Then you can run your .jar from the hadoop command line:
jar
Runs a jar file. Users can bundle their Map Reduce code in a jar
file and execute it using this command.
Usage: hadoop jar <jar> [mainClass] args...
The streaming jobs are run via this command. Examples can be referred
from Streaming examples
Word count example is also run using jar command. It can be referred
from Wordcount example
Initially you need to set up a hadoop cluster as discussed by Remus.
Single Node SetUp and Multi Node SetUp are two good way to start with.
Once you have the set up done, start hadoop daemons and copy the input files into any hdfs directory.
Prepare the jar of your program.
Run the jar on the terminal using hadoop jar <you jar name> <your main class> <input path><output directory path>
(The jar arguments depend on your program)

Issue adding third-party jars to hadoop job

I am trying to add third-party jars to a hadoop job. I am adding each jar using the DistributedCache.addFileToClassPath method. I can see that the mapred.job.classpath.files is properly populated in the job xml file.
-libjars does not work for me either (most likely because we are not using toolrunner)
Any suggestions, what could be wrong?
Add the Jar in HADOOP_CLASSPATH
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Add last line
export HADOOP_CLASSPATH=/root/hadoop/extrajars/java-json.jar:$HADOOP_CLASSPATH
"/root/hadoop/extrajars/java-json.jar" is path on linux box itself and not on HDFS
Restart the hadoop
Command
hadoop classpath
Should show the jar in classpath
Now run MR job as usual
hadoop jar <MR-program jar> <MR Program class> <input dir> <output dir>
It will use the file from as expected.

Resources