Running hadoop jar command in edge Node - hadoop

I am new to hadoop and have below question(s) on running hadoop jar command from edgeNode(http://www.dummies.com/programming/big-data/hadoop/edge-nodes-in-hadoop-clusters/).hadoop jar ${JAR_FILE} {CLASS_NAMEWithPackage} . Have below Question(s)
After running above command why the jar is extracted in
Djava.io.tmpdir dir in edgeNode ? Every time I run this command I get a
directory something like hadoop-unjar7637059002474165348 in temp
dir,that has extracted jar.Is this expected? I was thinking hadoop
jar submits whole jar to yarn but I could not understand why it is
extracted in temp folder ?
After extracting the jar in edge Node,does the program expected to
remove the extracted jar directory.In this case
hadoop-unjar7637059002474165348 ?
Thanks!

You can probably look at here and this question for why your jars are extracted in the edge node (client node) when you run the hadoop jar command. It's to support the 'jar-within-jar' idea while running your jar from the client node. Pushing jars to HDFS, yarn and all those happens after that but, before these happens, your jar has to be executed to begin with, right? In your case, you might have jar-within-jar or you might not, but the concept is supported.
About the auto delete, probably it's not auto deleted.

Related

Replicate updated jar file to each slave node on Spark

I have an Apache Spark cluster consisting of a master and multiple slave nodes. In the jars folder of each node I require the jar file for a program I run on Spark.
There are regular updates to this jar so I find myself constantly copying the updated jar file.
Is there a quick and easy way that an updated jar file can be replicated from master to all slave nodes or any other way of distributing this each time the jar is updated?
When you running your Spark job with spark-submit use --jars option. Using this option you can write path to jar file that you need.
Also, jars in --jars option will be automatically transferred to the cluster, so you need this jar only on the master node.
Read about how to use this option here.

Can I run JAR file which includes another JAR file under lib folder in HDInsight?

Is it possible to run a JAR file in HDInsight which includes another JAR file under the lib folder?
JAR file
├/folder1/subfolder1/myApp/…
│    └.class file
|
|
└ lib/dependency.jar // library (jar file)
Thank you!
On HDInsight, we should be able to run a Java MapReduce JAR, which has a dependency on another JAR. There are a few ways to do this, but typically not by copying the second JAR under lib folder on headnode.
Reasons are – Depending on where the dependency is, you may need to copy the JAR under the lib folder of all worker nodes and headnodes – becomes a tedious task. Also, this change will be erased when the node gets re-imaged by Azure, and hence not a supported way.
Now, there are two types of dependencies –
1. MapReduce driver class has dependency on another external JAR
2. Map or reduce task has dependency on another JAR, where Map or Reduce functions calls an API on the external JAR.
Scenario #1 (MapReduce driver class depends on another JAR):
we can use one of the following options –
a. Copy your dependency JAR to a local folder (like d:\test on windows HDI) on the headnode and then use RDP to append this path to HADOOP_CLASSPATH environment variable on head node– this is suitable for dev/test to run jobs directly from headnode, but won’t work with remote job submissions. So this is not suitable for production scenarios.
b. Using a ‘fat or uber jar’ to include all the dependent jars inside your JAR – you can use Maven ‘Shade’ plugin , example here
Scenario #2 ( Map or Reduce function calls API on external JAR) -
Basically use –libjars option.
If you want to run the mapreduce JAR from Hadoop command line -
a. Copy the Mapreduce JAR to a local path (like d:\test )
b. Copy the dependent JAR on WASB
Example of running a mapreduce JAR with dependency-
hadoop jar D:\Test\BlobCount-0.0.1-SNAPSHOT.jar css.ms.BlobCount.BlobCounter -libjars wasb://mycontainername#azimwasb.blob.core.windows.net/mrdata/jars/microsoft-windowsazure-storage-sdk-0.6.0.jar -DStorageAccount=%StorageAccount% -DStorageKey=%StorageKey% -DContainer=%Container% /mcdpoc/mrinput /mcdpoc/mroutput
The example is using HDInsight windows – you can use similar approach on HDInsight Linux as well.
Using PowerShell or .Net SDK (remote job submission) –With PowerShell, you can use the –LibJars parameter to refer to dependent jars.
you can review the following documentations, these have various examples of using powerShell, SSH etc.
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-mapreduce/
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-mapreduce/
I hope it helps!
Thanks,
Azim

Running a hadoop job

It is the first time I'm running a job on hadoop and started from WordCount example. To run my job, I', using this command
hduser#ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
and I think we should copy the jar file in /usr/local/hadoop . My first question is that what is the meaning of hadoop*examples*? and if we want to locate our jar file in another location for example /home/user/WordCountJar, what I should do? Thanks for your help in advance.
I think we should copy the jar file in /usr/local/hadoop
It is not mandatory. But if you have your jar at some other location, you need to specify the complete path while running your job.
My first question is that what is the meaning of hadoop*examples*?
hadoop*examples* is the name of your jar package that contains your MR job along with other dependencies. Here, * signifies that it can be any version. Not specifically 0.19.2 or something else. But, I feel it should be hadoop-examples-*.jar and not hadoop*examples*.jar
and if we want to locate our jar file in another location for example
/home/user/WordCountJar, what I should do?
If your jar is present in a directory other than the directory from where you are executing the command, you need to specify the complete path to your jar. Say,
bin/hadoop jar /home/user/WordCountJar/hadoop-*-examples.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
The examples is just wildcard expansion to account for different version numbers in the file name. For example: hadoop-0.19.2-examples.jar
You can use the full path to your jar like so:
bin/hadoop jar /home/user/hadoop-0.19.2-examples.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
Edit: the asterisks surrounding the word examples got removed from my post at time of submission.

How to execute map reduce program(ex. wordcount) from HDFS and see the output?

I am new to Hadoop. I have a simple wordcount program in eclipse which takes input files and then shows the output. But I need to execute the same program from HDFS. I have already created a JAR file for the wordcount program.
Can any one pls let me know how to proceed?
You need to have a cluster set up, even if is a single node cluster. Then you can run your .jar from the hadoop command line:
jar
Runs a jar file. Users can bundle their Map Reduce code in a jar
file and execute it using this command.
Usage: hadoop jar <jar> [mainClass] args...
The streaming jobs are run via this command. Examples can be referred
from Streaming examples
Word count example is also run using jar command. It can be referred
from Wordcount example
Initially you need to set up a hadoop cluster as discussed by Remus.
Single Node SetUp and Multi Node SetUp are two good way to start with.
Once you have the set up done, start hadoop daemons and copy the input files into any hdfs directory.
Prepare the jar of your program.
Run the jar on the terminal using hadoop jar <you jar name> <your main class> <input path><output directory path>
(The jar arguments depend on your program)

Hadoop Mapreduce with two jars (one of the jars is needed on namenode only)

The mapred task is a very simple 'wordcount' implemented by Java (plz, see http://wiki.apache.org/hadoop/WordCount ).
after the last line, "job.waitForCompletion(true);"
I add some code implemented by Jython.
It means the libraries for Jythoon is only needed on namenode.
However, I added all libraries for Jython to a single jar, and then
executed it
hadoop jar wordcount.jar in out
The wordcount is done without any problem.
The problem I want to solve is I have to heavy libraries for Jython that is not needed for the slave nodes(mappers and reducers). the jar is almost 15M (upper than 14M is for Jython).
Can I split them, and get the same results?
Nobody knows this question.
I've solved this problem as follows: even if it's not the best.
Simply, copy jython.jar to /usr/local/hadoop (or path of hadoop installed) which is the default classpath of hadoop, and make a jar without jython.jar
If you need very big libraries to mapreduce task, then
upload jython.jar to hdfs
hadoop fs -put jython.jar Lib/jython.jar
add the follow line to your main code
DistributedCache.addFileToClassPath(new URI("Lib/jython.jar"));

Resources