Confused between -libjars and -archives to distribute side data to task nodes - hadoop

I have a MapReduce job which uses a 3rd party jar and for passing a jar file to the task nodes I know that there are 2 ways to do it which is hadoop jar -archive /custom.jar or hadoop jar -libjars /custom.jar provided my Job uses GenericOptionsParser.
My Question is which is the best option to choose, as jar files can be passes by both -archive and -libjars options ?

-libjar is mostly suited to ship jars as documentation says. -archive is a general purpose and the option unarchives them(this might not be needed for jar usage, as you will never want the jar to be unzipped) at the task node. archive is mostly for shipping any other files and making them available at the task node.

Related

Replicate updated jar file to each slave node on Spark

I have an Apache Spark cluster consisting of a master and multiple slave nodes. In the jars folder of each node I require the jar file for a program I run on Spark.
There are regular updates to this jar so I find myself constantly copying the updated jar file.
Is there a quick and easy way that an updated jar file can be replicated from master to all slave nodes or any other way of distributing this each time the jar is updated?
When you running your Spark job with spark-submit use --jars option. Using this option you can write path to jar file that you need.
Also, jars in --jars option will be automatically transferred to the cluster, so you need this jar only on the master node.
Read about how to use this option here.

Can I run JAR file which includes another JAR file under lib folder in HDInsight?

Is it possible to run a JAR file in HDInsight which includes another JAR file under the lib folder?
JAR file
├/folder1/subfolder1/myApp/…
│    └.class file
|
|
└ lib/dependency.jar // library (jar file)
Thank you!
On HDInsight, we should be able to run a Java MapReduce JAR, which has a dependency on another JAR. There are a few ways to do this, but typically not by copying the second JAR under lib folder on headnode.
Reasons are – Depending on where the dependency is, you may need to copy the JAR under the lib folder of all worker nodes and headnodes – becomes a tedious task. Also, this change will be erased when the node gets re-imaged by Azure, and hence not a supported way.
Now, there are two types of dependencies –
1. MapReduce driver class has dependency on another external JAR
2. Map or reduce task has dependency on another JAR, where Map or Reduce functions calls an API on the external JAR.
Scenario #1 (MapReduce driver class depends on another JAR):
we can use one of the following options –
a. Copy your dependency JAR to a local folder (like d:\test on windows HDI) on the headnode and then use RDP to append this path to HADOOP_CLASSPATH environment variable on head node– this is suitable for dev/test to run jobs directly from headnode, but won’t work with remote job submissions. So this is not suitable for production scenarios.
b. Using a ‘fat or uber jar’ to include all the dependent jars inside your JAR – you can use Maven ‘Shade’ plugin , example here
Scenario #2 ( Map or Reduce function calls API on external JAR) -
Basically use –libjars option.
If you want to run the mapreduce JAR from Hadoop command line -
a. Copy the Mapreduce JAR to a local path (like d:\test )
b. Copy the dependent JAR on WASB
Example of running a mapreduce JAR with dependency-
hadoop jar D:\Test\BlobCount-0.0.1-SNAPSHOT.jar css.ms.BlobCount.BlobCounter -libjars wasb://mycontainername#azimwasb.blob.core.windows.net/mrdata/jars/microsoft-windowsazure-storage-sdk-0.6.0.jar -DStorageAccount=%StorageAccount% -DStorageKey=%StorageKey% -DContainer=%Container% /mcdpoc/mrinput /mcdpoc/mroutput
The example is using HDInsight windows – you can use similar approach on HDInsight Linux as well.
Using PowerShell or .Net SDK (remote job submission) –With PowerShell, you can use the –LibJars parameter to refer to dependent jars.
you can review the following documentations, these have various examples of using powerShell, SSH etc.
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-mapreduce/
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-mapreduce/
I hope it helps!
Thanks,
Azim

Location of Hadoop Libjars

I am running a hadoop job on a cluster and passing some jars using -libjars option while running a hadoop job. I am not sure where I can find these jars on cluster. One more thing whether these jars are copied from localmachine to cluster. Where I can find these jars on cluster
According to the Hadoop - The Definitive Guide
Copies the specified JAR files from the local filesystem (or any filesystem if a scheme is
specified)to the shared filesystem used bythe jobtracker (usually HDFS), and adds them
to the MapReduce task’s classpath. This option is a useful way of shipping JAR files that
a job is dependent on.
So, the specified files are copied from the local file system to HDFS and then to the mapper/reducer nodes in the classpath. Also, these files are replicated mapreduce.client.submit.file.replication number of times, which is defaulted to 10. The reason why it's replicated more than 3 times is because the file has to be distributed to all the required nodes.

How to specify multiple jar files in oozie

I need a solution for the following problem:
My project has two jars in which
one jar contains all bean classes like Employee etc, and the other jar contains MR jobs which uses the first jar bean class so when iam trying to run the MR job as a simple java program i am facing the issue of class not found (com.abc.Employee class not found as it is in another jar) so can any one provide me the solution how to solve the issue .... as in real time there may be many jars not 1 or 2 how to specify all those jars can any one please reply as soon as possible.
You should have a lib folder in the HDFS directory where you are storing your Oozie workflow. You can place both jar files in this folder and oozie will ensure both are on the classpath when your MR job executes:
hdfs://namenode:8020/path/to/oozie/app/workflow.xml
hdfs://namenode:8020/path/to/oozie/app/lib/first.jar
hdfs://namenode:8020/path/to/oozie/app/lib/second.jar
See Workflow Application Deployment for more details
If you often use jars in a number of oozie workflows, you can place these common jars (HBase jars for example) in a directory in HDFS, and then denote in an oozie property to include this folder's jars See HDFS Share Libraries for more details

Hadoop Mapreduce with two jars (one of the jars is needed on namenode only)

The mapred task is a very simple 'wordcount' implemented by Java (plz, see http://wiki.apache.org/hadoop/WordCount ).
after the last line, "job.waitForCompletion(true);"
I add some code implemented by Jython.
It means the libraries for Jythoon is only needed on namenode.
However, I added all libraries for Jython to a single jar, and then
executed it
hadoop jar wordcount.jar in out
The wordcount is done without any problem.
The problem I want to solve is I have to heavy libraries for Jython that is not needed for the slave nodes(mappers and reducers). the jar is almost 15M (upper than 14M is for Jython).
Can I split them, and get the same results?
Nobody knows this question.
I've solved this problem as follows: even if it's not the best.
Simply, copy jython.jar to /usr/local/hadoop (or path of hadoop installed) which is the default classpath of hadoop, and make a jar without jython.jar
If you need very big libraries to mapreduce task, then
upload jython.jar to hdfs
hadoop fs -put jython.jar Lib/jython.jar
add the follow line to your main code
DistributedCache.addFileToClassPath(new URI("Lib/jython.jar"));

Resources