We are able to install the jar file using the UI method to a particular cluster. But our requirement to install it on all the ondemand clusters in the workspace.
We are using the below shell script to download the jar file to DBFS. Not sure how we can refer/install this jar in all cluster using a global init script
curl https://repo1.maven.org/maven2/com/databricks/spark-xml_2.12/0.12.0/spark-xml_2.12-0.12.0.jar >/dbfs/FileStore/jars/maven/com/databricks/spark_xml_2_12_0_12_0.jar
Any help would be really appreciated!!
There is an alternate solution for adding jar library to the job cluster which is called from Azure data factory while running our job.
In ADF, while calling the notebook we have the option to include the jar directory in DBFS or we can able to give the Maven coordinates.
ADF SETTINGS
In the global init script you can just download this file into /databricks/jars/ directory - then it will be picked up by cluster
Related
I have a client application that uses the hadoop conf files (hadoop-site.xml and hadoop-core.xml)
I don't want to check it in on the resources folders, so I try to add it via idea.
The problem is that the hadoop Confs ignores my HADOOP_CONF_DIR and loads the default confs from the hadoop package. Any ideia ?
I'm using gradle
I end up solving it by putting the configuration files on test resources folder. So when the jar gets build it does not take it.
Is it possible to run a JAR file in HDInsight which includes another JAR file under the lib folder?
JAR file
├/folder1/subfolder1/myApp/…
│ └.class file
|
|
└ lib/dependency.jar // library (jar file)
Thank you!
On HDInsight, we should be able to run a Java MapReduce JAR, which has a dependency on another JAR. There are a few ways to do this, but typically not by copying the second JAR under lib folder on headnode.
Reasons are – Depending on where the dependency is, you may need to copy the JAR under the lib folder of all worker nodes and headnodes – becomes a tedious task. Also, this change will be erased when the node gets re-imaged by Azure, and hence not a supported way.
Now, there are two types of dependencies –
1. MapReduce driver class has dependency on another external JAR
2. Map or reduce task has dependency on another JAR, where Map or Reduce functions calls an API on the external JAR.
Scenario #1 (MapReduce driver class depends on another JAR):
we can use one of the following options –
a. Copy your dependency JAR to a local folder (like d:\test on windows HDI) on the headnode and then use RDP to append this path to HADOOP_CLASSPATH environment variable on head node– this is suitable for dev/test to run jobs directly from headnode, but won’t work with remote job submissions. So this is not suitable for production scenarios.
b. Using a ‘fat or uber jar’ to include all the dependent jars inside your JAR – you can use Maven ‘Shade’ plugin , example here
Scenario #2 ( Map or Reduce function calls API on external JAR) -
Basically use –libjars option.
If you want to run the mapreduce JAR from Hadoop command line -
a. Copy the Mapreduce JAR to a local path (like d:\test )
b. Copy the dependent JAR on WASB
Example of running a mapreduce JAR with dependency-
hadoop jar D:\Test\BlobCount-0.0.1-SNAPSHOT.jar css.ms.BlobCount.BlobCounter -libjars wasb://mycontainername#azimwasb.blob.core.windows.net/mrdata/jars/microsoft-windowsazure-storage-sdk-0.6.0.jar -DStorageAccount=%StorageAccount% -DStorageKey=%StorageKey% -DContainer=%Container% /mcdpoc/mrinput /mcdpoc/mroutput
The example is using HDInsight windows – you can use similar approach on HDInsight Linux as well.
Using PowerShell or .Net SDK (remote job submission) –With PowerShell, you can use the –LibJars parameter to refer to dependent jars.
you can review the following documentations, these have various examples of using powerShell, SSH etc.
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-mapreduce/
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-mapreduce/
I hope it helps!
Thanks,
Azim
I have created a new Java project in
eclipse-jee-kepler-SR2-win32-x86_64.
I have included the Jars in
flink-0.8.1\lib.
I have created the standard WordCount and it works.
I have modified my WordCount to take input from text files and csv files and it works.
all the imports work perfectly.
then i tried import org.apache.flink.api.java.io.jdbc.JDBCInputFormat.
Eclipse doesn't find it?
Why does Eclipse not find the import?
Because inside the jar flink-java-0.8.1.jar there is no directory io/jdbc.
I tried the same thing with flink-0.9.0-bin-hadoop27 and in the jar flink-dist-0.9.0.jar there is no org/apache/flink/api/java/io/jdbc directory. I uncompressed the jar and searched for the string "jdbcinputformat" with 0 results. I searched the string "jdbc" and it is only mentioned in org/apache/log4j, org/eclipse/jetty, and in other places that are not org.apache.flink.api.java.io
So my question is: Where do I find the class JDBCInputFormat?
What can I do to access SqlServer2012 in Flink (apart from accessing it outside Flink, create csv files, and then reading them in Flink (It sounds horrible to me since there should be a class specific for that))?
The corresponding module is not included. In order to use it, you need to build Flink from scratch. Run the following commands:
git clone https://github.com/apache/flink.git
cd flink
mvn -DskipTests clean install
This builds the latest snapshot for flink-0.10-SNAPSHOT. If you want to use stable version 0.9 run different git clone command:
git clone -b release-0.9 https://github.com/apache/flink.git
In your current project, you need to change the used Flink version in your pom file accordingly, eg, 0.10-SNAPSHOT or 0.9-SNAPSHOT.
I am trying to run some simple jobs on EMR (AMI 3.6) with Hadoop 2.4 and Spark 1.3.1. I have installed Spark manually without a bootstrap script. Currently I am trying to read and process data from S3 but it seems like I am missing an endless number of jars on my classpath.
Running commands on spark-shell. Starting shell using:
spark-shell --jars jar1.jar,jar2.jar...
Commands run on the shell:
val lines = sc.textFile("s3://folder/file.gz")
lines.collect()
The errors always look something like: "Class xyz not found". After I find the needed jar and add it to the classpath, I will get this error again but with a different class name in the error message.
Is there a set of jars that are needed for working with (compressed and uncompressed) S3 files?
I was able to figure out the jars needed for my classpath by following the logic in the AWS GitHub repo https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark.
The install-spark and install-spark-script.py files contain logic for copying jars into a new 'classpath' directory used by the SPARK_CLASSPATH variable (spark-env.sh).
The jars I was personally missing were located in /usr/share/aws/emr/emrfs/lib/ and /usr/share/aws/emr/lib/
It seems that you have not imported the proper libraries from with-in the spark-shell.
To do so :
import path.to.Class
or more likely if you want to import the RDD class, per say:
import org.apache.spark.rdd.RDD
I have installed Hadoop 1.2.1 on a three node cluster. while installing Oozie, When i try to generate a war file for the web console, I get this error.
hadoop-mapreduce-client-core-[0-9.]*.jar' not found in '/home/hduser/hadoop'
I believe the version of Hadoop that I am using doesn't have this jar file(don't know where to find them). So can anyone please tell me how to create a war file and enable the web console. Any help is appreciated.
You are correct. You have 2 options :
1. Download individual jars and put them inside your hadoop1.2.1 directory and generate the war file.
2. Download Hadoop 2.x and use it while creating the war file and once it has been built continue using your hadoop1.2.1.
For example : oozie-3.3.2 bin/oozie-setup.sh prepare-war -hadoop
hadoop-1.1.2 ~/hadoop-eco/hadoop-2.2.0 -extjs
~/hadoop-eco/oozie/oozie-3.3.2/webapp/src/main/webapp/ext-2.2
Here I have built Oozie-3.3.2 to use it with hadoop-1.1.2 but using hadoop-2.2.0
HTH