How do I programmatically install Maven libraries to a cluster using init scripts? - maven

Have been trying for a while now and Im sure the solution is simple enough, just struggling to find it. Im pretty new so be easy on me..!
Its a requirement to do this using a premade init-script, which is then selected in the UI when configuring the cluster.
I am trying to install com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.18 to a cluster on Azure Databricks. Following the documentations example (it is installing a postgresql driver) they produce an init script using the following command:
dbutils.fs.put("/databricks/scripts/postgresql-install.sh","""
#!/bin/bash
wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar""", True)```
My question is, what is the /mnt/driver-daemon/jars/postgresql-42.2.2.jar section of this code? And what would I have to do to make this work for my situation?
Many thanks in advance.

/mnt/driver-daemon/jars/postgresql-42.2.2.jar here is the output path where the jar file will be put. But it makes no sense as this jar won't be put into CLASSPATH and won't be found by Spark. Jars need to be put into /databricks/jars/ directory, where they will be picked up by Spark automatically.
But this method with downloading of jars works only for jars without dependencies, and for libraries like EventHubs connector this is not a case - they won't work if dependencies aren't downloaded as well. Instead it's better to use Cluster UI or Libraries API (or Jobs API for jobs) - with these methods, all dependencies will be fetched as well.
P.S. But really, instead of using EventHubs connector, it's better to use Kafka protocol that is supported by EventHubs as well. There are several reasons for that:
It's better from performance standpoint
It's better from stability standpoint
Kafka connector is included into DBR, so you don't need to install anything extra
You can read how to use Spark + EventHubs + Kafka connector in the EventHubs documentation.

Related

Apache Spark and gRPC

I'm writing an application that uses Apache Spark. For communicating with a client, I would like to use gRPC.
In my Gradle build file, I use
dependencies {
compile('org.apache.spark:spark-core_2.11:1.5.2')
compile 'org.apache.spark:spark-sql_2.11:1.5.2'
compile 'io.grpc:grpc-all:0.13.1'
...
}
When leaving out gRPC, everything works fine. However, when gRPC is used, I can create the build, but not execute it, as various versions of netty are used by the packages. Spark seems to use netty-all, which contains the same methods (but with potentially different signatures) than what gRPC uses.
I tried shadowing (using com.github.johnrengelman.shadow) , but somehow it still does not work. How can I approach this problem?
The general solution to this sort of thing is shading with relocation. See the answer to a similar problem with protobuf dependencies: https://groups.google.com/forum/#!topic/grpc-io/ABwMhW9bU34
I think the problem is that spark uses netty 4.0.x and gRPC 4.1.0 .

Flink for embedded stream processing in OSGi

I would like to use Apache Flink to process event inside an application.
My tests on a standalone JVM worked reasonably well though flink is a really big dependency.
I also tried to get it running in OSGi but gave up for now because of the many dependencies.
So my question is:
How small can I make Flink. I currently tried with the maven dependency on flink-streaming-java.
Unfortunately this depends on or embeds (only listing the questionable ones):
flink-shaded-hadoop2
kryo
zookeeper
netty
jetty
apache http client
apache http core
scala
akka
jackson
It also looks like several jars embed the same libs again and again. Like some google libs and asm.
So is there some way to get a slimmmer version of flink for local usage that does not depend on so many libs?
Many of the dependencies are required for Apache Flink's primary use-cases namely, distributed stream and batch processing.
Zookeeper for high-availability in case of (process) failures
Netty for data network transfer
Jetty for monitoring via REST API and web dashboard
Akka (and transitively Scala) for coordination of distributed processes
Most of these libraries are tightly coupled with the system and cannot be easily switched off or excluded.
I am sorry, there is no stripped down version for local stream processing.

How can I use an Elascicsearch plugin in a JVM local node?

I'm in the process of adding support for unicode normalization in ES with the help of the ICU analysis plugin. Installing this in a dedicated cluster is relatively easy, but I also need this plugin to be available during testing, where we use a JVM local node. Since it's a JVM local node I can't simply call the commands as explained in the plugin documentation. How can I get my plugin to work for this local node?
After digging through the source code of Elasticsearch I figured out the answer, and it is stupidly simple: Just make sure the plugins are in your classpath and ES will pick them up automatically. In my case adding the plugin to my pom.xml was enough.

Deployment of artifacts to Hadoop cluster

Is there any pattern how to deploy applications (jar-files) to an Hadoop-Custer ? I am not talking about map-reduce jobs but to deploy applications for Spark, Flume etc.
Within the Hadoop ecosystem deployment alone is not sufficient. You need to restart services, deploy configurations (e.g. via Ambari) and so forth.
I have not found any specific tools. Is my assumption correct that you use standard automation tools like maven/jenkins and do the missing parts by yourself ?
Just wondering if I have overseen something. Just do not want to reinvent the wheel ;)
If you are managing the Hadoop ecosystem you can use Ambari and Cloudera's manager. But you will need to stop and restart their services for configuration and library changes. If the ecosystem is managed outside of this then you have the option of managing the jars with outside tools like Puppet and Salt. Currently, we use Salt because of the push/pull abilities.
If you are talking about applications, like jobs running on Spark, you will just provide the Hadoop URL in the file path. For example:
spark-submit --class my.dev.org.SparkDriver --properties-file mySparkProps.conf wordcount-shaded.jar hdfs://servername/input/file/sample.txt hdfs://servername/output/sparkresults
For applications have dependencies on third party jar files. Then you do have the option of shading the job's jar file to prevent other application libraries from interfering with each other. The down side is the application jar file will get big. I use maven, so I added the maven-shade-plugin artifact and use the default scope (compile) for the dependencies.

Bare minimum of dependencies to work with HDFS

I need to put some files into HDFS from my client application. I am not planning to schedule a job to hadoop, just need to drop something into HDFS.
Maven dependency on hadoop-core brings a lot of stuff like jersey-core etc, which I don't need at all.
Is there any simple client library to work with HDFS without getting a full stack of hadoop dependencies? What is the minimal set of maven dependencies I can use?
Is webhdfs the only option?
They introduced hadoop-client which is much better then hadoop-core as a client library.

Resources