How do you handle multiple versions of same jar? - maven

I use Apache_Gora_0.2.1 and Apache Nutch_2.1 .
Nutch depends on Gora.
Gora have modules gora-core and gora-hbase.
gora-hbase depends on gora-core.
All the modules of Gora use avro_1.3.3.jar . I want to use avro_1.3.3.jar for gora-core and avro_1.5.3.jar for gora-hbase .
I successfully compiled Gora via Maven and I successfully compiled Nutch via Ant and Ivy.
Then seems to be two versions in the Nutch classpath (avro.1.3.3.jar and avro.1.5.3.jar). If I exclude avro_1.5.3.jar via ivy.xml, gora-hbase don't use avro.1.5.3.
How can I solve this problem?

You should avoid the situation when you have in the classpath same jars with different versions.
To solve your problem you need to find version of Apache_Gora_0.2.1 and Apache Nutch_2.1 that uses the similar versions of avro.
Try do use Apache Nutch_1.6, since Apache_Gora_0.2.1 is the latest version.
Then, you exclude the lowest version and you solve your problem.

Another possibility is to downgrade Nutch to 2.0 because this works with avro 1.3.3.
If I am not wrong, gora-hbase does not work with avro 1.5.3, but 1.3.3.
At the same time, tell that gora-hbase only uses avro to serialize values... why do you need it to use avro 1.5.3?

Related

How do I programmatically install Maven libraries to a cluster using init scripts?

Have been trying for a while now and Im sure the solution is simple enough, just struggling to find it. Im pretty new so be easy on me..!
Its a requirement to do this using a premade init-script, which is then selected in the UI when configuring the cluster.
I am trying to install com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.18 to a cluster on Azure Databricks. Following the documentations example (it is installing a postgresql driver) they produce an init script using the following command:
dbutils.fs.put("/databricks/scripts/postgresql-install.sh","""
#!/bin/bash
wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar""", True)```
My question is, what is the /mnt/driver-daemon/jars/postgresql-42.2.2.jar section of this code? And what would I have to do to make this work for my situation?
Many thanks in advance.
/mnt/driver-daemon/jars/postgresql-42.2.2.jar here is the output path where the jar file will be put. But it makes no sense as this jar won't be put into CLASSPATH and won't be found by Spark. Jars need to be put into /databricks/jars/ directory, where they will be picked up by Spark automatically.
But this method with downloading of jars works only for jars without dependencies, and for libraries like EventHubs connector this is not a case - they won't work if dependencies aren't downloaded as well. Instead it's better to use Cluster UI or Libraries API (or Jobs API for jobs) - with these methods, all dependencies will be fetched as well.
P.S. But really, instead of using EventHubs connector, it's better to use Kafka protocol that is supported by EventHubs as well. There are several reasons for that:
It's better from performance standpoint
It's better from stability standpoint
Kafka connector is included into DBR, so you don't need to install anything extra
You can read how to use Spark + EventHubs + Kafka connector in the EventHubs documentation.

How to write from Apache Flink to Elasticsearch

I am trying to connect Flink to Elasticsearch and when I run the Maven project I have this error :
or another way to do it, I am using this example : https://github.com/keiraqz/KafkaFlinkElastic
The example you linked depends on various Flink modules with different version which is highly discouraged. Try setting them all to one version and see if this fixes the issue.

Apache Spark and gRPC

I'm writing an application that uses Apache Spark. For communicating with a client, I would like to use gRPC.
In my Gradle build file, I use
dependencies {
compile('org.apache.spark:spark-core_2.11:1.5.2')
compile 'org.apache.spark:spark-sql_2.11:1.5.2'
compile 'io.grpc:grpc-all:0.13.1'
...
}
When leaving out gRPC, everything works fine. However, when gRPC is used, I can create the build, but not execute it, as various versions of netty are used by the packages. Spark seems to use netty-all, which contains the same methods (but with potentially different signatures) than what gRPC uses.
I tried shadowing (using com.github.johnrengelman.shadow) , but somehow it still does not work. How can I approach this problem?
The general solution to this sort of thing is shading with relocation. See the answer to a similar problem with protobuf dependencies: https://groups.google.com/forum/#!topic/grpc-io/ABwMhW9bU34
I think the problem is that spark uses netty 4.0.x and gRPC 4.1.0 .

hbase-client or hbase-common for HBase 0.94.6-cdh4.5.0

We are using Cloudera CDH 4.5.0 for HBase and Storm 0.9.3 uses hbase-client. Unfortunately, it seems Cloudera did not provide an hbase-client maven artifact, and I cannot figure out how to satisfy the dependency for org.apache.hadoop.hbase.security.UserProvider. According to the Maven search site, it can be provided by either hbase-client or hbase-common. Can someone tell me if there is a comparable version of either of these that I can use with cdh 4.5.0?
Are you using cdh4.x or cdh5.x? the hbase-client/hbase-common jars are only in cdh5 (hbase 96+). The cdh4 release has only one big hbase jar containing everything. Also UserProvider doesn't seems to be present in 4.5.0 but is present from 4.6.x
hbase-client depends on hbase-common, so in general you need both if you want to use client.
(if you are looking only for the UserProvider class that is in hbase-common)

Bare minimum of dependencies to work with HDFS

I need to put some files into HDFS from my client application. I am not planning to schedule a job to hadoop, just need to drop something into HDFS.
Maven dependency on hadoop-core brings a lot of stuff like jersey-core etc, which I don't need at all.
Is there any simple client library to work with HDFS without getting a full stack of hadoop dependencies? What is the minimal set of maven dependencies I can use?
Is webhdfs the only option?
They introduced hadoop-client which is much better then hadoop-core as a client library.

Resources