How to write from Apache Flink to Elasticsearch - elasticsearch

I am trying to connect Flink to Elasticsearch and when I run the Maven project I have this error :
or another way to do it, I am using this example : https://github.com/keiraqz/KafkaFlinkElastic

The example you linked depends on various Flink modules with different version which is highly discouraged. Try setting them all to one version and see if this fixes the issue.

Related

How do I programmatically install Maven libraries to a cluster using init scripts?

Have been trying for a while now and Im sure the solution is simple enough, just struggling to find it. Im pretty new so be easy on me..!
Its a requirement to do this using a premade init-script, which is then selected in the UI when configuring the cluster.
I am trying to install com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.18 to a cluster on Azure Databricks. Following the documentations example (it is installing a postgresql driver) they produce an init script using the following command:
dbutils.fs.put("/databricks/scripts/postgresql-install.sh","""
#!/bin/bash
wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar""", True)```
My question is, what is the /mnt/driver-daemon/jars/postgresql-42.2.2.jar section of this code? And what would I have to do to make this work for my situation?
Many thanks in advance.
/mnt/driver-daemon/jars/postgresql-42.2.2.jar here is the output path where the jar file will be put. But it makes no sense as this jar won't be put into CLASSPATH and won't be found by Spark. Jars need to be put into /databricks/jars/ directory, where they will be picked up by Spark automatically.
But this method with downloading of jars works only for jars without dependencies, and for libraries like EventHubs connector this is not a case - they won't work if dependencies aren't downloaded as well. Instead it's better to use Cluster UI or Libraries API (or Jobs API for jobs) - with these methods, all dependencies will be fetched as well.
P.S. But really, instead of using EventHubs connector, it's better to use Kafka protocol that is supported by EventHubs as well. There are several reasons for that:
It's better from performance standpoint
It's better from stability standpoint
Kafka connector is included into DBR, so you don't need to install anything extra
You can read how to use Spark + EventHubs + Kafka connector in the EventHubs documentation.

Couchbase plugin for ElasticSearch deprecated?

I was reading https://www.elastic.co/blog/deprecating-rivers which stats that ES rivers (plugin) are getting deprecated. i.e. any plugin directly integrated with ElasticSearch server will no longer work beyond ES 3.x onwards.
Couchbase plugin is one of those kind.
I searched all the documents of couchbase plugin at http://developer.couchbase.com/documentation/server/4.5/connectors/elasticsearch-2.1/elastic-intro.html but could not find if they are using deprecated way or not?
Does anyone know? Should we keep using couchbase plugin or should start planning to write data directly to ES using our application.
We have couchbase data getting replicated to ES using couchbase plugin and XDCR.
I'm the maintainer of the Couchbase ES transport plugin. As Roi mention in his answer, the plugin doesn't use rivers, so it won't be deprecated. It currently supports any version of ES from 1.3 to 2.x, and I'm working on adding support for 5.x. It's taking a bit longer, because ES 5.x broke some configuration sharing features in unexpected ways.
I'd suggest always looking at our github repo for the latest plugin releases:
https://github.com/couchbaselabs/elasticsearch-transport-couchbase
The Couchbase plugin is not using Rivers, there is another River plugin which is not longer valid.
take a look here: https://github.com/couchbaselabs/elasticsearch-transport-couchbase

How can I use an Elascicsearch plugin in a JVM local node?

I'm in the process of adding support for unicode normalization in ES with the help of the ICU analysis plugin. Installing this in a dedicated cluster is relatively easy, but I also need this plugin to be available during testing, where we use a JVM local node. Since it's a JVM local node I can't simply call the commands as explained in the plugin documentation. How can I get my plugin to work for this local node?
After digging through the source code of Elasticsearch I figured out the answer, and it is stupidly simple: Just make sure the plugins are in your classpath and ES will pick them up automatically. In my case adding the plugin to my pom.xml was enough.

Nutch 2.2.1 and Elasticsearch 0.90.11 NoSuchFieldError: STOP_WORDS_SET

I am trying to integrate Apache Nutch 2.2.1 with Elasticsearch 0.90.11.
I have followed all tutorials available (although there are not so many) and even changed bin/crawl.sh to use elasticsearch to index instead of solr.
It seems that all works when I run the script until elasticsearch is trying to index the crawled data.
I checked hadoop.log inside logs folder under nutch and found the following errors:
Error injecting constructor, java.lang.NoSuchFieldError: STOP_WORDS_SET
Error injecting constructor, java.lang.NoClassDefFoundError: Could not initialize class org.apache.lucene.analysis.en.EnglishAnalyzer$DefaultSetHolder
If you managed to get it working I would very much appreciate the help.
Thanks,
Andrei.
Having never used Apache Nutch, but briefly reading about it, I would suspect that your inclusion of Elasticsearch is causing a classpath collision with a different version of Lucene that is also on the classpath. Based on its Maven POM, which does not specify Lucene directly, then I would suggest only including the Lucene bundled with Elasticsearch, which should be Apache Lucene 4.6.1 for your version.
Duplicated code (differing versions of the same jar) tend to be the cause of NoClassDefFoundError when you are certain that you have the necessary code. Based on the fact that you switched from Solr to Elasticsearch, then it would make sense that you left whatever jars from Solr on your classpath, which would cause the collision at hand. The current release of Solr is 4.7.0, which is the same as Lucene and that would collide with 4.6.1.

How do you handle multiple versions of same jar?

I use Apache_Gora_0.2.1 and Apache Nutch_2.1 .
Nutch depends on Gora.
Gora have modules gora-core and gora-hbase.
gora-hbase depends on gora-core.
All the modules of Gora use avro_1.3.3.jar . I want to use avro_1.3.3.jar for gora-core and avro_1.5.3.jar for gora-hbase .
I successfully compiled Gora via Maven and I successfully compiled Nutch via Ant and Ivy.
Then seems to be two versions in the Nutch classpath (avro.1.3.3.jar and avro.1.5.3.jar). If I exclude avro_1.5.3.jar via ivy.xml, gora-hbase don't use avro.1.5.3.
How can I solve this problem?
You should avoid the situation when you have in the classpath same jars with different versions.
To solve your problem you need to find version of Apache_Gora_0.2.1 and Apache Nutch_2.1 that uses the similar versions of avro.
Try do use Apache Nutch_1.6, since Apache_Gora_0.2.1 is the latest version.
Then, you exclude the lowest version and you solve your problem.
Another possibility is to downgrade Nutch to 2.0 because this works with avro 1.3.3.
If I am not wrong, gora-hbase does not work with avro 1.5.3, but 1.3.3.
At the same time, tell that gora-hbase only uses avro to serialize values... why do you need it to use avro 1.5.3?

Resources