Elasticsearch plugin for PySpark 3.1.1 - elasticsearch

I used Elasticsearch Spark 7.12.0 with PySpark 2.4.5 successfully. Both read and write were perfect. Now, I'm testing the upgrade to Spark 3.1.1, this integration doesn't work anymore. No code change in PySpark between 2.4.5 & 3.1.1.
Is there a compatible plugin? Has anyone got this to work with PySpark 3.1.1?
The error:

Try to use package org.elasticsearch:elasticsearch-spark-30_2.12:7.13.1
The error you're seeing (java.lang.NoClassDefFoundError: scala/Product$class) usually indicates that you are trying to use a package built for an incompatible version of Scala.
If you are using the most recent zip package from Elasticsearch, as of the date of your question, it is still built for Scala v11, as per the conversation here:
https://github.com/elastic/elasticsearch-hadoop/pull/1589
You can confirm the version of Scala used to build your PySpark by doing
spark-submit --version
from the command line. After the Spark logo it will say something like
Using Scala version 2.12.10
You need to take a look at this page:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/install.html
On that page you can see the compatibility matrix.
Elastic gives you some info on "installation" for Hadoop here: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/install.html
For Spark, it provides this:
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-30_2.12</artifactId>
<version>7.14.0</version>
</dependency>
Now if you're using PySpark, you may be unfamiliar with Maven, so I can appreciate that it's not that helpful to be given the maven dependency.
Here's a minimal way to get maven to get your jar for you, without having to get into the weeds of an unfamiliar tool.
Install maven (apt install maven)
Create a new directory
In that directory, create a file called pom.xml
<project>
<modelVersion>4.0.0</modelVersion>
<groupId>spark-es</groupId>
<artifactId>spark-esj</artifactId>
<version>1</version>
<dependencies>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-30_2.12</artifactId>
<version>7.14.0</version>
</dependency>
</dependencies>
Save that file and create an additional directory called "targetdir" (it could be called anything)
Then
mvn dependency:copy-dependencies -DoutputDirectory=targetdir
You'll find your jar in targetdir.

Related

Where should I get the maven dependencies when migrating a mapreduce project from hdp to bigtop?

I am migrating a map-reduce java-project (built using maven) from Horton Works to Big Top.
I am trying to figure out what is the best way to ensure that the depedency versions in my java-project match the jar files deployed on the cluster by Big Top.
We are currently targetting Big Top 3.2.0.
I am inspecting their BOM file and using those versions in my pom file.
For example, when we were using hdp I had something like
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.2.3.1.4.0-315</version>
</dependency>
According to the Big Top BOM file the Spark Version is 3.2.3 & Scala Library Version is 2.12.13. Does that mean that the new maven depdency in our project pom file should be
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.2.3</version>
</dependency>
Is there a place where the exact maven dependencies are listed? Is this the correct way to migrate our project's POM file?

Can't resolve maven dependency with beam-runners-google-cloud-dataflow-java and bigtable-client-core

I am trying to run Java code from a Maven project that uses both beam-runners-google-cloud-dataflow-java and bigtable-client-core, and I cannot get it to properly reconcile dependencies amongst these two. When I run and attempt to create a BigtableDataClient, I get the following error:
java.lang.NoSuchFieldError: TE_HEADER
at io.grpc.netty.shaded.io.grpc.netty.Utils.<clinit> (Utils.java:74)
at io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder.<clinit> (NettyChannelBuilder.java:72)
at io.grpc.netty.shaded.io.grpc.netty.NettyChannelProvider.builderForAddress (NettyChannelProvider.java:37)
at io.grpc.netty.shaded.io.grpc.netty.NettyChannelProvider.builderForAddress (NettyChannelProvider.java:23)
at io.grpc.ManagedChannelBuilder.forAddress (ManagedChannelBuilder.java:39)
at com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createSingleChannel (InstantiatingGrpcChannelProvider.java:242)
at com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createChannel (InstantiatingGrpcChannelProvider.java:198)
at com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.getTransportChannel (InstantiatingGrpcChannelProvider.java:185)
at com.google.api.gax.rpc.ClientContext.create (ClientContext.java:160)
at com.google.cloud.bigtable.data.v2.stub.EnhancedBigtableStub.create (EnhancedBigtableStub.java:151)
at com.google.cloud.bigtable.data.v2.BigtableDataClient.create (BigtableDataClient.java:138)
at com.google.cloud.bigtable.data.v2.BigtableDataClient.create (BigtableDataClient.java:130)
...
I can only conclude this is due to an issue with version conflict on the relevant libraries (either grpc-netty or grpc-netty-shaded); I'm using 1.17 for grpc-netty and 1.23 for grpc-netty-shaded. I've tried using dependencyManagement to force the use of version 1.23.0 for both grpc-netty and grpc-netty-shaded, and then tried 1.17 for both, but this doesn't help. I've also tried using earlier versions of both the Beam runners and bigtable-client-core, and this doesn't help either.
The relevant Maven dependencies are:
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
<version>2.15.0</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigtable</groupId>
<artifactId>bigtable-client-core</artifactId>
<version>1.12.1</version>
</dependency>
I look at the code for Utils.java (https://github.com/grpc/grpc-java/blame/master/netty/src/main/java/io/grpc/netty/Utils.java), and I don't see any evidence that I'd be using any earlier version that might not have this constant (it's been there since version 1.7).
I'm completely baffled what the issue is here. How do I identify the dependency conflict? Is there another way I can find what version of the class Maven is actually looking at here?

Spark-submit is not using the protobuf version of my project

In my work project, I use spark-submit to launch my application into a yarn cluster. I am quite new to Maven projects and pom.xml use, but the problem I seem to be having is that hadoop is using an older version of google protobuf (2.5.0) than the internal dependencies I'm importing at work (2.6.1).
The error is here:
java.lang.NoSuchMethodError:
com/google/protobuf/LazyStringList.getUnmodifiableView()Lcom/google/protobuf/LazyStringList;
(loaded from file:/usr/hdp/2.6.4.0-91/spark2/jars/protobuf-java-2.5.0.jar
by sun.misc.Launcher$AppClassLoader#8b6f2bf7)
called from class protobuf.com.mycompany.group.otherproject.api.JobProto$Query
Since I'm not quite sure how to approach dependency issues like this, and I can't change the code of the internal dependency that uses 2.6.1, I added the required protobuf version as a dependency to my project, as well:
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>2.6.1</version>
</dependency>
Unfortunately, this hasn't resolved the issue. When the internal dependency (which does import 2.6.1 on its own) tries to use its proto, the conflict occurs.
Any suggestions on how I could force the usage of the newer, correct version would be greatly appreciated.
Ultimately I found the Maven Shade Plugin to be the answer. I shaded my company's version of protobufs, deployed our service as an uber jar, and the conflict was resolved.

Apache strom - package backtype.storm.tuple does not exist

I'm trying the Storm analysis presents here
CallLogCounterBolt.java:4: error: package backtype.storm.tuple does not exist
import backtype.storm.tuple.Fields;
I ran into similar problems with another old Apache Storm tutorial. It turned out to simply be because of the tutorial using deprecated classes from previous versions (0.9.6), while I was using newer ones (1.1.0). Therefore my suggestion is to either look through the newer libraries for corresponding resources in those and changing your library load statements accordingly, or checking that the dependencies that you are using are not masked by similarly named libraries.
The issue is with your Java classpath... which entirely depends on how you have setup your project. Rather than try to fix what you have I'll give you a suggestion.
If you're using Java, then the "normal" way to create storm topologies is using Maven which should work with whatever IDE you're using (Eclipse, Intellij, etc.).
Once you have a skeleton maven project setup, all you need to do is add the storm dependencies. For example:
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>${storm.version}</version>
<scope>provided</scope>
</dependency>
Here is an example POM file.
You should use newer Libraries in order to execute Since backtype is deprecated, Go through the Apache Storm javadocs Apache Storm javadocs

Storm-crawler and Elasticsearch version

I'm working on getting the latest version of ES (5x) working with Storm-crawler.
I did what was mentioned here, I cloned the repo, mvn clean install to build and then I entered all the mvn commands mentioned here and it all worked.
The thing I'm confused about is when it comes to the pom.xml file, for the version number:
<dependency>
<groupId>com.digitalpebble.stormcrawler</groupId>
<artifactId>storm-crawler-elasticsearch</artifactId>
<version>1.4</version>
</dependency>
Do I enter 1.5 there or keep it as 1.4? I'm still trying to get get better with Maven and the Java build process and all.
If you are building the project on your local post cloning the repo.
You shall try
mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.5-SNAPSHOT
and then further you can then edit the pom.xml and add the dependency for the Elasticsearch module as -
<dependency>
<groupId>com.digitalpebble.stormcrawler</groupId>
<artifactId>storm-crawler-elasticsearch</artifactId>
<version>1.5-SNAPSHOT</version>
</dependency>
StormCrawler 1.5 should be released soon and as suggested by #nullpointer you need to change the version to 1.5-SNAPSHOT; the tutorial was based on SC 1.4 which uses ES 2.x
See blog for potential issues when upgrading to ES5.
You have to keep it as 1.4, because this is the latest version of storm-crawler-elasticsearch plugin.

Resources