Spark streaming 1.6.1 is not working with Kinesis asl 1.6.1 and asl 2.0.0-preview - spark-streaming

I am trying to run spark streaming job on EMR with Kinesis. Spark 1.6.1 with Kinesis ASL 1.6.1. Writing a plain sample wordcount example.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kinesis-asl_2.10</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>amazon-kinesis-client</artifactId>
<version>1.6.3</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>amazon-kinesis-producer</artifactId>
<version>0.10.2</version>
</dependency>
This throws following exception
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: com/google/protobuf/ProtocolStringList
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer.checkAndSubmitNextTask(ShardConsumer.java:157)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer.consumeShard(ShardConsumer.java:126)
Upgrading to 2.0.0-preview
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kinesis-asl_2.10</artifactId>
<version>2.0.0-preview</version>
</dependency>
gives following exception
java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging
at org.apache.spark.streaming.kinesis.KinesisUtils$$anonfun$createStream$1.apply(KinesisUtils.scala:74)

It was caused by protobuf-java dependency conflict.
Use mvn dependency:tree to find the version of protobuf-java, which is KCL and KPL depend on. And go to spark lib directory, you would find the another version.
Please use maven-shade-plugin, and relocate the conflict classes:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<outputFile>
${project.build.directory}/${project.artifactId}-${project.version}-selfcontained.jar
</outputFile>
<relocations>
<relocation>
<pattern>com.google.protobuf</pattern>
<shadedPattern>shade.com.google.protobuf</shadedPattern>
</relocation>
<relocation>
<pattern>com.amazonaws</pattern>
<shadedPattern>shade.com.amazonaws</shadedPattern>
</relocation>
</relocations>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
</transformers>
</configuration>
</execution>
</executions>
</plugin>

Related

spark "delta" source not found

While using kafka and delta_core dependencies in a spark project I'm receiving the next warning:
[WARNING] delta-core_2.12-0.7.0.jar, spark-sql-kafka-0-10_2.12-3.1.1.jar define 1 overlapping resources:
[WARNING] - META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
Which causes delta source to not be found. How can I include both delta and kafka? Thanks.
Here is my maven config:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_${scala.version}</artifactId>
<version>0.7.0</version>
</dependency>
...
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>
META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
</resource>
</transformer>
</transformers>
<finalName>${project.artifactId}-${project.version}</finalName>
<artifactSet>
<includes>
<include>org.scalactic:*</include>
<include>io.delta:*</include>
<include>org.apache.spark:*</include>
</includes>
</artifactSet>
<filters>
<filter>
<artifact>*:*</artifact>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
```
I solved it. My problem was that I was using both maven-shade and maven-assembly plugins. Removing maven-assembly plugin worked!
To extend B. Bal answer, in case anyone is using Sbt instead of Maven, the problem may be fixed by changing the assembly merge strategy:
assembly / assemblyMergeStrategy := {
case PathList("META-INF", "services", xg # _*) => MergeStrategy.concat
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
By using this merge strategy, the META-INF/services will not be overwritten, so the Delta source along with any other source will be available from your fat jar.
More details may be found in this threat

Scala version error with Spark 2 & ElasticSearch 5.4.2

I'm using Spark 2.2 (build with Scala 2.11.8) to index my data into ElasticSearch 5.4.2.
ElasticSearch :
My project spark use this pom.xml :
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-hadoop</artifactId>
<version>5.4.2</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>log4j-over-slf4j</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-20_2.11</artifactId>
<version>5.4.2</version>
</dependency>
When I run my job, I get this exception :
Caused by: java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
at org.elasticsearch.spark.serialization.ReflectionUtils$.org$elasticsearch$spark$serialization$ReflectionUtils$$checkCaseClass(ReflectionUtils.scala:42)
at org.elasticsearch.spark.serialization.ReflectionUtils$$anonfun$checkCaseClassCache$1.apply(ReflectionUtils.scala:84)
at org.elasticsearch.spark.serialization.ReflectionUtils$$anonfun$checkCaseClassCache$1.apply(ReflectionUtils.scala:83)
I know my problem is Scala version (build/run) ...
Thanks for your help
EDIT BUILD POM :
<build>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.3.1</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<argLine>-J-Xms128m</argLine>
<argLine>-J-Xmx512m</argLine>
<argLine>-J--XX:MaxPermSize=300m</argLine>
<argLine>-Djava.net.preferIPv4Stack=true</argLine>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<shadedArtifactAttached>true</shadedArtifactAttached>
<shadedClassifierName>UBER</shadedClassifierName>
<artifactSet>
<includes>
<include>com.databricks:spark-csv_${scala.compact.version}</include>
<include>org.apache.commons:commons-csv</include>
<include>org.elasticsearch:elasticsearch-hadoop</include>
</includes>
</artifactSet>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>

How to reasonable rename package with maven plugin

My project has the following modules:
client
rest
The above modules both depend on com.google.protobuf, and rest depends on client (the rest module uses protobuf jar by client).
In order to avoid conflict, I renamed com.google.protobuf to my.com.google.protobuf in client module with shade plugin.
The problem is that the rest module can not be compiled and reports the following error:
error: incompatible types: my.com.google.protobuf.Descriptors.FileDescriptor cannot be converted to com.google.protobuf.Descriptors.FileDescriptor
Client pom.xml
<parent>
<groupId>my-project</groupId>
<artifactId>my-project</artifactId>
<version>1.0-SNAPSHOT</version>
</parent>
<artifactId>client</artifactId>
<name>Client</name>
<dependencies>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>${guava.version}</version>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>${protobuf.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<configuration>
<createDependencyReducedPom>true</createDependencyReducedPom>
<filters>
<filter>
<artifact>client</artifact>
<includes>
<include>**/*.class</include>
</includes>
</filter>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>io.my.client.Client</mainClass>
</transformer>
</transformers>
<relocations>
<relocation>
<pattern>my.google</pattern>
<shadedPattern>shiva.com.google</shadedPattern>
</relocation>
</relocations>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
rest pom.xml
<parent>
<groupId>my-project</groupId>
<artifactId>my-project</artifactId>
<version>1.0-SNAPSHOT</version>
</parent>
<artifactId>rest</artifactId>
<dependencies>
<dependency>
<groupId>my-project</groupId>
<artifactId>clientt</artifactId>
<version>1.0-SNAPSHOT</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<configuration>
<createDependencyReducedPom>true</createDependencyReducedPom>
<filters>
<filter>
<artifact>rest</artifact>
<includes>
<include>**/*.class</include>
</includes>
</filter>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>io.my.rest.WebServer</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
A better workaround seems to be: Right-click on project "client" -> pom.xml in the project view in IntelliJ, choose "Maven" -> "Ignore Projects". Then do a "Maven" -> "Reimport" on the top-level pom.xml.

use maven-shade-plugin, but dependency classes are not in the final jar

In my project's pom.xml I have the following dependency:
<dependency>
<groupId>com.my.library</groupId>
<artifactId>MyLib</artifactId>
<version>1.0</version>
<type>jar</type>
</dependency>
<dependency>
...
</dependency>
I would like to have my project's final built jar including the classes of above com.my.library:MyLib dependency, so I used maven-shade-plugin in the following way:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>com.my.library:MyLib</artifact>
<includes>
<include>com/my/library/**</include>
</includes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
Then, I run mvn clean install , my project was built successfully.
But when I check the content of MyProject.jar under target/ directory, it doesn't contain classes from com.my.library:MyLib dependency ,why? Where am I wrong with maven-shade-plugin ?
Define an <artifactSet>:
<artifactSet>
<includes>
<include>com.my.library:MyLib</include>
</includes>
</artifactSet>
And try removing the <artifact/> from the <filters/>. This should do it.
change pattern to
<includes>
<include>com/my/library/**.class</include>
</includes>

gmaven-plugin works for groovy 1.7.5 but not for 2.1.0

I have working maven 2 setup which compiles jUnit tests written in groovy. Both java and groovy tests are located at /src/test/java
See a snapshot of the pom.xml
<plugin>
<groupId>org.codehaus.gmaven</groupId>
<artifactId>gmaven-plugin</artifactId>
<version>1.3</version>
<executions>
<execution>
<id>testCompile</id>
<goals>
<goal>testCompile</goal>
</goals>
<configuration>
<sources>
<fileset>
<directory>${pom.basedir}/src/test/java</directory>
<includes>
<include>**/*.groovy</include>
</includes>
</fileset>
</sources>
</configuration>
</execution>
</executions>
</plugin>
<dependency>
<groupId>org.codehaus.groovy</groupId>
<artifactId>groovy</artifactId>
<version>1.7.5</version>
<scope>test</scope>
</dependency>
When I upgrade to plugin version 1.5 and groovy 2.1.0, */.groovy files are ignored. Has anybody met up with this problem?
I found this page https://confluence.atlassian.com/display/CLOVER/Compiling+Groovy+with+GMaven+plugin
Note that you must put your Groovy Classes and Tests under src/main/groovy and src/test/groovy respectively.
Following configuration based on that page seems to work:
<!-- Groovy and Maven https://confluence.atlassian.com/display/CLOVER/Compiling+Groovy+with+GMaven+plugin -->
<plugin>
<groupId>org.codehaus.gmaven</groupId>
<artifactId>gmaven-plugin</artifactId>
<version>${gmaven.version}</version>
<configuration>
<providerSelection>2.0</providerSelection>
</configuration>
<dependencies>
<dependency>
<groupId>org.codehaus.gmaven.runtime</groupId>
<artifactId>gmaven-runtime-2.0</artifactId>
<version>${gmaven.version}</version>
</dependency>
<dependency>
<groupId>org.codehaus.groovy</groupId>
<artifactId>groovy-all</artifactId>
<version>${groovy.version}</version>
</dependency>
</dependencies>
<executions>
<execution>
<goals>
<goal>generateStubs</goal>
<goal>compile</goal>
<goal>generateTestStubs</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
And in dependencies of course
<dependency>
<groupId>org.codehaus.groovy</groupId>
<artifactId>groovy-all</artifactId>
<version>${groovy.version}</version>
</dependency>
And in properties
<properties>
<gmaven.version>1.5</gmaven.version>
<groovy.version>2.1.8</groovy.version>
</properties>
Ok, this configuration works for maven 2.
<plugin>
<groupId>org.codehaus.gmaven</groupId>
<artifactId>gmaven-plugin</artifactId>
<version>1.4</version>
<configuration>
<providerSelection>2.0</providerSelection>
<sourceEncoding>UTF-8</sourceEncoding>
</configuration>
<executions>
<execution>
<goals>
<goal>testCompile</goal>
</goals>
<configuration>
<sources>
<fileset>
<directory>${pom.basedir}/src/test/java</directory>
<includes>
<include>**/*.groovy</include>
</includes>
</fileset>
</sources>
</configuration>
</execution>
</executions>
</plugin>
<dependency>
<groupId>org.codehaus.groovy</groupId>
<artifactId>groovy</artifactId>
<version>2.0.0</version>
<scope>test</scope>
</dependency>
I experience the same problem, but downgrading to gmaven 1.4 solves the problem (using groovy-all 2.3.2)
First, each GMaven provider compiles against a particular version of Groovy, so there can be issues if Groovy breaks something with a point release. Second, GMaven is no longer maintained (that's why you don't see any providers for newer Groovy versions). I recommend switching to GMavenPlus or the Groovy-Eclipse compiler plugin for Maven.

Resources