Apache Spark dependency issue - maven

I'm trying to run my spark application in a Hadoop cluster.
The spark version running in the cluster is 1.3.1. I'm getting the error as posted below while packaging and running my spark application in a cluster. I looked at the other posts as well, seems like I'm messing up with the library dependencies, but couldn't figure out what!
Here are the other information that might be helpful for you guys to help me out:
hadoop -version:
Hadoop 2.7.1.2.3.0.0-2557
Subversion git#github.com:hortonworks/hadoop.git -r 9f17d40a0f2046d217b2bff90ad6e2fc7e41f5e1
Compiled by jenkins on 2015-07-14T13:08Z
Compiled with protoc 2.5.0
From source with checksum 54f9bbb4492f92975e84e390599b881d
This command was run using /usr/hdp/2.3.0.0-2557/hadoop/lib/hadoop-common-2.7.1.2.3.0.0-2557.jar
The error stack:
java.lang.NoSuchMethodError: org.apache.spark.sql.hive.HiveContext: method <init>(Lorg/apache/spark/api/java/JavaSparkContext;)V not found
at com.cyber.app.cyberspark_app.main.Main.main(Main.java:163)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:577)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:174)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
My pom.xml looks like this:
<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>path.to.my.main.Main</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id> <!-- this is used for inheritance merges -->
<phase>package</phase> <!-- bind to the packaging phase -->
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>1.6.1</version>
<scope>provided</scope>
</dependency>
</dependencies>
I'm using "mvn package" to package my jar.
EDIT:
I tried changing all the versions to 1.3.1. If I do this change, I
need to change my application as I'm using the features that were
available after 1.3.1.
But if I put all 1.6.1 compiled under
Scala_2.10, I get the same error.
Please let me know if I need to provide any additional information. Any help will be greatly appreciated.
Thank you.

It can be binary compatibility issues.
First, make sure that all your Spark dependencies are on Spark 1.3.1. I see that you have Spark SQL to be on 1.6.1.
Second, you are using Spark compiled on Scala 2.11. The typical distribution of Spark is compiled only on 2.10. Typically, if you want the 2.11 version you need to compile spark yourself.
If you are not sure if the Spark running on your cluster is compiled with Scala I would change all my dependencies to use "2.10" instead of "2.11" and try again.

Related

What do we meant by "Unresolved requirement: Import-Package: com.google.common.collect_ [Sanitized]" in liferay 7.2

I am creating a hook in liferay 7.2 but unfortunately when I deploy it.I come across this error. I had tried increasing version of "com.google.collections" dependency and also tried adding guauva
a dependency but nothing seems to resolve this error.
My dependencies in Pom.xml is as such:
<dependencies>
<dependency>
<groupId>com.liferay.portal</groupId>
<artifactId>com.liferay.portal.kernel</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.osgi</groupId>
<artifactId>org.osgi.service.component.annotations</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.google.collections</groupId>
<artifactId>google-collections</artifactId>
<version>1.0-rc2</version>
</dependency>
<dependency>
<groupId>org.osgi</groupId>
<artifactId>osgi.cmpn</artifactId>
<version>6.0.0</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>3.1.2</version>
<configuration>
<archive>
<manifestFile>${project.build.outputDirectory}/META-INF/MANIFEST.MF</manifestFile>
</archive>
</configuration>
</plugin>
<plugin>
<groupId>biz.aQute.bnd</groupId>
<artifactId>bnd-maven-plugin</artifactId>
<version>4.3.0</version>
<executions>
<execution>
<goals>
<goal>bnd-process</goal>
</goals>
</execution>
</executions>
<dependencies>
<dependency>
<groupId>biz.aQute.bnd</groupId>
<artifactId>biz.aQute.bndlib</artifactId>
<version>4.3.0</version>
</dependency>
<dependency>
<groupId>com.liferay</groupId>
<artifactId>com.liferay.ant.bnd</artifactId>
<version>3.2.6</version>
</dependency>
</dependencies>
Error :
org.osgi.framework.BundleException: Could not resolve module: com.allen.portal.hook [1272]_ Unresolved requirement: Import-Package: com.google.common.collect_ [Sanitized]
at org.eclipse.osgi.container.Module.start(Module.java:444)
at org.eclipse.osgi.internal.framework.EquinoxBundle.start(EquinoxBundle.java:428)
at com.liferay.portal.file.install.internal.DirectoryWatcher._startBundle(DirectoryWatcher.java:1106)
at com.liferay.portal.file.install.internal.DirectoryWatcher._startBundles(DirectoryWatcher.java:1139)
at com.liferay.portal.file.install.internal.DirectoryWatcher._process(DirectoryWatcher.java:1001)
at com.liferay.portal.file.install.internal.DirectoryWatcher.run(DirectoryWatcher.java:313)
If you have any ways to resolve this error, please help me out
Unrelated: You're using an rc2 version released in October 2009, when a release was made in December 2009? Seriously?
It looks like you're building an OSGi module, which compiles fine (because you provide the dependency). However, that does not mean that the google collections code ends up in your jar as well. The runtime expects to find it though - and as Google collections is not an OSGi bundle itself, you'll have several choices:
repackage it as OSGi bundle (and deploy it to the runtime) (or find someone who did it already)
repackage it within your own bundle
use a different implementation. Chances are that collections utility code from 2009 has found its way into more current implementations and is no longer necessary.
In short: In one way or another, you'll need to make your dependencies available at runtime. Either by fattening your own bundle (but be careful: You can't pass those collections around to other bundles if they bring their own implementation) or by relying on the implementation being available to the runtime.
The third alternative is to switch to an implementation where it's easier to make it available at runtime, preferably as OSGi bundle.

java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()V

It looks like I am again stuck on the running a packaged spark app jar using spark submit. Following is my pom file:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<parent>
<artifactId>oneview-forecaster</artifactId>
<groupId>com.dataxu.oneview.forecast</groupId>
<version>1.0.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>forecaster</artifactId>
<dependencies>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.module</groupId>
<artifactId>jackson-module-scala_${scala.binary.version}</artifactId>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.2.0</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>2.8.3</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
<version>1.10.60</version>
</dependency>
<!-- https://mvnrepository.com/artifact/joda-time/joda-time -->
<dependency>
<groupId>joda-time</groupId>
<artifactId>joda-time</artifactId>
<version>2.9.9</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.8.0</version>
<!--<scope>provided</scope>-->
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>${scala-maven-plugin.version}</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>com.dataxu.oneview.forecaster.App</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
Following is a simple snippet of code which fetches data from s3 location and prints it:
def getS3Data(path: String): Map[String, Any] = {
println("spark session start.........")
val spark = getSparkSession()
val configTxt = spark.sparkContext.textFile(path)
.collect().reduce(_ + _)
val mapper = new ObjectMapper
mapper.registerModule(DefaultScalaModule)
mapper.readValue(configTxt, classOf[Map[String, String]])
}
When I run it from intellij, everything works fine. the log is clear and looks good. However, when I package it using mvn package and try to run it using spark submit, I end up getting the following error at the .collect.reduce(_ + _). Following is the error I encounter:
"main" java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()V
at org.apache.hadoop.fs.s3a.S3AFileSystem.addDeprecatedKeys(S3AFileSystem.java:181)
at org.apache.hadoop.fs.s3a.S3AFileSystem.<clinit>(S3AFileSystem.java:185)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
...
I am not understanding which dependency was not packaged or what might be the issue as I did set the versions correctly expecting the hadoop aws should have all of them.
Any help will be appreciated.
The dependencies between hadoop and AWS JDK are very sensitive, and you should stick to using the correct versions that your hadoop dependency version was built with.
The first problem you need to solve is pick one version of Hadoop. I see you're mixing versions 2.8.3 and 2.8.0.
When I look at the dependency tree for org.apache.hadoop:hadoop-aws:2.8.0, I see that it is built against version 1.10.6 of the AWS SDK (same for hadoop-aws:2.8.3).
This is probably what's causing mismatches (you're mixing incompatible versions). So:
Choose the version of hadoop you want to use
Include hadoop-aws with the version compatible with your hadoop
Remove other dependencies, or only include them with versions matching the one compatible with your hadoop version.
In case anybody else is still stumbling on this error... it took me a while to find out, but check if your project has a dependency (direct or transitive) on the package org.apache.avro/avro-tools.
It was brought into my code by a transitive dependency.
Its problem is that it ships with a copy of org.apache.hadoop.conf.Configuration
that is much older than all current versions of hadoop, so it may end up being the one picked up in the classpath.
In my scala project, I just had to exclude it with
ExclusionRule("org.apache.avro","avro-tools")
and the error (finally!) disappear.
I am sure that the avro-tools coders had some good reason to include a copy of a file that belongs to another package (hadoop-common), I was really surprised to find it there and made me waste an entire day.
In my case, I was running a local Spark installation on a Cloudera edge node and was hitting this conflict (even though I made sure to download Spark with the correct hadoop binaries precompiled). I just went into my Spark home and moved the hadoop-common jar so it wouldn't be loaded:
mv ~/spark-2.4.4-bin-hadoop2.6/jars/hadoop-common-2.6.5.jar ~/spark-2.4.4-bin-hadoop2.6/jars/hadoop-common-2.6.5.jar.XXXXXX
After that, it ran... in local mode anyway.

What is stable version for jasperreports-maven-plugin?

In my project, I am using Maven 3.0.4 and using JasperReports 5.1.0. To compile the JRXML file, using the jasperreports-maven-plugins. I have the jasperreports-maven-plugin with version 1.0-beta-2. Since it was beta version (1.0-beta-2) Can i know, what is stable version of jasperreports-maven-plugin available to be use?
Below the plugin used in my pom.xml file
<properties>
<jasperreports.version>5.1.0</jasperreports.version>
</properties>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>jasperreports-maven-plugin</artifactId>
<version>1.0-beta-2</version>
<configuration>
<sourceDirectory>src/main/resources/reports</sourceDirectory>
<outputDirectory>${project.build.directory}/classes/reports</outputDirectory>
</configuration>
<executions>
<execution>
<!-- Need to bind to the compile phase cuz the reports uses classes under target/classes. The default is the generate-resources phase. -->
<phase>compile</phase>
<goals>
<goal>compile-reports</goal>
</goals>
</execution>
</executions>
<dependencies>
<dependency>
<groupId>net.sf.jasperreports</groupId>
<artifactId>jasperreports</artifactId>
<version>${jasperreports.version}</version>
</dependency>
<dependency>
<groupId>org.codehaus.groovy</groupId>
<artifactId>groovy-all</artifactId>
<version>2.0.1</version>
<scope>compile</scope>
<optional>true</optional>
</dependency>
</dependencies>
</plugin>
</plugins>
</build>
Forget about the official maven plugin. I've been using alexnederlof's Jasper report maven plugin for a long time and works like a charm.
You can find more info at github:
The original jasperreports-maven-plugin from org.codehaus.mojo was a
bit slow. This plugin is 10x faster. I tested it with 52 reports which
took 48 seconds with the original plugin and only 4.7 seconds with
this plugin.
and in his blog:
The original plug-in is created in Java 4, works single-threaded and
the last time any committed to the repo was (at time of writing) 31st
of August, 2009. Not really an active project it seems.

Maven dependency:get does not download Stanford NLP model files

The core component of the Stanford Natural Language Processing Toolkit has Java code in a stanford-corenlp-1.3.4.jar file, and has (very large) model files in a separate stanford-corenlp-1.3.4-models.jar file. Maven does not download the model files automatically, but only if you add <classifier>models</classifier> line to the .pom. Here is a .pom snippet that fetches both the code and the models.
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>1.3.4</version>
<classifier>models</classifier>
</dependency>
I'm trying to figure out how to do the same thing from the command line. It seems like the Maven dependency:get plugin task is the way to do this. The following command line seems like it would be correct
mvn dependency:get \
-DgroupId=edu.stanford.nlp \
-DartifactId=stanford-corenlp \
-Dversion=LATEST \
-Dclassifier=models \
-DrepoUrl=repo1.maven.org
However, it only downloads the code Jar file but not the models Jar file.
Any idea why this is the case? I'm not sure if this is just an issue with the Stanford NLP package or a more general issue with the classifier option of dependency:get.
First thanks for your question. It answered my question about how to include both the data and the lib. I'll share what I'm doing with Maven, but I'm not sure this satisfies your question:
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>1.3.4</version>
<classifier>models</classifier>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>1.3.4</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-parser</artifactId>
<version>2.0.4</version>
</dependency>
Also, make sure my jar includes the libs I use:
<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>org.example.nlpservice.NLP</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id> <!-- this is used for inheritance merges -->
<phase>package</phase> <!-- bind to the packaging phase -->
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
Finally, have you tried mvn deploy or mvn install yet? You could copy from your local mvn cache/repo into a /lib directory.

using QueryDSL in osgi

I have been trying to use querydsl in a project which is an osgi bundle.
my pom.xml has the following dependencies:
<dependency>
<groupId>com.mysema.querydsl</groupId>
<artifactId>querydsl-apt</artifactId>
<version>2.5.0</version>
</dependency>
<dependency>
<groupId>com.mysema.querydsl</groupId>
<artifactId>querydsl-jpa</artifactId>
<version>2.5.0</version>
</dependency>
As well as the plugin
<plugin>
<groupId>com.mysema.maven</groupId>
<artifactId>maven-apt-plugin</artifactId>
<version>0.3.2</version>
<executions>
<execution>
<goals>
<goal>process</goal>
</goals>
<configuration>
<outputDirectory>target/generated-sources/java</outputDirectory>
<processor>com.mysema.query.apt.jpa.JPAAnnotationProcessor</processor>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.ops4j</groupId>
<artifactId>maven-pax-plugin</artifactId>
</plugin>
<plugin>
<groupId>org.apache.felix</groupId>
<artifactId>maven-bundle-plugin</artifactId>
<extensions>true</extensions>
<!-- | the following instructions build a simple set of public/private
classes into an OSGi bundle -->
<configuration>
<instructions>
<Import-Package>com.mysema.query.jpa,*</Import-Package>
<Export-Package>com.mypackage.package.*;version="${project.version}"</Export-Package>
</instructions>
</configuration>
</plugin>
Still when I try to start the bundle I get:
Error executing command: Unresolved constraint in bundle com.mypackage.package [163]: Unable to resolve 163.0: missing requirement [163.0] package; (&(package=com.mysema.query.jpa)(version>=2.5.0)(!(version>=3.0.0)))
I was using an older version of querydsl but apparently they fixed some stuff about osgi recently so I upgraded. The problem persists.
What I am missing for querydsl to work inside osgi?
Installing each dependency by hand will be a pain, but AFAIK there's nothing that will take a maven artifact and chain back of all dependencies - this would fail as where would it stop?
You could end up with every version of every logging framework (even if you had pax-logging installed), or the wrong implementation.
Alas in maven's case there's currently no way of applying semantic versioning or higher level requirement and capability. (Though BND (maven-bundle-plugin, bndtools) makes some sensible assumptions at a code level)
Karaf features (see the PDF manual in distribution's ${KARAF_HOME}) can do a lot to alleviate this but it can take some work to setup. There's a(t least) couple of ways to generate features files;
Use the features-maven-plugin
Use the maven-build-helper plugin to publish an XML file that you handcraft (laborious but you can maintain versions using resource filtering).

Resources