spark maven dependency understanding - maven

I am trying to understand how spark works with Maven ,
I have the following question : Do I need to have spark installed in my machine to build spark application ( in scala ) with maven ?
Or should I just add the spark dependency into the POM.xml of my maven project
Best regards

The short answer is no. At build time you all your dependencies will be collected by Maven or Sbt. There is no need for an additional Spark installation.
Also at runtime (an this might also include the execution of unit test during the build) you do not necessarily need a Spark installation. If the value of SPARK_HOME is not set to a valid Spark installation, default values will be used for the runtime configuration of Spark.
However, as soon as you want to start Spark jobs on a remote cluster (by using spark-submit) you will need a Spark installation.

Related

how to change flink fat jar to thin jar

can I move the dependency jars to hdfs, so I can run a thin jar without dependency jars?
the Operation and Maintenance Engineers do not allow me to move jar to flink lib folder.
Not sure what problem you are trying to solve, but you might want to consider an application mode deployment if you are using yarn:
./bin/flink run-application -t yarn-application \
-Dyarn.provided.lib.dirs="hdfs://myhdfs/remote-flink-dist-dir" \
"hdfs://myhdfs/jars/MyApplication.jar"
In this example, MyApplication.jar isn't a thin jar, but the job submission is very lightweight as the needed Flink jars and the application jar are picked up from HDFS rather than being shipped to the cluster by the client. Moreover, the application’s main() method is executed on the JobManager.
Application mode was introduced in Flink 1.11, and is described in detail in this blog post: Application Deployment in Flink: Current State and the new Application Mode.

Toggling between provided and included for spark binaries in local run mode

A common pattern for building spark applications either in maven or sbt is to mark the spark binaries as provided. This approach reduces the uber jar size substantially and also avoids version mismatches if the binaries were built for say spark 2.0.0 but deployed on 2.0.1.
The downside of this approach is - how do we run the programs in local mode? In that case there is no spark server to provide the binaries for us.
This is not referring to running tests: those live in the test directory. Instead the intention is to run precisely the same workflow as will happen on a deployment cluster but locally - including sourcing out of the main directory and using the same build file. The preferred answer would be one that only differs by an sbt or maven command line switch.
So for example in sbt (notice the provided which will omit the binaries):
"org.apache.spark" %% "spark-core" % Versions.spark % "provided"
We want to *include the spark binaries:
sbt package <some switch to include the spark binaries>
in maven pom.xml
<dependency>
..
<scope>provided</scope>
</dependency>
We want to include the spark binaries somehow:
mvn package <some switch to include the spark binaries>

Running Spark Unit Tests in IntelliJ

I have a spark job I'm developing in IntelliJ. It's builds via maven, the tests pass, and I can run the job locally. However, If I try to run the tests via IntelliJ, I get
Error:scalac: bad symbolic reference. A signature in
SparkContext.class refers to term akka in package root which is not
available. It may be completely missing from the current classpath, or
the version on the classpath might be incompatible with the version
used when compiling SparkContext.class.
I ended up just busting and rebuilding the IMLs in IntelliJ and this fixed the issue.

dependency issues with app while deploying in tomcat-server

i am using hbase 0.94.7 and hadoop 1.0.4 and tomcat 7
i wrote a small res-based application which performs crud operations on hbase.
earlier i used to run the app using maven tomcat plugin.
now i am trying to deploy the war in tomcat-server.
since hadoop and hbase jars already contain org.mortbay.jetty jsp-api and servlet-api jars of older verisons,
i am getting Abstract Method Exceptions
here's the exception log
so then i made a exclusion of org.mortbay.jetty from both hadoop and hbase dependencies in pom.xml. but it started showing more and more such kind of issues like jasper.
so then i added scope provided to hadoop and hbase dependencies.
now tomcat is unable to find the hadoop and hbase jars.
can someone help me in fixing this dependecy issues.
Thanks.
Do one thing,
- Right click on project
- go to property,
- type java build path,
- go to third tab of library,
- Removed dependency of lib and maven,
- Clean build your project.
might be solve your problem.

Why can't I run the example from storm-starter using this command?

I've have had no experience using Storm or Maven before, and I'm working on my starter project. When I compile the starter project uploaded on the git website using the command given there i.e. this:
mvn compile exec:java -Dexec.classpathScope=compile -Dexec.mainClass=storm.starter.ExclamationTopology
I can run the Exclamation topology class, but when I use this command:
java -cp ./target/storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar storm.starter.ExclamationTopology
I can't run it.
By the way, I got the second command from the maven tutorial on apache's site
Could someone point out what am I doing wrong here?
PS: This is the error http://pastebin.com/A1PQbB3r
You are hitting the java.lang.NoClassDefFoundError since the storm jars are not in your classpath. For your second command, put the storm jar and the storm/lib in your classpath and it should work as expected.
Your pom probably has the scope for the storm dependency as "provided" which means that it will be in the runtime classpath, but not in the jar-with-dependencies. Try changing the scope to "compile"
The scope for the Storm dependency should be different depending on whether your are running in local mode or cluster.
For local mode you need to set the scope to "compile" or leave the tag empty as scope defaults to "compile".
In order to submit your topology to a cluster you need to set the scope to "provided", otherwise the Storm jar will be packaged inside your topology jar and when deploying to the cluster there will be 2 Storm jars in the classpath: the one inside your topology and the one inside the Storm installation directory.

Resources