Setting spark classpaths on EC2: spark.driver.extraClassPath and spark.executor.extraClassPath - hadoop

Reducing size of application jar by providing spark- classPath for maven dependencies:
My cluster is having 3 ec2 instances on which hadoop and spark is running.If I build jar with maven dependencies, it becomes too large(around 100 MB) which I want to avoid this as Jar is getting replicating on all nodes ,each time I run the job.
To avoid that I have build a maven package as "maven package".For dependency resolution I have downloaded the all maven dependencies on each node and then only provided above below jar paths:
I have added class paths on each node in the "spark-defaults.conf" as
spark.driver.extraClassPath /home/spark/.m2/repository/com/google/code/gson/gson/2.3.1/gson-2.3.1.jar:/home/spark/.m2/repository/com/datastax/cassandra/cassandra-driver-core/2.1.5/cassandra-driver-core-2.1.5.jar:/home/spark/.m2/repository/com/google/guava/guava/16.0.1/guava-16.0.1.jar:/home/spark/.m2/repository/com/google/collections/google-collections/1.0/google-collections-1.0.jar:/home/spark/.m2/repository/com/datastax/spark/spark-cassandra-connector-java_2.10/1.2.0-rc1/spark-cassandra-connector-java_2.10-1.2.0-rc1.jar:/home/spark/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.10/1.2.0-rc1/spark-cassandra-connector_2.10-1.2.0-rc1.jar:/home/spark/.m2/repository/org/apache/cassandra/cassandra-thrift/2.1.3/cassandra-thrift-2.1.3.jar:/home/spark/.m2/repository/org/joda/joda-convert/1.2/joda-convert-1.2.jar
It has worked,locally on single node.
Still i am getting this error.Any help will be appreciated.

Finally, I was able to solve the problem. I have created application jar using "mvn package" instead of "mvn clean compile assembly:single ",so that it will not download the maven dependencies while creating jar(But need to provide these jar/dependencies run-time) which resulted in small size Jar(as there is only reference of dependencies).
Then, I have added below two parameters in spark-defaults.conf on each node as:
spark.driver.extraClassPath /home/spark/.m2/repository/com/datastax/cassandra/cassandra-driver-core/2.1.7/cassandra-driver-core-2.1.7.jar:/home/spark/.m2/repository/com/googlecode/json-simple/json-simple/1.1/json-simple-1.1.jar:/home/spark/.m2/repository/com/google/code/gson/gson/2.3.1/gson-2.3.1.jar:/home/spark/.m2/repository/com/google/guava/guava/16.0.1/guava-16.0.1.jar
spark.executor.extraClassPath /home/spark/.m2/repository/com/datastax/cassandra/cassandra-driver-core/2.1.7/cassandra-driver-core-2.1.7.jar:/home/spark/.m2/repository/com/googlecode/json-simple/json-simple/1.1/json-simple-1.1.jar:/home/spark/.m2/repository/com/google/code/gson/gson/2.3.1/gson-2.3.1.jar:/home/spark/.m2/repository/com/google/guava/guava/16.0.1/guava-16.0.1.jar
So question arises that,how application JAR will get the maven dependencies(required jar's) run-time?
For that I have downloaded all required dependencies on each node using mvn clean compile assembly:single in advance.

You don't need to put all jars files .Just Put your application jar file .
If you get again error than put all jar files which are needed .
You have to put jars file by setJars() methods .

Related

How to get all dependency jars for deployment

I am using one Apache open source project and its pre-built binary contains all the target jars and the corresponding dependency jars for deployment.
But when I build from source like mvn clean install, how could I also get the necessary dependency jars for deployment?
I suggest two options:
Build a fat jar: the maven output jar will contain ll necessary classes taken from its dependencies. To accomplish this task you can use the maven-assembly-plugin maven plugin. You can read a good tutorial here.
Configure maven to copy all needed jar in a specific folder. To accomplish this task you can use the maven-dependency-plugin maven plugin. You will find a good tutorial here.

Setup maven pom to work with dependencies across environments

I have a Java projects a-1.0.jar with ojdbc.jar dependency and b.jar that depends on a-1.0.jar and ojdbc.jar. I am trying to make it work on my machine, new user machine and a Bamboo server.
Desired behavior:
On local machine git clone <git_url>, mvn clean install, java -jar b.jar project should run. Bamboo should checkout and run project.
On Bamboo: a plan can check out a project and run it. Build should track version of b.jar built and a.jar used.
So far I saw these options:
<systemPath>${project.basedir}/lib/a-1.0.jar</systemPath>: maven warns that it will fail to resolve dependencies
A Perl script to run mvn install for each dependent jar before building the project
(1) defeats the purpose of DevOps automation
(2) makes it unclear which version of a jar was used
(3) installs the jar, but java -jar b.jar fails a.jar is missing
I can overcome this with another Perl script that adds the dependent jars to a classpath
These are basic tasks and as a build tool Maven should be able to do it.
How to tell Maven to three goals below?
(1) For each unknown import - get a jar from lib folder
(2) Make a set of dependent jars. That is don't import ojdbc twice
(3) Pack a self sufficient jar that runs, not fails with "stuff is missing"
Seems like you need to create an executable jar - and for this, you can use various approaches.
One of them is, add maven-shade-plugin which puts all dependencies into single jar, while taking care of potential resource collisions.
Try non-maven-jar-maven plugin. It adds jars that are not in the maven central.

Why can't I run the example from storm-starter using this command?

I've have had no experience using Storm or Maven before, and I'm working on my starter project. When I compile the starter project uploaded on the git website using the command given there i.e. this:
mvn compile exec:java -Dexec.classpathScope=compile -Dexec.mainClass=storm.starter.ExclamationTopology
I can run the Exclamation topology class, but when I use this command:
java -cp ./target/storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar storm.starter.ExclamationTopology
I can't run it.
By the way, I got the second command from the maven tutorial on apache's site
Could someone point out what am I doing wrong here?
PS: This is the error http://pastebin.com/A1PQbB3r
You are hitting the java.lang.NoClassDefFoundError since the storm jars are not in your classpath. For your second command, put the storm jar and the storm/lib in your classpath and it should work as expected.
Your pom probably has the scope for the storm dependency as "provided" which means that it will be in the runtime classpath, but not in the jar-with-dependencies. Try changing the scope to "compile"
The scope for the Storm dependency should be different depending on whether your are running in local mode or cluster.
For local mode you need to set the scope to "compile" or leave the tag empty as scope defaults to "compile".
In order to submit your topology to a cluster you need to set the scope to "provided", otherwise the Storm jar will be packaged inside your topology jar and when deploying to the cluster there will be 2 Storm jars in the classpath: the one inside your topology and the one inside the Storm installation directory.

What is wrong with my neo4j test setup? EmbeddedNeo4j.java, neo4j, maven

I started a project with maven using the "quickstart" archetype. I then changed my POM to include neo4j:
https://github.com/ENCE688R/msrcs/blob/master/pom.xml
I added:
https://github.com/neo4j/neo4j/blob/master/community/embedded-examples/src/main/java/org/neo4j/examples/EmbeddedNeo4j.java
and ran
mvn package
This works with no errors, but
java -cp target/msrcs-1.0-SNAPSHOT.jar org.neo4j.examples.EmbeddedNeo4j
Returns the Error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/neo4j/graphdb/RelationshipType
What am I missing? At this point I simply need to test that I can include and use neo4j.
use
mvn exec:java -Dexec.mainClass=org.neo4j.examples.EmbeddedNeo4j
there is also mvn dependency:copy that copies all dependencies to target/dependencies
and there is the mvn appassembler plugin that allows you to generate startup shell scripts that include all your dependencies as a classpath.
And last but not least there is the maven assembly plugin mvn assembly:single which generates a single jar file that you can run java -jar my-jar-file.jar
You need to add the Neo4j dependencies to your classpath as well. At the moment you're only adding the source jar you created. If you look at this POM you'll see that Neo4J examples require many other dependencies.
Find the libs directory where the dependencies have been downloaded (this may be in your local .m2 maven repo) and add these jars to your classpath. You do not need to add each jar one-by-one as you can simply add a directory with wildcards - ex:
Windows:
java -cp "target/msrcs-1.0-SNAPSHOT.jar;lib/*" org.neo4j.examples.EmbeddedNeo4j
Mac/Unix:
java -cp "target/msrcs-1.0-SNAPSHOT.jar:lib/*" org.neo4j.examples.EmbeddedNeo4j
I've started to work on some maven archetypes which could be a good starting point as well.
For java Neo4j projects, use neo4j-archetype-quickstart.
For Spring Data Neo4j projects, use sdn-archetype-quickstart.

How to build Mahout /usr/lib resource folders after build with Maven

I am new to this stuff so I hope someone can help;
I want to build my own Apache Mahout installation from source code. I have Maven2.2.1. Following the instructions on the Mahout wiki I was able to check out the code (Mahout-0.6-SNAPSHOT) and build Mahout with Maven. At least that is was I thought happened after "mvn install" from the root of the folder containing the checked out src code. Test were run, which took a while.
So I now have all these jars (called artifacts if Im not mistaking) in a Maven repository on ~/.m2/repository.
So my first question now is; how do I get from here to a 'installed' package like I am used to when I run a RPM on redhat. By that I mean a new folder under /usr/lib/ and from there a /lib a /bin etc. folder.
Second question is about dependency jars. I can see in the repository that Mahout was built with a hadoop-core-0.20.204.0.jar but that is not the jar I want because I run a Hadoop cluster with another hadoop-core jar from Cloudera. How would I go about to build Mahout again with the right hadoop-core jar? Or would it just be a matter of changing one hadoop-core jar for another in the /lib folder being created (after my first question is answered)
Thanks

Resources