Submit spark application in a Jar file separate from uber Jar containing all dependencies - maven

I am building Spark application which has several heavy dependencies (e.g. Stanford NLP with language models) so that uber Jar that contains application code with dependencies takes ~500MB. Uploading this fat Jar to my test cluster takes a lot of time and I decided to build my app and dependencies into separate Jar files.
I've created two modules in my parent pom.xml and build app and uber jar separately with mvn package and mvn assembly:asembly respectively.
However, after I upload these separate jars to my YARN cluster application fails with the following error:
Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.hadoop.net.unix.DomainSocketWatcher.(I)V at
org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.(DfsClientShmManager.java:415)
at
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.(ShortCircuitCache.java:379)
at
org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:100)
at org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:151)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:690) at
org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601) at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
When running application on Spark it also fails with similar error.
Jar with dependencies is included into Yarn classpath:
<property>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,
$HADOOP_COMMON_HOME/lib/*,
$HADOOP_HDFS_HOME/*,
$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,
$HADOOP_MAPRED_HOME/lib/*,
$YARN_HOME/*,
$YARN_HOME/lib/*,
/usr/local/myApp/org.myCompany.myApp-dependencies.jar
</value>
</property>
Is it actually possibly to run Spark application this way? Or I have to put all dependencies on YARN (or Spark) classpath as individual Jar files?

I encountered the same issue with my spark job. This is a dependency issue for sure. You have to make sure the correct versions are picked up at runtime.The best way to do this was adding the correct version hadoop-common-2.6.jar to my application jar. I also upgraded my hadoop-hdfs version in application jar. This resolved my issue.

Related

how to change flink fat jar to thin jar

can I move the dependency jars to hdfs, so I can run a thin jar without dependency jars?
the Operation and Maintenance Engineers do not allow me to move jar to flink lib folder.
Not sure what problem you are trying to solve, but you might want to consider an application mode deployment if you are using yarn:
./bin/flink run-application -t yarn-application \
-Dyarn.provided.lib.dirs="hdfs://myhdfs/remote-flink-dist-dir" \
"hdfs://myhdfs/jars/MyApplication.jar"
In this example, MyApplication.jar isn't a thin jar, but the job submission is very lightweight as the needed Flink jars and the application jar are picked up from HDFS rather than being shipped to the cluster by the client. Moreover, the application’s main() method is executed on the JobManager.
Application mode was introduced in Flink 1.11, and is described in detail in this blog post: Application Deployment in Flink: Current State and the new Application Mode.

How to deploy assembly jar and use it as provided dependency?

Using spark over hbase and hadoop using Yarn,
an assembly library among other libraries is provided server side.
(called like spark-looongVersion-haddop-looongVersion.jar)
it includes numerous libraries.
When the spark jar is sent as a job to the server for execution, conflicts may arise between the libraries included in the job and the server libraries (assembly jar and possibly other libraries) .
I need to include this assembly jar as a "provided" maven dependency to avoid conflicts between client dependencies and server classpath
how can I deploy and use this assembly jar as a provided dependency ?
how can I deploy and use this assembly jar as a provided dependency ?
An assembly jar is a regular jar file and so as any other jar file can be a library dependency if it's available in the artifact repo to download it from, e.g. Nexus, Artifactory or similar.
The quickest way to do it is to "install" it in your Maven local repository (see Maven's Guide to installing 3rd party JARs). That however binds you to what you have locally available and so will quickly get out of sync with what other teams are using.
The recommended way is to deploy the dependency using Apache Maven Deploy Plugin.
Once it's deployed, declaring it as a dependency is not different from declaring other dependencies.
Provided dependencies scope
Spark dependencies must be excluded from the assembled JAR. If not, you should expect weird errors from Java classloader during application startup. Additional benefit of assembly without Spark dependencies is faster deployment. Please remember that application assembly must be copied over the network to the location accessible by all cluster nodes (e.g: HDFS or S3).

Spark can't find Guava Classes

I'm running Spark's example called JavaPageRank, but it's a copy that I compiled separately using maven in a new jar. I keep getting this error:
ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
java.lang.NoClassDefFoundError: com/google/common/collect/Iterables
Despite the fact that guava is listed as one of Spark's dependencies. I'm running compiled Spark 1.6 that I downloaded pre-compiled from the apache website.
Thanks!
The error means that the jar containing com.google.common.collect.Iterables class is not in the classpath. So your application is not able to find the required class in runtime.
If you are using maven/gradle , try to clean, build and refresh the project. Then check your classes folder and make sure the guava jar is in the lib folder.
Hope this will help.
Good luck!

Setting spark classpaths on EC2: spark.driver.extraClassPath and spark.executor.extraClassPath

Reducing size of application jar by providing spark- classPath for maven dependencies:
My cluster is having 3 ec2 instances on which hadoop and spark is running.If I build jar with maven dependencies, it becomes too large(around 100 MB) which I want to avoid this as Jar is getting replicating on all nodes ,each time I run the job.
To avoid that I have build a maven package as "maven package".For dependency resolution I have downloaded the all maven dependencies on each node and then only provided above below jar paths:
I have added class paths on each node in the "spark-defaults.conf" as
spark.driver.extraClassPath /home/spark/.m2/repository/com/google/code/gson/gson/2.3.1/gson-2.3.1.jar:/home/spark/.m2/repository/com/datastax/cassandra/cassandra-driver-core/2.1.5/cassandra-driver-core-2.1.5.jar:/home/spark/.m2/repository/com/google/guava/guava/16.0.1/guava-16.0.1.jar:/home/spark/.m2/repository/com/google/collections/google-collections/1.0/google-collections-1.0.jar:/home/spark/.m2/repository/com/datastax/spark/spark-cassandra-connector-java_2.10/1.2.0-rc1/spark-cassandra-connector-java_2.10-1.2.0-rc1.jar:/home/spark/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.10/1.2.0-rc1/spark-cassandra-connector_2.10-1.2.0-rc1.jar:/home/spark/.m2/repository/org/apache/cassandra/cassandra-thrift/2.1.3/cassandra-thrift-2.1.3.jar:/home/spark/.m2/repository/org/joda/joda-convert/1.2/joda-convert-1.2.jar
It has worked,locally on single node.
Still i am getting this error.Any help will be appreciated.
Finally, I was able to solve the problem. I have created application jar using "mvn package" instead of "mvn clean compile assembly:single ",so that it will not download the maven dependencies while creating jar(But need to provide these jar/dependencies run-time) which resulted in small size Jar(as there is only reference of dependencies).
Then, I have added below two parameters in spark-defaults.conf on each node as:
spark.driver.extraClassPath /home/spark/.m2/repository/com/datastax/cassandra/cassandra-driver-core/2.1.7/cassandra-driver-core-2.1.7.jar:/home/spark/.m2/repository/com/googlecode/json-simple/json-simple/1.1/json-simple-1.1.jar:/home/spark/.m2/repository/com/google/code/gson/gson/2.3.1/gson-2.3.1.jar:/home/spark/.m2/repository/com/google/guava/guava/16.0.1/guava-16.0.1.jar
spark.executor.extraClassPath /home/spark/.m2/repository/com/datastax/cassandra/cassandra-driver-core/2.1.7/cassandra-driver-core-2.1.7.jar:/home/spark/.m2/repository/com/googlecode/json-simple/json-simple/1.1/json-simple-1.1.jar:/home/spark/.m2/repository/com/google/code/gson/gson/2.3.1/gson-2.3.1.jar:/home/spark/.m2/repository/com/google/guava/guava/16.0.1/guava-16.0.1.jar
So question arises that,how application JAR will get the maven dependencies(required jar's) run-time?
For that I have downloaded all required dependencies on each node using mvn clean compile assembly:single in advance.
You don't need to put all jars files .Just Put your application jar file .
If you get again error than put all jar files which are needed .
You have to put jars file by setJars() methods .

dependency issues with app while deploying in tomcat-server

i am using hbase 0.94.7 and hadoop 1.0.4 and tomcat 7
i wrote a small res-based application which performs crud operations on hbase.
earlier i used to run the app using maven tomcat plugin.
now i am trying to deploy the war in tomcat-server.
since hadoop and hbase jars already contain org.mortbay.jetty jsp-api and servlet-api jars of older verisons,
i am getting Abstract Method Exceptions
here's the exception log
so then i made a exclusion of org.mortbay.jetty from both hadoop and hbase dependencies in pom.xml. but it started showing more and more such kind of issues like jasper.
so then i added scope provided to hadoop and hbase dependencies.
now tomcat is unable to find the hadoop and hbase jars.
can someone help me in fixing this dependecy issues.
Thanks.
Do one thing,
- Right click on project
- go to property,
- type java build path,
- go to third tab of library,
- Removed dependency of lib and maven,
- Clean build your project.
might be solve your problem.

Resources