Running Spark from a local IDE - maven

I've been spending some time banging my head over trying to run a complex spark application locally in order to test quicker (without having to package and deploy to a cluster).
Some context:
This spark application interfaces with Datastax Enterprise version of Cassandra and their distributed file system, so it needs some explicit jars to be provided (not available in Maven)
These jars are available on my local machine, and to "cheese" this, I tried placing them in SPARK_HOME/jars so they would be automatically added to the classpath
I tried to do something similar with the required configuration settings by putting them in spark-defaults.conf under SPARK_HOME/conf
When building this application, we do not build an uber jar, but rather do a spark-submit on the server using --jars
The problem I'm facing, is when I run the Spark Application through my IDE, it seems like it doesn't pick up any of these additional items from the SPARK_HOME director (config or jars). I spent a few hours trying to get the config items to work and ended up setting them as System.property values in my test case before starting the spark session in order for Spark to pick them up, so the configuration settings can be ignored.
However, I do not know how to reproduce this for the vendor specific jar files. Is there an easy way I can emulate the --jars behavior that spark-submit does and some home set up my spark session with this jar value? Note: I am using in my code the following command to start a spark session:
SparkSession.builder().config(conf).getOrCreate()
Additional information, in case it helps:
The Spark version I have locally in SPARK_HOME is the same version that my code is compiling with using Maven.
I asked another question similar to this related to configs: Loading Spark Config for testing Spark Applications
When I print the SPARK_HOME environment variable in my application, I am getting the correct SPARK_HOME value, so I'm not sure why neither the configs or jar files are being picked up from here. Is it possible that when running the application from my IDE, it's not picking up the SPARK_HOME environment variable and using all defaults?

You can make use of .config(key, value) while building the SparkSession by passing "spark.jars" as the key and a comma separated list of paths to the jar like so:
SparkSession.builder().config("spark.jars", "/path/jar1.jar, /path/jar2.jar").config(conf).getOrCreate()

Related

Hadoop confs for client application

I have a client application that uses the hadoop conf files (hadoop-site.xml and hadoop-core.xml)
I don't want to check it in on the resources folders, so I try to add it via idea.
The problem is that the hadoop Confs ignores my HADOOP_CONF_DIR and loads the default confs from the hadoop package. Any ideia ?
I'm using gradle
I end up solving it by putting the configuration files on test resources folder. So when the jar gets build it does not take it.

Where is spark loading it's jars from?

When specifying a jar at "spark.jars", and running on a standalone spark, without spark-submit. Where is the jar loaded from?
I have a Spring application that performs some spark operations on a Spark standalone running in Docker.
My application relies on various libraries such as MySQL JDBC, ElasticSearch, etc, and thus it fails running on the cluster which doesn't have them.
I assembled my jar with all its dependencies and moved it to the /jars directory in Docker. But still no luck.
13:28:42.577 [Executor task launch worker-0] INFO org.apache.spark.executor.Executor - Fetching spark://192.168.99.1:58290/jars/xdf-1.0.jar with timestamp 1499088505128
13:28:42.614 [dispatcher-event-loop-0] INFO org.apache.spark.executor.Executor - Executor is trying to kill task 0.3 in stage 1.0 (TID 7)
13:28:42.698 [Executor task launch worker-0] DEBUG org.apache.spark.network.client.TransportClient - Sending stream request for /jars/xdf-1.0.jar to /192.168.99.1:58290
13:28:42.741 [shuffle-client-7-1] DEBUG org.apache.spark.rpc.netty.NettyRpcEnv - Error downloading stream /jars/xdf-1.0.jar.
java.lang.RuntimeException: Stream '/jars/xdf-1.0.jar' was not found.
Now I noticed that it's looking for the jar on the driver host but I don't understand where it's trying to deploy it from.
Any one has an
idea where it's looking for that jar.
I figured it out. The jars are loaded from the driver node.
So, I didn't need to move my jar to the spark nodes. And I had to set the correct path to the dependency jar.
So this solved it:
spark.jars=./target/scala-2.1.1/xdf.jar
If you're essentially running a standalone application running in local mode, you will need to provide all jars on your own, as opposed to having spark-submit stage the spark run time for you. Assuming that you're using a build system such as maven or gradle, you will need to package all transitive dependencies with your application and remove any scope provided declarations.
The easiest in this case is to use assembly or maven-shade plugin to package a fat jar and then run that.
if you're running on cluster mode, you can programmatically submit your application using SparkLauncher, here's an example in scala:
import org.apache.spark.launcher.SparkLauncher
object Launcher extends App {
val spark = new SparkLauncher()
.setSparkHome("/home/user/spark-1.4.0-bin-hadoop2.6")
.setAppResource("/home/user/example-assembly-1.0.jar")
.setMainClass("MySparkApp")
.setMaster("local[*]")
.launch();
spark.waitFor();
}
Keep in mind that in Yarn mode, you will also have to provide the path to your hadoop configuration.

Apache Ignite: What are the dependencies of IgniteHadoopIgfsSecondaryFileSystem?

I am trying to setup IGFS with Hadoop as the secondary storage. I have set my configuration as shown here but I keep getting NoClassDefFoundErrors. I have downloaded both binary distributions of Ignite and have tried building from source also but the dependencies are not included. hadoop-common-2.6.0.jar and ignite-hadoop-1.4.0.jar provided some of the dependencies but now I am getting a NoClassDefFoundError for org/apache/hadoop/mapred/JobConf which by my understanding is a deprecated class...
I have been following the instructions on the Apache Ignite website but this is as far as I've gotten.
What dependencies do I need for IgniteHadoopIgfsSecondaryFileSystem as the secondary storage?
It looks like the problem is that Ignite node does not have Hadoop libraries on the classpath. To fix that please try to do the following:
1) use "Hadoop Accelerator" edition of Ignite distribution (use -Dignite.edition=hadoop if you're building the distribution yourself).
2) Set HADOOP_HOME environment variable for the Ignite process if you're using Apache Hadoop distribution, or, if you use another distribution (HDP, Cloudera, BigTop, etc.) make sure /etc/default/hadoop file exists and has appropriate contents.
Alternatively, you can manually add necessary Hadoop dependencies to Ignite node classpath: these are dependencies of groupId "org.apache.hadoop" listed in file modules/hadoop/pom.xml . Currently they are:
hadoop-annotations
hadoop-auth
hadoop-common
hadoop-hdfs
hadoop-mapreduce-client-common
hadoop-mapreduce-client-core
If you don't want to deal with dependency management yourself - which is a real hard thing to do manually - I'd suggest you look at the projects providing orchestration and deployment services for software stacks. Check Apache Bigtop (bigtop.apache.org), that provides pre-cut linux packages for Apache Ignite, Hadoop, HDFS and pretty much anything else in this space. You can grab the latest nightly packages from our CI at http://ci.bigtop.apache.org/view/Packages/job/Bigtop-trunk-packages

Setting Spark Classpath on Amazon EMR

I am trying to run some simple jobs on EMR (AMI 3.6) with Hadoop 2.4 and Spark 1.3.1. I have installed Spark manually without a bootstrap script. Currently I am trying to read and process data from S3 but it seems like I am missing an endless number of jars on my classpath.
Running commands on spark-shell. Starting shell using:
spark-shell --jars jar1.jar,jar2.jar...
Commands run on the shell:
val lines = sc.textFile("s3://folder/file.gz")
lines.collect()
The errors always look something like: "Class xyz not found". After I find the needed jar and add it to the classpath, I will get this error again but with a different class name in the error message.
Is there a set of jars that are needed for working with (compressed and uncompressed) S3 files?
I was able to figure out the jars needed for my classpath by following the logic in the AWS GitHub repo https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark.
The install-spark and install-spark-script.py files contain logic for copying jars into a new 'classpath' directory used by the SPARK_CLASSPATH variable (spark-env.sh).
The jars I was personally missing were located in /usr/share/aws/emr/emrfs/lib/ and /usr/share/aws/emr/lib/
It seems that you have not imported the proper libraries from with-in the spark-shell.
To do so :
import path.to.Class
or more likely if you want to import the RDD class, per say:
import org.apache.spark.rdd.RDD

What Jars does Spark standalone have access to and when is it necessary to provide Jars through the SparkContext constructor?

I am using Spark Streaming to connect to the Twitter sample api and I am retrieving the text of what tweets I get. The SparkContext is running standalone on my local machine.
JavaStreamingContext ssc = new JavaStreamingContext(
"spark://[my network IP address]:7077", "StreamingTest",
new Duration(1000), System.getenv("SPARK_PREFIX"), new String[]{...});
I have all of the jars I need to compile and run the code locally but when I call .forEachRDD(new Function2<T, Time, Void>(){...}) on aJavaReceiverInputDStream that is derived from my Streaming Context I get a
java.lang.ClassNotFoundException: [my package].[my class]$1$1
which refers to the anonymous class provided to .forEachRDD.
I get around this issue by packaging the project in a jar and giving that as an argument for the SparkStreamingContext constructor but this seems odd for a few reasons:
Spark does not complain about other jars that I import into the project such as Twitter4J (added as a Maven dependency) so it must have inherent access to some jars,
To package the project so that it can pass itself to Spark seems too much like a bit of a workaround - there must be a more elegant solution,
A new copy of my jar is created in the Spark directory each time I run my code.
How can I determine what jars the Spark cluster has access to and when is it necessary/good practice to provide jars directly to the SparkContext constructor?
Thanks.
If your project has a dependency on third party jars, you need to include them as a comma separated list when you submit the program into the cluster. It is required to bundle your source into a jar file before submitting to a cluster.
Assume if your project structure is as below.
simpleapp
- src/main/java
- org.apache.spark.examples
-SimpleApp.java
- lib
- dependent.jars (you can put all dependent jars inside lib directory)
- target
- simpleapp.jar (after compiling your source)
So you can use below command.
spark-submit --jars $(echo lib/*.jar | tr ' ' ',' ) --class org.apache.spark.examples.SimpleApp --master local[2] target/simpleapp.jar
Further you can see the jar distribution using spark web console, go to your program -> Environment

Resources