Where is spark loading it's jars from? - spring

When specifying a jar at "spark.jars", and running on a standalone spark, without spark-submit. Where is the jar loaded from?
I have a Spring application that performs some spark operations on a Spark standalone running in Docker.
My application relies on various libraries such as MySQL JDBC, ElasticSearch, etc, and thus it fails running on the cluster which doesn't have them.
I assembled my jar with all its dependencies and moved it to the /jars directory in Docker. But still no luck.
13:28:42.577 [Executor task launch worker-0] INFO org.apache.spark.executor.Executor - Fetching spark:// with timestamp 1499088505128
13:28:42.614 [dispatcher-event-loop-0] INFO org.apache.spark.executor.Executor - Executor is trying to kill task 0.3 in stage 1.0 (TID 7)
13:28:42.698 [Executor task launch worker-0] DEBUG org.apache.spark.network.client.TransportClient - Sending stream request for /jars/xdf-1.0.jar to /
13:28:42.741 [shuffle-client-7-1] DEBUG org.apache.spark.rpc.netty.NettyRpcEnv - Error downloading stream /jars/xdf-1.0.jar.
java.lang.RuntimeException: Stream '/jars/xdf-1.0.jar' was not found.
Now I noticed that it's looking for the jar on the driver host but I don't understand where it's trying to deploy it from.
Any one has an
idea where it's looking for that jar.

I figured it out. The jars are loaded from the driver node.
So, I didn't need to move my jar to the spark nodes. And I had to set the correct path to the dependency jar.
So this solved it:

If you're essentially running a standalone application running in local mode, you will need to provide all jars on your own, as opposed to having spark-submit stage the spark run time for you. Assuming that you're using a build system such as maven or gradle, you will need to package all transitive dependencies with your application and remove any scope provided declarations.
The easiest in this case is to use assembly or maven-shade plugin to package a fat jar and then run that.
if you're running on cluster mode, you can programmatically submit your application using SparkLauncher, here's an example in scala:
import org.apache.spark.launcher.SparkLauncher
object Launcher extends App {
val spark = new SparkLauncher()
Keep in mind that in Yarn mode, you will also have to provide the path to your hadoop configuration.


Running Spark from a local IDE

I've been spending some time banging my head over trying to run a complex spark application locally in order to test quicker (without having to package and deploy to a cluster).
Some context:
This spark application interfaces with Datastax Enterprise version of Cassandra and their distributed file system, so it needs some explicit jars to be provided (not available in Maven)
These jars are available on my local machine, and to "cheese" this, I tried placing them in SPARK_HOME/jars so they would be automatically added to the classpath
I tried to do something similar with the required configuration settings by putting them in spark-defaults.conf under SPARK_HOME/conf
When building this application, we do not build an uber jar, but rather do a spark-submit on the server using --jars
The problem I'm facing, is when I run the Spark Application through my IDE, it seems like it doesn't pick up any of these additional items from the SPARK_HOME director (config or jars). I spent a few hours trying to get the config items to work and ended up setting them as System.property values in my test case before starting the spark session in order for Spark to pick them up, so the configuration settings can be ignored.
However, I do not know how to reproduce this for the vendor specific jar files. Is there an easy way I can emulate the --jars behavior that spark-submit does and some home set up my spark session with this jar value? Note: I am using in my code the following command to start a spark session:
Additional information, in case it helps:
The Spark version I have locally in SPARK_HOME is the same version that my code is compiling with using Maven.
I asked another question similar to this related to configs: Loading Spark Config for testing Spark Applications
When I print the SPARK_HOME environment variable in my application, I am getting the correct SPARK_HOME value, so I'm not sure why neither the configs or jar files are being picked up from here. Is it possible that when running the application from my IDE, it's not picking up the SPARK_HOME environment variable and using all defaults?
You can make use of .config(key, value) while building the SparkSession by passing "spark.jars" as the key and a comma separated list of paths to the jar like so:
SparkSession.builder().config("spark.jars", "/path/jar1.jar, /path/jar2.jar").config(conf).getOrCreate()

Can I run JAR file which includes another JAR file under lib folder in HDInsight?

Is it possible to run a JAR file in HDInsight which includes another JAR file under the lib folder?
JAR file
│    └.class file
└ lib/dependency.jar // library (jar file)
Thank you!
On HDInsight, we should be able to run a Java MapReduce JAR, which has a dependency on another JAR. There are a few ways to do this, but typically not by copying the second JAR under lib folder on headnode.
Reasons are – Depending on where the dependency is, you may need to copy the JAR under the lib folder of all worker nodes and headnodes – becomes a tedious task. Also, this change will be erased when the node gets re-imaged by Azure, and hence not a supported way.
Now, there are two types of dependencies –
1. MapReduce driver class has dependency on another external JAR
2. Map or reduce task has dependency on another JAR, where Map or Reduce functions calls an API on the external JAR.
Scenario #1 (MapReduce driver class depends on another JAR):
we can use one of the following options –
a. Copy your dependency JAR to a local folder (like d:\test on windows HDI) on the headnode and then use RDP to append this path to HADOOP_CLASSPATH environment variable on head node– this is suitable for dev/test to run jobs directly from headnode, but won’t work with remote job submissions. So this is not suitable for production scenarios.
b. Using a ‘fat or uber jar’ to include all the dependent jars inside your JAR – you can use Maven ‘Shade’ plugin , example here
Scenario #2 ( Map or Reduce function calls API on external JAR) -
Basically use –libjars option.
If you want to run the mapreduce JAR from Hadoop command line -
a. Copy the Mapreduce JAR to a local path (like d:\test )
b. Copy the dependent JAR on WASB
Example of running a mapreduce JAR with dependency-
hadoop jar D:\Test\BlobCount-0.0.1-SNAPSHOT.jar css.ms.BlobCount.BlobCounter -libjars wasb://mycontainername#azimwasb.blob.core.windows.net/mrdata/jars/microsoft-windowsazure-storage-sdk-0.6.0.jar -DStorageAccount=%StorageAccount% -DStorageKey=%StorageKey% -DContainer=%Container% /mcdpoc/mrinput /mcdpoc/mroutput
The example is using HDInsight windows – you can use similar approach on HDInsight Linux as well.
Using PowerShell or .Net SDK (remote job submission) –With PowerShell, you can use the –LibJars parameter to refer to dependent jars.
you can review the following documentations, these have various examples of using powerShell, SSH etc.
I hope it helps!

Setting Spark Classpath on Amazon EMR

I am trying to run some simple jobs on EMR (AMI 3.6) with Hadoop 2.4 and Spark 1.3.1. I have installed Spark manually without a bootstrap script. Currently I am trying to read and process data from S3 but it seems like I am missing an endless number of jars on my classpath.
Running commands on spark-shell. Starting shell using:
spark-shell --jars jar1.jar,jar2.jar...
Commands run on the shell:
val lines = sc.textFile("s3://folder/file.gz")
The errors always look something like: "Class xyz not found". After I find the needed jar and add it to the classpath, I will get this error again but with a different class name in the error message.
Is there a set of jars that are needed for working with (compressed and uncompressed) S3 files?
I was able to figure out the jars needed for my classpath by following the logic in the AWS GitHub repo https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark.
The install-spark and install-spark-script.py files contain logic for copying jars into a new 'classpath' directory used by the SPARK_CLASSPATH variable (spark-env.sh).
The jars I was personally missing were located in /usr/share/aws/emr/emrfs/lib/ and /usr/share/aws/emr/lib/
It seems that you have not imported the proper libraries from with-in the spark-shell.
To do so :
import path.to.Class
or more likely if you want to import the RDD class, per say:
import org.apache.spark.rdd.RDD

What Jars does Spark standalone have access to and when is it necessary to provide Jars through the SparkContext constructor?

I am using Spark Streaming to connect to the Twitter sample api and I am retrieving the text of what tweets I get. The SparkContext is running standalone on my local machine.
JavaStreamingContext ssc = new JavaStreamingContext(
"spark://[my network IP address]:7077", "StreamingTest",
new Duration(1000), System.getenv("SPARK_PREFIX"), new String[]{...});
I have all of the jars I need to compile and run the code locally but when I call .forEachRDD(new Function2<T, Time, Void>(){...}) on aJavaReceiverInputDStream that is derived from my Streaming Context I get a
java.lang.ClassNotFoundException: [my package].[my class]$1$1
which refers to the anonymous class provided to .forEachRDD.
I get around this issue by packaging the project in a jar and giving that as an argument for the SparkStreamingContext constructor but this seems odd for a few reasons:
Spark does not complain about other jars that I import into the project such as Twitter4J (added as a Maven dependency) so it must have inherent access to some jars,
To package the project so that it can pass itself to Spark seems too much like a bit of a workaround - there must be a more elegant solution,
A new copy of my jar is created in the Spark directory each time I run my code.
How can I determine what jars the Spark cluster has access to and when is it necessary/good practice to provide jars directly to the SparkContext constructor?
If your project has a dependency on third party jars, you need to include them as a comma separated list when you submit the program into the cluster. It is required to bundle your source into a jar file before submitting to a cluster.
Assume if your project structure is as below.
- src/main/java
- org.apache.spark.examples
- lib
- dependent.jars (you can put all dependent jars inside lib directory)
- target
- simpleapp.jar (after compiling your source)
So you can use below command.
spark-submit --jars $(echo lib/*.jar | tr ' ' ',' ) --class org.apache.spark.examples.SimpleApp --master local[2] target/simpleapp.jar
Further you can see the jar distribution using spark web console, go to your program -> Environment

Configuring hadoop 2.5 in eclipse

I'm trying to configure map-reduce in eclipse indigo with hadoop version 2.5. I downloaded hadoop 2.5 source and added all the libraries in the eclipse project.
While trying to run the project, it is showing following error
Java path and classpath was set properly. Please help me.!!
Configuring cygiwn SSH is mandatory to use eclipse map-reduce?
I am not sure what you are trying to do here. If you are running the application in eclipse as a regular traditional java program the following may help.
Hadoop map reduce programs must run the program using the hadoop jar command usually after using SSH ( PuTTY ) onto the cluster and using TFTP ( FileZila ) to port the .jar file to the cluster.
Usage: hadoop jar <jar> [mainClass] args…
If you want to debug the application use java.util.logging.Logger.
