Setting Spark Classpath on Amazon EMR - hadoop

I am trying to run some simple jobs on EMR (AMI 3.6) with Hadoop 2.4 and Spark 1.3.1. I have installed Spark manually without a bootstrap script. Currently I am trying to read and process data from S3 but it seems like I am missing an endless number of jars on my classpath.
Running commands on spark-shell. Starting shell using:
spark-shell --jars jar1.jar,jar2.jar...
Commands run on the shell:
val lines = sc.textFile("s3://folder/file.gz")
lines.collect()
The errors always look something like: "Class xyz not found". After I find the needed jar and add it to the classpath, I will get this error again but with a different class name in the error message.
Is there a set of jars that are needed for working with (compressed and uncompressed) S3 files?

I was able to figure out the jars needed for my classpath by following the logic in the AWS GitHub repo https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark.
The install-spark and install-spark-script.py files contain logic for copying jars into a new 'classpath' directory used by the SPARK_CLASSPATH variable (spark-env.sh).
The jars I was personally missing were located in /usr/share/aws/emr/emrfs/lib/ and /usr/share/aws/emr/lib/

It seems that you have not imported the proper libraries from with-in the spark-shell.
To do so :
import path.to.Class
or more likely if you want to import the RDD class, per say:
import org.apache.spark.rdd.RDD

Related

How to install a jar in databricks using ADF

We are able to install the jar file using the UI method to a particular cluster. But our requirement to install it on all the ondemand clusters in the workspace.
We are using the below shell script to download the jar file to DBFS. Not sure how we can refer/install this jar in all cluster using a global init script
curl https://repo1.maven.org/maven2/com/databricks/spark-xml_2.12/0.12.0/spark-xml_2.12-0.12.0.jar >/dbfs/FileStore/jars/maven/com/databricks/spark_xml_2_12_0_12_0.jar
Any help would be really appreciated!!
There is an alternate solution for adding jar library to the job cluster which is called from Azure data factory while running our job.
In ADF, while calling the notebook we have the option to include the jar directory in DBFS or we can able to give the Maven coordinates.
ADF SETTINGS
In the global init script you can just download this file into /databricks/jars/ directory - then it will be picked up by cluster

Running Spark from a local IDE

I've been spending some time banging my head over trying to run a complex spark application locally in order to test quicker (without having to package and deploy to a cluster).
Some context:
This spark application interfaces with Datastax Enterprise version of Cassandra and their distributed file system, so it needs some explicit jars to be provided (not available in Maven)
These jars are available on my local machine, and to "cheese" this, I tried placing them in SPARK_HOME/jars so they would be automatically added to the classpath
I tried to do something similar with the required configuration settings by putting them in spark-defaults.conf under SPARK_HOME/conf
When building this application, we do not build an uber jar, but rather do a spark-submit on the server using --jars
The problem I'm facing, is when I run the Spark Application through my IDE, it seems like it doesn't pick up any of these additional items from the SPARK_HOME director (config or jars). I spent a few hours trying to get the config items to work and ended up setting them as System.property values in my test case before starting the spark session in order for Spark to pick them up, so the configuration settings can be ignored.
However, I do not know how to reproduce this for the vendor specific jar files. Is there an easy way I can emulate the --jars behavior that spark-submit does and some home set up my spark session with this jar value? Note: I am using in my code the following command to start a spark session:
SparkSession.builder().config(conf).getOrCreate()
Additional information, in case it helps:
The Spark version I have locally in SPARK_HOME is the same version that my code is compiling with using Maven.
I asked another question similar to this related to configs: Loading Spark Config for testing Spark Applications
When I print the SPARK_HOME environment variable in my application, I am getting the correct SPARK_HOME value, so I'm not sure why neither the configs or jar files are being picked up from here. Is it possible that when running the application from my IDE, it's not picking up the SPARK_HOME environment variable and using all defaults?
You can make use of .config(key, value) while building the SparkSession by passing "spark.jars" as the key and a comma separated list of paths to the jar like so:
SparkSession.builder().config("spark.jars", "/path/jar1.jar, /path/jar2.jar").config(conf).getOrCreate()

Shouldn't Oozie/Sqoop jar location be configured during package installation?

I'm using HDP 2.4 in CentOS 6.7.
I have created the cluster with Ambari, so Oozie was installed and configured by Ambari.
I got two errors while running Oozie/Sqoop related to jar file location. The first concerned postgresql-jdbc.jar, since the Sqoop job is incrementally importing from Postgres. I added the postgresql-jdbc.jar file to HDFS and pointed to it in workflow.xml:
<file>/user/hdfs/sqoop/postgresql-jdbc.jar</file>
It solved the problem. But the second error seems to concern kite-data-mapreduce.jar. However, doing the same for this file:
<file>/user/hdfs/sqoop/kite-data-mapreduce.jar</file>
does not seem to solve the problem:
Failing Oozie Launcher, Main class
[org.apache.oozie.action.hadoop.SqoopMain], main() threw exception,
org/kitesdk/data/DatasetNotFoundException
java.lang.NoClassDefFoundError:
org/kitesdk/data/DatasetNotFoundException
It seems strange that this is not automatically configured by Ambari and that we have to copy jar files into HDFS as we start getting errors.
Is this the correct methodology or did I miss some configuration step?
This is happening due to the missing jars in the classpath. I would suggest you to use the property oozie.use.system.libpath=true in the job.properties file. All the sqoop related jars will be added automatically in the classpath. Then add only custom jar you need to the lib directory of the workflow application path., all the sqoop related jars will be added from the /user/oozie/share/lib/lib_<timestamp>/sqoop/*.jar.

Hadoop job DocumentDB dependency jar file

I have a hadoop job which gets its input from azure documentdb. I have put the documentdb jar dependency files under a directory called 'lib'. However when I ran the job it gives me a ClassNotFound error message for one of the classes in the jar file. I also tried adding the jar files using the -libjars option but it didn't work either. Does anyone have any idea what can be wrong?

What Jars does Spark standalone have access to and when is it necessary to provide Jars through the SparkContext constructor?

I am using Spark Streaming to connect to the Twitter sample api and I am retrieving the text of what tweets I get. The SparkContext is running standalone on my local machine.
JavaStreamingContext ssc = new JavaStreamingContext(
"spark://[my network IP address]:7077", "StreamingTest",
new Duration(1000), System.getenv("SPARK_PREFIX"), new String[]{...});
I have all of the jars I need to compile and run the code locally but when I call .forEachRDD(new Function2<T, Time, Void>(){...}) on aJavaReceiverInputDStream that is derived from my Streaming Context I get a
java.lang.ClassNotFoundException: [my package].[my class]$1$1
which refers to the anonymous class provided to .forEachRDD.
I get around this issue by packaging the project in a jar and giving that as an argument for the SparkStreamingContext constructor but this seems odd for a few reasons:
Spark does not complain about other jars that I import into the project such as Twitter4J (added as a Maven dependency) so it must have inherent access to some jars,
To package the project so that it can pass itself to Spark seems too much like a bit of a workaround - there must be a more elegant solution,
A new copy of my jar is created in the Spark directory each time I run my code.
How can I determine what jars the Spark cluster has access to and when is it necessary/good practice to provide jars directly to the SparkContext constructor?
Thanks.
If your project has a dependency on third party jars, you need to include them as a comma separated list when you submit the program into the cluster. It is required to bundle your source into a jar file before submitting to a cluster.
Assume if your project structure is as below.
simpleapp
- src/main/java
- org.apache.spark.examples
-SimpleApp.java
- lib
- dependent.jars (you can put all dependent jars inside lib directory)
- target
- simpleapp.jar (after compiling your source)
So you can use below command.
spark-submit --jars $(echo lib/*.jar | tr ' ' ',' ) --class org.apache.spark.examples.SimpleApp --master local[2] target/simpleapp.jar
Further you can see the jar distribution using spark web console, go to your program -> Environment

Resources