Adding JDBC driver to Spark on EMR - jdbc

I'm trying to add a JDBC driver to a Spark cluster that is executing on top Amazon EMR but I keep getting the:
java.sql.SQLException: No suitable driver found for exception.
I tried the following things:
Use addJar to add the driver Jar explicitly from the code.
Using spark.executor.extraClassPath spark.driver.extraClassPath parameters.
Using spark.driver.userClassPathFirst=true, when I used this option I'm getting a different error because mix of dependencies with Spark, Anyway this option seems to be to aggressive if I just want to add a single JAR.
Could you please help me with that,how can I introduce the driver to the Spark cluster easily?
Thanks,
David
Source code of the application
val properties = new Properties()
properties.put("ssl", "***")
properties.put("user", "***")
properties.put("password", "***")
properties.put("account", "***")
properties.put("db", "***")
properties.put("schema", "***")
properties.put("driver", "***")
val conf = new SparkConf().setAppName("***")
.setMaster("yarn-cluster")
.setJars(JavaSparkContext.jarOfClass(this.getClass()))
val sc = new SparkContext(conf)
sc.addJar(args(0))
val sqlContext = new SQLContext(sc)
var df = sqlContext.read.jdbc(connectStr, "***", properties = properties)
df = df.select( Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***)
// Additional actions on df

I had the same problem. What ended working for me is to use the --driver-class-path parameter used with spark-submit.
The main thing is to add the entire spark class path to the --driver-class-path
Here are my steps:
I got the default driver class path by getting the value of the
"spark.driver.extraClassPath" property from the Spark History Server
under "Environment".
Copied the MySQL JAR file to each node in the EMR cluster.
Put the MySQL jar path at the front of the --driver-class-path argument to the spark-submit command and append the value of "spark.driver.extraClassPath" to it
My driver class path ended up looking like this:
--driver-class-path /home/hadoop/jars/mysql-connector-java-5.1.35.jar:/etc/hadoop/conf:/usr/lib/hadoop/:/usr/lib/hadoop-hdfs/:/usr/lib/hadoop-mapreduce/:/usr/lib/hadoop-yarn/:/usr/lib/hadoop-lzo/lib/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/*
This worked with EMR 4.1 using Java with Spark 1.5.0.
I had already added the MySQL JAR as a dependency in the Maven pom.xml
You may also want to look at this answer as it seems like a cleaner solution. I haven't tried it myself.

With EMR 5.2 I add any new jars to the original driver classpath with:
export MY_DRIVER_CLASS_PATH=my_jdbc_jar.jar:some_other_jar.jar$(grep spark.driver.extraClassPath /etc/spark/conf/spark-defaults.conf | awk '{print $2}')
and after that
spark-submit --driver-class-path $MY_DRIVER_CLASS_PATH

Following a similar pattern to this answer quoted above, this is how I automated installing a JDBC driver on EMR clusters. (Full automation is useful for transient clusters started and terminated per job.)
use a bootstrap action to install the JDBC driver on all EMR cluster nodes. Your bootstrap action will be a one-line shell script, stored in S3, that looks like
aws s3 cp s3://.../your-jdbc-driver.jar /home/hadoop
add a step to your EMR cluster before running your actual Spark job, to modify /etc/spark/conf/spark-defaults.conf
This will be another one-line shell script, stored in S3:
sudo sed -e 's,\(^spark.driver.extraClassPath.*$\),\1:/home/hadoop/your-jdbc-driver.jar,' -i /etc/spark/conf/spark-defaults.conf
The step itself will look like
{
"name": "add JDBC driver to classpath",
"jar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
"args": ["s3://...bucket.../set-spark-driver-classpath.sh"]
}
This will add your JDBC driver to spark.driver.extraClassPath
Explanation
you can't do both as bootstrap actions because Spark won't be installed yet, so no config file to update
you can't install the JDBC driver as a step, because you need the JDBC driver installed on the same path on all cluster nodes. In YARN cluster mode, the driver process does not necessarily run on the master node.
The configuration only needs to be updated on the master node, though, as the config is packed up and shipped whatever node ends up running the driver.

In case you're using python in your EMR cluster there's no need for you to specify the jar while creating the cluster. You can add the jar package while creating your SparkSession.
spark = SparkSession \
.builder \
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
.config("spark.jars.packages", "mysql:mysql-connector-java:8.0.17") \
.getOrCreate()
And then when you make your query mention the driver like this:
form_df = spark.read.format("jdbc"). \
option("url", "jdbc:mysql://yourdatabase"). \
option("driver", "com.mysql.jdbc.Driver"). \
This way the package is included on the SparkSession as it is pulled from a maven repository. I hope it helps someone that is on the same situation I once was.

Related

How to use JDBC to read datasets from Oracle?

What is really executed and where, when using jdbc drivers to connect to e.g. oracle.?
1: I have started a spark master as
spark-class.cmd org.apache.spark.deploy.master.Master
and a worker like so
spark-class.cmd org.apache.spark.deploy.worker.Worker spark://myip:7077
and spark shell as
spark-shell --master spark://myip:7077
in spark-defaults.conf I have
spark.driver.extraClassPath = C:/jdbcDrivers/ojdbc8.jar
spark.executor.extraClassPath = C:/jdbcDrivers/ojdbc8.jar
and in spark-env.sh I have
SPARK_CLASSPATH=C:/jdbcDrivers/ojdbc8.jar
I can now run queries against Oracle in the spark-shell:
val jdbcDF = spark.read.format("jdbc").option("url","jdbc:oracle:thin:#...
This works fine without separately adding the jdbc driver jar in the scala shell.
When I start the master and worker in the same way, but create a scala project in eclipse and connecting to the master as follows:
val sparkSession = SparkSession.builder.
master("spark://myip:7077")
.appName("SparkTestApp")
.config("spark.jars", "C:\\pathToJdbc\\ojdbc8.jar")
.getOrCreate()
then it fails if I don't explicitly add the jdbc jar in the scala code.
How is the execution different? Why do I need to specify the jdbc jar in the code? What is the purpose of connecting to the master if it doesn't rely on the master and workers started?
If I use multiple workers with jdbc will they use only one connection or will they simultaneously read in parallel over several connections?
You are certainly using too much for the sample and you got confused.
The two lines, spark-class.cmd org.apache.spark.deploy.master.Master and spark-class.cmd org.apache.spark.deploy.worker.Worker spark://myip:7077, started a Spark Standalone cluster with one master and one worker. See Spark Standalone Mode.
In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.
You chose to start the Spark Standalone cluster manually (as described in Starting a Cluster Manually).
I doubt that spark-defaults.conf is used by the cluster at all. The file is to configure your Spark applications that are spark-submit to a cluster (as described in Dynamically Loading Spark Properties):
bin/spark-submit will also read configuration options from conf/spark-defaults.conf, in which each line consists of a key and a value separated by whitespace.
With that said, I think we can safely put Spark Standalone aside. It does not add much to the discussion (and does confuse a bit).
"Installing" JDBC Driver for Spark Application
In order to use a JDBC driver in your Spark application, you should spark-submit with --driver-class-path command-line option (or spark.driver.extraClassPath property as described in Runtime Environment):
spark.driver.extraClassPath Extra classpath entries to prepend to the classpath of the driver.
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
I strongly recommend using spark-submit --driver-class-path.
$ ./bin/spark-submit --help
...
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
You can read my notes on how to use a JDBC driver with PostgreSQL in Working with Datasets from JDBC Data Sources (and PostgreSQL).
PROTIP Use SPARK_PRINT_LAUNCH_COMMAND=1 to check out the command line of spark-submit.
All above applies to spark-shell too (as it uses spark-submit under the covers).

Spark with yarn cluster in docker

I worked with my spark project as standalone mode (in eclipse IDE). Here is some code that I used (working well).
val conf = new SparkConf()
.setAppName("My Application")
.setMaster("local[*]")
.set("spark.executor.memory", "4g")
.set("spark.driver.memory", "4g")
SparkSession.builder
.config(conf)
.appName("spark app")
.config("spark.sql.warehouse.dir", "file:///.")
.getOrCreate()
}
Until now, I wrapped this application as a jar and used in another java project.
Now I'm going to change it to yarn clustering mode for multi user. So, I installed hadoop clustering in docker with this image(link).
I think It's already set yarn configuration. Do i need to do something more to do spark yarn cluster?
How should I set sparkConf? How pass hadoop ip, my application jar and other infomation needed?

How to add a typesafe config file which is located on HDFS to spark-submit (cluster-mode)?

I have a Spark (Spark 1.5.2) application that streams data from Kafka to HDFS. My application contains two Typesafe config files to configure certain things like Kafka topic etc.
Now I want to run my application with spark-submit (cluster mode) in a cluster.
The jar file with all dependencies of my project is stored on HDFS.
As long as my config files are included in the jar file everything works fine. But this is unpractical for testing purposes because I always have to rebuild the jar.
Therefore I excluded the config files of my project and I added them via "driver-class-path". This worked on client mode but if I move the config files now to HDFS and run my application in cluster mode it can't find the settings. Below you can find my spark-submit command:
/usr/local/spark/bin/spark-submit \
--total-executor-cores 10 \
--executor-memory 15g \
--verbose \
--deploy-mode cluster\
--class com.hdp.speedlayer.SpeedLayerApp \
--driver-class-path hdfs://iot-master:8020/user/spark/config \
--master spark://spark-master:6066 \
hdfs://iot-master:8020/user/spark/speed-layer-CONFIG.jar
I already tried it with the --file parameter but that also didn't work. Does anybody know how I can fix this?
Update:
I did some further research and I figured out that it could be related to the HDFS path. I changed the HDFS path to "hdfs:///iot-master:8020//user//spark//config But unfortunately that also that didn't work. But maybe this could help you.
Below you can also see the error I get when I run the driver program in cluster mode:
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.ExceptionInInitializerError
at com.speedlayer.SpeedLayerApp.main(SpeedLayerApp.scala)
... 6 more
Caused by: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'application'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:145)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)
...
Trying to achieve the same result I found out the following:
--files: is associated only to local files on machine running the spark-submit command and converts to conf.addFile(). so hdfs files wont work unless you are able to run hdfs dfs -get <....> before to retrieve the file. in my case I want to run it from oozie so I dont know on which machine its going to run and I dont want to add a copy file action to my workflow.
The quote #Yuval_Itzchakov took refers to --jars which only handles jars since it converts to conf.addJar()
So as far as I know there is no strait way to load configuration file from hdfs.
My approach was to pass the path to my app and read the configuration file and merge it into reference file:
private val HDFS_IMPL_KEY = "fs.hdfs.impl"
def loadConf(pathToConf: String): Config = {
val path = new Path(pathToConf)
val confFile = File.createTempFile(path.getName, "tmp")
confFile.deleteOnExit()
getFileSystemByUri(path.toUri).copyToLocalFile(path, new Path(confFile.getAbsolutePath))
ConfigFactory.load(ConfigFactory.parseFile(confFile))
}
def getFileSystemByUri(uri: URI) : FileSystem = {
val hdfsConf = new Configuration()
hdfsConf.set(HDFS_IMPL_KEY, classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
FileSystem.get(uri, hdfsConf)
}
P.S the error only means that the ConfigFactory didnt find any configuration file, so he couldn't find the property you are looking for.
One option is to use the --files flag and with the HDFS location and make sure you add it to your executor classpath using the spark.executor.extraClassPath flag with -Dconfig.file:
Spark uses the following URL scheme to allow different
strategies for disseminating jars:
file: - Absolute paths and file:/ URIs are served by the driver’s HTTP
file server, and every executor pulls the file from the driver HTTP
server.
hdfs:, http:, https:, ftp: - these pull down files and JARs
from the URI as expected
local: - a URI starting with local:/ is
expected to exist as a local file on each worker node. This means that
no network IO will be incurred, and works well for large files/JARs
that are pushed to each worker, or shared via NFS, GlusterFS, etc.
Also, you can see it when looking at the help documentation for spark-submit:
--files FILES Comma-separated list of files to be placed in the working
directory of each executor.
Running with spark-submit:
/usr/local/spark/bin/spark-submit \
--total-executor-cores 10 \
--executor-memory 15g \
--conf "spark.executor.extraClassPath=-Dconfig.file=application.conf"
--verbose \
--deploy-mode cluster\
--class com.hdp.speedlayer.SpeedLayerApp \
--driver-class-path hdfs://iot-master:8020/user/spark/config \
--files hdfs:/path/to/conf \
--master spark://spark-master:6066 \
hdfs://iot-master:8020/user/spark/speed-layer-CONFIG.jar

How to access s3a:// files from Apache Spark?

Hadoop 2.6 doesn't support s3a out of the box, so I've tried a series of solutions and fixes, including:
deploy with hadoop-aws and aws-java-sdk => cannot read environment variable for credentials
add hadoop-aws into maven => various transitive dependency conflicts
Has anyone successfully make both work?
Having experienced first hand the difference between s3a and s3n - 7.9GB of data transferred on s3a was around ~7 minutes while 7.9GB of data on s3n took 73 minutes [us-east-1 to us-west-1 unfortunately in both cases; Redshift and Lambda being us-east-1 at this time] this is a very important piece of the stack to get correct and it's worth the frustration.
Here are the key parts, as of December 2015:
Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify --hadoop-major-version 2 (which uses CDH 4.2 as of this writing).
You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything.
You'll also need the hadoop-aws 2.7.1 JAR on the classpath. This JAR contains the class org.apache.hadoop.fs.s3a.S3AFileSystem.
In spark.properties you probably want some settings that look like this:
spark.hadoop.fs.s3a.access.key=ACCESSKEY spark.hadoop.fs.s3a.secret.key=SECRETKEY
If you are using hadoop 2.7 version with spark then the aws client uses V2 as default auth signature. And all the new aws region support only V4 protocol. To use V4 pass these conf in spark-submit and also endpoint (format - s3.<region>.amazonaws.com) must be specified.
--conf "spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
--conf "spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
I've detailed this list in more detail on a post I wrote as I worked my way through this process. In addition I've covered all the exception cases I hit along the way and what I believe to be the cause of each and how to fix them.
I'm writing this answer to access files with S3A from Spark 2.0.1 on Hadoop 2.7.3
Copy the AWS jars(hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar) which shipped with Hadoop by default
Hint: If the jar locations are unsure? Running find command as a privileged user can be helpful; commands can be
find / -name hadoop-aws*.jar
find / -name aws-java-sdk*.jar
into spark classpath which holds all spark jars
Hint: We can not directly point the location(It must be in property file) as I want to make an answer generic for distributions and Linux flavors. spark classpath can be identified by find command below
find / -name spark-core*.jar
in spark-defaults.conf
Hint: (Mostly it will be placed in /etc/spark/conf/spark-defaults.conf)
#make sure jars are added to CLASSPATH
spark.yarn.jars=file://{spark/home/dir}/jars/*.jar,file://{hadoop/install/dir}/share/hadoop/tools/lib/*.jar
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key={s3a.access.key}
spark.hadoop.fs.s3a.secret.key={s3a.secret.key}
#you can set above 3 properties in hadoop level `core-site.xml` as well by removing spark prefix.
in spark submit include jars(aws-java-sdk and hadoop-aws) in --driver-class-path if needed.
spark-submit --master yarn \
--driver-class-path {spark/jars/home/dir}/aws-java-sdk-1.7.4.jar \
--driver-class-path {spark/jars/home/dir}/hadoop-aws-2.7.3.jar \
other options
Note:
Make sure the Linux user with reading privileges, before running the
find command to prevent error Permission denied
I got it working using the Spark 1.4.1 prebuilt binary with hadoop 2.6
Make sure you set both spark.driver.extraClassPath and spark.executor.extraClassPath pointing to the two jars (hadoop-aws and aws-java-sdk)
If you run on a cluster, make sure your executors have access to the jar files on the cluster.
We're using spark 1.6.1 with Mesos and we were getting lots of issues writing to S3 from spark. I give credit to cfeduke for the answer. The slight change I made was adding maven coordinates to the spark.jar config in the spark-defaults.conf file. I tried with hadoop-aws:2.7.2 but was still getting lots of errors so we went back to 2.7.1. Below are the changes in spark-defaults.conf that are working for us:
spark.jars.packages net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1,com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1
spark.hadoop.fs.s3a.access.key <MY ACCESS KEY>
spark.hadoop.fs.s3a.secret.key <MY SECRET KEY>
spark.hadoop.fs.s3a.fast.upload true
Thank you cfeduke for taking the time to write up your post. It was very helpful.
Here are the details as of October 2016, as presented at Spark Summit EU: Apache Spark and Object Stores.
Key points
The direct output committer is gone from Spark 2.0 due to risk/experience of data corruption.
There are some settings on the FileOutputCommitter to reduce renames, but not eliminate them
I'm working with some colleagues to do an O(1) committer, relying on Apache Dynamo to give us that consistency we need.
To use S3a, get your classpath right.
And be on Hadoop 2.7.z; 2.6.x had some problems which were addressed by then HADOOP-11571.
There's a PR under SPARK-7481 to pull everything into a spark distro you build yourself. Otherwise, ask whoever supplies to the binaries to do the work.
Hadoop 2.8 is going to add major perf improvements HADOOP-11694.
Product placement: the read-performance side of HADOOP-11694 is included in HDP2.5; The Spark and S3 documentation there might be of interest —especially the tuning options.
as you said, hadoop 2.6 doesn't support s3a, and latest spark release 1.6.1 doesn't support hadoop 2.7, but spark 2.0 is definitely no problem with hadoop 2.7 and s3a.
for spark 1.6.x, we made some dirty hack, with the s3 driver from EMR... you can take a look this doc: https://github.com/zalando/spark-appliance#emrfs-support
if you still want to try to use s3a in spark 1.6.x, refer to the answer here: https://stackoverflow.com/a/37487407/5630352
Using Spark 1.4.1 pre-built with Hadoop 2.6, I am able to get s3a:// to work when deploying to a Spark Standalone cluster by adding the hadoop-aws and aws-java-sdk jar files from the Hadoop 2.7.1 distro (found under $HADOOP_HOME/share/hadoop/tools/lib of Hadoop 2.7.1) to my SPARK_CLASSPATH environment variable in my $SPARK_HOME/conf/spark-env.sh file.
You can also add the S3A dependencies to the classpath using spark-defaults.conf.
Example:
spark.driver.extraClassPath /usr/local/spark/jars/hadoop-aws-2.7.5.jar
spark.executor.extraClassPath /usr/local/spark/jars/hadoop-aws-2.7.5.jar
spark.driver.extraClassPath /usr/local/spark/jars/aws-java-sdk-1.7.4.jar
spark.executor.extraClassPath /usr/local/spark/jars/aws-java-sdk-1.7.4.jar
Or just:
spark.jars /usr/local/spark/jars/hadoop-aws-2.7.5.jar,/usr/local/spark/jars/aws-java-sdk-1.7.4.jar
Just make sure to match your AWS SDK version to the version of Hadoop. For more information about this, look at this answer: Unable to access S3 data using Spark 2.2
Here's a solution for pyspark (possibly with proxy):
def _configure_s3_protocol(spark, proxy=props["proxy"]["host"], port=props["proxy"]["port"], endpoint=props["s3endpoint"]["irland"]):
"""
Configure access to the protocol s3
https://sparkour.urizone.net/recipes/using-s3/
AWS Regions and Endpoints
https://docs.aws.amazon.com/general/latest/gr/rande.html
"""
sc = spark.sparkContext
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", os.environ.get("AWS_ACCESS_KEY_ID"))
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", os.environ.get("AWS_SECRET_ACCESS_KEY"))
sc._jsc.hadoopConfiguration().set("fs.s3a.proxy.host", proxy)
sc._jsc.hadoopConfiguration().set("fs.s3a.proxy.port", port)
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", endpoint)
return spark
Here is a scala version that works fine with Spark 3.2.1 (pre-built) with Hadoop 3.3.1, accessing a S3 bucket from a non AWS machine [typically a local setup on a developer machine]
sbt
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.2.1" % "provided",
"org.apache.spark" %% "spark-streaming" % "3.2.1" % "provided",
"org.apache.spark" %% "spark-sql" % "3.2.1" % "provided",
"org.apache.hadoop" % "hadoop-aws" % "3.3.1",
"org.apache.hadoop" % "hadoop-common" % "3.3.1" % "provided"
)
spark program
val spark = SparkSession
.builder()
.master("local")
.appName("Process parquet file")
.config("spark.hadoop.fs.s3a.path.style.access", true)
.config("spark.hadoop.fs.s3a.access.key", ACCESS_KEY)
.config("spark.hadoop.fs.s3a.secret.key", SECRET_KEY)
.config("spark.hadoop.fs.s3a.endpoint", ENDPOINT)
.config(
"spark.hadoop.fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem"
)
// The enable V4 does not seem necessary for the eu-west-3 region
// see #stevel comment below
// .config("com.amazonaws.services.s3.enableV4", true)
// .config(
// "spark.driver.extraJavaOptions",
// "-Dcom.amazonaws.services.s3.enableV4=true"
// )
.config("spark.executor.instances", "4")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val df = spark.read.parquet("s3a://[BUCKET NAME]/.../???.parquet")
df.show()
note: region is in the form s3.[REGION].amazonaws.com e.g. s3.eu-west-3.amazonaws.com
s3 configuration
To make the bucket available from outside of AWS, add a Bucket Policy of the form:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Statement1",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::[ACCOUNT ID]:user/[IAM USERNAME]"
},
"Action": [
"s3:Delete*",
"s3:Get*",
"s3:List*",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::[BUCKET NAME]/*"
}
]
}
The supplied ACCESS_KEY and SECRET_KEY to the spark configuration must be those of the IAM user configured on the bucket
I am using spark version 2.3, and when I save a dataset using spark like:
dataset.write().format("hive").option("fileFormat", "orc").mode(SaveMode.Overwrite)
.option("path", "s3://reporting/default/temp/job_application")
.saveAsTable("job_application");
It works perfectly and saves my data into s3.

Spark Unable to find JDBC Driver

So I've been using sbt with assembly to package all my dependencies into a single jar for my spark jobs. I've got several jobs where I was using c3p0 to setup connection pool information, broadcast that out, and then use foreachPartition on the RDD to then grab a connection, and insert the data into the database. In my sbt build script, I include
"mysql" % "mysql-connector-java" % "5.1.33"
This makes sure the JDBC connector is packaged up with the job. Everything works great.
So recently I started playing around with SparkSQL and realized it's much easier to simply take a dataframe and save it to a jdbc source with the new features in 1.3.0
I'm getting the following exception :
java.sql.SQLException: No suitable driver found for
jdbc:mysql://some.domain.com/myschema?user=user&password=password at
java.sql.DriverManager.getConnection(DriverManager.java:596) at
java.sql.DriverManager.getConnection(DriverManager.java:233)
When I was running this locally I got around it by setting
SPARK_CLASSPATH=/path/where/mysql-connector-is.jar
Ultimately what I'm wanting to know is, why is the job not capable of finding the driver when it should be packaged up with it? My other jobs never had this problem. From what I can tell both c3p0 and the dataframe code both make use of the java.sql.DriverManager (which handles importing everything for you from what I can tell) so it should work just fine?? If there is something that prevents the assembly method from working, what do I need to do to make this work?
This person was having similar issue: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-DataFrame-with-MySQL-td22178.html
Have you updated your connector drivers to the most recent version? Also did you specify the driver class when you called load()?
Map<String, String> options = new HashMap<String, String>();
options.put("url", "jdbc:mysql://localhost:3306/video_rcmd?user=root&password=123456");
options.put("dbtable", "video");
options.put("driver", "com.mysql.cj.jdbc.Driver"); //here
DataFrame jdbcDF = sqlContext.load("jdbc", options);
In spark/conf/spark-defaults.conf, you can also set spark.driver.extraClassPath and spark.executor.extraClassPath to the path of your MySql driver .jar
These options are clearly mentioned in spark docs: --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
The mistake I was doing was mentioning these options after my application's jar.
However the correct way is to specify these options immediately after spark-submit:
spark-submit --driver-class-path /somepath/project/mysql-connector-java-5.1.30-bin.jar --jars /somepath/project/mysql-connector-java-5.1.30-bin.jar --class com.package.MyClass target/scala-2.11/project_2.11-1.0.jar
Both spark driver and executor need mysql driver on class path so specify
spark.driver.extraClassPath = <path>/mysql-connector-java-5.1.36.jar
spark.executor.extraClassPath = <path>/mysql-connector-java-5.1.36.jar
With spark 2.2.0, problem was corrected for me by adding extra class path information for SparkSession session in python script :
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.driver.extraClassPath", "/path/to/jdbc/driver/postgresql-42.1.4.jar") \
.getOrCreate()
See official documentation https://spark.apache.org/docs/latest/configuration.html
In my case, spark is not launched from cli command, but from django framework https://www.djangoproject.com/
spark.driver.extraClassPath does not work in client-mode:
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
Env variable SPARK_CLASSPATH has been deprecated in Spark 1.0+.
You should first copy the jdbc driver jars into each executor under the same local filesystem path and then use the following options in you spark-submit:
--driver-class-path "driver_local_file_system_jdbc_driver1.jar:driver_local_file_system_jdbc_driver2.jar"
--class "spark.executor.extraClassPath=executors_local_file_system_jdbc_driver1.jar:executors_local_file_system_jdbc_driver2.jar"
For example in case of TeraData you need both terajdbc4.jar and tdgssconfig.jar .
Alternatively modify compute_classpath.sh on all worker nodes, Spark documentation says:
The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java’s DriverManager class does a security check that results in it ignoring all drivers not visible to the primordial class loader when one goes to open a connection. One convenient way to do this is to modify compute_classpath.sh on all worker nodes to include your driver JARs.
There exists a simple Java trick to solve your problem. You should specify Class.forName() instance. For example:
val customers: RDD[(Int, String)] = new JdbcRDD(sc, () => {
Class.forName("com.mysql.jdbc.Driver")
DriverManager.getConnection(jdbcUrl)
},
"SELECT id, name from customer WHERE ? < id and id <= ?" ,
0, range, partitions, r => (r.getInt(1), r.getString(2)))
Check the docs
Simple easy way is to copy "mysql-connector-java-5.1.47.jar" into "spark-2.4.3\jars\" directory
I had the same problem running jobs over a Mesos cluster in cluster mode.
To use a JDBC driver is necessary to add the dependency to the system classpath not to the framework classpath. I only found the way of doing it by adding the dependency in the file spark-defaults.conf in every instance of the cluster.
The properties to add are spark.driver.extraClassPath and spark.executor.extraClassPath and the path must be in the local file system.
I add the jar file to the SPARK_CLASSPATH in spark-env.sh, it works.
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/local/spark-1.6.3-bin-hadoop2.6/lib/mysql-connector-java-5.1.40-bin.jar
I was facing the same issue when I was trying to run the spark-shell command from my windows machine. The path that you pass for the driver location as well as for the jar that you would be using should be in the double quotes otherwise it gets misinterpreted and you would not get the exact output that you want.
you also would have to install the JDBC driver for SQL server from the link : JDBC Driver
I have used the below command for this to work fine for me on my windows machine:
spark-shell --driver-class-path "C:\Program Files\Microsoft JDBC Driver 6.0 for SQL Server\sqljdbc_6.0\enu\jre8\sqljdbc42.jar" --jars "C:\Program Files\Microsoft JDBC Driver 6.0 for SQL Server\sqljdbc_6.0\enu\jre8\sqljdbc42.jar"

Resources