So I've been using sbt with assembly to package all my dependencies into a single jar for my spark jobs. I've got several jobs where I was using c3p0 to setup connection pool information, broadcast that out, and then use foreachPartition on the RDD to then grab a connection, and insert the data into the database. In my sbt build script, I include
"mysql" % "mysql-connector-java" % "5.1.33"
This makes sure the JDBC connector is packaged up with the job. Everything works great.
So recently I started playing around with SparkSQL and realized it's much easier to simply take a dataframe and save it to a jdbc source with the new features in 1.3.0
I'm getting the following exception :
java.sql.SQLException: No suitable driver found for
jdbc:mysql://some.domain.com/myschema?user=user&password=password at
java.sql.DriverManager.getConnection(DriverManager.java:596) at
java.sql.DriverManager.getConnection(DriverManager.java:233)
When I was running this locally I got around it by setting
SPARK_CLASSPATH=/path/where/mysql-connector-is.jar
Ultimately what I'm wanting to know is, why is the job not capable of finding the driver when it should be packaged up with it? My other jobs never had this problem. From what I can tell both c3p0 and the dataframe code both make use of the java.sql.DriverManager (which handles importing everything for you from what I can tell) so it should work just fine?? If there is something that prevents the assembly method from working, what do I need to do to make this work?
This person was having similar issue: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-DataFrame-with-MySQL-td22178.html
Have you updated your connector drivers to the most recent version? Also did you specify the driver class when you called load()?
Map<String, String> options = new HashMap<String, String>();
options.put("url", "jdbc:mysql://localhost:3306/video_rcmd?user=root&password=123456");
options.put("dbtable", "video");
options.put("driver", "com.mysql.cj.jdbc.Driver"); //here
DataFrame jdbcDF = sqlContext.load("jdbc", options);
In spark/conf/spark-defaults.conf, you can also set spark.driver.extraClassPath and spark.executor.extraClassPath to the path of your MySql driver .jar
These options are clearly mentioned in spark docs: --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
The mistake I was doing was mentioning these options after my application's jar.
However the correct way is to specify these options immediately after spark-submit:
spark-submit --driver-class-path /somepath/project/mysql-connector-java-5.1.30-bin.jar --jars /somepath/project/mysql-connector-java-5.1.30-bin.jar --class com.package.MyClass target/scala-2.11/project_2.11-1.0.jar
Both spark driver and executor need mysql driver on class path so specify
spark.driver.extraClassPath = <path>/mysql-connector-java-5.1.36.jar
spark.executor.extraClassPath = <path>/mysql-connector-java-5.1.36.jar
With spark 2.2.0, problem was corrected for me by adding extra class path information for SparkSession session in python script :
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.driver.extraClassPath", "/path/to/jdbc/driver/postgresql-42.1.4.jar") \
.getOrCreate()
See official documentation https://spark.apache.org/docs/latest/configuration.html
In my case, spark is not launched from cli command, but from django framework https://www.djangoproject.com/
spark.driver.extraClassPath does not work in client-mode:
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
Env variable SPARK_CLASSPATH has been deprecated in Spark 1.0+.
You should first copy the jdbc driver jars into each executor under the same local filesystem path and then use the following options in you spark-submit:
--driver-class-path "driver_local_file_system_jdbc_driver1.jar:driver_local_file_system_jdbc_driver2.jar"
--class "spark.executor.extraClassPath=executors_local_file_system_jdbc_driver1.jar:executors_local_file_system_jdbc_driver2.jar"
For example in case of TeraData you need both terajdbc4.jar and tdgssconfig.jar .
Alternatively modify compute_classpath.sh on all worker nodes, Spark documentation says:
The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java’s DriverManager class does a security check that results in it ignoring all drivers not visible to the primordial class loader when one goes to open a connection. One convenient way to do this is to modify compute_classpath.sh on all worker nodes to include your driver JARs.
There exists a simple Java trick to solve your problem. You should specify Class.forName() instance. For example:
val customers: RDD[(Int, String)] = new JdbcRDD(sc, () => {
Class.forName("com.mysql.jdbc.Driver")
DriverManager.getConnection(jdbcUrl)
},
"SELECT id, name from customer WHERE ? < id and id <= ?" ,
0, range, partitions, r => (r.getInt(1), r.getString(2)))
Check the docs
Simple easy way is to copy "mysql-connector-java-5.1.47.jar" into "spark-2.4.3\jars\" directory
I had the same problem running jobs over a Mesos cluster in cluster mode.
To use a JDBC driver is necessary to add the dependency to the system classpath not to the framework classpath. I only found the way of doing it by adding the dependency in the file spark-defaults.conf in every instance of the cluster.
The properties to add are spark.driver.extraClassPath and spark.executor.extraClassPath and the path must be in the local file system.
I add the jar file to the SPARK_CLASSPATH in spark-env.sh, it works.
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/local/spark-1.6.3-bin-hadoop2.6/lib/mysql-connector-java-5.1.40-bin.jar
I was facing the same issue when I was trying to run the spark-shell command from my windows machine. The path that you pass for the driver location as well as for the jar that you would be using should be in the double quotes otherwise it gets misinterpreted and you would not get the exact output that you want.
you also would have to install the JDBC driver for SQL server from the link : JDBC Driver
I have used the below command for this to work fine for me on my windows machine:
spark-shell --driver-class-path "C:\Program Files\Microsoft JDBC Driver 6.0 for SQL Server\sqljdbc_6.0\enu\jre8\sqljdbc42.jar" --jars "C:\Program Files\Microsoft JDBC Driver 6.0 for SQL Server\sqljdbc_6.0\enu\jre8\sqljdbc42.jar"
Related
I did have HDP 2.6.1.0-129
I have external Jar example.jar for serialized flume data file.
I did add new parametr in section Custom hive-site
name = hive.aux.jars.path
value hdfs:///user/libs/
Did save new configuration and did restart hadoop componens and in more time restart all hadoop cluster.
After in Hive client I did try to run select
select * from example_serealized_table
and hive did return error
FAILED: RuntimeException MetaException(message:org.apache.hadoop.hive.serde2.SerDeException java.lang.ClassNotFoundException: Class com.my.bigtable.example.model.gen.TSerializedRecord not found)
How solve this problem?
p.s.
If did try add in current session,
add jar hdfs:///user/libs/example-spark-SerializedRecord.jar;
Did try to put *.jar to local folder.
Problem same.
I did not say that library write my my colleague did write a library.
It did turn out that it redefines the variables that affect the level of logging the field.
After excluding overridden variables in the library, the problem of reproducing did stopp.
What is really executed and where, when using jdbc drivers to connect to e.g. oracle.?
1: I have started a spark master as
spark-class.cmd org.apache.spark.deploy.master.Master
and a worker like so
spark-class.cmd org.apache.spark.deploy.worker.Worker spark://myip:7077
and spark shell as
spark-shell --master spark://myip:7077
in spark-defaults.conf I have
spark.driver.extraClassPath = C:/jdbcDrivers/ojdbc8.jar
spark.executor.extraClassPath = C:/jdbcDrivers/ojdbc8.jar
and in spark-env.sh I have
SPARK_CLASSPATH=C:/jdbcDrivers/ojdbc8.jar
I can now run queries against Oracle in the spark-shell:
val jdbcDF = spark.read.format("jdbc").option("url","jdbc:oracle:thin:#...
This works fine without separately adding the jdbc driver jar in the scala shell.
When I start the master and worker in the same way, but create a scala project in eclipse and connecting to the master as follows:
val sparkSession = SparkSession.builder.
master("spark://myip:7077")
.appName("SparkTestApp")
.config("spark.jars", "C:\\pathToJdbc\\ojdbc8.jar")
.getOrCreate()
then it fails if I don't explicitly add the jdbc jar in the scala code.
How is the execution different? Why do I need to specify the jdbc jar in the code? What is the purpose of connecting to the master if it doesn't rely on the master and workers started?
If I use multiple workers with jdbc will they use only one connection or will they simultaneously read in parallel over several connections?
You are certainly using too much for the sample and you got confused.
The two lines, spark-class.cmd org.apache.spark.deploy.master.Master and spark-class.cmd org.apache.spark.deploy.worker.Worker spark://myip:7077, started a Spark Standalone cluster with one master and one worker. See Spark Standalone Mode.
In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.
You chose to start the Spark Standalone cluster manually (as described in Starting a Cluster Manually).
I doubt that spark-defaults.conf is used by the cluster at all. The file is to configure your Spark applications that are spark-submit to a cluster (as described in Dynamically Loading Spark Properties):
bin/spark-submit will also read configuration options from conf/spark-defaults.conf, in which each line consists of a key and a value separated by whitespace.
With that said, I think we can safely put Spark Standalone aside. It does not add much to the discussion (and does confuse a bit).
"Installing" JDBC Driver for Spark Application
In order to use a JDBC driver in your Spark application, you should spark-submit with --driver-class-path command-line option (or spark.driver.extraClassPath property as described in Runtime Environment):
spark.driver.extraClassPath Extra classpath entries to prepend to the classpath of the driver.
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
I strongly recommend using spark-submit --driver-class-path.
$ ./bin/spark-submit --help
...
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
You can read my notes on how to use a JDBC driver with PostgreSQL in Working with Datasets from JDBC Data Sources (and PostgreSQL).
PROTIP Use SPARK_PRINT_LAUNCH_COMMAND=1 to check out the command line of spark-submit.
All above applies to spark-shell too (as it uses spark-submit under the covers).
I'm trying to add a JDBC driver to a Spark cluster that is executing on top Amazon EMR but I keep getting the:
java.sql.SQLException: No suitable driver found for exception.
I tried the following things:
Use addJar to add the driver Jar explicitly from the code.
Using spark.executor.extraClassPath spark.driver.extraClassPath parameters.
Using spark.driver.userClassPathFirst=true, when I used this option I'm getting a different error because mix of dependencies with Spark, Anyway this option seems to be to aggressive if I just want to add a single JAR.
Could you please help me with that,how can I introduce the driver to the Spark cluster easily?
Thanks,
David
Source code of the application
val properties = new Properties()
properties.put("ssl", "***")
properties.put("user", "***")
properties.put("password", "***")
properties.put("account", "***")
properties.put("db", "***")
properties.put("schema", "***")
properties.put("driver", "***")
val conf = new SparkConf().setAppName("***")
.setMaster("yarn-cluster")
.setJars(JavaSparkContext.jarOfClass(this.getClass()))
val sc = new SparkContext(conf)
sc.addJar(args(0))
val sqlContext = new SQLContext(sc)
var df = sqlContext.read.jdbc(connectStr, "***", properties = properties)
df = df.select( Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***)
// Additional actions on df
I had the same problem. What ended working for me is to use the --driver-class-path parameter used with spark-submit.
The main thing is to add the entire spark class path to the --driver-class-path
Here are my steps:
I got the default driver class path by getting the value of the
"spark.driver.extraClassPath" property from the Spark History Server
under "Environment".
Copied the MySQL JAR file to each node in the EMR cluster.
Put the MySQL jar path at the front of the --driver-class-path argument to the spark-submit command and append the value of "spark.driver.extraClassPath" to it
My driver class path ended up looking like this:
--driver-class-path /home/hadoop/jars/mysql-connector-java-5.1.35.jar:/etc/hadoop/conf:/usr/lib/hadoop/:/usr/lib/hadoop-hdfs/:/usr/lib/hadoop-mapreduce/:/usr/lib/hadoop-yarn/:/usr/lib/hadoop-lzo/lib/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/*
This worked with EMR 4.1 using Java with Spark 1.5.0.
I had already added the MySQL JAR as a dependency in the Maven pom.xml
You may also want to look at this answer as it seems like a cleaner solution. I haven't tried it myself.
With EMR 5.2 I add any new jars to the original driver classpath with:
export MY_DRIVER_CLASS_PATH=my_jdbc_jar.jar:some_other_jar.jar$(grep spark.driver.extraClassPath /etc/spark/conf/spark-defaults.conf | awk '{print $2}')
and after that
spark-submit --driver-class-path $MY_DRIVER_CLASS_PATH
Following a similar pattern to this answer quoted above, this is how I automated installing a JDBC driver on EMR clusters. (Full automation is useful for transient clusters started and terminated per job.)
use a bootstrap action to install the JDBC driver on all EMR cluster nodes. Your bootstrap action will be a one-line shell script, stored in S3, that looks like
aws s3 cp s3://.../your-jdbc-driver.jar /home/hadoop
add a step to your EMR cluster before running your actual Spark job, to modify /etc/spark/conf/spark-defaults.conf
This will be another one-line shell script, stored in S3:
sudo sed -e 's,\(^spark.driver.extraClassPath.*$\),\1:/home/hadoop/your-jdbc-driver.jar,' -i /etc/spark/conf/spark-defaults.conf
The step itself will look like
{
"name": "add JDBC driver to classpath",
"jar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
"args": ["s3://...bucket.../set-spark-driver-classpath.sh"]
}
This will add your JDBC driver to spark.driver.extraClassPath
Explanation
you can't do both as bootstrap actions because Spark won't be installed yet, so no config file to update
you can't install the JDBC driver as a step, because you need the JDBC driver installed on the same path on all cluster nodes. In YARN cluster mode, the driver process does not necessarily run on the master node.
The configuration only needs to be updated on the master node, though, as the config is packed up and shipped whatever node ends up running the driver.
In case you're using python in your EMR cluster there's no need for you to specify the jar while creating the cluster. You can add the jar package while creating your SparkSession.
spark = SparkSession \
.builder \
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
.config("spark.jars.packages", "mysql:mysql-connector-java:8.0.17") \
.getOrCreate()
And then when you make your query mention the driver like this:
form_df = spark.read.format("jdbc"). \
option("url", "jdbc:mysql://yourdatabase"). \
option("driver", "com.mysql.jdbc.Driver"). \
This way the package is included on the SparkSession as it is pulled from a maven repository. I hope it helps someone that is on the same situation I once was.
Refrence to the previously asked question Oozie + Sqoop: JDBC Driver Jar Location 1
but not able to find jar in HDFS /user/oozie/share/lib/sqoop location.
I have also tried to put driver jars at my workFlow app Lib. Still Drivers not found error occure.
You need to add all lib files like jdbc drivers, etc in the oozie share lib folder inside sqoop folder .
This should resolve your issue.
To check the library files invoked/used by the job , go to the job tracker for the corresponding job and in syslogs you will see which all jars has been used.
The exact problem was the single coats "'". Because of single coats oozie take it as a single string. But it was working fine when I was using it in the Sqoop command.
................. --driver com.microsoft.sqlserver.jdbc.SQLServer...................
instead of.
.................. --driver 'com.microsoft.sqlserver.jdbc.SQLServer'................
I have a 6 node cloudera based hadoop cluster and I'm trying to connect to an oracle database from a sqoop action in oozie.
I have copied my ojdbc6.jar into the sqoop lib location (which for me happens to be at: /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/sqoop/lib/ ) on all the nodes and have verified that I can run a simple 'sqoop eval' from all the 6 nodes.
Now when I run the same command using Oozie's sqoop action, I get "Could not load db driver class: oracle.jdbc.OracleDriver"
I have read this article about using shared libs and it makes sense to me when we're talking about my task/action/workflow specific dependencies. But I see a JDBC driver installation as an extention to sqoop and so I think it belongs in the sqoop installation lib.
Now the question is, while sqoop sees this ojdbc6 jar I have put into it's lib folder, how come my Oozie workflow doesn't see it?
Is this something expected or am I missing something?
As an aside, what do you guy think about where is the appropriate location for a JDBC driver jar?
Thanks in advance!
The JDBC driver jar (and any jars it depends on) should go in your Oozie sharelib folder on HDFS. I'm running Hortonworks Data Platform 1.2 instead of Cloudera 4.2 so the details may vary, but my JDBC driver is located in /user/oozie/share/lib/sqoop. This should allow you to run Sqoop with the JDBC via Oozie.
It is not necessary to put to the JDBC driver jar in the sqoop lib on the data nodes. In my setupt I can't run a simple sqoop eval from the command line on my data nodes. I understand the logic for why you thought this would work. The reason the JDBC driver jar needs to be on HDFS is so that all the data nodes have access to it. Your solution should accomplish the same goal. I'm not familiar enough with the inner workings of Oozie to say why using the sharelib works but your solution does not.
In CDH5, you should put the jar to '/user/oozie/share/lib/lib_${timestamp}/sqoop', and after that, you must update the sharelib or restart oozie.
update sharelib:
oozie admin -oozie http://localhost:11000/oozie -sharelibupdate
If you are using CDH-5 the JDBC driver jar (and any jars it depends on) should go in '/user/oozie/share/lib/lib_timestamp/sqoop' folder on HDFS.
I was facing the same issue it was not able to find the mysql jar. I am using cloudera 4.4 in this even oozie admin -oozie http://localhost:11000/oozie -sharelibupdate command will not work
To resolve the issue I had followed the below steps:
create a user in Hue with hdfs and provide the admin privileges
using Hue UI upload the jar into /user/oozie/share/lib/sqoop hdfs path
or you can use below command:
hadoop put /var/lib/sqoop2/mysql-connector-java.jar /user/oozie/share/lib/sqoop
Once the jar is placed run the oozie command.