Oozie cant able to find JDBC drivers in Sqoop - hadoop

Refrence to the previously asked question Oozie + Sqoop: JDBC Driver Jar Location 1
but not able to find jar in HDFS /user/oozie/share/lib/sqoop location.
I have also tried to put driver jars at my workFlow app Lib. Still Drivers not found error occure.

You need to add all lib files like jdbc drivers, etc in the oozie share lib folder inside sqoop folder .
This should resolve your issue.
To check the library files invoked/used by the job , go to the job tracker for the corresponding job and in syslogs you will see which all jars has been used.

The exact problem was the single coats "'". Because of single coats oozie take it as a single string. But it was working fine when I was using it in the Sqoop command.
................. --driver com.microsoft.sqlserver.jdbc.SQLServer...................
instead of.
.................. --driver 'com.microsoft.sqlserver.jdbc.SQLServer'................

Related

Why Hive will search its configuration profile in HADOOP_CONF_DIR first?

Today I found that if I copy hive-site.xml into $HADOOP_HOME/etc/hadoop/, Hive will use the hive-site.xml in the $HADOOP_HOME/etc/hadoop/ instead of the one in $HIVE_HOME/conf, and it will also search for the hive-log4j.properties in $HADOOP_HOME/etc/hadoop/.
If not found, Hive will just use the default one in /lib/hive-common-1.1.0-cdh5.7.6.jar!/hive-log4j.properties instead of the customized one in $HIVE_HOME/conf, but why?
I searched the keyword copy hive-site.xml to HADOOP_HOME in the official Hive manual in apache.org but failed to find any explanation...
My Hive version is hive-1.1.0-cdh5.7.6, Hadoop version hadoop-2.6.0-cdh5.7.6, JDK 1.7.
So, you've mentioned Sqoop, therefore I'll point out the proper processes for getting hive XML configuration.
1) There's a classpath problem if the file isn't found. Copying the file is one solution, but a poor one. A symlink is preferred.
Every time I've used Sqoop, I never messed around with controlling any XML files - it just worked. Therefore, both HDP and CDH must have the proper classpath and/or symlinks setup.
2) The documentation states where configurations are loaded from
Sqoop will fall back to $HADOOP_HOME. If it is not set either, Sqoop will use the default installation locations for Apache Bigtop, /usr/lib/hadoop and /usr/lib/hadoop-mapreduce, respectively.
The active Hadoop configuration is loaded from $HADOOP_HOME/conf/, unless the $HADOOP_CONF_DIR environment variable is set
This classpath controls where configurations are loaded from
3) You can also, at runtime, give extra files
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
sqoop import -files $HIVE_HOME/conf/hive-site.xml ...

How to add jar files for Hue in Cloudera?

I'm running an SQL query on a JSON serde table. It's working in the Hive CLI, but it's failing in Hue with the error:
Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
I guess it's due to the missing jar file; any idea how to add the jar file hive-hcatalog-core-1.2.1.jar for Hue?
Place your jar in HDFS and add same path by using ADD JAR hdfs:///user/hive/lib/hive-hcatalog-core-1.2.1.jar ;
Run ADD JAR hive-hcatalog-core-1.2.1.jar in hue before your query this thing will be present till your current secession persists.
For the benefit of others, who might face same issue either for this particular jar "hive-hcatalog-core-1.2.1.jar" or any udf jar:
In the HUE - Query Editor, run the following command:
add jar hdfs:/hive-hcatalog-core-1.2.1.jar;
Please note single quotes is not required as is the case with Hive CLI
Exact command cloudera gave is ADD JAR {{lib_dir}}/hive/lib/hive-contrib.jar;
1)I am unable to find hive/lib directory on CDH 5
The {{lib_dir}} on CDH installed environments for Hive would either be /usr/lib/hive/ or /opt/cloudera/parcels/CDH/lib/hive/ (depending on packages or parcels being in use).
this is the way to add jar in cloudera
for this you have to change to supper user by use this command
SUDO SU
it will change to supper user

Can SQOOP work with a custom libpath?

I am trying to get some table data imported from PostgreSQL to HDFS using Sqoop. Now due to licensing constraints, Sqoop does not come packaged with JDBC drivers for all JDBC compliant databases. PostgreSQL is one of them. In order to interact with this database, Sqoop needs the relevant JDBC driver to be installed into a preset classpath (typically $SQOOP_HOME/lib).
In my case, the Hadoop administrator does not provide me write access to this predefined classpath. Is there any alternate way to instruct Sqoop client to look into some path (say, my home directory) instead of or in addition to the preset location?
I looked into the official Apache documentation and searched the internet, but could not fetch any answer. Could anyone please help?
Thanks !
I got this working yesterday. Below are the steps to follow.
Download the appropriate JDBC driver from here
Put the jar file under the directory of choice. I chose
the hadoop cluster user's home directory i.e. /home/myuser
export HADOOP_CLASSPATH="/home/myuser/postgresql-9.4.1209.jar"
(replace /home/myuser/postgresql-9.4.1209.jar with your path and jar file name)
To perform Sqoop import you may use the below command.
sqoop import
--connect 'jdbc:postgresql://<postgres_server_url>:<postgres_port>/<db_name>'
--username <db_user_name>
--password <db_user_password>
--table <db_table_name>
--warehouse-dir <existing_empty_hdfs_directory>
To perform Sqoop export you may use the below command.
sqoop export
--connect 'jdbc:postgresql://<postgres_server_url>:<postgres_port>/<db_name>'
--username <db_user_name>
--password <db_user_password>
--table <db_table_name>
--export-dir <existing_hdfs_path_containing_export_data>
As per Sqoop docs,
-libjars <comma separated list of jars>- specify comma separated jar files to include in the classpath.
Make sure you use -libjars as first argument in the command.
EDIT :
According to docs,
The -files, -libjars, and -archives arguments are not typically used with Sqoop, but they are included as part of Hadoop’s internal argument-parsing system.
So, JDBC client jars need to be put at $SQOOP_HOME/lib.
I had recently experienced issue with this -libjars option. It doesn't work perfectly. Probably this issue is propagated from Hadoop jar command line option. Possible option is to specify your extra jars using HADOOP_CLASSPATH environmental variable.
You have to export path to your driver jar file.
export HADOOP_CLASSPATH=<path_to_driver_jar>.jar
After this, it can correctly pick up the jar file you specified. -libjars option doesn't correctly pick the file. I noticed this in sqoop version 1.4.6.

Spark Unable to find JDBC Driver

So I've been using sbt with assembly to package all my dependencies into a single jar for my spark jobs. I've got several jobs where I was using c3p0 to setup connection pool information, broadcast that out, and then use foreachPartition on the RDD to then grab a connection, and insert the data into the database. In my sbt build script, I include
"mysql" % "mysql-connector-java" % "5.1.33"
This makes sure the JDBC connector is packaged up with the job. Everything works great.
So recently I started playing around with SparkSQL and realized it's much easier to simply take a dataframe and save it to a jdbc source with the new features in 1.3.0
I'm getting the following exception :
java.sql.SQLException: No suitable driver found for
jdbc:mysql://some.domain.com/myschema?user=user&password=password at
java.sql.DriverManager.getConnection(DriverManager.java:596) at
java.sql.DriverManager.getConnection(DriverManager.java:233)
When I was running this locally I got around it by setting
SPARK_CLASSPATH=/path/where/mysql-connector-is.jar
Ultimately what I'm wanting to know is, why is the job not capable of finding the driver when it should be packaged up with it? My other jobs never had this problem. From what I can tell both c3p0 and the dataframe code both make use of the java.sql.DriverManager (which handles importing everything for you from what I can tell) so it should work just fine?? If there is something that prevents the assembly method from working, what do I need to do to make this work?
This person was having similar issue: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-DataFrame-with-MySQL-td22178.html
Have you updated your connector drivers to the most recent version? Also did you specify the driver class when you called load()?
Map<String, String> options = new HashMap<String, String>();
options.put("url", "jdbc:mysql://localhost:3306/video_rcmd?user=root&password=123456");
options.put("dbtable", "video");
options.put("driver", "com.mysql.cj.jdbc.Driver"); //here
DataFrame jdbcDF = sqlContext.load("jdbc", options);
In spark/conf/spark-defaults.conf, you can also set spark.driver.extraClassPath and spark.executor.extraClassPath to the path of your MySql driver .jar
These options are clearly mentioned in spark docs: --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
The mistake I was doing was mentioning these options after my application's jar.
However the correct way is to specify these options immediately after spark-submit:
spark-submit --driver-class-path /somepath/project/mysql-connector-java-5.1.30-bin.jar --jars /somepath/project/mysql-connector-java-5.1.30-bin.jar --class com.package.MyClass target/scala-2.11/project_2.11-1.0.jar
Both spark driver and executor need mysql driver on class path so specify
spark.driver.extraClassPath = <path>/mysql-connector-java-5.1.36.jar
spark.executor.extraClassPath = <path>/mysql-connector-java-5.1.36.jar
With spark 2.2.0, problem was corrected for me by adding extra class path information for SparkSession session in python script :
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.driver.extraClassPath", "/path/to/jdbc/driver/postgresql-42.1.4.jar") \
.getOrCreate()
See official documentation https://spark.apache.org/docs/latest/configuration.html
In my case, spark is not launched from cli command, but from django framework https://www.djangoproject.com/
spark.driver.extraClassPath does not work in client-mode:
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
Env variable SPARK_CLASSPATH has been deprecated in Spark 1.0+.
You should first copy the jdbc driver jars into each executor under the same local filesystem path and then use the following options in you spark-submit:
--driver-class-path "driver_local_file_system_jdbc_driver1.jar:driver_local_file_system_jdbc_driver2.jar"
--class "spark.executor.extraClassPath=executors_local_file_system_jdbc_driver1.jar:executors_local_file_system_jdbc_driver2.jar"
For example in case of TeraData you need both terajdbc4.jar and tdgssconfig.jar .
Alternatively modify compute_classpath.sh on all worker nodes, Spark documentation says:
The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java’s DriverManager class does a security check that results in it ignoring all drivers not visible to the primordial class loader when one goes to open a connection. One convenient way to do this is to modify compute_classpath.sh on all worker nodes to include your driver JARs.
There exists a simple Java trick to solve your problem. You should specify Class.forName() instance. For example:
val customers: RDD[(Int, String)] = new JdbcRDD(sc, () => {
Class.forName("com.mysql.jdbc.Driver")
DriverManager.getConnection(jdbcUrl)
},
"SELECT id, name from customer WHERE ? < id and id <= ?" ,
0, range, partitions, r => (r.getInt(1), r.getString(2)))
Check the docs
Simple easy way is to copy "mysql-connector-java-5.1.47.jar" into "spark-2.4.3\jars\" directory
I had the same problem running jobs over a Mesos cluster in cluster mode.
To use a JDBC driver is necessary to add the dependency to the system classpath not to the framework classpath. I only found the way of doing it by adding the dependency in the file spark-defaults.conf in every instance of the cluster.
The properties to add are spark.driver.extraClassPath and spark.executor.extraClassPath and the path must be in the local file system.
I add the jar file to the SPARK_CLASSPATH in spark-env.sh, it works.
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/local/spark-1.6.3-bin-hadoop2.6/lib/mysql-connector-java-5.1.40-bin.jar
I was facing the same issue when I was trying to run the spark-shell command from my windows machine. The path that you pass for the driver location as well as for the jar that you would be using should be in the double quotes otherwise it gets misinterpreted and you would not get the exact output that you want.
you also would have to install the JDBC driver for SQL server from the link : JDBC Driver
I have used the below command for this to work fine for me on my windows machine:
spark-shell --driver-class-path "C:\Program Files\Microsoft JDBC Driver 6.0 for SQL Server\sqljdbc_6.0\enu\jre8\sqljdbc42.jar" --jars "C:\Program Files\Microsoft JDBC Driver 6.0 for SQL Server\sqljdbc_6.0\enu\jre8\sqljdbc42.jar"

Oozie + Sqoop: JDBC Driver Jar Location

I have a 6 node cloudera based hadoop cluster and I'm trying to connect to an oracle database from a sqoop action in oozie.
I have copied my ojdbc6.jar into the sqoop lib location (which for me happens to be at: /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/sqoop/lib/ ) on all the nodes and have verified that I can run a simple 'sqoop eval' from all the 6 nodes.
Now when I run the same command using Oozie's sqoop action, I get "Could not load db driver class: oracle.jdbc.OracleDriver"
I have read this article about using shared libs and it makes sense to me when we're talking about my task/action/workflow specific dependencies. But I see a JDBC driver installation as an extention to sqoop and so I think it belongs in the sqoop installation lib.
Now the question is, while sqoop sees this ojdbc6 jar I have put into it's lib folder, how come my Oozie workflow doesn't see it?
Is this something expected or am I missing something?
As an aside, what do you guy think about where is the appropriate location for a JDBC driver jar?
Thanks in advance!
The JDBC driver jar (and any jars it depends on) should go in your Oozie sharelib folder on HDFS. I'm running Hortonworks Data Platform 1.2 instead of Cloudera 4.2 so the details may vary, but my JDBC driver is located in /user/oozie/share/lib/sqoop. This should allow you to run Sqoop with the JDBC via Oozie.
It is not necessary to put to the JDBC driver jar in the sqoop lib on the data nodes. In my setupt I can't run a simple sqoop eval from the command line on my data nodes. I understand the logic for why you thought this would work. The reason the JDBC driver jar needs to be on HDFS is so that all the data nodes have access to it. Your solution should accomplish the same goal. I'm not familiar enough with the inner workings of Oozie to say why using the sharelib works but your solution does not.
In CDH5, you should put the jar to '/user/oozie/share/lib/lib_${timestamp}/sqoop', and after that, you must update the sharelib or restart oozie.
update sharelib:
oozie admin -oozie http://localhost:11000/oozie -sharelibupdate
If you are using CDH-5 the JDBC driver jar (and any jars it depends on) should go in '/user/oozie/share/lib/lib_timestamp/sqoop' folder on HDFS.
I was facing the same issue it was not able to find the mysql jar. I am using cloudera 4.4 in this even oozie admin -oozie http://localhost:11000/oozie -sharelibupdate command will not work
To resolve the issue I had followed the below steps:
create a user in Hue with hdfs and provide the admin privileges
using Hue UI upload the jar into /user/oozie/share/lib/sqoop hdfs path
or you can use below command:
hadoop put /var/lib/sqoop2/mysql-connector-java.jar /user/oozie/share/lib/sqoop
Once the jar is placed run the oozie command.

Resources