I am trying to get some table data imported from PostgreSQL to HDFS using Sqoop. Now due to licensing constraints, Sqoop does not come packaged with JDBC drivers for all JDBC compliant databases. PostgreSQL is one of them. In order to interact with this database, Sqoop needs the relevant JDBC driver to be installed into a preset classpath (typically $SQOOP_HOME/lib).
In my case, the Hadoop administrator does not provide me write access to this predefined classpath. Is there any alternate way to instruct Sqoop client to look into some path (say, my home directory) instead of or in addition to the preset location?
I looked into the official Apache documentation and searched the internet, but could not fetch any answer. Could anyone please help?
Thanks !
I got this working yesterday. Below are the steps to follow.
Download the appropriate JDBC driver from here
Put the jar file under the directory of choice. I chose
the hadoop cluster user's home directory i.e. /home/myuser
export HADOOP_CLASSPATH="/home/myuser/postgresql-9.4.1209.jar"
(replace /home/myuser/postgresql-9.4.1209.jar with your path and jar file name)
To perform Sqoop import you may use the below command.
sqoop import
--connect 'jdbc:postgresql://<postgres_server_url>:<postgres_port>/<db_name>'
--username <db_user_name>
--password <db_user_password>
--table <db_table_name>
--warehouse-dir <existing_empty_hdfs_directory>
To perform Sqoop export you may use the below command.
sqoop export
--connect 'jdbc:postgresql://<postgres_server_url>:<postgres_port>/<db_name>'
--username <db_user_name>
--password <db_user_password>
--table <db_table_name>
--export-dir <existing_hdfs_path_containing_export_data>
As per Sqoop docs,
-libjars <comma separated list of jars>- specify comma separated jar files to include in the classpath.
Make sure you use -libjars as first argument in the command.
EDIT :
According to docs,
The -files, -libjars, and -archives arguments are not typically used with Sqoop, but they are included as part of Hadoop’s internal argument-parsing system.
So, JDBC client jars need to be put at $SQOOP_HOME/lib.
I had recently experienced issue with this -libjars option. It doesn't work perfectly. Probably this issue is propagated from Hadoop jar command line option. Possible option is to specify your extra jars using HADOOP_CLASSPATH environmental variable.
You have to export path to your driver jar file.
export HADOOP_CLASSPATH=<path_to_driver_jar>.jar
After this, it can correctly pick up the jar file you specified. -libjars option doesn't correctly pick the file. I noticed this in sqoop version 1.4.6.
Related
Today I found that if I copy hive-site.xml into $HADOOP_HOME/etc/hadoop/, Hive will use the hive-site.xml in the $HADOOP_HOME/etc/hadoop/ instead of the one in $HIVE_HOME/conf, and it will also search for the hive-log4j.properties in $HADOOP_HOME/etc/hadoop/.
If not found, Hive will just use the default one in /lib/hive-common-1.1.0-cdh5.7.6.jar!/hive-log4j.properties instead of the customized one in $HIVE_HOME/conf, but why?
I searched the keyword copy hive-site.xml to HADOOP_HOME in the official Hive manual in apache.org but failed to find any explanation...
My Hive version is hive-1.1.0-cdh5.7.6, Hadoop version hadoop-2.6.0-cdh5.7.6, JDK 1.7.
So, you've mentioned Sqoop, therefore I'll point out the proper processes for getting hive XML configuration.
1) There's a classpath problem if the file isn't found. Copying the file is one solution, but a poor one. A symlink is preferred.
Every time I've used Sqoop, I never messed around with controlling any XML files - it just worked. Therefore, both HDP and CDH must have the proper classpath and/or symlinks setup.
2) The documentation states where configurations are loaded from
Sqoop will fall back to $HADOOP_HOME. If it is not set either, Sqoop will use the default installation locations for Apache Bigtop, /usr/lib/hadoop and /usr/lib/hadoop-mapreduce, respectively.
The active Hadoop configuration is loaded from $HADOOP_HOME/conf/, unless the $HADOOP_CONF_DIR environment variable is set
This classpath controls where configurations are loaded from
3) You can also, at runtime, give extra files
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
sqoop import -files $HIVE_HOME/conf/hive-site.xml ...
Does sqoop import/export create java classes? If it does so, where can I see these generated classes. What is the location of these class files?
Does sqoop import/export create java classes?
Yes
If it does so, where can I see these generated classes. What is the location of these class files?
It automatically generates a java file of same table name in the
current path of local system.
You can use --outdir to provide your own path.
Updated as per comment
You can use codegen command for this:
sqoop codegen \
--connect jdbc:mysql://localhost/databasename\
--username username\
--password password\
--table tablename
After the command is executed successfully there will be a path at the end where you can see the java files.
This is the complete flow of sqoop commands
User---> SQOOP CLI cmd ----> Sqoop Code GEN -----> Sqoop JAR Writer
----> JAR submission ---> ResourceManager ----> MR operation (5phases) ----> HDFS ----> Ack to Sqoop by MR program
**
Sqoop internally uses MapReducev1 or v2 for its execution(Getting data from DB and Storing the same in HDFS in comma delimited values). And it first creates a .java source file for the map-reduce prg and pakages in jar and then submits.
The .java is created in the current local directory with name of table.
sqoop import --connect jdbc:mysql://localhost/hadoop --table employee -m 1
In this case a "employee.java" is created .
Refrence to the previously asked question Oozie + Sqoop: JDBC Driver Jar Location 1
but not able to find jar in HDFS /user/oozie/share/lib/sqoop location.
I have also tried to put driver jars at my workFlow app Lib. Still Drivers not found error occure.
You need to add all lib files like jdbc drivers, etc in the oozie share lib folder inside sqoop folder .
This should resolve your issue.
To check the library files invoked/used by the job , go to the job tracker for the corresponding job and in syslogs you will see which all jars has been used.
The exact problem was the single coats "'". Because of single coats oozie take it as a single string. But it was working fine when I was using it in the Sqoop command.
................. --driver com.microsoft.sqlserver.jdbc.SQLServer...................
instead of.
.................. --driver 'com.microsoft.sqlserver.jdbc.SQLServer'................
I have a 6 node cloudera based hadoop cluster and I'm trying to connect to an oracle database from a sqoop action in oozie.
I have copied my ojdbc6.jar into the sqoop lib location (which for me happens to be at: /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/sqoop/lib/ ) on all the nodes and have verified that I can run a simple 'sqoop eval' from all the 6 nodes.
Now when I run the same command using Oozie's sqoop action, I get "Could not load db driver class: oracle.jdbc.OracleDriver"
I have read this article about using shared libs and it makes sense to me when we're talking about my task/action/workflow specific dependencies. But I see a JDBC driver installation as an extention to sqoop and so I think it belongs in the sqoop installation lib.
Now the question is, while sqoop sees this ojdbc6 jar I have put into it's lib folder, how come my Oozie workflow doesn't see it?
Is this something expected or am I missing something?
As an aside, what do you guy think about where is the appropriate location for a JDBC driver jar?
Thanks in advance!
The JDBC driver jar (and any jars it depends on) should go in your Oozie sharelib folder on HDFS. I'm running Hortonworks Data Platform 1.2 instead of Cloudera 4.2 so the details may vary, but my JDBC driver is located in /user/oozie/share/lib/sqoop. This should allow you to run Sqoop with the JDBC via Oozie.
It is not necessary to put to the JDBC driver jar in the sqoop lib on the data nodes. In my setupt I can't run a simple sqoop eval from the command line on my data nodes. I understand the logic for why you thought this would work. The reason the JDBC driver jar needs to be on HDFS is so that all the data nodes have access to it. Your solution should accomplish the same goal. I'm not familiar enough with the inner workings of Oozie to say why using the sharelib works but your solution does not.
In CDH5, you should put the jar to '/user/oozie/share/lib/lib_${timestamp}/sqoop', and after that, you must update the sharelib or restart oozie.
update sharelib:
oozie admin -oozie http://localhost:11000/oozie -sharelibupdate
If you are using CDH-5 the JDBC driver jar (and any jars it depends on) should go in '/user/oozie/share/lib/lib_timestamp/sqoop' folder on HDFS.
I was facing the same issue it was not able to find the mysql jar. I am using cloudera 4.4 in this even oozie admin -oozie http://localhost:11000/oozie -sharelibupdate command will not work
To resolve the issue I had followed the below steps:
create a user in Hue with hdfs and provide the admin privileges
using Hue UI upload the jar into /user/oozie/share/lib/sqoop hdfs path
or you can use below command:
hadoop put /var/lib/sqoop2/mysql-connector-java.jar /user/oozie/share/lib/sqoop
Once the jar is placed run the oozie command.
When I pass the command:
$sqoop create-hive-table --connect 'jdbc:sqlserver://10.100.0.18:1433;username=cloud;password=cloud123;database=hadoop' --table cluster
Some errors and warnings appear and at the end it says,
Failed to start database '/var/lib/hive/metastore/metastore_db', see the next exception for details [again a list of import errors displayed]
Finally it says hive exited with satus 9
What is the problem here? I am new to sqoop and hive. Please anyone help me.
The correct syntax would be
sqoop import --connect 'jdbc:sqlserver://10.100.0.18:1433/hadoop' --username cloud --password cloud123 --table cluster --hive-import
I think you might want to check if you have write permissions to the specified directory and if a directory named metastore_db is being created
This message is usually shown when you're running Sqoop with default Hive configuration. Hive will by default use derby datastore which is usable only in very basic test use cases. I would recommend to reconfigure your hive instance to use some other relation database as a datastore back end (MySQL, PostgreSQL, Oracle).
Your syntax is all wrong. Syntax is $sqoop tool-name [tool-arguments]
$sqoop import --create-hive-table --connect 'jdbc:sqlserver://10.100.0.18:1433/hadoop' --username cloud --password cloud123 --table cluster
Pasting a sample call of hive import using sqoop. This might help you to correct your syntax further. Remember that essentially you need to give minimum the below command to make it work.
sqoop import --connect jdbc:mysql://localhost/RAWDATA --table geolocation --username root --password hadoop --hive-import --create-hive-table --driver com.mysql.jdbc.Driver --m 1 --delete-target-dir
--connect, in this the part which reads /RAWDATA is the database name from your mysql instance which contains the geolocation table. You can execute 'show databases' and 'show tables' command in mysql to check for your databases and tables.
--delete-target-dir option is used for safety. It will ensure sqoop delete the tmp dir it creates to write the file before moving it into hive. This will avoid unnecessary errors of directory already exists, in case you retry the command.
--create-hive-table is required only if you did not create the target table in hive already. If your previous runs of sqoop command created the table already, then you can ignore this option completely. Check your hive database for existence of target hive table.
--driver is a mandatory part of the command to perform any database connection.Make sure you either find the right path to the driver library or try googling for options. You can try first the one pasted above to see if it does the trick. You can revert to this forum for help.
remember we did not mention which database in hive the table will be created therefore it will be in default database of hive. I am not giving that option since you are just about starting in sqoop.