Import data from Oracle(Windows) to HDFS (CDH3) machine using sqoop - hadoop

Hi I am taking a training in HADOOP. I have a task in which I have to import a tables data from oracle(windows, 11g xe) to hdfs using sqoop. I am reading the following article. My question is that how do I exactly import data from windows to hdfs. Noramally I use Winscp to transfer files from Windows to hdfs machine. I have imported data from MySql which was installed in hdfs(cdh3) machine. But I don't know to import data from Oracle in windows to hdfs. Please help.
Link that I am following

Following is the step wise process:
1.Connect oracle sql command line log in with your credentials:
e.g username : system password: system
(make sure that this user has all administrative privileges or connect as sysdba in oracle make a new user with all privilegdes)
Create a user with all privileges in Oracle
Create tables under that user and insert some values and commit
2.Now we need a connector for transferring our data from Oracle to HDFS.
So, we need to download the oracle -sqoop connector jar file and place it in the following path of CDH3.(use sudo in your commands while copying in the following path as it requires admin acess in linux)
/usr/lib/sqoop/bin
http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html --Download link--ojdbc6.jar
Use winscp to transfer the downloaded jar from windows to CDH3.then move it to the above mentioned path in CDH3.
3.Command:
sudo bin/sqoop import –connect jdbc:oracle:thin:system/system#192.168.XX.XX:1521:xe–username system -P –table system.emp –columns “ID” –target-dir /sqoopoutput1 -m 1
sqoopoutput is the ouput file in HDFS where you will get your data ,U can change dis as per your
-m 1 : this tells the number no of mappers for this sqoop job here it is 1.
192.168.XX.XX:1521--ip address of your windows machine

you don't need to import data from oracle to local machine. Then copy it to HDFS machine. Then import it in HDFS.
Sqoop is here to import your RDBMS tables in HDFS directory.
Use command:
sqoop import --connect 'jdbc:oracle:thin:#192.xx.xx.xx:1521:ORCL' --username testuser --password testpassword --table testtable --target-dir /tmp/testdata
Go to the machine on which sqoop is running. Go to the terminal (I believe its linux). Just fire above mentioned command and check --target-dir (I mentioned /tmp/testdata in the example command) in hdfs. You will find files corresponding to your oracle table there.
Check sqoop docs for more details.

Related

Does sqoop import/export create java classes? If it does so, what is the location of these classes?

Does sqoop import/export create java classes? If it does so, where can I see these generated classes. What is the location of these class files?
Does sqoop import/export create java classes?
Yes
If it does so, where can I see these generated classes. What is the location of these class files?
It automatically generates a java file of same table name in the
current path of local system.
You can use --outdir to provide your own path.
Updated as per comment
You can use codegen command for this:
sqoop codegen \
--connect jdbc:mysql://localhost/databasename\
--username username\
--password password\
--table tablename
After the command is executed successfully there will be a path at the end where you can see the java files.
This is the complete flow of sqoop commands
User---> SQOOP CLI cmd ----> Sqoop Code GEN -----> Sqoop JAR Writer
----> JAR submission ---> ResourceManager ----> MR operation (5phases) ----> HDFS ----> Ack to Sqoop by MR program
**
Sqoop internally uses MapReducev1 or v2 for its execution(Getting data from DB and Storing the same in HDFS in comma delimited values). And it first creates a .java source file for the map-reduce prg and pakages in jar and then submits.
The .java is created in the current local directory with name of table.
sqoop import --connect jdbc:mysql://localhost/hadoop --table employee -m 1
In this case a "employee.java" is created .

Can SQOOP work with a custom libpath?

I am trying to get some table data imported from PostgreSQL to HDFS using Sqoop. Now due to licensing constraints, Sqoop does not come packaged with JDBC drivers for all JDBC compliant databases. PostgreSQL is one of them. In order to interact with this database, Sqoop needs the relevant JDBC driver to be installed into a preset classpath (typically $SQOOP_HOME/lib).
In my case, the Hadoop administrator does not provide me write access to this predefined classpath. Is there any alternate way to instruct Sqoop client to look into some path (say, my home directory) instead of or in addition to the preset location?
I looked into the official Apache documentation and searched the internet, but could not fetch any answer. Could anyone please help?
Thanks !
I got this working yesterday. Below are the steps to follow.
Download the appropriate JDBC driver from here
Put the jar file under the directory of choice. I chose
the hadoop cluster user's home directory i.e. /home/myuser
export HADOOP_CLASSPATH="/home/myuser/postgresql-9.4.1209.jar"
(replace /home/myuser/postgresql-9.4.1209.jar with your path and jar file name)
To perform Sqoop import you may use the below command.
sqoop import
--connect 'jdbc:postgresql://<postgres_server_url>:<postgres_port>/<db_name>'
--username <db_user_name>
--password <db_user_password>
--table <db_table_name>
--warehouse-dir <existing_empty_hdfs_directory>
To perform Sqoop export you may use the below command.
sqoop export
--connect 'jdbc:postgresql://<postgres_server_url>:<postgres_port>/<db_name>'
--username <db_user_name>
--password <db_user_password>
--table <db_table_name>
--export-dir <existing_hdfs_path_containing_export_data>
As per Sqoop docs,
-libjars <comma separated list of jars>- specify comma separated jar files to include in the classpath.
Make sure you use -libjars as first argument in the command.
EDIT :
According to docs,
The -files, -libjars, and -archives arguments are not typically used with Sqoop, but they are included as part of Hadoop’s internal argument-parsing system.
So, JDBC client jars need to be put at $SQOOP_HOME/lib.
I had recently experienced issue with this -libjars option. It doesn't work perfectly. Probably this issue is propagated from Hadoop jar command line option. Possible option is to specify your extra jars using HADOOP_CLASSPATH environmental variable.
You have to export path to your driver jar file.
export HADOOP_CLASSPATH=<path_to_driver_jar>.jar
After this, it can correctly pick up the jar file you specified. -libjars option doesn't correctly pick the file. I noticed this in sqoop version 1.4.6.

Issue with sqoop import from mysql to hbase

I am trying to import data from mysql to hbase using sqoop:
sqoop import --connect jdbc:mysql://<hostname>:3306/test --username USERNAME -P --table testtable --direct --hbase-table testtable --column-family info --hbase-row-key id --hbase-create-table
The process runs smoothly, without any error, but the data goes to hdfs and not to hbase.
Here is my setup:
HBase and Hadoop is installed in distributed mode in my three server cluster. Namenode and HBase Master being one one server. Datanodes and Region server lies in two other servers. Sqoop is installed in NameNode server only.
I am using Hadoop version 0.20.2-cdh3u3, hbase version 0.90.6-cdh3u4 and sqoop version 1.3.0-cdh3u3.
Any suggestions where I am doing wrong?
Sqoop's direct connectors usually do not support HBase and this is definitely the case for MySQL direct connector. You should drop the --direct option if you need import data into HBase.
Here is an example of importing data from Mysql to HBase
http://souravgulati.webs.com/apps/forums/topics/show/8680714-sqoop-import-data-from-mysql-to-hbase

How to specify Hive database name in command line while importing data from RDBMS into Hive using Sqoop ?

I need to import data from RDBMS table into remote Hive machine. How can i achieve this using Sqoop ?
In nut shell, How to specify hive database name and the hive machine i/p in the import command?
Please help me with appropriate sqoop command.
You should run the sqoop command on the machine where you have Hive installed, because sqoop will look for $HIVE_HOME/bin/hive to execute the CREATE TABLE ... and other statements.
Alternatively, you could use sqoop with the --hive-home command line option to specify where your Hive is installed (just overrides $HIVE_HOME)
To connect to your remote RDBMS:
sqoop import --connect jdbc:mysql://remote-server/mytable --username xxx --password yyy
To import into Hive:
sqoop import --hive-import
You can get a more comprehensive list of commands by looking at http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_literal_sqoop_import_literal">this link.

I couldn't import the tables from my sql server to hive through sqoop

When I pass the command:
$sqoop create-hive-table --connect 'jdbc:sqlserver://10.100.0.18:1433;username=cloud;password=cloud123;database=hadoop' --table cluster
Some errors and warnings appear and at the end it says,
Failed to start database '/var/lib/hive/metastore/metastore_db', see the next exception for details [again a list of import errors displayed]
Finally it says hive exited with satus 9
What is the problem here? I am new to sqoop and hive. Please anyone help me.
The correct syntax would be
sqoop import --connect 'jdbc:sqlserver://10.100.0.18:1433/hadoop' --username cloud --password cloud123 --table cluster --hive-import
I think you might want to check if you have write permissions to the specified directory and if a directory named metastore_db is being created
This message is usually shown when you're running Sqoop with default Hive configuration. Hive will by default use derby datastore which is usable only in very basic test use cases. I would recommend to reconfigure your hive instance to use some other relation database as a datastore back end (MySQL, PostgreSQL, Oracle).
Your syntax is all wrong. Syntax is $sqoop tool-name [tool-arguments]
$sqoop import --create-hive-table --connect 'jdbc:sqlserver://10.100.0.18:1433/hadoop' --username cloud --password cloud123 --table cluster
Pasting a sample call of hive import using sqoop. This might help you to correct your syntax further. Remember that essentially you need to give minimum the below command to make it work.
sqoop import --connect jdbc:mysql://localhost/RAWDATA --table geolocation --username root --password hadoop --hive-import --create-hive-table --driver com.mysql.jdbc.Driver --m 1 --delete-target-dir
--connect, in this the part which reads /RAWDATA is the database name from your mysql instance which contains the geolocation table. You can execute 'show databases' and 'show tables' command in mysql to check for your databases and tables.
--delete-target-dir option is used for safety. It will ensure sqoop delete the tmp dir it creates to write the file before moving it into hive. This will avoid unnecessary errors of directory already exists, in case you retry the command.
--create-hive-table is required only if you did not create the target table in hive already. If your previous runs of sqoop command created the table already, then you can ignore this option completely. Check your hive database for existence of target hive table.
--driver is a mandatory part of the command to perform any database connection.Make sure you either find the right path to the driver library or try googling for options. You can try first the one pasted above to see if it does the trick. You can revert to this forum for help.
remember we did not mention which database in hive the table will be created therefore it will be in default database of hive. I am not giving that option since you are just about starting in sqoop.

Resources