I have successfully installed SQOOP now the problem is that how to implement it with RDBMS and how to load data from RDBMS to HDFS using SQOOP.
By Using Sqoop You can Load Data directly to Hive Tables or Store the data in Some target Directory in HDFS
If you Need to copy data from RDBMS into Some directory
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password {In case no password Do not Specify it}
--table tableName
--col column_name(s) {In case you need to call specific columns}
--target-dir '/tmp/myfolder'
--boundary-query 'Select min,max from table name'
--m 5 {set number of mappers to 5}
--fields-terminated-by ',' {how do you want your data to look in target file}
Boundary Query : This is something you can specify. If you do not specify this , then by default this is run in as an inner query which adds up to a complex query.
If you specify this explicitly then this runs as a normal query and hence the performance is increased.
Also you may want to restrict the number of observation ,say based on column ID, and suppose you need data from ID 1 to 1000. Then using Boundary condition and split-by you will be able to restrict your import data.
--boundary-query "select 0,1000 from employee'
--split-by ID
Split-By : You use Split by on a Sqoop import to specify the column on basis of which split is required. By default,if you do not specify this, sqoop pics up table's primary key as the Split_by column.
Split By picks up data from tables and stores them in different folders based on number of mappers. By Default Number of Mappers are 4.
This may seem unwanted but in case you have a composite primary key or no primary key at all, then sqoop fails to pick up data and may error out.
Note: You may not face any issue if you set the number of mappers to 1. In this case, no split by condition is used since there is only one mapper. So query runs fine. This can be done using
--m 1
If you Need to copy data from RDBMS into Hive Table
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password {In case no password Do not Specify it}
--table tableName
--boundary-query 'Select min,max from table name'
--m 5 {set number of mappers to 5}
--hive-import
--hive-table serviceorderdb.productinfo
--m 1
Running a query instead of calling entire table itself
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password
--query 'select name from employees where name like '%s' and $CONDITIONS'
--m 5 {set number of mappers to 5}
--target-dir '/tmp/myfolder'
--fields-terminated-by ',' {how do you want your data to look in target file}
You may see $conditions as extra parameter $CONDITIONS. This is because this time you specified no table and specified a query explicity. When Sqoop runs, it searches for a boundary conditions, which it does not find. Then It Searches for a table and a primary key for applying boundary query which again it will not find. Hence we use $CONDITIONS to explicitly specify that we are not using a query and use default boundry condition from query result.
Checking if your connection is set up properly : For this you can just call list databases and if the you see your data populated then your connection is fine.
$ sqoop list-databases
--connect jdbc:mysql://localhost/
--username root
--password pwd
Connection String for Different Databases :
MYSQL: jdbc:mysql://<hostname>:<port>/<dbname>
jdbc:mysql://127.0.0.1:3306/test_database
Oracle :#//host_name:port_number/service_name
jdbc:oracle:thin:scott/tiger#//myhost:1521/myservicename
You may learn more about sqoop imports from : https://sqoop.apache.org/docs/1.4.1-incubating/SqoopUserGuide.html
By using sqoop import command you can import data from RDBMS to HDFS, Hive and HBase
sqoop import --connect jdbc:mysql://localhost:portnumber/DBName --username root --table emp --password root -m1
By using this command data will be stored in HDFS.
Sample commands to run sqoop import (load data from RDBMS to HDFS):
Postgres
sqoop import --connect jdbc:postgresql://postgresHost/databaseName
--username username --password 123 --table tableName
MySQL
sqoop import --connect jdbc:mysql://mysqlHost/databaseName --username username --password 123 --table tableName
Oracle*
sqoop import --connect jdbc:oracle:thin:#oracleHost:1521/databaseName --username USERNAME --password 123 --table TABLENAME
SQL Server
sqoop import --connect 'jdbc:sqlserver://sqlserverhost:1433;database=dbname;username=<username>;password=<password>' --table tableName
*Sqoop won't find any columns from a table if you don't specify both the username and the table in correct case. Usually, specifying both in uppercase will resolve the issue.
Read the Sqoop User's Guide: https://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html
I also recommend the Apache Sqoop Cookbook. You will learn how to use import and export tools, do incremental import jobs, save jobs, solve problems with jdbc drivers and much more. http://shop.oreilly.com/product/0636920029519.do
Related
Can you please help me with the below points.
I have a oracle data base with huge no.of records today - suppose 5TB data, so we can use the vaildator sqoop framework- It will validate and import in the HDFS.
Then, Suppose tomorrow- i will receive the new records on top of the above TB data, so how can i import those new records (only new records to the existing directory) and validation by using the validator sqoop framework.
I have a requirement, how to use sqoop validator if new records arrives.
I need sqoop validatior framework used in new records arrives to be imported in HDFS.
Please help me team.Thanks.
Thank You,
Sipra
My understanding is that you need to validate the oracle database for new records before you start your delta process. I don’t think you can validate based on the size of the records. But if you have a offset or a TS column that will be helpful for validation.
how do I know if there is new records in oracle since last run/job/check ??
You can do this in two sqoop import approaches, following is the examples and explanation for both.
sqoop incremental
Following is an example for the sqoop incremental import
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --check-column rDate --incremental lastmodified --last-value 2014-01-25 --target-dir yloc/loc
This link explained it : https://www.tutorialspoint.com/sqoop/sqoop_import.html
sqoop import using query option
Here you basically use the where condition in the query and pull the data which is greater than the last received date or offset column.
Here is the syntax for it sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba --password cloudera \
--query 'select * from sample_data where $CONDITIONS AND salary > 1000' \
--split-by salary \
--target-dir hdfs://quickstart.cloudera/user/cloudera/sqoop_new
Isolate the validation and import job
If you want to run the validation and import job independently you have an other utility in sqoop which is sqoop eval, with this you can run the query on the rdbms and point the out put to the file or to a variable In your code and use that for validation purpose as you want.
Syntax :$ sqoop eval \
--connect jdbc:mysql://localhost/db \
--username root \
--query “SELECT * FROM employee LIMIT 3”
Explained here : https://www.tutorialspoint.com/sqoop/sqoop_eval.htm
validation parameter in sqoop
You can use this parameter to validate the counts between what’s imported/exported between RDBMS and HDFS
—validate
More on that : https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#validation
I have been using sqoop to import data from mysql to hive, the command I used are below:
sqoop import --connect jdbc:mysql://localhost:3306/datasync \
--username root --password 654321 \
--query 'SELECT id,name FROM test WHERE $CONDITIONS' --split-by id \
--hive-import --hive-database default --hive-table a \
--target-dir /tmp/yfr --as-parquetfile
The Hive table is created and the data is inserted, however I can not find the parquet file.
Does anyone know?
Best regards,
Feiran
Sqoop import to hive works in 2 steps:
Fetching data from RDBMS to HDFS
Create hive table if not exists and Load data into hive table
In your case,
firstly, data is stored at --target-dir i.e. /tmp/yfr
Then, it is loaded into Hive table a using
LOAD DATA INPTH ... INTO TABLE..
command.
As mentioned in the comments, data is moved to hive warehouse directory that's why there is no data in --target-dir.
I am having a accounts table in mysql db.
it has around 19654 records.I used sqoop to import the table data in HDFS.It created four files in HDFS with data evenly distributed
then i executed below sql statement on DB
update accounts set modified = now() where acct_num in (1,2,3,4) ;
Then i executed below sqoop tool
sqoop import --table accounts --connect jdbc:mysql://localhost/loudacre
--username training --password training
--incremental lastmodified
--check-column modified --last-value '2014-03-18 13:29:47.0'
--merge-key acct_num --target-dir /accounts/
After the above completed it created only one file with around 10 entries only.Does not even include new timestamp value.
I was just trying to update the rows which have new timestamp. Can anyone help?
I am trying to append data to already existing Table in hive.Using the Following command first i import the table from MS-SQL Server to hive.
Sqoop Command:
sqoop import --connect "jdbc:sqlserver://XXX.XX.XX.XX;databaseName=mydatabase" --table "my_table" --where "Batch_Id > 100" --username myuser --password mypassword --hive-import
Now i want to append the data to same existing table in hive where "Batch_Id < 100"
I am using the following Command:
sqoop import --connect "jdbc:sqlserver://XXX.XX.XX.XX;databaseName=mydatabase" --table "my_table" --where "Batch_Id < 100" --username myuser --password mypassword --append --hive-table my_table
This command however runs successfully also updates the HDFS data, but when u connect to hive shell and query the table, the records which are appended are not visible.
Sqoop updated the Data on hdfs "/user/hduser/my_table" but the data on "/user/hive/warehouse/batch_dim" is not updated.
How can reslove this issue.
Regards,
Bhagwant Bhobe
Try using
sqoop import --connect "jdbc:sqlserver://XXX.XX.XX.XX;databaseName=mydatabase"
--table "my_table" --where "Batch_Id < 100"
--username myuser --password mypassword
--hive-import --hive-table my_table
when you are using --hive-import DO NOT use --append parameter.
The Sqoop command you're using (--import) is only for ingesting records into HDFS. You need to use the --hive-import flag to import records into Hive.
See http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_importing_data_into_hive for more details and for additional import configuration options (you may want to change the document reference to your version of Sqoop, of course).
After install hadoop, hive (CDH version) I execute
./sqoop import -connect jdbc:mysql://10.164.11.204/server -username root -password password -table user -hive-import --hive-home /opt/hive/
All goes fine, but when I enter hive command line and execute show tables, there are nothing.
I use ./hadoop fs -ls, I can see /user/(username)/user existing.
Any help is appreciated.
---EDIT-----------
/sqoop import -connect jdbc:mysql://10.164.11.204/server -username root -password password -table user -hive-import --target-dir /user/hive/warehouse
import fail due to :
11/07/02 00:40:00 INFO hive.HiveImport: FAILED: Error in semantic analysis: line 2:17 Invalid Path 'hdfs://hadoop1:9000/user/ubuntu/user': No files matching path hdfs://hadoop1:9000/user/ubuntu/user
11/07/02 00:40:00 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Hive exited with status 10
at com.cloudera.sqoop.hive.HiveImport.executeExternalHiveScript(HiveImport.java:326)
at com.cloudera.sqoop.hive.HiveImport.executeScript(HiveImport.java:276)
at com.cloudera.sqoop.hive.HiveImport.importTable(HiveImport.java:218)
at com.cloudera.sqoop.tool.ImportTool.importTable(ImportTool.java:362)
at com.cloudera.sqoop.tool.ImportTool.run(ImportTool.java:423)
at com.cloudera.sqoop.Sqoop.run(Sqoop.java:144)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at com.cloudera.sqoop.Sqoop.runSqoop(Sqoop.java:180)
at com.cloudera.sqoop.Sqoop.runTool(Sqoop.java:218)
at com.cloudera.sqoop.Sqoop.main(Sqoop.java:228)
Check your hive-site.xml for the value of the property
javax.jdo.option.ConnectionURL. If you do not define this explicitly,
the default value will use a relative path for creation of hive
metastore (jdbc:derby:;databaseName=metastore_db;create=true) which
will be different depending upon where you launch the process from.
This would explain why you cannot see the table via show tables.
define this property value in your
hive-site.xml using an absolute path
no need of creating the table in hive..refer the below query
sqoop import --connect jdbc:mysql://xxxx.com/Database name --username root --password admin --table tablename (mysql table) --direct -m 1 --hive-import --create-hive-table --hive-table table name --target-dir '/user/hive/warehouse/Tablename(which u want create in hive)' --fields-terminated-by '\t'
In my case Hive stores data in /user/hive/warehouse directory in HDFS. This is where Sqoop should put it.
So I guess you have to add:
--target-dir /user/hive/warehouse
Which is default location for Hive tables (might be different in your case).
You might also want to create this table in Hive:
sqoop create-hive-table --connect jdbc:mysql://host/database --table tableName --username user --password password
in my case it creates table in hive default database, you can give it a try.
sqoop import --connect jdbc:mysql://xxxx.com/Database name --username root --password admin --table NAME --hive-import --warehouse-dir DIR --create-hive-table --hive-table NAME -m 1
Hive tables will be created by Sqoop import process. Please make sure the /user/hive/warehouse is created in you HDFS. You can browse the HDFS (http://localhost:50070/dfshealth.jsp - Browse the File System option.
Also include the HDFS local in -target dir i.e hdfs://:9000/user/hive/warehouse in the sqoop import command.
First of all , create the table definition in Hive with exact field names and types as in mysql.
Then, perform the import operation
For Hive Import
sqoop import --verbose --fields-terminated-by ',' --connect jdbc:mysql://localhost/test --table tablename --hive-import --warehouse-dir /user/hive/warehouse --fields-terminated-by ',' --split-by id --hive-table tablename
'id' can be your primary key of the existing table
'localhost' can be your local ip
'test' is database
'warehouse' directory is in HDFS
I think all you need is to specify the hive table where data should go.
add "--hive-table database.tablename" to the sqoop command and remove the --hive-home /opt/hive/. I think that should resolve the problem.