Import the all tables from RDBMS using sqoop - hadoop

I am trying to import data from testing mysql database to hadoop using sqoop. But in some tables having primary and some tables does not have primary key.
$sqoop import-all-tables --connect jdbc:mysql://192.168.0.101/mysql -username test -P --warehouse-dir /home/user_all_tables
17/08/01 22:46:54 ERROR tool.ImportAllTablesTool: Error during import:
No primary key could be found for table general_log. Please specify
one with --split-by or perform a sequential import with '-m 1'.
Kindly suggest me how to use split by in sqoop command line.

For the import-all-tables tool to be useful, the following conditions must be met:
Each table must have a single-column primary key.
You must intend to import all columns of each table.
You must not intend to use non-default splitting column, nor impose any conditions via a WHERE clause.
Default option doesn't fit with the non primary key table therefore it is not working. Here I will suggests to use -m 1 option to strict the import with one mapper only.
Sqoop command:
import-all-tables --connect jdbc:mysql://192.168.0.101/mysql -username test \
-P --warehouse-dir /home/user_all_tables -m 1

Related

How to use sqoop validation?

Can you please help me with the below points.
I have a oracle data base with huge no.of records today - suppose 5TB data, so we can use the vaildator sqoop framework- It will validate and import in the HDFS.
Then, Suppose tomorrow- i will receive the new records on top of the above TB data, so how can i import those new records (only new records to the existing directory) and validation by using the validator sqoop framework.
I have a requirement, how to use sqoop validator if new records arrives.
I need sqoop validatior framework used in new records arrives to be imported in HDFS.
Please help me team.Thanks.
Thank You,
Sipra
My understanding is that you need to validate the oracle database for new records before you start your delta process. I don’t think you can validate based on the size of the records. But if you have a offset or a TS column that will be helpful for validation.
how do I know if there is new records in oracle since last run/job/check ??
You can do this in two sqoop import approaches, following is the examples and explanation for both.
sqoop incremental
Following is an example for the sqoop incremental import
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --check-column rDate --incremental lastmodified --last-value 2014-01-25 --target-dir yloc/loc
This link explained it : https://www.tutorialspoint.com/sqoop/sqoop_import.html
sqoop import using query option
Here you basically use the where condition in the query and pull the data which is greater than the last received date or offset column.
Here is the syntax for it sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba --password cloudera \
--query 'select * from sample_data where $CONDITIONS AND salary > 1000' \
--split-by salary \
--target-dir hdfs://quickstart.cloudera/user/cloudera/sqoop_new
Isolate the validation and import job
If you want to run the validation and import job independently you have an other utility in sqoop which is sqoop eval, with this you can run the query on the rdbms and point the out put to the file or to a variable In your code and use that for validation purpose as you want.
Syntax :$ sqoop eval \
--connect jdbc:mysql://localhost/db \
--username root \
--query “SELECT * FROM employee LIMIT 3”
Explained here : https://www.tutorialspoint.com/sqoop/sqoop_eval.htm
validation parameter in sqoop
You can use this parameter to validate the counts between what’s imported/exported between RDBMS and HDFS
—validate
More on that : https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#validation

what is the relevence of -m 1

I am executing below sqoop command::=
sqoop import --connect 'jdbc:sqlserver://10.xxx.xxx.xx:1435;database=RRAM_Temp' --username DRRM_DATALOADER --password ****** --table T_VND --hive-import --hive-table amitesh_db.amit_hive_test --as-textfile --target-dir amitesh_test_hive -m 1
I have two queries::-
1) what is the relevence of -m 1? as far as I know Its the number of mapper that I am assigning to the sqoop job. If that is true, then, the moment I assign -m 2, the execution start throwing error as below:
ERROR tool.ImportTool: Error during import: No primary key could be found for table xxx. Please specify one with --split-by or perform a sequential import with '-m 1'
Now, I am forced to change my concept, now I see, it has something to do with database primary key. Can somebody help me a logic behind this?
2) I have ordered the above sqoop command to save the file as text file format.But when I go to the location suggested by the execution, I find tbl_name.jar. Why, if --as-textfile is a wrong sytax, then what is the right one. Or is there any other location that I can find the file in?
1) To have -m or --num-mappers to be set to a value greater than 1, the table must either have PRIMARY KEY or the sqoop command must be provided with a --split-by column. Controlling Parallelism would explain the logic behind this.
2) The FileFormat of the data imported into the Hive table amit_hive_test would be plain text(--as-textfile). As this is --hive-import, the data will be first imported into the --target-dir and then is loaded (LOAD DATA INPATH) into the Hive table. The resultant data will be inside the table's LOCATION and not in --target-dir.

Sqoop import converting TINYINT to BOOLEAN

I am attempting to import a MySQL table of NFL play results into HDFS using Sqoop. I issued the following command to achieve this:
sqoop import \
--connect jdbc:mysql://127.0.0.1:3306/nfl \
--username <username> -P \
--table play
Unfortunately, there are columns of type TINYINT, which are being converted to booleans upon import. For instance, there is a 'quarter' column for which quarter of the game the play occurred in. The value in this column is converted to 'true' if the play occurred in the first quarter and 'false' otherwise.
In fact, I did a sqoop import-all-tables, importing the entire NFL database I have, and it behaves like this uniformly.
Is there a way around this, or perhaps some argument for import or import-all-tables that prevents this from happening?
Add tinyInt1isBit=false in your JDBC connection URL. Something like
jdbc:mysql://127.0.0.1:3306/nfl?tinyInt1isBit=false
Another solution would be to explicitly override the column mapping for the datatype TINYINT(1) column. For example, if the column name is foo, then pass the following option to Sqoop during import: --map-column-hive foo=tinyint. In the case of non-Hive imports to HDFS, use --map-column-java foo=integer.
Source

Importing external table using multiple conditions sqoop

I would like import some selected rows from an external table into HDFS directory using sqoop
Below is table rows in MYSQL database
The column names are name,bank,salary,company
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
all I need is to have multiple where conditions in sqoop commands. How to have multiple where conditions in sqoop commands.
sqoop import --connect jdbc:mysql://192.891.289.1/testing --username root -P
--query 'select * from records where salary>30000 and bank='HDFC' $CONDITIONS'
--target-dir '/user/cloudera/surender' -m 1
The above query is returning error. I am getting error as "Unknown column "HDFC" in where clause
The reason is you need to put "and" before $CONDITIONS. Instead of:
where salary>30000 and bank='HDFC' $CONDITIONS
Try using
where salary>30000 and bank='HDFC' and \$CONDITIONS'

Configuring Sqoop with Mysql?

I have successfully installed SQOOP now the problem is that how to implement it with RDBMS and how to load data from RDBMS to HDFS using SQOOP.
By Using Sqoop You can Load Data directly to Hive Tables or Store the data in Some target Directory in HDFS
If you Need to copy data from RDBMS into Some directory
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password {In case no password Do not Specify it}
--table tableName
--col column_name(s) {In case you need to call specific columns}
--target-dir '/tmp/myfolder'
--boundary-query 'Select min,max from table name'
--m 5 {set number of mappers to 5}
--fields-terminated-by ',' {how do you want your data to look in target file}
Boundary Query : This is something you can specify. If you do not specify this , then by default this is run in as an inner query which adds up to a complex query.
If you specify this explicitly then this runs as a normal query and hence the performance is increased.
Also you may want to restrict the number of observation ,say based on column ID, and suppose you need data from ID 1 to 1000. Then using Boundary condition and split-by you will be able to restrict your import data.
--boundary-query "select 0,1000 from employee'
--split-by ID
Split-By : You use Split by on a Sqoop import to specify the column on basis of which split is required. By default,if you do not specify this, sqoop pics up table's primary key as the Split_by column.
Split By picks up data from tables and stores them in different folders based on number of mappers. By Default Number of Mappers are 4.
This may seem unwanted but in case you have a composite primary key or no primary key at all, then sqoop fails to pick up data and may error out.
Note: You may not face any issue if you set the number of mappers to 1. In this case, no split by condition is used since there is only one mapper. So query runs fine. This can be done using
--m 1
If you Need to copy data from RDBMS into Hive Table
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password {In case no password Do not Specify it}
--table tableName
--boundary-query 'Select min,max from table name'
--m 5 {set number of mappers to 5}
--hive-import
--hive-table serviceorderdb.productinfo
--m 1
Running a query instead of calling entire table itself
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password
--query 'select name from employees where name like '%s' and $CONDITIONS'
--m 5 {set number of mappers to 5}
--target-dir '/tmp/myfolder'
--fields-terminated-by ',' {how do you want your data to look in target file}
You may see $conditions as extra parameter $CONDITIONS. This is because this time you specified no table and specified a query explicity. When Sqoop runs, it searches for a boundary conditions, which it does not find. Then It Searches for a table and a primary key for applying boundary query which again it will not find. Hence we use $CONDITIONS to explicitly specify that we are not using a query and use default boundry condition from query result.
Checking if your connection is set up properly : For this you can just call list databases and if the you see your data populated then your connection is fine.
$ sqoop list-databases
--connect jdbc:mysql://localhost/
--username root
--password pwd
Connection String for Different Databases :
MYSQL: jdbc:mysql://<hostname>:<port>/<dbname>
jdbc:mysql://127.0.0.1:3306/test_database
Oracle :#//host_name:port_number/service_name
jdbc:oracle:thin:scott/tiger#//myhost:1521/myservicename
You may learn more about sqoop imports from : https://sqoop.apache.org/docs/1.4.1-incubating/SqoopUserGuide.html
By using sqoop import command you can import data from RDBMS to HDFS, Hive and HBase
sqoop import --connect jdbc:mysql://localhost:portnumber/DBName --username root --table emp --password root -m1
By using this command data will be stored in HDFS.
Sample commands to run sqoop import (load data from RDBMS to HDFS):
Postgres
sqoop import --connect jdbc:postgresql://postgresHost/databaseName
--username username --password 123 --table tableName
MySQL
sqoop import --connect jdbc:mysql://mysqlHost/databaseName --username username --password 123 --table tableName
Oracle*
sqoop import --connect jdbc:oracle:thin:#oracleHost:1521/databaseName --username USERNAME --password 123 --table TABLENAME
SQL Server
sqoop import --connect 'jdbc:sqlserver://sqlserverhost:1433;database=dbname;username=<username>;password=<password>' --table tableName
*Sqoop won't find any columns from a table if you don't specify both the username and the table in correct case. Usually, specifying both in uppercase will resolve the issue.
Read the Sqoop User's Guide: https://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html
I also recommend the Apache Sqoop Cookbook. You will learn how to use import and export tools, do incremental import jobs, save jobs, solve problems with jdbc drivers and much more. http://shop.oreilly.com/product/0636920029519.do

Resources