Incremental import- To avoid duplication of rows - sqoop

Consider a table departments with below data-
ID -1,2,3,8000
Name- A,B,C,D
I imported data into HDFS using sqoop
Added 2 new rows with ID 4 and 5 into MySQL
Performed incremental import with last value as 3 and mode=append
Data imported has two rows for 8000 ID as the condition used is department_id>3
How can I tweak the below command to make sure duplicate rows are created.
sqoop import
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db"
--username=retail_dba
--password=cloudera
--table departments
--target-dir/user/cloudera/dep1
--append
--check-column "department_id"
--incremental append
--last-value 3

You can not tweak this command.
--incremental append is for appending new data with --check-column > -last-value.
For your usecase you should use --incremental lastmodified.
--check-column should be of date, time, datetime and timestamp data types.
If you added new records after --last-value, it will fetch all the records (new or updated)
Sample command:
sqoop import
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username=retail_dba \
--password=cloudera \
--table departments \
--target-dir/user/cloudera/dep1 \
--incremental lastmodified \
--check-column last_update_date \
--last-value "2015-10-20 06:00:01"
All the records added after "2015-10-20 06:00:01" will be imported.
Check sqoop documentation for more details.

Related

Sqoop import split by column and different database to split that data

I have a requirement.
I need to get data from Teradata database a into Hadoop.
I have access to only view in that database and need to pull data from that view. As I don't have read/write access to that database. I cannot use --split-by column option in my sqoop import query.
So is there a option where I can say sqoop to use database b for storing the split data and then move the data into Hadoop
Query:
sqoop import \
--connect "jdbc:sqlserver://xx.aa.dd.aa;databaseName=a" \
--connection-manager org.apache.sqoop.manager.SQLServerManager \
--username XXXX \
--password XXXX \
--num-mappers 20 \
--query "select * from (select ID,name,x,y,z from TABLE1 where DT between '2018/01/01' and '2018/01/31') as temp_table where updt_date <'2018/01/31' AND \$CONDITIONS" \
--split-by id \
--target-dir /user/XXXX/sqoop_import/XYZ/2018/TABLE1

Sqoop job incremental lastmodified wrong timestamp value

I am trying to create a Sqoop Job using incremental lastmodified
sqoop job --create job_import_test8_by_query_update -- import \
--bindir ./ --connect 'jdbc:mysql://localhost/db?serverTimezone=UTC&useSSL=false' \
--username user \
--password pass \
--table test8 -m 2 \
--incremental lastmodified \
--check-column "timestamp_field" \
--last-value 0 \
--split-by "id" \
--merge-key "id" \
--verbose \
--target-dir /usr/local/sqlImport/1
in this example I am having problem with last-value.
Running the first time when last-value is "0" works fine. Then the last value is automatically set to current_local_time + 4 hours, so I am losing some records.
It seems that the last-value takes the server timezone value instead of the last record value from the database.
Thanks for any help!
Try adding the useTimezone option to your connection string
--connect 'jdbc:mysql://localhost/db?useTimezone=true&serverTimezone=UTC'

impala incremental last modified

I have a Sqoop import to bring in data from Oracle with a join on two tables. I need to do a --incremental last-modified based on a column which is common on both the tables:
--query "SELECT customer_info.customer_id, customer_info.customer_name,
customer.date_created, sales_info.last_update_date as sales_last_update_date
from customer_info
inner join
sales_info ON customer_info.customer_id = sales_info.customer_id
AND \$CONDITIONS" \
--split-by "customer_id" \
--fields-terminated-by '\t' \
--target-dir (name_of_dir) \
--incremental lastmodified \
--check-column sales_last_update_date \
The last_update_date is common on both the columns.
But I get the error:
ORA-00904: "sales_last_update_date": invalid identifier

Unable to import data from MySql using Sqoop with different delimiter

As a beginner in Hadoop field, i was trying my hands on Sqoop tool (Version : Sqoop 1.4.6-cdh5.8.0).
Though i referred to various sites and forums but i could not get workable solution where in i could import data using any other delimiter other than ,.
PFB the code that i have used :
--- Connecting to MySql, creating table and records with , in string.
mysql> create database GRHadoop;
Query OK, 1 row affected (0.00 sec)
mysql> use GRHadoop;
Database changed
mysql> Create table sitecustomer(Customerid int(10), Customername varchar(100),Productid int(4),Salary int(20));
Query OK, 0 rows affected (0.22 sec)
mysql> Insert into sitecustomer values(1,'Sohail',100,50000),(2,'Reshma',200,80000),(3,'Tom',200,60000);
Query OK, 3 rows affected (0.06 sec)
Records: 3 Duplicates: 0 Warnings: 0
mysql> Insert into sitecustomer values(4,'Su,kama',300,50000),(5,'Ram,bha',100,80000),(6,'Suz',200,60000);
Query OK, 3 rows affected (0.03 sec)
Records: 3 Duplicates: 0 Warnings: 0
Sqoop Command :
sqoop import \
--connect jdbc:mysql://127.0.0.1:3306/GRHadoop \
--username root \
--password cloudera \
--table sitecustomer \
--input-fields-terminated-by '|' \
--lines-terminated-by "\n" \
--target-dir /user/cloudera/GR/Sqoop/sitecustomer_data \
--m 1;
Expected Output :
1|Sohail|100|50000
2|Reshma|200|80000
3|Tom|200|60000
4|Su,kama|300|50000
5|Ram,bha|100|80000
6|Suz|200|60000
Actual output :
1,Sohail,100,50000
2,Reshma,200,80000
3,Tom,200,60000
4,Su,kama,300,50000
5,Ram,bha,100,80000
6,Suz,200,60000
Please guide where i am getting it wrong.
The --input-fields-terminated-by argument is to tell Sqoop how to parse the input files during export. You should be using --fields-terminated-by, this argument controls how the output is formatted.
sqoop import \
--connect jdbc:mysql://127.0.0.1:3306/GRHadoop \
--username root \
--password cloudera \
--table sitecustomer \
--fields-terminated-by '|' \
--lines-terminated-by "\n" \
--target-dir /user/cloudera/GR/Sqoop/sitecustomer_data \
--m 1;

Incremental sqoop from oracle to hdfs with condition

I am doing a incremental sqooping from to hdfs oracle giving where condition like
(LST_UPD_TMST >TO_TIMESTAMP('2016-05-31T18:55Z', 'YYYY-MM-DD"T"HH24:MI"Z"')
AND LST_UPD_TMST <= TO_TIMESTAMP('2016-09-13T08:51Z', 'YYYY-MM-DD"T"HH24:MI"Z"'))
But it is not using the index. How can I force an Index so that sqoop can be faster by considering only filtered records.
What is the best option to do incremental sqoop. Table size in oracle is in TBs.
Table has billions rows and after where condition it is in some million
You can use --where or --query with where condition in select to filter import results
I was not sure about your sqoop full command, just give a try in this way
sqoop import
--connect jdbc:oracle:thin:#//db.example.com/dbname \
--username dbusername \
--password dbpassword \
--table tablename \
--columns "column,names,to,select,in,comma,separeted" \
--where "(LST_UPD_TMST >TO_TIMESTAMP('2016-05-31T18:55Z', 'YYYY-MM-DD\"T\"HH24:MI\"Z\"') AND LST_UPD_TMST <= TO_TIMESTAMP('2016-09-13T08:51Z', 'YYYY-MM-DD\"T\"HH24:MI\"Z\"'))" \
--target-dir {hdfs/location/to/save/data/from/oracle} \
--incremental lastmodified \
--check-column LST_UPD_TMST \
--last-value {from Date/Timestamp to Sqoop in incremental}
Check more details about sqoop incremental load
Update
For incremental imports Sqoop saved job is recommended to maintain --last-value automatically.
sqoop job --create {incremental job name} \
-- import
--connect jdbc:oracle:thin:#//db.example.com/dbname \
--username dbusername \
--password dbpassword \
--table tablename \
--columns "column,names,to,select,in,comma,separeted" \
--incremental lastmodified \
--check-column LST_UPD_TMST \
--last-value 0
Here --last-value 0 to import from start for first time then latest
value will be passed automatically in next invocation by sqoop job

Resources