Sqoop Incremental Import - sqoop

Need advice on Sqoop Incremental Imports.
Say I have a Customer with Policy 1 on Day 1 and I imported those records in HDFS on Day 1 and I see them in Part Files.
On Day 2, the same customer adds Policy 2 and after the incremental import sqoop run, will we get only new records in the part files?
In that case, How do I get the Old and Incremental appended/last modified records using Sqoop?

Consider a table with 3 records which you already imported to hdfs using sqoop
+------+------------+----------+------+------------+
| sid | city | state | rank | rDate |
+------+------------+----------+------+------------+
| 101 | Chicago | Illinois | 1 | 2014-01-25 |
| 101 | Schaumburg | Illinois | 3 | 2014-01-25 |
| 101 | Columbus | Ohio | 7 | 2014-01-25 |
+------+------------+----------+------+------------+
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P
Now you have additional records in the table but no updates on existing records
+------+------------+----------+------+------------+
| sid | city | state | rank | rDate |
+------+------------+----------+------+------------+
| 101 | Chicago | Illinois | 1 | 2014-01-25 |
| 101 | Schaumburg | Illinois | 3 | 2014-01-25 |
| 101 | Columbus | Ohio | 7 | 2014-01-25 |
| 103 | Charlotte | NC | 9 | 2013-04-22 |
| 103 | Greenville | SC | 9 | 2013-05-12 |
| 103 | Atlanta | GA | 11 | 2013-08-21 |
+------+------------+----------+------+------------+
Here you should use an --incremental append with --check-column which specifies the column to be examined when determining which rows to import.
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --check-column rank --incremental append --last-value 7
The above code will insert all the new rows based on the last value.
Now we can think of second case where there are updates in rows
+------+------------+----------+------+------------+
| sid | city | state | rank | rDate |
+------+------------+----------+------+------------+
| 101 | Chicago | Illinois | 1 | 2015-01-01 |
| 101 | Schaumburg | Illinois | 3 | 2014-01-25 |
| 101 | Columbus | Ohio | 7 | 2014-01-25 |
| 103 | Charlotte | NC | 9 | 2013-04-22 |
| 103 | Greenville | SC | 9 | 2013-05-12 |
| 103 | Atlanta | GA | 11 | 2013-08-21 |
| 104 | Dallas | Texas | 4 | 2015-02-02 |
| 105 | Phoenix | Arzona | 17 | 2015-02-24 |
+------+------------+----------+------+------------+
Here we use incremental lastmodified where we will fetch all the updated rows based on date.
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --check-column rDate --incremental lastmodified --last-value 2014-01-25 --target-dir yloc/loc

In answer to your first question, it depends on how you run the import statement. If you use the --incremental append option, you would be specifying your --check-column and --last-value arguments. These will dictate exactly which records are pulled and they will simply be appended to your table.
For example: you could specify a DATE type column for your --check-column argument and a very early date (like '1900-01-01' or Day1 in your case) for --last-value and this would just keep appending everything in the source table (creating duplicate rows) to your destination. In this case, the new part files created will hold both new and old records. You could also use an increasing ID column and keep entering the small ID and that would have the same effect. However, if --last-value is Day2, there will be additional part files with only new records. I'm not sure if you were wondering if you would lose the old records (just in case you were) but that's not the case.
The last-modified argument for --incremental would only be useful if, in the future, you go back and update some of the attributes of an existing row. In this case, it replaces the old data in your table (and adds the new stuff) with the updated version of the row that's now in your source table. Hope this helps!
Oh, all of this is based on The Sqoop User Guide Section 7.2.7 https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports
and Chapter 3 of the Apache Sqoop Cookbook (that chapter is actually fantastic!)

Step1 : The entire table is imported. This will be available as part-m file in your specified HDFS location (say /user/abc/def/part-m-00000)
Step2 : Only the incremental records are imported. This will be available in another location (say /user/abc/def1/part-m-00000)
Now that both the data are available, you can use the sqoop merge option to consolidate both based on the key column.
Refer to the below doc. for more details
https://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_literal_sqoop_merge_literal

let's take example here, you are having customer table with two columns cust_id and policy, also custid is your primary key and you just want to insert data cust id 100 onward
scenario 1:- append new data on the basis of cust_id field
phase1:-
below 3 records are there which are inserted recently in customer table which we want to import in HDFS
| custid | Policy |
| 101 | 1 |
| 102 | 2 |
| 103 | 3 |
here is sqoop command for that
sqoop import \
--connect jdbc:mysql://localhost:3306/db \
--username root -P \
--table customer \
--target-dir /user/hive/warehouse/<your db>/<table> \
--append \
--check-column custid \
--incremental append \
--last-value 100
phase2:-
below 4 records are there which are inserted recently in customer table which we want to import in HDFS
| custid | Policy |
| 104 | 4 |
| 105 | 5 |
| 106 | 6 |
| 107 | 7 |
here is sqoop command for that
sqoop import \
--connect jdbc:mysql://localhost:3306/db \
--username root -P \
--table customer \
--target-dir /user/hive/warehouse/<your db>/<table> \
--append \
--check-column custid \
--incremental append \
--last-value 103
so these four properties we will have to cosider for inserting new records
--append \
--check-column <primary key> \
--incremental append \
--last-value <Last Value of primary key which sqoop job has inserted in last run>
scenario 2:- append new data +update existing data on the basis of cust_id field
below 1 new record with cust id 108 has inserted and cust id 101 and 102 has updated recently in customer table which we want to import in HDFS
| custid | Policy |
| 108 | 8 |
| 101 | 11 |
| 102 | 12 |
sqoop import \
--connect jdbc:mysql://localhost:3306/db \
--username root -P \
--table customer \
--target-dir /user/hive/warehouse/<your db>/<table> \
--append \
--check-column custid \
--incremental lastmodified \
--last-value 107
so these four properties we will have to cosider for insert/update records in same command
--append \
--check-column <primary key> \
--incremental lastmodified \
--last-value <Last Value of primary key which sqoop job has inserted in last run>
I am specifically mentioning primary key as if table is not having primary key then few more properties needs to be consider which are:-
multiple mapper perform the sqoop job by default so mapper need data to be split on the basis of some key so
either we have to specifically define --m 1 option to say that only one mapper will perform this operation
or we have to specify any other key (by using sqoop property --split-by ) through with you can uniquely identify the data then you can use

You could also try a free form query which is going to be altered based on a specific condition. You could write a Java code using Sqoop Client to do the same :
How to use Sqoop in Java Program?

In such use cases always look for fields which are genuinely incremental in nature for incremental append.
and for last modified look best suited field is modified_date or likewise some fields for those which have been changed since you sqoop-ed them. only those and those rows will be updated, adding newer rows in your hdfs location requires incremental append.

There are already great responses here. Along with these you could also try Sqoop Query Approach. You can customize your query based on the condition to retrieve the updated records.
STEP 1: Importing New Records from the Database Table:
Example 1:
$ sqoop import \
--query 'SELECT a., b. FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /tmp/MyNewloc
Example 2:
sqoop import --connect "jdbc:jtds:sqlserver://MYPD22:1333;databaseName=myDb" --target-dir /tmp/MyNewloc --fields-terminated-by \| --username xxx --password='xxx' --query "select * from Policy_Table where Policy_ID > 1 AND \$CONDITIONS" -m1
Don't forget to supply $CONDITIONS in the Where Clause.
Please Refer Sqoop Free Form Import
STEP 2: Merging part-m files of both base table (original data) & New Table (New Records)
You could do this using 2 methods.
Method 1 - Using Sqoop Merge
Method 2 - Copying newly generated part-m files into original table target directory. (Copy part-m files from /tmp/MyNewloc to /tmp/MyOriginalLoc/)
STEP 3: CREATING HIVE TABLE
1) Now crate a hive table using Location as original table target directory which contains both original part-m files and new records part-m files.
CREATE EXTERNAL TABLE IF NOT EXISTS Policy_Table(
Policy_ID string,
Customer_Name string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/tmp/MyOriginalLoc/';

Here's a step by step guide for Sqoop incremental imports.
For an overview, you use append mode only when the rows in your source table do not update or you don't care about the updates, however you use lastmodified when you want to update the already imported data as well.

Related

Does Sqoop incremental lastmodified option include the date, mentioned in last value, in result?

My SQL table is
mysql> select * from Orders;
+--------+-----------+------------+-------------+-------------+
| ord_no | purch_amt | ord_date | customer_id | salesman_id |
+--------+-----------+------------+-------------+-------------+
| 70001 | 150.50 | 2012-10-05 | 3005 | 5002 |
| 70009 | 270.65 | 2012-09-10 | 3001 | 5005 |
| 70002 | 65.26 | 2012-10-05 | 3002 | 5001 |
| 70004 | 110.50 | 2012-08-17 | 3009 | 5003 |
| 70007 | 948.50 | 2012-09-10 | 3005 | 5002 |
| 70005 | 999.99 | 2012-07-27 | 3007 | 5001 |
| 70008 | 999.99 | 2012-09-10 | 3002 | 5001 |
| 70010 | 999.99 | 2012-10-10 | 3004 | 5006 |
| 70003 | 999.99 | 2012-10-10 | 3009 | 5003 |
| 70012 | 250.45 | 2012-06-27 | 3008 | 5002 |
| 70011 | 75.29 | 2012-08-17 | 3003 | 5007 |
| 70013 | 999.99 | 2012-04-25 | 3002 | 5001 |
+--------+-----------+------------+-------------+-------------+
I ran an Sqoop import
sqoop import --connect jdbc:mysql://ip-172-31-20-247:3306/sqoopex --
username sqoopuser --password <hidden> --table Orders --target-dir
SqoopImp2 --split-by ord_no --check-column ord_date --incremental
lastmodified --last-value '2012-09-10'
As per Sqoop 1.4.6 manual stated below,
An alternate table update strategy supported by Sqoop is called lastmodified
mode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported
I was not expecting columns with date '2012-09-10' in output. However my output, as shown below,
[manojpurohit17834325#ip-172-31-38-146 ~]$ hadoop fs -cat SqoopImp2/*
70001,150.50,2012-10-05,3005,5002
70002,65.26,2012-10-05,3002,5001
70003,999.99,2012-10-10,3009,5003
70007,948.50,2012-09-10,3005,5002
70009,270.65,2012-09-10,3001,5005
70008,999.99,2012-09-10,3002,5001
70010,999.99,2012-10-10,3004,5006
contains rows with date 20125-10-10. Note: the output directory was not present earlier and it was created by this sqoop execution.
From this execution, I see that date in --last-modified is included in output which is contrary to what is mentioned in manual. Please help me understand this discrepancy and correct me if am missing something here.
yes, --lastvalue also included in the results.
in --incremental import 2 modes available: i) append ii) lastmodified
append: in this mode it just checks --lastvalue and import from last value onwards.
--> it's not imported any previous values even if they updated
lastmodified: it's also the same as append mode only but here it imports new rows and imported previous rows also if they updated.
Note: lastmodified is working on the only date or Timestamp type columns only

sqoop import is not moving entire table in hdfs

I had create a small database with few tables in mysql. Now I transferring the table using sqoop to HDFS.
Below is the sqoop command:
sqoop import --connect jdbc:mysql://localhost/sqooptest --username root -P --table emp --m 1 --driver com.mysql.jdbc.Driver
I am not getting last 2 columns, salary and dept
Output of the above command
1201gopalmanager
1202manishaProof reader
1203khalilphp dev
1204prasanthphp dev
1205kranthiadmin
MySql table is :
+------+----------+--------------+--------+------+
| id | name | deg | salary | dept |
+------+----------+--------------+--------+------+
| 1201 | gopal | manager | 50000 | TP |
| 1202 | manisha | Proof reader | 50000 | TP |
| 1203 | khalil | php dev | 30000 | AC |
| 1204 | prasanth | php dev | 30000 | AC |
| 1205 | kranthi | admin | 20000 | TP |
+------+----------+--------------+--------+------+
I tried with using "--fields-terminated-by , **", or "--input-fields-terminated-by ,**" but failed
Also when I am using mapper count like (--m 3), getting only single file in HDFS.
I am using apache Sqoop on ubuntu machince.
Thanks in advance for finding solution. :)
Your command seems correct. Providing some steps below that you can try to follow once again and see if it works:
1) Create the table and populate it (MySQL)
mysql> create database sqooptest;
mysql> use sqooptest;
mysql> create table emp (id int, name varchar(100), deg varchar(50), salary int, dept varchar(10));
mysql> insert into emp values(1201, 'gopal','manager',50000,'TP');
mysql> insert into emp values(1202, 'manisha','Proof reader',50000,'TP');
mysql> insert into emp values(1203, 'khalil','php dev',30000,'AC');
mysql> insert into emp values(1204, 'prasanth','php dev',30000,'AC');
mysql> insert into emp values(1205, 'kranthi','admin',20000,'TP');
mysql> select * from emp;
+------+----------+--------------+--------+------+
| id | name | deg | salary | dept |
+------+----------+--------------+--------+------+
| 1201 | gopal | manager | 50000 | TP |
| 1202 | manisha | Proof reader | 50000 | TP |
| 1203 | khalil | php dev | 30000 | AC |
| 1204 | prasanth | php dev | 30000 | AC |
| 1205 | kranthi | admin | 20000 | TP |
+------+----------+--------------+--------+------+
2) Run the import
$ sqoop import --connect jdbc:mysql://localhost/sqooptest --username root -P --table emp --m 1 --driver com.mysql.jdbc.Driver --target-dir /tmp/sqoopout
3) Check result
$ hadoop fs -cat /tmp/sqoopout/*
1201,gopal,manager,50000,TP
1202,manisha,Proof reader,50000,TP
1203,khalil,php dev,30000,AC
1204,prasanth,php dev,30000,AC
1205,kranthi,admin,20000,TP
HDFS has only one file (part-m-00000):
$ hadoop fs -ls /tmp/sqoopout
Found 2 items
/tmp/sqoopout/_SUCCESS
/tmp/sqoopout/part-m-00000
This is because the data size is small, and one mapper is sufficient to process it. You can verify this by looking at the sqoop logs, which outputs:
Job Counters
Launched map tasks=1

sqoop: import-all-tables not importing all tables

I used:
sqoop import-all-tables --m 1 --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --username retail_dba --password cloudera --hive-import --create-hive-table --hive-overwrite --hive-database default --warehouse-dir /user/hive/warehouse
I see only categories table is imported. We have 6 tables in MySQL.
Aftr importing this table, I see categories dir and command not exiting.
When I logged in to hive, I don't see any tables under default table.
I am using by default setting comes with CDH 5.12.
Not changed any configurations. Please advise.
There is no issue in your command, Check whether the default schema is having any tables before use run the command.
Or create a new DB and execute the command.
hive> create database retaildb;
OK
Time taken: 0.38 seconds
hive> use retaildb;
OK
Time taken: 0.023 seconds
hive> show tables;
OK
mysql> show tables;
+---------------------+
| Tables_in_retail_db |
+---------------------+
| avrotable |
| categories |
| customers |
| departments |
| departments_new |
| order_items |
| orders |
| products |
| products_replica |
| tablewithboolean |
| test |
+---------------------+
Execute sqoop command.
sqoop import-all-tables --m 1 --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --username retail_dba --password cloudera \
--hive-import --create-hive-table --hive-overwrite --hive-database retaildb --warehouse-dir /user/hive/warehouse/retail_db
hive> show tables;
OK
avrotable
categories
customers
departments
departments_new
order_items
orders
products
products_replica
tablewithboolean
test
Time taken: 0.24 seconds, Fetched: 11 row(s)

List columns with sqoop

I have found the following commands very usefull to see what my source database looks like:
sqoop-list-databases
sqoop-list-tables
However, there does not appear to be a command to list the columns in a table, which would be a logical step.
My question is now:
How can I get the list of columns from a table via Sqoop?
Unfortunately there is no command like sqoop-list-columns, however with some creativity there is a workaround:
Run an import, and import the fieldnames.
Here is an example, for how this can be done when connecting to a SQL Server database:
sqoop import --m 1 --connect 'jdbc:sqlserver://nameofmyserver; database=nameofmydatabase; username=dennisjaheruddin; password=mypassword' --query "SELECT column_name, DATA_TYPE FROM INFORMATION_SCHEMA.Columns WHERE table_name='mytableofinterest' AND \$CONDITIONS" --target-dir 'mytableofinterest_column_name'
This will retreive the column names and write them to a file, which you can then inspect manually.
Of course this can be expanded to get other metadata (e.g. field types).
Note that you will need a slightly different SELECT statement if you are connecting to a different database type, but that should be easy to find.
Use "Describe DB_name.table_name" in sqoop eval.
sqoop eval --connect Connection_string --username username --password password --query "describe DB_Name.Table_Name"
If you just want to know the column names for some source table, Below is the easiest way to deal
SQOOP EVAL
Below is the example for pulling the column names from oracle database, its the same for all the RDBMS
sqoop eval -libjars /var/lib/sqoop/ojdbc6.jar --connect jdbc:oracle:thin:#hostname:portnumber/servicename --username user -password password --query "select * from schemaname.tablename where rownum=10"
This will print the schema on the terminal. in the query section you can run any query that you want to run on the rdbms.
Lets say you want to store the output, just append it to a file as below.
sqoop eval -libjars /var/lib/sqoop/ojdbc6.jar --connect jdbc:oracle:thin:#hostname:portnumber/servicename --username user -password password --query "select * from schemaname.tablename where rownum=10" >> sqoop_results.txt
use the below command in sqoop eval query:
"describe database_name.table_name"
This command worked for me :
sqoop eval --connect jdbc:mysql://localhost/database_name --username root -P --query "describe database_name.table_name"
The command output will be as below :
[cloudera#localhost ~]$ sqoop eval --connect jdbc:mysql://localhost/mytestdb --username root -P --query "describe mytestdb.CustomersNew3"
17/01/26 22:13:08 INFO sqoop.Sqoop: Running Sqoop version: 1.4.3-cdh4.7.0
Enter password:
17/01/26 22:13:12 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
| COLUMN_NAME | COLUMN_TYPE | IS_NULLABLE | COLUMN_KEY | COLUMN_DEFAULT | EXTRA |
| ID | int(11) | NO | | (null) | |
| NAME | varchar(20) | NO | | (null) | |
| AGE | int(11) | NO | | (null) | |
| ADDRESS | char(25) | YES | | (null) | |
| SALARY | decimal(18,2) | YES | | (null) | |
You can leverage sqoop eval command for this purposes.
Example in Netezza:
sqoop eval --connect 'jdbc:netezza://host:port/db' --username 'your_user' --password 'your_pass' --query "SELECT column_name, DATA_TYPE FROM INFORMATION_SCHEMA.Columns WHERE table_name='your_table'"
This will output result to console:
-----------------------------------------------
| column_name | data_type |
-----------------------------------------------
| col1 | bigint |
| col2 | bigint |
| col3 | bigint |
| col4 | integer |

Sqoop export using update key

I have to export HDFS file into MySql.
Let's say my HDFS file is:
1,abcd,23
2,efgh,24
3,ijkl,25
4,mnop,26
5,qrst,27
and say my Mysql database schema is:
+-----+-----+-------------+
| ID | AGE | NAME |
+-----+-----+-------------+
| | | |
+-----+-----+-------------+
When I'm inserting using following Sqoop command:
sqoop export \
--connect jdbc:mysql://localhost/DBNAME \
--username root \
--password root \
--export-dir /input/abc \
--table test \
--fields-terminated-by "," \
--columns "id,name,age"
It's working fine and inserting into database.
But, when I need to update already existing records I have to use --update-key and --columns.
Now, when I'm trying to update table using following command:
sqoop export \
--connect jdbc:mysql://localhost/DBNAME \
--username root \
--password root \
--export-dir /input/abc \
--table test \
--fields-terminated-by "," \
--columns "id,name,age" \
--update-key id
I'm facing issue like data is not updating into columns as specified in --columns
Am I doing anything wrong?
Can't we update database this way? HDFS file should be in Mysql schema only to update?
Is there any other way to achieve this?
4b.Update data from HDFS into a table in a relational database
Create emp table tbl in mysql test db
create table emp
(
id int not null primary key,
name varchar(50)
);
vi emp --> create file with below contents
1,Thiru
2,Vikram
3,Brij
4,Sugesh
Move the file to hdfs
hadoop fs -put emp <dir>
Execute the below sqoop job to export the data to the mysql
sqoop export --connect <jdbc connection> \
--username sqoop \
--password sqoop \
--table emp \
--export-dir <dir> \
--input-fields-terminated-by ',';
Verify the data in the mysql table
mysql> select * from emp;
+----+--------+
| id | name |
+----+--------+
| 1 | Thiru |
| 2 | Vikram |
| 3 | Brij |
| 4 | Sugesh |
+----+--------+
update the emp file & move the updated file into hdfs.
contents of the updated file
1,Thiru
2,Vikram
3,Sugesh
4,Brij
5,Sagar
Sqoop export for upsert - Update if the key matches else insert.
sqoop export --connect <jdbc connection> \
--username sqoop \
--password sqoop \
--table emp \
--update-mode allowinsert \
--update-key id \
--export-dir <dir> \
--input-fields-terminated-by ',';
Note: --update-mode <mode> - we can pass two arguments "updateonly" - to update the records. this will update the records if the update key matches.
if you want to do upsert (If exists UPDATE else INSERT) then use "allowinsert" mode.
example:
--update-mode updateonly \ --> for updates
--update-mode allowinsert \ --> for upsert
verify the results:
mysql> select * from emp;
+----+--------+
| id | name |
+----+--------+
| 1 | Thiru |
| 2 | Vikram |
| 3 | Sugesh |--> Previous value "Brij"
| 4 | Brij |--> Previous value "Sugesh"
| 5 | Sagar |--> new value inserted
+----+--------+
Just try with --update-key primary_key
sqoop export --connect jdbc:mysql://localhost/DBNAME -username root -password root --export-dir /input/abc --table test --fields-terminated-by "," --update-key id
It worked for me.It updates all records matching with primary key. (It may not insert new data)
Make use of --update-mode updateonly/allowinsert wisely
You may want to try with --input-fields-terminated-by.
Currently you are using fields-terminated-by, which is meant for imports.
I actually tried this on Sqoop using in multiple ways. The Update-Key can only update the columns already present in the table and can't insert them unless you also mention the Update-Mode to allowinsert (which is not supported by all databases). If you actually try to update using update-key it will update the rows for the mentioned key in update-key.

Resources