How to import only new data by using Sqoop? - hadoop

Let me give an example: I exported 1TB of data yesterday. Today, the database got another 1GB of data. If I try to import the data again today, Sqoop will import 1TB+1GB of data, then I am merging it. So it's a headache. I want to import only new data and append it to the old data. In this way, on a daily basis, I'll pull the RDBMS data into HDFS.

You can use sqoop Incremental Imports:
Sqoop provides an incremental import mode which can be used to retrieve only rows newer than some previously-imported set of rows.
Incremental import arguments:
--check-column (col) Specifies the column to be examined when determining which rows to import.
--incremental (mode) Specifies how Sqoop determines which rows are new. Legal values for mode include append and last modified.
--last-value (value) Specifies the maximum value of the check column from the previous import.
Reference: https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports
For Incremental Import: You would need to specify a value in a check column against a reference value for the most recent import. For example, if the –incremental append argument was specified, along with –check-column id and –last-value 100, all rows with id > 100 will be imported. If an incremental import is run from the command line, the value which should be specified as –last-value in a subsequent incremental import will be printed to the screen for your reference. If an incremental import is run from a saved job, this value will be retained in the saved job. Subsequent runs of sqoop job –exec some Incremental Job will continue to import only newer rows than those previously imported.
For importing all the tables at one go, you would need to use sqoop-import-all-tables command, but this command must satisfy the below criteria to work
Each table must have a single-column primary key.
You must intend to import all columns of each table.
You must not intend to use non-default splitting column, nor impose any conditions via a WHERE clause.
Reference: https://hortonworks.com/community/forums/topic/sqoop-incremental-import/

Related

After performing sqoop import from rdbms , how to check whether the data is properly imported or not in hive

Is there any tools available?
Normally I check by doing manual checks like count(*), min , max , doing select where query in both rdbms and hive table. Is there any other way?
Please use --validate in sqoop import or export to get row count between source and destination.
Update: Column Level checking.
There is no in built parameter in sqoop to achieve this.But you can do this as below:
1.Store the data imported in a temp table.
Use shell script for below:
2.Get the data from source table and compare it with temp table using shell variables.
3.If it matches,then copy the data from temp to original table

IMPDP : How to import only Table data

I tried to import (TABLES, PROCEDURE, FUNCTION etc) from a dump file. I did a mistake by
executing KILL -9 <PROCESS_ID> while import was still going on.
So, I started to import again. Now, I did another mistake by NOT mentioning
TABLE_EXISTS_ACTION=TRUNCATE . So, tables have been imported with duplicate records.
I want to get rid of duplicate data. There are more than 500 tables involved.
I am planning to import again by first truncating the table and then importing data only.
Below is the import command I have come up with. Will this command import ONLY table data(records) by first
truncating the table and then insert only the data?
impdp DIRECTORY=MY_DIRECTORY dumpfile=EXP_MY_DUMP.dmp INCLUDE=TABLE_DATA TABLE_EXISTS_ACTION=TRUNCATE
I could try executing myself and find out if that works. But, I have already tried twice and failed.
Also, I don't want to again import INDEX, SEQUENCES etc. Just table records.
Remove INCLUDE=TABLE_DATA. That will not execute create table.. that should work.

Can a single sqoop job be used for multiple tables and be running at the same time

I just started with Sqoop Hands-on. I have a question, lets say I have 300 tables in a database and I want to perform an incremental load on those tables. I understand I can do incremental imports with either append mode or last modified.
But do I have to create 300 jobs, if the only thing in job which varies is Table name , CDC column and the last value/updated value?
Has anyone tried using the same job and passing this above things as parameter which can be read from a text file in a loop and execute the same job for all the tables in parallel.
What is the industry standard and recommendations ?
Also, is there a way to truncate and re-load the hadoop tables which is very small instead of performing CDC and merging the tables later?
There is import-all-tables "Import tables from a database to HDFS"
However it will not provide way to change CDC column for each table.
Also see sqoop import multiple tables
There is no truncation but same can be achieved through following.
--delete-target-dir "Delete the import target directory if it exists"

How sqoop treats updated rows while import?

If there is a table in Oracle(or any RDBMS) which contains data that is flushed out every day.
example:
1234,Raj,Kolkata,1000,09092015
Suppose, I import this row today using a standard sqoop import and store in HDFS in flatfile. Next day, the row is deleted from the source table.But if the same record is updated(say the sal field 1000 is updated to 2000) after 7 days.
If I run again a sqoop query how will it treat the data and how will it store?
Will there be two entries of the same record or the newer value will be updated?
will this record
<1234, Raj, Kolkata, 1000, 09092015>
be replaced by this one?
<1234, Raj, Kolkata, 2000, 09092015>
If you perform incremental imports in Sqoop, you can control what happens when one of the rows is updated as well as what happens when new rows are inserted by means of using the argument --incremental. You have two options:
append (sqoop import (...) --incremental append) This option is used when new rows are continually added to your database and you want to import them. In this case, you'd need to tell Sqoop the column that it has to check (in order to detect these new rows), by means of the check-column parameter.
lastmodified (sqoop import (...) --incremental lastmodified). This option is what you want in your example, it lets you tell Sqoop that you want to check for updated rows in the table (that you already imported) and set them to the new values. You have to bear in mind that you have to specify, by means of the parameter --check-column, the column name which Sqoop will use to detect the updated rows, and also that this column is required to hold a date value (for instance, date, datetime, time or timestamp). In your example you would need an extra column holding a date value, and you should update that value every time you change the value of any of the other columns, in order for that row to be imported.
Of course, if you update a row but you don't update the field specified by check-column of that row, it will not be updated in your destination table.
I hope this helps.

Oracle Import option

I have a simple question, how exactly do you use the oracle import tool to import a database with the option of automatically resize the column length, so it automatically fits the data before importing the data.
Give you an example: if I have a table TABLE1 that has a column called "comment", comment field length is 250. Since I'm importing TABLE1 from source (which is in western character set) into target database (which is AL-UTF32 character set). Some of the records data will grow, i.e. 1 record's comment field data will grow from 250 into 260 because of the character set conversion.
My question is: how do I import TABLE1, so that target database will automatically change the field "comment" from 250 into the max data field length of this field (after character set conversion grows the data). So I can import TABLE1 with no errors.
What's the import option or command line? Is there a way to know which columns cause data issue?
Thank you
Ideally, you would build your target table beforehand, with the column widths you need defined at that point. You would then tailor a sqlldr (SQL Loader) control file to your input format.

Resources