How sqoop treats updated rows while import? - hadoop

If there is a table in Oracle(or any RDBMS) which contains data that is flushed out every day.
example:
1234,Raj,Kolkata,1000,09092015
Suppose, I import this row today using a standard sqoop import and store in HDFS in flatfile. Next day, the row is deleted from the source table.But if the same record is updated(say the sal field 1000 is updated to 2000) after 7 days.
If I run again a sqoop query how will it treat the data and how will it store?
Will there be two entries of the same record or the newer value will be updated?
will this record
<1234, Raj, Kolkata, 1000, 09092015>
be replaced by this one?
<1234, Raj, Kolkata, 2000, 09092015>

If you perform incremental imports in Sqoop, you can control what happens when one of the rows is updated as well as what happens when new rows are inserted by means of using the argument --incremental. You have two options:
append (sqoop import (...) --incremental append) This option is used when new rows are continually added to your database and you want to import them. In this case, you'd need to tell Sqoop the column that it has to check (in order to detect these new rows), by means of the check-column parameter.
lastmodified (sqoop import (...) --incremental lastmodified). This option is what you want in your example, it lets you tell Sqoop that you want to check for updated rows in the table (that you already imported) and set them to the new values. You have to bear in mind that you have to specify, by means of the parameter --check-column, the column name which Sqoop will use to detect the updated rows, and also that this column is required to hold a date value (for instance, date, datetime, time or timestamp). In your example you would need an extra column holding a date value, and you should update that value every time you change the value of any of the other columns, in order for that row to be imported.
Of course, if you update a row but you don't update the field specified by check-column of that row, it will not be updated in your destination table.
I hope this helps.

Related

Deduplication in Oracle

Situation:-
Table 'A' is receiving data from OracleGoldenGate feed and gets the data as New,Updated,Duplicate feed that either creates a new record or rewrites the old one based on it's characteristics (N/U/D). Every entry in table has its UpdatedTimeStamp column contain insertion timestamp.
Scope:-
To write a StoredProcedure in Oracle that pulls the data for a time period based on UpdatedTimeStamp column and publishes an xml using DBMSXMLGEN.
How can I ensure that a duplicate entered in the table is not processed again ??
FYI-am currently filtering via a new table that I created, named as 'A-stg' and has old data inserted incrementally.
As far as I understood the question, there are a few ways to avoid duplicates.
The most obvious is to use DISTINCT, e.g.
select distinct data_column from your_table
Another one is to use timestamp column and get only the last (or the first?) value, e.g.
select data_column, max(timestamp_column)
from your_table
group by data_column

incremental import in sqoop on a table with jumbled data and no modified time column

Suppose I have a table Customer :
CustomerID CustomerName CustomerBill
7 John 100
2 Bill 500
4 Mark 200
Here CustomerID is the primary key but the records are in no particular order. There is no modified time column in the corresponding table in the database. The previous entries can change as well. How do I do incremental imports on the data?
The database I am using is Sybase and importing it to Hive.
Records are in no particular order.
append mode can not be used.
There is no modified time column in the corresponding table in the database.
lastmodified mode can not be used.
Sqoop does do anything special. It needs incrementing ID or updated timstamp to make a SQL query to fetch ONLY inserted/updated recored.

PLSQL Daily record of changes on table, then select from day

Oracle PL SQL question: One table should be archived day by day. Table counts about 50.000 records. But only few records during a day are changed. Second table (destination/history table) has one additional field - import_date. Two days = 100.000 records. Should be 50.000 + feq records with informations about changes during a day.
I need one simple solution to copy data from source table to destination like a "LOG" - only changes are copied/registered. But I should have possibility to check dataset of source table from given day.
Is there such mechanism like MERGE or something like that?
Normally you'd have a day_table and a master_table. All records are loaded from the day_table into master and only master is manipulated with the day table used to store the raw data.
You could add a new column to master such as a date_modified and have the app update this field when a record changes, or a flag used to indicate it's changed.
Another way to do this is to have an active/latest flag. Instead of changing the record it is duplicated with a flag set to indicate this is a better/old record. This might be easier for comparison
e.g. select * from master_table where record = 'abcd'
This would show 2 rows - the original loaded at 1pm and the modified active one changed at 2pm.
There's no need to have another table, you could base a view on this flag then
e.g. CHANGED_RECORDS_VIEW = select * from master_table where flag = 'Y'
Once i faced a similar issue. And please find the solution below.
Tables we had :
Master table always has records it and keeps adding up.
One backup table to store all the master records on daily basis.
Solution:
From morning to evening records are inserted and updated into the master table. The concept of finding out the new records was the timestamp. Whenever a new record is inserted/updated then corresponding timestamp is added and kept.
At night, we had created a job schedule to run a procedure (Create_Job-> please check oracle documentations for further learning) which runs exactly at 10:00 pm to bulk collect all the records available in master table based on today's date and insert into the backup table.
This scenrio which i have explained to you will help you. Please check out the concept of Job scheduling which will help you. Thank you .

How to import only new data by using Sqoop?

Let me give an example: I exported 1TB of data yesterday. Today, the database got another 1GB of data. If I try to import the data again today, Sqoop will import 1TB+1GB of data, then I am merging it. So it's a headache. I want to import only new data and append it to the old data. In this way, on a daily basis, I'll pull the RDBMS data into HDFS.
You can use sqoop Incremental Imports:
Sqoop provides an incremental import mode which can be used to retrieve only rows newer than some previously-imported set of rows.
Incremental import arguments:
--check-column (col) Specifies the column to be examined when determining which rows to import.
--incremental (mode) Specifies how Sqoop determines which rows are new. Legal values for mode include append and last modified.
--last-value (value) Specifies the maximum value of the check column from the previous import.
Reference: https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports
For Incremental Import: You would need to specify a value in a check column against a reference value for the most recent import. For example, if the –incremental append argument was specified, along with –check-column id and –last-value 100, all rows with id > 100 will be imported. If an incremental import is run from the command line, the value which should be specified as –last-value in a subsequent incremental import will be printed to the screen for your reference. If an incremental import is run from a saved job, this value will be retained in the saved job. Subsequent runs of sqoop job –exec some Incremental Job will continue to import only newer rows than those previously imported.
For importing all the tables at one go, you would need to use sqoop-import-all-tables command, but this command must satisfy the below criteria to work
Each table must have a single-column primary key.
You must intend to import all columns of each table.
You must not intend to use non-default splitting column, nor impose any conditions via a WHERE clause.
Reference: https://hortonworks.com/community/forums/topic/sqoop-incremental-import/

Import most recent data from CSV to SQL Server with SSIS

Here's the deal; the issue isn't with getting the CSV into SQL Server, it's getting it to work how I want it... which I guess is always the issue :)
I have a CSV file with columns like: DATE, TIME, BARCODE, etc... I use a derived column transformation to concatenate the DATE and TIME into a DATETIME for my import into SQL Server, and I import all data into the database. The issue is that we only get a new .CSV file every 12 hours, and for example sake we will say the .CSV is updated four times in a minute.
With the logic that we will run the job every 15 minutes, we will get a ton of overlapping data. I imagine I will use a variable, say LastCollectedTime which can be pulled from my SQL database using the MAX(READTIME). My problem comes in that I only want to collect rows with a readtime more recent than that variable.
Destination table structure:
ID, ReadTime, SubID, ...datacolumns..., LastModifiedTime where LastModifiedTime has a default value of GETDATE() on the last insert.
Any ideas? Remember, our readtime is a Derived Column, not sure if it matters or not.
Here is one approach that you can make use of:
Let's assume that your destination table in SQL Server is named BarcodeData.
Create a staging table (say BarcodeStaging) in your database that has the same column structure as your destination table BarcodeData into which CSV data is imported into.
In the SSIS package, add an Execute SQL Task before the Data Flow Task to truncate the staging table BarcodeStaging.
Import the CSV data into the staging table BarcodeStaging and not into the actual destination table.
Use the MERGE statement (I assume that you are using SQL Server 2008 or higher version), to compare the staging table BarCodeStaging and the actual destination table BarcodeData using the DateTime column as the join key. If there are unmatched rows, then copy the rows from the staging table and insert them into the destination table.
Technet link to MERGE statement: http://technet.microsoft.com/en-us/library/bb510625.aspx
Hope that helps.

Resources