In ETL, How to handle new data inserted in Source DB with past Timestamp? - etl

We have a DWH that is connected to several sources DB's. We recently faced an issue where one of the sources inserted a new set of records with Timestamp that is in the past (not the actual Timestamp of the insertion to their DB). We use the Timestamp to extract Delta records. So in this case these new set of records are not getting extracted in our delta extraction. I believe using rowversion would be an ideal solution but we do not have control over this source and we can't guarantee this won't happen again. What would be a good solution to handle such cases? We use Datastage.
Thanks!

Related

Best approaches to UPDATE the data in tables - Teradata

I am new to Teradata & fortunately got a chance to work on both DDL-DML statements.
One thing I observed is Teradata is very slow when time comes to UPDATE the data in a table having large number of records.
The simplest way I found on the Google to perform this update is to write an INSERT-SELECT statement with a CASE on column holding values to be update with new values.
But what when this situation arrives in Data Warehouse environment, when we need to update multiple columns from a table holding millions of rows ?
Which would be the best approach to follow ?
INSERT-SELECT only OR MERGE-UPDATE OR MLOAD ?
Not sure if any of the above approach is not used for this UPDATE operation.
Thank you in advance!
At enterprise level, we expect volumes to be huge and updates are often part of some scheduled jobs/scripts.
With huge volume of data, Updates comes as a costly operation that involve risk of blocking table for some time in case the update fails (due to fallback journal). Although scripts are tested well, and failures seldom happen in production environments, it's always better to have data that needs to be updated loaded to a temporary table in required form and inserted back to same table after deleting matching records to maintain SCD-1 (Where we don't maintain history).

Importing data incrementally from RDBMS to hive/hadoop using sqoop

I have an oracle database and need to import data to a hive table. The daily import data size would be around 1 GB. What would be the better approach?
If I import each day data as a partition, how can the updated values be handled?
For example, if I imported today's data as a partition and for the next day there are some fields that are updated with the new values.
Using --lastmodified we can get the values but where we need to send the updated values to the new partition or to the old (already existing) partition?
If I send to the new partition, then the data is duplicated.
If I want to send to the already existing partition, how we can it be achieved?
Your only option is to override the entire existing partition with 'INSERT OVERWRITE TABLE...'.
Question is - how far back are you going to be constantly updating the data?
I think of 3 approaches u can consider:
Decide on a threshold for 'fresh' data. for example '14 days backwards' or '1 month backwards'.
Then each day you are running the job, you override partitions (only the ones which have updated values) backwards, until the threshold decided.
With ~1 GB a day it should be feasible.
All the data from before your decided time is not guranteed to be 100% correct.
This scenario could be relevant if you know the fields can only be changed a certain time window after they were initially set.
Make your Hive table compatible with ACID transactions, thus allowing updates on the table.
Split your daily job to 2 tasks: the new data being written for the run day. the updated data that you need to run backwards. the sqoop will be responsible for the new data. take care of the updated data 'manually' (some script that generates the update statements)
Don't use partitions based on time. maybe dynamic partitioning is more suitable for your use case.It depends on the nature of the data being handled.

Historical Data Comparison in realtime - faster in SQL or code?

I have a requirement in the project I am currently working on to compare the most recent version of a record with the previous historical record to detect changes.
I am using the Azure Offline data sync framework to transfer data from a client device to the server which causes records in the synced table to update based on user changes. I then have a trigger copying each update into a history table and a SQL query which runs when building a list of changes to compare the current record vs the most recent historical by doing column comparisons - mainly string but some integer and date values.
Is this the most efficient way of achieving this? Would it be quicker to load the data into memory and perform a code based comparison with rules?
Also, if I continually store all the historical data in a SQL table, will this affect the performance over time and would I be better storing this data in something like Azure Table Storage? I am also thinking along the lines of cost as SQL usage is much more expensive that Table Storage but obviously I cannot use a trigger and would need to insert each synced row into Table Storage manually.
You could avoid querying and comparing the historical data altogether, because the most recent version is already in the main table (and if it's not, it will certainly be new/changed data).
Consider a main table with 50.000 records and 1.000.000 records of historical data (and growing every day).
Instead of updating the main table directly and then querying the 1.000.000 records (and extracting the most recent record), you could query the smaller main table for that one record (probably an ID), compare the fields, and only if there is a change (or no data yet) update those fields and add the record to the historical data (or use a trigger / stored procedure for that).
That way you don't even need a database (probably containing multiple indexes) for the historical data, you could even store it in a flat file if you wanted, depending on what you want to do with that data.
The sync framework I am using deals with the actual data changes, so i only get new history records when there is an actual change. Given a batch of updates to a number of records, i need to compare all the changes with their previous state and produce an output list of whats changed.

How to implement an ETL Process

I would like to implement a synchronization between a source SQL base database and a target TripleStore.
However for matter of simplicity let say simply 2 databases. I wonder what approaches to use to have every change in the source database replicated in the target database. More specifically, I would like that each time some row changes in the source database that this can be seen by a process that will read the changes and populate the target database accordingly while applying some transformation in the middle.
I have seen suggestion around the mechanism of notification that can
be available in the database, or building tables such that changes can
be tracked (meaning doing it manually) and have the process polling it
at different intervals, or the usage of Logs (change data capture,
etc...)
I'm seriously puzzle about all of this. I wonder if anyone could give some guidance and explanation about the different approaches with respect to my objective. Meaning: name of methods and where to look.
My organization mostly uses: Postgres and Oracle database.
I have to take relational data and transform them in RDF so as to store them in a triplestore and keep that triplestore constantly synchronized with the data is the SQL Store.
Please,
Many thanks
PS:
A clarification between ETL and replication techniques as in Change Data capture, with respect to my overall objective would be appreciated.
Again i need to make sense of the subject, know what are the methods, so i can further start digging for myself. So far i have understood that CDC is the new way to go.
Assuming you can't use replication and you need to use some kind of ETL process to actually extract, transform and load all changes to the destination database, you could use insert, update and delete triggers to fill a (manually created) audit table. Columns GeneratedId, TableName, RowId, Action (insert, update, delete) and a boolean value to determine if your ETL process has already processed this change. Use that table to get all the changed rows in your database and transport them to the destination database. Then delete the processed rows from the audit table so that it doesn't grow too big. How often you have to run the ETL process depends on the amount of changes occurring in the source database.

Oracle overwrites my Date?

I got a problem with Oracle Dates. It seems that predefined Dates in an Java Application are different after inserting into an OracleDB.
Insert via JPA entity:
entity.setDateOfCreation(new Date(System.currentTimeInMillis()));
// 1350565985000
After commit and retrieve:
entity.getDateOfCreation() // 1350565985047
Why is this different?
I assumed Oracle would just insert my specific Date Object with these exact Milliseconds into the Database. But obviously it doesn't. Because of the minimal delay it seems to "overwrite" the given Date with its own Date in milliseconds (and despite I do NOT use #GeneratedValue).
Does the table you working with have a trigger which populates that column? I would hope it does. I have experienced lots of problems in the last with time differences between the app server and the database. It is much better to have a single of time which ensures consistent timings across the state.

Resources