change data capture multiple tables for incremental load - ETL - etl

I am building a staging area that gets data from informatica cdc. Now for example lets say I am replication two tables for incremental load. I have to delete the processed data from the staging tables after each load. I join these two tables to populate my target dimension. Problem is change can happen on only one source and not the other in a particular load.
Example:
Employee
---------
ID NAME
1 PETER
EmployeeSal
------------
EMPID SAL
1 2000
If the above is replicated in my first load, I join the two tables and load them thats fine.
Now lets say the salary of peter is updated frrom 2000 to 3000. As I have delete my staging tables after each load, I will have the following for current load.
Employee
---------
ID NAME
EmployeeSal
-----------
EMPID SAL
1 3000
Here is my problem ss I have to populate the whole row of the dimension which is TYPE2.
I have to join back to the source to get the other attributes of employee table ( This is just a lame example, In reality it might be 10 tables and hundreds of thousand of changes). Is it recommended to go back to the source?
I can join the target table to this mix and populate the missing attributes.
Is this even recommended as I have to do lot of case statements , nullhandlings etc if a particular staging table has no change for a dimension record. My question is is it even common that target table is joined in a ETL transformation?

Going back to the source system, invalidates the purpose of creating the staging area in the first place. It is usually not recommended.
However querying the target table is quite common to get previous information. But it is true that you have to do a lot of checks.
Another option is to maintain a scd type 1 in your staging area. Maintain insert and update timestamps in staging which you can use to get only the changed records while loading the dimension.

I have encountered the same problem with Order Header and Details - if the detail changes I need to be able to inner join to the Header to update my flattened Order Fact table.
I resolved it this way: After I have staged all the changed records, I look up any missing Headers for the changed Details (an vice versa) using an SQL Task that loads an object with an array of Order IDs. And I use a for each loop to get load the missing Headers into staging.
My inner joins from Staging to the data warehouse will now work as expected

Related

Oracle 12c - refreshing the data in my tables based on the data from warehouse tables

I need to update the some tables in my application from some other warehouse tables which would be updating weekly or biweekly. I should update my tables based on those. And these are having foreign keys in another tables. So I cannot just truncate the table and reinsert the whole data every time. So I have to take the delta and update accordingly based on few primary key columns which doesn't change. Need some inputs on how to implement this approach.
My approach:
Check the last updated time of those tables, views.
If it is most recent then compare each row based on the primary key in my table and warehouse table.
update each column if it is different.
Do nothing if there is no change in columns.
insert if there is a new record.
My Question:
How do I implement this? Writing a PL/SQL code is it a good and efficient way? as the expected number of records are around 800K.
Please provide any sample code or links.
I would go for Pl/Sql and bulk collect forall method. You can use minus in your cursor in order to reduce data size and calculating difference.
You can check this site for more information about bulk collect, forall and engines: http://www.oracle.com/technetwork/issue-archive/2012/12-sep/o52plsql-1709862.html
There are many parts to your question above and I will answer as best I can:
While it is possible to disable referencing foreign keys, truncate the table, repopulate the table with the updated data then reenable the foreign keys, given your requirements described above I don't believe truncating the table each time to be optimal
Yes, in principle PL/SQL is a good way to achieve what you are wanting to
achieve as this is too complex to deal with in native SQL and PL/SQL is an efficient alternative
Conceptually, the approach I would take is something like as follows:
Initial set up:
create a sequence called activity_seq
Add an "activity_id" column of type number to your source tables with a unique constraint
Add a trigger to the source table/s setting activity_id = activity_seq.nextval for each insert / update of a table row
create some kind of master table to hold the "last processed activity id" value
Then bi/weekly:
retrieve the value of "last processed activity id" from the master
table
select all rows in the source table/s having activity_id value > "last processed activity id" value
iterate through the selected source rows and update the target if a match is found based on whatever your match criterion is, or if
no match is found then insert a new row into the target (I assume
there is no delete as you do not mention it)
on completion, update the master table "last processed activity id" to the greatest value of activity_id for the source rows
processed in step 3 above.
(please note that, depending on your environment and the number of rows processed, the above process may need to be split and repeated over a number of transactions)
I hope this proves helpful

Which is better way of handling data Groupby/Distinct

I have big sql that scans through multiple tables having million records. After query completion, i am getting 250K records. The resultset will be saved in a staging table before getting written in files. There is a possibility that the resultset will have duplicates.
The question is, which of the following options is better and gives a better result
Doing a group by or distinct before inserting into resultset into the staging table.
Insert duplicate records into staging table and use distinct/group by while selecting records from staging table
There is not much difference between 1 and 2
If you filter the duplicates before inserting then you are reducing the number of writes that you need to make into the staging table and, since those duplicate rows will not be in the staging table, then you are also going to reduce the number of reads from the staging table when you come to write it out to a file. So, logically, option 1 should give better performance.
However, if you are that concerned about the difference between the two then the answer has to be "profile both methods on your system and see which is best on your hardware/tables/indexes/etc".

SQL Loader, Trigger saturation? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have a situation i can't find an explanation for, here it is. (I'll use hypothetical info since the original are really big.)
I have a table, let's say:
table_a
-------------
name
last name
dept
status
notes
And this table has a trigger on insert, which does a lot of validation to the info to change the status field of the new record according to the results of the validation, some of the validations are:
- check for the name existing in a dictionary
- check for the last name existing in a dictionary
- check that fields (name,last name,dept) aren't already inserted in table_b
- ... and so on
The thing is, if I do an insert on the table via query, like
insert into table_a
(name,last_name,dept,status,notes)
values
('john','smith',1,0,'new');
it takes only 173 ms to do all the validation process, update the status field and insert the record in the table. (the validation process does all the searches via indexes)
But if I try this via SQLloader, reading a file with 5000 records, it takes like 40 minutes to validate and insert 149 records (of course i killed it...)
So I tried loading the data disabling the trigger (to check speed)
and I got that it loads like all the records in less than 10 seconds.
So my question is, what can I do to improve this process? My only theory is that I could be saturating the database because it loads so fast and launches many instances of the trigger, but I really don't know.
My objective is to load around 60 files with info and validate them through the process in the trigger (willing to try other options though).
I would really appreciatte any help you can provide!!
COMPLEMENT---------------------------------------------------------------------------------
Thanks for the answer, now i'll read all about this, now hope you can help me with this part. let me explain some of the functionality i need (and i used a trigger cause i couldn't think of anything else)
so the table data comes with this (important) fields:
pid name lastname birthdate dept cicle notes
the data comes like this
name lastname birthdate dept
now, the trigger does this to the data:
Calls a function to calculate the pid (is calculated based on the name, lastname and birthdate with an algorithm)
Calls a function to check for the names on the dictionary (thats because in my dictionary i have single names, meaning if a person is named john aaron smith jones the function splits john aaron in two, and searches for john and aaron in the dictionary in separate querys, thats why i didn't use a foreign key [to avoid having lots of combinations john aaron,john alan,john pierce..etc]) --->kinda stuck on how to implement this one with keys without changing the dictionary...maybe with a CHECK?, the lastname foreign key would be good idea.
Gets the cicle from another table according to the dept and the current date (because a person can appear twice in the table in the same dept but in different cicle) --->how could i get this cicle value in a more efficient way to do the correct search?
And finally, after all this validation is done, i need to know exactly which validation wasn't met (thus the field notes) so the trigger concatenates all the strings of failed validations, like this:
lastname not in dictionary, cannot calculate pid (invalid date), name not in dictionary
i know that if the constraint check isn't met all i could do is insert the record in another table with the constraint-failed error message, but that only leaves me with one validation, am i right? but i need to validate all of them and send the report to other department so they can review the data and make all the necessary adjustments to it.
Anyway, this is my situation right now, i'll explore possibilities and hope you can share some light on the overall process, Thank you very much for your time.
You're halfway to the solution already:
"So I tried loading the data disabling the trigger (to check speed) ... it loads like all the records in less than 10 seconds."
This is not a surprise. Your current implementation executes a lot of single row SELECT statements for each row you insert into table B. That will inevitably give you a poor performance profile. SQL is a set-based language and performs better with multi-row operations.
So, what you need to do is find a way to replace all the SELECT statements which more efficient alternatives. Then you'll be able to drop the triggers permanently. For instance, replace the look-ups on the dictionary with foreign keys between the table A columns and the reference table. Relational integrity constraints, being internal Oracle code, perform much better than any code we can write (and work in multi-user environments too).
The rule about not inserting into table A if a combination of columns already exists in table B is more problematic. Not because it's hard to do but because it sounds like poor relational design. If you don't want to load records in table A when they already exits in table B why aren't you loading into table B directly? Or perhaps you have a sub-set of columns which should be extracted from table A and table B and formed into table C (which would have foreign key relationships with A and B)?
Anyway, leaving that to one side, you can do this with set-based SQL by replacing SQL*Loader with an external table. An external table allows us to present a CSV file to the database as if it were a regular table. This means we can use it in normal SQL statements. Find out more.
So, with foreign key constraints on dictionary and an external table you can replace teh SQL Loader code with this statement (subject to whatever other rules are subsumed into "...and so on"):
insert into table_a
select ext.*
from external_table ext
left outer join table_b b
on (ext.name = b.name and ext.last_name = b.last_name and ext.dept=b.dept)
where b.name is null
log errors into err_table_a ('load_fail') ;
This employs the DML error logging syntax to capture constraint errors for all rows in a set-based fashion. Find out more. It won't raise exceptions for rows which already exist in table B. You could either use the multi-table INSERT ALL to route rows into an overflow table or use a MINUS set operation after the event to find rows in the external table which aren't in table A. Depends on your end goal and how you need to report things.
Perhaps a more complex answer than you were expecting. Oracle SQL is a very extensive SQL implementation, with a lot of functionality for improving the efficient of bulk operations. It really pays us to read the Concepts Guide and the SQL Reference to find out just how much we can do with Oracle.

oracle and creating history

I am working on a system to track a project's history. There are 3 main tables: projects, tasks, and clients then 3 history tables for each. I have the following trigger on projects table.
CREATE OR REPLACE TRIGGER mySchema.trg_projectHistory
BEFORE UPDATE OR DELETE
ON mySchema.projects REFERENCING NEW AS New OLD AS Old
FOR EACH ROW
declare tmpVersion number;
BEGIN
select myPackage.GETPROJECTVERSION( :OLD.project_ID ) into tmpVersion from dual;
INSERT INTO mySchema.projectHistiry
( project_ID, ..., version )
VALUES
( :OLD.project_ID,
...
tmpVersion
);
EXCEPTION
WHEN OTHERS THEN
-- Consider logging the error and then re-raise
RAISE;
END ;
/
I got three triggers for each of my tables (projects, tasks, clients).
Here is the challenge: Not everything changes at the same time. For example, somebody could just update a certain tasks' cost. In this case, only one trigger fires and I got one insert. I'd like to insert one record into 3 history tables at once even if nothing changed in the projects and clients tables.
Also, what if somebody changes a project's end_date, the cost, and say the picks another client. Now, I have three triggers firing at the same time. Only in this case, I will have one record inserted into my three history tables. (which I want)
If i modify the triggers to do insert into 3 tables for the first example, then I will have 9 inserts when the second example happens.
Not quite sure how to tackle this. any help?
To me it sounds as if you want a transaction-level snapshot of the three tables created whenever you make a change to any of those tables.
Have a row level trigger on each of the three tables that calls a single packaged procedure with the project id and optionally client / task id.
The packaged procedure inserts into all three history tables the relevant project, client and tasks where there isn't already a history record for that key and transaction (ie you don't want duplicates). You got a couple of choices when it comes to the latter. You can use a unique constraint and either a BULK select and insert with FORALL/SAVE EXCEPTIONS, DML error logging (EXCEPTIONS INTO) or a INSERT...SELECT...WHERE NOT EXISTS...
You do need to keep track of your transactions. I'm guessing this is what you were doing with myPackage.GETPROJECTVERSION. The trick here is to only increment versions when you have a new transaction. If, when you get a new version number, you hold it in a pacakge level variable, you can easily tell whether your session has already got a version number or not.
If your session is going to run multiple transaction, you'll need to 'clear' out the session-level version number if it was part of a previous transaction. If you get DBMS_TRANSACTION.LOCAL_TRANSACTION_ID and store that at the package/session level as well, you can determine if you are in a new transaction, or part of the same transaction.
From your description, it looks like you would be capturing the effective and end date for each of the history rows once any of the original rows change.
Eg. Project_hist table would have eff_date and exp_date which has the start and end date for a given project. Project table would just have an effective date. (as it is the active project).
I don't see why you want to insert rows for all three history tables when only one of the table values is updated. You can pretty much get the details as you need (as of a given date) using your current logic. (inserting old row in the history table for the table that has been updated only.).
Alternative answer.
Have a look at Total Recall / Flashback Archive
You can set the retention to 10 years, and use a simple AS OF TIMESTAMP to get the data as of any particular timestamp.
Not sure on performance though. It may be easier to have a daily or weekly retention and then a separate scheduled job that picks out the older versions using the VERSIONS BETWEEN syntax and stores them in your history table.

How do I prevent the loading of duplicate rows in to an Oracle table?

I have some large tables (millions of rows). I constantly receive files containing new rows to add in to those tables - up to 50 million rows per day. Around 0.1% of the rows I receive are duplicates of rows I have already loaded (or are duplicates within the files). I would like to prevent those rows being loaded in to the table.
I currently use SQLLoader in order to have sufficient performance to cope with my large data volume. If I take the obvious step and add a unique index on the columns which goven whether or not a row is a duplicate, SQLLoader will start to fail the entire file which contains the duplicate row - whereas I only want to prevent the duplicate row itself being loaded.
I know that in SQL Server and Sybase I can create a unique index with the 'Ignore Duplicates' property and that if I then use BCP the duplicate rows (as defined by that index) will simply not be loaded.
Is there some way to achieve the same effect in Oracle?
I do not want to remove the duplicate rows once they have been loaded - it's important to me that they should never be loaded in the first place.
What do you mean by "duplicate"? If you have a column which defines a unique row you should setup a unique constraint against that column. One typically creates a unique index on this column, which will automatically setup the constraint.
EDIT:
Yes, as commented below you should setup a "bad" file for SQL*Loader to capture invalid rows. But I think that establishing the unique index is probably a good idea from a data-integrity standpoint.
Use Oracle MERGE statement. Some explanations here.
You dint inform about what release of Oracle you have. Have a look at there for merge command.
Basically like this
---- Loop through all the rows from a record temp_emp_rec
MERGE INTO hr.employees e
USING temp_emp_rec t
ON (e.emp_ID = t.emp_ID)
WHEN MATCHED THEN
--- _You can update_
UPDATE
SET first_name = t.first_name,
last_name = t.last_name
--- _Insert into the table_
WHEN NOT MATCHED THEN
INSERT (emp_id, first_name, last_name)
VALUES (t.emp_id, t.first_name, t.last_name);
I would use integrity constraints defined on the appropriate table columns.
This page from the Oracle concepts manual gives an overview, if you also scroll down you will see what types of constraints are available.
use below option, if you will get this much error 9999999 after that your sqlldr will terminate.
OPTIONS (ERRORS=9999999, DIRECT=FALSE )
LOAD DATA
you will get duplicate records in bad file.
sqlldr user/password#schema CONTROL=file.ctl, LOG=file.log, BAD=file.bad

Resources