SSIS duplicate data Insert Update approach - insert

New to SSIS.
I have data from a set of flat files. It is possible to have the same person on the same flat file or different flat files with person different information. I process them all at the same time.
But I need to know either to insert the record or update it (if the same person). I'm using a lookup to determine if the person exist on the table. I have already set the Old DB Destination FastLoadMaxInsertCommitSize to 1 and using Ole DB Command for updates.
But still it cant determine an update if the same person is encountered.
I also tried merge on control flow but failed.
What could be the solution for this?

After Insert/Update, look for duplicate data and delete IDs(if using identity keys) lower than the max(ID).

Related

Informatica 9.5.1, huge table (scd1)

I have a table(in oracle) size about 860 million records (850gb) on top we are getting about 2 -3 million records as source (flatfile).
we are doing a lookup on target if record already exist it will update if it is a new record it will insert(scd1).
The transformations we using are unconnectedlookup, sorter, filter and router, update strategy transformations, it was fine all this time, but as the table is huge and growing huge, it is taking for ever to insert and update, last night it took 19 hrs to 2.4 million records (2.1 millions were new so inserted and the rest are updates).
Today I got about 1.9 millions to go through i am not sure how long it will take any suggestions or help how can we handle this ?
1) Use just a connected lookup to oracle table, after SQ matching on primary key and filter out nulls (records missing in Oracle table) or not null (updates). Dont check for other columns for update. Skip sorter and filter. Just use update strategy.
2) Or use joiner and make flat file pipeline as master. Then check for nulls to find insert or updates.
3) Check if your target table dont have any trigger etc on it. If yes then check its logic and implement it in ETL.
Since you are dealing with 850mil data, you have two major bottlenecks - target lookup and writing into target.
You can think of this strategy -
Mapping 1 - Create a new mapping to load flat file data into a temp table TMP1.
Mapping 2 - Modify existing mapping. Just modify lookup query and join TMP1 and target (860mil)table in SQL Override. This will reduce time, I/O, lookup cache.
Also, please make sure you have an index on key columns in target. And you drop-create all other index while loading. Skipping sorter will help but adding joiner will not help much.
Regards,
Koushik
How many inserts vs updates do you have?
With just a few updates, try using Update else Insert target
property.
If there are many updates and few inserts, perform update
just if a key is found, without checking if anything has changed
If there are many source rows matching what you already have (i.e. an update that doesn't change anything) try to eliminate them. But don't compare all columns - use a hash instead. Just create an additional computed column that will contain a MD5 calculated on all columns. Then all you need to do is compare one column instead of all to detect a change.
1) Try using a merge statement if source and targets are in same database.
2) We can also use sql loader connection to improve the performance.
Clearly the bottleneck is in the target lookup and target load (update to be specific).
Try the following to tune the existing code:
1) Try to remove any unwanted lookup ports if you have in the lookup transformation. Keep only the fields that are used in the lookup condition as you are using it just to check if the record exists.
2) Try adding an index to the target table for the fields you are using for the update
3) Increase the commit interval of the session to a higher value.
4) Partial Pushdown optimization:
You can pushdown some of the processing to database which might be faster instead of doing it in Informatica
Create a staging table to hold the incoming data for that load.
Create a mapping to load the incoming file to the staging table. Truncate it before the start of the load to clear the records of the previous run.
In the SQL override of the existing mapping do a left join between the staging table and target table to find insert/updates. This will be faster than the Informatica lookup and eliminates the time taken to build the Informatica lookup cache.
5) Using MD5 to eliminate unwanted updates
For using MD5 you need to add a new field in the target table and do a mapping to update the existing records one time.
Then in your existing mapping add a step to compute MD5 for the incoming column.
If the record is identified for update then check if the MD5 computed for the incoming column is same as that of the target column. If the checksum also matches then don't update the record. Only if the check sum is different update the record. By this way you will filter out the unwanted updates. If there is no lookup match then insert the record.
Advantages: You are reducing the unwanted updates.
Disadvantages: You have to do an one time process to populate MD5 values for the existing records in the table.
If none of this works check with your database administrator to see if there is any issue in the database side that might slow down the load.

Cassandra Best Practice on edits: delete & re-insert vs. update?

I am new to Cassandra. I am looking at many examples online. Here is one from JHipster Cassandra examples on GitHub:
https://gist.github.com/jdubois/c3d3bedb869466731316
The repository save(user) method does a read (to look for existence) then a delete and re-insert of the existing user across all the denormalized tables whenever the user data changed.
Is this best practice?
Is this only because of how the data model for this sample is designed?
Is this sample's design a result of twisting a POJO framework into a NoSQL database design?
When would I want to just do a update in Cassandra? It supports updates at the field-level, so it seems like that would be preferred.
First of all, the delete operations should be part of the batch for more robust error handling. But it looks like there are also some concurrency issues with the code. It will update the user based on the current user value read before. It's not save to assume this will still be the latest value while save() is actually executed. It will also just overwrite any keys in the lookup table that might be in use for a different user at that point. E.g. the login could already exist for another user while executing insertByLoginStmt.
It is not necessary to delete a row before inserting a new one.
But if you are replacing rows and new columns are different from existing columns then you need to delete all existing columns and insert new columns. Or insert new and delete old, does not matter if happens in batch.

ADOX Rearrange Or Insert Columns Rather than Append them in Access Vb6, VB.Net or CSharp

I need to insert a field in the middle of current fields of a database table. I'm currently doing this in VB6 but may get the green light to do this in .net. Anyway I'm wondering since Access gives you the ability to "insert" fields in the table is there a way to do this in ADOX? If I had to I could step back and use DAO, but not sure how to do it there either.
If yor're wondering why I want to do this applications database has changed over time and I'm being asked to create Upgrade program for some of the installations with older versions.
Any help would be great.
This should not be necessary. Use the correct list of fields in your queries to retrieve them in the required order.
BUT, if you really need to do that, the only way i know is to create a new table with the fields in the required order, read the data from the old table into the new one, delete the old table and rename the new table as the old one.
I hear you: in Access the order of the fields is important.
If you need a comprehensive way to work with ADOX, your go to place is Allen Browne's website. I have used it to from my novice to pro in handling Access database changes. Here it is: www.AllenBrowne.com. Go to Access Tips then scroll down to ADOX Code.
That is also where I normally refer people with doubts about capabilities of Access as a database :)
In your case, you will juggle through creating a new table with the new field in the right position, copying data to the new table, applying properties to the fields, deleting original table, renaming the new table to the required (original) name.
That is the correct order. Do not apply field properties before copying the data. Some indexes and key properties may not be applied when the fields already have data.
Over time, I have automated this so I just run an application to do detect and implement the required changes for me. But that took A LOT of work-weeks.

Oracle copy data between databases honoring relationships

I have a question regarding the oracle copy command:
Is it possible to copy data between databases (were the structure is the same) and honor relationships in one go without(!) writing procedures?
To be more precise:
Table B refers (by B.FK) to table A (A.PK) by a foreign key (B.FK -> A.PK; no relationship information is stored in the db itself). The keys are generated by a sequence, which is used to create the PK for all tables.
So how to copy table A and B while keeping the relationship intact and use the target DBs sequence to generate new primary keys for the copied data (i cannot use the "original" PK values as they might already be used in the same table for a different dataset)?
I doubt that the copy command is capable to handle this situation but what is the way to achieve the desired behavior?
Thanks
Matthias
Oracle has several different ways of moving data from one database to another, of which the SQL*Plus copy command is the most basic and the least satisfactory. Writing your own replication routine (as #OldProgrammer suggests) isn't much better.
You're using 11g, so move into the 21st century by using the built-in Streams functionality.
There is no way to synchronize sequences across databases. There is a workaround, which is explained by the inestimable Tom Kyte.
I generally prefer db links and then use sql insert statements to copy over data.
In your scenario , first insert data of Table A using DB link and then table. If you try otherway round, you will error.
For info on DB link , you canc heck this link: http://docs.oracle.com/cd/B28359_01/server.111/b28310/ds_concepts002.htm

Referencing object's identity before submitting changes in LINQ

is there a way of knowing ID of identity column of record inserted via InsertOnSubmit beforehand, e.g. before calling datasource's SubmitChanges?
Imagine I'm populating some kind of hierarchy in the database, but I wouldn't want to submit changes on each recursive call of each child node (e.g. if I had Directories table and Files table and am recreating my filesystem structure in the database).
I'd like to do it that way, so I create a Directory object, set its name and attributes,
then InsertOnSubmit it into DataContext.Directories collection, then reference Directory.ID in its child Files. Currently I need to call InsertOnSubmit to insert the 'directory' into the database and the database mapping fills its ID column. But this creates a lot of transactions and accesses to database and I imagine that if I did this inserting in a batch, the performance would be better.
What I'd like to do is to somehow use Directory.ID before commiting changes, create all my File and Directory objects in advance and then do a big submit that puts all stuff into database. I'm also open to solving this problem via a stored procedure, I assume the performance would be even better if all operations would be done directly in the database.
One way to get around this is to not use an identity column. Instead build an IdService that you can use in the code to get a new Id each time a Directory object is created.
You can implement the IdService by having a table that stores the last id used. When the service starts up have it grab that number. The service can then increment away while Directory objects are created and then update the table with the new last id used at the end of the run.
Alternatively, and a bit safer, when the service starts up have it grab the last id used and then update the last id used in the table by adding 1000 (for example). Then let it increment away. If it uses 1000 ids then have it grab the next 1000 and update the last id used table. Worst case is you waste some ids, but if you use a bigint you aren't ever going to care.
Since the Directory id is now controlled in code you can use it with child objects like Files prior to writing to the database.
Simply putting a lock around id acquisition makes this safe to use across multiple threads. I've been using this in a situation like yours. We're generating a ton of objects in memory across multiple threads and saving them in batches.
This blog post will give you a good start on saving batches in Linq to SQL.
Not sure off the top if there is a way to run a straight SQL query in LINQ, but this query will return the current identity value of the specified table.
USE [database];
GO
DBCC CHECKIDENT ("schema.table", NORESEED);
GO

Resources