How do I prevent the loading of duplicate rows in to an Oracle table? - oracle

I have some large tables (millions of rows). I constantly receive files containing new rows to add in to those tables - up to 50 million rows per day. Around 0.1% of the rows I receive are duplicates of rows I have already loaded (or are duplicates within the files). I would like to prevent those rows being loaded in to the table.
I currently use SQLLoader in order to have sufficient performance to cope with my large data volume. If I take the obvious step and add a unique index on the columns which goven whether or not a row is a duplicate, SQLLoader will start to fail the entire file which contains the duplicate row - whereas I only want to prevent the duplicate row itself being loaded.
I know that in SQL Server and Sybase I can create a unique index with the 'Ignore Duplicates' property and that if I then use BCP the duplicate rows (as defined by that index) will simply not be loaded.
Is there some way to achieve the same effect in Oracle?
I do not want to remove the duplicate rows once they have been loaded - it's important to me that they should never be loaded in the first place.

What do you mean by "duplicate"? If you have a column which defines a unique row you should setup a unique constraint against that column. One typically creates a unique index on this column, which will automatically setup the constraint.
EDIT:
Yes, as commented below you should setup a "bad" file for SQL*Loader to capture invalid rows. But I think that establishing the unique index is probably a good idea from a data-integrity standpoint.

Use Oracle MERGE statement. Some explanations here.

You dint inform about what release of Oracle you have. Have a look at there for merge command.
Basically like this
---- Loop through all the rows from a record temp_emp_rec
MERGE INTO hr.employees e
USING temp_emp_rec t
ON (e.emp_ID = t.emp_ID)
WHEN MATCHED THEN
--- _You can update_
UPDATE
SET first_name = t.first_name,
last_name = t.last_name
--- _Insert into the table_
WHEN NOT MATCHED THEN
INSERT (emp_id, first_name, last_name)
VALUES (t.emp_id, t.first_name, t.last_name);

I would use integrity constraints defined on the appropriate table columns.
This page from the Oracle concepts manual gives an overview, if you also scroll down you will see what types of constraints are available.

use below option, if you will get this much error 9999999 after that your sqlldr will terminate.
OPTIONS (ERRORS=9999999, DIRECT=FALSE )
LOAD DATA
you will get duplicate records in bad file.
sqlldr user/password#schema CONTROL=file.ctl, LOG=file.log, BAD=file.bad

Related

Oracle 12c - refreshing the data in my tables based on the data from warehouse tables

I need to update the some tables in my application from some other warehouse tables which would be updating weekly or biweekly. I should update my tables based on those. And these are having foreign keys in another tables. So I cannot just truncate the table and reinsert the whole data every time. So I have to take the delta and update accordingly based on few primary key columns which doesn't change. Need some inputs on how to implement this approach.
My approach:
Check the last updated time of those tables, views.
If it is most recent then compare each row based on the primary key in my table and warehouse table.
update each column if it is different.
Do nothing if there is no change in columns.
insert if there is a new record.
My Question:
How do I implement this? Writing a PL/SQL code is it a good and efficient way? as the expected number of records are around 800K.
Please provide any sample code or links.
I would go for Pl/Sql and bulk collect forall method. You can use minus in your cursor in order to reduce data size and calculating difference.
You can check this site for more information about bulk collect, forall and engines: http://www.oracle.com/technetwork/issue-archive/2012/12-sep/o52plsql-1709862.html
There are many parts to your question above and I will answer as best I can:
While it is possible to disable referencing foreign keys, truncate the table, repopulate the table with the updated data then reenable the foreign keys, given your requirements described above I don't believe truncating the table each time to be optimal
Yes, in principle PL/SQL is a good way to achieve what you are wanting to
achieve as this is too complex to deal with in native SQL and PL/SQL is an efficient alternative
Conceptually, the approach I would take is something like as follows:
Initial set up:
create a sequence called activity_seq
Add an "activity_id" column of type number to your source tables with a unique constraint
Add a trigger to the source table/s setting activity_id = activity_seq.nextval for each insert / update of a table row
create some kind of master table to hold the "last processed activity id" value
Then bi/weekly:
retrieve the value of "last processed activity id" from the master
table
select all rows in the source table/s having activity_id value > "last processed activity id" value
iterate through the selected source rows and update the target if a match is found based on whatever your match criterion is, or if
no match is found then insert a new row into the target (I assume
there is no delete as you do not mention it)
on completion, update the master table "last processed activity id" to the greatest value of activity_id for the source rows
processed in step 3 above.
(please note that, depending on your environment and the number of rows processed, the above process may need to be split and repeated over a number of transactions)
I hope this proves helpful

Is there any use to create index on all the table columns in oracle?

In our one of production database, we have 4 column table and there are no PK,UK constraints on it. only one notnull constraint on one column. The inserts are slow on this table and when I checked the indexes , there is one index which is built on all columns.
It is a normal table and not IOT. I really don't see a need of all column index, but wondering why the developers has created it?
Appreciate your thoughts?
It might be usefull, i.e. if you (mainly) query all columns oracle doesn't have to access the table at all, but can get all the data from the index. Though inserts take longer because a larger index has to be maintained by the dbms everytime.
One case where it could be useful is,
Say for example, you are trying to check the existence of records in this table and for that you have to have joins on all four columns. So in such a case if you have written a correlated query like below,
SELECT <something>
FROM table_1 t1
WHERE EXISTS
(SELECT 1 FROM table_t2 t2 where t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3 and t1.c4=t2.c4)
Apart from above case, it looks an error to me from developer's side.
Indexes are good to better query optimization but causes slow updates/inserts because the indexes needs to be updated at each modification.
If these tables first use is querying and inserts happens only in a specific periods like a batch at the beginning or the end of the day only, then you can remove the indexes before updating tables and then restore them.
In addition, all the queries all these tables need to be analysed to see which indexes are useful and which are not?
Anyway, You need to ask developers before removing these indexes.

Sybase ASE remote row insert locking

Im working on an application which access a Sybase ASE 15.0.2 ,where the current code access a remote database
(CIS) to insert a row using a proxy table definition (the destination table is a DOL - DRL table - The PK
row is defined as identity ,and is always growing). The current code performs a select to check if the row
already exists to avoid duplicate data to be inserted.
Since the remote table also have a PK definition on the table, i do understand that the PK verification will
be done again prior to commiting the row.
Im planning to remove the select check since its being effectively done again by the PK verification,
but im concerned about if when receiving a file with many duplicates, the table may suffer
some unecessary contention when the data is tried to be commited.
Its not clear to me if Sybase ASE tries to hold the last row and writes the data prior to check for the
duplicate. Also, if the table is very big, im concerned also about the time it will spend looking the
whole index to find duplicates.
I've found some documentation for SQL anywhere, but not ASE in the following link
http://dcx.sybase.com/1200/en/dbusage/insert-how-transact.html
The best i could find is the following explanation
https://groups.google.com/forum/?fromgroups#!topic/comp.databases.sybase/tHnOqptD7X8
But it doesn't enlighten in details how the row is locked (and if there is any kind of
optimization to write it ahead or at the same time of PK checking)
, and also if it will waste a full PK look if im positively inserting a row which the PK
positively greater than the last row commited
Thanks
Alex
Unlike SqlAnywhere there is no option for ASE to set wait_for_commit. The primary key constraint is checked during the insert and not at the commit time. The problem as I understand from your post I see is if you have a mass insert from a file that may contain duplicates is to load into a temp table , check for duplicates, remove the duplicates and then insert the unique ones. Mass insert are lot faster though it still checks for primary key violations. However there is no cost associated as there is no rolling back. The insert statement is always all or nothing. Even if one row is duplicate the entire insert statement will fail. Check before insert in more of error free approach as opposed to use of constraint to the verification because it is going to fail and rollback is going to again be costly.
Thanks Mike
The link does have a very quick explanation about the insert from the CIS perspective. Its a variable to keep an eye on given that CIS may become a representative time consumer
if its performing data and syntax checking if it will be done again when CIS forwards the insert statement to the target server. I was afraid that CIS could have some influence beyond the network traffic/time over the locking/PK checking
Raju
I do agree that avoiding the PK duplication by checking if the row already exists by running a select and doing in a batch, but im currently looking for a stop gap solution, and that may be to perform the insert command in batches of about 50 rows and leave the
duplicate key check for the PK.
Hopefully the PK check will be done over a join of the 50 newly inserted rows, and thus
avoid to traverse the index for each single row...
Ill try to test this and comment back
Alex

Deleting duplicate rows in oracle

Shouldn't the following query work fine for deleting duplicate rows in oracle
SQL> delete from sessions o
where 1<=(select count(*)
from sessions i
where i.id!=o.id
and o.data=i.data);
It seems to delete all the duplication rows!! (I wish to keep 1 tough)
Your statement doesn't work because your table has at least one row where two different ID's share the same values for DATA.
Although your intent may be to look for differing values of DATA ID by ID, what your SQL is saying is in fact set-based: "Look at my table as a whole. If there are any rows in the table such that the DATA is the same but the ID's are different (i.e., that inner COUNT(*) is anything greater than 0), then DELETE every row in the table."
You may be attempting specific, row-based logic, but your statement is big-picture (set-based). There's nothing in it to single out duplicate rows, as there is in the solution Ollie has linked to, for example.

create index before adding columns vs. create index after adding columns - does it matter?

In Oracle 10g, does it matter what order create index and alter table comes in?
Say i have a query Q with a where clause on column C in table T. Now i perform one of the following scenarios:
I create index I(C) and then add columns X,Y,Z.
Add columns X,Y,Z then create index I(C).
Q is 'select * from T where C = whatever'
Between 1 and 2 will there be a significant difference in performance of Q on table T when T contains a very large number of rows?
I personally make it a practice to do #2 but others seem to have a different opinion.
thanks
It makes no difference if you add columns to a table before or after creating an index. The optimizer should pick the same plan for the query and the execution time should be unchanged.
Depending on the physical storage parameters of the table, it is possible that adding the additional columns and populating them with data may force quite a bit of row migration to take place. That row migration will generate changes to the indexes on the table. If the index exists when you are populating the three new columns with data, it is possible that populating the data in X, Y, and Z will take a bit longer because of the additional index maintenance.
If you add columns without populating them, then it is pretty quick as it is just a metadata change. Adding an index does require the table to be read (or potentially another index) so that can be very time consuming and of much greater impact than the simple metadata change of recording the new index details.
If the new columns are going to be populated as part of the ALTER TABLE, it is a different matter.
The database may undergo an unplanned shutdown during the course of adding that data to every row of the table data
The server memory may not have room to record every row changed in that table
Therefore those row changes may be written to datafiles before commit, and are therefore written as dirty blocks
The next read of those blocks, after the ALTER table has successfully completed will do a delayed block cleanout (ie record the fact that the change has been committed)
If you add the columns (with data) first, then the create index will (probably) read the table and do the added work of the delayed block cleanout.
If you create the index first then add the columns, the create index may be faster but the delayed block cleanout won't happen and that housekeeping will be picked up by the application later (potentially by the select * from T where C = whatever)

Resources