Inserting data in a column avoiding duplicates

Inserting data in a column avoiding duplicates - oracle

Lets say i have a query which is fetching col1 after joining multiple tables. I want to insert values of that col1 in a table which is on remote db i.e. i would be using dblink to do that.
Now that col1 would be fetched from 4-5 different db's. There is chances that a value1 fetch from db1 would b in db2 as well. How can i avoid duplicates ?
In my remote db, I have created col1 a primary key. so when inserting, an error would be thrown if there is a duplicate key, end result failing rest of the process. Which i don't want to. I was thiking about 2 approaches
Write a PLSQL script, For each value, determine if value already exists or not. If it doesn't then insert.
Write a PLSQL script and insert and catch the duplicate key exception. The exception would be ignore and it will keep inserting (it doesn't sound that good).
Which approach would you prefer? Is there anything else i can do ?

I would use the MERGE statement and WHEN NOT MATCHED THEN INSERT.
The same merger could also update but it doesn't have to, just leave the update part out.

The different databases can have duplicate primary keys but that doesn't mean the records are duplicates. The actual data may be different in each case. Or the records may represent the same real world thing but at different statuses, Don't know, you haven't provided enough explanation.
The point is, you need much more analysis of why duplicate records can exist and probably a more sophisticated approach to handling collisions. Do you need to take all records (in which case you need a synthetic key)? Or do you take only one instance (so how do you decide precedence)? Other scenarios may exist.
In any case, MERGE or PL/SQL loops are likely to be too crude a solution.

First off, I would suggest that your target database drive all of these inserts because inserting/updating across a database link can create some locking issues and further complicate things especially with multiple databases attempting to access and perform DML on the same table. However if that isn't possible the solutions below will work.
I would fix your primary key problem by including a table look-up on the target table for each row.
INSERT INTO customer#dblink.oracle.com cust
(emp_name,
emp_id)
VALUES
(SELECT
cust.employee_name,
cust.employee_id --primary_key
FROM
source_table st
WHERE NOT EXISTS
(SELECT 1
FROM customer#dblink.oracle.com cust
WHERE cust.employee_id = st.emp_id));
Again, I would not recommend DML transactions across database links unless absolutely necessary as you can sometimes have weird locking behavior.
A PL/SQL procedure or anonymous PL/SQL block could be used to create a bulk processing solution as follows:
CREATE OR REPLACE PROCEDURE send_unique_data
AS
TYPE tab_cust IS TABLE OF customer#dblink.oracle.com%ROWTYPE
INDEX BY PLS_INTEGER;
t_records tab_cust;
BEGIN
SELECT
cust.employee_name,
cust.employee_id --primary_key
BULK COLLECT
INTO t_records
FROM source_table;
FORALL i IN t_records.FIRST...t_records.LAST SAVE EXCEPTIONS
INSERT INTO customer#dblink.oracle.com
VALUES t_records(i);
END send_unique_data;
You can also call the system SQL%BULKEXCEPTIONS collection in case you want to do anything with the records that threw exceptions (such as unique_constraint violations). Be warned that this solution will cause the target table to suffers from performance issues if there are lots of duplicate data attempting to be inserted.

Related

Oracle Insert All vs Insert

In Oracle I came across two types of insert statement
1) Insert All: Multiple entries can be inserted using a single sql statement
2) Insert : One entry will be updated per insert.
Now I want to insert around 100,000 records at a time. (Table have 10 fields with includes a primary key). I am not concerned about any return value.
I am using oracle 11g.
Can you please help me with respect to performance which is better "Insert" or "Insert All".

I know this is kind of a Necro but it's pretty high on the google search results so I think this is a point that worth making.
Insert All can give dramatic performance benefits if you are building a web application because it is a single SQL statement that requires only one round trip to your database. In most cases although far from all cases. the majority of the cost of a query is actually latency. Depending on what framework you are using, this syntax can help you avoid unnecessary round trips.
This might seem incredibly obvious but I have seen many, many production web applications in large companies that have forgotten this simple fact.

Insert statement and insert all statement are practically the same conventional insert statement. insert all, which has been introduced in 9i version simply allows you to do insertion into multiple tables using one statement. Another type of insert that you could use to speed up the process is direct-path insert - you use /*+ append*/ or /*+ append_values*/(Oracle 11g) hints
insert /*+ append*/ into some_table(<<columns>>)
select <<columns or literals>>
from <<somwhere>>
or (Oracle 11g)
insert /*+ append_values*/ into some_table(<<columns>>)
values(<<values>>)
to tell Oracle that you want to perform direct-path insert. But, 100K rows it's not that many rows and conventional insert statement will do just fine. You wont get significant performance advantage using direct-path insert with that amount of data. Moreover direct-path insert wont reuse free space, it adds new data after HWM(high water mark), hence require more space. You wont be able to use select statement or other DML statement, if you has not issued commit.

To use FORALL you would need PLSQL tables.
This process is quite fast.
You can also choose the table to have NO LOG option which would speed the process up during inserts.

Sybase ASE remote row insert locking

Im working on an application which access a Sybase ASE 15.0.2 ,where the current code access a remote database
(CIS) to insert a row using a proxy table definition (the destination table is a DOL - DRL table - The PK
row is defined as identity ,and is always growing). The current code performs a select to check if the row
already exists to avoid duplicate data to be inserted.
Since the remote table also have a PK definition on the table, i do understand that the PK verification will
be done again prior to commiting the row.
Im planning to remove the select check since its being effectively done again by the PK verification,
but im concerned about if when receiving a file with many duplicates, the table may suffer
some unecessary contention when the data is tried to be commited.
Its not clear to me if Sybase ASE tries to hold the last row and writes the data prior to check for the
duplicate. Also, if the table is very big, im concerned also about the time it will spend looking the
whole index to find duplicates.
I've found some documentation for SQL anywhere, but not ASE in the following link
http://dcx.sybase.com/1200/en/dbusage/insert-how-transact.html
The best i could find is the following explanation
https://groups.google.com/forum/?fromgroups#!topic/comp.databases.sybase/tHnOqptD7X8
But it doesn't enlighten in details how the row is locked (and if there is any kind of
optimization to write it ahead or at the same time of PK checking)
, and also if it will waste a full PK look if im positively inserting a row which the PK
positively greater than the last row commited
Thanks
Alex

Unlike SqlAnywhere there is no option for ASE to set wait_for_commit. The primary key constraint is checked during the insert and not at the commit time. The problem as I understand from your post I see is if you have a mass insert from a file that may contain duplicates is to load into a temp table , check for duplicates, remove the duplicates and then insert the unique ones. Mass insert are lot faster though it still checks for primary key violations. However there is no cost associated as there is no rolling back. The insert statement is always all or nothing. Even if one row is duplicate the entire insert statement will fail. Check before insert in more of error free approach as opposed to use of constraint to the verification because it is going to fail and rollback is going to again be costly.

Thanks Mike
The link does have a very quick explanation about the insert from the CIS perspective. Its a variable to keep an eye on given that CIS may become a representative time consumer
if its performing data and syntax checking if it will be done again when CIS forwards the insert statement to the target server. I was afraid that CIS could have some influence beyond the network traffic/time over the locking/PK checking
Raju
I do agree that avoiding the PK duplication by checking if the row already exists by running a select and doing in a batch, but im currently looking for a stop gap solution, and that may be to perform the insert command in batches of about 50 rows and leave the
duplicate key check for the PK.
Hopefully the PK check will be done over a join of the 50 newly inserted rows, and thus
avoid to traverse the index for each single row...
Ill try to test this and comment back
Alex

Deleting duplicate rows in oracle

Shouldn't the following query work fine for deleting duplicate rows in oracle
SQL> delete from sessions o
where 1<=(select count(*)
from sessions i
where i.id!=o.id
and o.data=i.data);
It seems to delete all the duplication rows!! (I wish to keep 1 tough)

Your statement doesn't work because your table has at least one row where two different ID's share the same values for DATA.
Although your intent may be to look for differing values of DATA ID by ID, what your SQL is saying is in fact set-based: "Look at my table as a whole. If there are any rows in the table such that the DATA is the same but the ID's are different (i.e., that inner COUNT(*) is anything greater than 0), then DELETE every row in the table."
You may be attempting specific, row-based logic, but your statement is big-picture (set-based). There's nothing in it to single out duplicate rows, as there is in the solution Ollie has linked to, for example.

oracle - moving data from to identical database

I have two databases with identical table layouts. There are a dozen or so tables of interest. They are a number of FK between them.
I have been asked to write a stored procedure to copy data from database A to database B based on the PK of the parent table at the top of the hierarchy. I may receive just one value, or a list of values. I'm supposed to select all records from database A that match the value(s) and insert/update them into database B. This includes all the records in the child tables too.
My questions is whats the best(most efficent/ best practice) way to do this?
Should I write a dozen select from... insert into... statements?
Should I join the tables together an try to insert into all the tables at the same time?
Thanks!
Additional info:
The record should be inserted if it is not already there. (based on the PK of the respective table). Otherwise it should be updated.
Obviously I need to traverse down to all child tables, so There would only be one record to copy in the parent table, but the child table might have 10, and the child's child table might have 500. I would of course need to update the record if it already existed, insert if it does not for the child tables too...
UPDATE:
I think it would make the solution simpler if I just deleted all records related to the top level key, and then insert all the new records rather than trying to do updates.
So I guess the questions is it best to just do a dozen:
delete from ... where ... in ...
select from ... where ... in ...
insert into...
or is it better to do some kinda of fancy joins to do all the inserts in one sql statement?

I would do this by disabling all the foreign key constraints, then doing a set of MERGE statements to deal with the updates and inserts, then enable all the constraints.
Think about logging. How much redo do you want to generate?
You might find that it's quicker and better to truncate all the target tables and then do inserts of everything with nolog. Could be simpler than the merges.
One major main alternative would be to drop all the target tables and use export and import. Might be a lot faster.
A second alternative would be to use materialized views, particularly if you don't need to do updates on the target tables. That way, Oracle does all the heavy lifting for you. You can force integrity by choosing refresh groups carefully.
There are several ways to deal with this business problem. A PL/SQL program may not be the best.

What is the fastest way to insert data into an Oracle table?

I am writing a data conversion in PL/SQL that processes data and loads it into a table. According to the PL/SQL Profiler, one of the slowest parts of the conversion is the actual insert into the target table. The table has a single index.
To prepare the data for load, I populate a variable using the rowtype of the table, then insert it into the table like this:
insert into mytable values r_myRow;
It seems that I could gain performance by doing the following:
Turn logging off during the insert
Insert multiple records at once
Are these methods advisable? If so, what is the syntax?

It's much better to insert a few hundred rows at a time, using PL/SQL tables and FORALL to bind into insert statement. For details on this see here.
Also be careful with how you construct the PL/SQL tables. If at all possible, prefer to instead do all your transforms directly in SQL using "INSERT INTO t1 SELECT ..." as doing row-by-row operations in PL/SQL will still be slower than SQL.
In either case, you can also use direct-path inserts by using INSERT /*+APPEND*/, which basically bypasses the DB cache and directly allocates and writes new blocks to data files. This can also reduce the amount of logging, depending on how you use it. This also has some implications, so please read the fine manual first.
Finally, if you are truncating and rebuilding the table it may be worthwhile to first drop (or mark unusable) and later rebuild indexes.

Regular insert statements are the slowest way to get data in a table and not meant for bulk inserts. The following article references a lot of different techniques for improving performance: http://www.dba-oracle.com/oracle_tips_data_load.htm

Drop the index, then insert the rows, then re-create the index.

If dropping the index doesn't speed things up enough, you need the Oracle SQL*Loader:
http://www.oracle.com/technology/products/database/utilities/htdocs/sql_loader_overview.html

Suppose you have taken eid,ename,sal,job. So create a table first as:
SQL>create table tablename(eid number, ename varchar2(20),sal number,job char(10));
Now insert data:-
SQL>insert into tablename values(&eid,'&ename',&sal,'&job');

Check this link
http://www.dba-oracle.com/t_optimize_insert_sql_performance.htm
main points to consider for your
case is to use Append hint as this
will directly append into the table
instead of using freelist. If you can afford to turn off logging than use append with nologging hint to do it
Use a bulk insert instead instead of iterating in PL/SQL
Use sqlloaded to load the data directly into the table if you are getting data from a file feed

Here are my recommendations on fast insert.
Trigger - Disable any triggers associated with a table. Enable after Inserts are complete.
Index - Drop Index and re-create it after your Inserts are complete.
Stale stats - Re-analyze table and index stats.
Index de-fragmentation - Rebuild Index if needed
Use No Logging -Insert using INSERT APPEND (Oracle only). This approach is very risky approach, no redo logs are generated therefore you can’t do a rollback - make a backup of table before you start and don't try on live tables. Check if your db has similar option
Parallel Insert: Running parallel insert will get the job faster.
Use Bulk Insert
Constraints - Not much overhead during inserts but still a good idea to check, if it is still slow after even after step 1
You can learn more on http://www.dbarepublic.com/2014/04/slow-insert.html

Maybe one of your best option is to avoid Oracle as much as possible actually.
I've been baffled by this myself, but very often a Java process can outperform many of the Oracle's utilities which either use OCI (read: SQL Plus) or will take up so much of your time to get right (read: SQL*Loader).
This doesn't prevent you to use specific hints either (like /APPEND/).
I've been pleasantly surprised each time I've turned to that kind of solution.
Cheers,
Rollo

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio