Deleting duplicate rows in oracle - oracle

Shouldn't the following query work fine for deleting duplicate rows in oracle
SQL> delete from sessions o
where 1<=(select count(*)
from sessions i
where i.id!=o.id
and o.data=i.data);
It seems to delete all the duplication rows!! (I wish to keep 1 tough)

Your statement doesn't work because your table has at least one row where two different ID's share the same values for DATA.
Although your intent may be to look for differing values of DATA ID by ID, what your SQL is saying is in fact set-based: "Look at my table as a whole. If there are any rows in the table such that the DATA is the same but the ID's are different (i.e., that inner COUNT(*) is anything greater than 0), then DELETE every row in the table."
You may be attempting specific, row-based logic, but your statement is big-picture (set-based). There's nothing in it to single out duplicate rows, as there is in the solution Ollie has linked to, for example.

Related

MERGE INTO Performance as table grows

This is a general question about the Oracle MERGE INTO statement with a particular scenario, on Oracle RDBMS 12c.
Daily data will be loaded to StagingTableA - about 10m rows.
This will be MERGEd INTO TableA.
TableA will vary between 0 to 10m rows (matcing StagingTableA).
There may be times when TableA will be pruned/emptied and left with 0 rows.
Clearly, when TableA is empty, a straight INSERT will do the job, but the procedure has been written to use a MERGE INTO method to handle all scenarios.
The MERGE .. MATCH is on a indexed column.
My question is an uncertainty about how the MERGE handles the MATCH in circumstances where TableA will start empty, and then grow hugely during the MERGE execution. The MATCH on indexed columns will use a FTS as the stats will show the table has 0 rows.
At some point during the MERGE transaction, this will become inefficient.
Is the MERGE statement clever enough to detect this and change the execution plan, and start using the index instead of the FTS?
If this was done the old way with CURSOR, UPDATE and INSERT then we could potentially introduce a ANALYZE at a appropriate point (say after 50,000 processed) on the TableA to switch to a optimal plan.
I haven't been able to find any documentation dealing with this specific question.
Hopefully you've got a UNIQUE index on that table, which is based on the incoming data. If I was you, rather than using a simple MERGE I'd:
Mark all indexes on the table as UNUSABLE, except for the unique index.
INSERT all records
Catch the DUPLICATE VALUE ON INDEX exception at the time of INSERT and issue the appropriate UPDATE.
DELETE processed rows from the input record.
Commit every N records (1000? 10000? 100000? Your choice...), calling DBMS_STATS.GATHER_TABLE_STATS for the table you've inserted into after each COMMIT.
Best of luck.

Optimizing a delete... where query with rownum

I'm working with an application that has a large amount of outdated data clogging up a table in my databank. Ideally, I'd want to delete all entries in the table whose reference date is too old:
delete outdatedTable where referenceDate < :deletionCutoffDate
If this statement were to be run, it would take ages to complete, so I'd rather break it up into chunks with the following:
delete outdatedTable where referenceData < :deletionCutoffDate and rownum <= 10000
In testing, this works suprisingly slowly. The following query, however, runs dramatically faster:
delete outdatedTable where rownum <= 10000
I've been reading through multiple blogs and similar questions on StackOverflow, but I haven't yet found a straightforward description of how/whether using rownum affects the Oracle optimizer when there are other Where clauses in the query. In my case, it seems to me as if Oracle checks
referenceData < :deletionCutoffDate
on every single row, executes a massive Select on all matching rows, and only then filters out the top 10000 rows to return. Is this in fact the case? If so, is there any clever way to make Oracle stop checking the Where clause as soon as it's found enough matching rows?
How about a different approach without so much DML on the table. As a permanent solution for future you could go for table partitioning.
Create a new table with required partition(s).
Move ONLY the required rows from your existing table to the new partitioned table.
Once the new table is populated, add the required constraints and indexes.
Drop the old table.
In future, you would just need to DROP the old partitions.
CTAS(create table as select) is another way, however, if you want to have a new table with partition, you would have to go for exchange partition concept.
First of all, you should read about SQL statement's execution plan and learn how to explain in. It will help you to find answers on such questions.
Generally, one single delete is more effective than several chunked. It's main disadvantage is extremal using of undo tablespace.
If you wish to delete most rows of table, much faster way usially a trick:
create table new_table as select * from old_table where date >= :date_limit;
drop table old_table;
rename table new_table to old_table;
... recreate indexes and other stuff ...
If you wish to do it more than once, partitioning is a much better way. If table partitioned by date, you can select actual date quickly and you can drop partion with outdated data in milliseconds.
At last, paritioning if a way to dismiss 'deleting outdated records' at all. Sometimes we need old data, and it's sad if we delete it by own hands. With paritioning you can archive outdated partitions outside of the database, but connects them when you need to access old data.
This is an old request, but I'd like to show another approach (also using partitions).
Depending on what you consider old, you could create corresponding partitions (optimally exactly two; one current, one old; but you could just as well make more), e.g.:
PARTITION BY LIST ( mod(referenceDate,2) )
(
PARTITION year_odd VALUES (1),
PARTITION year_even VALUES (0)
);
This could as well be months (Jan, Feb, ... Dec), decades (XX0X, XX1X, ... XX9X), half years (first_half, second_half), etc. Anything circular.
Then whenever you want to get rid of old data, truncate:
ALTER TABLE mytable TRUNCATE PARTITION year_even;
delete from your_table
where PK not in
(select PK from your_table where rounum<=...) -- these records you want to leave

Is there any use to create index on all the table columns in oracle?

In our one of production database, we have 4 column table and there are no PK,UK constraints on it. only one notnull constraint on one column. The inserts are slow on this table and when I checked the indexes , there is one index which is built on all columns.
It is a normal table and not IOT. I really don't see a need of all column index, but wondering why the developers has created it?
Appreciate your thoughts?
It might be usefull, i.e. if you (mainly) query all columns oracle doesn't have to access the table at all, but can get all the data from the index. Though inserts take longer because a larger index has to be maintained by the dbms everytime.
One case where it could be useful is,
Say for example, you are trying to check the existence of records in this table and for that you have to have joins on all four columns. So in such a case if you have written a correlated query like below,
SELECT <something>
FROM table_1 t1
WHERE EXISTS
(SELECT 1 FROM table_t2 t2 where t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3 and t1.c4=t2.c4)
Apart from above case, it looks an error to me from developer's side.
Indexes are good to better query optimization but causes slow updates/inserts because the indexes needs to be updated at each modification.
If these tables first use is querying and inserts happens only in a specific periods like a batch at the beginning or the end of the day only, then you can remove the indexes before updating tables and then restore them.
In addition, all the queries all these tables need to be analysed to see which indexes are useful and which are not?
Anyway, You need to ask developers before removing these indexes.

Inserting data in a column avoiding duplicates

Lets say i have a query which is fetching col1 after joining multiple tables. I want to insert values of that col1 in a table which is on remote db i.e. i would be using dblink to do that.
Now that col1 would be fetched from 4-5 different db's. There is chances that a value1 fetch from db1 would b in db2 as well. How can i avoid duplicates ?
In my remote db, I have created col1 a primary key. so when inserting, an error would be thrown if there is a duplicate key, end result failing rest of the process. Which i don't want to. I was thiking about 2 approaches
Write a PLSQL script, For each value, determine if value already exists or not. If it doesn't then insert.
Write a PLSQL script and insert and catch the duplicate key exception. The exception would be ignore and it will keep inserting (it doesn't sound that good).
Which approach would you prefer? Is there anything else i can do ?
I would use the MERGE statement and WHEN NOT MATCHED THEN INSERT.
The same merger could also update but it doesn't have to, just leave the update part out.
The different databases can have duplicate primary keys but that doesn't mean the records are duplicates. The actual data may be different in each case. Or the records may represent the same real world thing but at different statuses, Don't know, you haven't provided enough explanation.
The point is, you need much more analysis of why duplicate records can exist and probably a more sophisticated approach to handling collisions. Do you need to take all records (in which case you need a synthetic key)? Or do you take only one instance (so how do you decide precedence)? Other scenarios may exist.
In any case, MERGE or PL/SQL loops are likely to be too crude a solution.
First off, I would suggest that your target database drive all of these inserts because inserting/updating across a database link can create some locking issues and further complicate things especially with multiple databases attempting to access and perform DML on the same table. However if that isn't possible the solutions below will work.
I would fix your primary key problem by including a table look-up on the target table for each row.
INSERT INTO customer#dblink.oracle.com cust
(emp_name,
emp_id)
VALUES
(SELECT
cust.employee_name,
cust.employee_id --primary_key
FROM
source_table st
WHERE NOT EXISTS
(SELECT 1
FROM customer#dblink.oracle.com cust
WHERE cust.employee_id = st.emp_id));
Again, I would not recommend DML transactions across database links unless absolutely necessary as you can sometimes have weird locking behavior.
A PL/SQL procedure or anonymous PL/SQL block could be used to create a bulk processing solution as follows:
CREATE OR REPLACE PROCEDURE send_unique_data
AS
TYPE tab_cust IS TABLE OF customer#dblink.oracle.com%ROWTYPE
INDEX BY PLS_INTEGER;
t_records tab_cust;
BEGIN
SELECT
cust.employee_name,
cust.employee_id --primary_key
BULK COLLECT
INTO t_records
FROM source_table;
FORALL i IN t_records.FIRST...t_records.LAST SAVE EXCEPTIONS
INSERT INTO customer#dblink.oracle.com
VALUES t_records(i);
END send_unique_data;
You can also call the system SQL%BULKEXCEPTIONS collection in case you want to do anything with the records that threw exceptions (such as unique_constraint violations). Be warned that this solution will cause the target table to suffers from performance issues if there are lots of duplicate data attempting to be inserted.

Query does cartesian join unless

I've got a query that's supposed to return 2 rows. However, it returns 48 rows. It's acting like one of the tables that's being joined isn't there. But if I add a column from that table to the select clause, with no changes to the from or where parts of the query, it returns 2 rows.
Here's what "Explain plan" says without the "m.*" in the select:
Here it is again after adding m.* in the select:
Can anybody explain why it should behave this way?
Update: We only had this problem on one system and not another. The DBA verified that the one with the problem is running optimizer_features_enable set to 10.2.0.5, and the one where it doesn't happen is running optimizer_features_enable set to 10.2.0.4. Unfortunately the customer site is running 10.2.0.5.
It's about a join elimination that was introduced in 10gR2:
Table elimination (alternately called
"join elimination") removes redundant
tables from a query. A table is
redundant if its columns are only
referenced to in join predicates, and
it is guaranteed that those joins
neither filter nor expand the
resulting rows. There are several
cases where Oracle will eliminate a
redundant table.
Maybe that's kind of related bug or so. Have a look at this article.
Looks like a bug. What are the constraints ?
Logically, if all rows in MASTERSOURCE_FUNCTION had the function NON-OSDA then that wouldn't exclude any rows (or if none had that value, then all rows would be excluded).
Going one step further, if every row in MASTERSOURCE had one or zero NON-OSDA rows in MASTERSOURCE_FUNCTION, then it should be a candidate for exclusion. But there would also need to be a one-to-one between the MASTERSOURCE ID and NAME.
I'd pull the ROWIDs from ACCOUNTSOURCE for the 48 rows, then track the MASTERSOURCE ID and NAME and see on what grounds those rows are being duplicated or not excluded. That is, are there 12 duplicate names in MASTERSOURCE where it is expected to be unique through a NOVALIDATE constraint.

Resources