Oracle : Identifying duplicates in a table without index

Oracle : Identifying duplicates in a table without index - oracle

When I try to create a unique index on a large table, I get a unique contraint error. The unique index in this case is a composite key of 4 columns.
Is there an efficient way to identify the duplicates other than :
select col1, col2, col3, col4, count(*)
from Table1
group by col1, col2, col3, col4
having count(*) > 1
The explain plan above shows full table scan with extremely high cost, and just want to find if there is another way.
Thanks !

Try creating a non-unique index on these four columns first. That will take O(n log n) time, but will also reduce the time needed to perform the select to O(n log n).
You're in a bit of a bind here -- any way you slice it, the entire table has to be read in at least once. The naïve algorithm runs in O(n2) time, unless the query optimizer is clever enough to build a temporary index/table.

You can use the EXCEPTIONS INTO clause to trap the duplicated rows.
If you don't already have an EXCEPTIONS table create one using the provided script:
SQL> #$ORACLE_HOME/rdbms/admin/ultexcpt.sql
Now you can attempt to create a unique constraint like this
alter table Table1
add constraint tab1_uq UNIQUE (col1, col2, col3, col4)
exceptions into exceptions
/
This will fail but now your EXCEPTIONS table contains a list of all the rows whose keys contain duplicates, identified by ROWID. That gives you a basis for deciding what to do with the duplicates (delete, renumber, whatever).
edit
As others have noted you have to pay the cost of scanning the table once. This approach gives you a permanent set of the duplicated rows, and ROWID is the fastest way of accessing any given row.

Since there is no index on those columns, that query would have to do a full table scan - no other way to do it really, unless one or more of those columns is already indexed.
You could create the index as a non-unique index, then run the query to identify the duplicate rows (which should be very fast once the index is created). But I doubt if the combined time of creating the non-unique index then running the query would be any less than just running the query without the index.

In fact, you need to look for a duplicate of every single row in a table. No way to do this effectively without an index.

I don't think there is a quicker way unfortunately.

Related

MERGE INTO Performance as table grows

This is a general question about the Oracle MERGE INTO statement with a particular scenario, on Oracle RDBMS 12c.
Daily data will be loaded to StagingTableA - about 10m rows.
This will be MERGEd INTO TableA.
TableA will vary between 0 to 10m rows (matcing StagingTableA).
There may be times when TableA will be pruned/emptied and left with 0 rows.
Clearly, when TableA is empty, a straight INSERT will do the job, but the procedure has been written to use a MERGE INTO method to handle all scenarios.
The MERGE .. MATCH is on a indexed column.
My question is an uncertainty about how the MERGE handles the MATCH in circumstances where TableA will start empty, and then grow hugely during the MERGE execution. The MATCH on indexed columns will use a FTS as the stats will show the table has 0 rows.
At some point during the MERGE transaction, this will become inefficient.
Is the MERGE statement clever enough to detect this and change the execution plan, and start using the index instead of the FTS?
If this was done the old way with CURSOR, UPDATE and INSERT then we could potentially introduce a ANALYZE at a appropriate point (say after 50,000 processed) on the TableA to switch to a optimal plan.
I haven't been able to find any documentation dealing with this specific question.

Hopefully you've got a UNIQUE index on that table, which is based on the incoming data. If I was you, rather than using a simple MERGE I'd:
Mark all indexes on the table as UNUSABLE, except for the unique index.
INSERT all records
Catch the DUPLICATE VALUE ON INDEX exception at the time of INSERT and issue the appropriate UPDATE.
DELETE processed rows from the input record.
Commit every N records (1000? 10000? 100000? Your choice...), calling DBMS_STATS.GATHER_TABLE_STATS for the table you've inserted into after each COMMIT.
Best of luck.

Is insert statement without column approve performance in Oracle

I'm using Oracle 12.
When you define insert statement there is option not to state column list
If you omit the column list altogether, then the values_clause or query must specify values for all columns in the table.
Also it's describe in Ask TOM when suggesting a best performance solution for bulk:
insert into insert_into_table values ( data(i) );
My question, is not stating columns really produce a better or at least equal performance than stating column in statement as
insert table A (col1, col2, col3) values (?, ?, ?);

From my experience there is no gain in omitting column names - also it's a bad practice to do so, since if column order changes (sometimes people do that, for clarity, but they really don't need to) and their definition allows to insert the data, you will get wrong data in wrong columns.
As a rule of thumb it's not worth the trouble. Always specify column list that you're putting values into. Database has to check that anyways.
Related: SQL INSERT performance omitting field names?

Best practice is, that you ALWAYS define the columns in insert statements, NOT for the performance sake(there is no difference), but for situation like this:
You create table test1 with columns col1, col2;
You insert data in that table in your procedures/etc, without naming the columns;
You add new columns, col3, col4;
The current logic will fail with insert, errors will be raised.
So, to avoid the failure, always name the columns, then your code doesn't brake,when you modify the table structure.

Fastest way to SELECT a row from a table in a database (Microsoft SQL server)

I have a huge table with one int PRIMARY KEY IDENTITY column.
I guess making the SELECT query using that primary key is the fastest way for the database to find the row in the table isn't it?
If that is true i still have a question.
Is that query as fast as a call to a dictionary by key or the database still has to read all the rows from the beginning (the Primary Key column) till it finds the row itself?
Thanks in advance ^^

Using primary key is obviously the fastest way to access a particular row.
If you want to understand how it works, you have to understand how index works.
In general it works like that :
Let's say you have a table t1(col1,col2...col10) and you have an index on col1.
Index on col1 means that you have some data structure which contains pairs (col1, rec_id)
and rec_id allows direct access to row with appropriate col1.
The data structure is ordered by col1 and therefore allows efficient searching by col1.

I think searching in dictionary works per dictionary search algorithm which should be more like binary search kind.
When you declare a column as Primary key in table, then that column is indexed, hence it should be working based on hashing principle, so searching is definitely NOT row by row as you mentioned.
Finally, yes it is the common and fast way, but you should be selective about the number of columns and rows you need in your sql query. Avoid fetching large number of rows per select call.

Wrong index is chosen by Oracle

I have a problem in indexing in Oracle. Will try to explain my problem with an instance as follows.
I have a table TABLE1 with columns A,B,C,D
another table TABLE2 with columns A,B,C,E,F,H
I have created Indexes for TABLE1
IX_1 A
IX_2 A,B
IX_3 A,C
IX_4 A,B,C
I have created Indexes for TABLE1
IY_1 A,B,C
IY_2 A
when i gave query similar to this
SELECT * FROM TABLE1 T1,TABLE2 T2
WHERE T1.A=T2.A
When i give Explain Plan i got its not getting IX_1 nor IY_2
Its taking IX_4 nor IY_1
why this is not picking right index?
EDITED:
Can anyone help me to know difference between INDEX RANGE SCAN,INDEX UNIQUE SCAN, INDEX SKIP SCAN
I guess SKIP SCAN means when a column is skipped in Composite Index by Oracle
what about others i dont have idea!

The best benefit of indexes is that you can select a few rows from a table without scanning the entire table.
If you ask for too many rows(let's say 30% - depends of many things) the engine will prefer to scan the entire table for those rows.
That's because reading a row using an index is gets an overhead : reading some index blocks, and after that reading table blocks.
In your case, in order to join tables T1 and T2, Oracle needs all the rows from those table. Reading(full) the index will be an unsefull operation, adding unnecesary cost.
UPDATE: A step forward: if you run:
SELECT T1.B, T2.B FROM TABLE1 T1,TABLE2 T2
WHERE T1.A=T2.A
Oracle probably will use the indexes(IX2, IY2), because it does not need to read anything from table, because the values T1.B, T2.B, are in indexes.

Oracle 10g - optimize WHERE IS NOT NULL

We have Oracle 10g and we need to query 1 table (no joins) and filter out rows where 1 of the columns is null. When we do this - WHERE OurColumn IS NOT NULL - we get a full table scan on a very large table - BAD BAD BAD. The column has an index on it but it gets ignored in this instance. Are there any solutions to this?
Thanks

The optimizer thinks that the full table scan will be better.
If there are just a few NULL rows, the optimizer is right.
If you are absolutely sure that the index access will be faster (that is, you have more than 75% rows with col1 IS NULL), then hint your query:
SELECT /*+ INDEX (t index_name_on_col1) */
*
FROM mytable t
WHERE col1 IS NOT NULL
Why 75%?
Because using INDEX SCAN to retrieve values not covered by the index implies a hidden join on ROWID, which costs about 4 times as much as table scan.
If the index range includes more than 25% of rows, the table scan is usually faster.
As mentioned by Tony Andrews, clustering factor is more accurate method to measure this value, but 25% is still a good rule of thumb.

The optimiser will make its decision based on the relative cost of the full table scan and using the index. This mainly comes down to how many blocks will have to be read to satisfy the query. The 25%/75% rule of thumb mentioned in another answer is simplistic: in some cases a full table scan will make sense even to get 1% of the rows - i.e. if those rows happen to be spread around many blocks.
For example, consider this table:
SQL> create table t1 as select object_id, object_name from all_objects;
Table created.
SQL> alter table t1 modify object_id null;
Table altered.
SQL> update t1 set object_id = null
2 where mod(object_id,100) != 0
3 /
84558 rows updated.
SQL> analyze table t1 compute statistics;
Table analyzed.
SQL> select count(*) from t1 where object_id is not null;
COUNT(*)
----------
861
As you can see, only approximately 1% of the rows in T1 have a non-null object_id. But due to the way I built the table, these 861 rows will be spread more or less evenly around the table. Therefore, the query:
select * from t1 where object_id is not null;
is likely to visit almost every block in T1 to get data, even if the optimiser used the index. It makes sense then to dispense with the index and go for a full table scan!
A key statistic to help identify this situation is the index clustering factor:
SQL> select clustering_factor from user_indexes where index_name='T1_IDX';
CLUSTERING_FACTOR
-----------------
460
This value 460 is quite high (compared to the 861 rows in the index), and suggests that a full table scan will be used. See this DBAZine article on clustering factors.

If you are doing a select *, then it would make sense to do a table scan rather than using the index. If you know which columns you are interested in, you could create a covered index with those colums plus the one you are applying the IS NOT NULL condition.

It can depend on the type of index you have on the table.
Most B-tree indexes do not store null entries. Bitmap indexes do store null entries.
So, if you have:
select * from mytable
where mycolumn is null
and you have a standard B-tree index on mycolumn, then the query can't use the index as the "null" isn't in the index.
(If the index is against multiple columns, and one of the indexed columns is not null then there will be an entry in the index.)

Create an index on that column.
To make sure the index is used, it should be on the index and other columns in the where.
ocdecio answered:
If you are doing a select *, then it would make sense to do a table scan rather than using the index.
That's not strictly true; an index will be used if there is an index that fits your where clause, and the query optimizer decides using that index would be faster than doing a table scan. If there is no index, or no suitable index, only then must a table scan be done.

It's also worth checking whether Oracle's statistics on the table are up to date. It may not know that a full table scan will be slower.

Oracle database don't index null values at all in regular (b-tree) indexes, so it can't use it nor you can't force oracle database to use it.
BR

Using hints should be done only as a work around rather than a solution.
As mentioned in other answers, the null value is not available in B-TREE indexes.
Since you know that you have mostly null values in this column, would you be able to replace the null value by a range for instance.
That really depends on your column and the nature of your data but typically, if your column is a date type for instance:
where mydatecolumn is not null
Can be translated in a rule saying: I want all rows which have a date.
Then you can most definitely do this:
where mydatecolumn <=sysdate (in oracle)
This will return all rows with a date and ommit null values while taking advantage of the index on that column without using any hints.

See http://www.oracloid.com/2006/05/using-index-for-is-null/
If your index is on one single field, it will NOT be used. Try to add a dummy field or a constant in the index:
create index tind on t(field_to_index, 1);

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio