Oracle Anti-Join Execution plan question - oracle

We have two tables like so:
Event
id
type
... a bunch of other columns
ProcessedEvent
event_id
process
There are indexes defined for
Event(id) (PK)
ProcessedEvent (event_id, process)
The first represents events in an application.
The second represents the fact that a certain event got processes by a certain process. There are many processes that need to process a certain event, so there are multiple entries in the second table for each entry in the first.
In order to find all the events that need processing we execute the following query:
select * // of course we do name the columns in the production code
from Event
where type in ( 'typeA', 'typeB', 'typeC')
and id not in (
select event_id
from ProcessedEvent
where process = :1
)
Statistics are up to date
Since most events are processed, I think the best execution plan should look something like this
full index scan on the ProcessedEvent Index
full index scan on the Event Index
anti join between the two
table access with the rest
filter
Instead Oracle does the following
full index scan on the ProcessedEvent Index
full table scan on the Event table
filter the Event table
anti join between the two sets
With an index hint I get Oracle to do the following:
full index scan on the ProcessedEvent Index
full index scan on the Event Index
table acces on the Event table
filter the Event table
anti join between the two sets
which is really stupid IMHO.
So my question is: what might be the reason for oracle to insist on the early table access?
Addition:
The performance is bad. We are fixing the performance problem by selecting just the Event.IDs and then fetching the needed rows 'manually'. But of course that is just a work around.

your FULL INDEX SCAN will probably be faster than a FULL TABLE SCAN since the index is likely "thinner" than the table. Still, the FULL INDEX SCAN is a complete segment reading and it will be about the same cost as the FULL TABLE SCAN.
However, you're also adding a TABLE ACCESS BY ROWID step. It is an expensive step: one logical IO per row for the ROWID access whereas you get one logical IO per multi blocks (depending upon your db_file_multiblock_read_count parameter) for the FULL TABLE SCAN.
In conclusion, the optimizer computes that:
cost(FULL TABLE SCAN) < cost(FULL INDEX SCAN) + cost(TABLE ACCESS BY ROWID)
Update: The FULL TABLE SCAN also enables the filter on type sooner than in the FULL INDEX SCAN path (since the INDEX doesn't know what type an event is), therefore reducing the size of the set that will be anti-joined (yet another advantage of the FULL TABLE SCAN).

The optimizer does many things which do not make sense at first, but it has it's reasons. They may not always be right, but they're understandable.
The Event table may be easier to full-scan rather than by rowid access because of its size. It could be that there are significantly fewer IO operations involved to read the entire table sequentially than to read bits and pieces.
Is the performance bad, or are you just asking why the optimizer did that?

I can't explain the behavior of the optimizer, but my experience has been to avoid "NOT IN" at all costs, replacing it instead with MINUS, like so:
select * from Event
where id in (
select id from Event where type in ( 'typeA', 'typeB', 'typeC')
minus
select id from ProcessedEvent
)
I've seen orders of magnitude in query performance with similar transformations.

Something like:
WITH
PROCEEDED AS
(
SELECT
event_id
FROM
ProcessedEvent
WHERE
PROCESS = :1
)
SELECT
* // of course we do name the columns in the production code
FROM
EVENT
LEFT JOIN PROCEEDED P
ON
p.event_id = EVENT.event_id
WHERE
type IN ( 'typeA', 'typeB', 'typeC')
AND p.event_id IS NULL; -- exclude already proceeded
could work fast enough (at least much faster than NOT IN).

Related

Trying to figure out max length of Rowid in Oracle

As per my design I want to fetch rowid as in
select rowid r from table_name;
into a C variable. I was wondering what is the max size / length in characters of the rowid.
Currently in one of the biggest tables in my DB we have the max length as 18 and its 18 throughout the table for rowid.
Thanks in advance.
Edit:
Currently the below block of code is iterated and used for multiple tables hence in-order to make the code flexible without introducing the need of defining every table's PK in the query we use ROWID.
select rowid from table_name ... where ....;
delete from table_name where rowid = selectedrowid;
I think as the rowid is picked and used then and there without storing it for future, it is safe to use in this particular scenario.
Please refer to below answer:
Is it safe to use ROWID to locate a Row/Record in Oracle?
I'd say no. This could be safe if for instance the application stores ROWID temporarily(say generating a list of select-able items, each identified with ROWID, but the list is routinely regenerated and not stored). But if ROWID is used in any persistent way it's not safe.
A physical ROWID has a fixed size in a given Oracle version, it does not depend on the number of rows in a table. It consists of the number of the datafile, the number of the block within this file, and the number of the row within this block. Therefore it is unique in the whole database and allows direct access to the block and row without any further lookup.
As things in the IT world continue to grow, it is safe to assume that the format will change in future.
Besides volume there are also structural changes, like the advent of transportable tablespaces, which made it necessary to store the object number (= internal number of the table/partition/subpartion) inside the ROWID.
Or the advent of Index organized tables (mentioned by #ibre5041), which look like a table, but are in reality just an index without such a physical address (because things are moving constantly in an index). This made it necessary to introduce UROWIDs which can store physical and index-based ROWIDs.
Please be aware that a ROWID can change, for instance if the row moves from one table partition to another one, or if the table is defragmented to fill the holes left by many DELETEs.
According documentation ROWID has a length of 10 Byte:
Rowids of Row Pieces
A rowid is effectively a 10-byte physical address of a row.
Every row in a heap-organized table has a rowid unique to this table
that corresponds to the physical address of a row piece. For table
clusters, rows in different tables that are in the same data block can
have the same rowid.
Oracle also documents the (current) format see, Rowid Format
In general you could use the ROWID in your application, provided the affected rows are locked!
Thus your statement may look like this:
CURSOR ... IS
select rowid from table_name ... where .... FOR UPDATE;
delete from table_name where rowid = selectedrowid;
see SELECT FOR UPDATE and FOR UPDATE Cursors
Oracle even provides a shortcut. Instead of where rowid = selectedrowid you can use WHERE CURRENT OF ...

Best way to identify a handful of records expected to have a flag set to TRUE

I have a table that I expect to get 7 million records a month on a pretty wide table. A small portion of these records are expected to be flagged as "problem" records.
What is the best way to implement the table to locate these records in an efficient way?
I'm new to Oracle, but is a materialized view an valid option? Are there such things in Oracle such as indexed views or is this potentially really the same thing?
Most of the reporting is by month, so partitioning by month seems like an option, but a "problem" record may be lingering for several months theorectically. Otherwise, the reporting shuold be mostly for the current month. Would you expect that querying across all month partitions to locate any problem record would cause significant performance issues compared to usinga single table?
Your general thoughts of where to start would be appreciated. I realize I need to read up and I'll do that but I wanted to get the community thought first to make sure I read the right stuff.
One more thought: The primary key is a GUID varchar2(36). In order of magnitude, how much of a performance hit would you expect this to be relative to using a NUMBER data type PK? This worries me but it is out of my control.
It depends what you mean by "flagged", but it sounds to me like you would benefit from a simple index, function based index, or an indexed virtual column.
In all cases you should be careful to ensure that all the index columns are NULL for rows that do not need to be flagged. This way your index will contain only the rows that are flagged (Oracle does not - by default - index rows in B-Tree indexes where all index column values are NULL).
Your primary key being a VARCHAR2 GUID should make no difference, at least with regards to the specific flagging of rows in this question, indexes will point to rows via Oracle internal ROWIDs.
Indexes support partitioning, so if your data is already partitioned, your index could be set to match.
Simple column index method
If you can dictate how the flagging works, or the column already exists, then I would simply add an index to it like so:
CREATE INDEX my_table_problems_idx ON my_table (problem_flag)
/
Function-based index method
If the data model is fixed / there is no flag column, then you can create a function-based index assuming that you have all the information you need in the target table. For example:
CREATE INDEX my_table_problems_fnidx ON my_table (
CASE
WHEN amount > 100 THEN 'Y'
ELSE NULL
END
)
/
Now if you use the same logic in your SELECT statement, you should find that it uses the index to efficiently match rows.
SELECT *
FROM my_table
WHERE CASE
WHEN amount > 100 THEN 'Y'
ELSE NULL
END IS NOT NULL
/
This is a bit clunky though, and it requires you to use the same logic in queries as the index definition. Not great. You could use a view to mask this, but you're still duplicating logic in at least two places.
Indexed virtual column
In my opinion, this is the best way to do it if you are computing the value dynamically (available from 11g onwards):
ALTER TABLE my_table
ADD virtual_problem_flag VARCHAR2(1) AS (
CASE
WHEN amount > 100 THEN 'Y'
ELSE NULL
END
)
/
CREATE INDEX my_table_problems_idx ON my_table (virtual_problem_flag)
/
Now you can just query the virtual column as if it were a real column, i.e.
SELECT *
FROM my_table
WHERE virtual_problem_flag = 'Y'
/
This will use the index and puts the function-based logic into a single place.
Create a new table with just the pks of the problem rows.

T-SQL Performance dilemma when looping through a Big table (details inside)

Let's say I have a Big and a Bigger table.
I need to cycle through the Big table, that is indexed but not sequential (since it is a filter of a sequentially indexed Bigger table).
For this example, let's say I needed to cycle through about 20000 rows.
Should I do 20000 of these
set #currentID = (select min(ID) from myData where ID > #currentID)
or
Creating a (big) temporary sequentially indexed table (copy of the Big table) and do 20000 of
#Row = #Row + 1
?
I imagine that doing 20000 filters of the Bigger table just to fetch the next ID is heavy, but so must be filling a big (Big sized) temporary table just to add a dummy identity column.
Is the solution somewhere else?
For example, if I could loop through the results of the select statement (the filter of the Bigger table that originates "table" (actually a resultset) Big) without needing to create temporary tables, it would be ideal, but I seem to be unable to add something like an IDENTITY(1,1) dummy column to the results.
Thanks!
You may want to consider finding out how to do your work set based instead of RBAR. With that said, for very big tables, you may want to not make a temp table so that you are sure that you have live data if you suspect that the proc may run for a while in production. If your proc fails, you'll be able to pick up where you left off. If you use a temp table then if your proc crashes, then you could lose data that hasn't been completed yet.
You need to provide more information on what your end result is, It is only very rarely necessary to do row-by-row processing (and almost always the worst possible choice from a performance perspective). This article will get you started on how to do many tasks in a set-based manner:
http://wiki.lessthandot.com/index.php/Cursors_and_How_to_Avoid_Them
If you just want a temp table with an identity, here are two methods:
create table #temp ( test varchar (10) , id int identity)
insert #temp (test)
select test from mytable
select test, identity(int) as id into #temp from mytable
I think a join will serve your purposes better.
SELECT BIG.*, BIGGER.*, -- Add additional calcs here involving BIG and BIGGER.
FROM TableBig BIG (NOLOCK)
JOIN TableBigger BIGGER (NOLOCK)
ON BIG.ID = BIGGER.ID
This will limit the set you are working with to. But again it comes down to the specifics of your solution.
Remember too, you can do bulk inserts and bulk updates in this manner too.

Oracle ROWID as function/procedure parameter

I just would like to hear different opinions about ROWID type usage as input parameter of any function or procedure.
I have normally used and seen primary keys used as input parameters but is there some kind of disadvantages to use ROWID as input parameter? I think it's kind a simple and selects are pretty quick if used in WHERE clause.
For example:
FUNCTION get_row(p_rowid IN ROWID) RETURN TABLE%ROWTYPE IS...
From the concept guide:
Physical rowids provide the fastest possible access to a row of a given table. They contain the physical address of a row (down to the specific block) and allow you to retrieve the row in a single block access. Oracle guarantees that as long as the row exists, its rowid does not change.
The main drawback of a ROWID is that while it is normally stable, it can change under some circumstances:
The table is rebuilt (ALTER TABLE MOVE...)
Export / Import obviously
Partition table with row movement enable
A primary key identifies a row logically, you will always find the correct row, even after a delete+insert. A ROWID identifies the row physically and is not as persistent as a primary key.
You can safely use ROWID in a single SQL statement since Oracle will guarantee the result is coherent, for example to remove duplicates in a table. To be on the safe side, I would suggest you only use the ROWID accross statements when you have a lock on the row (SELECT ... FOR UPDATE).
From a performance point of view, the Primary key access is a bit more expensive but you will normally notice this only if you do a lot of single row access. If performance is critical though, you usually can get greater benefit in that case from using set processing than single row processing with rowid. In particular, if there are a lot of roundtrips between the DB and the application, the cost of the row access will probably be negligible compared to the roundtrips cost.

Is an Index Organized Table appropriate here?

I recently was reading about Oracle Index Organized Tables (IOTs) but am not sure I quite understand WHEN to use them. So I have a small table:
create table categories
(
id VARCHAR2(36),
group VARCHAR2(100),
category VARCHAR2(100
)
create unique index (group, category, id) COMPRESS 2;
The id column is a foreign key from another table entries and my common query is:
select e.id, e.time, e.title from entries e, categories c where e.id=c.id AND e.group=? AND c.category=? ORDER by e.time
The entries table is indexed properly.
Both of these tables have millions (16M currently) of rows and currently this query really stinks (note: I have it wrapped in a pagination query also so I only get back the first 20, but for simplicity I omitted that).
Since I am basically indexing the entire table, does it make sense to create this table as an IOT?
EDIT by popular demand:
create table entries
(
id VARCHAR2(36),
time TIMESTAMP,
group VARCHAR2(100),
title VARCHAR2(500),
....
)
create index (group, time) compress 1;
My real question I dont think depends on this though. Basically if you have a table with few columns (3 in this example) and you are planning on putting a composite index on all three rows is there any reason not to use an IOT?
IOTs are great for a number of purposes, including this case where you're gonna have an index on all (or most) of the columns anyway - but the benefit only materialises if you don't have the extra index - the idea is that the table itself is an index, so put the columns in the order that you want the index to be in. In your case, you're accessing category by id, so it makes sense for that to be the first column. So effectively you've got an index on (id, group, category). I don't know why you'd want an additional index on (group, category, id).
Your query:
SELECT e.id, e.time, e.title
FROM entries e, categories c
WHERE e.id=c.id AND e.group=? AND c.category=?
ORDER by e.time
You're joining the tables by ID, but you have no index on entries.id - so the query is probably doing a hash or sort merge join. I wouldn't mind seeing a plan for what your system is doing now to confirm.
If you're doing a pagination query (i.e. only interested in a small number of rows) you want to get the first rows back as quick as possible; for this to happen you'll probably want a nested loop on entries, e.g.:
NESTED LOOPS
ACCESS TABLE BY ROWID - ENTRIES
INDEX RANGE SCAN - (index on ENTRIES.group,time)
ACCESS TABLE BY ROWID - CATEGORIES
INDEX RANGE SCAN - (index on CATEGORIES.ID)
Since the join to CATEGORIES is on ID, you'll want an index on ID; if you make it an IOT, and make ID the leading column, that might be sufficient.
The performance of the plan I've shown above will be dependent on how many rows match the given "group" - i.e. how selective an average "group" is.
Have you looked at dba-oracle.com, asktom.com, IOUG, another asktom.com?
There are penalties to pay for IOTs - e.g., poorer insert performance
Can you prototype it and compare performance?
Also, perhaps you might want to consider a hash cluster.
IOT's are a trade off. You are getting access performance for decreased insert/update performance. We typically use them for reference data that is batch loaded daily and not updated during the day. This is not to say it's the only way to use them, just how we use them.
Few things here:
You mention pagination - have you considered the first_rows hint?
Is that the order your index is in, with group as the first field? If so I'd consider moving ID to be the first column since that index will not be used.
foreign keys should have an index on the column. Consider addind an index on the foreign key (id column).
Are you sure it's not the ORDER BY causing slowness?
What version of Oracle are you using?
I ASSUME there is a primary key on table entries for field id, correct?
Why the WHERE condition does not include "c.group = e.group" ?
Try to:
Remove the order by condition
Change the index definition from "create unique index (group,
category, id)" to "create unique index (id, group, category)"
Reorganise table categories as an IOT on (group, category, id)
Reorganise table categories as an IOT on (id, group, category)
In each of the above case use EXPLAIN PLAN to review the cost

Resources