Oracle explain plan estimates incorrect cardinality for an index range scan - oracle

I have an Oracle 10.2.0.3 database, and a query like this:
select count(a.id)
from LARGE_PARTITIONED_TABLE a
join SMALL_NONPARTITIONED_TABLE b on a.key1 = b.key1 and a.key2 = b.key2
where b.id = 1000
Table LARGE_PARTITIONED_TABLE (a) has about 5 million rows, and is partitioned by a column not present in the query. Table SMALL_NONPARTITIONED_TABLE (b) is not partitioned, and holds about 10000 rows.
Statistics are up-to-date, and there are height balanced histograms in columns key1 and key2 of table a.
Table a has a primary key and a global, nonpartitioned unique index on columns key1, key2, key3, key4, and key5.
Explain plan for the query displays the following results:
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 31 | 4 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 31 | | |
| 2 | NESTED LOOPS | | 406 | 12586 | 4 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN| INDEX_ON_TABLE_B | 1 | 19 | 2 (0)| 00:00:01 |
|* 4 | INDEX RANGE SCAN| PRIMARY_KEY_INDEX_OF_TABLE_A | 406 | 4872 | 2 (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("b"."id"=1000)
4 - access("a"."key1"="b"."key1" and
"a"."key2"="b"."key2")
Thus the rows (cardinality) estimated for step 4 is 406.
Now, a tkprof trace reveals the following:
Rows Row Source Operation
------- ---------------------------------------------------
1 SORT AGGREGATE (cr=51 pr=9 pw=0 time=74674 us)
7366 NESTED LOOPS (cr=51 pr=9 pw=0 time=824941 us)
1 INDEX RANGE SCAN INDEX_ON_TABLE_B (cr=2 pr=0 pw=0 time=36 us)(object id 111111)
7366 INDEX RANGE SCAN PRIMARY_KEY_INDEX_OF_TABLE_A (cr=49 pr=9 pw=0 time=810173 us)(object id 222222)
So the cardinality in reality was 7366, not 406!
My question is this: From where does Oracle get the estimated cardinality of 406 in this case, and how can I improve its accuracy, so that the estimate is more in line of what really happens during query execution?
Update: Here is a snippet of a 10053 trace I ran on the query.
NL Join
Outer table: Card: 1.00 Cost: 2.00 Resp: 2.00 Degree: 1 Bytes: 19
Inner table: LARGE_PARTITIONED_TABLE Alias: a
...
Access Path: index (IndexOnly)
Index: PRIMARY_KEY_INDEX_OF_TABLE_A
resc_io: 2.00 resc_cpu: 27093
ix_sel: 1.3263e-005 ix_sel_with_filters: 1.3263e-005
NL Join (ordered): Cost: 4.00 Resp: 4.00 Degree: 1
Cost_io: 4.00 Cost_cpu: 41536
Resp_io: 4.00 Resp_cpu: 41536
****** trying bitmap/domain indexes ******
Best NL cost: 4.00
resc: 4.00 resc_io: 4.00 resc_cpu: 41536
resp: 4.00 resp_io: 4.00 resp_cpu: 41536
Using concatenated index cardinality for table SMALL_NONPARTITIONED_TABLE
Revised join sel: 8.2891-e005 = 8.4475e-005 * (1/12064.00) * (1/8.4475e-005)
Join Card: 405.95 = outer (1.00) * inner (4897354.00) * sel (8.2891-e005)
Join Card - Rounded: 406 Computed: 405.95
So that is where the value 406 is coming from. Like Adam answered, join cardinality is join selectivity * filter cardinality (a) * filter cardinality (b), as can be seen on the second to last line of above trace quote.
What I do not understand is the Revised join sel line. 1/12064 is the selectivity of the index used to find the row from table b (12064 rows on table, and select based on unique id). But so what?
Cardinality appears to be calculated by
multiplying the filter cardinality
of table b (4897354) with
selectivity of table a (1/12064).
Why? What
does the selectivity on
table a have to do with how much
rows is expected to be found from
table b, when a --> b join is not based on
a.id?
Where does the number
8.4475e-005 come from (it does not appear anywhere else in the whole
trace)? Not that it affects the
output, but I'd still like to know.
I understand that the optimizer has likely chosen the correct path here. But still the cardinality is miscalculated - and that can have a major effect on the execution path that is chosen from that point onwards (as in the case I'm having IRL - this example is a simplification of that).

Generating a 10053 trace file will help show exactly what choices the optimizer's making regarding its estimation of cardinality and selectivity. Jonathan Lewis' excellect Cost Based Oracle Fundamentals is an excellent resource to understanding how the optimizer works, and the printing I have spans 8i to 10.1.
From that work:
Join Selectivity = ((num_rows(t1) - num_nulls(t1.c1)) / num_rows(t1))
* ((num_rows(t2) - num_nulls(t2.c2)) / num_rows(t2))
/ greater (num_distinct(t1.c1), num_distinct(t2.c2))
Join Cardinality = Join Selectivity
* filtered_cardinality (t1)
* filtered_cardinality (t2)
However, because we have a multi-column join, Join Selectivity isn't at the table level, it's the product (intersection) of the join selectivities on each column. Assuming there's no nulls in play:
Join Selectivity = Join Selectivity (key1) * Join Selectivity (key2)
Join Selectivity (key1) = ((5,000,000 - 0) / 5,000,000)
* ((10,000 - 0)) / 10,000)
/ max (116, ?) -- distinct key1 values in B
= 1 / max(116, distinct_key1_values_in_B)
Join Selectivity (key2) = ((5,000,000 - 0) / 5,000,000)
* ((10,000 - 0)) / 10,000)
/ max (650, ?) -- distinct key2 values in B
= 1 / max(650, distinct_key2_values in B)
Join Cardinality = JS(key1) * JS(key2)
* Filter_cardinality(a) * Filter_cardinality(b)
We know that there are no filters on A, so that's tables filter cardinality is the number of rows. We're selecting the key value from B, so that table's filter cardinality is 1.
So the best case for estimated estimated join cardinality is now
Join Cardinality = 1/116 * 1/650 * 5,000,000 * 1
=~ 67
It might be easier to work backward. Your estimated cardinality of 406, given what we know, leads to a join selectivty of 406/5,000,000, or approximately 1/12315. That happens to be really, really close to 1 / (116^2), which is a sanity check within the optimizer to prevent it from finding too aggressive a cardinality on multi-column joins.
For the TL;DR crowd:
Get Jonathan Lewis' Cost Based Oracle Fundamentals.
Get a 10053 trace of the query whose behavior your can't understand.

The cardinality estimate would be based on the product of the selectivity of a.key1 and a.key2, which (at least in 10g) would each be based on the number of distinct values for those two columns as recorded in the column statistics.
For a table of 5M rows, a cardinality estimate of 406 is not significantly different to 7366. The question you have to ask yourself is, is the "inaccurate" estimate here causing a problem?
You can check what plan Oracle would choose if it were able to generate a perfectly accurate estimate by getting an explain plan for this:
select /*+CARDINALITY(a 7366)*/ count(a.id)
from LARGE_PARTITIONED_TABLE a
join SMALL_NONPARTITIONED_TABLE b on a.key1 = b.key1 and a.key2 = b.key2
where b.id = 1000;
If this comes up with the same plan, then the estimate that Oracle is calculating is already adequate.

You might be interested to read this excellent paper by Wolfgang Breitling which has a lot of info on CBO calculations: http://www.centrexcc.com/A%20Look%20under%20the%20Hood%20of%20CBO%20-%20the%2010053%20Event.pdf.
As explained there, because you have histograms, the filter-factor calculation for these columns does not use number of distinct values (NDV) but density, which is derived from the histogram in some way.
What are the DENSITY values in USER_TAB_COLUMNS for a.key1 and a.key2?
Generally, the problem in cases like this is that Oracle does not gather statistics on pairs of columns, and assumes that their combined filter factor will be the product of their individual factors. This will produce low estimates if there is any correlation between values of the two columns.
If this is causing a serious performance issue, I suppose you could create a function-based index on a function of those columns, and use that to do the lookup. Then Oracle would gather statistics on that index and probably produce better estimates.

Related

Confusion regarding to_char and to_number

First of all, I am aware about basics.
select to_number('A231') from dual; --this will not work but
select to_char('123') from dual;-- this will work
select to_number('123') from dual;-- this will also work
Actually in my package, we have 2 tables A(X number) and B(Y varchar) There are many columns but we are worried about only X and Y. X contains values only numeric like 123,456 etc but Y contains some string and some number for eg '123','HR123','Hello'. We have to join these 2 tables. its legacy application so we are not able to change tables and columns.
Till this time below condition was working properly
to_char(A.x)=B.y;
But since there is index on Y, performance team suggested us to do
A.x=to_number(B.y); it is running in dev env.
My question is, in any circumstances will this query give error? if it picks '123' definitely it will give 123. but if it picks 'AB123' then it will fail. can it fail? can it pick 'AB123' even when it is getting joined with other table.
can it fail?
Yes. It must put every row through TO_NUMBER before it can check whether or not it meets the filter condition. Therefore, if you have any one row where it will fail then it will always fail.
From Oracle 12.2 (since you tagged Oracle 12) you can use:
SELECT *
FROM A
INNER JOIN B
ON (A.x = TO_NUMBER(B.y DEFAULT NULL ON CONVERSION ERROR))
Alternatively, put an index on TO_CHAR(A.x) and use your original query:
SELECT *
FROM A
INNER JOIN B
ON (TO_CHAR(A.x) = B.y)
Also note: Having an index on B.y does not mean that the index will be used. If you are filtering on TO_NUMBER(B.y) (with or without the default on conversion error) then you would need a function-based index on the function TO_NUMBER(B.Y) that you are using. You should profile the queries and check the explain plans to see whether there is any improvement or change in use of indexes.
Never convert a VARCHAR2 column that can contain non-mumeric strings to_number.
This can partially work, but will eventuelly definitively fail.
Small Example
create table a as
select rownum X from dual connect by level <= 10;
create table b as
select to_char(rownum) Y from dual connect by level <= 10
union all
select 'Hello' from dual;
This could work (as you limit the rows, so that the conversion works; if you are lucky and Oracle chooses the right execution plan; which is probable, but not guarantied;)
select *
from a
join b on A.x=to_number(B.y)
where B.y = '1';
But this will fail
select *
from a
join b on A.x=to_number(B.y)
ORA-01722: invalid number
Performance
But since there is index on Y, performance team suggested us to do A.x=to_number(B.y);
You should chalange the team, as if you use a function on a column (to_number(B.y)) index can't be used.
On the contrary, your original query can perfectly use the following indexes:
create index b_y on b(y);
create index a_x on a(x);
Query
select *
from a
join b on to_char(A.x)=B.y
where A.x = 1;
Execution Plan
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 5 | 1 (0)| 00:00:01 |
| 1 | NESTED LOOPS | | 1 | 5 | 1 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN| A_X | 1 | 3 | 1 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN| B_Y | 1 | 2 | 0 (0)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("A"."X"=1)
3 - access("B"."Y"=TO_CHAR("A"."X"))

Oracle partitioned table query cost vs non-partitioned table query cost

I have a table PO_HEADER with ~20 million records. Considering our future load on the table we have decided to partitioned the table to increase the performance of the sql queries. Below are the queries used to create the new partitioned tables.
CREATE TABLE PO_HEADER_LP
PARTITION BY LIST (BUYER_IDENTIFIER)
(PARTITION GC66287246AA VALUES ('GC66287246AA') TABLESPACE MITRIX_TABLES,
PARTITION GC43837235JK VALUES ('GC43837235JK') TABLESPACE MITRIX_TABLES,
PARTITION GC84338293AA VALUES ('GC84338293AA') TABLESPACE MITRIX_TABLES,
PARTITION DEFAULTBUID VALUES (DEFAULT) TABLESPACE MITRIX_TABLES)
AS SELECT *
FROM PO_HEADER;
create index PO_HEADER_LP_SI_IDX on PO_HEADER_LP("SUPPLIER_IDENTIFIER") TABLESPACE MITRIX_INDEXES LOCAL;
Old Table PO_HEADER has two indexes on "BUYER_IDENTIFIER" and "SUPPLIER_IDENTIFIER" columns as follows:
create index PO_HEADER_BI_IDX on PO_HEADER("BUYER_IDENTIFIER") TABLESPACE MITRIX_INDEXES;
create index PO_HEADER_SI_IDX on PO_HEADER("SUPPLIER_IDENTIFIER") TABLESPACE MITRIX_INDEXES;
To test the performance of the query, I executed below query on both the tables. But, to my wonder I saw the cost of the 2nd query is almost double than the 1st one. Can any body know, why is the query cost is high of the partitioned table compared to normal table. Thanks in Advance.
select * from po_header where buyer_identifier='GC84338293AA' and supplier_identifier='GC75987723HT'; --cost: 56,941
select * from po_header_lp where buyer_identifier= 'GC84338293AA' and supplier_identifier='GC75987723HT'; --cost: 93,309
PO_HEADER with Global Index on buyer_identifier & supplier_identifier column
PO_HEADER_LP with Global Index on supplier_identifier column
PO_HEADER_LP with Local Index on supplier_identifier column
From your DDL I assume, you have three big buyers (say 5M records each) and a bunch of smaller ones. In other word this would be the correct setup for you list partitioning schema.
You may verify, whether it works testing access on buyer only:
EXPLAIN PLAN SET STATEMENT_ID = 'jara1' into plan_table FOR
select * from tab_lp where BUYER_ID = 1;
;
SELECT * FROM table(DBMS_XPLAN.DISPLAY('plan_table', 'jara1','ALL'));
------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6662K| 82M| 4445 (2)| 00:00:01 | | |
| 1 | PARTITION LIST SINGLE| | 6662K| 82M| 4445 (2)| 00:00:01 | KEY | KEY |
| 2 | TABLE ACCESS FULL | TAB_LP | 6662K| 82M| 4445 (2)| 00:00:01 | 2 | 2 |
------------------------------------------------------------------------------------------------
The same query for the non-partitioned table should produce much higher cost. Why?
In the partitioned table the selected buyer (in your case GC84338293AA, I'm using surrogate keys) has it own partition.
So full scan of this partition is the best access.
select * from tab where BUYER_ID = 1;
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6596K| 81M| 14025 (1)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TAB | 6596K| 81M| 14025 (1)| 00:00:01 |
--------------------------------------------------------------------------
1 - filter("BUYER_ID"=1)
For the non-partitioned table (to get approximately one fourth of the data) the FULL TABLE SCAN is OK as well,
but of course has higher cost as all data must be scanned.
Note - if you see here lower cost, unrealistically low Rows count and/or INDEX ACCESS,
than this is the cause of the problem of the underestimating of the cost. So don't worry the old cost are too low, not the new one too high!
The next step is the access on both buyer and supplier. To get the answer you must provide
additional information.
How selective is the supplier filter?
I.e. if the predicate buyer_identifier='GC84338293AA' returns say 5M records, how may records return the predicate with both columns?
buyer_identifier='GC84338293AA' and supplier_identifier='GC75987723HT'
Is it 4M or 100 records?
If the complete predicate returns only few records than the local index on supplier is OK.
If it returns large number of rows (say the quarter of the partition) - you should stay on FULL PARTITION SCAN and not use it.
This is similar to my comment on the non partitioned table.
Estimation of the supplier cardinality
In case that the column SUPPLIER contains a skewed data (which may fool the CBO to calulate improper cost) you may define explicitely histogram in this column.
I used this statement statement, that calculates the histogram on full data (100% is important for highly skewed data) and for the table and partition.
exec dbms_stats.gather_table_stats(ownname=>user,tabname=>'TAB_LP',granularity=>'all',estimate_percent => 100,METHOD_OPT => 'for columns SUPPLIER_ID size 254');
This worked for my test data, i.e. for supplier with low cardinality an index access was opened (on local no-prefixed index) and for huge suppliers a full partition scan was used.
You can create a Local partitioned index using this script.
CREATE INDEX PO_HEADER_LOCAL_IDX ON PO_HEADER_LP
(BUYER_IDENTIFIER, SUPPLIER_IDENTIFIER)
LOCAL (
PARTITION GC66287246AA,
PARTITION GC43837235JK,
PARTITION GC84338293AA,
PARTITION DEFAULTBUID
);
Also it is recommended to gather statistics of the newly created partition table using this script:
EXEC DBMS_STATS.GATHER_TABLE_STATS('SCHEMA Name','PO_HEADER_LP');
Now you can generate the execution plan again of the following SQL:
select * from po_header_lp where buyer_identifier= 'GC84338293AA' and supplier_identifier='GC75987723HT';
Hope this will help you.

Oracle CBO when using types [duplicate]

I'm trying to optimize a set of stored procs which are going against many tables including this view. The view is as such:
We have TBL_A (id, hist_date, hist_type, other_columns) with two types of rows: hist_type 'O' vs. hist_type 'N'. The view self joins table A to itself and transposes the N rows against the corresponding O rows. If no N row exists for the O row, the O row values are repeated. Like so:
CREATE OR REPLACE FORCE VIEW V_A (id, hist_date, hist_type, other_columns_o, other_columns_n)
select
o.id, o.hist_date, o.hist_type,
o.other_columns as other_columns_o,
case when n.id is not null then n.other_columns else o.other_columns end as other_columns_n
from
TBL_A o left outer join TBL_A n
on o.id=n.id and o.hist_date=n.hist_date and n.hist_type = 'N'
where o.hist_type = 'O';
TBL_A has a unique index on: (id, hist_date, hist_type). It also has a unique index on: (hist_date, id, hist_type) and this is the primary key.
The following query is at issue (in a stored proc, with x declared as TYPE_TABLE_OF_NUMBER):
select b.id BULK COLLECT into x from TBL_B b where b.parent_id = input_id;
select v.id from v_a v
where v.id in (select column_value from table(x))
and v.hist_date = input_date
and v.status_new = 'CLOSED';
This query ignores the index on id column when accessing TBL_A and instead does a range scan using the date to pick up all the rows for the date. Then it filters that set using the values from the array. However if I simply give the list of ids as a list of numbers the optimizer uses the index just fine:
select v.id from v_a v
where v.id in (123, 234, 345, 456, 567, 678, 789)
and v.hist_date = input_date
and v.status_new = 'CLOSED';
The problem also doesn't exist when going against TBL_A directly (and I have a workaround that does that, but it's not ideal.).Is there a way to get the optimizer to first retrieve the array values and use them as predicates when accessing the table? Or a good way to restructure the view to achieve this?
Oracle does not use the index because it assumes select column_value from table(x) returns 8168 rows.
Indexes are faster for retrieving small amounts of data. At some point it's faster to scan the whole table than repeatedly walk the index tree.
Estimating the cardinality of a regular SQL statement is difficult enough. Creating an accurate estimate for procedural code is almost impossible. But I don't know where they came up with 8168. Table functions are normally used with pipelined functions in data warehouses, a sorta-large number makes sense.
Dynamic sampling can generate a more accurate estimate and likely generate a plan that will use the index.
Here's an example of a bad cardinality estimate:
create or replace type type_table_of_number as table of number;
explain plan for
select * from table(type_table_of_number(1,2,3,4,5,6,7));
select * from table(dbms_xplan.display(format => '-cost -bytes'));
Plan hash value: 1748000095
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Time |
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 8168 | 00:00:01 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 8168 | 00:00:01 |
-------------------------------------------------------------------------
Here's how to fix it:
explain plan for select /*+ dynamic_sampling(2) */ *
from table(type_table_of_number(1,2,3,4,5,6,7));
select * from table(dbms_xplan.display(format => '-cost -bytes'));
Plan hash value: 1748000095
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Time |
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 7 | 00:00:01 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 7 | 00:00:01 |
-------------------------------------------------------------------------
Note
-----
- dynamic statistics used: dynamic sampling (level=2)

Function-based Index using Substr and Instr

I have created a query doing this in ORACLE:
SELECT SUBSTR(title,1,INSTR(title,' ',1,1)) AS first_word, COUNT(*) AS word_count
FROM FILM
GROUP BY SUBSTR(title,1,INSTR(title,' ',1,1))
HAVING COUNT(*) >= 20;
Results after running:
539 rows selected. Elapsed: 00:00:00.22
I need to improve the performance of this and created a function-based index as so:
CREATE INDEX INDX_FIRSTWRD ON FILM(SUBSTR(title,1,INSTR(title,' ',1,1)));
After running the same query at the top of this post, I still get the same performance:
539 rows selected. Elapsed: 00:00:00.22
Is the index not being applied or overwritten or am I doing something wrong?
Thanks for any help you could provide. :)
EDIT:
Execution Plan:
----------------------------------------------------------
Plan hash value: 2033354507
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 20000 | 2968K| 138 (2)| 00:00:02 |
|* 1 | FILTER | | | | | |
| 2 | HASH GROUP BY | | 20000 | 2968K| 138 (2)| 00:00:02 |
| 3 | TABLE ACCESS FULL| FILM | 20000 | 2968K| 136 (0)| 00:00:02 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(COUNT(*)>=20)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
471 consistent gets
0 physical reads
0 redo size
14030 bytes sent via SQL*Net to client
908 bytes received via SQL*Net from client
37 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
539 rows processed
The problem is that the value you're using for the index may be null - if there is no space in the title (i.e. it's a one-word title like "Jaws") then your substr evaluates to null. That probably isn't what you want, incidentally - you probably want the end position to be conditional on whether there is a space at all, but that's beyond the scope of the question. (And even if you correct that logic, Oracle may still not be able to trust that the result can't be null, even if the underlying column is not nullable). Edit: see below for more on using nvl to handle single-word titles.
Since nulls aren't included in indexes, the single-title rows won't be indexed. But you're asking for all rows, and Oracle knows the index doesn't hold all rows, so it can't use the index to fulfil the query - even if you add a hint telling it to, it has to ignore that hint.
The only time the index will be used is if you include a filter that references the indexed value too, and explicitly or implicitly exclude nulls, e.g.:
SELECT SUBSTR(title,1,INSTR(title,' ',1,1)) AS first_word, COUNT(*) AS word_count
FROM FILM
WHERE SUBSTR(title,1,INSTR(title,' ',1,1)) IS NOT NULL
GROUP BY SUBSTR(title,1,INSTR(title,' ',1,1))
HAVING COUNT(*) >= 20;
(which also probably isn't what you actually want).
SQL Fiddle for queries with and without a filter, and with and without an index hint. (Click the 'execution plan' link against each result section to see whether it's doing a full table scan or a full index scan).
And another Fiddle showing that the index can't be used even with the filter if the filter still allows null values, again since they are not in the index.
Since SylvainLeroux brought it up, Oracle isn't quite clever enough to know the computed value can't be null if you coalesce it, even if the underlying column is not-null (as a function-based index or as a virtual column). Possibly because there could be a lot of branches to evaluate. But it is clever enough if you use the simpler and proprietary nvl instead:
CREATE INDEX INDX_FIRSTWRD
ON FILM(NVL(SUBSTR(title,1,INSTR(title,' ',1,1)),title));
SELECT NVL(SUBSTR(title,1,INSTR(title,' ',1,1)),title) AS first_word,
COUNT(*) AS word_count
FROM FILM
GROUP BY NVL(SUBSTR(title,1,INSTR(title,' ',1,1)),title)
HAVING COUNT(*) >= 20;
But only if title is defined as not-null. And coalesce does work if the virtual column is also declared not-null (thanks Sylvain).
SQL Fiddle with a function-based index and another with a virtual column.
539 rows selected. Elapsed: 00:00:00.22
Do you really think you need to tune the query which returns 539 rows in less than a second? 220 milliseconds, precicely! Think about it.
In your case, I think CBO does the best possible thing. And that is the reason it doesn't use the index. Because, to read every row from the table, using the index is an overhead. It needs to read the index and then do a table access by rowid. Probably, in your small table, it could read the entire table with less IO to fetch the data.
If the table is small enough to be in a single block, then, it just requires a one IO to fetch required data from single block with full table scan.
You can try to check the explain plan by hinting the query to use the index and see if anything really improves. Remember, you are trying unnecessarily to improve the performance of a query which executes in less than a second!

is there a tricky way to optimize this query

I'm working on a table that has 3008698 rows
exam_date is a DATE field.
But queries I run want to match only the month part. So what I do is:
select * from my_big_table where to_number(to_char(exam_date, 'MM')) = 5;
which I believe takes long because of function on the column. Is there a way to avoid this and make it faster? other than making changes to the table? exam_date in the table have different date values. like 01-OCT-10 or 12-OCT-10...and so on
I don't know Oracle, but what about doing
WHERE exam_date BETWEEN first_of_month AND last_of_month
where the two dates are constant expressions.
select * from my_big_table where MONTH(exam_date) = 5
oops.. Oracle huh?..
select * from my_big_table where EXTRACT(MONTH from exam_date) = 5
Bear in mind that since you want approximately 1/12th of all the data, it may well be more efficient for Oracle to perform a full table scan anyway. This may explain why performance was worse when you followed harpo's advice.
Why? Suppose your data is such that 20 rows fit on each database block (on average), so that you have a total of 3,000,000/20 = 150,000 blocks. That means a full table scan will require 150,000 block reads. Now about 1/12th of the 3,000,000 rows will be for month 05. 3,000,000/12 is 250,000. So that's 250,000 table reads if you use the index - and that's ignoring the index reads that will also be required. So in this example the full table scan does a lot less work than the indexed search.
Bear in miond that there are only twelve distinct values for MONTH. So unless you have a strongly clustered set of records (say if you use partitioining) it is possible that using an index is not necessarily the most efficient way of querying in this fashion.
I didn't find that using EXTRACT() lead the optimizer to use a regular index on my date column but YMMV:
SQL> create index big_d_idx on big_table(col3) compute statistics
2 /
Index created.
SQL> set autotrace traceonly explain
SQL> select * from big_table
2 where extract(MONTH from col3) = 'MAY'
3 /
Execution Plan
----------------------------------------------------------
Plan hash value: 3993303771
-------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 23403 | 1028K| 4351 (3)| 00:00:53 |
|* 1 | TABLE ACCESS FULL| BIG_TABLE | 23403 | 1028K| 4351 (3)| 00:00:53 |
-------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(EXTRACT(MONTH FROM INTERNAL_FUNCTION("COL3"))=TO_NUMBER('M
AY'))
SQL>
What definitely can persuade the optimizer to use an index in these scenarios is building a function-based index:
SQL> create index big_mon_fbidx on big_table(extract(month from col3))
2 /
Index created.
SQL> select * from big_table
2 where extract(MONTH from col3) = 'MAY'
3 /
Execution Plan
----------------------------------------------------------
Plan hash value: 225326446
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|Time |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 23403 | 1028K| 475 (0)|00:00:06|
| 1 | TABLE ACCESS BY INDEX ROWID| BIG_TABLE | 23403 | 1028K| 475 (0)|00:00:06|
|* 2 | INDEX RANGE SCAN | BIG_MON_FBIDX | 9361 | | 382 (0)|00:00:05|
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access(EXTRACT(MONTH FROM INTERNAL_FUNCTION("COL3"))=TO_NUMBER('MAY'))
SQL>
The function call means that Oracle won't be able to use any index that might be defined on the column.
Either remove the function call (as in harpo's answer) or use a function based index.

Resources