I have a very large table in oracle 11g that has a very simple index in a char field (that is normally Y or N)
If I just execute the queue as bellow it takes around 10s to return
select QueueId, QueueSiteId, QueueData from queue where QueueProcessed = 'N'
However if I force it to use the index I create it takes 80ms
select /*+ INDEX(avaqueue QUEUEPROCESSED_IDX) */ QueueId, QueueSiteId, QueueData
from queue where QueueProcessed = 'N'
Also if I run under the explain plan for as bellow:
explain plan for select QueueId, QueueSiteId, QueueData
from queue where QueueProcessed = 'N'
and
explain plan for select /*+ INDEX(avaqueue QUEUEPROCESSED_IDX) */
QueueId, QueueSiteId, QueueData
from queue where QueueProcessed = 'N'
For the frist plan I got:
------------------------------------------------------------------------------
Plan hash value: 803924726
------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 691K| 128M| 12643 (1)| 00:02:32 |
|* 1 | TABLE ACCESS FULL| AVAQUEUE | 691K| 128M| 12643 (1)| 00:02:32 |
------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("QUEUEPROCESSED"='N')
For the second pla I got:
Plan hash value: 2012309891
--------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 691K| 128M| 24386 (1)| 00:04:53 |
| 1 | TABLE ACCESS BY INDEX ROWID| AVAQUEUE | 691K| 128M| 24386 (1)| 00:04:53 |
|* 2 | INDEX RANGE SCAN | QUEUEPROCESSED_IDX | 691K| | 1297 (1)| 00:00:16 |
--------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("QUEUEPROCESSED"='N')
------------------------------------------------------------------------------
What proves that if I don't explicit tell oracle to use the index it does not use it, my question is why is oracle not using this index? Oracle is normally smart enough to make decisions 10 times better than me, that is the first time I actually have to force oracle to use a index and I am not very comfortable with it.
Does anyone have a good explanation for oracle decision to not use the index in this very explicit case?
The QueueProcessed column is probably missing a histogram so Oracle does not know the data is skewed.
If Oracle does not know the data is skewed it will assume the equality predicate, QueueProcessed = 'N', returns DBA_TABLES.NUM_ROWS /
DBA_TAB_COLUMNS.NUM_DISTINCT. The optimizer thinks the query returns half the rows in the table. Based on the 80ms return time the real number of rows returned is small.
Index range scans generally only work well when they select a small percentage of the rows. Index range scans read from a data structure one block at a time. And if the data is randomly distributed, it may need to read every block of data from the table anyway. For those reasons, if the query accesses a large portion of the table, it is more efficient to use a multi-block full table scan.
The bad cardinality estimate from the skewed data causes Oracle to think a full table scan is better. Creating a histogram will fix the issue.
Sample schema
Create a table, fill it with skewed data, and gather statistics the first time.
drop table queue;
create table queue(
queueid number,
queuesiteid number,
queuedata varchar2(4000),
queueprocessed varchar2(1)
);
create index QUEUEPROCESSED_IDX on queue(queueprocessed);
--Skewed data - only 100 of the 100000 rows are set to N.
insert into queue
select level, level, level, decode(mod(level, 1000), 0, 'N', 'Y')
from dual connect by level <= 100000;
begin
dbms_stats.gather_table_stats(user, 'QUEUE');
end;
/
The first execution will have the problem.
In this case the default statistics settings do not gather histograms the first time. The plan shows a full table scan and estimates Rows=50000, exactly half.
explain plan for
select QueueId, QueueSiteId, QueueData
from queue where QueueProcessed = 'N';
select * from table(dbms_xplan.display);
Plan hash value: 1157425618
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 50000 | 878K| 103 (1)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| QUEUE | 50000 | 878K| 103 (1)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("QUEUEPROCESSED"='N')
Create a histogram
The default statistics settings are usually sufficient. Histogram may not be collected for several reasons. They may be manually disabled - check for the tasks, jobs or preferences set by the DBA.
Also, histograms are only automatically collected on columns that are both skewed and used. Gathering histograms can take time, there's no need to create the histogram on a column that is never used in a relevant predicate. Oracle tracks when a column is used and could benefit from a histogram, although that data is lost if the table is dropped.
Running a sample query and re-gathering statistics will make the histogram appear:
select QueueId, QueueSiteId, QueueData
from queue where QueueProcessed = 'N';
begin
dbms_stats.gather_table_stats(user, 'QUEUE');
end;
/
Now the Rows=100 and the Index is used.
explain plan for
select QueueId, QueueSiteId, QueueData
from queue where QueueProcessed = 'N';
select * from table(dbms_xplan.display);
Plan hash value: 2630796144
----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 100 | 1800 | 2 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| QUEUE | 100 | 1800 | 2 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | QUEUEPROCESSED_IDX | 100 | | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("QUEUEPROCESSED"='N')
Here's the histogram:
select column_name, histogram
from dba_tab_columns
where table_name = 'QUEUE'
order by column_name;
COLUMN_NAME HISTOGRAM
----------- ---------
QUEUEDATA NONE
QUEUEID NONE
QUEUEPROCESSED FREQUENCY
QUEUESITEID NONE
Create the histogram
Try to determine why the histogram was missing. Check that statistics are gathered with the defaults, there are no weird column or table preferences, and that table is not constantly dropped and re-loaded.
If you cannot rely on the default statistics job for your process you can manually gather histograms with the method_opt parameter like this:
begin
dbms_stats.gather_table_stats(user, 'QUEUE', method_opt=>'for columns size 254 queueprocessed');
end;
/
The answer - at least the first one that will just lead to more questions - is right there in the plans. The first plan has an estimated cost and estimated execution time about half that of the second plan. In the absence of the hint, Oracle is choosing the plan that it thinks will run faster.
So of course the next question is why is its estimate so far off in this case. Not only are the estimated times wrong relative to each other, both are much greater than what you actually experience when running the query.
The first thing I would look at is the estimated number of rows returned. The optimizer is guessing, in both cases, that there are about 691,000 rows in table matching your predicate. Is this close to the truth, or very far off? If it's far off, then refreshing statistics may be the right solution. Although if the column only has two possible values, I'd be kind of surprised if the existing stats are so off base.
Related
I am trying to come up with an example showing that indexes can have a dramatic (orders of magnitude) effect on query execution time. After hours of trial and error I am still at square one. Namely, the speed-up is not large even when the execution plan shows using the index.
Since I realized that I better have a large table for the index to make a difference, I wrote the following script (using Oracle 11g Express):
CREATE TABLE many_students (
student_id NUMBER(11),
city VARCHAR(20)
);
DECLARE
nStudents NUMBER := 1000000;
nCities NUMBER := 10000;
curCity VARCHAR(20);
BEGIN
FOR i IN 1 .. nStudents LOOP
curCity := ROUND(DBMS_RANDOM.VALUE()*nCities, 0) || ' City';
INSERT INTO many_students
VALUES (i, curCity);
END LOOP;
COMMIT;
END;
I then tried quite a few queries, such as:
select count(*)
from many_students M
where M.city = '5467 City';
and
select count(*)
from many_students M1
join many_students M2 using(city);
and a few other ones.
I have seen this post and think that my queries satisfy the requirements stated in the replies there. However, none of the queries I tried showed dramatic improvement after building an index: create index myindex on many_students(city);
Am I missing some characteristic that distinguishes a query for which an index makes a dramatic difference? What is it?
The test case is a good start but it needs a few more things to get a noticeable performance difference:
Realistic data sizes. One million rows of two small values is a small table. With a table that small the performance difference between a good and a bad execution plan may not matter much.
The below script will double the table size until it gets to 64 million rows. It takes about 20 minutes on my machine. (To make it go quicker, for larger sizes, you could make the table nologging and add an /*+ append */ hint to the insert.
--Increase the table to 64 million rows. This took 20 minutes on my machine.
insert into many_students select * from many_students;
insert into many_students select * from many_students;
insert into many_students select * from many_students;
insert into many_students select * from many_students;
insert into many_students select * from many_students;
insert into many_students select * from many_students;
commit;
--The table has about 1.375GB of data. The actual size will vary.
select bytes/1024/1024/1024 gb from dba_segments where segment_name = 'MANY_STUDENTS';
Gather statistics. Always gather statistics after large table changes. The optimizer cannot do its job well unless it has table, column, and index statistics.
begin
dbms_stats.gather_table_stats(user, 'MANY_STUDENTS');
end;
/
Use hints to force a good and bad plan. Optimizer hints should usually be avoided. But to quickly compare different plans they can be helpful to fix a bad plan.
For example, this will force a full table scan:
select /*+ full(M) */ count(*) from many_students M where M.city = '5467 City';
But you'll also want to verify the execution plan:
explain plan for select /*+ full(M) */ count(*) from many_students M where M.city = '5467 City';
select * from table(dbms_xplan.display);
Flush the cache. Caching is probably the main culprit behind the index and full table scan queries taking the same amount of time. If the table fits entirely in memory then the time to read all the rows may be almost too small to measure. The number could be dwarfed by the time to parse the query or to send a simple result across the network.
This command will force Oracle to remove almost everything from the buffer cache. This will help you test a "cold" system. (You probably do not want to run this statement on a production system.)
alter system flush buffer_cache;
However, that won't flush the operating system or SAN cache. And maybe the table really would fit in memory on production. If you need to test a fast query it may be necessary to put it in a PL/SQL loop.
Multiple, alternating runs. There many things happening in the background, like caching and other processes. It's so easy to get bad results because something unrelated changed on the system.
Maybe the first run takes extra long to put things in a cache. Or maybe some huge job was started between queries. To avoid those issues, alternate running the two queries. Run them five times, throw out the highs and lows, and compare the averages.
For example, copy and paste the statements below five times and run them. (If using SQL*Plus, run set timing on first.) I already did that and posted the times I got in a comment before each line.
--Seconds: 0.02, 0.02, 0.03, 0.234, 0.02
alter system flush buffer_cache;
select count(*) from many_students M where M.city = '5467 City';
--Seconds: 4.07, 4.21, 4.35, 3.629, 3.54
alter system flush buffer_cache;
select /*+ full(M) */ count(*) from many_students M where M.city = '5467 City';
Testing is hard. Putting together decent performance tests is difficult. The above rules are only a start.
This might seem like overkill at first. But it's a complex topic. And I've seen so many people, including myself, waste a lot of time "tuning" something based on a bad test. Better to spend the extra time now and get the right answer.
An index really shines when the database doesn't need to go to every row in a table to get your results. So COUNT(*) isn't the best example. Take this for example:
alter session set statistics_level = 'ALL';
create table mytable as select * from all_objects;
select * from mytable where owner = 'SYS' and object_name = 'DUAL';
---------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 300 |00:00:00.01 | 12 |
| 1 | TABLE ACCESS FULL| MYTABLE | 1 | 19721 | 300 |00:00:00.01 | 12 |
---------------------------------------------------------------------------------------
So, here, the database does a full table scan (TABLE ACCESS FULL), which means it has to visit every row in the database, which means it has to load every block from disk. Lots of I/O. The optimizer guessed that it was going to find 15000 rows, but I know there's only one.
Compare that with this:
create index myindex on mytable( owner, object_name );
select * from mytable where owner = 'SYS' and object_name = 'JOB$';
select * from table( dbms_xplan.display_cursor( null, null, 'ALLSTATS LAST' ));
----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 3 | 2 |
| 1 | TABLE ACCESS BY INDEX ROWID| MYTABLE | 1 | 2 | 1 |00:00:00.01 | 3 | 2 |
|* 2 | INDEX RANGE SCAN | MYINDEX | 1 | 1 | 1 |00:00:00.01 | 2 | 2 |
----------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("OWNER"='SYS' AND "OBJECT_NAME"='JOB$')
Here, because there's an index, it does an INDEX RANGE SCAN to find the rowids for the table that match our criteria. Then, it goes to the table itself (TABLE ACCESS BY INDEX ROWID) and looks up only the rows we need and can do so efficiently because it has a rowid.
And even better, if you happen to be looking for something that is entirely in the index, the scan doesn't even have to go back to the base table. The index is enough:
select count(*) from mytable where owner = 'SYS';
select * from table( dbms_xplan.display_cursor( null, null, 'ALLSTATS LAST' ));
------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 46 | 46 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 46 | 46 |
|* 2 | INDEX RANGE SCAN| MYINDEX | 1 | 8666 | 9294 |00:00:00.01 | 46 | 46 |
------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("OWNER"='SYS')
Because my query involved the owner column and that's contained in the index, it never needs to go back to the base table to look anything up there. So the index scan is enough, then it does an aggregation to count the rows. This scenario is a little less than perfect, because the index is on (owner, object_name) and not just owner, but its definitely better than doing a full table scan on the main table.
I have a table PO_HEADER with ~20 million records. Considering our future load on the table we have decided to partitioned the table to increase the performance of the sql queries. Below are the queries used to create the new partitioned tables.
CREATE TABLE PO_HEADER_LP
PARTITION BY LIST (BUYER_IDENTIFIER)
(PARTITION GC66287246AA VALUES ('GC66287246AA') TABLESPACE MITRIX_TABLES,
PARTITION GC43837235JK VALUES ('GC43837235JK') TABLESPACE MITRIX_TABLES,
PARTITION GC84338293AA VALUES ('GC84338293AA') TABLESPACE MITRIX_TABLES,
PARTITION DEFAULTBUID VALUES (DEFAULT) TABLESPACE MITRIX_TABLES)
AS SELECT *
FROM PO_HEADER;
create index PO_HEADER_LP_SI_IDX on PO_HEADER_LP("SUPPLIER_IDENTIFIER") TABLESPACE MITRIX_INDEXES LOCAL;
Old Table PO_HEADER has two indexes on "BUYER_IDENTIFIER" and "SUPPLIER_IDENTIFIER" columns as follows:
create index PO_HEADER_BI_IDX on PO_HEADER("BUYER_IDENTIFIER") TABLESPACE MITRIX_INDEXES;
create index PO_HEADER_SI_IDX on PO_HEADER("SUPPLIER_IDENTIFIER") TABLESPACE MITRIX_INDEXES;
To test the performance of the query, I executed below query on both the tables. But, to my wonder I saw the cost of the 2nd query is almost double than the 1st one. Can any body know, why is the query cost is high of the partitioned table compared to normal table. Thanks in Advance.
select * from po_header where buyer_identifier='GC84338293AA' and supplier_identifier='GC75987723HT'; --cost: 56,941
select * from po_header_lp where buyer_identifier= 'GC84338293AA' and supplier_identifier='GC75987723HT'; --cost: 93,309
PO_HEADER with Global Index on buyer_identifier & supplier_identifier column
PO_HEADER_LP with Global Index on supplier_identifier column
PO_HEADER_LP with Local Index on supplier_identifier column
From your DDL I assume, you have three big buyers (say 5M records each) and a bunch of smaller ones. In other word this would be the correct setup for you list partitioning schema.
You may verify, whether it works testing access on buyer only:
EXPLAIN PLAN SET STATEMENT_ID = 'jara1' into plan_table FOR
select * from tab_lp where BUYER_ID = 1;
;
SELECT * FROM table(DBMS_XPLAN.DISPLAY('plan_table', 'jara1','ALL'));
------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6662K| 82M| 4445 (2)| 00:00:01 | | |
| 1 | PARTITION LIST SINGLE| | 6662K| 82M| 4445 (2)| 00:00:01 | KEY | KEY |
| 2 | TABLE ACCESS FULL | TAB_LP | 6662K| 82M| 4445 (2)| 00:00:01 | 2 | 2 |
------------------------------------------------------------------------------------------------
The same query for the non-partitioned table should produce much higher cost. Why?
In the partitioned table the selected buyer (in your case GC84338293AA, I'm using surrogate keys) has it own partition.
So full scan of this partition is the best access.
select * from tab where BUYER_ID = 1;
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6596K| 81M| 14025 (1)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TAB | 6596K| 81M| 14025 (1)| 00:00:01 |
--------------------------------------------------------------------------
1 - filter("BUYER_ID"=1)
For the non-partitioned table (to get approximately one fourth of the data) the FULL TABLE SCAN is OK as well,
but of course has higher cost as all data must be scanned.
Note - if you see here lower cost, unrealistically low Rows count and/or INDEX ACCESS,
than this is the cause of the problem of the underestimating of the cost. So don't worry the old cost are too low, not the new one too high!
The next step is the access on both buyer and supplier. To get the answer you must provide
additional information.
How selective is the supplier filter?
I.e. if the predicate buyer_identifier='GC84338293AA' returns say 5M records, how may records return the predicate with both columns?
buyer_identifier='GC84338293AA' and supplier_identifier='GC75987723HT'
Is it 4M or 100 records?
If the complete predicate returns only few records than the local index on supplier is OK.
If it returns large number of rows (say the quarter of the partition) - you should stay on FULL PARTITION SCAN and not use it.
This is similar to my comment on the non partitioned table.
Estimation of the supplier cardinality
In case that the column SUPPLIER contains a skewed data (which may fool the CBO to calulate improper cost) you may define explicitely histogram in this column.
I used this statement statement, that calculates the histogram on full data (100% is important for highly skewed data) and for the table and partition.
exec dbms_stats.gather_table_stats(ownname=>user,tabname=>'TAB_LP',granularity=>'all',estimate_percent => 100,METHOD_OPT => 'for columns SUPPLIER_ID size 254');
This worked for my test data, i.e. for supplier with low cardinality an index access was opened (on local no-prefixed index) and for huge suppliers a full partition scan was used.
You can create a Local partitioned index using this script.
CREATE INDEX PO_HEADER_LOCAL_IDX ON PO_HEADER_LP
(BUYER_IDENTIFIER, SUPPLIER_IDENTIFIER)
LOCAL (
PARTITION GC66287246AA,
PARTITION GC43837235JK,
PARTITION GC84338293AA,
PARTITION DEFAULTBUID
);
Also it is recommended to gather statistics of the newly created partition table using this script:
EXEC DBMS_STATS.GATHER_TABLE_STATS('SCHEMA Name','PO_HEADER_LP');
Now you can generate the execution plan again of the following SQL:
select * from po_header_lp where buyer_identifier= 'GC84338293AA' and supplier_identifier='GC75987723HT';
Hope this will help you.
My schema (simplified):
CREATE TABLE LOC
(
LOC_ID NUMBER(15,0) NOT NULL,
LOC_REF_NO VARCHAR2(100 CHAR) NOT NULL
)
/
CREATE INDEX LOC_REF_NO_IDX ON LOC
(
NLSSORT("LOC_REF_NO",'nls_sort=''BINARY_AI''') ASC
)
/
My query (in SQL*Plus):
ALTER SESSION SET NLS_COMP=LINGUISTIC NLS_SORT=BINARY_AI
/
VAR LOC_REF_NO VARCHAR2(50)
BEGIN
:LOC_REF_NO := 'SPDJ1501270';
END;
/
-- Causes full table scan (i.e, does not use LOC_REF_NO_IDX)
SELECT * FROM LOC WHERE LOC_REF_NO LIKE :LOC_REF_NO||'%';
-- Causes index scan (i.e. uses LOC_REF_NO_IDX)
SELECT * FROM LOC WHERE LOC_REF_NO LIKE 'SPDJ1501270%';
That the index is not used has been confirmed by doing an AUTOTRACE (EXPLAIN PLAN) and the SQL just runs slower. Tried a number of thing without success. Anyone got any idea what is going on? I am using Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit.
Update 1:
Note that the index is used when I use an equals with a parameter:
SELECT * FROM LOC WHERE LOC_REF_NO = :LOC_REF_NO;
Explain Plan:
----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 93 | 5 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| LOC | 1 | 93 | 5 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | LOC_REF_NO_IDX | 1 | | 3 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access(NLSSORT("LOC_REF_NO",'nls_sort=''BINARY_AI''')=NLSSORT(:LOC_REF_NO,'nls_
sort=''BINARY_AI'''))
Whereas
SELECT * FROM LOC WHERE LOC_REF_NO LIKE :LOC_REF_NO||'%';
Explain Plan:
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 50068 | 3471K| 5724 (1)| 00:01:09 |
|* 1 | TABLE ACCESS FULL| LOC | 50068 | 3471K| 5724 (1)| 00:01:09 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("LOC_REF_NO" LIKE :LOC_REF_NO||'%')
Dumbfounded!
Update 2:
The reason we are using NLSSORT on an index is to make Oracle queries case insensitive and this was the general recommendation. Previously we use functional indexes with NLS_UPPER. The strange thing that is that the index is always used, parameter or not, as shown below.
So if table is as above, LOC_REF_NO_IDX index removed and this one added:
CREATE INDEX LOC_REF_NO_CI_IDX ON LOC
(
NLS_UPPER(LOC_REF_NO) ASC
)
/
The all of the following use the index:
ALTER SESSION SET NLS_COMP=BINARY NLS_SORT=BINARY;
SELECT * FROM LOC WHERE NLS_UPPER(LOC_REF_NO) LIKE :LOC_REF_NO||'%';
-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 50068 | 5329K| 5700 (1)| 00:01:09 |
| 1 | TABLE ACCESS BY INDEX ROWID| LOC | 50068 | 5329K| 5700 (1)| 00:01:09 |
|* 2 | INDEX RANGE SCAN | LOC_REF_NO_CI_IDX | 9012 | | 43 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access(NLS_UPPER("LOC_REF_NO") LIKE :LOC_REF_NO||'%')
filter(NLS_UPPER("LOC_REF_NO") LIKE :LOC_REF_NO||'%')
So for some reason when using LIKE with a parameter on a linguistic index, the Oracle optimizer is deciding not to use the index.
According to Oracle support note 1451804.1 this is a known limitation of using LIKE with NLSSORT-based indexes.
If you look at the execution plan for your fixed-value query you see something like:
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access(NLSSORT("LOC_REF_NO",'nls_sort=''BINARY_AI''')>=HEXTORAW('7370646A313530
3132373000') AND NLSSORT("LOC_REF_NO",'nls_sort=''BINARY_AI''')<HEXTORAW('7370646A313
5303132373100') )
Those raw values evaluate to spdj1501270 and spdj1501271; those are derived from your constant string, and any values matching your like condition will be in that range. That parse-time transformation has to be based on a constant value, and doesn't work with a bind variable or an expression, presumably because it's evaluated too late.
See the note for more information, but there doesn't seem to be a workaround unfortunately. You might have to go back to your NLS_UPPER approach.
Previous explanation applies generally but not in this specific case, but kept for reference...
In general, with the fixed value the optimiser can estimate how selective your query is when it parses it, because it can know roughly what proportion of index values match that value. It may or may not use the index, depending on the actual value you use.
With the bind variable it comes up with a plan via bind variable peeking:
In bind variable peeking (also known as bind peeking), the optimizer looks at the value in a bind variable when the database performs a hard parse of a statement.
When a query uses literals, the optimizer can use the literal values to find the best plan. However, when a query uses bind variables, the optimizer must select the best plan without the presence of literals in the SQL text. This task can be extremely difficult. By peeking at bind values the optimizer can determine the selectivity of a WHERE clause condition as if literals had been used, thereby improving the plan.
It uses the statistics it has gathered to decide if any particular value is more likely than others. That probably isn't going to be the case here, especially with the like. It's falling back to a full table scan becuse it can't determine when it does the hard parse that the index will be more selective most of the time. Imagine, for example, that the parse decided to use the index, but then you supplied a bind value of just S, or even null - using the index would then do much more work than a full table scan.
Also worth noting:
When choosing a plan, the optimizer only peeks at the bind value during the hard parse. This plan may not be optimal for all possible values.
Adaptive cursor sharing can mitigate this, but this query may not qualify:
The criteria used by the optimizer to decide whether a cursor is bind-sensitive include the following:
The optimizer has peeked at the bind values to generate selectivity estimates.
A histogram exists on the column containing the bind value.
When I mocked this up with a small-ish amount of limited data, v$sql reported both is_bind_sensitive and is_bind_aware as 'N'.
I've a table Film:
CREATE TABLE film (
film_id NUMBER(5) NOT NULL,
title varchar2(255));
And I wanted to make the query, which counts how many titles start with the same word and only displays ones with more than 20, faster using a function based index. The query:
SELECT FW_SEPARATOR.FIRST_WORD AS "First Word", COUNT(FW_SEPARATOR.FIRST_WORD) AS "Count"
FROM (SELECT regexp_replace(FILM.TITLE, '(\w+).*$','\1') AS FIRST_WORD FROM FILM) FW_SEPARATOR
GROUP BY FW_SEPARATOR.FIRST_WORD
HAVING COUNT(FW_SEPARATOR.FIRST_WORD) >= 20;
The thing is, I created this function based index:
CREATE INDEX FIRST_WORD_INDEX ON FILM(regexp_replace(TITLE, '(\w+).*$','\1'));
But it didn't speed anything up...
I was wondering if anyone could help me with this :)
Add a redundant predicate to the query to convince Oracle that the expression will not return null values and an index can be used:
select regexp_replace(film.title, '(\w+).*$','\1') first_word
from film
where regexp_replace(film.title, '(\w+).*$','\1') is not null;
Oracle can use an index like a skinny version of a table. Many queries only contain a small subset of the columns in a table. If all the columns in that set are part of the same index, Oracle can use that index instead of the table. This will be either an INDEX FAST FULL SCAN or an INDEX FULL SCAN. The data may be read similar to the way a regular table scan works. But since the index is much smaller than the table, that access method can be much faster.
But function-based indexes do not store NULLs. Oracle cannot use an index scan if it thinks there is a NULL that is not stored in the index. In this case, if the base column was defined as NOT NULL, the regular expression would always return a non-null value. But unsurprisingly, Oracle has not built code to determine whether or not a regular expression could return NULL. That sounds like an impossible task, similar to the halting problem.
There are several ways to convince Oracle that the expression is not null. The simplest may be to repeat the predicate and add an IS NOT NULL condition.
Sample Schema
create table film (
film_id number(5) not null,
title varchar2(255) not null);
insert into film select rownumber, column_value
from
(
select rownum rownumber, column_value from table(sys.odcivarchar2list(
q'<The Shawshank Redemption>',
q'<The Godfather>',
q'<The Godfather: Part II>',
q'<The Dark Knight>',
q'<Pulp Fiction>',
q'<The Good, the Bad and the Ugly>',
q'<Schindler's List>',
q'<12 Angry Men>',
q'<The Lord of the Rings: The Return of the King>',
q'<Fight Club>'))
);
create index film_idx1 on film(regexp_replace(title, '(\w+).*$','\1'));
begin
dbms_stats.gather_table_stats(user, 'FILM');
end;
/
Query that does not use index
Even with an index hint, the normal query will not use an index. Remember that hints are directives, and this query would use the index if it was possible.
explain plan for
select /*+ index_ffs(film) */ regexp_replace(title, '(\w+).*$','\1') first_word
from film;
select * from table(dbms_xplan.display);
Plan hash value: 1232367652
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 50 | 3 (0)| 00:00:01 |
| 1 | TABLE ACCESS FULL| FILM | 10 | 50 | 3 (0)| 00:00:01 |
--------------------------------------------------------------------------
Query that uses index
Now add the extra condition and the query will use the index. I'm not sure why it uses an INDEX FULL SCAN instead of an INDEX FAST FULL SCAN. With such small sample data it doesn't matter. The important point is that an index is used.
explain plan for
select regexp_replace(film.title, '(\w+).*$','\1') first_word
from film
where regexp_replace(film.title, '(\w+).*$','\1') is not null;
select * from table(dbms_xplan.display);
Plan hash value: 1151375616
------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 50 | 1 (0)| 00:00:01 |
|* 1 | INDEX FULL SCAN | FILM_IDX1 | 10 | 50 | 1 (0)| 00:00:01 |
------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter( REGEXP_REPLACE ("TITLE",'(\w+).*$','\1') IS NOT NULL)
I'm working on a table that has 3008698 rows
exam_date is a DATE field.
But queries I run want to match only the month part. So what I do is:
select * from my_big_table where to_number(to_char(exam_date, 'MM')) = 5;
which I believe takes long because of function on the column. Is there a way to avoid this and make it faster? other than making changes to the table? exam_date in the table have different date values. like 01-OCT-10 or 12-OCT-10...and so on
I don't know Oracle, but what about doing
WHERE exam_date BETWEEN first_of_month AND last_of_month
where the two dates are constant expressions.
select * from my_big_table where MONTH(exam_date) = 5
oops.. Oracle huh?..
select * from my_big_table where EXTRACT(MONTH from exam_date) = 5
Bear in mind that since you want approximately 1/12th of all the data, it may well be more efficient for Oracle to perform a full table scan anyway. This may explain why performance was worse when you followed harpo's advice.
Why? Suppose your data is such that 20 rows fit on each database block (on average), so that you have a total of 3,000,000/20 = 150,000 blocks. That means a full table scan will require 150,000 block reads. Now about 1/12th of the 3,000,000 rows will be for month 05. 3,000,000/12 is 250,000. So that's 250,000 table reads if you use the index - and that's ignoring the index reads that will also be required. So in this example the full table scan does a lot less work than the indexed search.
Bear in miond that there are only twelve distinct values for MONTH. So unless you have a strongly clustered set of records (say if you use partitioining) it is possible that using an index is not necessarily the most efficient way of querying in this fashion.
I didn't find that using EXTRACT() lead the optimizer to use a regular index on my date column but YMMV:
SQL> create index big_d_idx on big_table(col3) compute statistics
2 /
Index created.
SQL> set autotrace traceonly explain
SQL> select * from big_table
2 where extract(MONTH from col3) = 'MAY'
3 /
Execution Plan
----------------------------------------------------------
Plan hash value: 3993303771
-------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 23403 | 1028K| 4351 (3)| 00:00:53 |
|* 1 | TABLE ACCESS FULL| BIG_TABLE | 23403 | 1028K| 4351 (3)| 00:00:53 |
-------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(EXTRACT(MONTH FROM INTERNAL_FUNCTION("COL3"))=TO_NUMBER('M
AY'))
SQL>
What definitely can persuade the optimizer to use an index in these scenarios is building a function-based index:
SQL> create index big_mon_fbidx on big_table(extract(month from col3))
2 /
Index created.
SQL> select * from big_table
2 where extract(MONTH from col3) = 'MAY'
3 /
Execution Plan
----------------------------------------------------------
Plan hash value: 225326446
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|Time |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 23403 | 1028K| 475 (0)|00:00:06|
| 1 | TABLE ACCESS BY INDEX ROWID| BIG_TABLE | 23403 | 1028K| 475 (0)|00:00:06|
|* 2 | INDEX RANGE SCAN | BIG_MON_FBIDX | 9361 | | 382 (0)|00:00:05|
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access(EXTRACT(MONTH FROM INTERNAL_FUNCTION("COL3"))=TO_NUMBER('MAY'))
SQL>
The function call means that Oracle won't be able to use any index that might be defined on the column.
Either remove the function call (as in harpo's answer) or use a function based index.