Using function based index (oracle) to speed up count(X) - oracle

I've a table Film:
film_id NUMBER(5) NOT NULL,
title varchar2(255));
And I wanted to make the query, which counts how many titles start with the same word and only displays ones with more than 20, faster using a function based index. The query:
The thing is, I created this function based index:
CREATE INDEX FIRST_WORD_INDEX ON FILM(regexp_replace(TITLE, '(\w+).*$','\1'));
But it didn't speed anything up...
I was wondering if anyone could help me with this :)

Add a redundant predicate to the query to convince Oracle that the expression will not return null values and an index can be used:
select regexp_replace(film.title, '(\w+).*$','\1') first_word
from film
where regexp_replace(film.title, '(\w+).*$','\1') is not null;
Oracle can use an index like a skinny version of a table. Many queries only contain a small subset of the columns in a table. If all the columns in that set are part of the same index, Oracle can use that index instead of the table. This will be either an INDEX FAST FULL SCAN or an INDEX FULL SCAN. The data may be read similar to the way a regular table scan works. But since the index is much smaller than the table, that access method can be much faster.
But function-based indexes do not store NULLs. Oracle cannot use an index scan if it thinks there is a NULL that is not stored in the index. In this case, if the base column was defined as NOT NULL, the regular expression would always return a non-null value. But unsurprisingly, Oracle has not built code to determine whether or not a regular expression could return NULL. That sounds like an impossible task, similar to the halting problem.
There are several ways to convince Oracle that the expression is not null. The simplest may be to repeat the predicate and add an IS NOT NULL condition.
Sample Schema
create table film (
film_id number(5) not null,
title varchar2(255) not null);
insert into film select rownumber, column_value
select rownum rownumber, column_value from table(sys.odcivarchar2list(
q'<The Shawshank Redemption>',
q'<The Godfather>',
q'<The Godfather: Part II>',
q'<The Dark Knight>',
q'<Pulp Fiction>',
q'<The Good, the Bad and the Ugly>',
q'<Schindler's List>',
q'<12 Angry Men>',
q'<The Lord of the Rings: The Return of the King>',
q'<Fight Club>'))
create index film_idx1 on film(regexp_replace(title, '(\w+).*$','\1'));
dbms_stats.gather_table_stats(user, 'FILM');
Query that does not use index
Even with an index hint, the normal query will not use an index. Remember that hints are directives, and this query would use the index if it was possible.
explain plan for
select /*+ index_ffs(film) */ regexp_replace(title, '(\w+).*$','\1') first_word
from film;
select * from table(dbms_xplan.display);
Plan hash value: 1232367652
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 10 | 50 | 3 (0)| 00:00:01 |
| 1 | TABLE ACCESS FULL| FILM | 10 | 50 | 3 (0)| 00:00:01 |
Query that uses index
Now add the extra condition and the query will use the index. I'm not sure why it uses an INDEX FULL SCAN instead of an INDEX FAST FULL SCAN. With such small sample data it doesn't matter. The important point is that an index is used.
explain plan for
select regexp_replace(film.title, '(\w+).*$','\1') first_word
from film
where regexp_replace(film.title, '(\w+).*$','\1') is not null;
select * from table(dbms_xplan.display);
Plan hash value: 1151375616
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 10 | 50 | 1 (0)| 00:00:01 |
|* 1 | INDEX FULL SCAN | FILM_IDX1 | 10 | 50 | 1 (0)| 00:00:01 |
Predicate Information (identified by operation id):
1 - filter( REGEXP_REPLACE ("TITLE",'(\w+).*$','\1') IS NOT NULL)


How should I index a FULLNAME field in Oracle when I need to query by first and last name?

I have a rather large table (34 GB, 77M rows) which contains payment information. The table is partitioned by payment date because users usually care about small ranges of dates so the partition pruning really helps queries to return quickly.
The problem is that I have a user who wants to find out all payments that have ever been made to certain people.
Names are stored in columns NAME1 and NAME2, which are both VARCHAR2(40 Byte) and hold free-form full name data. For example, John Q Public could appear in either column as:
John Q Public
John Public
Public, John Q
or even embedded in the middle of the field, like "Estate of John Public"
Right now, the way the query is set up is to look for
NAME1||NAME2 LIKE '%JOHN%PUBLIC%' OR NAME1||NAME2 LIKE '%PUBLIC%JOHN%' and as you can imagine, the performance sucks.
Is this a job for Oracle Text? How else could I better index the atomic bits of the columns so that the user can search by first/last name?
Database Version: Oracle 12c (
Create a multi-column index on both names and modify your query to use an INDEX FAST FULL SCAN operation.
Traversing a b-tree index is a great way to quickly find a small amount of data. Unfortunately the leading wildcards ruin that access path for your query. However, Oracle has multiple ways of reading data from an index. The INDEX FAST FULL SCAN operation simply reads all of the index blocks in no particular order, as if the index was a skinny table. Since the average row length of your table is 442 bytes, and the two columns use at most 80 bytes, reading all the names in the index may be much faster than scanning the entire table.
But the index alone probably isn't enough. You need to change the concatenation into multiple OR expressions.
Sample schema:
--Create payment table and index on name columns.
create table payment
id number,
paydate date,
other_data varchar2(400),
name1 varchar2(40),
name2 varchar2(40)
create index payment_idx on payment(name1, name2);
--Insert 100K sample rows.
insert into payment
select level, sysdate + level, lpad('A', 400, 'A'), level, level
from dual
connect by level <= 100000;
--Insert two rows with relevant values.
insert into payment values(0, sysdate, 'other data', 'B JOHN B PUBLIC B', 'asdf');
insert into payment values(0, sysdate, 'other data', 'asdf', 'C JOHN C PUBLIC C');
--Gather stats to help optimizer pick the right plan.
dbms_stats.gather_table_stats(user, 'payment');
Original expression uses a full table scan:
explain plan for
select name1, name2
from payment
select * from table(dbms_xplan.display);
Plan hash value: 684176532
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 9750 | 4056K| 1714 (1)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| PAYMENT | 9750 | 4056K| 1714 (1)| 00:00:01 |
Predicate Information (identified by operation id):
1 - filter("NAME1"||"NAME2" LIKE '%JOHN%PUBLIC%' OR "NAME1"||"NAME2"
New expression uses a faster INDEX FAST FULL SCAN operation:
explain plan for
select name1, name2
from payment
select * from table(dbms_xplan.display);
Plan hash value: 1655289165
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 18550 | 217K| 152 (3)| 00:00:01 |
|* 1 | INDEX FAST FULL SCAN| PAYMENT_IDX | 18550 | 217K| 152 (3)| 00:00:01 |
Predicate Information (identified by operation id):
This solution should definitely be faster than a full table scan. How much faster depends on the average name size and the name being searched. And depending on the query you may want to add additional columns to keep all the relevant data in the index.
Oracle Text is also a good option, but that feature feels a little "weird" in my opinion. If you're not already using text indexes you might want to stick with normal indexes to simplify administrative tasks.

Oracle CBO when using types [duplicate]

I'm trying to optimize a set of stored procs which are going against many tables including this view. The view is as such:
We have TBL_A (id, hist_date, hist_type, other_columns) with two types of rows: hist_type 'O' vs. hist_type 'N'. The view self joins table A to itself and transposes the N rows against the corresponding O rows. If no N row exists for the O row, the O row values are repeated. Like so:
CREATE OR REPLACE FORCE VIEW V_A (id, hist_date, hist_type, other_columns_o, other_columns_n)
select, o.hist_date, o.hist_type,
o.other_columns as other_columns_o,
case when is not null then n.other_columns else o.other_columns end as other_columns_n
TBL_A o left outer join TBL_A n
on and o.hist_date=n.hist_date and n.hist_type = 'N'
where o.hist_type = 'O';
TBL_A has a unique index on: (id, hist_date, hist_type). It also has a unique index on: (hist_date, id, hist_type) and this is the primary key.
The following query is at issue (in a stored proc, with x declared as TYPE_TABLE_OF_NUMBER):
select BULK COLLECT into x from TBL_B b where b.parent_id = input_id;
select from v_a v
where in (select column_value from table(x))
and v.hist_date = input_date
and v.status_new = 'CLOSED';
This query ignores the index on id column when accessing TBL_A and instead does a range scan using the date to pick up all the rows for the date. Then it filters that set using the values from the array. However if I simply give the list of ids as a list of numbers the optimizer uses the index just fine:
select from v_a v
where in (123, 234, 345, 456, 567, 678, 789)
and v.hist_date = input_date
and v.status_new = 'CLOSED';
The problem also doesn't exist when going against TBL_A directly (and I have a workaround that does that, but it's not ideal.).Is there a way to get the optimizer to first retrieve the array values and use them as predicates when accessing the table? Or a good way to restructure the view to achieve this?
Oracle does not use the index because it assumes select column_value from table(x) returns 8168 rows.
Indexes are faster for retrieving small amounts of data. At some point it's faster to scan the whole table than repeatedly walk the index tree.
Estimating the cardinality of a regular SQL statement is difficult enough. Creating an accurate estimate for procedural code is almost impossible. But I don't know where they came up with 8168. Table functions are normally used with pipelined functions in data warehouses, a sorta-large number makes sense.
Dynamic sampling can generate a more accurate estimate and likely generate a plan that will use the index.
Here's an example of a bad cardinality estimate:
create or replace type type_table_of_number as table of number;
explain plan for
select * from table(type_table_of_number(1,2,3,4,5,6,7));
select * from table(dbms_xplan.display(format => '-cost -bytes'));
Plan hash value: 1748000095
| Id | Operation | Name | Rows | Time |
| 0 | SELECT STATEMENT | | 8168 | 00:00:01 |
Here's how to fix it:
explain plan for select /*+ dynamic_sampling(2) */ *
from table(type_table_of_number(1,2,3,4,5,6,7));
select * from table(dbms_xplan.display(format => '-cost -bytes'));
Plan hash value: 1748000095
| Id | Operation | Name | Rows | Time |
| 0 | SELECT STATEMENT | | 7 | 00:00:01 |
- dynamic statistics used: dynamic sampling (level=2)

Oracle is not using the Indexes

I have a very large table in oracle 11g that has a very simple index in a char field (that is normally Y or N)
If I just execute the queue as bellow it takes around 10s to return
select QueueId, QueueSiteId, QueueData from queue where QueueProcessed = 'N'
However if I force it to use the index I create it takes 80ms
select /*+ INDEX(avaqueue QUEUEPROCESSED_IDX) */ QueueId, QueueSiteId, QueueData
from queue where QueueProcessed = 'N'
Also if I run under the explain plan for as bellow:
explain plan for select QueueId, QueueSiteId, QueueData
from queue where QueueProcessed = 'N'
explain plan for select /*+ INDEX(avaqueue QUEUEPROCESSED_IDX) */
QueueId, QueueSiteId, QueueData
from queue where QueueProcessed = 'N'
For the frist plan I got:
Plan hash value: 803924726
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 691K| 128M| 12643 (1)| 00:02:32 |
|* 1 | TABLE ACCESS FULL| AVAQUEUE | 691K| 128M| 12643 (1)| 00:02:32 |
Predicate Information (identified by operation id):
1 - filter("QUEUEPROCESSED"='N')
For the second pla I got:
Plan hash value: 2012309891
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 691K| 128M| 24386 (1)| 00:04:53 |
| 1 | TABLE ACCESS BY INDEX ROWID| AVAQUEUE | 691K| 128M| 24386 (1)| 00:04:53 |
|* 2 | INDEX RANGE SCAN | QUEUEPROCESSED_IDX | 691K| | 1297 (1)| 00:00:16 |
Predicate Information (identified by operation id):
2 - access("QUEUEPROCESSED"='N')
What proves that if I don't explicit tell oracle to use the index it does not use it, my question is why is oracle not using this index? Oracle is normally smart enough to make decisions 10 times better than me, that is the first time I actually have to force oracle to use a index and I am not very comfortable with it.
Does anyone have a good explanation for oracle decision to not use the index in this very explicit case?
The QueueProcessed column is probably missing a histogram so Oracle does not know the data is skewed.
If Oracle does not know the data is skewed it will assume the equality predicate, QueueProcessed = 'N', returns DBA_TABLES.NUM_ROWS /
DBA_TAB_COLUMNS.NUM_DISTINCT. The optimizer thinks the query returns half the rows in the table. Based on the 80ms return time the real number of rows returned is small.
Index range scans generally only work well when they select a small percentage of the rows. Index range scans read from a data structure one block at a time. And if the data is randomly distributed, it may need to read every block of data from the table anyway. For those reasons, if the query accesses a large portion of the table, it is more efficient to use a multi-block full table scan.
The bad cardinality estimate from the skewed data causes Oracle to think a full table scan is better. Creating a histogram will fix the issue.
Sample schema
Create a table, fill it with skewed data, and gather statistics the first time.
drop table queue;
create table queue(
queueid number,
queuesiteid number,
queuedata varchar2(4000),
queueprocessed varchar2(1)
create index QUEUEPROCESSED_IDX on queue(queueprocessed);
--Skewed data - only 100 of the 100000 rows are set to N.
insert into queue
select level, level, level, decode(mod(level, 1000), 0, 'N', 'Y')
from dual connect by level <= 100000;
dbms_stats.gather_table_stats(user, 'QUEUE');
The first execution will have the problem.
In this case the default statistics settings do not gather histograms the first time. The plan shows a full table scan and estimates Rows=50000, exactly half.
explain plan for
select QueueId, QueueSiteId, QueueData
from queue where QueueProcessed = 'N';
select * from table(dbms_xplan.display);
Plan hash value: 1157425618
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 50000 | 878K| 103 (1)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| QUEUE | 50000 | 878K| 103 (1)| 00:00:01 |
Predicate Information (identified by operation id):
1 - filter("QUEUEPROCESSED"='N')
Create a histogram
The default statistics settings are usually sufficient. Histogram may not be collected for several reasons. They may be manually disabled - check for the tasks, jobs or preferences set by the DBA.
Also, histograms are only automatically collected on columns that are both skewed and used. Gathering histograms can take time, there's no need to create the histogram on a column that is never used in a relevant predicate. Oracle tracks when a column is used and could benefit from a histogram, although that data is lost if the table is dropped.
Running a sample query and re-gathering statistics will make the histogram appear:
select QueueId, QueueSiteId, QueueData
from queue where QueueProcessed = 'N';
dbms_stats.gather_table_stats(user, 'QUEUE');
Now the Rows=100 and the Index is used.
explain plan for
select QueueId, QueueSiteId, QueueData
from queue where QueueProcessed = 'N';
select * from table(dbms_xplan.display);
Plan hash value: 2630796144
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 100 | 1800 | 2 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| QUEUE | 100 | 1800 | 2 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | QUEUEPROCESSED_IDX | 100 | | 1 (0)| 00:00:01 |
Predicate Information (identified by operation id):
2 - access("QUEUEPROCESSED"='N')
Here's the histogram:
select column_name, histogram
from dba_tab_columns
where table_name = 'QUEUE'
order by column_name;
----------- ---------
Create the histogram
Try to determine why the histogram was missing. Check that statistics are gathered with the defaults, there are no weird column or table preferences, and that table is not constantly dropped and re-loaded.
If you cannot rely on the default statistics job for your process you can manually gather histograms with the method_opt parameter like this:
dbms_stats.gather_table_stats(user, 'QUEUE', method_opt=>'for columns size 254 queueprocessed');
The answer - at least the first one that will just lead to more questions - is right there in the plans. The first plan has an estimated cost and estimated execution time about half that of the second plan. In the absence of the hint, Oracle is choosing the plan that it thinks will run faster.
So of course the next question is why is its estimate so far off in this case. Not only are the estimated times wrong relative to each other, both are much greater than what you actually experience when running the query.
The first thing I would look at is the estimated number of rows returned. The optimizer is guessing, in both cases, that there are about 691,000 rows in table matching your predicate. Is this close to the truth, or very far off? If it's far off, then refreshing statistics may be the right solution. Although if the column only has two possible values, I'd be kind of surprised if the existing stats are so off base.

why does Oracle take another execution path on a view

We are using a view on a oracle 10g database to provide data to a .NET application. The nice part of this is that we need a number(12) in the view so the .NET apllication sees this as an integer. So in the select there is a cast(field as NUMBER(12)). So far so good the cost is if we use a where clause on some fields 0.9k. But now the funny part if we make an view of this and query the view with an where clause the cost goes from 0.9k to 18k.
In the explain plan suddenly all indexes are skipped and this results in lots of full table scans. Why does this happen when we use a view?
The simplified version of the problem:
SELECT CAST (a.numbers AS NUMBER (12)) numbers
FROM tablea a
WHERE a.numbers = 201813754;
explain plan:
SELECT STATEMENT ALL_ROWSCost: 1 Bytes: 7 Cardinality: 1
1 INDEX UNIQUE SCAN INDEX (UNIQUE) TAB1_IDX Cost: 1 Bytes: 7 Cardinality: 1
No problem index hit
If we put the above query in a view and execute the same query:
SELECT a.numbers
FROM index_test a
WHERE a.numbers = 201813754;
No index is used.
Explain plan:
SELECT STATEMENT ALL_ROWSCost: 210 Bytes: 2,429 Cardinality: 347
1 TABLE ACCESS FULL TABLE TABLEA Object Instance: 2 Cost: 210 Bytes: 2,429 Cardinality: 347
The issue is you're applying a function to the column (cast in this case). Oracle can't use the index you have as your query stands. To fix this you either need to remove the cast function from your view, or create a function based index:
create table tablea (numbers integer);
insert into tablea
select rownum from dual connect by level <= 1000;
create index ix on tablea (numbers);
-- query on base table uses index
explain plan for
SELECT * FROM tablea
where numbers = 1;
SELECT * FROM table(dbms_xplan.display(null,null, 'BASIC +PREDICATE'));
| Id | Operation | Name |
Predicate Information (identified by operation id):
1 - access("NUMBERS"=1)
create view v as
SELECT cast(numbers as number(12)) numbers FROM tablea;
-- the cast function in the view means we can't use the index
-- note the filter in below the plan
explain plan for
where numbers = 1;
SELECT * FROM table(dbms_xplan.display(null,null, 'BASIC +PREDICATE'));
| Id | Operation | Name |
Predicate Information (identified by operation id):
1 - filter(CAST("NUMBERS" AS number(12))=1)
-- create the function based index and we're back to an index range scan
create index iv on tablea (cast(numbers as number(12)));
explain plan for
where numbers = 1;
SELECT * FROM table(dbms_xplan.display(null,null, 'BASIC +PREDICATE'));
| Id | Operation | Name |
Predicate Information (identified by operation id):
1 - access(CAST("NUMBERS" AS number(12))=1)

is there a tricky way to optimize this query

I'm working on a table that has 3008698 rows
exam_date is a DATE field.
But queries I run want to match only the month part. So what I do is:
select * from my_big_table where to_number(to_char(exam_date, 'MM')) = 5;
which I believe takes long because of function on the column. Is there a way to avoid this and make it faster? other than making changes to the table? exam_date in the table have different date values. like 01-OCT-10 or 12-OCT-10...and so on
I don't know Oracle, but what about doing
WHERE exam_date BETWEEN first_of_month AND last_of_month
where the two dates are constant expressions.
select * from my_big_table where MONTH(exam_date) = 5
oops.. Oracle huh?..
select * from my_big_table where EXTRACT(MONTH from exam_date) = 5
Bear in mind that since you want approximately 1/12th of all the data, it may well be more efficient for Oracle to perform a full table scan anyway. This may explain why performance was worse when you followed harpo's advice.
Why? Suppose your data is such that 20 rows fit on each database block (on average), so that you have a total of 3,000,000/20 = 150,000 blocks. That means a full table scan will require 150,000 block reads. Now about 1/12th of the 3,000,000 rows will be for month 05. 3,000,000/12 is 250,000. So that's 250,000 table reads if you use the index - and that's ignoring the index reads that will also be required. So in this example the full table scan does a lot less work than the indexed search.
Bear in miond that there are only twelve distinct values for MONTH. So unless you have a strongly clustered set of records (say if you use partitioining) it is possible that using an index is not necessarily the most efficient way of querying in this fashion.
I didn't find that using EXTRACT() lead the optimizer to use a regular index on my date column but YMMV:
SQL> create index big_d_idx on big_table(col3) compute statistics
2 /
Index created.
SQL> set autotrace traceonly explain
SQL> select * from big_table
2 where extract(MONTH from col3) = 'MAY'
3 /
Execution Plan
Plan hash value: 3993303771
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 23403 | 1028K| 4351 (3)| 00:00:53 |
|* 1 | TABLE ACCESS FULL| BIG_TABLE | 23403 | 1028K| 4351 (3)| 00:00:53 |
Predicate Information (identified by operation id):
What definitely can persuade the optimizer to use an index in these scenarios is building a function-based index:
SQL> create index big_mon_fbidx on big_table(extract(month from col3))
2 /
Index created.
SQL> select * from big_table
2 where extract(MONTH from col3) = 'MAY'
3 /
Execution Plan
Plan hash value: 225326446
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|Time |
| 0 | SELECT STATEMENT | | 23403 | 1028K| 475 (0)|00:00:06|
| 1 | TABLE ACCESS BY INDEX ROWID| BIG_TABLE | 23403 | 1028K| 475 (0)|00:00:06|
|* 2 | INDEX RANGE SCAN | BIG_MON_FBIDX | 9361 | | 382 (0)|00:00:05|
Predicate Information (identified by operation id):
The function call means that Oracle won't be able to use any index that might be defined on the column.
Either remove the function call (as in harpo's answer) or use a function based index.
