Function-based Index using Substr and Instr - oracle

I have created a query doing this in ORACLE:
SELECT SUBSTR(title,1,INSTR(title,' ',1,1)) AS first_word, COUNT(*) AS word_count
FROM FILM
GROUP BY SUBSTR(title,1,INSTR(title,' ',1,1))
HAVING COUNT(*) >= 20;
Results after running:
539 rows selected. Elapsed: 00:00:00.22
I need to improve the performance of this and created a function-based index as so:
CREATE INDEX INDX_FIRSTWRD ON FILM(SUBSTR(title,1,INSTR(title,' ',1,1)));
After running the same query at the top of this post, I still get the same performance:
539 rows selected. Elapsed: 00:00:00.22
Is the index not being applied or overwritten or am I doing something wrong?
Thanks for any help you could provide. :)
EDIT:
Execution Plan:
----------------------------------------------------------
Plan hash value: 2033354507
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 20000 | 2968K| 138 (2)| 00:00:02 |
|* 1 | FILTER | | | | | |
| 2 | HASH GROUP BY | | 20000 | 2968K| 138 (2)| 00:00:02 |
| 3 | TABLE ACCESS FULL| FILM | 20000 | 2968K| 136 (0)| 00:00:02 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(COUNT(*)>=20)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
471 consistent gets
0 physical reads
0 redo size
14030 bytes sent via SQL*Net to client
908 bytes received via SQL*Net from client
37 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
539 rows processed

The problem is that the value you're using for the index may be null - if there is no space in the title (i.e. it's a one-word title like "Jaws") then your substr evaluates to null. That probably isn't what you want, incidentally - you probably want the end position to be conditional on whether there is a space at all, but that's beyond the scope of the question. (And even if you correct that logic, Oracle may still not be able to trust that the result can't be null, even if the underlying column is not nullable). Edit: see below for more on using nvl to handle single-word titles.
Since nulls aren't included in indexes, the single-title rows won't be indexed. But you're asking for all rows, and Oracle knows the index doesn't hold all rows, so it can't use the index to fulfil the query - even if you add a hint telling it to, it has to ignore that hint.
The only time the index will be used is if you include a filter that references the indexed value too, and explicitly or implicitly exclude nulls, e.g.:
SELECT SUBSTR(title,1,INSTR(title,' ',1,1)) AS first_word, COUNT(*) AS word_count
FROM FILM
WHERE SUBSTR(title,1,INSTR(title,' ',1,1)) IS NOT NULL
GROUP BY SUBSTR(title,1,INSTR(title,' ',1,1))
HAVING COUNT(*) >= 20;
(which also probably isn't what you actually want).
SQL Fiddle for queries with and without a filter, and with and without an index hint. (Click the 'execution plan' link against each result section to see whether it's doing a full table scan or a full index scan).
And another Fiddle showing that the index can't be used even with the filter if the filter still allows null values, again since they are not in the index.
Since SylvainLeroux brought it up, Oracle isn't quite clever enough to know the computed value can't be null if you coalesce it, even if the underlying column is not-null (as a function-based index or as a virtual column). Possibly because there could be a lot of branches to evaluate. But it is clever enough if you use the simpler and proprietary nvl instead:
CREATE INDEX INDX_FIRSTWRD
ON FILM(NVL(SUBSTR(title,1,INSTR(title,' ',1,1)),title));
SELECT NVL(SUBSTR(title,1,INSTR(title,' ',1,1)),title) AS first_word,
COUNT(*) AS word_count
FROM FILM
GROUP BY NVL(SUBSTR(title,1,INSTR(title,' ',1,1)),title)
HAVING COUNT(*) >= 20;
But only if title is defined as not-null. And coalesce does work if the virtual column is also declared not-null (thanks Sylvain).
SQL Fiddle with a function-based index and another with a virtual column.

539 rows selected. Elapsed: 00:00:00.22
Do you really think you need to tune the query which returns 539 rows in less than a second? 220 milliseconds, precicely! Think about it.
In your case, I think CBO does the best possible thing. And that is the reason it doesn't use the index. Because, to read every row from the table, using the index is an overhead. It needs to read the index and then do a table access by rowid. Probably, in your small table, it could read the entire table with less IO to fetch the data.
If the table is small enough to be in a single block, then, it just requires a one IO to fetch required data from single block with full table scan.
You can try to check the explain plan by hinting the query to use the index and see if anything really improves. Remember, you are trying unnecessarily to improve the performance of a query which executes in less than a second!

Related

With clause execution

I've always thought that With clause was working as a one-time execute statement, which behaves as a normal table - you can do all SQL operations on it as you would on a regular table.
But it turned out that in several databases(Oracle, Netezza, Sybase, Teradata) the with clause is executed each time it is used.
With Test as(
select random() --pseudo code
)
select '1st select', * from Test
union
select '2nd select', * form Test
Instead of 2 identical numbers, the query above returns 2 different numbers, so it is executed for each of the selects.
If I have a very complex query within a With clause and I use it 5 times in the rest of the query, it would execute 5 times which seems very ineffective to me.
So can someone give me a good logical reason is it working this way?
At least in Teradata it's working as expected, the random value is calculated only once:
With Test as(
select random(1,1000000) as x --pseudo code
)
select '1st select', x from Test
union
select '2nd select', x from Test
;
*** Query completed. 2 rows found. 2 columns returned.
*** Total elapsed time was 1 second.
'1st select' x
------------ -----------
1st select 422654
2nd select 422654
In the oracle world, as explained here, The WITH query_name clause lets you assign a name to a subquery block. You can then reference the subquery block multiple places in the query by specifying the query name. Oracle optimizes the query by treating the query name as either an inline view or as a temporary table.
You can specify this clause in any top-level SELECT statement and in most types of subqueries. The query name is visible to the main query and to all subsequent subqueries except the subquery that defines the query name itself.
A WITH clause is most valuable when the result of the WITH query is required more than one time in the body of the main query such as where one averaged value needs to be compared against two or three times. The point is to minimize the number of accesses to a table joined multiple times into a single query.
Restrictions on Subquery Factoring:
You cannot nest this clause. That is, you cannot specify the subquery_factoring_clause within the subquery of another subquery_factoring_clause. However, a query_name defined in one subquery_factoring_clause can be used in the subquery of any subsequent subquery_factoring_clause.
In a query with set operators, the set operator subquery cannot contain the subquery_factoring_clause, but the FROM subquery can contain the subquery_factoring_clause.
In your case, you have used a random function which will be treated differently by the optimizer, which will treat it as an inline view rather than a materialized one. As #ibre5041 suggested use a EXPLAIN PLAN for different cases.
Consider the case of a recursive CTE, it is used internally everytime.
WITH generator ( value ) AS (
SELECT 1 FROM DUAL
UNION ALL
SELECT value + 1
FROM generator
WHERE value < 10
)
SELECT value
FROM generator;
Plan hash value: 1492144221
--------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 26 | 4 (0)| 00:00:01 |
| 1 | VIEW | | 2 | 26 | 4 (0)| 00:00:01 |
| 2 | UNION ALL (RECURSIVE WITH) BREADTH FIRST| | | | | |
| 3 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
|* 4 | RECURSIVE WITH PUMP | | | | | |
--------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - filter("VALUE"<10)
For Oracle:
CTE aka Subquery factoring clause(in Oracle's terminology) can either be INLINEd or MATERIALIZEd, which ever is feasible. CBO should decide what will be more effective. Your example with random() function is a corner case.
Also it does not matter whether WITH clause is executed once or many times.
Each potential re-execution must give the same result. So decision whether subquery is INLINEd or MATERIALIZEd influences only the performance, but never actual query result.
Try to use hints MATERIALIZE or INLINE to see how exec. plan changes. Your test does not express anything about behavior of real SQL query evaluation.

Oracle partitioned table query cost vs non-partitioned table query cost

I have a table PO_HEADER with ~20 million records. Considering our future load on the table we have decided to partitioned the table to increase the performance of the sql queries. Below are the queries used to create the new partitioned tables.
CREATE TABLE PO_HEADER_LP
PARTITION BY LIST (BUYER_IDENTIFIER)
(PARTITION GC66287246AA VALUES ('GC66287246AA') TABLESPACE MITRIX_TABLES,
PARTITION GC43837235JK VALUES ('GC43837235JK') TABLESPACE MITRIX_TABLES,
PARTITION GC84338293AA VALUES ('GC84338293AA') TABLESPACE MITRIX_TABLES,
PARTITION DEFAULTBUID VALUES (DEFAULT) TABLESPACE MITRIX_TABLES)
AS SELECT *
FROM PO_HEADER;
create index PO_HEADER_LP_SI_IDX on PO_HEADER_LP("SUPPLIER_IDENTIFIER") TABLESPACE MITRIX_INDEXES LOCAL;
Old Table PO_HEADER has two indexes on "BUYER_IDENTIFIER" and "SUPPLIER_IDENTIFIER" columns as follows:
create index PO_HEADER_BI_IDX on PO_HEADER("BUYER_IDENTIFIER") TABLESPACE MITRIX_INDEXES;
create index PO_HEADER_SI_IDX on PO_HEADER("SUPPLIER_IDENTIFIER") TABLESPACE MITRIX_INDEXES;
To test the performance of the query, I executed below query on both the tables. But, to my wonder I saw the cost of the 2nd query is almost double than the 1st one. Can any body know, why is the query cost is high of the partitioned table compared to normal table. Thanks in Advance.
select * from po_header where buyer_identifier='GC84338293AA' and supplier_identifier='GC75987723HT'; --cost: 56,941
select * from po_header_lp where buyer_identifier= 'GC84338293AA' and supplier_identifier='GC75987723HT'; --cost: 93,309
PO_HEADER with Global Index on buyer_identifier & supplier_identifier column
PO_HEADER_LP with Global Index on supplier_identifier column
PO_HEADER_LP with Local Index on supplier_identifier column
From your DDL I assume, you have three big buyers (say 5M records each) and a bunch of smaller ones. In other word this would be the correct setup for you list partitioning schema.
You may verify, whether it works testing access on buyer only:
EXPLAIN PLAN SET STATEMENT_ID = 'jara1' into plan_table FOR
select * from tab_lp where BUYER_ID = 1;
;
SELECT * FROM table(DBMS_XPLAN.DISPLAY('plan_table', 'jara1','ALL'));
------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6662K| 82M| 4445 (2)| 00:00:01 | | |
| 1 | PARTITION LIST SINGLE| | 6662K| 82M| 4445 (2)| 00:00:01 | KEY | KEY |
| 2 | TABLE ACCESS FULL | TAB_LP | 6662K| 82M| 4445 (2)| 00:00:01 | 2 | 2 |
------------------------------------------------------------------------------------------------
The same query for the non-partitioned table should produce much higher cost. Why?
In the partitioned table the selected buyer (in your case GC84338293AA, I'm using surrogate keys) has it own partition.
So full scan of this partition is the best access.
select * from tab where BUYER_ID = 1;
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6596K| 81M| 14025 (1)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TAB | 6596K| 81M| 14025 (1)| 00:00:01 |
--------------------------------------------------------------------------
1 - filter("BUYER_ID"=1)
For the non-partitioned table (to get approximately one fourth of the data) the FULL TABLE SCAN is OK as well,
but of course has higher cost as all data must be scanned.
Note - if you see here lower cost, unrealistically low Rows count and/or INDEX ACCESS,
than this is the cause of the problem of the underestimating of the cost. So don't worry the old cost are too low, not the new one too high!
The next step is the access on both buyer and supplier. To get the answer you must provide
additional information.
How selective is the supplier filter?
I.e. if the predicate buyer_identifier='GC84338293AA' returns say 5M records, how may records return the predicate with both columns?
buyer_identifier='GC84338293AA' and supplier_identifier='GC75987723HT'
Is it 4M or 100 records?
If the complete predicate returns only few records than the local index on supplier is OK.
If it returns large number of rows (say the quarter of the partition) - you should stay on FULL PARTITION SCAN and not use it.
This is similar to my comment on the non partitioned table.
Estimation of the supplier cardinality
In case that the column SUPPLIER contains a skewed data (which may fool the CBO to calulate improper cost) you may define explicitely histogram in this column.
I used this statement statement, that calculates the histogram on full data (100% is important for highly skewed data) and for the table and partition.
exec dbms_stats.gather_table_stats(ownname=>user,tabname=>'TAB_LP',granularity=>'all',estimate_percent => 100,METHOD_OPT => 'for columns SUPPLIER_ID size 254');
This worked for my test data, i.e. for supplier with low cardinality an index access was opened (on local no-prefixed index) and for huge suppliers a full partition scan was used.
You can create a Local partitioned index using this script.
CREATE INDEX PO_HEADER_LOCAL_IDX ON PO_HEADER_LP
(BUYER_IDENTIFIER, SUPPLIER_IDENTIFIER)
LOCAL (
PARTITION GC66287246AA,
PARTITION GC43837235JK,
PARTITION GC84338293AA,
PARTITION DEFAULTBUID
);
Also it is recommended to gather statistics of the newly created partition table using this script:
EXEC DBMS_STATS.GATHER_TABLE_STATS('SCHEMA Name','PO_HEADER_LP');
Now you can generate the execution plan again of the following SQL:
select * from po_header_lp where buyer_identifier= 'GC84338293AA' and supplier_identifier='GC75987723HT';
Hope this will help you.

Oracle Domain index and sorting

The following query performs very poorly, due to the "order by". My goal is to get only a small subset of the resultset (using ROWNUM, for example). However, when I add "order by" it goes through the entire resultset performing an index lookup for each record, which makes it extremely slow. Without sorting the query is about 100 times faster when I limit the resultset to, for example, 1000 records.
QUERY:
SELECT text_field
from mytable where
contains(text_field,'ABC', 1)>0
order by another_field;
THIS IS HOW I CREATED THE INDEX:
CREATE INDEX myindex ON mytable (text_field) INDEXTYPE IS ctxsys.context FILTER BY another_field
EXECUTION PLAN:
---------------------------------------------------------------
| Id | Operation | Name |
---------------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SORT ORDER BY | |
| 2 | TABLE ACCESS BY INDEX ROWID| MYTABLE |
|* 3 | DOMAIN INDEX | MYINDEX |
---------------------------------------------------------------
I also used CTXCAT instead of CONTEXT, and no improvement. I think the problem is, when I want the results sorted (only top 1000), it performs an index lookup for each record in the "entire" resultset. Is there a way to avoid that?
Thank you.
To have the ordering applied before the rownum filter, you need to use an in-line view:
SELECT text_file
from (
SELECT text_field
from mytable where
contains(text_field,'ABC', 1)>0
order by another_field
)
where rownum <= 1000;
With your index in place Oracle should optimise this to do as little work as possible. You should see 'sort order by stopkey' and 'count stopkey' steps in the plan, which is Oracle being clever and knowing it only needs to get 1000 values from the index.
If you don't use the in-line view but just add the rownum to your original query it will still optimise it but as you state it will order the first 1000 random (or indeterminate, anyway) rows it finds, because of the sequence of operations it performs.

Will this type of pagination scale?

I need to paginate on a set of models that can/will become large. The results have to be sorted so that the latest entries are the ones that appear on the first page (and then, we can go all the way to the start using 'next' links).
The query to retrieve the first page is the following, 4 is the number of entries I need per page:
SELECT "relationships".* FROM "relationships" WHERE ("relationships".followed_id = 1) ORDER BY created_at DESC LIMIT 4 OFFSET 0;
Since this needs to be sorted and since the number of entries is likely to become large, am I going to run into serious performance issues?
What are my options to make it faster?
My understanding is that an index on 'followed_id' will simply help the where clause. My concern is on the 'order by'
Create an index that contains these two fields in this order (followed_id, created_at)
Now, how large is the large we are talking about here? If it will be of the order of millions.. How about something like the one that follows..
Create an index on keys followed_id, created_at, id (This might change depending upon the fields in select, where and order by clause. I have tailor-made this to your question)
SELECT relationships.*
FROM relationships
JOIN (SELECT id
FROM relationships
WHERE followed_id = 1
ORDER BY created_at
LIMIT 10 OFFSET 10) itable
ON relationships.id = itable.id
ORDER BY relationships.created_at
An explain would yield this:
+----+-------------+---------------+------+---------------+-------------+---------+------+------+-----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+---------------+-------------+---------+------+------+-----------------------------------------------------+
| 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Impossible WHERE noticed after reading const tables |
| 2 | DERIVED | relationships | ref | sample_rel2 | sample_rel2 | 5 | | 1 | Using where; Using index |
+----+-------------+---------------+------+---------------+-------------+---------+------+------+-----------------------------------------------------+
If you examine carefully, the sub-query containing the order, limit and offset clauses will operate on the index directly instead of the table and finally join with the table to fetch the 10 records.
It makes a difference when at one point your query makes a call like limit 10 offset 10000. It will retrieve all the 10000 records from the table and fetch the first 10. This trick should restrict the traversal to just the index.
An important note: I tested this in MySQL. Other database might have subtle differences in behavior, but the concept holds good no matter what.
you can index these fields. but it depends:
you can assume (mostly) that the created_at is already ordered. So that might by unnecessary. But that more depends on you app.
anyway you should index followed_id (unless its the primary key)

is there a tricky way to optimize this query

I'm working on a table that has 3008698 rows
exam_date is a DATE field.
But queries I run want to match only the month part. So what I do is:
select * from my_big_table where to_number(to_char(exam_date, 'MM')) = 5;
which I believe takes long because of function on the column. Is there a way to avoid this and make it faster? other than making changes to the table? exam_date in the table have different date values. like 01-OCT-10 or 12-OCT-10...and so on
I don't know Oracle, but what about doing
WHERE exam_date BETWEEN first_of_month AND last_of_month
where the two dates are constant expressions.
select * from my_big_table where MONTH(exam_date) = 5
oops.. Oracle huh?..
select * from my_big_table where EXTRACT(MONTH from exam_date) = 5
Bear in mind that since you want approximately 1/12th of all the data, it may well be more efficient for Oracle to perform a full table scan anyway. This may explain why performance was worse when you followed harpo's advice.
Why? Suppose your data is such that 20 rows fit on each database block (on average), so that you have a total of 3,000,000/20 = 150,000 blocks. That means a full table scan will require 150,000 block reads. Now about 1/12th of the 3,000,000 rows will be for month 05. 3,000,000/12 is 250,000. So that's 250,000 table reads if you use the index - and that's ignoring the index reads that will also be required. So in this example the full table scan does a lot less work than the indexed search.
Bear in miond that there are only twelve distinct values for MONTH. So unless you have a strongly clustered set of records (say if you use partitioining) it is possible that using an index is not necessarily the most efficient way of querying in this fashion.
I didn't find that using EXTRACT() lead the optimizer to use a regular index on my date column but YMMV:
SQL> create index big_d_idx on big_table(col3) compute statistics
2 /
Index created.
SQL> set autotrace traceonly explain
SQL> select * from big_table
2 where extract(MONTH from col3) = 'MAY'
3 /
Execution Plan
----------------------------------------------------------
Plan hash value: 3993303771
-------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 23403 | 1028K| 4351 (3)| 00:00:53 |
|* 1 | TABLE ACCESS FULL| BIG_TABLE | 23403 | 1028K| 4351 (3)| 00:00:53 |
-------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(EXTRACT(MONTH FROM INTERNAL_FUNCTION("COL3"))=TO_NUMBER('M
AY'))
SQL>
What definitely can persuade the optimizer to use an index in these scenarios is building a function-based index:
SQL> create index big_mon_fbidx on big_table(extract(month from col3))
2 /
Index created.
SQL> select * from big_table
2 where extract(MONTH from col3) = 'MAY'
3 /
Execution Plan
----------------------------------------------------------
Plan hash value: 225326446
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|Time |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 23403 | 1028K| 475 (0)|00:00:06|
| 1 | TABLE ACCESS BY INDEX ROWID| BIG_TABLE | 23403 | 1028K| 475 (0)|00:00:06|
|* 2 | INDEX RANGE SCAN | BIG_MON_FBIDX | 9361 | | 382 (0)|00:00:05|
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access(EXTRACT(MONTH FROM INTERNAL_FUNCTION("COL3"))=TO_NUMBER('MAY'))
SQL>
The function call means that Oracle won't be able to use any index that might be defined on the column.
Either remove the function call (as in harpo's answer) or use a function based index.

Resources