I want to check every 2 positions of a table column for many different values. (selecting any row that matches any specified value in any of the 2 digit positions)
This is the only way I know of to do what I need, but I bet there is a cleaner, shorter way:
select *
from table
where 1=1
and (
Substr(columnA,1,2) IN ('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM')
or Substr(columnA,3,2) IN ('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM')
or Substr(columnA,5,2) IN ('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM')
or Substr(columnA,7,2) IN ('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM')
or Substr(columnA,9,2) IN ('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM')
or Substr(columnA,11,2) IN ('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM')
or Substr(columnA,12,2) IN ('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM')
or Substr(columnA,14,2) IN ('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM')
or Substr(columnA,16,2) IN ('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM') )
;
note: if the column's value is ABCDEF, and we are checking for 'BC', this should not find a match, only 'AB', 'CD', 'EF' should match.
I want to be able to list everything I am searching for just once. Even better would be to only list columnA once.
I did find an INSTR function that may be useful, but I'm not sure how to apply it here.
this works:
Instr(columnA,'XX') IN (1,3,5,7,9,11,14,16)
But is there a better way than to do this for every value I am searching for?
Could I use COALESCE some how?
Using REGEXP_LIKE to match a regular expression? Something like that:
'^(..)*((CE)|(44)|(45)|(87)|(89)|(UT)|(AZ)|(XX)|(YY)|(S1)|(S2)|(S3)|(S4)|(ES)|(PM))'
^ ensure the regex is anchored to the start of line
(..)* eats 0 to many pair of characters
((CE)|...) matches one of your digraphs.
See http://sqlfiddle.com/#!4/d41d8/38453/0 for a live example.
This answer assumes that you intend to search every 2 characters. I am assuming the even numbered starting numbers are incorrect: technically you are searching overlapping strings in the original question.
Since this is a regular pattern , you can generate it, which can help you simplify the query.
WITH search_pattern AS
(SELECT LEVEL * 2 - 1 search_start
FROM DUAL
CONNECT BY LEVEL <= 9)
SELECT DISTINCT t.*
FROM table1 t CROSS JOIN search_pattern sp
WHERE SUBSTR (t.columna, search_start, 2) IN
('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM')
In this case, we have one row for each position, which is functionally equivalent to or. The distinct keyword is necessary to prevent rows that qualify more than once from being returned more than once.
While this solution is functional, #SylvainLeroux's answer using regex will likely perform better.
I have four tables; two for current data, two for archive data. One of the archive tables has tens of millions of rows. All tables have a couple narrow indexes and are very similar.
Given the following queries:
SELECT (SELECT COUNT(*) FROM A)
UNION SELECT (SELECT COUNT(*) FROM B)
UNION SELECT (SELECT COUNT(*) FROM C_LargeTable)
UNION SELECT (SELECT COUNT(*) FROM D);
A, B and D perform index scans. C_LargeTable uses a seq scan and the query takes about 20 seconds to execute. Table D has millions of rows as well, but is only about 10% of the size of C_LargeTable
If I then modify my query to execute using the following logic, which sufficiently narrows counts, I still get the same results, the index is used and the query takes about 5 seconds, or 1/4th of the time
...
SELECT (SELECT COUNT(*) FROM C_LargeTable WHERE idx_col < 'G')
+ (SELECT COUNT(*) FROM C_LargeTable WHERE idx_col BETWEEN 'G' AND 'Q')
+ (SELECT COUNT(*) FROM C_LargeTable WHERE idx_col > 'Q')
...
It does not makes sense to me to have the I/O overhead of a full table scan for a count when perfectly good indexes exist and there is a covering primary key which would ensure uniqueness. My understanding of postgres is that a PRIMARY KEY isn't like a SQL Server clustering index in that it determines a sort, but it implicitly creates a btree index to ensure uniqueness, which I assume should require significantly less I/O than a full table scan.
Is this potentially an indication of an optimization that I may need to perform to organize data within C_LargeTable?
There isn't a covering index on the primary key because PostgreSQL doesn't support them (true up to and including 9.4 anyway).
The heap scan is required because of MVCC visibility. The index doesn't contain visibility information. Pg can do an index scan, but it still has to check visibility info from the heap, and with an index scan that'd be random I/O to read the whole table, so a seqscan will be much faster.
Make sure you run 9.2 or newer, and that autovacuum is configured to run frequently on the table. You should then be able to do an index-only scan where the visibility map is used. This only works under limited circumstances as Horse notes; see the wiki page on count and on index-only scans. If you aren't letting autovacuum run regularly enough the visibility map will be outdated and Pg won't be able to do an index-only scan.
In future, make sure you post explain or preferably explain analyze output with any queries.
I've written a scalar function (DYNAMIC_DATE) that converts a text value to a date/time. For example, DYANMIC_DATE('T-1') (T-1 = today minus 1 = 'yesterday') returns 08-AUG-2012 00:00:00. It also accepts date strings: DYNAMIC_DATE('10/10/1974').
The function makes use of CASE statements to parse the sole parameter and calculate a date relative to sysdate.
While it doesn't make use of any table in its schema, it does make use of TABLE type to store date-format strings:
TYPE VARCHAR_TABLE IS TABLE OF VARCHAR2(10);
formats VARCHAR_TABLE := VARCHAR_TABLE ('mm/dd/rrrr','mm-dd-rrrr','rrrr/mm/dd','rrrr-mm-dd');
When I use the function in the SELECT clause, the query returns in < 1 second:
SELECT DYNAMIC_DATE('MB-1') START_DATE, DYNAMIC_DATE('ME-1') END_DATE
FROM DUAL
If I use it against our date dimension table (91311 total records), the query completes in < 1 second:
SELECT count(1)
from date_dimension
where calendar_dt between DYNAMIC_DATE('MB-1') and DYNAMIC_DATE('ME-1')
Others, however, are having problems with the function if it is used against a larger table (26,301,317 records):
/*
cost: 148,840
records: 151,885
time: ~20 minutes
*/
SELECT count(1)
FROM ORDERS ord
WHERE trunc(ord.ordering_date) between DYNAMIC_DATE('mb-1') and DYNAMIC_DATE('me-1')
However, the same query, using 'hard coded' dates, returns fairly rapidly:
/*
cost: 144,257
records: 151,885
time: 62 seconds
*/
SELECT count(1)
FROM ORDERS ord
WHERE trunc(ord.ordering_date) between to_date('01-JUL-2012','dd-mon-yyyy') AND to_date('31-JUL-2012','dd-mon-yyyy')
The vendor's vanilla installation doesn't include an index on the ORDERING_DATE field.
The explain plans for both queries are similar:
with function:
with hard-coded dates:
Is the DYNAMIC_DATE function being called repeatedly in the WHERE clause?
What else might explain the disparity?
** edit **
A NONUNIQUE index was added to ORDERS table. Both queries execute in < 1 second. Both plans are the same (approach), but the one with the function is lower cost.
I removed the DETERMINISTIC keyword from the function; the query executed in < 1 second.
Is the issue really with the function or was it related to the table?
3 years from now, when this table is even larger, and if I don't include the DETERMINISTIC keyword, will query performance suffer?
Will the DETERMINISTIC keyword have any affect on the function's results? If I run DYNAMIC_DATE('T-1') tomorrow, will I get the same results as if I ran it today (08/09/2012)? If so, this approach won't work.
If the steps of the plan are identical, then the total amount of work being done should be identical. If you trace the session (something simple like set autotrace on in SQL*Plus or something more sophisticated like an event 10046 trace), or if you look at DBA_HIST_SQLSTAT assuming you have licensed access to the AWR tables, are you seeing (roughly) the same amount of logical I/O and CPU consumption for the two queries? Is it possible that the difference in runtime you are seeing is the result of the data being cached when you run the second query?
I am guessing that the problem isn't with your function. Try creating a function based index on trunc(ord.ordering_date) and see the explain plans.
CREATE INDEX ord_date_index ON ord(trunc(ord.ordering_date));
I am writing a query to this effect:
select *
from players
where player_name like '%K%
and player_rank<10
and check_if_player_is_eligible(player_name) > 1;
Now, the function check_if_player_is_eligible() is heavy and, therefore, I want the query to filter the search results sufficiently and then only run this function on the filtered results.
How can I ensure that the all filtering happens before the function is executed, so that it runs the minimum number of times ?
Here's two methods where you can trick Oracle into not evaluating your function before all the other WHERE clauses have been evaluated:
Using rownum
Using the pseudo-column rownum in a subquery will force Oracle to "materialize" the subquery. See for example this askTom thread for examples.
SELECT *
FROM (SELECT *
FROM players
WHERE player_name LIKE '%K%'
AND player_rank < 10
AND ROWNUM >= 1)
WHERE check_if_player_is_eligible(player_name) > 1
Here's the documentation reference "Unnesting of Nested Subqueries":
The optimizer can unnest most subqueries, with some exceptions. Those exceptions include hierarchical subqueries and subqueries that contain a ROWNUM pseudocolumn, one of the set operators, a nested aggregate function, or a correlated reference to a query block that is not the immediate outer query block of the subquery.
Using CASE
Using CASE you can force Oracle to only evaluate your function when the other conditions are evaluated to TRUE. Unfortunately it involves duplicating code if you want to make use of the other clauses to use indexes as in:
SELECT *
FROM players
WHERE player_name LIKE '%K%'
AND player_rank < 10
AND CASE
WHEN player_name LIKE '%K%'
AND player_rank < 10
THEN check_if_player_is_eligible(player_name)
END > 1
There is the NO_PUSH_PRED hint to do it without involving rownum evaluation (that is a good trick anyway) in the process!
SELECT /*+NO_PUSH_PRED(v)*/*
FROM (
SELECT *
FROM players
WHERE player_name LIKE '%K%'
AND player_rank < 10
) v
WHERE check_if_player_is_eligible(player_name) > 1
You usually want to avoid forcing a specific order of execution. If the data or the query changes, your hints and tricks may backfire. It's usually better to provide useful metadata to Oracle so it can make the correct decisions for you.
In this case, you can provide better optimizer statistics about the function with ASSOCIATE STATISTICS.
For example, if your function is very slow because it has to read 50 blocks each time it is called:
associate statistics with functions
check_if_player_is_eligible default cost(1000 /*cpu*/, 50 /*IO*/, 0 /*network*/);
By default Oracle assumes that a function will select a row 1/20th of the time. Oracle wants to eliminate as many rows as soon
as possible, changing the selectivity should make the function less likely to be executed first:
associate statistics with functions
check_if_player_is_eligible default selectivity 90;
But this raises some other issues. You have to pick a selectivity for ALL possible conditions, 90% certainly won't always be accurate. The IO cost is the number of blocks fetched, but CPU cost is "machine instructions used", what exactly does that mean?
There are more advanced ways to customize statistics,for example using the Oracle Data Cartridge Extensible Optimizer. But data cartridge is probably one of the most difficult Oracle features.
You did't specify whether player.player_name is unique or not. One could assume that it is and then the database has to call the function at least once per result record.
But, if player.player_name is not unique, you would want to minimize the calls down to count(distinct player.player_name) times. As (Ask)Tom shows in Oracle Magazine, the scalar subquery cache is an efficient way to do this.
You would have to wrap your function call into a subselect in order to make use of the scalar subquery cache:
SELECT players.*
FROM players,
(select check_if_player_is_eligible(player.player_name) eligible) subq
WHERE player_name LIKE '%K%'
AND player_rank < 10
AND ROWNUM >= 1
AND subq.eligible = 1
Put the original query in a derived table then place the additional predicate in the where clause of the derived table.
select *
from (
select *
from players
where player_name like '%K%
and player_rank<10
) derived_tab1
Where check_if_player_is_eligible(player_name) > 1;