Improve BigQuery case insensitive search performance - performance

The BigQuery team strikes again: This question is not longer relevant, as the results with LOWER() are as fast as with REGEX() now.
Processing ~5GB of data with BigQuery should be super fast. For example the following query performs a case insensitive search in 18 seconds:
#standardSQL
SELECT COUNT(*) c
FROM `bigquery-public-data.hacker_news.full`
WHERE
LOWER(text) LIKE '%bigquery%' # 18s
Usually BigQuery is faster than this, but the real problem is that adding new search terms makes this query considerably slower (almost a minute with 3 search terms):
#standardSQL
SELECT COUNT(*) c
FROM `bigquery-public-data.hacker_news.full`
WHERE
LOWER(text) LIKE '%bigquery%' OR LOWER(text) LIKE '%big query%' # 34s
#standardSQL
SELECT COUNT(*) c
FROM `bigquery-public-data.hacker_news.full`
WHERE
LOWER(text) LIKE '%bigquery%' OR LOWER(text) LIKE '%big query%'
OR LOWER(text) LIKE '%google cloud%' # 52s
How can I improve my query performance?

Note from the team: Stay tuned! Very soon BigQuery will turn this
advice irrelevant.
BigQuery performance tip: Avoid using LOWER() and UPPER()
LOWER() and UPPER() operations have a hard time when dealing with Unicode text: each character needs to be mapped individually and they can also be multi-bytes.
Solution 1: Case insensitive regex
A faster alternative: Use REGEX_MATCH() and add the case insensitive (?i) modifier to your regular expression
#standardSQL
SELECT COUNT(*) c
FROM `bigquery-public-data.hacker_news.full`
WHERE
REGEXP_CONTAINS(text, '(?i)bigquery') # 7s
# REGEXP_CONTAINS(text, '(?i)bigquery')
# OR REGEXP_CONTAINS(text, '(?i)big query') # 9s
# REGEXP_CONTAINS(text, '(?i)bigquery')
# OR REGEXP_CONTAINS(text, '(?i)big query')
# OR REGEXP_CONTAINS(text, '(?i)google cloud') # 11s
Performance is much better this way:
1 search term: 18s down to 8s
2 search terms: 34s down to 9s
3 search terms: 52s down to 11s.
Solution 2: Combine regexes
Why do 3 searches when a regular expression can combine many into 1?
#standardSQL
SELECT COUNT(*) c
FROM `bigquery-public-data.hacker_news.full`
WHERE
REGEXP_CONTAINS(text, '(?i)(bigquery|big query|google cloud)') # 7s
3 terms in 7s - nice.
Solution 3: Transform to bytes
This is uglier, but shows that UPPER() and LOWER() perform way better when dealing with individual bytes - for equivalent results in these searches:
#standardSQL
SELECT COUNT(*) c
FROM `bigquery-public-data.hacker_news.full`
WHERE
LOWER(CAST(text AS BYTES)) LIKE b'%bigquery%'
OR LOWER(CAST(text AS BYTES)) LIKE b'%big query%'
OR LOWER(CAST(text AS BYTES)) LIKE b'%google cloud%' # 7s
LOWER() is slower. Use the regex (?i) modifier instead.
If this worked for you, please feel free to comment with your performance improvements.

Related

Improve performance of Query

I have recently came across a case in which the overall SELECT statement was very fast (0.06 sec) but this part took 20 seconds:
COALESCE(T1.trees, T1.trees, T1.flowers) ASC
While used in: ORDER BY COALESCE(T2.trees, T2.trees, T2.flowers) ASC, T1.Shirts)
T1 & T2 are aliases to the same table while T2 is used for left outer Join..
Any alternatives for the COALESCE(T2.trees, T2.trees, T2.flowers) ASC part that will present better performance?
(as I read on other posts that probably do the the fact that COALESCE(x, y) doesn't make full use of indexes). So assume that trees column is indexed.)
Cheers,
udifel

easier way to do multiple substr instr OR statements oracle pl sql

I want to check every 2 positions of a table column for many different values. (selecting any row that matches any specified value in any of the 2 digit positions)
This is the only way I know of to do what I need, but I bet there is a cleaner, shorter way:
select *
from table
where 1=1
and (
Substr(columnA,1,2) IN ('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM')
or Substr(columnA,3,2) IN ('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM')
or Substr(columnA,5,2) IN ('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM')
or Substr(columnA,7,2) IN ('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM')
or Substr(columnA,9,2) IN ('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM')
or Substr(columnA,11,2) IN ('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM')
or Substr(columnA,12,2) IN ('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM')
or Substr(columnA,14,2) IN ('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM')
or Substr(columnA,16,2) IN ('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM') )
;
note: if the column's value is ABCDEF, and we are checking for 'BC', this should not find a match, only 'AB', 'CD', 'EF' should match.
I want to be able to list everything I am searching for just once. Even better would be to only list columnA once.
I did find an INSTR function that may be useful, but I'm not sure how to apply it here.
this works:
Instr(columnA,'XX') IN (1,3,5,7,9,11,14,16)
But is there a better way than to do this for every value I am searching for?
Could I use COALESCE some how?
Using REGEXP_LIKE to match a regular expression? Something like that:
'^(..)*((CE)|(44)|(45)|(87)|(89)|(UT)|(AZ)|(XX)|(YY)|(S1)|(S2)|(S3)|(S4)|(ES)|(PM))'
^ ensure the regex is anchored to the start of line
(..)* eats 0 to many pair of characters
((CE)|...) matches one of your digraphs.
See http://sqlfiddle.com/#!4/d41d8/38453/0 for a live example.
This answer assumes that you intend to search every 2 characters. I am assuming the even numbered starting numbers are incorrect: technically you are searching overlapping strings in the original question.
Since this is a regular pattern , you can generate it, which can help you simplify the query.
WITH search_pattern AS
(SELECT LEVEL * 2 - 1 search_start
FROM DUAL
CONNECT BY LEVEL <= 9)
SELECT DISTINCT t.*
FROM table1 t CROSS JOIN search_pattern sp
WHERE SUBSTR (t.columna, search_start, 2) IN
('CE','44','45','87','89','UT','AZ','XX','YY','S1','S2','S3','S4','ES','PM')
In this case, we have one row for each position, which is functionally equivalent to or. The distinct keyword is necessary to prevent rows that qualify more than once from being returned more than once.
While this solution is functional, #SylvainLeroux's answer using regex will likely perform better.

postgres not using index on SELECT COUNT(*) for a large table

I have four tables; two for current data, two for archive data. One of the archive tables has tens of millions of rows. All tables have a couple narrow indexes and are very similar.
Given the following queries:
SELECT (SELECT COUNT(*) FROM A)
UNION SELECT (SELECT COUNT(*) FROM B)
UNION SELECT (SELECT COUNT(*) FROM C_LargeTable)
UNION SELECT (SELECT COUNT(*) FROM D);
A, B and D perform index scans. C_LargeTable uses a seq scan and the query takes about 20 seconds to execute. Table D has millions of rows as well, but is only about 10% of the size of C_LargeTable
If I then modify my query to execute using the following logic, which sufficiently narrows counts, I still get the same results, the index is used and the query takes about 5 seconds, or 1/4th of the time
...
SELECT (SELECT COUNT(*) FROM C_LargeTable WHERE idx_col < 'G')
+ (SELECT COUNT(*) FROM C_LargeTable WHERE idx_col BETWEEN 'G' AND 'Q')
+ (SELECT COUNT(*) FROM C_LargeTable WHERE idx_col > 'Q')
...
It does not makes sense to me to have the I/O overhead of a full table scan for a count when perfectly good indexes exist and there is a covering primary key which would ensure uniqueness. My understanding of postgres is that a PRIMARY KEY isn't like a SQL Server clustering index in that it determines a sort, but it implicitly creates a btree index to ensure uniqueness, which I assume should require significantly less I/O than a full table scan.
Is this potentially an indication of an optimization that I may need to perform to organize data within C_LargeTable?
There isn't a covering index on the primary key because PostgreSQL doesn't support them (true up to and including 9.4 anyway).
The heap scan is required because of MVCC visibility. The index doesn't contain visibility information. Pg can do an index scan, but it still has to check visibility info from the heap, and with an index scan that'd be random I/O to read the whole table, so a seqscan will be much faster.
Make sure you run 9.2 or newer, and that autovacuum is configured to run frequently on the table. You should then be able to do an index-only scan where the visibility map is used. This only works under limited circumstances as Horse notes; see the wiki page on count and on index-only scans. If you aren't letting autovacuum run regularly enough the visibility map will be outdated and Pg won't be able to do an index-only scan.
In future, make sure you post explain or preferably explain analyze output with any queries.

Oracle scalar function in WHERE clause leads to poor performance

I've written a scalar function (DYNAMIC_DATE) that converts a text value to a date/time. For example, DYANMIC_DATE('T-1') (T-1 = today minus 1 = 'yesterday') returns 08-AUG-2012 00:00:00. It also accepts date strings: DYNAMIC_DATE('10/10/1974').
The function makes use of CASE statements to parse the sole parameter and calculate a date relative to sysdate.
While it doesn't make use of any table in its schema, it does make use of TABLE type to store date-format strings:
TYPE VARCHAR_TABLE IS TABLE OF VARCHAR2(10);
formats VARCHAR_TABLE := VARCHAR_TABLE ('mm/dd/rrrr','mm-dd-rrrr','rrrr/mm/dd','rrrr-mm-dd');
When I use the function in the SELECT clause, the query returns in < 1 second:
SELECT DYNAMIC_DATE('MB-1') START_DATE, DYNAMIC_DATE('ME-1') END_DATE
FROM DUAL
If I use it against our date dimension table (91311 total records), the query completes in < 1 second:
SELECT count(1)
from date_dimension
where calendar_dt between DYNAMIC_DATE('MB-1') and DYNAMIC_DATE('ME-1')
Others, however, are having problems with the function if it is used against a larger table (26,301,317 records):
/*
cost: 148,840
records: 151,885
time: ~20 minutes
*/
SELECT count(1)
FROM ORDERS ord
WHERE trunc(ord.ordering_date) between DYNAMIC_DATE('mb-1') and DYNAMIC_DATE('me-1')
However, the same query, using 'hard coded' dates, returns fairly rapidly:
/*
cost: 144,257
records: 151,885
time: 62 seconds
*/
SELECT count(1)
FROM ORDERS ord
WHERE trunc(ord.ordering_date) between to_date('01-JUL-2012','dd-mon-yyyy') AND to_date('31-JUL-2012','dd-mon-yyyy')
The vendor's vanilla installation doesn't include an index on the ORDERING_DATE field.
The explain plans for both queries are similar:
with function:
with hard-coded dates:
Is the DYNAMIC_DATE function being called repeatedly in the WHERE clause?
What else might explain the disparity?
** edit **
A NONUNIQUE index was added to ORDERS table. Both queries execute in < 1 second. Both plans are the same (approach), but the one with the function is lower cost.
I removed the DETERMINISTIC keyword from the function; the query executed in < 1 second.
Is the issue really with the function or was it related to the table?
3 years from now, when this table is even larger, and if I don't include the DETERMINISTIC keyword, will query performance suffer?
Will the DETERMINISTIC keyword have any affect on the function's results? If I run DYNAMIC_DATE('T-1') tomorrow, will I get the same results as if I ran it today (08/09/2012)? If so, this approach won't work.
If the steps of the plan are identical, then the total amount of work being done should be identical. If you trace the session (something simple like set autotrace on in SQL*Plus or something more sophisticated like an event 10046 trace), or if you look at DBA_HIST_SQLSTAT assuming you have licensed access to the AWR tables, are you seeing (roughly) the same amount of logical I/O and CPU consumption for the two queries? Is it possible that the difference in runtime you are seeing is the result of the data being cached when you run the second query?
I am guessing that the problem isn't with your function. Try creating a function based index on trunc(ord.ordering_date) and see the explain plans.
CREATE INDEX ord_date_index ON ord(trunc(ord.ordering_date));

oracle : how to ensure that a function in the where clause will be called only after all the remaining where clauses have filtered the result?

I am writing a query to this effect:
select *
from players
where player_name like '%K%
and player_rank<10
and check_if_player_is_eligible(player_name) > 1;
Now, the function check_if_player_is_eligible() is heavy and, therefore, I want the query to filter the search results sufficiently and then only run this function on the filtered results.
How can I ensure that the all filtering happens before the function is executed, so that it runs the minimum number of times ?
Here's two methods where you can trick Oracle into not evaluating your function before all the other WHERE clauses have been evaluated:
Using rownum
Using the pseudo-column rownum in a subquery will force Oracle to "materialize" the subquery. See for example this askTom thread for examples.
SELECT *
FROM (SELECT *
FROM players
WHERE player_name LIKE '%K%'
AND player_rank < 10
AND ROWNUM >= 1)
WHERE check_if_player_is_eligible(player_name) > 1
Here's the documentation reference "Unnesting of Nested Subqueries":
The optimizer can unnest most subqueries, with some exceptions. Those exceptions include hierarchical subqueries and subqueries that contain a ROWNUM pseudocolumn, one of the set operators, a nested aggregate function, or a correlated reference to a query block that is not the immediate outer query block of the subquery.
Using CASE
Using CASE you can force Oracle to only evaluate your function when the other conditions are evaluated to TRUE. Unfortunately it involves duplicating code if you want to make use of the other clauses to use indexes as in:
SELECT *
FROM players
WHERE player_name LIKE '%K%'
AND player_rank < 10
AND CASE
WHEN player_name LIKE '%K%'
AND player_rank < 10
THEN check_if_player_is_eligible(player_name)
END > 1
There is the NO_PUSH_PRED hint to do it without involving rownum evaluation (that is a good trick anyway) in the process!
SELECT /*+NO_PUSH_PRED(v)*/*
FROM (
SELECT *
FROM players
WHERE player_name LIKE '%K%'
AND player_rank < 10
) v
WHERE check_if_player_is_eligible(player_name) > 1
You usually want to avoid forcing a specific order of execution. If the data or the query changes, your hints and tricks may backfire. It's usually better to provide useful metadata to Oracle so it can make the correct decisions for you.
In this case, you can provide better optimizer statistics about the function with ASSOCIATE STATISTICS.
For example, if your function is very slow because it has to read 50 blocks each time it is called:
associate statistics with functions
check_if_player_is_eligible default cost(1000 /*cpu*/, 50 /*IO*/, 0 /*network*/);
By default Oracle assumes that a function will select a row 1/20th of the time. Oracle wants to eliminate as many rows as soon
as possible, changing the selectivity should make the function less likely to be executed first:
associate statistics with functions
check_if_player_is_eligible default selectivity 90;
But this raises some other issues. You have to pick a selectivity for ALL possible conditions, 90% certainly won't always be accurate. The IO cost is the number of blocks fetched, but CPU cost is "machine instructions used", what exactly does that mean?
There are more advanced ways to customize statistics,for example using the Oracle Data Cartridge Extensible Optimizer. But data cartridge is probably one of the most difficult Oracle features.
You did't specify whether player.player_name is unique or not. One could assume that it is and then the database has to call the function at least once per result record.
But, if player.player_name is not unique, you would want to minimize the calls down to count(distinct player.player_name) times. As (Ask)Tom shows in Oracle Magazine, the scalar subquery cache is an efficient way to do this.
You would have to wrap your function call into a subselect in order to make use of the scalar subquery cache:
SELECT players.*
FROM players,
(select check_if_player_is_eligible(player.player_name) eligible) subq
WHERE player_name LIKE '%K%'
AND player_rank < 10
AND ROWNUM >= 1
AND subq.eligible = 1
Put the original query in a derived table then place the additional predicate in the where clause of the derived table.
select *
from (
select *
from players
where player_name like '%K%
and player_rank<10
) derived_tab1
Where check_if_player_is_eligible(player_name) > 1;

Resources