Is there a way to tag a Snowflake view as "safe" for result reuse? - user-defined-functions

Reading the Snowflake documentation for Using Persisted Query Results, one of the conditions that must be met for result reuse is the following:
The query does not include user-defined functions (UDFs) or external
functions.
After some experiments with this trivial UDF and view:
create function trivial_function(x number)
returns number
as
$$
x
$$
;
create view redo as select trivial_function($1) as col1, $2 as col2
from values (1,2), (3,4);
I've verified that this condition also applies to queries over views that use UDFs in their definitions.
The thing is, I have a complex view that would benefit much from result reuse, but employs a lot of UDFs for the sake of clarity. Some UDFs could be inlined, but that would make the view much more difficult to read. And some UDFs (those written in Javascript) are impossible to inline.
And yet, I know that all the UDFs are "pure" in the functional programming sense: for the same inputs, they always return the same outputs. They don't check the current timestamp, generate random values, or reference some other table that might change between invocations.
Is there some way to "convince" the query planner that a view is safe for result reuse, despite the presence of UDFs?

There is parameter called IMMUTABLE:
CREATE FUNCTION
VOLATILE | IMMUTABLE
Specifies the behavior of the UDF when returning results:
VOLATILE: UDF might return different values for different rows, even for the same input (e.g. due to non-determinism and statefullness).
IMMUTABLE: UDF assumes that the function, when called with the same inputs, will always return the same result. This guarantee is not checked. Specifying IMMUTABLE for a UDF that returns different values for the same input will result in undefined behavior.
Default: VOLATILE

Related

User-defined function not returning correct results in Azure Cosmos DB

We are trying to use a simple user-defined function (UDF) in the where clause of a query in Azure Cosmos DB, but it's not working correctly. The end goal is to query all results where the timestamp _ts is greater than or equal to yesterday, but our first step is to get a UDF working.
The abbreviated data looks like this:
[
{
"_ts": 1500000007
}
{
"_ts": 1500000005
}
]
Using the Azure Portal's Cosmos DB Data Explorer, a simple query like the following will correctly return one result:
SELECT * FROM c WHERE c._ts >= 1500000006
A simple query using our UDF incorrectly returns zero results. This query is below:
SELECT * FROM c WHERE c._ts >= udf.getHardcodedTime()
The definition of the function is below:
function getHardcodedTime(){
return 1500000006;
}
And here is a screenshot of the UDF for verification:
As you can see, the only difference between the two queries is that one query uses a hard-coded value while the other query uses a UDF to get a hard-coded value. The problem is that the query using the UDF returns zero results instead of returning one result.
Are we using the UDF correctly?
Update 1
When the UDF is updated to return the number 1, then we get a different count of results each time.
New function:
function getHardcodedTime(){
return 1;
}
New query: SELECT count(1) FROM c WHERE c._ts >= udf.getHardcodedTime()
Results vary with 7240, 7236, 7233, 7264, etc. (This set is the actual order of responses from Cosmos DB.)
By your description, the most likely cause is that the UDF version is slow and returns a partial result with continuation token instead of final result.
The concept of continuation is explained here as:
If a query's results cannot fit within a single page of results, then the REST API returns a continuation token through the x-ms-continuation-token response header. Clients can paginate results by including the header in subsequent results.
The observed count variation could be caused by the query stopping for next page at slightly different times. Check the x-ms-continuation header to know if that's the case.
Why is UDF slow?
UDF returning a constant is not the same as constant for the query. Correct me if you know better, but from CosmosDB side, it does not know that the specific UDF is actually deterministic, but assumes it could evaluate to any value and hence has to be executed for each document to check for a match. It means it cannot use an index and has to do a slow full scan.
What can you do?
Option 1: Follow continuations
If you don't care about the performance, you could keep using the UDF and just follow the continuation until all rows are processed.
Some DocumentDB clients can do this for you (ex: .net API), so this may be the fastest fix if you are in a hurry. But beware, this does not scale (keeps getting slower and cost more and more RU) and you should not consider this a long-term solution.
Option 2: Drop UDF, use parameters
You could pass the hardcodedTime instead as parameter. This way the query execution engine would know the value, could use a matching index and give you correct results without the hassle of continuations.
I don't know which API you use, but related reading just in case: Parameterized SQL queries.
Option 3: Wrap in stored proc
If you really-really must control the hardcodedTime in UDF then you could implement a server-side procedure which would:
query the hardcodedTime from UDF
query the documents, passing the hardcodedTime as parameter
return results as SP output.
It would use the UDF AND index, but brings a hefty overhead in the amount of code required. Do your math if keeping UDF is worth the extra effort in dev and maintenance.
Related documentation about SPs: Azure Cosmos DB server-side programming: Stored procedures, database triggers, and UDFs

JSONata query ... feedback requested

I am investigating JSONata as part of my quest to find a declarative syntax for describing transformations from hierarchical JSON data to traditional, normalized, relational tables.
This example uses the try.jsonata.org Invoice data and shows transformations to three RDBMS tables ... AccountTable, AccountOrderTable, AccountOrderProductTable.
Observe that I am iterating through the $.Account.Order array twice ... once to generate records for the AccountOrderTable and again to generate records for the AccountOrderProductTable. The closure captures $order so that the product entries have access to the parent $order.OrderID
While I'm OK with iterating over the $.Account.Order array twice, I am wondering if there is a clean way to construct the JSONata query so that the $map operation only gets applied once to each array.
Q: How could I construct this query (while generating the same resuls) so that I only $map the $.Account.Order array one time?
I think this produces the same result, but uses a temporary binding to eliminate the double loop round the Order array. Of course, you could argue that being a declarative language, the implementation should be responsible for optimising the evaluation. But I suspect the implementor might not get round to this for a while.

pl/sql query optimization with function call in where clause

I am trying to optimize a query where I am using a function() call in the where clause.
The function() simply changes the timezone of the date.
When I call the function as part of the SELECT, it executes extremely fast (< 0.09 sec against table of many hundreds of thousands of rows)
select
id,
fn_change_timezone (date_time, 'UTC', 'US/Central') AS tz_date_time,
value
from a_table_view
where id = 'keyvalue'
and date_time = to_date('01-10-2014','mm-dd-yyyy')
However, this version runs "forever" [meaning I stop it after umpteen minutes]
select id, date_time, value
from a_table_view
where id = 'keyvalue'
and fn_change_timezone (date_time, 'UTC', 'US/Central') = to_date('01-10-2014','mm-dd-yyyy')
(I know I'd have to change the date being compared, its just for example)
So my question is two-fold:
If the function is so fast outside of the where clause, why is it so much slower than say using TRUNC() or other functions (obviously trunc() doesnt do a table lookup like my function - but still the function is very very fast outside the where clause)
What are alternate ways of accomplishing this outside of the where clause ?
I tried this as an alternative, which did not seem any better, it still ran until I stopped the query:
select
tz.date_time,
v.id,
v.value
from
(select
fn_change_timezone(to_date('01/10/2014-00:00:00', 'mm/dd/yyyy-hh24:mi:ss'), 'UTC', 'US/Central') as date_time
from dual
) tz
inner join
(
select
id,
fn_change_timezone (date_time, 'UTC', 'US/Central') AS v_date_time,
value
from a_table_view
where id = 'keyvalue'
) v ON
v.tz_date_time = tz.date_time
Hopefully I am explaining the issue well.
There are at least four potential issues with using functions in the WHERE clause:
Functions may prevent indexes. A function-based index can solve this issue.
Functions may prevent partition pruning. Hard-coding values or maybe virtual column partitioning are possible solutions, although neither is likely helpful in this case.
Functions may run slowly. Even if the function is cheap, it is often very expensive to switch between SQL and PL/SQL. Some possible solutions are DETERMINISTIC, PARALLEL_ENABLE, function result caching, defining the logic in purely SQL, or with 12c defining the function in SQL.
Functions may cause bad cardinality estimates. It's hard enough for the optimizer to guess the result of normal conditions, adding procedural code makes it even more difficult. Using ASSOCIATE STATISTICS it is possible to provide some information to the optimizer about the cost and cardinality of the function.
Without more information, such as an explain plan, it is difficult to know what the specific issue is with this query.
Function calls in the WHERE clause are a Bad Thing. The problem is that the function may be called for every row in the table, which may be many more than the selected set. This can be a real performance killer (don't ask me how I know :-). In the first version with the function call in the SELECT list the function will only be called when a row has been chosen and is being added to the result set - in the second version the function may well be called for every row in the table. Also, depending on the version of Oracle you're using there may be significant overhead to calling a user function from SQL, but I think this penalty has been largely eliminated in versions since 10g.
Best of luck.
Share and enjoy.

Why is PostgreSQL calling my STABLE/IMMUTABLE function multiple times?

I'm trying to optimise a complex query in PostgreSQL 9.1.2, which calls some functions. These functions are marked STABLE or IMMUTABLE and are called several times with the same arguments in the query. I assumed PostgreSQL would be smart enough to only call them once for each set of inputs - after all, that's the point of STABLE and IMMUTABLE, isn't it? But it appears that the functions are being called multiple times. I wrote a simple function to test this, which confirms it:
CREATE OR REPLACE FUNCTION test_multi_calls1(one integer)
RETURNS integer
AS $BODY$
BEGIN
RAISE NOTICE 'Called with %', one;
RETURN one;
END;
$BODY$ LANGUAGE plpgsql IMMUTABLE;
WITH data AS
(
SELECT 10 AS num
UNION ALL SELECT 10
UNION ALL SELECT 20
)
SELECT test_multi_calls1(num)
FROM data;
Output:
NOTICE: Called with 10
NOTICE: Called with 10
NOTICE: Called with 20
Why is this happening and how can I get it to only execute the function once?
The following extension of your test code is informative:
CREATE OR REPLACE FUNCTION test_multi_calls1(one integer)
RETURNS integer
AS $BODY$
BEGIN
RAISE NOTICE 'Immutable called with %', one;
RETURN one;
END;
$BODY$ LANGUAGE plpgsql IMMUTABLE;
CREATE OR REPLACE FUNCTION test_multi_calls2(one integer)
RETURNS integer
AS $BODY$
BEGIN
RAISE NOTICE 'Volatile called with %', one;
RETURN one;
END;
$BODY$ LANGUAGE plpgsql VOLATILE;
WITH data AS
(
SELECT 10 AS num
UNION ALL SELECT 10
UNION ALL SELECT 20
)
SELECT test_multi_calls1(num)
FROM data
where test_multi_calls2(40) = 40
and test_multi_calls1(30) = 30
OUTPUT:
NOTICE: Immutable called with 30
NOTICE: Volatile called with 40
NOTICE: Immutable called with 10
NOTICE: Volatile called with 40
NOTICE: Immutable called with 10
NOTICE: Volatile called with 40
NOTICE: Immutable called with 20
Here we can see that while in the select-list the immutable function was called multiple times, in the where clause it was called once, while the volatile was called thrice.
The important thing isn't that PostgreSQL will only call a STABLE or IMMUTABLE function once with the same data - your example clearly shows that this is not the case - it's that it may call it only once. Or perhaps it will call it twice when it would have to call a volatile version 50 times, and so on.
There are different ways in which stability and immutability can be taken advantage of, with different costs and benefits. To provide the sort of saving you are suggesting it should make with select-lists it would have to cache the results, and then lookup each argument (or list of arguments) in this cache before either returning the cached result or calling function on a cache-miss. This would be more expensive than calling your function, even in the case where there was a high percentage of cache-hits (there could be 0% cache hits meaning this "optimisation" did extra work for absolutely no gain). It could store maybe just the last parameter and result, but again that could be completely useless.
This is especially so considering that stable and immutable functions are often the lightest functions.
With the where clause however, the immutability of test_multi_calls1 allows PostgreSQL to actually restructure the query from the plain meaning of the SQL given:
For every row calculate test_multi_calls1(30) and if the result is
equal to 30 continue processing the row in question
To a different query plan entirely:
Calculate test_multi_calls1(30) and if it is equal to 30 then
continue with the query otherwise return a zero row result-set without
any further calculation
This is the sort of use that PostgreSQL makes of STABLE and IMMUTABLE - not the caching of results, but the rewriting of queries into different queries which are more efficient but give the same results.
Note also that test_multi_calls1(30) is called before test_multi_calls2(40) no matter what order they appear in the where clause. This means that if the first call results in no rows being returned (replace = 30 with = 31 to test) then the volatile function won't be called at all - again regardless to which is on which side of the and.
This particular sort of rewriting depends upon immutability or stability. With where test_multi_calls1(30) != num query re-writing will happen for immutable but not for merely stable functions. With where test_multi_calls1(num) != 30 it won't happen at all (multiple calls) though there are other optimisations possible:
Expressions containing only STABLE and IMMUTABLE functions can be used with index scans. Expressions containing VOLATILE functions cannot. The number of calls may or may not decrease, but much more importantly the results of the calls will then be used in a much more efficient way in the rest of the query (only really matters on large tables, but then it can make a massive difference).
In all, don't think of volatility categories in terms of memoisation, but rather in terms of giving PostgreSQL's query planner opportunities to restructure entire queries in ways that are logically equivalent (same results) but much more efficient.
According to the documentation IMMUTABLE functions will return the same value given the same arguments. Since you are feeding dynamic arguments (And not even the same once) optimizer has no reason to believe that it will get the same results and hence calls the function. Better qustion is: why is your query invoking the function multiple times if it doesn't need to?

How to inline a variable in PL/SQL?

The Situation
I have some trouble with my query execution plan for a medium-sized query over a large amount of data in Oracle 11.2.0.2.0. In order to speed things up, I introduced a range filter that does roughly something like this:
PROCEDURE DO_STUFF(
org_from VARCHAR2 := NULL,
org_to VARCHAR2 := NULL)
-- [...]
JOIN organisations org
ON (cust.org_id = org.id
AND ((org_from IS NULL) OR (org_from <= org.no))
AND ((org_to IS NULL) OR (org_to >= org.no)))
-- [...]
As you can see, I want to restrict the JOIN of organisations using an optional range of organisation numbers. Client code can call DO_STUFF with (supposed to be fast) or without (very slow) the restriction.
The Trouble
The trouble is, PL/SQL will create bind variables for the above org_from and org_to parameters, which is what I would expect in most cases:
-- [...]
JOIN organisations org
ON (cust.org_id = org.id
AND ((:B1 IS NULL) OR (:B1 <= org.no))
AND ((:B2 IS NULL) OR (:B2 >= org.no)))
-- [...]
The Workaround
Only in this case, I measured the query execution plan to be a lot better when I just inline the values, i.e. when the query executed by Oracle is actually something like
-- [...]
JOIN organisations org
ON (cust.org_id = org.id
AND ((10 IS NULL) OR (10 <= org.no))
AND ((20 IS NULL) OR (20 >= org.no)))
-- [...]
By "a lot", I mean 5-10x faster. Note that the query is executed very rarely, i.e. once a month. So I don't need to cache the execution plan.
My questions
How can I inline values in PL/SQL? I know about EXECUTE IMMEDIATE, but I would prefer to have PL/SQL compile my query, and not do string concatenation.
Did I just measure something that happened by coincidence or can I assume that inlining variables is indeed better (in this case)? The reason why I ask is because I think that bind variables force Oracle to devise a general execution plan, whereas inlined values would allow for analysing very specific column and index statistics. So I can imagine that this is not just a coincidence.
Am I missing something? Maybe there is an entirely other way to achieve query execution plan improvement, other than variable inlining (note I have tried quite a few hints as well but I'm not an expert on that field)?
In one of your comments you said:
"Also I checked various bind values.
With bind variables I get some FULL
TABLE SCANS, whereas with hard-coded
values, the plan looks a lot better."
There are two paths. If you pass in NULL for the parameters then you are selecting all records. Under those circumstances a Full Table Scan is the most efficient way of retrieving data. If you pass in values then indexed reads may be more efficient, because you're only selecting a small subset of the information.
When you formulate the query using bind variables the optimizer has to take a decision: should it presume that most of the time you'll pass in values or that you'll pass in nulls? Difficult. So look at it another way: is it more inefficient to do a full table scan when you only need to select a sub-set of records, or to do indexed reads when you need to select all records?
It seems as though the optimizer has plumped for full table scans as being the least inefficient operation to cover all eventualities.
Whereas when you hard code the values the Optimizer knows immediately that 10 IS NULL evaluates to FALSE, and so it can weigh the merits of using indexed reads for find the desired sub-set records.
So, what to do? As you say this query is only run once a month I think it would only require a small change to business processes to have separate queries: one for all organisations and one for a sub-set of organisations.
"Btw, removing the :R1 IS NULL clause
doesn't change the execution plan
much, which leaves me with the other
side of the OR condition, :R1 <=
org.no where NULL wouldn't make sense
anyway, as org.no is NOT NULL"
Okay, so the thing is you have a pair of bind variables which specify a range. Depending on the distribution of values, different ranges might suit different execution plans. That is, this range would (probably) suit an indexed range scan...
WHERE org.id BETWEEN 10 AND 11
...whereas this is likely to be more fitted to a full table scan...
WHERE org.id BETWEEN 10 AND 1199999
That is where Bind Variable Peeking comes into play.
(depending on distribution of values, of course).
Since the query plans are actually consistently different, that implies that the optimizer's cardinality estimates are off for some reason. Can you confirm from the query plans that the optimizer expects the conditions to be insufficiently selective when bind variables are used? Since you're using 11.2, Oracle should be using adaptive cursor sharing so it shouldn't be a bind variable peeking issue (assuming you are calling the version with bind variables many times with different NO values in your testing.
Are the cardinality estimates on the good plan actually correct? I know you said that the statistics on the NO column are accurate but I would be suspicious of a stray histogram that may not be updated by your regular statistics gathering process, for example.
You could always use a hint in the query to force a particular index to be used (though using a stored outline or optimizer plan stability would be preferable from a long-term maintenance perspective). Any of those options would be preferable to resorting to dynamic SQL.
One additional test to try, however, would be to replace the SQL 99 join syntax with Oracle's old syntax, i.e.
SELECT <<something>>
FROM <<some other table>> cust,
organization org
WHERE cust.org_id = org.id
AND ( ((org_from IS NULL) OR (org_from <= org.no))
AND ((org_to IS NULL) OR (org_to >= org.no)))
That obviously shouldn't change anything, but there have been parser issues with the SQL 99 syntax so that's something to check.
It smells like Bind Peeking, but I am only on Oracle 10, so I can't claim the same issue exists in 11.
This looks a lot like a need for Adaptive Cursor Sharing, combined with SQLPlan stability.
I think what is happening is that the capture_sql_plan_baselines parameter is true. And the same for use_sql_plan_baselines. If this is true, the following is happening:
The first time that a query started it is parsed, it gets a new plan.
The second time, this plan is stored in the sql_plan_baselines as an accepted plan.
All following runs of this query use this plan, regardless of what the bind variables are.
If Adaptive Cursor Sharing is already active,the optimizer will generate a new/better plan, store it in the sql_plan_baselines but is not able to use it, until someone accepts this newer plan as an acceptable alternative plan. Check dba_sql_plan_baselines and see if your query has entries with accepted = 'NO' and verified = null
You can use dbms_spm.evolve to evolve the new plan and have it automatically accepted if the performance of the plan is at least 1,5 times better than without the new plan.
I hope this helps.
I added this as a comment, but will offer up here as well. Hope this isn't overly simplistic, and looking at the detailed responses I may be misunderstanding the exact problem, but anyway...
Seems your organisations table has column no (org.no) that is defined as a number. In your hardcoded example, you use numbers to do the compares.
JOIN organisations org
ON (cust.org_id = org.id
AND ((10 IS NULL) OR (10 <= org.no))
AND ((20 IS NULL) OR (20 >= org.no)))
In your procedure, you are passing in varchar2:
PROCEDURE DO_STUFF(
org_from VARCHAR2 := NULL,
org_to VARCHAR2 := NULL)
So to compare varchar2 to number, Oracle will have to do the conversions, so this may cause the full scans.
Solution: change proc to pass in numbers

Resources