Related
Imagine I have a table with some field one of which is array off date.
as below
col1 col2 alldate Max_date
1 2 ["2021-02-12","2021-02-13"] "2021-02-13"
2 3 ["2021-01-12","2021-02-13"] "2021-02-13"
4 4 ["2021-01-12"] "2021-01-12"
5 3 ["2021-01-11","2021-02-12"] "2021-02-12"
6 7 ["2021-02-13"] "2021-02-13"
I need to write a query such that to select only the one which has max date in there array. And there is a column which has max date as well.
Like the select statement should show
col1 col2 alldate Max_date
1 2 ["2021-02-12","2021-02-13"] "2021-02-13"
2 3 ["2021-01-12","2021-02-13"] "2021-02-13"
6 7 ["2021-02-13"] "2021-02-13"
The table is huge so a optimized query is needed.
Till now I was thinking of
select col1, col2, maxdate
from t1 where array_contains((select max(max_date) from t1)::variant,date));
But to me it seems running select statement per query is a bad idea.
Any Suggestion
If you want pure speed using lateral flatten is 10% faster than the array_contains approach over 500,000,000 records on a XS warehouse. You can copy paste below code straight into snowflake to test for yourself.
Why is the lateral flattern approach faster?
Well if you look at the query plans the optimiser filters at first step (immediately culling records) where as the array_contains waits until the 4th step before doing the same. The filter is the qualifier of the max(max_date) ...
Create Random Dataset:
create or replace table stack_overflow_68132958 as
SELECT
seq4() col_1,
UNIFORM (1, 500, random()) col_2,
DATEADD(day, UNIFORM (-40, -0, random()), current_date()) random_date_1,
DATEADD(day, UNIFORM (-40, -0, random()), current_date()) random_date_2,
DATEADD(day, UNIFORM (-40, -0, random()), current_date()) random_date_3,
ARRAY_CONSTRUCT(random_date_1, random_date_2, random_date_3) date_array,
greatest(random_date_1, random_date_2, random_date_3) max_date,
to_array(greatest(random_date_1, random_date_2, random_date_3)) max_date_array
FROM
TABLE (GENERATOR (ROWCOUNT => 500000000)) ;
Test Felipe/Mike approach -> 51secs
select
distinct
col_1
,col_2
from
stack_overflow_68132958
qualify
array_contains(max(max_date) over () :: variant, date_array);
Test Adrian approach -> 47 secs
select
distinct
col_1
, col_2
from
stack_overflow_68132958
, lateral flatten(input => date_array) g
qualify
max(max_date) over () = g.value;
I would likely use a CTE for this, like:
WITH x AS (
SELECT max(max_date) as max_max_date
FROM t1
)
select col1, col2, maxdate
from t1
cross join x
where array_contains(x.max_max_date::variant,alldate);
I have not tested the syntax exactly and the data types might vary things a bit, but the concept here is that the CTE will be VERY fast and return a single record with a single value. A MAX() function leverage metadata in Snowflake, so it won't even use a warehouse to get it.
That said, the Snowflake profiler is pretty smart, so your query might actually create the exact same query profile as this statement. Test them both and see what the Profile looks like to see if it truly makes a difference.
To build on Mike's answer, we can do everything in the QUALIFY, without the need for a CTE:
with t1 as (
select 'a' col1, 'b' col2, '2020-01-01'::date maxdate, array_construct('2020-01-01'::date, '2018-01-01', '2017-01-01') alldate
)
select col1, col2, alldate, maxdate
from t1
qualify array_contains((max(maxdate) over())::variant, alldate)
;
Note that you should be careful with types. Both of these are true:
select array_contains('2020-01-01'::date::string::variant, array_construct('2020-01-01', '2019-01-01'));
select array_contains('2020-01-01'::date::variant, array_construct('2020-01-01'::date, '2019-01-01'));
But this is false:
select array_contains('2020-01-01'::date::variant, array_construct('2020-01-01', '2019-01-01'));
You have some great answers, which I only saw, after i wrote mine up.
If your data types, match, you should be good to go, copy paste direct into snowflake ... and this should work.
create or replace schema abc;
use schema abc;
create or replace table myarraytable(col1 number, col2 number, alldates variant, max_date timestamp_ltz);
insert into myarraytable
select 1,2,array_construct('2021-02-12'::timestamp_ltz,'2021-02-13'::timestamp_ltz), '2021-02-13'
union
select 2,3,array_construct('2021-01-12'::timestamp_ltz,'2021-02-13'::timestamp_ltz),'2021-02-13'
union
select 4,4,array_construct('2021-01-12'::timestamp_ltz) , '2021-01-12'
union
select 5,3,array_construct('2021-01-11'::timestamp_ltz,'2021-02-12'::timestamp_ltz) , '2021-02-12'
union
select 6,7,array_construct('2021-02-13'::timestamp_ltz) , '2021-02-13';
select * from myarraytable
order by 1 ;
WITH cte_max AS (
SELECT max(max_date) as max_date
FROM myarraytable
)
select myarraytable.*
from myarraytable, cte_max
where array_contains(cte_max.max_date::variant, alldates)
order by 1 ;
A query, like so:
SELECT SUM(col1 * col3) AS total, col2
FROM table1
GROUP BY col2
works as expected when run individually.
For reference:
table1.col1 -- float
table1.col2 -- varchar2
table1.col3 -- float
When this query is moved to a subquery, I get an ORA-01722 error, with reference to the "col2" position in the select clause. The larger query looks like this:
SELECT col3, subquery1.total
FROM table3
LEFT JOIN (
SELECT SUM(table1.col1 * table1.col3) AS total, table.1col2
FROM table1
GROUP BY table1.col2
) subquery1 ON table3.col3 = subquery1.col2
For reference:
table3.col3 -- varchar2
It may also be worth noting that I have another query, from table2 that has the same structure as table1. If I use the subquery from table2, it works. It never works when using table1.
There is no concatenation, the data types match, the query works by itself... I'm at a loss here. What else should I be looking for? What painfully obvious problem is staring me in the face?
(I didn't choose or make the table structures and can't change them, so answers to that end will unfortunately not be helpful.)
try using a proper cast of float to char ..
SELECT col3, subquery1.total
FROM table3
LEFT JOIN (
SELECT SUM(table1.col1 * table1.col3) AS total, table.1col2
FROM table1
GROUP BY table1.col2
) subquery1 ON to_char(table3.col3) = subquery1.col2
In our environment one procedure is taking long time to execute. I have checked the procedure, and below is the summary -
The procedure contains only select block (around 24). Before each select we are checking if data exists. If yes select the data, else do something else. For example :
-- Select block 1 --
IF EXISTS (SELECT 1 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col2='someValue' AND t2.col2='someValue'
)
BEGIN
SELECT t1.col1,t2.col2,t2.col3 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col2='someValue' AND t2.col2='someValue'
END
ELSE
BEGIN
SELECT 'DEFAULT1', 'DEFAULT2', 'DEFAULT3'
END
-- Select block 2 --
IF EXISTS (SELECT 1 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col5='someValue' AND t2.col5='someValue'
)
BEGIN
SELECT t1.col5,t2.col6,t2.col7 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col5='someValue' AND t2.col5='someValue'
END
ELSE
BEGIN
SELECT 'DEFAULT1', 'DEFAULT2', 'DEFAULT3'
END
I have come to an conclution that, somehow if we can combine the query that is used within IF EXISTS block into one query, and set some value to some variables so that we can identify which where condition returns true, that can reduce the execution time and improve the performance.
Is my thought correct? Is there any option to do that? Can you suggest any other options?
We are using Microsoft SQL Server 2005.
[Editted : Added] - All select statement doesn't return same column types they are different. And all select statements are required. If there are 24 if block, procedure should return 24 result-set.
[Added]
I would like to ask one more thing, which one of the below runs faster -
SELECT 1 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col2='someValue' AND t2.col2='someValue'
SELECT COUNT(1) FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col2='someValue' AND t2.col2='someValue'
SELECT TOP 1 1 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col2='someValue' AND t2.col2='someValue'
Thanks.
Kartic
To enhance the performance of select query...create "index" on columns which you are using in where clause
like you are using the
WHERE t1.col2='someValue' AND t2.col2='someValue'
WHERE t1.col5='someValue' AND t2.col5='someValue'
so create database index on col2 and col5
Temp table
you can use the temp table to store the result. since you are using same query 24 time so first store the result of below query into the temp table (correct the syntax as require)
insert into temp_table (col2, col5)
SELECT col1, col5 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
Now use the temp table for checking
-- Select block 1 --
IF EXISTS (SELECT 1 FROM temp_table
WHERE t1.col2='someValue' AND t2.col2='someValue'
)
BEGIN
SELECT t1.col1,t2.col2,t2.col3 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col2='someValue' AND t2.col2='someValue'
END
-- Select block 2 --
IF EXISTS (SELECT 1 FROM temp_table1
WHERE t1.col5='someValue' AND t2.col5='someValue'
)
BEGIN
SELECT t1.col5,t2.col6,t2.col7 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col5='someValue' AND t2.col5='someValue'
END
The current structure is not very efficient - you effectively have to execute each "if" statement (which will be expensive), and then repeat the same where clause (the expensive bit) if the "if" returns true. And you do this 24 times. Worst case (all the queries return data), you're doubling the time for the query.
You say you've checked for indexing - given that each query appears to be subtly different, it would be worth double checking this.
The obvious thing is to refactor the application to execute the 24 select statements, and deal with the fact that sometimes, they don't return any data. That's a fairly large refactoring, and I assume you've considered that...
If you can't do that, consider a less ambitious (though nastier) refactoring. Instead of checking whether data exists, and either returning it or an equivalent default result set, write it as a union:
SELECT t1.col1,t2.col2,t2.col3 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col2='someValue' AND t2.col2='someValue'
UNION
SELECT 'DEFAULT1', 'DEFAULT2', 'DEFAULT3'
This reduces the number of times you're hitting the where clause, but means your client application must filter out the "default" data.
To answer your final question, I'd run it through the query optimizer and look at the execution plan - but I'd imagine that the first version is fastest - the query can complete as soon as it finds the first record that matches the where criteria. The second version must find all records that match and count them; the final version must find all records and select the first one.
You could outer-join the results of a query to a row of default values, then fall back to the defaults when the query's results are empty:
SELECT
col1 = COALESCE(query.col1, defaults.col1),
col2 = COALESCE(query.col2, defaults.col2),
col3 = COALESCE(query.col3, defaults.col3)
FROM
(SELECT 'DEFAULT1', 'DEFAULT2', 'DEFAULT3') AS defaults (col1, col2, col3)
LEFT JOIN
(
SELECT t1.col1, t2.col2, t2.col3
FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col2='someValue' AND t2.col2='someValue'
) query
ON 1=1 -- i.e. join all the rows unconditionally
;
The method may not suit you in exactly that form you if the subquery may actually return NULLs and those must not be replaced with default values. In that case, have the subqueries return a flag column (just any value). If that column evaluates to NULL in the final query, that can only mean that the subquery hasn't returned rows. You can use that fact in a CASE expression like this:
SELECT
col1 = CASE WHEN query.HasRows IS NULL THEN defaults.col1 ELSE query.col2 END,
col2 = CASE WHEN query.HasRows IS NULL THEN defaults.col2 ELSE query.col2 END,
col3 = CASE WHEN query.HasRows IS NULL THEN defaults.col3 ELSE query.col2 END
FROM
(SELECT 'DEFAULT1', 'DEFAULT2', 'DEFAULT3') AS defaults (col1, col2, col3)
LEFT JOIN
(
SELECT HasRows = 1, t1.col1, t2.col2, t2.col3
FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col2='someValue' AND t2.col2='someValue'
) query
ON 1=1
;
Hey I have the following tables and SQL:
T1: ID, col2,col3 - PK(ID) - 23mil rows
T2: ID, col2,col3 - PK(ID) - 23mil rows
T3: ID, name,value - PK(ID,name) -66mil rows
1) The below sql returns back the 10k row resultset very fast, no problems.
select top 10000 T1.col2, T2.col2, T3.name, T4.value
from T1, T2, T3
where T1.ID = T2.ID and T1.ID *= T3.ID and T3.name in ('ABC','XYZ')
and T2.col1 = 'SOMEVALUE'
2) The below sql took FOREVER.
select top 10000 T1.col2, T2.col2,
ABC = min(case when T3.name='ABC ' then T3.value end)
XYZ = min(case when T3.name='XYZ ' then T3.value end)
from T1, T2, T3
where T1.ID = T2.ID and T1.ID *= T3.ID and T3.name in ('ABC','XYZ')
and T2.col1 = 'SOMEVALUE'
group by T1.col2, T2.col2,
The only difference in the showplan between those 2 queries are the below for query 2). I dont understand it 100%, is it selecting the ENTIRE resultset WITHOUT top 10000 into the temp table then doing a group by on it? is that why it's slow?
STEP 1
The type of query is SELECT (into Worktable1).
GROUP BY
Evaluate Grouped MINIMUM AGGREGATE.
FROM TABLE ...etc..
TO TABLE
Worktable1.
STEP 2
The type of query is SELECT.
FROM TABLE
Worktable1.
Nested iteration.
Table Scan.
Forward scan.
Positioning at start of table.
Using I/O Size 16 Kbytes for data pages.
With MRU Buffer Replacement Strategy for data pages.
My question is
1) Why is query 2) so slow
2) How do I fix while keeping the query logic the same and preferably limit it to just 1 select SQL as before.
thank you
Although possibly a generic answer, I'd say to put a index on the columns you're grouping by.
Edit / Revise: Here's my theory after re-looking at the issue. The SELECT statement in a query is always the last line executed. This makes sense as it is the statement that retrieves the values you want from the dataset specified below. In your query, the whole dataset (millions of records) will be evaluated for the MIN value expression that you specified. There will be two seperate functions called on the entire dataset, since you have specified two MIN columns in the select statement. After the dataset is filtered and the MIN columns have been determined, the top 10000 rows will then be selected.
In a nutshell, you're doing two mathematical function on millions of records. This will take a significant amount of time, especially with no indexes.
The solution for you would be to use a derived table. I haven't compiled the code below, but it's something close to what you would use. It will only take the min values of the 10,000 records rather than the whole dataset.
I.e.
Select my_derived_table.t1col2, my_derived_table.t2col2,
ABC = min(case when my_derived_table.t3name ='ABC ' then my_derived_table.t3value end),
XYZ = min(case when my_derived_table.t3name='XYZ ' then my_derived_table.t3value end)
FROM
(Select top 10000 T1.col2 as t1col2,
T2.col2 as t2col2,
t3.name as t3name,
t3.value as t3.value
from T1, T2, T3
where T1.ID = T2.ID
and T1.ID *= T3.ID
and T3.name in ('ABC','XYZ')
and T2.col1 = 'SOMEVALUE') my_derived_table
group by my_derived_table.t1col2, my_derived_table.t2col2
How do I go about replacing the following self join using analytics:
SELECT
t1.col1 col1,
t1.col2 col2,
SUM((extract(hour FROM (t1.times_stamp - t2.times_stamp)) * 3600 + extract(minute FROM ( t1.times_stamp - t2.times_stamp)) * 60 + extract(second FROM ( t1.times_stamp - t2.times_stamp)) ) ) div,
COUNT(*) tot_count
FROM tab1 t1,
tab1 t2
WHERE t2.col1 = t1.col1
AND t2.col2 = t1.col2
AND t2.col3 = t1.sequence_num
AND t2.times_stamp < t1.times_stamp
AND t2.col4 = 3
AND t1.col4 = 4
AND t2.col5 NOT IN(103,123)
AND t1.col5 != 549
GROUP BY t1.col1, t1.col2
I'm pretty sure you won't be able to replace the self-join with analytics because you are using inter-rows operations (t1.time_stamp - t2.time_stamp). Analytics can only access the values of the current row and the value of aggregate functions over a subset of rows (windowing clause).
See this article from Tom Kyte and this paper for further analysis of the limitations of analytics.
It almost looks like you could eliminate the self join on t2 and replace
t1.time_stamp - t2.time_stamp
with something like
t1.time_stamp - lag(t1.time_stamp) over (partition by col1, col2 order by time_stamp)
The different filters on t1 and t2 on col4 and col5 are what prevents you from doing this.
Analytic functions are applied after the where / group by on the main query, so you'd need to have a single filter on t1 in order to use lag/lead to specify following or preceding rows in a sequence.
Also, you'd need to push the sum/group by to an outer query to aggregate after the analytic function:
select col1, col2, sum(timestamp_diff) from (
select col1, col2, timestamp - lag(timestamp) over(.....) as timestamp_diff
where ....
) group by col1, col2