oracle not in query takes longer than in query - oracle

I have 2 tables each has about 230000 records. When I make a query:
select count(*)
from table1
where field1 in (select field2 from table2).
It takes about 0.2 second.
If I use the same query just changing in to not in
select count(*)
from table1
where field1 NOT in (select field2 from table2).
It never ends.
Why ?

It's the difference between a scan and a seek.
When you ask for "IN" you ask for specifically these values.
This means the database engine can use indexes to seek to the correct data pages.
When you ask for "NOT IN" you ask for all values except these values.
This means the database engine has to scan the entirety of the table/indexes to find all values.
The other factor is the amount of data. The IN query likely involves much less data and therefore much less I/O than the NOT IN.
Compare it to a phonebook, If you want people only named Smith you can just pick the section for Smith and return that. You don't have to read any pages in the book before or any pages after the Smith section.
If you ask for all non-Smith - you have to read all pages before Smith and all after Smith.
This illustrates both the seek/scan aspect and the data amount aspect.

Its better to user not exists, as not in uses row search which takes too long

In worst case both queries can be resolved using two full table scans plus a hash join (semi or anti). We're talking a few seconds for 230 000 rows unless there is something exceptionally going on in your case.
My guess is that either field1 or field2 is nullable. When you use a NOT IN construct, Oracle has to perform an expensive filter operation which is basically executing the inner query once for each row in the outer table. This is 230 000 full table scans...
You can verify this by looking at the the execution plan. It would look something like:
SELECT
FILTER (NOT EXISTS SELECT 0...)
TABLE ACCESS FULL ...
TABLE ACCESS FULL ...
If there are no NULL values in either column (field1, field2) you can help Oracle with this piece of information so another more efficient execution strategy can be used:
select count(*)
from table1
where field1 is not null
and field1 not in (select field2 from table2 where field2 is not null)
This will generate a plan that looks something like:
SELECT
HASH JOIN ANTI
FULL TABLE SCAN ...
FULL TABLE SCAN ...
...or you can change the construct to NOT EXISTS (will generate the same plan as above):
select count(*)
from table1
where not exists(
select 'x'
from table2
where table2.field2 = table1.field1
);
Please note that changing from NOT IN to NOT EXISTS may change the result of the query. Have a look at the following example and try the two different where-clauses to see the difference:
with table1 as(
select 1 as field1 from dual union all
select null as field1 from dual union all
select 2 as field1 from dual
)
,table2 as(
select 1 as field2 from dual union all
select null as field2 from dual union all
select 3 as field2 from dual
)
select *
from table1
--where field1 not in(select field2 from table2)
where not exists(select 'x' from table2 where field1 = field2)

Try:
SELECT count(*)
FROM table1 t1
LEFT JOIN table2 t2 ON t1.field1 = t2.field2
WHERE t2.primary_key IS NULL

Related

snowflake select max date from date array

Imagine I have a table with some field one of which is array off date.
as below
col1 col2 alldate Max_date
1 2 ["2021-02-12","2021-02-13"] "2021-02-13"
2 3 ["2021-01-12","2021-02-13"] "2021-02-13"
4 4 ["2021-01-12"] "2021-01-12"
5 3 ["2021-01-11","2021-02-12"] "2021-02-12"
6 7 ["2021-02-13"] "2021-02-13"
I need to write a query such that to select only the one which has max date in there array. And there is a column which has max date as well.
Like the select statement should show
col1 col2 alldate Max_date
1 2 ["2021-02-12","2021-02-13"] "2021-02-13"
2 3 ["2021-01-12","2021-02-13"] "2021-02-13"
6 7 ["2021-02-13"] "2021-02-13"
The table is huge so a optimized query is needed.
Till now I was thinking of
select col1, col2, maxdate
from t1 where array_contains((select max(max_date) from t1)::variant,date));
But to me it seems running select statement per query is a bad idea.
Any Suggestion
If you want pure speed using lateral flatten is 10% faster than the array_contains approach over 500,000,000 records on a XS warehouse. You can copy paste below code straight into snowflake to test for yourself.
Why is the lateral flattern approach faster?
Well if you look at the query plans the optimiser filters at first step (immediately culling records) where as the array_contains waits until the 4th step before doing the same. The filter is the qualifier of the max(max_date) ...
Create Random Dataset:
create or replace table stack_overflow_68132958 as
SELECT
seq4() col_1,
UNIFORM (1, 500, random()) col_2,
DATEADD(day, UNIFORM (-40, -0, random()), current_date()) random_date_1,
DATEADD(day, UNIFORM (-40, -0, random()), current_date()) random_date_2,
DATEADD(day, UNIFORM (-40, -0, random()), current_date()) random_date_3,
ARRAY_CONSTRUCT(random_date_1, random_date_2, random_date_3) date_array,
greatest(random_date_1, random_date_2, random_date_3) max_date,
to_array(greatest(random_date_1, random_date_2, random_date_3)) max_date_array
FROM
TABLE (GENERATOR (ROWCOUNT => 500000000)) ;
Test Felipe/Mike approach -> 51secs
select
distinct
col_1
,col_2
from
stack_overflow_68132958
qualify
array_contains(max(max_date) over () :: variant, date_array);
Test Adrian approach -> 47 secs
select
distinct
col_1
, col_2
from
stack_overflow_68132958
, lateral flatten(input => date_array) g
qualify
max(max_date) over () = g.value;
I would likely use a CTE for this, like:
WITH x AS (
SELECT max(max_date) as max_max_date
FROM t1
)
select col1, col2, maxdate
from t1
cross join x
where array_contains(x.max_max_date::variant,alldate);
I have not tested the syntax exactly and the data types might vary things a bit, but the concept here is that the CTE will be VERY fast and return a single record with a single value. A MAX() function leverage metadata in Snowflake, so it won't even use a warehouse to get it.
That said, the Snowflake profiler is pretty smart, so your query might actually create the exact same query profile as this statement. Test them both and see what the Profile looks like to see if it truly makes a difference.
To build on Mike's answer, we can do everything in the QUALIFY, without the need for a CTE:
with t1 as (
select 'a' col1, 'b' col2, '2020-01-01'::date maxdate, array_construct('2020-01-01'::date, '2018-01-01', '2017-01-01') alldate
)
select col1, col2, alldate, maxdate
from t1
qualify array_contains((max(maxdate) over())::variant, alldate)
;
Note that you should be careful with types. Both of these are true:
select array_contains('2020-01-01'::date::string::variant, array_construct('2020-01-01', '2019-01-01'));
select array_contains('2020-01-01'::date::variant, array_construct('2020-01-01'::date, '2019-01-01'));
But this is false:
select array_contains('2020-01-01'::date::variant, array_construct('2020-01-01', '2019-01-01'));
You have some great answers, which I only saw, after i wrote mine up.
If your data types, match, you should be good to go, copy paste direct into snowflake ... and this should work.
create or replace schema abc;
use schema abc;
create or replace table myarraytable(col1 number, col2 number, alldates variant, max_date timestamp_ltz);
insert into myarraytable
select 1,2,array_construct('2021-02-12'::timestamp_ltz,'2021-02-13'::timestamp_ltz), '2021-02-13'
union
select 2,3,array_construct('2021-01-12'::timestamp_ltz,'2021-02-13'::timestamp_ltz),'2021-02-13'
union
select 4,4,array_construct('2021-01-12'::timestamp_ltz) , '2021-01-12'
union
select 5,3,array_construct('2021-01-11'::timestamp_ltz,'2021-02-12'::timestamp_ltz) , '2021-02-12'
union
select 6,7,array_construct('2021-02-13'::timestamp_ltz) , '2021-02-13';
select * from myarraytable
order by 1 ;
WITH cte_max AS (
SELECT max(max_date) as max_date
FROM myarraytable
)
select myarraytable.*
from myarraytable, cte_max
where array_contains(cte_max.max_date::variant, alldates)
order by 1 ;

BULK COLLECT INTO a UNION query into a table of objects

How can I collect into a table of objects, the values produced by a query that has a union in it as shown below
Select customer_name
from customer
where customer_id = 'xxx'
BULK COLLECT INTO customer_obj
UNION
Select customer_name
from customer
where customer_name like '%adam%'
the constraints above are completely made up.
The bulk collect clause comes right after the (first) select clause, before the (first) from clause. You have it in the wrong place.
It is not clear why you are using a union (although that by itself will not result in an error). Perhaps as an unintended consequence, you will get a list of distinct names, because that is what union does (as opposed to union all).
Other than that, as has been pointed out in a Comment already, you don't need union - you need an or in the where clause. But even if you modify your query that way, you still must move bulk collect to its proper place.
Another option would be to put your UNION into an inline view. For example,
SELECT cust.customer_name
BULK COLLECT
INTO customer_obj
FROM (
SELECT customer_name
FROM customer
WHERE customer_id = 'xxx'
UNION
SELECT customer_name
FROM customer
WHERE customer_name LIKE '%adam%'
) cust

ORA-01722: invalid number but only when query used as subquery

A query, like so:
SELECT SUM(col1 * col3) AS total, col2
FROM table1
GROUP BY col2
works as expected when run individually.
For reference:
table1.col1 -- float
table1.col2 -- varchar2
table1.col3 -- float
When this query is moved to a subquery, I get an ORA-01722 error, with reference to the "col2" position in the select clause. The larger query looks like this:
SELECT col3, subquery1.total
FROM table3
LEFT JOIN (
SELECT SUM(table1.col1 * table1.col3) AS total, table.1col2
FROM table1
GROUP BY table1.col2
) subquery1 ON table3.col3 = subquery1.col2
For reference:
table3.col3 -- varchar2
It may also be worth noting that I have another query, from table2 that has the same structure as table1. If I use the subquery from table2, it works. It never works when using table1.
There is no concatenation, the data types match, the query works by itself... I'm at a loss here. What else should I be looking for? What painfully obvious problem is staring me in the face?
(I didn't choose or make the table structures and can't change them, so answers to that end will unfortunately not be helpful.)
try using a proper cast of float to char ..
SELECT col3, subquery1.total
FROM table3
LEFT JOIN (
SELECT SUM(table1.col1 * table1.col3) AS total, table.1col2
FROM table1
GROUP BY table1.col2
) subquery1 ON to_char(table3.col3) = subquery1.col2

Reduce overload on pl/sql

I have a requirement to do matching of few attributes one by one. I'm looking to avoid multiple select statements. Below is the example.
Table1
Col1|Price|Brand|size
-----------------------
A|10$|BRAND1|SIZE1
B|10$|BRAND1|SIZE1
C|30$|BRAND2|SIZE2
D|40$|BRAND2|SIZE4
Table2
Col1|Col2|Col3
--------------
B|XYZ|PQR
C|ZZZ|YYY
Table3
Col1|COL2|COL3|LIKECOL1|Price|brand|size
-----------------------------------------
B|XYZ|PQR|A|10$|BRAND1|SIZE1
C|ZZZ|YYY|D|NULL|BRAND2|NULL
In table3, I need to insert data from table2 by checking below conditions.
Find a match for record in table2, if Brand and size, Price match
If no match found, then try just Brand, Size
still no match found, try brand only
In the above example, for the first record in table2, found match with all the 3 attributes and so inserted into table3 and second record, record 'D' is matching but only 'Brand'.
All I can think of is writing 3 different insert statements like below into an oracle pl/sql block.
insert into table3
select from tab2
where all 3 attributes are matching;
insert into table3
select from tab2
where brand and price are matching
and not exists in table3 (not exists is to avoid
inserting the same record which was already
inserted with all 3 attributes matched);
insert into table3
select from tab2
where Brand is matching and not exists in table3;
Can anyone please suggest a better way to achieve it in any better way avoiding multiple times selecting from table2.
This is a case for OUTER APPLY.
OUTER APPLY is a type of lateral join that allows you join on dynamic views that refer to tables appearing earlier in your FROM clause. With that ability, you can define a dynamic view that finds all the matches, sorts them by the pecking order you've specified, and then use FETCH FIRST 1 ROW ONLY to only include the 1st one in the results.
Using OUTER APPLY means that if there is no match, you will still get the table B record -- just with all the match columns null. If you don't want that, you can change OUTER APPLY to CROSS APPLY.
Here is a working example (with step by step comments), shamelessly stealing the table creation scripts from Michael Piankov's answer:
create table Table1 (Col1,Price,Brand,size1)
as select 'A','10','BRAND1','SIZE1' from dual union all
select 'B','10','BRAND1','SIZE1' from dual union all
select 'C','30','BRAND2','SIZE2' from dual union all
select 'D','40','BRAND2','SIZE4'from dual
create table Table2(Col1,Col2,Col3)
as select 'B','XYZ','PQR' from dual union all
select'C','ZZZ','YYY' from dual;
-- INSERT INTO table3
SELECT t2.col1, t2.col2, t2.col3,
t1.col1 likecol1,
decode(t1.price,t1_template.price,t1_template.price, null) price,
decode(t1.brand,t1_template.brand,t1_template.brand, null) brand,
decode(t1.size1,t1_template.size1,t1_template.size1, null) size1
FROM
-- Start with table2
table2 t2
-- Get the row from table1 matching on col1... this is our search template
inner join table1 t1_template on
t1_template.col1 = t2.col1
-- Get the best match from table1 for our search
-- template, excluding the search template itself
outer apply (
SELECT * FROM table1 t1
WHERE 1=1
-- Exclude search template itself
and t1.col1 != t2.col1
-- All matches include BRAND
and t1.brand = t1_template.brand
-- order by match strength based on price and size
order by case when t1.price = t1_template.price and t1.size1 = t1_template.size1 THEN 1
when t1.size1 = t1_template.size1 THEN 2
else 3 END
-- Only get the best match for each row in T2
FETCH FIRST 1 ROW ONLY) t1;
Unfortunately is not clear what do you mean when say match. What is you expectation if there is more then one match?
Should it be only first matching or it will generate all available pairs?
Regarding you question how to avoid multiple inserts there is more then one way:
You could use multitable insert with INSERT first and condition.
You could join table1 to self and get all pairs and filter results in where condition
You could use analytical function
I suppose there is another ways. But why you would like to avoid 3 simple inserts. Its easy to read and maintain. And may be
There is example with analytical function next:
create table Table1 (Col1,Price,Brand,size1)
as select 'A','10','BRAND1','SIZE1' from dual union all
select 'B','10','BRAND1','SIZE1' from dual union all
select 'C','30','BRAND2','SIZE2' from dual union all
select 'D','40','BRAND2','SIZE4'from dual
create table Table2(Col1,Col2,Col3)
as select 'B','XYZ','PQR' from dual union all
select'C','ZZZ','YYY' from dual
with s as (
select Col1,Price,Brand,size1,
count(*) over(partition by Price,Brand,size1 ) as match3,
count(*) over(partition by Price,Brand ) as match2,
count(*) over(partition by Brand ) as match1,
lead(Col1) over(partition by Price,Brand,size1 order by Col1) as like3,
lead(Col1) over(partition by Price,Brand order by Col1) as like2,
lead(Col1) over(partition by Brand order by Col1) as like1,
lag(Col1) over(partition by Price,Brand,size1 order by Col1) as like_desc3,
lag(Col1) over(partition by Price,Brand order by Col1) as like_desc2,
lag(Col1) over(partition by Brand order by Col1) as like_desc1
from Table1 t )
select t.Col1,t.Col2,t.Col3, coalesce(s.like3, like_desc3, s.like1, like_desc1, s.like1, like_desc1),
case when match3 > 1 then size1 end as size1,
case when match1 > 1 then Brand end as Brand,
case when match2 > 1 then Price end as Price
from table2 t
left join s on s.Col1 = t.Col1
COL1 COL2 COL3 LIKE_COL SIZE1 BRAND PRICE
B XYZ PQR A SIZE1 BRAND1 10
C ZZZ YYY D - BRAND2 -

multiple column update from other tables column in single query - performance improvement

I Have a table MAIN_TABLE of 6 million records. Below query is taking 4 hours. Can someone suggest performance improvement
MAIN_TABLE structure.
FIELD1 (NUMBER), FIELD2(VARCHAR), FILED3............
UPDATE MAIN_TABLE MT
SET
FIELD3 = (SELECT FLDA FROM TABLE1 where FLDX= MT.FIELD2 AND FLDZ='XYZ'),
FIELD4 = (SELECT FLDB FROM TABLE1 where FLDX= MT.FIELD2 AND FLDZ='PQR'),
FIELD6 = (SELECT FLDD FROM TABLE1 where FLDX= MT.FIELD2 AND FLDW='FGH');
Will looping help? Will dividing the population into multiple threads help?
Will any DB hint work?

Resources