postgresql query plan strange behaviour - performance

I must select the rows in one table (table_A fairly small < 10K rows) that have no corresponding rows in another table (table_B over 500K rows) based on column_X (table_B has an index, btree, on that column).
If I use the following query:
select a.column1,
a.column2,
a.column3,
a.column_X,
b.column_X
from table_A a
left outer join table_B b on a.column_X = b.column_X
where a.column_X <> 0
and b.column_X is null
the query (168 resulting rows) is executed in about 600ms.
If, on the other hand, I try a different query:
select column1,
column2,
column3,
column_X
from table_A
where column_X not in (
select column_X
from table_B
where column_X is not null
)
and column_X <> 0
it takes around 8 minutes to retrieve the same 168 rows. column_X is of type bigint and casting seems to make no difference (in the second query the index is never used).
Any idea?

The NOT IN subselect is much worse optimized than any other. Due different semantic PostgreSQL cannot to use antijoin. If you can, don't use this pattern. Use NOT EXISTS instead, or outer join.

Related

ORA-01722: invalid number but only when query used as subquery

A query, like so:
SELECT SUM(col1 * col3) AS total, col2
FROM table1
GROUP BY col2
works as expected when run individually.
For reference:
table1.col1 -- float
table1.col2 -- varchar2
table1.col3 -- float
When this query is moved to a subquery, I get an ORA-01722 error, with reference to the "col2" position in the select clause. The larger query looks like this:
SELECT col3, subquery1.total
FROM table3
LEFT JOIN (
SELECT SUM(table1.col1 * table1.col3) AS total, table.1col2
FROM table1
GROUP BY table1.col2
) subquery1 ON table3.col3 = subquery1.col2
For reference:
table3.col3 -- varchar2
It may also be worth noting that I have another query, from table2 that has the same structure as table1. If I use the subquery from table2, it works. It never works when using table1.
There is no concatenation, the data types match, the query works by itself... I'm at a loss here. What else should I be looking for? What painfully obvious problem is staring me in the face?
(I didn't choose or make the table structures and can't change them, so answers to that end will unfortunately not be helpful.)
try using a proper cast of float to char ..
SELECT col3, subquery1.total
FROM table3
LEFT JOIN (
SELECT SUM(table1.col1 * table1.col3) AS total, table.1col2
FROM table1
GROUP BY table1.col2
) subquery1 ON to_char(table3.col3) = subquery1.col2

How to use Oracle hints or other optimization to fix function in where clause performance issue?

This is slow:
select col_x from table_a where col_y in (select col_y from table_b) and fn(col_x)=0;
But I know that this will return 4 rows fast, and that I can run fn() on 4 values fast.
So I do some testing, and I see that this is fast:
select fn(col_x) from table_a where col_y in (select col_y from table_b);
When using the fn() in the where clause, Oracle is running it on every row in table_a. How can I make it so Oracle first uses the col_y filter, and only runs the function on the matched rows?
For example, conceptually, I though this would work:
with taba as (
select fn(col_x) x from table_a where col_y in (select col_y from table_b)
)
select * from taba where x=0;
because I thought Oracle would run the with clause first, but Oracle is "optimizing" this query and making this run exactly the same as the first query above where fn(col_x)=0 is in the where clause.
I would like this to run just as a query and not in a pl/sql block. It seems like there should be a way to give oracle a hint, or do some other trick, but I can't figure it out. BTW, table is indexed on col_y and it is being used as an access predicate. Stats are up to date.
There are two ways you could go around it,
1) add 'AND rownum >=0' in the subquery to force materialization.
OR
2) use a Case statement inside the query to force the execution priority (maybe)
This works, but if anyone has a better answer, please share:
select col_x
from table_a
where col_y in (select col_y from table_b)
and (select 1 from dual where fn(col_x)=0);
Kind of kludgy, but works. Takes a query running in 60+ seconds down to .1 seconds.
You could try the HAVING clause in your query. This clause is not executed until the base query is completed, and then the HAVING clause is run on the resulting rows. It's typically used for analytic functions, but could be useful in your case.
select col_x
from table_a
where col_y in (select col_y from table_b)
having fn(col_x)=0;
A HAVING clause restricts the results of a GROUP BY in a
SelectExpression. The HAVING clause is applied to each group of the
grouped table, much as a WHERE clause is applied to a select list. If
there is no GROUP BY clause, the HAVING clause is applied to the
entire result as a single group. The SELECT clause cannot refer
directly to any column that does not have a GROUP BY clause. It can,
however, refer to constants, aggregates, and special registers.
http://docs.oracle.com/javadb/10.8.3.0/ref/rrefsqlj14854.html
1) Why you don't try join table_a and table_b using col_y.
select a.col_x from table_a a,table_b b
where a.col_y = b.col_y
and fn(col_x) = 0
2) NO_PUSH_PRED -
select /*+ NO_PUSH_PRED(v) */ col_x from (
select col_x from table_a where col_y in (select col_y from table_b)
) v
where fn(col_x) =0
3) Exists and PUSH_SUBQ.
select col_x from table_a a
where exists( select /*+ PUSH_SUBQ */ 1 from table_b b where a.col_y = b.coly )
and fn(col_x) = 0;

sqplus: Retrieve results in chunks (update ROWNUM in loop)

I'm really new to using sqlplus, so I'm sorry if this is a stupid question.
I have a really long running query of the form:
SELECT columnA
from tableA
where fieldA in (
(select unique columnB
from tableB
where fieldB in
(select columnC
from tableC
where fieldC not in
(select columnD
from tableD
where x=y
and a=b
and columnX in
(select columnE
from tableE
where p=q)))
and columnInTableB = <some value>
and anotherColumnInTableB = <some other value>
and thirdColumnInTableB IN (<set of values>)
and fourthColumnInTableB like <some string>);
Each of the tables has about 15 - 30 columns and varying number of rows. TableB is the largest, with about 5million rows in all. Tables A - E have between 500,000 - 1 million rows each.
I've tried a couple of approaches:
1) Run this query as is:
This query runs for really long and I get the error -
ORA-01555: snapshot too old: rollback segment number <> with name <>
I did some research and found things like:
ORA-1555: snapshot too old: rollback segment number
However, I don't have privileges to change the undo segment.
2) I re-wrote the query using with...as, but then I get the error:
unable to extend temp segment by <> in tablespace TEMP
Again, I found explanations as to how to fix this error, but I don't have privileges to extend the temp segment.
The query that takes longest to run is:
(select unique columnB
from tableB ...
The and fourthColumnInTableB like <some string>); matches about three million entries in tableB in the worst case.
Someone suggested to me to 'run the query in smaller chunks'.
An approach I thought of was to retrieve data for the long running subquery (
(select unique columnB
from tableB ...
)
in chunks (using ROWNUM as suggested here.
My question is this:
I don't know exactly how many potential matches there are for this subquery. Can I dynamically set ROWNUM to retrieve data in chunks?
If yes, could you please give me an example of what the while loop must look like, and how I can determine when the result set has been exhausted?
An option I found for this was to check while ##ROWCOUNT > 0 or use:
while exists (query)
However, I'm still not sure how to write the loop and how to use a variable (?) to dynamically set ROWNUM.
Basically, I'm wondering if I can do:
SELECT columnA
from tableA
where fieldA in (
while all results have not been fetched:
select *
from
(select a.*, rownum rnum
from
(select unique columnB
from tableB
where fieldB in
(select columnC
from tableC
where fieldC not in
(select columnD
from tableD
where x=y
and a=b
and columnX in
(select columnE
from tableE
where p=q)))
and columnInTableB = <some value>
and anotherColumnInTableB = <some other value>
and thirdColumnInTableB IN (<set of values>)
and fourthColumnInTableB like <some string>) a
where rownumm <= i) and rnum >= i);
update value of i here (?) so that the loop can repeat
How can I update the value of 'i' for rownum/rnum above dynamically in some loop to retrieve results in chunks (of, say, 10000) until the result set has been exhausted?
Also, what should the while loop look like?
I also have no idea how to rewrite this using joins (my knowledge of sql is very limited), so if someone can help me rewrite this more efficiently using joins or any other method, that would work too.
I'd really appreciate any help on this. I've been stuck on this for a few days now and I'm unable to determine a proper solution.
Thank you!
1) Try removing UNIQUE /DISTINCT clause. It should in-memory sort and temp segment.
2) Try applying a 'rownum < x_rows' filter, in the end, to restrict no of rows to the client and reduce IO.
SELECT columnA
FROM tableA
WHERE fieldA IN (
(SELECT **UNIQUE** columnB
FROM tableB
WHERE fieldB IN
(SELECT columnC
FROM tableC
WHERE fieldC NOT IN
(SELECT columnD
FROM tableD
WHERE x =y
AND a =b
AND columnX IN
(SELECT columnE FROM tableE WHERE p=q
)
)
)
AND columnInTableB = <SOME value>
AND anotherColumnInTableB = <SOME other value>
AND thirdColumnInTableB IN (<
SET OF VALUES >)
AND fourthColumnInTableB LIKE <SOME string>
)
**ROWNUM < 50** ;

What is the correct Index column ordering based on a SQL Query?

Consider the generic query:
SELECT * FROM (
SELECT COL_1, COL_2, COL_3 FROM TABLE_1
WHERE COL_1 IN ('item1', 'item2')
AND COL_2 = 100
ORDER BY COL_3
) subquery1
INNER JOIN
(
SELECT COL_A, MAX(COL_B) FROM TABLE_2
GROUP BY COL_A
HAVING COUNT(COL_B) > 2
) subquery2
ON subquery1.COL_1 = subquery2.COL_A
Assume that the query itself is optimized. If I wanted to create the an 'optimized' index around a query like this, what indexes should I be creating? In particular, which order are the indexes' columns be?
From my understanding, the first columns should be the columns used in the WHERE clause, then the ORDER BY clause, and lastly the SELECT clause. Is this true? What about the others, such as GROUP BY's, HAVING's, and JOIN's - when should they be considered?
Also if necessary, assume that this is an Oracle database. (But I imagine column ordering would be the same for other platforms.)

Convert rownum from Oracle in Postgres, in "having" clause

I need to convert a query from Oracle SQL to Postgres.
select count(*) from table1 group by column1 having max(rownum) = 4
If I replace "rownum" with "row_number() over()", I have an error message: "window functions are not allowed in HAVING".
Could you help me to get the same result in Postgres, as in Oracle?
The query below will do what your Oracle query is doing.
select count(*) from
(select column1, row_number() over () as x from table1) as t
group by column1 having max(t.x) = 6;
However
Neither oracle not postgres will guarantee the order in which records are read unless you specify an order by clause. So running the query multiple times is going to be inconsistent depending on how the database decides to process the query. Certainly in postgres any updates will change the underlying row order.
In the example below I've got an extra column of seq which is used to provide a consistent sort.
CREATE TABLE table1 (column1 int, seq int);
insert into table1 values (0,1),(0,2),(0,3),(1,4),(0,5),(1,6);
And a revised query which forces the order to be consistent:
select count(*) from
(select column1, row_number() over (order by seq) as x from table1) as t
group by column1 having max(t.x) = 6;

Resources