Worse query plan with a JOIN after ANALYZE - performance

I see that running ANALYZE results in significantly poor performance on a particular JOIN I'm making between two tables.
Suppose the following schema:
CREATE TABLE a ( id INTEGER PRIMARY KEY, name TEXT );
CREATE TABLE b ( a NOT NULL REFERENCES a, value INTEGER, PRIMARY KEY(a, b) );
CREATE VIEW ab AS SELECT a.name, b.text, MAX(b.value)
FROM a
JOIN b ON b.a = a.id;
GROUP BY a.id
ORDER BY a.name
Table a is approximately 10K rows, table b is approximately 48K rows (~5 rows per row in table a).
Before ANALYZE
Now when I run the following query:
SELECT * FROM ab;
The query plan looks as follows:
1|0|0|SCAN TABLE b
1|1|1|SEARCH TABLE a USING INTEGER PRIMARY KEY (rowid=?)
This is a good plan, b is larger and I want it to be in the outer loop, making use of the index in table a. It finishes well within a second.
After ANALYZE
When I execute the same query again, the query plan results in two table scans:
1|0|1|SCAN TABLE a
1|1|0|SCAN TABLE b
This is far for optimal. For some reason the query planner thinks that an outer loop of 10K rows and an inner loop of 48K rows is a better fit. This takes about 1.5 minute to complete.
Should I adapt the index in table b to make it work after ANALYZE? Anything else to change to the indexing/schema?
I just try to understand the problem here. I worked around it using a CROSS JOIN, but that feels dirty and I don't really understand why the planner would go with a plan that is orders of magnitude slower than the un-analyzed plan. It seems to be related to GROUP BY, since the query planner puts table b in the outer loop without it (but that renders the query useless for what I want).

Accidentally found the answer by adjusting the GROUP BY clause in the view definition. Instead of joining on a.id, I group on b.a instead, although they have the same values.
CREATE VIEW ab AS SELECT a.name, b.text, MAX(b.value)
FROM a
JOIN b ON b.a = a.id;
GROUP BY b.a -- <== changed this from a.id to b.a
ORDER BY a.name
I'm still not entirely sure what the difference is, since it groups the same data.

Related

Consecutive JOIN and aliases: order of execution

I am trying to use FULLTEXT search as a preliminary filter before fetching data from another table. Consecutive JOINs follow to further refine the query and to mix-and-match rows (in reality there are up to 6 JOINs of the main table).
The first "filter" returns the IDs of the rows that are useful, so after joining I have a subset to continue with. My issue is performance, however, and my lack of understanding of how the SQL query is executed in SQLite.
SELECT *
FROM mytbl AS t1
JOIN
(SELECT someid
FROM myftstbl
WHERE
myftstbl MATCH 'MATCHME') AS prior
ON
t1.someid = prior.someid
AND t1.othercol = 'somevalue'
JOIN mytbl AS t2
ON
t2.someid = prior.someid
/* Or is this faster? t2.someid = t1.someid */
My thought process for the query above is that first, we retrieve the matched IDs from the myftstbl table and use those to JOIN on the main table t1 to get a sub-selection. Then we again JOIN a duplicate of the main table as t2. The part that I am unsure of is which approach would be faster: using the IDs from the matches, or from t2?
In other words: when I refer to t1.someid inside the second JOIN, does that contain only the someids after the first JOIN (so only those at the intersection of prior and those for which t1.othercol = 'somevalue) OR does it contain all the original someids of the whole original table?
You can assume that all columns are indexed. In fact, when I use one or the other approach, I find with EXPLAIN QUERY PLAN that different indices are being used for each query. So there must be a difference between the two.
The query should be simplified to
SELECT *
FROM mytbl AS t1
JOIN myftstbl USING (someid) -- or ON t1.someid = myftstbl.someid
JOIN mytbl AS t2 USING (someid) -- or ON t1.someid = t2.someid
WHERE myftstbl.{???} MATCH 'MATCHME' -- replace {???} with correct column name
AND t1.othercol = 'somevalue'
PS. The query logic is not clear for me, so it is saved as-is.

Using query hints to use a index in an inner table

I have a query which uses the view a as follows and the query is extremely slow.
select *
from a
where a.id = 1 and a.name = 'Ann';
The view a is made up another four views b,c,d,e.
select b.id, c.name, c.age, e.town
from b,c,d,e
where c.name = b.name AND c.id = d.id AND d.name = e.name;
I have created an index on the table of c named c_test and I need to use it when executing the first query.
Is this possible?
Are you really using this deprecated 1980s join syntax? You shouldn't. Use proper explicit joins (INNER JOIN in your case).
You are joining the two tables C and D on their IDs. That should mean they are 1:1 related. If not, "ID" is a misnomer, because an ID is supposed to identify a row.
Now let's look at the access route: You have the ID from table B and the name from tables B and C. We can tell from the column name that b.id is unique and Oracle guarantees this with a unique index, if the database is set up properly.
This means the DBMS will look for the B row with ID 1, find it instantly in the index, find the row instantly in the table, see the name and see whether it matches 'Ann'.
The only thing that can be slow hence is joining C, D, and E. Joining on unique IDs is extremely fast. Joining on (non-unigue?) names is only fast, if you provide indexes on the names. I'd recommend the following indexes accordingly:
create index idx_c on c (name);
create index idx_e on e (name);
To get this faster still, use covering indexes instead:
create index idx_b on b (id, name);
create index idx_c on c (name, id, age);
create index idx_d on d (id, name);
create index idx_e on e (name, town);

Oracle Performance issues on using subquery in an "In" orperator

I have two query that looks close to the same but Oracle have very different performance.
Query A
Create Table T1 as Select * from FinalView1 where CustomerID in ('A0000001','A000002')
Query B
Create Table T1 as Select * from FinalView1 where CustomerID in (select distinct CustomerID from CriteriaTable)
The CriteriaTable have 800 rows but all belongs to Customer ID 'A0000001' and 'A000002'.
This means the subquery: "select distinct CustomerID from CriteriaTable" also only returns the same two elements('A0000001','A000002') as manually entered in query A
Following is the query under the FinalView1
create or replace view FinalView1_20200716 as
select
Customer_ID,
<Some columns>
from
Table1_20200716 T1
INNER join Table2_20200716 T2 on
T1.Invoice_number = T2.Invoice_number
and
T1.line_id = T2.line_id
left join Table3_20200716 T3 on
T3.id = T1.Customer_ID
left join Table4_20200716 T4 on
T4.Shipping_ID = T1.Shipping_ID
left join Table5_20200716 Table5 on
Table5.Invoice_ID = T1.Invoice_ID
left join Table6_20200716 T6 on
T6.Shipping_ID = T4.Shipping_ID
left join First_Order first on
first.Shipping_ID = T1.Shipping_ID
;
Table1_20200716,Table2_20200716,Table3_20200716,Table4_20200716,Table5_20200716,Table6_20200716 are views to the corresponding table with temporal validity feature. For example
The query under Table1_20200716
Create or replace view Table1_20200716 as
select
*
from Table1 as for period of to_date('20200716,'yyyymmdd')
However table "First_Order" is just a normal table as
Following is the performance for both queries (According to explain plan):
Query A:
Cardinality: 102
Cost : 204
Total Runtime: 5 secs max
Query B:
Cardinality:27921981
Cost: 14846
Total Runtime:20 mins until user cancelled
All tables are indexed using those columns that used to join against other tables in the FinalView1. According to the explain plan, they have all been used except for the FirstOrder table.
Query A used uniquue index on the FirstOrder Table while Query B performed a full scan.
For query B, I was expecting the Oracle will firstly query the sub-query get the result into the in operator, before executing the main query and therefore should only have minor impact to the performance.
Thanks in advance!
As mentioned from my comment 2 days ago. Someone have actually posted the solution and then have it removed while the answer actually work. After waiting for 2 days the So I designed to post that solution.
That solution suggested that the performance was slow down by the "in" operator. and suggested me to replace it with an inner join
Create Table T1 as
Select
FV.*
from
FinalView1 FV
inner join (
select distinct
CustomerID
from
CriteriaTable
) CT on CT.customerid = FV.customerID;
Result from explain plan was worse then before:
Cardinality:28364465 (from 27921981)
Cost: 15060 (from 14846)
However, it only takes 17 secs. Which is very good!

Hash Join with Partition restriction from third table

my current problem is in 11g, but I am also interested in how this might be solved smarter in later versions.
I want to join two tables. Table A has 10 million rows, Table B is huge and has a billion of records across about a thousand partitions. One partition has around 10 million records. I am not joining on the partition key. For most rows of Table A, one or more rows in Table B will be found.
Example:
select * from table_a a
inner join table_b b on a.ref = b.ref
The above will return about 50 million rows, whereas the results come from about 30 partitions of table b. I am assuming a hash join is the correct join here, hashing table a and FTSing/index-scanning table b.
So, 970 partitions were scanned for no reason. And, I have a third query that could tell oracle which 30 partitions to check for the join.
Example of third query:
select partition_id from table_c
This query gives exactly the 30 partitions for the query above.
To my question:
In PL/SQL one can solve this by
select the 30 partition_ids into a variable (be it just a select listagg(partition_id,',') ... into v_partitions from table_c
Execute my query like so:
execute immediate 'select * from table_a a
inner join table_b b on a.ref = b.ref
where b.partition_id in ('||v_partitions||')' into ...
Let's say this completes in 10 minutes.
Now, how can I do this in the same amount of time with pure SQL?
Just simply writing
select * from table_a a
inner join table_b b on a.ref = b.ref
where b.partition_id in (select partition_id from table_c)
does not do the trick it seems, or I might be aiming at the wrong plan.
The plan I think I want is
hash join
table a
nested loop
table c
partition pruning here
table b
But, this does not come back in 10 minutes.
So, how to do this in SQL and what execution plan to aim at? One variation I have not tried yet that might be the solution is
nested loop
table c
hash join
table a
partition pruning here (pushed predicate from the join to c)
table b
Another feeling I have is that the solution might lie in joining table a to table c (not sure on what though) and then joining this result to table b.
I am not asking you to type everything out for me. Just a general concept of how to do this (getting partition restriction from a query) in SQL - what plan should I aim at?
thank you very much! Peter
I'm not an expert at this, but I think Oracle generally does the joins first, then applies the where conditions. So you might get the plan you want by moving the partition pruning up into a join condition:
select * from table_a a
inner join table_b b on a.ref = b.ref
and b.partition_id in (select partition_id from table_c);
I've also seen people try to do this sort of thing with an inline view:
select * from table_a a
inner join (select * from table_b
where partition_id in (select partition_id from table_c)) b
on a.ref = b.ref;
thank you all for your discussions with me on this one. In my case this was solved (not by me) by adding a join-path between table_c and table_a and by overloading the join conditions as below. In my case this was possible by adding column partition_id to table_a:
select * from
table_c c
JOIN table_a a ON (a.partition_id = c.partition_id)
JOIN table_b b ON (b.partition_id = c.partition_id and b.partition_id = a.partition_id and b.ref = a.ref)
And this is the plan you want:
leading(c,b,a) use_nl(c,b) swap_join_inputs(a) use_hash(a)
So you get:
hash join
table a
nested loop
table c
partition list iterator
table b

How to effeciently select data from two tables?

I have two tables: A, B.
A has prisoner_id and prisoner_name columns.
B has all other info about prisoners included prisoner_name column.
First I select all of the data that I need from B:
WITH prisoner_datas AS
(SELECT prisoner_name, ... FROM B WHERE ...)
Then I want to know all of the id of my prisoner_datas. To do this I need to combine information by prisoner_name column, because it's common for both tables
I did the following
SELECT A.prisoner_id, prisoner_datas.prisoner_name, prisoner_datas. ...,
FROM A, prisoner_datas
WHERE A.prisoner_name = prisoner_datas.prisoner_name
But it works very slow. How can I improve performance?
Add an index on the prisoner_name join column in the B table. Then the following join should have some performance improvement:
SELECT
A.prisoner_id,
B.prisoner_name,
B.prisoner_datas.id -- and other columns if needed
FROM A
INNER JOIN B
ON A.prisoner_name = B.prisoner_name
Note here that I used an explicit join syntax here. It isn't required, and the query plan might not change, but it makes the query easier to read. I don't think the CTE will change much, but the lack of an index on the join column should be important here.

Resources