`INTERSECT` vs `INNER JOIN` in PDO SQLite - performance

I wonder which way is faster
SELECT Id FROM T1
INTERSECT
SELECT Id FROM T2
or
SELECT T1.Id
FROM T1
INNER JOIN T2 ON T1.Id=T2.Id

At the moment, SQLite implements INTERSECT by copying the results of the two queries into two temporary sorted tables, and then looking up each Id value of the first table in the second table.
An INNER JOIN is implemented as a nested loop join, i.e., each Id value of one table is looked up in the other table. (SQLite chooses the other table as the one with an index on Id; if neither table has such an index, it creates a temporary index.)
So the pratical difference is that INTERSECT always creates temporary tables, while JOIN can work directly on the actual tables.
(If T1 and T2 were complicated subqueries, JOIN would also need temporary tables, and there would be no difference.)

Related

Chained CTEs in Redshift - How do I know which DIST KEY the CTE will inherit?

I have a view in Redshift which consists of lots of CTEs that are joined (chained) between each other. Inside these CTEs there are joins between multiple tables. If I then Join to a CTE that has a join of multiple tables inside where does the SORT KEY and DIST KEY for the Join from? How does Redshift decide which table in the join in the CTE, the CTE should inherit it's DIST KEY and SORT KEY from? If at all?
For example, tbl1 has a DIST KEY on tbl_key, tbl2 has a DIST KEY on tbl_id, tbl3 has DIST KEY on tbl_key.
First, I create a CTE which is the join of tbl1 and tbl2.
With cte1 as (
Select tbl1.col1, tbl2.col2
From tbl1
Join tbl2 on tbl1.job_no = tbl2.job_id )
Second, I create a CTE that joins to the first CTE
With cte2 as (
Select cte1.*, tbl3.col3
From cte1
Join tbl3 using (tbl_key))
Now my question is, does CTE1 have a DIST KEY on tbl1's DIST KEY of tbl_key or tbl2's DIST KEY of tbl_id? or both? or neither?
In Redshift, CTEs are just used to simplify the reading of sql. They are processed just the same as subqueries. i.e. they are not made physical and therefore do not have their own dist/sort key.
You could rewrite your code as
Select cte1.*, tbl3.col3
From (Select tbl1.col1, tbl2.col2
From tbl1
Join tbl2 on tbl1.job_no = tbl2.job_id
) as cte1
Join tbl3 using (tbl_key)
which can be simplified further as
Select tbl1.col1, tbl2.col2, tbl3.col3
from tbl1
join tbl2 on tbl1.job_no = tbl2.job_id
join tbl3 using (tbl_key)
If you are able to choose your dist/sort keys then you should consider which tables are the biggest and prioritise those accordingly.
for example if tbl1 and tbl2 are large then it may make sense to have them distributed as you described.
However, if tbl2 and tbl3 are both large, it may make sense to distribute both on tbl_key.
When you issue a query Redshift will compile and optimize that query as it sees fit to achieve the best performance and be logically equivalent. Your CTEs look like subqueries to the compile / optimization process and the order in which the joins are performed may have no relation to how you wrote the query.
Redshift makes these optimization choices based on the table metadata that is created / updated by ANALYZE. If you want Redshift to make smart choices on how to join your tables together you will want your table metadata to be up to date. The query plan (including join order and data distribution) is set at query compile, it is not dynamically determined during execution.
One of the choices Redshift makes is how the intermediate data of the query is distributed (your question) but remember that these intermediate results can be for a modified join order. To see what order that Redshift plans to join your tables look at the EXPLAIN plan for the query. The more tables you are joining and the more complex your query, the more choices Redshift has and the less likely it is that the EXPLAIN plan will join in the order you specified. I've worked on clients' queries with dozens of joins and many nested levels of subquery and the EXPLAIN plan is often very different than the original query as written.
So Redshift is trying to make smart choices about the join order and intermediate result distribution. For example it will usually join small tables to large tables first and keep the distribution of the large table. But here large and small are based on post WHERE clause filtering and the guesses Redshift can make based on metadata. The further join is away from the source table metadata (deep into the join tree) the more uncertain Redshift is about what the incoming and outgoing data of the join will look like.
Here the EXPLAIN plan can give you hints about what Redshift is "thinking" - if you see a DIST INNER join Redshift is moving the data of one table (or intermediate result set) to match the other. If you DIST BOTH then Redshift is redistributing both sets of data to some new distribution (usually one of the join on columns). It does this to avoid having only 1 slice with data and all others with nothing to do as this would be very inefficient.
To sum up to see what Redshift is planning to do with your joins look at the EXPLAIN plan. You can also infer some info about intermediate result distribution from the explain plan but is doesn't provide a complete map of what it plans to do.

inserting records from two different tables into a single table in oracle

I want to insert data from two different tables (say table A and table B ) into a third table (table C) in oracle.
I have written two different cursors for fetching data from table A and B separately, and populated two collections based on these two tables.
Now, i want to insert the data in those two collections into the third table (table C), how can i get this done.
Now there are two common columns that are present in both the columns, say for example ID and YEARMONTH, these two columns are there in all tables (A, B and C).
I have tried doing a merge based on these two fields.
but i am looking for an efficient and more convenient way to do this.
You didn't provide code you wrote, so I'll guess: cursors mean PL/SQL. If you're doing it in a loop, row-by-row, it'll be slow-by-slow.
As there are common columns in both tables (A and B), I'd suggest doing it in pure SQL: join those two tables and insert the result into C. Something like
insert into c (id, yearmonth, ...)
select a.id, a.yearmonth, ...
from a join b on a.id = b.id;
Make sure that indexes exist on columns you use to join tables. Or, even better, compare explain plans in both cases (with and without indexes) and choose an option which seems to be the best.
insert into tableC
select * from tableA where ...
union
select * from tableB where ...

Partition pruning issue

I’m joining 2 tables. Pruning is happening on table 1 but not on table 2 even though there is an outer join.
Example:
select *
from table1 t1, table2 t2
where t1.sk in (select sk from filter_table)
and t2.sk(+) = t1.sk
When I check the plan and noticed t1 table has KEY partition scan, but T2 is scanning all the partition(~4500). so the query is taking more than 4hrs just to pull 50 recs.
Is there any way to force the pruning on table 2 as well?
I am using Oracle 11g.
Without more data it is hard to say for sure what the problem can be. I have rewritten the query for clarity and with a simple test schema I get pruning for both tables with Oracle 12c (I don't have 11g handy). The first with key and the second with Bloom Filter (:BF0000 in the plan).
select t1.*, t2.*
from filter_table ft
join table1 t1 on t1.sk = ft.sk
left outer join table2 t2 on t2.sk = ft.sk;
Be sure to gather statistics for all three tables! Often when the optimizer seems to be stupid it is because the statistics are missing or not up to date.

Wrong index is chosen by Oracle

I have a problem in indexing in Oracle. Will try to explain my problem with an instance as follows.
I have a table TABLE1 with columns A,B,C,D
another table TABLE2 with columns A,B,C,E,F,H
I have created Indexes for TABLE1
IX_1 A
IX_2 A,B
IX_3 A,C
IX_4 A,B,C
I have created Indexes for TABLE1
IY_1 A,B,C
IY_2 A
when i gave query similar to this
SELECT * FROM TABLE1 T1,TABLE2 T2
WHERE T1.A=T2.A
When i give Explain Plan i got its not getting IX_1 nor IY_2
Its taking IX_4 nor IY_1
why this is not picking right index?
EDITED:
Can anyone help me to know difference between INDEX RANGE SCAN,INDEX UNIQUE SCAN, INDEX SKIP SCAN
I guess SKIP SCAN means when a column is skipped in Composite Index by Oracle
what about others i dont have idea!
The best benefit of indexes is that you can select a few rows from a table without scanning the entire table.
If you ask for too many rows(let's say 30% - depends of many things) the engine will prefer to scan the entire table for those rows.
That's because reading a row using an index is gets an overhead : reading some index blocks, and after that reading table blocks.
In your case, in order to join tables T1 and T2, Oracle needs all the rows from those table. Reading(full) the index will be an unsefull operation, adding unnecesary cost.
UPDATE: A step forward: if you run:
SELECT T1.B, T2.B FROM TABLE1 T1,TABLE2 T2
WHERE T1.A=T2.A
Oracle probably will use the indexes(IX2, IY2), because it does not need to read anything from table, because the values T1.B, T2.B, are in indexes.

ssis - merge join alternative

I have a table T1 in database D1 and table T2 in database D2. From T2 I need only those records whose primary keys are listed in T1.
The only way that I know so far is to use Merge Join (Inner Join). Since T2 contains much more records than T1 Merge Join would eliminate all records from T2 that don't exist in T1. Since this method is very slow is there any other method to do this task?
Thanks,
Ilija
Is there a reason the Lookup Transformation won't work?
Are D1 and D2 both on the same SQL Server instance? If so, the query is trivially easy to write:
SELECT t2.*
FROM D2.schema2.T2 t2
JOIN D1.schema1.T1 t1 ON t1.id = t2.id
(Obviously, you'd have to substitute the real names of the primary key column(s) in the join, as well as the schemas that T1 and T2 live under.)
Or you could make your data flow source be a query with the join rather than be a table.

Resources