Chained CTEs in Redshift - How do I know which DIST KEY the CTE will inherit? - view

I have a view in Redshift which consists of lots of CTEs that are joined (chained) between each other. Inside these CTEs there are joins between multiple tables. If I then Join to a CTE that has a join of multiple tables inside where does the SORT KEY and DIST KEY for the Join from? How does Redshift decide which table in the join in the CTE, the CTE should inherit it's DIST KEY and SORT KEY from? If at all?
For example, tbl1 has a DIST KEY on tbl_key, tbl2 has a DIST KEY on tbl_id, tbl3 has DIST KEY on tbl_key.
First, I create a CTE which is the join of tbl1 and tbl2.
With cte1 as (
Select tbl1.col1, tbl2.col2
From tbl1
Join tbl2 on tbl1.job_no = tbl2.job_id )
Second, I create a CTE that joins to the first CTE
With cte2 as (
Select cte1.*, tbl3.col3
From cte1
Join tbl3 using (tbl_key))
Now my question is, does CTE1 have a DIST KEY on tbl1's DIST KEY of tbl_key or tbl2's DIST KEY of tbl_id? or both? or neither?

In Redshift, CTEs are just used to simplify the reading of sql. They are processed just the same as subqueries. i.e. they are not made physical and therefore do not have their own dist/sort key.
You could rewrite your code as
Select cte1.*, tbl3.col3
From (Select tbl1.col1, tbl2.col2
From tbl1
Join tbl2 on tbl1.job_no = tbl2.job_id
) as cte1
Join tbl3 using (tbl_key)
which can be simplified further as
Select tbl1.col1, tbl2.col2, tbl3.col3
from tbl1
join tbl2 on tbl1.job_no = tbl2.job_id
join tbl3 using (tbl_key)
If you are able to choose your dist/sort keys then you should consider which tables are the biggest and prioritise those accordingly.
for example if tbl1 and tbl2 are large then it may make sense to have them distributed as you described.
However, if tbl2 and tbl3 are both large, it may make sense to distribute both on tbl_key.

When you issue a query Redshift will compile and optimize that query as it sees fit to achieve the best performance and be logically equivalent. Your CTEs look like subqueries to the compile / optimization process and the order in which the joins are performed may have no relation to how you wrote the query.
Redshift makes these optimization choices based on the table metadata that is created / updated by ANALYZE. If you want Redshift to make smart choices on how to join your tables together you will want your table metadata to be up to date. The query plan (including join order and data distribution) is set at query compile, it is not dynamically determined during execution.
One of the choices Redshift makes is how the intermediate data of the query is distributed (your question) but remember that these intermediate results can be for a modified join order. To see what order that Redshift plans to join your tables look at the EXPLAIN plan for the query. The more tables you are joining and the more complex your query, the more choices Redshift has and the less likely it is that the EXPLAIN plan will join in the order you specified. I've worked on clients' queries with dozens of joins and many nested levels of subquery and the EXPLAIN plan is often very different than the original query as written.
So Redshift is trying to make smart choices about the join order and intermediate result distribution. For example it will usually join small tables to large tables first and keep the distribution of the large table. But here large and small are based on post WHERE clause filtering and the guesses Redshift can make based on metadata. The further join is away from the source table metadata (deep into the join tree) the more uncertain Redshift is about what the incoming and outgoing data of the join will look like.
Here the EXPLAIN plan can give you hints about what Redshift is "thinking" - if you see a DIST INNER join Redshift is moving the data of one table (or intermediate result set) to match the other. If you DIST BOTH then Redshift is redistributing both sets of data to some new distribution (usually one of the join on columns). It does this to avoid having only 1 slice with data and all others with nothing to do as this would be very inefficient.
To sum up to see what Redshift is planning to do with your joins look at the EXPLAIN plan. You can also infer some info about intermediate result distribution from the explain plan but is doesn't provide a complete map of what it plans to do.

Related

Why Access and Filter Predicates are the same here?

When I get the autotrace output of the query above using the Oracle SQL Developer, I see that the join condition is used for access and filter predicates. My question is, does it read all the department_ids from the DEPT_ID_PK and then use these IDs to access and filter the employees table? If so, why the employees table has full table scan? Why does it read the employees table again by using the department_ids of the departments table? Could anyone please read this execution plan step by step simply, and explain the reason why the access and filter predicates are used here?
Best Regards
it is a merge join (a bit like hash join, Merge join is used when projections of the joined tables are sorted on the join columns. Merge joins are faster and uses less memory than hash joins).
so Oracle do a full table scan of in outer table (EMPLOYEES) and the it read the inner table in a ordred manner.
the filtre predicates is the column on which the projection will be done
more details: https://datacadamia.com/db/oracle/merge_join
It uses the primary key to avoid sorting, otherwise the plan would be like this
The distinction between "Access predicates" and "Filter predicates" is not particularly consistent, so take them with healthy amount of skepticism. For example, if you remove the USE_MERGE hint, then there would be no Fiter Predicates in the plan any more, and the Access Predicates node would be relocated under the HASH_JOIN node (where it makes more sense for MERGE_JOIN as well):

Oracle SQL - Does the JOIN order in a FROM clause impact performance optimization?

A long while ago I was once told during a SQL course that the JOIN order in a FROM clause of a query can impact the performance of the query. So for example if I had the following
SELECT * FROM
TABLE_1 INNER JOIN --5000 rows
TABLE_2 ON TABLE_1.COL1=TABLE_2.COL1 INNER JOIN --200 rows
TABLE_3 ON TABLE_2.COL1=TABLE_3.COL1--50 rows
.....
This should be reordered to the following
SELECT * FROM
TABLE_3 INNER JOIN --50 rows
TABLE_2 ON TABLE_2.COL1=TABLE_3.COL1 INNER JOIN --200 rows
TABLE_1 ON TABLE_1.COL1=TABLE_2.COL1 --5000 rows
.....
So the leading/driving table is the least amount of rows first (hypothetically). I have read though that unless a HINT is used to force the order, the cost based optimizer within Oracle would just re-arrange the JOIN as it saw fit.
Just curious, does the JOIN order without using HINTS matter in a SQL statement?
Exactly, a long time ago there was impact, when RBO (Rule based optimizer) was used.
In modern Oracle releases, CBO (Cost based optimizer) chooses the best execution plan and does that dirty job for you so - no, you don't have to reorder tables any more.
does the JOIN order without using HINTS matter in a SQL statement?
No. That's basic optimisation for the database; the optimizer will decide what is the best strategy to join the tables, regardless of the order in which they appear in the from clause.
The oracle optimizer component "Query Transformer" transform your query, it does this automatically if needed with the statistics available´when it finds your to be transformed.

indexed view vs temp table to improve performance of a seldom executed query

i have a slow query whose structure is
select
fields
from
table
join
manytables
join (select fields from tables) as V1 on V1 on V1.field = ....
join (select fields1 from othertables) as V2
join (select fields2 from moretables) as V3
The select subqueries in the last 3 joins are relatively simple but joins agains the, take time. If they were phisical tables it would be much better.
So i found out that i could turn the subqueries to indexed views or to temp tables.
By temp table i do not mean a table who is written hourly like explained here,
but a temp table who is created before the query execution
Now my doubt comes from the fact that indexed views are ok in datawarehouses since they impact the performance. This db is not a datawarehouse but a production db of a non data intense application.
But in my case the above query is executed not often, even if the underlaying tables (the tables whose data would become part of the indexed view) are used more often.
In this case is it ok to use indexed views? Or shuold i favor temp table?
Also table variable with primary key keyword is an alternative.

how to join two tables in oracle on blob column?

how to join two tables in oracle on blob column
when this query is executed "SQL command not properly ended" error message is appearing
select name,photo
from tbl1 join tbl2 on tbl1.photo = tbl2.photo
First, it seems very very odd to have a design where you are storing the same blob in two different tables and very odd that you would want to join on an image. That doesn't seem like a sensible design.
You've tagged this for Oracle 8i. That is an ancient version of Oracle that didn't support the SQL 99 join syntax. You would need to do the join in the where clause instead. You can't directly test for equality between two blob values. But you can use dbms_lob.compare
select name,photo
from tbl1,
tbl2
where dbms_lob.compare(tbl1.photo, tbl2.photo) = 0
This will be rather hideous from a performance perspective. You'll have to compare every photo from tbl1 against every photo from tbl2 and comparing two lobs isn't particularly quick. If you are really intent on comparing images, you are probably better off computing a hash, storing that in a separate column that is indexed, and then comparing the hashes rather than comparing the images directly.
The code:
SELECT
name, photo
FROM
tbl1 T1
INNER JOIN
tbl2 T2
ON
T1.photo = T2.photo
If not running fine, you would have to make few changes in your TABLE structure:
1. ...Add a new TABLE named as IMAGES having columns (image_id, image_blob)
2. ...And then you you would have to change the:
tbl1's blob and tbl2's blob to image_id
3. ...Then perform the JOIN on the basis of COLUMN named as image_id
NOTE: You can not perform GROUP BY, JOIN(any JOIN), CONCAT operations on BLOB datatype
SUGGESTION: save the Paths to images in the DATABASE and save the IMAGES somewhere on that SERVER's Directory (As saving images in BLOB in the DATABASE is not a good practice..... To ensure what i said VISIT HERE)

How to use Oracle Materialzed View in a Dimensional Model

I have a dimensional model with a large fact table (millions of rows) which is range partitioned by date and smaller dimensional tables that are not partitioned. I came across materialized views which is often used in these scenarios to improve query performance.
Now, I want to know which way is better of the following two to utilize these materialized views to get aggregated reports.
A. Create one with the by joining the whole fact table with each of the dimension tables required.
create materialized view my_mview execute immediate query rewrite
select
fact.col1, dim1.col2, dim2.col3, sum(fact.col4)
from
my_fact fact
inner join
my_dim1 dim1
on fact.dim1_key = dim1.dim1_key
inner join
my_dim2 dim2
on fact.dim2_key = dim2.dim2_key group by fact.col1, dim1.col2, dim2.col3
This seems like the most basic way of using them. But it seems
rather limiting and I would require a new materialzed view for each
variation of the query I want to create.
B. Create it over the aggregation of the fact table and utilize the query rewrite when doing a dimensional join back.
create materialized view my_mview execute immediate query rewrite
select
col1, dim1.dim2_key, dim2.dim_key, sum(fact.col4)
from
my_fact fact
And do the join as above in case A, which will use this aggregated materialzed view for the join and not the whole fact table.
Can anyone tell me when I would use each case or the other?
Your first example works exactly as you described.
For the second example the query should be:
create materialized view my_mview execute immediate query rewrite
select
col1, fact.dim2_key, fact.dim_key, sum(fact.col4)
from
my_fact fact
group by
col1, fact.dim2_key, fact.dim_key
This will automatically speed up aggregates such as
select sum(fact.col4)
from fact
select fact.dim_key,sum(fact.col4)
from fact
group by fact.dim_key
select fact.dim2_key,sum(fact.col4)
from fact
group by fact.dim2_key
I don't think Oracle will rewrite your first type of query to this MV automatically because in the MV the join columns are already grouped by (They also should be grouped in your second example). It never happened for us. This however may also depend on if there are relationships defined between dim and fact table and the value of QUERY_REWRITE_INTEGRITY parameter, so there is still some room for testing here.
You may still get a performance gain by writing a query in a specific way
WITH preaggr as (
select
col1, fact.dim2_key, fact.dim_key, sum(fact.col4)
from
my_fact fact
group by
col1, fact.dim2_key, fact.dim_key
)
select
dim2.col1,
sum(preaggr.col4)
from
preaggr
join
dim2
on
preaggr.dim2_key = fact.dim2_key
group by
dim2.col1

Resources