Can someone explain me how the cartesian product works in relational algebra - cartesian-product

here it says
Selection and cross product
Cross product is the costliest operator to evaluate. If the input relations have N and M rows, the result will contain NM rows. Therefore it is very important to do our best to decrease the size of both operands before applying the cross product operator.
suppose that we have 2 relations
first relation is called Student and has 3 attributes, thus
student
|a |b |c |
------------
|__|___|___|
|__|___|___|
|__|___|___|
second relation is university and again with 3 attributes
university
|e |f |g |
------------
|__|___|___|
|__|___|___|
|__|___|___|
we have 3 rows for each relation, so after applying the cross product operation we will get a relation which has 3*3 = 9 rows
now, I don't understand, why 9 and not 3?
won't the final relation be
final relation
|a |b |c |d |e |f |g |
--------------------------
|__|___|___|__|____|__|__|
|__|___|___|__|____|__|__|
|__|___|___|__|____|__|__|
doesn't this have 3 rows again?
Thanks

If the rows in Student are row1, row2 and row3, and the rows in University are row4, row5 and row6, then the cartesian product will contain
row1row4, row1row5, row1row6, row2row4, row2row5, row2row6, row3row4, row3row5, row3row6
Each possible combination of rows. That's how it is defined. Nothing more to it.
Except for your remark "Therefore it is very important to do our best to decrease the size of both operands before applying the cross product operator.". It is important to realise that there do exist optimizers which are able to "rewrite" certain algebra operations. It is certainly not the case that the onus is always on the query writer to determine the "most appropriate way of combining restrictions with other operations". In fact, "moving restrictions to the inside as far as possible" is one of the things industrial optimizers are actually very good at.

Just imagine that you have two tables one with the students and one with the universities, when you do a Cartesian query against a relational database you will get a row for every student which in turn is joined to every university.
Select *
From students,
universities;
OR
SELECT * FROM students CROSS JOIN universities
I know this has little to do with algebra but since your on stackoverflow :D

There is no common attribute to link between student and university so each row in student is matched to each row in university, 3 * 3 = 9

|a|e|
|a|f|
|a|g|
|b|e|
|b|f|
|b|g|
|c|e|
|c|f|
|c|g|
Therefore 9

Related

Is this natural join operation used correctly? (Relational Algebra)

I have the following task given from the professor:R-E Modell
Assume the companies may be located in several cities. Find all companies located in every city in which “Small Bank Corporation” is
located.
Now the professor's solution is the following:
s ← Π city (σ company_name=’Small Bank Corporation’ (company))
temp1 ← Π comp_id, company_name (company)
temp2 ← Π comp_id, company_name ((temp1 × s) − company)
result ← Π company_name (temp1 − temp2)
I for myself found a completely different solutions with a natural join operation which seems much simpler:
What I tried to do was using the natural joint operation which whe defined as following that a relation r and s are joined on their common attributes. So I tried to get all city names by using a projection on a selection of all companies with the company_name "Small Bank Cooperation". After that I joined the table with the city names with the company table, so that I get all company entrys which have the city names in it.
company ⋈ Π city (σ company_name=”Small Bank Cooperation” (company)))
My question now is if my solution is also valid, since it seems a little bit to trivial?
Yours isn't the same.
My answer here says how to query relationally. It uses a version of the relational algebra where headings are sets of attribute names. My answer here summarizes it:
Every query expression has an associated (characteristic)
predicate--statement template parameterized by attributes. The tuples
that make the predicate into a true proposition--statement--are in
the relation.
We are given the predicates for expressions that are relation names.
Let query expression E have predicate e. Then:
R ⨝ S has predicate r and s
R ∪ S has predicate r or s
R - S has predicate r and not s
σ p (R) has predicate r and p
π A (R) has predicate exists non-A attributes of R [r]
When we want the tuples satisfying a certain predicate we find a way
to express that predicate in terms of relation operator
transformations of given relation predicates. The corresponding query
returns/calculates the tuples.
Your solution
company ⋈ Π city (σ company_name=”Small Bank Corporation” (company)))
is rows where
company company_id named company_name is in city
AND FOR SOME company_id & company_name [
company company_id named company_name is in city
AND company_name=”Small Bank Corporation”]
ie
company company_id named company_name is in city
AND FOR SOME company_id [
company company_id named ”Small Bank Corporation” is in city]
ie
company company_id named company_name is in city
AND some company named ”Small Bank Corporation” is in city
You are returning rows that have more columns than just company_name. But your companies are not the requested companies.
Projecting your rows on company_name gives rows where
some company named company_name is in some city
AND some company named ”Small Bank Corporation” is in that city
After that I joined the table with the city names with the company
table, so that I get all company entrys which have the city names in
it.
That isn't clear about what you get. However the companies in your rows are those in at least one of the SBC cities. The request was for those in all of the SBC cities:
companies located in every city in which “Small Bank Corporation” is located
The links I gave tell you how to compose queries but also convert between query result specifications & relational algebra expressions returning a result.
When you see a query for rows matching "every" or "all" of some other rows you can expect that that part of your query involves relational-division or some related idiom. The exact algebra depends on what is intended by the--frequently poorly/ambiguously expressed--requirements. Eg whether "companies located in every city in which" is supposed to be no companies (division) or all companies (related idiom) when there are no such cities. (The normal mathematical interpretation of your assignment is the latter.) Eg whether they want companies in exactly all such cities or at least all such cities.
(It helps to avoid "all" & "every" after "find" & "return", where it is redundant anyway.)
Database Relational Algebra: How to find actors who have played in ALL movies produced by “Universal Studios”?
How to understand u=r÷s, the division operator, in relational algebra?
How to find all pizzerias that serve every pizza eaten by people over 30?

NOT IN Subquery slow and runs out of memory (Clickhouse)

I have a single table holding DNA variants for different people. I want to show the variants that are unique to a person:
Table DNA (engine ordered by variant):
person | variant
p1 | v1
p1 | v2
p1 | v3
p2 | v2
p2 | v3
p3 | v2
p3 | v3
p4 | v2
p4 | v3
So a simple query:
select variant from DNA where person = 'p1' and variant
not in (select variant from DNA where person in ('p2', 'p3'))
will return all variants unique to p1 vs. p2 and p3 (p4 not considered for this query). However - it is slow and runs out of memory.
Should I be doing this a different way?
I suspect that the reason it is running out of memory is that the select variant from DNA where person in ('p2', 'p3') sub-query will result in v2, v3, v2, v3. This, especially when brought to scale, seems exceedingly inefficient because of the repetition. Potentially, adding distinct to the query may help, but in general this seems like an inefficient method of achieving your results if you have a lot of people (you'd have to manually type a lot of people in where person in (.........).
An alternative to this is to do a self join and basically limit the results to those where the only match is itself. Something like:
SELECT person, COUNT(*)
FROM (
SELECT * FROM table
ALL LEFT JOIN table
USING variant
)
GROUP BY person
HAVING COUNT(*) == 1;

Column that sums values once per unique ID, while filtering on type (Oracle Fusion Transportation Intelligence)

I realize that this has been discussed before but haven't seen a solution in a simple CASE expression for adding a column in Oracle FTI - which is as far as my experience goes at the moment unfortunately. My end goal is to have an total Weight for each Category only counting the null type entries and only one Weight per ID (Don't know why null was chosen as the default Type). I need to break the data apart by Type for a total Cost column which is working fine so I didn't include that in the example data below, but because I have to break the data up by Type, I am having trouble eliminating redundant values in my Total Weight results.
My original column which included redundant weights was as follows:
SUM(CASE Type
WHEN null
THEN 'Weight'
ELSE null
END)
Some additional info:
Each ID can have multiple Types (additionally each ID may not always have A or B but should always have null)
Each ID can only have one Weight (But when broken apart by type the value just repeats and messes up results)
Each ID can only have one Category (This doesn't really matter since I already separate this out with a Category column in the results)
Example Data:
ID |Categ. |Type | Weight
1 | Old | A | 1600
1 | Old | B | 1600
1 | Old |(null) | 1600
2 | Old | B | 400
2 | Old |(null) | 400
2 | Old |(null) | 400
3 | New | A | 500
3 | New | B | 500
3 | New |(null) | 500
4 | New | A | 500
4 | New |(null) | 500
4 | New |(null) | 500
Desired Results:
Categ. | Total Weight
Old | 2000
New | 1000
I was trying to figure out how to include a DISTINCT based on ID in the column, but when I put DISTINCT in front of CASE it just eliminates redundant weights so I would just get 500 for Total Weight New.
Additionally, I thought it would be possible to divide the weight by the count of weights before aggregating them, but this didn't seem to work either:
SUM(CASE Type
WHEN null
THEN 'Weight'/COUNT(CASE Type
WHEN null
THEN 'Weight'
ELSE null
END)
ELSE null
END)
I am very appreciative of any help that can be offered, please let me know if there is a simple way to create a column that achieves the desired results. As it may be apparent, I am pretty new to Oracle, so please let me know if there is any additional information that is needed.
Thanks,
Will
You don't need a case statement here. You were on the right track with distinct, but you also need to use an inline view (a subquery in the from the caluse).
The subquery in the from clause, selecting all distinct combinations of (id, categ, weight), allows you to then select from the result set, whereby you select only categ, sum of weight, grouping by categ. The subquery in the from clause has no repeated weights for a given id (unlike the table itself, which is why this is needed).
This would have to be done a little differently if an id were ever to have more than one category, but you noted that an id only ever has one category.
select categ,
sum(weight)
from (select distinct id,
categ,
weight
from tbl)
group by categ;
Fiddle: http://sqlfiddle.com/#!4/11a56/1/0

Need a query that will satisfy two conditions from two tables

table a and table b, table a has two field, field 1 and 2, and table b has two fields, field 3 and 4.
where
tablea.field1 >= 4 and tableb.field3 = 'male'
is something like the above query possible, Ive tried something like this in my database although there are not errors and i get results, it checks whether both are true separately.
im going to try to be abit clear, and cant give out the query outright as much as i would like to (University reasons). so ill explain, table 1 has several columns of information one of which is number of kids, table two has more information on said kids, like gender.
so im having trouble creating a query where first it checks that a parent has 2 kids but two male kids, thus creating a relationship between parent table and kids table.
CREATE TABLE parent
(pID NUMBER,
numberkids INTEGER)
CREATE TABLE kids
(kID NUMBER,
father NUMBER,
mother NUMBER,
gender VARCHAR(7))
select
p.pid
from
kids k
inner join parent pm on pm.pid = k.mother
inner join parent pf on pf.pid = k.father,
parent p
where
p.numberkids >= 2 and k.gender = 'male'
/
this query checks that the parent has 2 kids or more and the kids gender is male, but i need it to check whether the parent has 2 kids and OF those kids is there 2 or more male kids (or in short to check whether the parent has 2 or more male kids).
sorry for the long winded explanation i modified the tables and the query from the one im actually going to use (so some mistakes might be there, but the original query work, just not how i want explained above). any help would be greatly appreciated.
The best thing to do would be to take the numberKids column out of the parent table ... you'll find it very difficult to maintain.
Anyway, something like this might do the trick:
SELECT p.pID
FROM parent p INNER JOIN kids k
ON p.pID IN (k.father, k.mother)
WHERE k.gender = 'male'
GROUP BY p.pID
HAVING COUNT(*) >= 2;

Oracle explain plan estimates incorrect cardinality for an index range scan

I have an Oracle 10.2.0.3 database, and a query like this:
select count(a.id)
from LARGE_PARTITIONED_TABLE a
join SMALL_NONPARTITIONED_TABLE b on a.key1 = b.key1 and a.key2 = b.key2
where b.id = 1000
Table LARGE_PARTITIONED_TABLE (a) has about 5 million rows, and is partitioned by a column not present in the query. Table SMALL_NONPARTITIONED_TABLE (b) is not partitioned, and holds about 10000 rows.
Statistics are up-to-date, and there are height balanced histograms in columns key1 and key2 of table a.
Table a has a primary key and a global, nonpartitioned unique index on columns key1, key2, key3, key4, and key5.
Explain plan for the query displays the following results:
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 31 | 4 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 31 | | |
| 2 | NESTED LOOPS | | 406 | 12586 | 4 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN| INDEX_ON_TABLE_B | 1 | 19 | 2 (0)| 00:00:01 |
|* 4 | INDEX RANGE SCAN| PRIMARY_KEY_INDEX_OF_TABLE_A | 406 | 4872 | 2 (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("b"."id"=1000)
4 - access("a"."key1"="b"."key1" and
"a"."key2"="b"."key2")
Thus the rows (cardinality) estimated for step 4 is 406.
Now, a tkprof trace reveals the following:
Rows Row Source Operation
------- ---------------------------------------------------
1 SORT AGGREGATE (cr=51 pr=9 pw=0 time=74674 us)
7366 NESTED LOOPS (cr=51 pr=9 pw=0 time=824941 us)
1 INDEX RANGE SCAN INDEX_ON_TABLE_B (cr=2 pr=0 pw=0 time=36 us)(object id 111111)
7366 INDEX RANGE SCAN PRIMARY_KEY_INDEX_OF_TABLE_A (cr=49 pr=9 pw=0 time=810173 us)(object id 222222)
So the cardinality in reality was 7366, not 406!
My question is this: From where does Oracle get the estimated cardinality of 406 in this case, and how can I improve its accuracy, so that the estimate is more in line of what really happens during query execution?
Update: Here is a snippet of a 10053 trace I ran on the query.
NL Join
Outer table: Card: 1.00 Cost: 2.00 Resp: 2.00 Degree: 1 Bytes: 19
Inner table: LARGE_PARTITIONED_TABLE Alias: a
...
Access Path: index (IndexOnly)
Index: PRIMARY_KEY_INDEX_OF_TABLE_A
resc_io: 2.00 resc_cpu: 27093
ix_sel: 1.3263e-005 ix_sel_with_filters: 1.3263e-005
NL Join (ordered): Cost: 4.00 Resp: 4.00 Degree: 1
Cost_io: 4.00 Cost_cpu: 41536
Resp_io: 4.00 Resp_cpu: 41536
****** trying bitmap/domain indexes ******
Best NL cost: 4.00
resc: 4.00 resc_io: 4.00 resc_cpu: 41536
resp: 4.00 resp_io: 4.00 resp_cpu: 41536
Using concatenated index cardinality for table SMALL_NONPARTITIONED_TABLE
Revised join sel: 8.2891-e005 = 8.4475e-005 * (1/12064.00) * (1/8.4475e-005)
Join Card: 405.95 = outer (1.00) * inner (4897354.00) * sel (8.2891-e005)
Join Card - Rounded: 406 Computed: 405.95
So that is where the value 406 is coming from. Like Adam answered, join cardinality is join selectivity * filter cardinality (a) * filter cardinality (b), as can be seen on the second to last line of above trace quote.
What I do not understand is the Revised join sel line. 1/12064 is the selectivity of the index used to find the row from table b (12064 rows on table, and select based on unique id). But so what?
Cardinality appears to be calculated by
multiplying the filter cardinality
of table b (4897354) with
selectivity of table a (1/12064).
Why? What
does the selectivity on
table a have to do with how much
rows is expected to be found from
table b, when a --> b join is not based on
a.id?
Where does the number
8.4475e-005 come from (it does not appear anywhere else in the whole
trace)? Not that it affects the
output, but I'd still like to know.
I understand that the optimizer has likely chosen the correct path here. But still the cardinality is miscalculated - and that can have a major effect on the execution path that is chosen from that point onwards (as in the case I'm having IRL - this example is a simplification of that).
Generating a 10053 trace file will help show exactly what choices the optimizer's making regarding its estimation of cardinality and selectivity. Jonathan Lewis' excellect Cost Based Oracle Fundamentals is an excellent resource to understanding how the optimizer works, and the printing I have spans 8i to 10.1.
From that work:
Join Selectivity = ((num_rows(t1) - num_nulls(t1.c1)) / num_rows(t1))
* ((num_rows(t2) - num_nulls(t2.c2)) / num_rows(t2))
/ greater (num_distinct(t1.c1), num_distinct(t2.c2))
Join Cardinality = Join Selectivity
* filtered_cardinality (t1)
* filtered_cardinality (t2)
However, because we have a multi-column join, Join Selectivity isn't at the table level, it's the product (intersection) of the join selectivities on each column. Assuming there's no nulls in play:
Join Selectivity = Join Selectivity (key1) * Join Selectivity (key2)
Join Selectivity (key1) = ((5,000,000 - 0) / 5,000,000)
* ((10,000 - 0)) / 10,000)
/ max (116, ?) -- distinct key1 values in B
= 1 / max(116, distinct_key1_values_in_B)
Join Selectivity (key2) = ((5,000,000 - 0) / 5,000,000)
* ((10,000 - 0)) / 10,000)
/ max (650, ?) -- distinct key2 values in B
= 1 / max(650, distinct_key2_values in B)
Join Cardinality = JS(key1) * JS(key2)
* Filter_cardinality(a) * Filter_cardinality(b)
We know that there are no filters on A, so that's tables filter cardinality is the number of rows. We're selecting the key value from B, so that table's filter cardinality is 1.
So the best case for estimated estimated join cardinality is now
Join Cardinality = 1/116 * 1/650 * 5,000,000 * 1
=~ 67
It might be easier to work backward. Your estimated cardinality of 406, given what we know, leads to a join selectivty of 406/5,000,000, or approximately 1/12315. That happens to be really, really close to 1 / (116^2), which is a sanity check within the optimizer to prevent it from finding too aggressive a cardinality on multi-column joins.
For the TL;DR crowd:
Get Jonathan Lewis' Cost Based Oracle Fundamentals.
Get a 10053 trace of the query whose behavior your can't understand.
The cardinality estimate would be based on the product of the selectivity of a.key1 and a.key2, which (at least in 10g) would each be based on the number of distinct values for those two columns as recorded in the column statistics.
For a table of 5M rows, a cardinality estimate of 406 is not significantly different to 7366. The question you have to ask yourself is, is the "inaccurate" estimate here causing a problem?
You can check what plan Oracle would choose if it were able to generate a perfectly accurate estimate by getting an explain plan for this:
select /*+CARDINALITY(a 7366)*/ count(a.id)
from LARGE_PARTITIONED_TABLE a
join SMALL_NONPARTITIONED_TABLE b on a.key1 = b.key1 and a.key2 = b.key2
where b.id = 1000;
If this comes up with the same plan, then the estimate that Oracle is calculating is already adequate.
You might be interested to read this excellent paper by Wolfgang Breitling which has a lot of info on CBO calculations: http://www.centrexcc.com/A%20Look%20under%20the%20Hood%20of%20CBO%20-%20the%2010053%20Event.pdf.
As explained there, because you have histograms, the filter-factor calculation for these columns does not use number of distinct values (NDV) but density, which is derived from the histogram in some way.
What are the DENSITY values in USER_TAB_COLUMNS for a.key1 and a.key2?
Generally, the problem in cases like this is that Oracle does not gather statistics on pairs of columns, and assumes that their combined filter factor will be the product of their individual factors. This will produce low estimates if there is any correlation between values of the two columns.
If this is causing a serious performance issue, I suppose you could create a function-based index on a function of those columns, and use that to do the lookup. Then Oracle would gather statistics on that index and probably produce better estimates.

Resources