Find the Cardinality of Natural Join - relational-algebra

|X| represents number of tuples in X bold letters represent keys in the relation
Consider the relations R(A, B) and S(A, C), and that R has a foreign key on A that
references S. |R ✶ S| (where ' * ' represents natural join) is:
The options are:
1. |R|
2. |S|
3. |R|.|S|
4. max(|R|, |S|)
5. min(|R|, |S|)
What I understand about the cardinality of natural join is that if there is no common attribute among the two relations then natural join will act like a cross-product and the cardinality will be r * s. But I don't understand how key constraints play a role in determining the cardinality.
Can someone please explain?

Presuming the bold A in each schema means it is a key; and presuming the Foreign Key constraint holds -- that is, the A value for every row in R does correspond to an A value in S:
Every row in R naturally joins to a row in S on A.
There might be rows in S that don't join to R (because there's no Foreign Key constraint to enforce that).
So the cardinality of the joined relations is the cardinality of R, answer 1.
Is there are real-life use for a schema like this? Consider S is Customer Name in C, keyed by Customer number in A. R holds date of birth in B, also keyed by Customer number in A. Every Customer must have a name; it's true every Customer (person) must have a d.o.b., but we don't need to record that unless/until they purchase age-restricted items.

There is absolutely not enough information to answer this question. The "natural" join can return any almost any value between 0 and R*S. The following are examples.
This example returns 12:
create table s1 (id int primary key);
create table r1 (s1_id int references s1(id));
insert into s1 (id) values (1), (2), (3);
insert into r1 (s1_id) values (1), (2), (2), (3);
This example returns 0:
create table s2 (id int primary key, x int default 2);
create table r2 (s2_id int references s2(id), x int default 1);
insert into s2 (id) values (1), (2), (3);
insert into r2 (s2_id) values (1), (2), (2), (3);
This examples returns 4:
create table s3 (id int primary key, y int default 2);
create table r3 (id int references s3(id), x int default 1);
insert into s3 (id) values (1), (2), (3);
insert into r3 (id) values (1), (2), (2), (3);
In all of these, r has a foreign key relationship to s. And a "natural" join is being used. Here is a db<>fiddle.
Even if you assume that the "A"s are the primary keys AND that there are no other columns, the number of rows still varies:
-- returns 4
create table s5 (id int primary key);
create table r5 (id int references s4(id));
insert into s5 (id) values (1), (2), (3);
insert into r5 (id) values (1), (1), (2), (2);
Versus:
-- returns 0
create table s4 (id int primary key);
create table r4 (id int references s4(id));
insert into s4 (id) values (1), (2), (3);
insert into r4 (id) values (NULL), (NULL), (NULL), (NULL);

Related

Mesure in fact table based on columns in a dim

I need to find a way to build a mesure in a fact table based on columns in one of the dimensions.
Depending on what's selected in the fact, i need to calculate the (sum, or count or avg ....) of one column in the dimension table based on a distinct value of another column in the same dimension table chosing the one with the higher 3rd column value in the same table
Let's say we want to calculate the sum of DimColumn2 based on a distinct value of Dimcolumn1 and we chose the line with the bigger Dimcolumn5 value:
If we select K1,K2 and K3 from the fact, the mesure should be : X + Y
If we select all the fact, the mesure should be : X + Y + Z + X
All that should be done in the fact table so that we can slice by other dimensions (like Dim2Key)

Alternative to using ungroup in kdb?

I have two tables in KDB.
One is a timeseries with a datetime, sym column (spanning multiple dates, eg could be 1mm rows or 2mm rows). Each timepoint has the same number of syms and few other standard columns such as price.
Let's call this t1:
`date`datetime`sym`price
The other table is of this structure:
`date`sym`factors`weights
where factors is a list and weights is a list of equal length for each sym.
Let's call this t2.
I'm doing a left join on these two tables and then an ungroup.
factors and weights are of not equal length for each sym.
I'm doing the following:
select sum (weights*price) by date, factors from ungroup t1 lj `date`sym xkey t2
However this is very slow and can be as slow as 5-6 seconds if t1 has a million rows or more.
Calling all kdb experts for some advice!
EDIT:
here's a full example:
(apologies for the roundabout way of defining t1 and t2)
interval: `long$`time$00:01:00;
hops: til 1+ `int$((`long$(et:`time$17:00)-st:`time$07:00))%interval;
times: st + `long$interval*hops;
dates: .z.D - til .z.D-.z.D-10;
timepoints: ([] date: dates) cross ([] time:times);
syms: ([] sym: 300?`5);
universe: timepoints cross syms;
t1: update datetime: date+time, price:count[universe]?100.0 from universe;
t2: ([] date:dates) cross syms;
/ note here my real life t2, doesn't have a count of 10 weights/factors for each sym, it can vary by sym.
t2: `date`sym xkey update factors: count[t2]#enlist 10?`5, weights: count[t2]#enlist 10?10 from t2;
/ what is slow is the ungroup
select sum weights*price by date, datetime, factors from ungroup t1 lj t2
One approach to avoid the ungroup is to work with matrices (aka lists of lists) and take advantage of the optimised matrix-multiply $ seen here: https://code.kx.com/q/ref/mmu/
In my approach below, instead of joining t2 to t1 to ungroup, I group t1 and join to t2 (thus keeping everything as lists of lists) and then use some matrix manipulation (with a final ungroup at the end on a much smaller set)
q)\ts res:select sum weights*price by date, factors from ungroup t1 lj t2
4100 3035628112
q)\ts resT:ungroup exec first factors,sum each flip["f"$weights]$price by date:date from t2 lj select price by date,sym from t1;
76 83892800
q)(0!res)~`date`factors xasc `date`factors`weights xcol resT
1b
As you can see its much quicker (at least on my machine) and the result is identical save for ordering and column names.
You may still need to modify this solution somewhat to work in your actual use-case (with variable weights etc - in this case perhaps enforce a uniform number of weights across each sym filling with zeros if necessary)

Relational Algebra and Cardinality?

I'm very confused when it comes to the topic of cardinality in Relational Algebra. I understand that cardinality essentially refers to the uniqueness of a table or data set. So I'll walk through a problem I attempted to solve and maybe someone can help me out, or give me better resources than the ones I've found.
I've got a table R2, with attributes D, E, and F, where D is a Primary Key, and E and F are Foreign Keys relating to the Primary Keys of the following table. Table R3, with attributes G, H, and I, where G and H are PKs. R2 has cardinality N2 = 100, R3 has cardinality N3 = 200. So what would the min and max cardinality be of a table created by joining R2 to R3 with the condition that E = G and F = H?
My answer is that the minimum is 1, and max is 200, or N3. My thought process is that E and F are FKs, so they can have many repeating values so long as they come from G and H, but since G and H are PKs, at least one value for E and F would be unique, and D is a PK as well, so at least one value is unique there too. So I assume those unique values mean the cardinality must be at least 1, and at most, it can have the same cardinality as R3, which is 200. But honestly, my own reasoning doesn't even make sense to me...
The whole idea seems really abstract to me. Attribute I is the only non FK/PK in the problem, so how does that affect the cardinality? Sorry for the long winded question, I'm just very confused by the whole idea of this and would love any help in general regarding the subject.
You are not equijoining FK-to-CK (foreign key to candidate key). You are equijoining on EF subtuples matching GH subtuples. Although every E has a G & every F has an H, there does not have to be a single EF-GH match. G & H are unique so GH is unique so each EF can match at most one, so there could be 0 to 100 rows in the result.
(If you want to make sound analyses you need to find the minimum & maximum results for various cases of kinds of joins on column sets referencing (having to appear elsewhere as) others. You can handle more cases by dealing with superkeys (unique column sets) not CKs (candidate keys) (superkeys containing no smaller superkeys). You mean CK when you say "PK" (primary key)--there can be at most one PK per table. For no duplicates or nulls, SQL UNIQUE is superkey & FOREIGN KEY is foreign superkey.

How does Excel Solver compare 2 [n*n]matrices in a constraint

Suppose i have 2 tables t1, t2 (or matrices) that are both square (e.g. both are 3X3).
In Solver I add the following constraint :
t1 >= t2
Then how does solver compare these values?
-Value at 1X1 in t1 >= 1X1 in t2, 1X2 in t1 >= 1X2 in t2,...
-Any value in t1 must be >= the largest value in t2
-...
If it is not the one, how can I obtain the first situation? Do I enter every value comparison by hand (since that will take quite some time)
It makes the comparison element-wise. You can confirm by getting the "Answer Report".
Here are the matrices on C5:G19 and J5:N19
You add the constraint stating C5:G9 <= (or >=) J5:N9
As you can see from the formula column, it makes the comparisons element-wise (C5<J5, D5<K5, ..., G9<N9).

Leads to find tables correlation

Have these two tables:
TableA
ID Opt1 Opt2 Type
1 A Z 10
2 B Y 20
3 C Z 30
4 C K 40
and
TableB
ID Opt1 Type
1 Z 57
2 Z 99
3 X 3000
4 Z 3000
What would be a good algorithm to find arbitrary relations between these two tables? In this example, I'd like it to find the apparent relation between records containing Op1 = C in TableA and Type = 3000 in TableB.
I could think of apriori in some way, but doesn't seems too practical. what you guys say?
thanks.
It sounds like a relational data mining problem. I would suggest trying Ross Quinlan's FOIL: http://www.rulequest.com/Personal/
In pseudocode, a naive implementation might look like:
1. for each column c1 in table1
2. for each column c2 in table2
3. if approximately_isomorphic(c1, c2) then
4. emit (c1, c2)
approximately_isomorphic(c1, c2)
1. hmap = hash()
2. for i = 1 to min(|c1|, |c2|) do
3. hmap[c1[i]] = c2[i]
4. if |hmap| - unique_count(c1) < error_margin then return true
5. else then return false
The idea is this: do a pairwise comparison of the elements of each column with each other column. For each pair of columns, construct a hash map linking corresponding elements of the two columns. If the hash map contains the same number of linkings as unique elements of the first column, then you have a perfect isomorphism; if you have a few more, you have a near isomorphism; if you have many more, up to the number of elements in the first column, you have what probably doesn't represent any correlation.
Example on your input:
ID & anything : perfect isomorphism since all of ID are unique
Opt1 & ID : 4 mappings and 3 unique values; not a perfect
isomorphism, but not too far away.
Opt1 & Opt1 : ditto above
Opt1 & Type : 3 mappings & 3 unique values, perfect isomorphism
Opt2 & ID : 4 mappings & 3 unique values, not a perfect
isomorphism, but not too far away
Opt2 & Opt2 : ditto above
Opt2 & Type : ditto above
Type & anything: perfect isomorphism since all of ID are unique
For best results, you might do this procedure both ways - that is, comparing table1 to table2 and then comparing table2 to table1 - to look for bijective mappings. Otherwise, you can be thrown off by trivial cases... all values in the first are different (perfect isomorphism) or all values in the second are the same (perfect isomorphism). Note also that this technique provides a way of ranking, or measuring, how similar or dissimilar columns are.
Is this going in the right direction? By the way, this is O(ijk) where table1 has i columns, table 2 has j columns and each column has k elements. In theory, the best you could do for a method would be O(ik + jk), if you can find correlations without doing pairwise comparisons.

Resources