Database Relational Algebra: How to find actors who have played in ALL movies produced by "Universal Studios"? - relational-algebra

Given the following relational schemas, where the primary keys are in bold:
movie(movieName, whenMade);
actor(actorName, age);
studio(studioName, location, movieName);
actsIn(actorName, movieName);
How do you find the list of actors who have played in EVERY movie produced by "Universal Studios"?
My attempt:
π actorName ∩ (σ studioName=“Universal Studios” studio) |><| actsIn, where |><| is the natural join
Are you supposed to use cartesian product and/or division? :\

Here are the two steps that you should follow:
Write an expression to find the names of movies produced by “Universal Studio” (the result is a relation with a single attribute)
Divide the relation actsIn by the result of the relation obtained at the first step.
This should give you the expected result (i.e. a relation with the actor names that have played in every movie of the “Universal Studio”).

Related

Create a Dynamic Array formula (Excel) to combine multiple results columns into one column that is filtered & sorted using multiple criteria?

The sample data in the image below is collected from a round robin tournament.
There is a Round column,Home team & Away team columns listing who is playing who. A team could be either Home or Away.
For each match in a round (including any "Bye" match) the number of games won for the Home and Away team are recorded in separate columns respectively.
"Ff" = forfeit and has a value of 0. "Bye" result is left blank (at this stage).
Output columns are "Won, Lost, Round".
Required output (shown in the image) is, for any selected team, the top n most-games-won matches (from both Home & Away) sorted in descending order and then the corresponding games lost but sorted in ascending order where the games won are equal. Finally show the rounds where those scores occurred.
These are the challenges I've faced in going from data to output in one step using dynamic array formula:
Collating/Combining the the Win results into 1 column. Likewise the Losses.
Getting the array to ignore blanks or convert "Ff" to 0 without getting #NUM or #VALUE errors.
Ensuring that if I used separate single column arrays the corresponding Loss and Round matched the Win result
Although "Round, Won, Lost" would be acceptable. But I wasn't able to get the Dynamic Array capability to give the required output with this order.
SUMPRODUCT, INDEX(MATCH), SORT(FILTER) functions all hint at a possible one step formula solution.
The solutions are numerous for sorting & filtering where the existing values are already in one column. There was one solution that dealt with 2 columns of values which was somewhat useful How to get the highest values from 2 columns in excel - Stackoverflow 2013
Many other responses are around the use of concatenation, combining/merging array sets, aggregation etc.
My work around solution is to use a Helper Sheet to combine the Wins from the separate results columns and convert blanks & "Ff" to -1. Likewise for Losses. Using the formula for each line
=IF($C5=L$2,IF($F5="",-1,IF($F5="Ff",0,$F5)),IF($D5=L$2,IF($G5="",-1,IF($G5="Ff",0,$G5)),-1))
Example Helper Sheet
To get the final output the Dynamic Array formula was used on the Helper Sheet data
=SORT(FILTER(L$26:N$40,L$26:L$40>=LARGE(L$26:L$40,$J$3),""),{1,2},{-1,1},FALSE)
I'm trying to avoid using pivottable, VBA solutions. Powerquery possible but not preferred.
Apologies for the screenshots but I couldn't work out how to attach the sample spreadsheet file. (Unfortunately Stackoverflow Help didn't help me to/not to do this.)
Based on the comments I changed my answer with a different approach:
=LET(data,A5:F19,
round,INDEX(data,,1),
ha,CHOOSECOLS(data,3,4),
HAwonR,CHOOSECOLS(data,5,6,1),
w,BYROW(ha,LAMBDA(h,IFERROR(XMATCH(L2,h),0))),
clm,CHOOSE(w,{1,2},{2,1}),
srtwon,DROP(REDUCE(0,SEQUENCE(ROWS(data)),LAMBDA(y,z,VSTACK(y,INDEX(HAwonR,z,HSTACK(INDEX(clm,z,),3))))),1),
res,FILTER(srtwon,w),
TAKE(SORT(res,{1,2},{-1,1}),J3))
Old answer:
=LET(data,A5:F19,
round,INDEX(data,,1),
home,INDEX(data,,3),
away,INDEX(data,,4),
HAwonR,CHOOSECOLS(data,5,6,1),
w,MAP(home,away,LAMBDA(h,a,OR(h=L2,a=L2))),
won,FILTER(HAwonR,w),
TAKE(SORT(won,{1,2},{-1,1}),J3))
In your example you selected round 3 for the third result, but that wasn't won, so I guess that was by mistake.
As you can see making use of LET avoids helpers. Let allows you to create names (helpers) that are stored and because you can name them, you can make complex formulas be more readable.
Basically what it does is filter the columns Home, Away and Round (in that order) for either Home or Away equal the team in cell L2. That's sorted column 1 descending and column 2 ascending. Than the number of rows mentioned in cell J3 are displayed from that sorted array.
Here is my solution based on the excellent contribution by #P.b. Thank you much appreciated.
The wins (likewise losses) required mapping the presence, of the team in question, as hT (home team) to the games it won (hG) and adding to that a 2nd mapping of the games it won (aG) when it was the away team (aT). Essentially what was being done on the Helper Sheet. Result was a 1 column array for game wins and a 1 column array for game losses.
In the process I was able to convert the "Ff" text to 0. I attempted without the conversion and it threw an error.
Instead of CHOOSECOLS used HSTACK to create the new array (wins, losses & round) for the FILTER, SORT, TAKE to work on.
If it could be made conciser(?) that is the next challenge. Overall (not just my solution), this exercise has provided greater flexibility and solved the problems stated. I'm happy!
=LET(data,A5:G19,
round,INDEX(data,,1),
hT,INDEX(data,,3),
aT,INDEX(data,,4),
hG,INDEX(data,,6),
aG,INDEX(data,,7),
wins,MAP(hG,
MAP(hT,LAMBDA(h,h=L2)),
LAMBDA(w,t,IF(w="Ff",0,w)*IF(t=TRUE,1,0))) +
MAP(aG,
MAP(aT,LAMBDA(a,a=L2)),
LAMBDA(w,t,IF(w="Ff",0,w)*IF(t=TRUE,1,0))),
losses,MAP(aG,
MAP(hT,LAMBDA(h,h=L2)),
LAMBDA(w,t,IF(w="Ff",0,w)*IF(t=TRUE,1,0))) +
MAP(hG,
MAP(aT,LAMBDA(a,a=L2)),
LAMBDA(w,t,IF(w="Ff",0,w)*IF(t=TRUE,1,0))),
HAwonR,HSTACK(wins,losses,round),
w,MAP(home,away,LAMBDA(h,a,OR(h=L2,a=L2))),
won,FILTER(HAwonR,w),
TAKE(SORT(won,{1,2},{-1,1}),J3))

Finding combination of trainstops through Europe

I am currently working on a school project. The objective is to find a route through Europe from trainstation to trainstation, from country to country where the names of the consecutive trainstations have to start with the letters of the alphabet consecutively and a country can only be used once. To give an example route:(Amsterdam Central, Netherlands) --> (Berlin, Germany) --> (Carcasonne, France) etc. Countries also need to be neighbouring countries. We have received a dataset in which countries and some of their specific stations are mentioned. Some of the countries don't have a large selection of stations, making it important that a certain letter is used with a certain country, because only a small selection first letters will be present for this specific country. Can someone maybe provide me with some guidance as to how I can tackle this problem. I am currently coding in python.
cheers!
- Sort countries in order of increasing number of letter choices
- Loop C over countries in order of increasing number of letter choices
- Place C in position for a random letter that it provides
- IF neighbouring positions have been populated, but they are NOT neighbours geographically
- THEN choose a different letter for those available in C
I would use a constraint solver for this. (I'm familiar with CP-SAT because I use it at work, but you have options.)
For each letter from A to Z, create a variable whose domain is the set of stations whose name starts with that letter. Create 26 corresponding country variables and, for each corresponding station variable and country variable, add a table constraint to ensure that they describe the same thing (so the table contains entries like ("Amsterdam Central, Netherlands", "Netherlands"), though you have to translate into integer indices). Add an all-different constraint on the country variables. For each pair of consecutive country variables, add a table constraint that they be neighbors.
The solver contains a powerful deduction engine that will surely pick through the possibilities much faster than brute force or simple heuristics.

Why am I getting all the pizzas(relational algebra) and my joins messing up?

This is the database I am using for my queries
https://class.stanford.edu/c4x/DB/RA/asset/pizzadata.html
the syntax for writing out relational algebra queries is based off http://www.cs.duke.edu/~junyang/ra/ .
My query is to "Find all pizzas eaten by at least one female over the age of 20."
this is what I have so far
\project_{name,pizza}(
Person \join_{gender='female' and age>20} Eats
)
I think I have the right logic here.("\join_{cond} is the relational theta-join operator.") I also showed the name column for debugging purposes. I am joining two relations and only keeping the rows where gender is female and age is > 20.
The result of my query(against the correct query). I don't think this is a syntax issue. In the Eats relation, Fay only eats mushroom. I don't understand why she is paired with every pizza combination
Theta joins are cartesian; they join every row of each table with every row of every other table. In your example you are joining every row of Person where gender='female' and age>20 with every row of Eats, regardless of name. You probably want:
Person \join_{gender='female' and age>20 and name=eater} \rename{eater, pizza} Eats
Note that Thetas typically increase the number of rows; you typically reduce the number of rows returned using Sigmas or selections. A more idiomatic way of performing your statement would be with a Select and natural join:
\select{gender='female' and age>20} Person \join Eats

Efficient point-in-time query of group membership

We have a scenario like this:
Millions of records (Record 1, Record 2, Record 3...)
Partitioned into millions of small non-intersecting groups (Group A, Group B, Group C...)
Membership gradually changes over time, i.e. a record may be reassigned to another group.
We are redesigning the data schema, and one use case we need to support is given a particular record, find all other records that belonged to the same group at a given point in time. Alternatively, this can be thought of as two separate queries, e.g.:
To which group did Record 15544 belong, three years ago? (Call this Group g).
What records belonged to Group g, three years ago?
Supposing we use a relational database, the association between records and groups is easily modelled using a two-column table of record id and group id. A common approach for allowing historical queries is to add a timestamp column. This allows us to answer the question above as follows:
Find the row for Record 15544 with the most recent timestamp prior to the given date. This tells us Group g.
Find all records that have at any time belonged to Group g.
For each of these records, find the row with the most recent timestamp prior to the given date. If this indicates that the record was in Group g at that time, then add it to the result set.
This is not too bad (assuming the table is separately indexed by both record id and group id), and may even be the optimal algorithm for the naive table structure just described, but it does cost an index lookup for every record found in step 2. Is there an alternative data structure that would answer the query more efficiently?
ETA: This is only one of several use cases for the system, so we don't want to speed up this query at the expense of making queries about current groupings slower, nor do we want to pay a huge price in space consumption, etc.
How about creating two tables:
(recordID, time-> groupID) - key is recordID, time - sorted by
recordID, and secondary by time (Let that be map1)
(groupID, time-> List) - key is groupID, time - sorted by
recordID, and secondary by time (Let that be map2)
At each record change:
Retrieve the current groupID of the record you are changing
set t <- current time
create a new entry to map1 for old group: (oldGroupID,t,list') - where list' is the same list, but without the entry you just moved out from there.
Add a new entry to map1 for new group: (newGroupId,t,list'') - where list'' is the old list for the new group, with the changed record added to it.
Add a new entry (recordId,t,newGroupId) to map1
During query:
You need to find the entry in map2 that is 'closest' and smaller than
(recordId,desired_time) - this is classic O(logN) operation in
sorted data structure.
This will give you the group g the element belonged to at the desired time.
Now, look in map1 similarly for the entry with key closest but smaller than (g,desired_time). The value is the list of all records that are at the group at the desired time.
This requires quite a bit of more space (at constant factor though...), but every operation is O(logN) - where N is the number of record changes.
An efficient sorted DS for entries that are mostly stored on disk is a B+ tree, which is also implemented by many relational DS implementations.

Multi Attribute Matching of Profiles

I am trying to solve a problem of a dating site. Here is the problem
Each user of app will have some attributes - like the books he reads, movies he watches, music, TV show etc. These are defined top level attribute categories. Each of these categories can have any number of values. e.g. in books : Fountain Head, Love Story ...
Now, I need to match users based on profile attributes. Here is what I am planning to do :
Store the data with reverse indexing. i.f. Each of Fountain Head, Love Story etc is index key to set of users with that attribute.
When a new user joins, get the attributes of this user, find which index keys for this user, get all the users for these keys, bucket (or radix sort or similar sort) to sort on the basis of how many times a user in this merged list.
Is this good, bad, worse? Any other suggestions?
Thanks
Ajay
The algorithm you described is not bad, although it uses a very simple notion of similarity between people.
Let us make it more adjustable, without creating a complicated matching criteria. Let's say people who like the same book are more similar than people who listen to the same music. The same goes with every interest. That is, similarity in different fields has different weights.
Like you said, you can keep a list for each interest (like a book, a song etc) to the people who have that in their profile. Then, say you want to find matches of guy g:
for each interest i in g's interests:
for each person p in list of i
if p and g have mismatching sexual preferences
continue
if p is already in g's match list
g->match_list[p].score += i->match_weight
else
add p to g->match_list with score i->match_weight
sort g->match_list based on score
The choice of weights is not a simple task though. You would need a lot of psychology to get that right. Using your common sense however, you could get values that are not that far off.
In general, matching people is much more complicated than summing some scores. For example a certain set of matching interests may have more (or in some cases less) effect than the sum of them individually. Also, an interest in one may totally result in a rejection from the other no matter what other matching interest exists (Take two very similar people that one of them loves and the other hates twilight for example)

Resources