Can we apply selection over projection in relational algebra - relational-algebra

Here is the schema for works table
(PersonName,age,companyname,rating)
Now I want all the tuples whose name is megha and works in the company mtbank
I know I can do it like this:
π personName, age (σ companyName="M&T Bank" and personName="megha"(works))
But can we do it like this:
π personname, age (σ personName="megha" (π personName, age (σ companyName="M&T Bank" (works)))
Can we can apply selection over projection like this?

The only time you can't nest expressions is if the input is illegal for the operator. Eg division by zero in arithmetic. Eg relational union, intersection or difference of two relations with different headings. Eg projecting on missing attributes or restriction/selection on a condition mentioning a missing attribute.
The particulars depend on what version of relational algebra (relations & operators) you are talking about and the range of values your relation names can stand for.

Related

How can I use SPSS syntax to count instances of variable combinations?

I have categorical variables like "Household income", "Urban/rural", "Gender", "Age". I want to find out how many people are all of (for example): Low household income, urban, male, and in age category two.
That is, I don't want to calculate the frequencies of each of these variables separately, but rather, I want to see how many people lie at the intersections of them.
Basically: How do I use SPSS to find out how many people in my dataset are low income urban males between ages 25-33?
Aggregating by the categorical variables can give you all the possible intersections:
dataset declare agg.
aggregate /out = agg /break = Household income Urban_rural Gender AgeGroup /Nintersect = n .
dataset activate agg.
The new dataset now has the count of cases in each intersection of all the categorical variables used as break in the aggregate command.
You could use the crosstab command, from Analyze/Descriptive statistics menu. This wor works best when you want to intersect 2 variables. You may also use the Layer, in you have a 3rd variable
For more than 3 variables, use an if recode:
if age=2 and gender=1... Recodevar=1'.
Fre recodevar.
Mind the overlapping scenarios, so you don't overwrite your recodevar

Database Relational Algebra: How to find actors who have played in ALL movies produced by "Universal Studios"?

Given the following relational schemas, where the primary keys are in bold:
movie(movieName, whenMade);
actor(actorName, age);
studio(studioName, location, movieName);
actsIn(actorName, movieName);
How do you find the list of actors who have played in EVERY movie produced by "Universal Studios"?
My attempt:
π actorName ∩ (σ studioName=“Universal Studios” studio) |><| actsIn, where |><| is the natural join
Are you supposed to use cartesian product and/or division? :\
Here are the two steps that you should follow:
Write an expression to find the names of movies produced by “Universal Studio” (the result is a relation with a single attribute)
Divide the relation actsIn by the result of the relation obtained at the first step.
This should give you the expected result (i.e. a relation with the actor names that have played in every movie of the “Universal Studio”).

Association rules algorithm

I am new to data mining. I want to mine multi-dimensional and ordinal association rules from my data set e.g.
if (income => 100) ^ (priority=>1) ^ (skill=>technician ) then (approve=>prove)
What I have learned is that
categorical = for skills e.g. technician, plumber or any textual data
quantitative = numeric for date, balance
So major then is which association rule algorithm should be used? Mostly algorithm are quantitative or categorical is there any combined?
I think you are misunderstanding the concept of association rule mining on its own.
Your quantitative data cannot be used as such in association rule mining (as I understood your question). At least, you cannot 'tune' the quantity to fit your needs because everything in association rule mining is either items (quantitative or qualitative) and transactions so that you can define the rules that relate the items between each other. Therefore, the quantities become 'fixed' items.
Note what is association rule mining: given a set of binary attributes (items) and set of transactions, each containing a subset of items, you define a set of rules, which are implications: X -> Y (with X and Y being subsets of the set of items, and disjunct as well).
You can interpret it (or model) the implication of a rule as an if, but that is just syntactic sugar. There are not quantitative or qualitative as we know them in association rule mining. Just, items that belong to a set and the relationships (implications/rules) that we define between them.

Which algorithm/implementation for weighted similarity between users by their selected, distanced attributes?

Data Structure:
User has many Profiles
(Limit - no more than one of each profile type per user, no duplicates)
Profiles has many Attribute Values
(A user can have as many or few attribute values as they like)
Attributes belong to a category
(No overlap. This controls which attribute values a profile can have)
Example/Context:
I believe with stack exchange you can have many profiles for one user, as they differ per exchange site? In this problem:
Profile: Video, so Video profile only contains Attributes of Video category
Attributes, so an Attribute in the Video category may be Genre
Attribute Values, e.g. Comedy, Action, Thriller are all Attribute Values
Profiles and Attributes are just ways of grouping Attribute Values on two levels.
Without grouping (which is needed for weighting in 2. onwards), the relationship is just User hasMany Attribute Values.
Problem:
Give each user a similarity rating against each other user.
Similarity based on All Attribute Values associated with the user.
Flat/one level
Unequal number of attribute values between two users
Attribute value can only be selected once per user, so no duplicates
Therefore, binary string/boolean array with Cosine Similarity?
1 + Weight Profiles
Give each profile a weight (totaling 1?)
Work out profile similarity, then multiply by weight, and sum?
1 + Weight Attribute Categories and Profiles
As an attribute belongs to a category, categories can be weighted
Similarity per category, weighted sum, then same by profile?
Or merge profile and category weights
3 + Distance between every attribute value
Table of similarity distance for every possible value vs value
Rather than similarity by value === value
'Close' attributes contribute to overall similarity.
No idea how to do this one
Fancy code and useful functions are great, but I'm really looking to fully understand how to achieve these tasks, so I think generic pseudocode is best.
Thanks!
First of all, you should remember that everything should be made as simple as possible, but not simpler. This rule applies to many areas, but in things like semantics, similarity and machine learning it is essential. Using several layers of abstraction (attributes -> categories -> profiles -> users) makes your model harder to understand and to reason about, so I would try to omit it as much as possible. This means that it's highly preferable to keep direct relation between users and attributes. So, basically your users should be represented as vectors, where each variable (vector element) represents single attribute.
If you choose such representation, make sure all attributes make sense and have appropriate type in this context. For example, you can represent 5 video genres as 5 distinct variables, but not as numbers from 1 to 5, since cosine similarity (and most other algos) will treat them incorrectly (e.g. multiply thriller, represented as 2, with comedy, represented as 5, which makes no sense actually).
It's ok to use distance between attributes when applicable. Though I can hardly come up with example in your settings.
At this point you should stop reading and try it out: simple representation of users as vector of attributes and cosine similarity. If it works well, leave it as is - overcomplicating a model is never good.
And if the model performs bad, try to understand why. Do you have enough relevant attributes? Or are there too many noisy variables that only make it worse? Or do some attributes should really have larger importance than others? Depending on these questions, you may want to:
Run feature selection to avoid noisy variables.
Transform your variables, representing them in some other "coordinate system". For example, instead of using N variables for N video genres, you may use M other variables to represent closeness to specific social group. Say, 1 for "comedy" variable becomes 0.8 for "children" variable, 0.6 for "housewife" and 0.9 for "old_people". Or anything else. Any kind of translation that seems more "correct" is ok.
Use weights. Not weights for categories or profiles, but weights for distinct attributes. But don't set these weights yourself, instead run linear regression to find them out.
Let me describe last point in a bit more detail. Instead of simple cosine similarity, which looks like this:
cos(x, y) = x[0]*y[0] + x[1]*y[1] + ... + x[n]*y[n]
you may use weighted version:
cos(x, y) = w[0]*x[0]*y[0] + w[1]*x[1]*y[1] + ... + w[2]*x[2]*y[2]
Standard way to find such weights is to use some kind of regression (linear one is the most popular). Normally, you collect dataset (X, y) where X is a matrix with your data vectors on rows (e.g. details of house being sold) and y is some kind of "correct answer" (e.g. actual price that the house was sold for). However, in you case there's no correct answer to user vectors. In fact, you can define correct answer to their similarity only. So why not? Just make each row of X be a combination of 2 user vectors, and corresponding element of y - similarity between them (you should assign it yourself for a training dataset). E.g.:
X[k] = [ user_i[0]*user_j[0], user_i[1]*user_j[1], ..., user_i[n]*user_j[n] ]
y[k] = .75 // or whatever you assign to it
HTH

What is the difference between a Confusion Matrix and Contingency Table?

I'm writting a piece of code to evaluate my Clustering Algorithm and I find that every kind of evaluation method needs the basic data from a m*n matrix like A = {aij} where aij is the number of data points that are members of class ci and elements of cluster kj.
But there appear to be two of this type of matrix in Introduction to Data Mining (Pang-Ning Tan et al.), one is the Confusion Matrix, the other is the Contingency Table. I do not fully understand the difference between the two. Which best describes the matrix I want to use?
Wikipedia's definition:
In the field of artificial intelligence, a confusion matrix is a
visualization tool typically used in supervised learning (in
unsupervised learning it is typically called a matching matrix). Each
column of the matrix represents the instances in a predicted class,
while each row represents the instances in an actual class.
Confusion matrix should be clear, it basically tells how many actual results match the predicted results. For example, see this confusion matrix
predicted class
c1 - c2
Actual class c1 15 - 3
___________________
c2 0 - 2
It tells that:
Column1, row 1 means that the classifier has predicted 15 items as belonging to class c1, and actually 15 items belong to class c1 (which is a correct prediction)
the second column row 1 tells that the classifier has predicted that 3 items belong to class c2, but they actually belong to class c1 (which is a wrong prediction)
Column 1 row 2 means that none of the items that actually belong to class c2 have been predicted to belong to class c1 (which is a wrong prediction)
Column 2 row 2 tells that 2 items that belong to class c2 have been predicted to belong to class c2 (which is a correct prediction)
Now see the formula of Accuracy and Error Rate from your book (Chapter 4, 4.2), and you should be able to clearly understand what is a confusion matrix. It is used to test the accuracy of a classifier using data with known results. The K-Fold method (also mentioned in the book) is one of the methods to calculate the accuracy of a classifier that has also been mentioned in your book.
Now, for Contingency table:
Wikipedia's definition:
In statistics, a contingency table (also referred to as cross
tabulation or cross tab) is a type of table in a matrix format that
displays the (multivariate) frequency distribution of the variables.
It is often used to record and analyze the relation between two or
more categorical variables.
In data mining, contingency tables are used to show what items appeared in a reading together, like in a transaction or in the shopping-cart of a sales analysis. For example (this is the example from the book you have mentioned):
Coffee !coffee
tea 150 50 200
!tea 650 150 800
800 200 1000
It tells that in 1000 responses (responses about do they like Coffee and tea or both or one of them, results of a survey):
150 people like both tea and coffee
50 people like tea but do not like coffee
650 people do not like tea but like coffee
150 people like neither tea nor coffee
Contingency tables are used to find the Support and Confidence of association rules, basically to evaluate association rules (read Chapter 6, 6.7.1).
Now the difference is that Confusion Matrix is used to evaluate the performance of a classifier, and it tells how accurate a classifier is in making predictions about classification, and contingency table is used to evaluate association rules.
Now after reading the answer, google a bit (always use google while you are reading your book), read what is in the book, see a few examples, and don't forget to solve a few exercises given in the book, and you should have a clear concept about both of them, and also what to use in a certain situation and why.
Hope this helps.
In short, contingency table is used to describe data. and confusion matrix is, as others have pointed out, often used when comparing two hypothesis. One can think of predicted vs actual classification/categorization as two hypothesis, with the ground truth being the null and the model output being the alternative.

Resources