Difference between natural join and simple join on common attribute in algebra - relational-algebra

I have a confusion.
Suppose there two relation with common attribite A.
Now is
(R natural join S)=(R join S where join condition A=A)?
Natural join returns a common column A
Do simple join return two columns with same name AA or 1 common column A due to relational algebra which is defined in set theory ??

There's an example of a Natural Join here. As #Renzo says, there are many variants. And SQL is different again. So I'll keep to what wikipedia shows.
Most important: the join condition applies to all attributes in common between the two arguments. So you need to say "two relations with A being their only common attribute". The only common attribute is DeptName in that wikipedia example. There can be many common attributes, in general.
Yes, joining means forming tuples in the result by pairing tuples from the argument that have same values in the corresponding common attributes. So you have same value with same attribute name. It would be pointless repeating both attributes in the result, because you'd be repeating the values. The example shows there's a single attribute DeptName in the result.
Beware that different dialects of Relational Algebra use different symbols and notations. So the bare bowtie (⋈) for Natural Join can be suffixed with a boolean condition, making a theta-join (θ-join) or equi-join -- see that example. The boolean condition is between differently-named attributes, and might use any comparison operator. So both attribute names and their values appear in the result.
Set theory operations apply because each tuple is a set of name-value pairs. The result tuples are the union of a tuple from each argument -- providing that union is a valid tuple. That is, providing same-named n-v pairs have same value.

Related

How to subtract values form two columns in RelaX (online relational algebra calculator)

Is there any way to subtract values from two different columns using RelaX (an relational algebra online calculator)? I have tried using projection, group by, as well as a few examples I saw here on SO. I am trying to subtract the average wage from value the of the wage of employees.
The RelaX projection operator takes a list of expressions giving the column values of each row returned. Those expressions can be just column names but they don't have to be. (As with an SQL select clause.)
From the help link:
projection
Expressions can be used to create more complex statements using one or more columns of a single row.
pi c.id, lower(username)->user, concat(firstname, concat(' ', lastname))->fullname (
ρ c ( Customer )
)
Value expressions
With most operators you can use a value-expression which connects one or more columns of a single row to calculate a new value. This is possible for:
the projection creating a new column (make sure to give the column a name)
the selection any expression evaluating to boolean can be used
for the joins any expression evaluating to boolean can be used; note that the rownum() expression always represents the index of the lefthand relation
PS RelaX is properly a query language, not an algebra. Its "value expressions" are not evaluated to a value before the call. That begs the question of how we would implement a language using an algebra.
From Is multiplication allowed in relational algebra?:
Some so-called "algebras" are really languages because the expressions don't only represent the results of operators being called on values. Although it is possible for an algebra to have operand values that represent expressions and/or relation values that contain names for themselves.
The projection that takes attribute expressions begs the question of its implementation given an algebra with projection only on a relation value and attribute names. This is important in an academic setting because a question may be wanting you to actually figure out how to do that, or because the difficulty of a question is dependent on the operators available. So find out what algebra you are supposed to use.
We can introduce an operator on attribute values when we only have basic relation operators taking attribute names and relation values. Each such operator can be associated with a relation value that has an attribute for each operand and an attribute for the result. The relation holds the tuples where the result value is equal to the the result of the operator called on the operand values. (The result is functionally dependent on the operands.)
From Relational Algebra rule for column transformation:
Suppose you supply the division operator on values of a column in the form of a constant base relation called DIVIDE holding tuples where dividend/divisor=quotient. I'll use the simplest algebra, with headings that are sets of attribute names. Assume we have input relation R with column c & average A. We want the relation like R but with each column c value set to its original value divided by A.
/* rows where
EXISTS dividend [R(dividend) & DIVIDE(dividend, A, c)]
*/
PROJECT c (
RENAME c\dividend (R)
NATURAL JOIN
RENAME quotient\c (
PROJECT dividend, quotient (SELECT divisor=A (DIVIDE))))
From Relational algebra - recode column values:
To introduce specific values into a relational algebra expression you have to have a way to write table literals. Usually the necessary operators are not made explicit, but on the other hand algebra exercises frequently use some kind of notation for example values.

Can a set have duplicate elements?

I have been asked a question that is a little ambiguous for my coursework.
The array of strings is regarded as a set, i.e. unordered.
I'm not sure whether I need to remove duplicates from this array?
I've tried googling but one place will tell me something different to the next. Any help would be appreciated.
From Wikipedia in Set (Mathematics)
A set is a collection of well defined and distinct objects.
Perhaps the confusion derives from the fact that a set does not depend on the way its elements are displayed. A set remains the same if its elements are allegedly repeated or rearranged.
As such, the programming languages I know would not put an element into a set if the element already belongs to it, or they would replace it if it already exists, but would never allow a duplication.
Programming Language Examples
Let me offer a few examples in different programming languages.
In Python
A set in Python is defined as "an unordered collection of unique elements". And if you declare a set like a = {1,2,2,3,4} it will only add 2 once to the set.
If you do print(a) the output will be {1,2,3,4}.
Haskell
In Haskell the insert operation of sets is defined as: "[...] if the set already contains an element equal to the given value, it is replaced with the new value."
As such, if you do this: let a = fromList([1,2,2,3,4]), if you print a to the main ouput it would render [1,2,3,4].
Java
In Java sets are defined as: "a collection that contains no duplicate elements.". Its add operation is defined as: "adds the specified element to this set if it is not already present [...] If this set already contains the element, the call leaves the set unchanged".
Set<Integer> myInts = new HashSet<>(asList(1,2,2,3,4));
System.out.println(myInts);
This code, as in the other examples, would ouput [1,2,3,4].
A set cannot have duplicate elements by its mere definition. The correct structure to allow duplicate elements is Multiset or Bag:
In mathematics, a multiset (or bag) is a generalization of the concept of a set that, unlike a set, allows multiple instances of the multiset's elements. For example, {a, a, b} and {a, b} are different multisets although they are the same set. However, order does not matter, so {a, a, b} and {a, b, a} are the same multiset.
A very common and useful example of a Multiset in programming is the collection of values of an object:
values({a: 1, b: 1}) //=> Multiset(1,1)
The values here are unordered, yet cannot be reduced to Set(1) that would e.g. break the iteration over the object values.
Further, quoting from the linked Wikipedia article (see there for the references):
Multisets have become an important tool in databases.[18][19][20] For instance, multisets are often used to implement relations in database systems. Multisets also play an important role in computer science.
Let A={1,2,2,3,4,5,6,7,...} and B={1,2,3,4,5,6,7,...} then any element in A is in B and any element in B is in A ==> A contains B and B contains A ==> A=B. So of course sets can have duplicate elements, it's just that the one with duplicate elements would end up being exactly the same as the one without duplicate elements.
"Sets are Iterables that contain no duplicate elements."
https://docs.scala-lang.org/overviews/collections/sets.html

How to determine correspondence between two lists of names?

I have:
1 million university student names and
3 million bank customer names
I manage to convert strings into numerical values based on hashing (similar strings have similar hash values). I would like to know how can I determine correlation between these two sets to see if values are pairing up at least 60%?
Can I achieve this using ICC? How does ICC 2-way random work?
Please kindly answer ASAP as I need this urgently.
This kind of entity resolution etc is normally easy, but I am surprised by the hashing approach here. Hashing loses information that is critical to entity resolution. So, if possible, you shouldn't use hash, rather the original strings.
Assuming using original strings is an option, then you would want to do something like this:
List A (1M), List B (3M)
// First, match the entities that match very well, and REMOVE them.
for a in List A
for b in List B
if compare(a,b) >= MATCH_THRESHOLD // This may be 90% etc
add (a,b) to matchedList
remove a from List A
remove b from List B
// Now, match the entities that match well, and run bipartite matching
// Bipartite matching is required because each entity can match "acceptably well"
// with more than one entity on the other side
for a in List A
for b in List B
compute compare(a,b)
set edge(a,b) = compare(a,b)
If compare(a,b) < THRESHOLD // This seems to be 60%
set edge(a,b) = 0
// Now, run bipartite matcher and take results
The time complexity of this algorithm is O(n1 * n2), which is not very good. There are ways to avoid this cost, but they depend upon your specific entity resolution function. For example, if the last name has to match (to make the 60% cut), then you can simply create sublists in A and B that are partitioned by the first couple of characters of the last name, and just run this algorithm between corresponding list. But it may very well be that last name "Nuth" is supposed to match "Knuth", etc. So, some local knowledge of what your name comparison function is can help you divide and conquer this problem better.

what is natural ordering when we talk about sorting?

What is meant by natural ordering . Suppose I have an Employee object with name , age and date of joining , sorting by what is natural ordering ?
Natural ordering is a kind of alphanumerical sort which seems natural to humans.
In a classical alphanumerical sort we will have something like :
1 10 11 12 2 20 21 3 4 5 6 7
If you're using Natural ordering, it will be :
1 2 3 4 5 6 7 10 11 12 20 21
Depending on the language, natural ordering sometimes ignore Capital letters and accentuated one (ie all accentuated letters are treated like their non-accentuated counterpart).
Many languages have a function to order a String naturally. However, an Employee is too "high level" for the language, you must decide what it means for you to order them naturally and create the according function.
In my point of view, ordering Employee will start by ordering them by name using a natural sort, then age and finally date of joining.
According to statistics there are two types of categorical variables. Variables having categories without a numerical ordering (nominal) and those which do have ordered categories (ordinal). The example of an Employee's name, age and date of joining is actually considered a nominal variable so there can be no sorting by natural ordering. Natural ordering could exist for example in age had you categorized it in levels of child, teenager, adult, in which one can observe an ascending type of sorting.
For strings containing numbers it means 1,2,3,4,5,6,7,8,9,10,11 instead of 1,10,11,2,3,4,5,6,7,8,9
Quite an old question, but very simply put, the Natural Order is an ascending order of the enumerable collection of the comparable elements:
For the numbers: 1, 2, 3...
For the characters: A, B, C...
If someone like me found himself reading the following article:
https://www.copterlabs.com/natural-sorting-in-mysql/
(which by the way is really useful), beware it because that's another method of sorting.
A correct natural sorting algorithm states that you order alphabetically but when you encounter a digit you will order that digit and all the subsequent digits as a single character.
Natural sorting has nothing to do with sorting by string length first, and then alphabetically when two strings have the same length. Though the article I linked is interesting, don't make the mistake I made and think that that's the correct way to sort naturally.
For Java, The ordering provided by the Comparable interface is called the natural ordering, so the Comparator interface provides, so to speak, an unnatural ordering.

union/intersection of 2 sets, where each set are defined by it's subsets

We know every set's definition from the union of other sets.
For example
A = B union {1,2}
B = C union D
C = {5,6}
D = {5,7}
E = {4}
then A = {1,2,5,6,7}
A union E = {1,2,4,5,6,7}
Are the any efficient algorithms to do that. Suppose the hierarchy of unions can be really deep, and the subsets can change pretty often(not that much).
I think there should be ways to minimize reduce the amount of unions one have to make.
So you have a unchanging hierarchy of unions of changing sets? And you are, like in your example, only interested in the value of one set?
Then flatten the hierarchy. That is, in your example you would once walk through the hierarchy to find the set of changing sets your set is the union of, and store this set.
To dispense with recomputing unions whenever a leaf set changes, you could track for each element in how many sets it is currently contained. This can be updated quickly if a leaf set changes, and those not required looking at any unchanged leaf sets. Then, those elements with frequency count > 0 are currently in the union.
Perhaps you're looking for some sort of disjoint set data structure?
Several questions about this case.
First question: For how long does this "script" / "program" have to run? In case it's not too long, it could well be a good option to simply store previous unions and checking cache before performing a union action. Memory isn't that expensive nowadays :).
Second question you should ask: are elements in a certain order before union? If they aren't, and a list is accessed more than once, it can be very useful to sort a list first (than you can make a decision when you're only halfway a list, for example). Mergesort is a powerful technique of the efficient join of two ordered lists.

Resources