How to subtract values form two columns in RelaX (online relational algebra calculator) - relational-algebra

Is there any way to subtract values from two different columns using RelaX (an relational algebra online calculator)? I have tried using projection, group by, as well as a few examples I saw here on SO. I am trying to subtract the average wage from value the of the wage of employees.

The RelaX projection operator takes a list of expressions giving the column values of each row returned. Those expressions can be just column names but they don't have to be. (As with an SQL select clause.)
From the help link:
projection
Expressions can be used to create more complex statements using one or more columns of a single row.
pi c.id, lower(username)->user, concat(firstname, concat(' ', lastname))->fullname (
ρ c ( Customer )
)
Value expressions
With most operators you can use a value-expression which connects one or more columns of a single row to calculate a new value. This is possible for:
the projection creating a new column (make sure to give the column a name)
the selection any expression evaluating to boolean can be used
for the joins any expression evaluating to boolean can be used; note that the rownum() expression always represents the index of the lefthand relation
PS RelaX is properly a query language, not an algebra. Its "value expressions" are not evaluated to a value before the call. That begs the question of how we would implement a language using an algebra.
From Is multiplication allowed in relational algebra?:
Some so-called "algebras" are really languages because the expressions don't only represent the results of operators being called on values. Although it is possible for an algebra to have operand values that represent expressions and/or relation values that contain names for themselves.
The projection that takes attribute expressions begs the question of its implementation given an algebra with projection only on a relation value and attribute names. This is important in an academic setting because a question may be wanting you to actually figure out how to do that, or because the difficulty of a question is dependent on the operators available. So find out what algebra you are supposed to use.
We can introduce an operator on attribute values when we only have basic relation operators taking attribute names and relation values. Each such operator can be associated with a relation value that has an attribute for each operand and an attribute for the result. The relation holds the tuples where the result value is equal to the the result of the operator called on the operand values. (The result is functionally dependent on the operands.)
From Relational Algebra rule for column transformation:
Suppose you supply the division operator on values of a column in the form of a constant base relation called DIVIDE holding tuples where dividend/divisor=quotient. I'll use the simplest algebra, with headings that are sets of attribute names. Assume we have input relation R with column c & average A. We want the relation like R but with each column c value set to its original value divided by A.
/* rows where
EXISTS dividend [R(dividend) & DIVIDE(dividend, A, c)]
*/
PROJECT c (
RENAME c\dividend (R)
NATURAL JOIN
RENAME quotient\c (
PROJECT dividend, quotient (SELECT divisor=A (DIVIDE))))
From Relational algebra - recode column values:
To introduce specific values into a relational algebra expression you have to have a way to write table literals. Usually the necessary operators are not made explicit, but on the other hand algebra exercises frequently use some kind of notation for example values.

Related

How can I vectorize a list of words?

I am working on SMS data where I have a list of words in my one column of dataframe
I want to train a classifier to predict it's type and subtype.
How would I convert the words into numerical format as they are in a list.
The idea is to use as vocabulary all the words found in this column across instances, except that the least frequent words should be removed (to avoid overfitting). Then for every instance the column is represented as vector of boolean features, where the nth value represents the nth word in the vocabulary: 1 if it is in the list for this instance, 0 if not.
In python you can use CountVectorizer, considering every list in the column as a sentence.

Difference between natural join and simple join on common attribute in algebra

I have a confusion.
Suppose there two relation with common attribite A.
Now is
(R natural join S)=(R join S where join condition A=A)?
Natural join returns a common column A
Do simple join return two columns with same name AA or 1 common column A due to relational algebra which is defined in set theory ??
There's an example of a Natural Join here. As #Renzo says, there are many variants. And SQL is different again. So I'll keep to what wikipedia shows.
Most important: the join condition applies to all attributes in common between the two arguments. So you need to say "two relations with A being their only common attribute". The only common attribute is DeptName in that wikipedia example. There can be many common attributes, in general.
Yes, joining means forming tuples in the result by pairing tuples from the argument that have same values in the corresponding common attributes. So you have same value with same attribute name. It would be pointless repeating both attributes in the result, because you'd be repeating the values. The example shows there's a single attribute DeptName in the result.
Beware that different dialects of Relational Algebra use different symbols and notations. So the bare bowtie (⋈) for Natural Join can be suffixed with a boolean condition, making a theta-join (θ-join) or equi-join -- see that example. The boolean condition is between differently-named attributes, and might use any comparison operator. So both attribute names and their values appear in the result.
Set theory operations apply because each tuple is a set of name-value pairs. The result tuples are the union of a tuple from each argument -- providing that union is a valid tuple. That is, providing same-named n-v pairs have same value.

Notation from some text about Universal Hashing

Here is a quote from some lecture on teh topic. I do not understand this part h : {1,...,M} -> {0,...,m-1} (the notation). Could someone please explain what it means? E.g. "a hash function h selected from M hash function, which returns values between 1 and m-1"??
Thanks.
Hashing
We assume that all the basics about hash tables have been covered in 61B.
We will make the simplifying assumption that the keys that we want to hash have been
encoded as integers, and that such integers are in the range {1,...,M}. We also assume that
collisions are handled using linked lists.
Suppose that we are using a table of size m, that we have selected a hash function
h : {1,...,M} -> {0,...,m-1} and that, at some point, the keys Y1,...,Yn have been
inserted in the data structure, and that we want to find, or insert, or delete, the key x.
The running time of such operation will be a big-Oh of the number of elements Yi such that
h(yi) = h(x).
...........
...........
Source: www.cs.berkeley.edu/~luca/cs170/notes/lecture9.pdf
It says: h is a function from the input set {1,...,M} to the target set {0,...,m-1}
More specifically it doesn't say how the function is formed.
It simply says that it deals with certain range of inputs and some other range of outputs and that it exists.
EDIT: it's a function, not a relation.

How to determine correspondence between two lists of names?

I have:
1 million university student names and
3 million bank customer names
I manage to convert strings into numerical values based on hashing (similar strings have similar hash values). I would like to know how can I determine correlation between these two sets to see if values are pairing up at least 60%?
Can I achieve this using ICC? How does ICC 2-way random work?
Please kindly answer ASAP as I need this urgently.
This kind of entity resolution etc is normally easy, but I am surprised by the hashing approach here. Hashing loses information that is critical to entity resolution. So, if possible, you shouldn't use hash, rather the original strings.
Assuming using original strings is an option, then you would want to do something like this:
List A (1M), List B (3M)
// First, match the entities that match very well, and REMOVE them.
for a in List A
for b in List B
if compare(a,b) >= MATCH_THRESHOLD // This may be 90% etc
add (a,b) to matchedList
remove a from List A
remove b from List B
// Now, match the entities that match well, and run bipartite matching
// Bipartite matching is required because each entity can match "acceptably well"
// with more than one entity on the other side
for a in List A
for b in List B
compute compare(a,b)
set edge(a,b) = compare(a,b)
If compare(a,b) < THRESHOLD // This seems to be 60%
set edge(a,b) = 0
// Now, run bipartite matcher and take results
The time complexity of this algorithm is O(n1 * n2), which is not very good. There are ways to avoid this cost, but they depend upon your specific entity resolution function. For example, if the last name has to match (to make the 60% cut), then you can simply create sublists in A and B that are partitioned by the first couple of characters of the last name, and just run this algorithm between corresponding list. But it may very well be that last name "Nuth" is supposed to match "Knuth", etc. So, some local knowledge of what your name comparison function is can help you divide and conquer this problem better.

Puzzle: Need an example of a "complicated" equivalence relation / partitioning that disallows sorting and/or hashing

From the question "Is partitioning easier than sorting?":
Suppose I have a list of items and an
equivalence relation on them, and
comparing two items takes constant
time. I want to return a partition of
the items, e.g. a list of linked
lists, each containing all equivalent
items.
One way of doing this is to extend the
equivalence to an ordering on the
items and order them (with a sorting
algorithm); then all equivalent items
will be adjacent.
(Keep in mind the distinction between equality and equivalence.)
Clearly the equivalence relation must be considered when designing the ordering algorithm. For example, if the equivalence relation is "people born in the same year are equivalent", then sorting based on the person's name is not appropriate.
Can you suggest a datatype and equivalence relation such that it is not possible to create an ordering?
How about a datatype and equivalence relation where it is possible to create such an ordering, but it is not possible to define a hash function on the datatype that will map equivalent items to the same hash value.
(Note: it is OK if nonequivalent items map to the same hash value (collide) -- I'm not asking to solve the collision problem -- but on the other hand, hashFunc(item) { return 1; } is cheating.)
My suspicion is that for any datatype/equivalence pair where it is possible to define an ordering, it will also be possible to define a suitable hash function, and they will have similar algorithmic complexity. A counterexample to that conjecture would be enlightening!
The answer to questions 1 and 2 is no, in the following sense: given a computable equivalence relation ≡ on strings {0, 1}*, there exists a computable function f such that x ≡ y if and only if f(x) = f(y), which leads to an order/hash function. One definition of f(x) is simple, and very slow to compute: enumerate {0, 1}* in lexicographic order (ε, 0, 1, 00, 01, 10, 11, 000, …) and return the first string equivalent to x. We are guaranteed to terminate when we reach x, so this algorithm always halts.
Creating a hash function and an ordering may be expensive but will usually be possible. One trick is to represent an equivalence class by a pre-arranged member of that class, for instance, the member whose serialised representation is smallest, when considered as a bit string. When somebody hands you a member of an equivalence class, map it to this canonicalised member of that class, and then hash or compare the bit string representation of that member. See e.g. http://en.wikipedia.org/wiki/Canonical#Mathematics
Examples where this is not possible or convenient include when somebody gives you a pointer to an object that implements equals() but nothing else useful, and you do not get to break the type system to look inside the object, and when you get the results of a survey that only asks people to judge equality between objects. Also Kruskal's algorithm uses Union&Find internally to process equivalence relations, so presumbly for this particular application nothing more cost-effective has been found.
One example that seems to fit your request is an IEEE floating point type. In particular, a NaN doesn't compare as equivalent to anything else (nor even to itself) unless you take special steps to detect that it's a NaN, and always call that equivalent.
Likewise for hashing. If memory serves, any floating point number with all bits of the significand set to 0 is treated as having the value 0.0, regardless of what the bits in the exponent are set to. I could be remembering that a bit wrong, but the idea is the same in any case -- the right bit pattern in one part of the number means that it has the value 0.0, regardless of the bits in the rest. Unless your hash function takes this into account, it will produce different hash values for numbers that really compare precisely equal.
As you probably know, comparison-based sorting takes at least O(n log n) time (more formally you would say it is Omega(n log n)). If you know that there are fewer than log2(n) equivalence classes, then partitioning is faster, since you only need to check equivalence with a single member of each equivalence class to determine which part in the partition you should assign a given element to.
I.e. your algorithm could be like this:
For each x in our input set X:
For each equivalence class Y seen so far:
Choose any member y of Y.
If x is equivalent to y:
Add x to Y.
Resume the outer loop with the next x in X.
If we get to here then x is not in any of the equiv. classes seen so far.
Create a new equivalence class with x as its sole member.
If there are m equivalence classes, the inner loop runs at most m times, taking O(nm) time overall. As ShreetvatsaR observes in a comment, there can be at most n equivalence classes, so this is O(n^2). Note this works even if there is not a total ordering on X.
Theoretically, it is alway possible (for questions 1 and 2), because of the Well Ordering Theorem, even when you have an uncountable number of partitions.
Even if you restrict to computable functions, throwawayaccount's answer answers that.
You need to more precisely define your question :-)
In any case,
Practically speaking,
Consider the following:
You data type is the set of unsigned integer arrays. The ordering is lexicographic comparison.
You could consider hash(x) = x, but I suppose that is cheating too :-)
I would say (but haven't thought more about getting a hash function, so might well be wrong) that partitioning by ordering is much more practical than partitioning by hashing, as hashing itself could become impractical. (A hashing function exists, no doubt).
I believe that...
1- Can you suggest a datatype and equivalence relation such that it is
not possible to create an ordering?
...it's possible only for infinite (possibly only for non-countable) sets.
2- How about a datatype and equivalence relation where it is
possible to create such an ordering,
but it is not possible to define a
hash function on the datatype that
will map equivalent items to the same
hash value.
...same as above.
EDIT: This answer is wrong
I am not going to delete it just because some of the comments below are enlightening
Not every equivalence relationship implies an order
As your equivalence relationship should not induce an order, let´s take an un-ordered distance function as relation.
If we get the set of functions f(x):R -> R as our datatype, and define an equivalence relation as:
f is equivalent to g if f(g(x)) = g(f(x) [commuting Operators][1]
Then you can't sort on that order (no injective function exists with the Real numbers). You just can't find a function which maps your datatype to numbers due to the cardinality of the function's space.
Suppose that F(X) is a function which maps an element of some data type T to another of the same type, such that for any Y of type T, there is exactly one X of type T such that F(X)=Y. Suppose further that the function is chosen so that there is generally no practical way of finding the X in the above equation for a given Y.
Define F0=X, F{1}(X)=F(X), F{2}(X)=F(F(X)), etc. so F{n}(X) = F(F{n-1}(X)).
Now define a data type Q containing a positive integer K and an object X of type T. Define an equivalence relation thus:
Q(a,X) vs Q(b,Y):
If a > b, the items are equal iff F{a-b}(Y)==X
If a < b, the items are equal iff F{b-a}(X)==Y
If a=b, the items are equal iff X==Y
For any given object Q(a,X) there exists exactly one Z for F{a}(Z)==X. Two objects are equivalent iif they would have the same Z. One could define an ordering or hash function based upon Z. On the other hand, if F is chosen such that its inverse cannot be practically computed, the only practical way to compare elements may be to use the equivalence function above. I know of no way to define an ordering or hash function without either knowing the largest possible "a" value an item could have, or having a means to invert function F.

Resources