Can you cube between multiple relations in PIG? - hadoop

I want to find the combination given another variable:
Example:
name, group, points
jim, T, 12
steven, T, 10
ting, T, 15
matt, F, 16
aamir, F, 12
I want to be able to get all combinations between members of T and F and do some multiplication to the points column for that. I first thought to break this into two relations, i.e. a T and an F relation and do some combination between them using CUBE but i don't think you can use CUBE between relations? Any suggestions?
Results:
jim, matt, 12*16
jim, aamir, 12*12
steven, matt, 16*16
...
...
ting, aamir, 15*12

Can you try this?
input.txt
jim,T,12
steven,T,10
ting,T,15
matt,F,16
aamir,F,12
PigScript:
A = LOAD 'input.txt' USING PigStorage(',') AS (name:chararray, group:chararray, points:int);
B = FILTER A BY group=='T';
C = FILTER A BY group=='F';
D = CROSS B,C;
E = FOREACH D GENERATE B::name,C::name,B::points*C::points;
DUMP E;
Output:
(jim,matt,192)
(jim,aamir,144)
(steven,matt,160)
(steven,aamir,120)
(ting,matt,240)
(ting,aamir,180)

Related

Relational Algebra: Natural Join having the same result as Cartesian product

I am trying to understand what will be the result of performing a natural join
between two relations R and S, where they have no common attributes.
By following the below definition, I thought the answer might be an empty set:
Natural Join definition.
My line of thought was because the condition in the 'Select' symbol is not met, the projection of all of the attributes won't take place.
When I asked my lecturer about this, he said that the output will be the same as doing a cartezian product between R and S.
I can't seem to understand why, would appreciate any help )
Natural join combines a cross product and a selection into one
operation. It performs a selection forcing equality on those
attributes that appear in both relation schemes. Duplicates are
removed as in all relation operations.
There are two special cases:
• If the two relations have no attributes in common, then their
natural join is simply their cross product.
• If the two relations have more than one attribute in common,
then the natural join selects only the rows where all pairs of
matching attributes match.
Notation: r s
Let r and s be relation instances on schema R and S
respectively.
The result is a relation on schema R ∪ S which is
obtained by considering each pair of tuples tr from r and ts from s.
If tr and ts have the same value on each of the attributes in R ∩ S, a
tuple t is added to the result, where
– t has the same value as tr on r
– t has the same value as ts on s
Example:
R = (A, B, C, D)
S = (E, B, D)
Result schema = (A, B, C, D, E)
r s is defined as:
πr.A, r.B, r.C, r.D, s.E (σr.B = s.B r.D = s.D (r x s))
The definition of the natural join you linked is:
It can be broken as:
1.First take the cartezian product.
2.Then select only those row so that attributes of the same name have the same value
3.Now apply projection so that all attributes have distinct names.
If the two tables have no attributes with same name, we will jump to step 3 and therefore the result will indeed be cartezian product.

How to understand `u=r÷s`, the division operator, in relational algebra?

let be a database having the following relational-schemes: R(A,B,D) and S(A,B) with the attributes of same name in the same domain and with the instances r and s respectively.
An instance of r
An instance of s
What is the scheme and what are the tuples of u=r÷s? How to define them in English with r and s?
My attempt
I know that
u=r÷s=
Which leads me to think that it would only be an array of one column A, but I'm not sure enough to know what will be ther result within the array.
Can you help me understand u=r÷s?
An intuitive property of the division operator of the relational algebra is simply that it is the inverse of the cartesian product. For example, if you have two relations R and S, then, if U is a relation defined as the cartesian product of them:
U = R x S
the division is the operator such that:
U ÷ R = S
and:
U ÷ S = R
So, you can think of the result of U ÷ R as: “the projection of U that, multiplied by R, produces U”, and of the operation ÷, as the operation that finds all the “parts” of U that are combined with all the tuples of R.
However, in order to be useful, we want that this operation can be applied to any couple of relations, that is, we want to divide a relation which is not the result of a cartesian product. For this, the formal definition is more complex.
So, supposing that we have two relations R and S with attributes respectively A and B, their division can be defined as:
R ÷ S = πA-B(R) - πA-B((πA-B(R) x S) - R)
that can be read in this way:
πA-B(R) x S: project R over the attributes of R which are not in S, and multiply (cartesian product) this relation with S. This produces a relation with the attributes A of R and with rows all the possible combinations of rows of S and the projection of R;
From the previous result subtract all the tuples originally in R, that is, perform (πA-B(R) x S) - R. In this way we obtain the “extra” tuples, that is the tuples in the cartesian product that were not present in the original relation.
Finally, subtract from the original relation those extra tuples (but, again, perform this operation only on the attributes of R which are not present in S). So, the final operation is: πA-B(R) - πA-B(the result of step 2).
So, coming to your example, the projection of r on D is equal to:
(D)
d1
d2
d3
d4
and the cartesian product with s is:
(A, B, D)
a1 b1 d1
a1 b1 d2
a1 b1 d3
a1 b1 d4
Now we can remove from this set the tuples that were also in the original relation r, i.e. the first two tuples and the last one, so that we obtain the following result:
(A, B, D)
a1 b1 d3
And finally, we can remove the previous tuples (projected on D), from the original relation (again projected on D), that is, we remove:
(D)
d3
from:
(D)
d1
d2
d3
d4
and we obtain the following result, which is the final result of the division:
(D)
d1
d2
d4
Finally, we could double check the result by multiplying it with the original relation s (which is composed only by the tuple (a1, b1)):
(A B D)
a1 b1 d1
a1 b1 d2
a1 b1 d4
And looking at the rows of the original relation r, you can see this fact, that should give you an important insight on the meaning of the division operator:
the only values of the column D in r that are present together with (a1, b1) (the only tuple of s), are d1, d2 and d4.
You can also see another example in Wikipedia, and for a detailed explanation of the division, together with its transformation is SQL, you could look at these slides.

SWI Prolog usage of agregation

I created a simple database on SWI Prolog. My task is to count how long each of departments will work depending on production plan. I am almost finished, but I don't know how to sum my results. As for now I am getting something like this
department amount
b 20
a 5
c 50
c 30
how I can transform it to this?
b 20
a 5
c 80
My code https://gist.github.com/senioroman4uk/d19fe00848889a84434b
The code provided won't interpret the count predicate on account of a bad format. You should rewrite it as count:- instead of count():-. As far as I know, all zero-ary predicates need to be defined like this.
Second, your count predicate does not collect the results in a list upon which you could operate. Here's how you can change it to collect all department-amount pairs in a list with findall:
count_sum(DepAmounts):-
findall((Department,Sum),
( productionPlan(FinishedProduct, Amount),
resultOf(FinishedProduct, Operation),
executedIn(Operation, Department, Time),
Sum is Amount * Time
),
DepAmounts
).
Then, over that list, you can use something like SWI-Prolog's aggregate:
?- count_sum(L), aggregate(sum(A),L,member((D,A),L),X).
Which will yield, by backtracing, departments in D and the sum of the amounts in X:
D = a,
X = 15 ;
D = b,
X = 20 ;
D = c,
X = 80.
BTW, if I were you I'd replace all double-quoted strings for department names and operations and etc. for atoms, for simplicity.
you should consider library(aggregate): for instance, calling data/2 the interesting DB subset, you get
?- aggregate(set(K-A),aggregate(sum(V),data(K,V),A),L).
L = [a-5, b-20, c-80]

loading data based on conditions in APACHE PIG

Problem statement-
I want to check if value of column in relation xyz is even then load first 10 fields(1-10) of a file abc and if not then load another 10(11-20).
Relation XYZ
123
Relation ABC
a b c d e f g h i j k l m n o p q r s t
if 123 is even then
relation PQR should have a-j
other wise k-t
Could somebody help.
You should write a storage function to do that.
See the implementation of CSVExcelStorage http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java for example.

Generate unique cross in Pig

I have a problem. I don't understand how can I generate unique "cross" for the input.
Here is my input:
A, B, C
I would like to get:
A,B
A,C
B,C
What UDF (data-fu, piggybank) can I use to solve this problem?
If your input is like
A
B
C
and your want to output:
A,B
A,C
B,C
You can use cross join to get the results. For example:
input1 = load 'your_path' as (key: chararray);
input2 = load 'your_path' as (key: chararray);
cross_results = cross input1, input2;
final_results = filter cross_results by input1::key < input2::key;
If "A,B,C" are only a bag in one record, you can use flatten. For example,
-- Assume your input x is something like {A, B, C} in one row
y = foreach x generate flatten($0) as f1, flatten($0) as f2;
final_results = filter y by f1 < f2;
As your description is not very exhaustive, I can only provide the above solution. You may need to adapt it.

Resources