Pig - How to use a nested for loop in pig to get the list of elements inside a tuple? - hadoop

I have an intermediate pig structure like
(A, B, (n. no Cs))
example:
(a1,b1, (c11,c12))
(a2,b2, (c21))
(a3,b3, (c31,c32, c33))
Now, I want the data in format
(a1, b1, c11)
(a1, b2, c12)
(a2, b2, c21) etc.
How do I go about doing it?
Essentially I want the size of the tuples, and then use this size for running a nested for loop.

Can you try the below approach?
input
a1 b1 (c11,c12)
a2 b2 (c21)
a3 b3 (c31,c32,c33)
PigScript:
A = LOAD 'input' AS(f1,f2,T:(f3:chararray));
B = FOREACH A GENERATE f1,f2,FLATTEN(T);
C = FOREACH B GENERATE f1,f2,FLATTEN(TOKENIZE(T::f3));
DUMP C;
Output:
(a1,b1,c11)
(a1,b1,c12)
(a2,b2,c21)
(a3,b3,c31)
(a3,b3,c32)
(a3,b3,c33)

Related

Numerical integration with parameters in Mathematica

I want to do numerical integration in Mathematica over a large dataset like: {{x1,y1(c)}, {x2,y2(c)}, {x3,y3(c)}, {x4,y4(c)}..} where y1=(a1+c)/b1, y2= (a2+c)/b2, y3=(a3+c)/b3, y4=(a4+c)/b4.... and a1, a2, a3, a4, b1, b2, b3, b4..... are just numbers and c is a constant. After the integration, I want to plot the resulting function as a function of c. How can I do that?

Data structure traversal

Lets say I have package A version 1 and package A version 2, Will call them A1 and A2 respectively.
If I have a pool of packages: A1, A2, B1, B2, C1, C2, D1, D2
A1 depends on B1, will represent as (A1, (B1)).
Plus A1 depends on any version of package C "C1 or C2 satisfy A1", will represent as (A1, (C1, C2))
combining A1 deps together, then A1 data-structure becomes: (A1, (B1), (C1, C2))
Also B1 depends on D1: (B1, (D1))
A1 structure becomes: (A1, ((B1, (D1))), (C1, C2))
similarly A2 structure is (A2, ((B2, (D2))), (C1, C2))
My question is: How can I select best candidate of package A, where I can select based on a condition (for example, the condition is the package does not conflict with current installed packages).
by combining A1 and A2: ((A1, ((B1, (D1))), (C1, C2)), (A2, ((B2, (D2))), (C1, C2)))
How can I traverse this data structure
So start with A1, if doesn't conflict check B1, if doesn't conflict check D1, if doesn't conflict check (C1, C2), and take one only either C1 or C2.
With this I end up selecting (A1, B1, D1, C1).
In case if A1 or any of its deps did not meet the condition, (for example if B1 conflicts with installed packages), then drop A1 entirely and move to check A2. then end up with (A2, B2, D2, C1).
What kind of traversal would that be?
I have been reading about in-order, pre-order, post-order traversal, and wondering if I need to do something similar here.
Assuming you are asking traversal on a more generic problem rather than working on this instance, I don't think there exists such a traversal.
Note that in-order is only applicable to BINARY trees. Any other kind of tree does not have in-order traversal. If your generic problem has B1, B2, B3, then apparently there wouldn't be a binary tree representation.
One property about traversal, is that the tree has all the information inclusively in the itself. When you traverse over a tree you never worry about "external information". In your case, your tree is not complete in information - you need to depend on external information to see if there is a conflict. e.g. B1 is installed - this information is never in the tree.
You can use adjacency list to represent the data:
Suppose the packages are A1, A2, B1, B2, C1, C2.
And A1 depends on B1 and C2, A2 depends on B1 and C1 and C2.
The above data can be represented as
[A1] -> [B1, C2]
[A2] -> [B1, C1, C2]
Use Topological Sorting to get the order of dependencies

How to understand `u=r÷s`, the division operator, in relational algebra?

let be a database having the following relational-schemes: R(A,B,D) and S(A,B) with the attributes of same name in the same domain and with the instances r and s respectively.
An instance of r
An instance of s
What is the scheme and what are the tuples of u=r÷s? How to define them in English with r and s?
My attempt
I know that
u=r÷s=
Which leads me to think that it would only be an array of one column A, but I'm not sure enough to know what will be ther result within the array.
Can you help me understand u=r÷s?
An intuitive property of the division operator of the relational algebra is simply that it is the inverse of the cartesian product. For example, if you have two relations R and S, then, if U is a relation defined as the cartesian product of them:
U = R x S
the division is the operator such that:
U ÷ R = S
and:
U ÷ S = R
So, you can think of the result of U ÷ R as: “the projection of U that, multiplied by R, produces U”, and of the operation ÷, as the operation that finds all the “parts” of U that are combined with all the tuples of R.
However, in order to be useful, we want that this operation can be applied to any couple of relations, that is, we want to divide a relation which is not the result of a cartesian product. For this, the formal definition is more complex.
So, supposing that we have two relations R and S with attributes respectively A and B, their division can be defined as:
R ÷ S = πA-B(R) - πA-B((πA-B(R) x S) - R)
that can be read in this way:
πA-B(R) x S: project R over the attributes of R which are not in S, and multiply (cartesian product) this relation with S. This produces a relation with the attributes A of R and with rows all the possible combinations of rows of S and the projection of R;
From the previous result subtract all the tuples originally in R, that is, perform (πA-B(R) x S) - R. In this way we obtain the “extra” tuples, that is the tuples in the cartesian product that were not present in the original relation.
Finally, subtract from the original relation those extra tuples (but, again, perform this operation only on the attributes of R which are not present in S). So, the final operation is: πA-B(R) - πA-B(the result of step 2).
So, coming to your example, the projection of r on D is equal to:
(D)
d1
d2
d3
d4
and the cartesian product with s is:
(A, B, D)
a1 b1 d1
a1 b1 d2
a1 b1 d3
a1 b1 d4
Now we can remove from this set the tuples that were also in the original relation r, i.e. the first two tuples and the last one, so that we obtain the following result:
(A, B, D)
a1 b1 d3
And finally, we can remove the previous tuples (projected on D), from the original relation (again projected on D), that is, we remove:
(D)
d3
from:
(D)
d1
d2
d3
d4
and we obtain the following result, which is the final result of the division:
(D)
d1
d2
d4
Finally, we could double check the result by multiplying it with the original relation s (which is composed only by the tuple (a1, b1)):
(A B D)
a1 b1 d1
a1 b1 d2
a1 b1 d4
And looking at the rows of the original relation r, you can see this fact, that should give you an important insight on the meaning of the division operator:
the only values of the column D in r that are present together with (a1, b1) (the only tuple of s), are d1, d2 and d4.
You can also see another example in Wikipedia, and for a detailed explanation of the division, together with its transformation is SQL, you could look at these slides.

Calculation based on two tables

I have the following tables:
A has columns A1 (text) and A2 (number).
B has columns B1 (text), B2 (text), and B3 (number).
Let's say the user fills in B1 with b1 and B2 with b2. I want the value in B3 (call it b3) to be automatically calculated as follows:
Search A1 to find b1.
Get the A2 value corresponding to b1. Call this c1.
Search A1 to find b2.
4.Get the A2 value corresponding to b2. Call this c2.
b3 is min[c1, c2]
Can I do this by making B3 a calculated field, or by using a query?
In Access 2010 and later you could use a Before Change data macro on [TableB] to derive the [B3] value like this:
For more information, see
Create a data macro

Optimized algorithm to synchronize two arrays

I am looking for an efficient algorithm to synchronize two arrays. Let's say a1 and a2 are two arrays given as input.
a1 - C , C++ , Java , C# , Perl
a2 - C++ , Python , Java , Cw , Haskel
Output 2 arrays:
Output A1: C , C++ , Java
Output A2: Cw , Haskell , Python
Output A1:
1) items common to both arrays
2) items only in A1 and not in A2
Output A2:
items only in a2
Thanks in advance.
Raj
Sort both arrays with an efficient sorting algorithm, complexity of O(n.log(n))
Build the output arrays initially empty
Compare the first element a1 of sorted A1 to the first element a2 of sorted A2
Equal means is in both arrays, put a1 into OutputA1
a1 < a2 means a1 is only in A1, a1 now necomes next element in sorted A1, put a1 into OutputA1
else a2 < a1 means a2 is only in A2, a2 now necomes next element in sorted A2, put a2 into OutputA2
Do this until you processed all elements in the sorted arrays, complexity of O(n).

Resources