Generate unique cross in Pig - hadoop

I have a problem. I don't understand how can I generate unique "cross" for the input.
Here is my input:
A, B, C
I would like to get:
A,B
A,C
B,C
What UDF (data-fu, piggybank) can I use to solve this problem?

If your input is like
A
B
C
and your want to output:
A,B
A,C
B,C
You can use cross join to get the results. For example:
input1 = load 'your_path' as (key: chararray);
input2 = load 'your_path' as (key: chararray);
cross_results = cross input1, input2;
final_results = filter cross_results by input1::key < input2::key;
If "A,B,C" are only a bag in one record, you can use flatten. For example,
-- Assume your input x is something like {A, B, C} in one row
y = foreach x generate flatten($0) as f1, flatten($0) as f2;
final_results = filter y by f1 < f2;
As your description is not very exhaustive, I can only provide the above solution. You may need to adapt it.

Related

Can you cube between multiple relations in PIG?

I want to find the combination given another variable:
Example:
name, group, points
jim, T, 12
steven, T, 10
ting, T, 15
matt, F, 16
aamir, F, 12
I want to be able to get all combinations between members of T and F and do some multiplication to the points column for that. I first thought to break this into two relations, i.e. a T and an F relation and do some combination between them using CUBE but i don't think you can use CUBE between relations? Any suggestions?
Results:
jim, matt, 12*16
jim, aamir, 12*12
steven, matt, 16*16
...
...
ting, aamir, 15*12
Can you try this?
input.txt
jim,T,12
steven,T,10
ting,T,15
matt,F,16
aamir,F,12
PigScript:
A = LOAD 'input.txt' USING PigStorage(',') AS (name:chararray, group:chararray, points:int);
B = FILTER A BY group=='T';
C = FILTER A BY group=='F';
D = CROSS B,C;
E = FOREACH D GENERATE B::name,C::name,B::points*C::points;
DUMP E;
Output:
(jim,matt,192)
(jim,aamir,144)
(steven,matt,160)
(steven,aamir,120)
(ting,matt,240)
(ting,aamir,180)

Combine row value to columns

I have a data like this
DUMP A;
(2013-11, a)
(2013-11, b)
(2013-11, c)
(2013-11, d)
(2013-12, e)
and I would like to merge row with the same key to be like this (my desired output):
(2013-11, a, b, c, d)
(2013-12, e)
How can I achieve this using Pig Latin alone?
What you are looking for is the GROUP operator. You can use it like:
-- A is your sample.
B = GROUP A BY $0 ;
DUMP B ;
-- (2013-11, {(a), (b), (c), (d)})
-- (2013-12, {(e)}
Note, there is no guarantee the bag will have the values in alphabetical (or any) order.

How to filter a (a,b) relation by (b,a)?

I have a generic relation A like this:
DUMP A;
(a, b)
(a, c)
(a, d)
(b, a)
(d, a)
(d, b)
See that there is the pair (a,b) and (b,a); but there isn't a pair for (d,b).
I want to filter those "unpaired" tuples out.
The final result should be something like:
DUMP R;
(a, b)
(a, d)
(b, a)
(d, a)
How can I write this on PIG?
I was able to solve with the following code, but the cross operation is too expensive:
A_cp = FOREACH L GENERATE u1, u2;
X = CROSS A, A_cp;
F = FILTER X BY ($0 == $3 AND $1 == $2);
R = FOREACH F GENERATE $0, $1;
This is the output of my DESCRIBE A ; DUMP A ;:
A: {first: chararray,second: chararray}
(a,b)
(a,c)
(a,d)
(b,a)
(d,a)
(d,b)
This is one way you could solve this:
A = LOAD 'foo.in' AS (first:chararray, second:chararray) ;
-- Can't do a join on its self, so we have to duplicate A
A2 = FOREACH A GENERATE * ;
-- Join the As so that are in (b,a,a,c) etc. pairs.
B = JOIN A BY second, A2 BY first ;
-- We only want pairs where the first char is equal to the last char.
C = FOREACH (FILTER B BY A::first == A2::second)
-- Now we project out just one side of the pair.
GENERATE A::first AS first, A::second AS second ;
Output:
C: {first: chararray,second: chararray}
(b,a)
(d,a)
(a,b)
(a,d)
Update: As WinnieNicklaus points out, this can be shortened to:
B = FOREACH (JOIN A BY (first, second), A2 BY (second, first))
GENERATE A::first AS first, A::second AS second ;

How to use a relation to filter a GROUP?

Let's say I have relation A
DUMP A;
(a)
(d)
(g)
And now I want to use A's values to filter a group G:
DUMP G;
(a, {(a,b), (a,c)})
(c, {(c,d), (c,x)})
(d, {(d,b), (d,e)})
...So that the result would be
(a, {(a,b), (a,c)})
(d, {(d,b), (d,e)})
And then I want to extract the groups to generate:
(a,b)
(a,c)
(d,b)
(d,e)
I tried the following to the filtering part, but it didn't work:
J = JOIN G BY group, A BY a1;
R = FOREACH (FILTER J BY J::group == A::a1)
GENERATE FLATTEN(J.group);
If I'm understanding your question correctly, the output of J should already be what you want. By default JOIN is an inner join, so since c does not appear in A it will not be included in the output of J. If you dump J you should see:
(a, {(a,b), (a,c)}, a)
(d, {(d,b), (d,e)}, d)
(Or something similar with the location of the variables switched.)
To FLATTEN out the bag you'll need to do something like:
R = FOREACH J GENERATE FLATTEN(G::FOO) ;
In this case FOO is the name of the relation you did the GROUP on. You can verify its name with DESCRIBE G ;.

string pattern match,the suffix array can solve this or have more solution?

i have a string that random generate by a special characters (B,C,D,F,X,Z),for example to generate a following string list:
B D Z Z Z C D C Z
B D C
B Z Z Z D X
D B Z F
Z B D C C Z
B D C F Z
..........
i also have a pattern list, that is to match the generate string and return a best pattern and extract some string from the string.
string pattern
B D C [D must appear before the C >> DC]
B C F
B D C F
B X [if string have X,must be matched.]
.......
for example,
B D Z Z Z C D C Z,that have B and DC,so that can match by B D C
D B Z C F,that have B and C and F,so that can match by B C F
D B Z D F,that have B and F,so that can match by B F
.......
now,i just think about suffix array.
1.first convert a string to suffix array object.
2.loop each a pattern,that find which suffix array can be matched.
3.compare all matched patterns and get which is a best pattern.
var suffix_array=Convert a string to suffix array.
var list=new List();
for (int i=0;i<pattern length;i++){
if (suffix_array.match(pattern))
list.Add(pattern);
}
var max=list[0];
for (int i=1;i<list.length;i++){
{
if (list[i]>max)
max=list[i];
Write(list[i]);
}
i just think this method is to complex,that need to build a tree for a pattern ,and take it to match suffix array.who have a more idea?
====================update
i get a best solution now,i create a new class,that have a B,C,D,X...'s property that is array type.each property save a position that appear at the string.
now,if the B not appear at the string,we can immediately end this processing.
we can also get all the C and D position,and then compare it whether can sequential appear(DC,DCC,CCC....)
I'm not sure what programming language you are using; have you checked its capabilities with regular expressions ? If you are not familiar with these, you should be, hit Google.
var suffix_array=Convert a string to suffix array.
var best=(worst value - presumably zero - pattern);
for (int i=0;i<pattern list array length;i++){
if (suffix_array.match(pattern[i])){
if(pattern[i]>best){
best=pattern[i];
}
(add pattern[i] to list here if you still want a list of all matches)
}
}
write best;
Roughly, anyway, if I understand what you're looking for that's a slight improvement though I'm sure there may be a better solution.

Resources