Combine row value to columns - hadoop

I have a data like this
DUMP A;
(2013-11, a)
(2013-11, b)
(2013-11, c)
(2013-11, d)
(2013-12, e)
and I would like to merge row with the same key to be like this (my desired output):
(2013-11, a, b, c, d)
(2013-12, e)
How can I achieve this using Pig Latin alone?

What you are looking for is the GROUP operator. You can use it like:
-- A is your sample.
B = GROUP A BY $0 ;
DUMP B ;
-- (2013-11, {(a), (b), (c), (d)})
-- (2013-12, {(e)}
Note, there is no guarantee the bag will have the values in alphabetical (or any) order.

Related

Is possible to read collection of strings and return a Regexp?

I have a collection of files from multiple sources.
Each file contains strings like:
File 1: A) B) C) D) E)
File 2: a) b) c) d) e)
File 3: a. b. c. d. e.
File 4: a- b- c- d- e-
(...)
I know I could code all possible patterns beforehand, but I'd rather make it automatically.
Is it possible to make a program to read the file and figure the pattern?
Ex:
File 1: A) B) C) D) E) # => [ABCDE]\)
File 2: a) b) c) d) e) # => [abcde]\)
File 3: a. b. c. d. e. # => [abcde]\.
File 4: a- b- c- d- e- # => [abcde]-
Regexp.union is smart enough to escape the parens, but that is it.
str = "A) B) C) D) E)"
p re = Regexp.union(*str.split) # => /A\)|B\)|C\)|D\)|E\)/
Perl's Regexp::assemble might be able to do this, AFAIK there is no Ruby equivalent.

How to filter a (a,b) relation by (b,a)?

I have a generic relation A like this:
DUMP A;
(a, b)
(a, c)
(a, d)
(b, a)
(d, a)
(d, b)
See that there is the pair (a,b) and (b,a); but there isn't a pair for (d,b).
I want to filter those "unpaired" tuples out.
The final result should be something like:
DUMP R;
(a, b)
(a, d)
(b, a)
(d, a)
How can I write this on PIG?
I was able to solve with the following code, but the cross operation is too expensive:
A_cp = FOREACH L GENERATE u1, u2;
X = CROSS A, A_cp;
F = FILTER X BY ($0 == $3 AND $1 == $2);
R = FOREACH F GENERATE $0, $1;
This is the output of my DESCRIBE A ; DUMP A ;:
A: {first: chararray,second: chararray}
(a,b)
(a,c)
(a,d)
(b,a)
(d,a)
(d,b)
This is one way you could solve this:
A = LOAD 'foo.in' AS (first:chararray, second:chararray) ;
-- Can't do a join on its self, so we have to duplicate A
A2 = FOREACH A GENERATE * ;
-- Join the As so that are in (b,a,a,c) etc. pairs.
B = JOIN A BY second, A2 BY first ;
-- We only want pairs where the first char is equal to the last char.
C = FOREACH (FILTER B BY A::first == A2::second)
-- Now we project out just one side of the pair.
GENERATE A::first AS first, A::second AS second ;
Output:
C: {first: chararray,second: chararray}
(b,a)
(d,a)
(a,b)
(a,d)
Update: As WinnieNicklaus points out, this can be shortened to:
B = FOREACH (JOIN A BY (first, second), A2 BY (second, first))
GENERATE A::first AS first, A::second AS second ;

How to use a relation to filter a GROUP?

Let's say I have relation A
DUMP A;
(a)
(d)
(g)
And now I want to use A's values to filter a group G:
DUMP G;
(a, {(a,b), (a,c)})
(c, {(c,d), (c,x)})
(d, {(d,b), (d,e)})
...So that the result would be
(a, {(a,b), (a,c)})
(d, {(d,b), (d,e)})
And then I want to extract the groups to generate:
(a,b)
(a,c)
(d,b)
(d,e)
I tried the following to the filtering part, but it didn't work:
J = JOIN G BY group, A BY a1;
R = FOREACH (FILTER J BY J::group == A::a1)
GENERATE FLATTEN(J.group);
If I'm understanding your question correctly, the output of J should already be what you want. By default JOIN is an inner join, so since c does not appear in A it will not be included in the output of J. If you dump J you should see:
(a, {(a,b), (a,c)}, a)
(d, {(d,b), (d,e)}, d)
(Or something similar with the location of the variables switched.)
To FLATTEN out the bag you'll need to do something like:
R = FOREACH J GENERATE FLATTEN(G::FOO) ;
In this case FOO is the name of the relation you did the GROUP on. You can verify its name with DESCRIBE G ;.

Generate unique cross in Pig

I have a problem. I don't understand how can I generate unique "cross" for the input.
Here is my input:
A, B, C
I would like to get:
A,B
A,C
B,C
What UDF (data-fu, piggybank) can I use to solve this problem?
If your input is like
A
B
C
and your want to output:
A,B
A,C
B,C
You can use cross join to get the results. For example:
input1 = load 'your_path' as (key: chararray);
input2 = load 'your_path' as (key: chararray);
cross_results = cross input1, input2;
final_results = filter cross_results by input1::key < input2::key;
If "A,B,C" are only a bag in one record, you can use flatten. For example,
-- Assume your input x is something like {A, B, C} in one row
y = foreach x generate flatten($0) as f1, flatten($0) as f2;
final_results = filter y by f1 < f2;
As your description is not very exhaustive, I can only provide the above solution. You may need to adapt it.

Hadoop Pig comparing two values and sort them

I'm currently learning the hadoop framework and the pig latin language.
Now I've a problem.
I've got a data-set with the following format:
"long a, long b, char c, char d"
Now I want to read this data-sets with pig. That's no problem with the load and PigStoarage funktion..
bla = load 'data/examples/test' as (a:long, b:long, c:chararray, d:chararray);
My next step is, that I want to compare a with b on each line. If a is greater than b it's okay. If b is greater than a, I wan't to switch a with b, so that the higher value is always the first value of my data set...
Is this possible? In Java I can do this with a simple "compareTo"...
sorry for my bad english :-)
blb = FOREACH bla GENERATE ((a < b) ? b : a), ((a < b) ? a : b), c, d;
This operator in Pig is called bincond. The first one says, if a is less than b, then output b. The second one says, if a is less than b, then output a. Notice that when a is greater than b, it outputs the opposite.

Resources