I have a data like this
DUMP A;
(2013-11, a)
(2013-11, b)
(2013-11, c)
(2013-11, d)
(2013-12, e)
and I would like to merge row with the same key to be like this (my desired output):
(2013-11, a, b, c, d)
(2013-12, e)
How can I achieve this using Pig Latin alone?
What you are looking for is the GROUP operator. You can use it like:
-- A is your sample.
B = GROUP A BY $0 ;
DUMP B ;
-- (2013-11, {(a), (b), (c), (d)})
-- (2013-12, {(e)}
Note, there is no guarantee the bag will have the values in alphabetical (or any) order.
Related
I have a collection of files from multiple sources.
Each file contains strings like:
File 1: A) B) C) D) E)
File 2: a) b) c) d) e)
File 3: a. b. c. d. e.
File 4: a- b- c- d- e-
(...)
I know I could code all possible patterns beforehand, but I'd rather make it automatically.
Is it possible to make a program to read the file and figure the pattern?
Ex:
File 1: A) B) C) D) E) # => [ABCDE]\)
File 2: a) b) c) d) e) # => [abcde]\)
File 3: a. b. c. d. e. # => [abcde]\.
File 4: a- b- c- d- e- # => [abcde]-
Regexp.union is smart enough to escape the parens, but that is it.
str = "A) B) C) D) E)"
p re = Regexp.union(*str.split) # => /A\)|B\)|C\)|D\)|E\)/
Perl's Regexp::assemble might be able to do this, AFAIK there is no Ruby equivalent.
I have a generic relation A like this:
DUMP A;
(a, b)
(a, c)
(a, d)
(b, a)
(d, a)
(d, b)
See that there is the pair (a,b) and (b,a); but there isn't a pair for (d,b).
I want to filter those "unpaired" tuples out.
The final result should be something like:
DUMP R;
(a, b)
(a, d)
(b, a)
(d, a)
How can I write this on PIG?
I was able to solve with the following code, but the cross operation is too expensive:
A_cp = FOREACH L GENERATE u1, u2;
X = CROSS A, A_cp;
F = FILTER X BY ($0 == $3 AND $1 == $2);
R = FOREACH F GENERATE $0, $1;
This is the output of my DESCRIBE A ; DUMP A ;:
A: {first: chararray,second: chararray}
(a,b)
(a,c)
(a,d)
(b,a)
(d,a)
(d,b)
This is one way you could solve this:
A = LOAD 'foo.in' AS (first:chararray, second:chararray) ;
-- Can't do a join on its self, so we have to duplicate A
A2 = FOREACH A GENERATE * ;
-- Join the As so that are in (b,a,a,c) etc. pairs.
B = JOIN A BY second, A2 BY first ;
-- We only want pairs where the first char is equal to the last char.
C = FOREACH (FILTER B BY A::first == A2::second)
-- Now we project out just one side of the pair.
GENERATE A::first AS first, A::second AS second ;
Output:
C: {first: chararray,second: chararray}
(b,a)
(d,a)
(a,b)
(a,d)
Update: As WinnieNicklaus points out, this can be shortened to:
B = FOREACH (JOIN A BY (first, second), A2 BY (second, first))
GENERATE A::first AS first, A::second AS second ;
Let's say I have relation A
DUMP A;
(a)
(d)
(g)
And now I want to use A's values to filter a group G:
DUMP G;
(a, {(a,b), (a,c)})
(c, {(c,d), (c,x)})
(d, {(d,b), (d,e)})
...So that the result would be
(a, {(a,b), (a,c)})
(d, {(d,b), (d,e)})
And then I want to extract the groups to generate:
(a,b)
(a,c)
(d,b)
(d,e)
I tried the following to the filtering part, but it didn't work:
J = JOIN G BY group, A BY a1;
R = FOREACH (FILTER J BY J::group == A::a1)
GENERATE FLATTEN(J.group);
If I'm understanding your question correctly, the output of J should already be what you want. By default JOIN is an inner join, so since c does not appear in A it will not be included in the output of J. If you dump J you should see:
(a, {(a,b), (a,c)}, a)
(d, {(d,b), (d,e)}, d)
(Or something similar with the location of the variables switched.)
To FLATTEN out the bag you'll need to do something like:
R = FOREACH J GENERATE FLATTEN(G::FOO) ;
In this case FOO is the name of the relation you did the GROUP on. You can verify its name with DESCRIBE G ;.
I have a problem. I don't understand how can I generate unique "cross" for the input.
Here is my input:
A, B, C
I would like to get:
A,B
A,C
B,C
What UDF (data-fu, piggybank) can I use to solve this problem?
If your input is like
A
B
C
and your want to output:
A,B
A,C
B,C
You can use cross join to get the results. For example:
input1 = load 'your_path' as (key: chararray);
input2 = load 'your_path' as (key: chararray);
cross_results = cross input1, input2;
final_results = filter cross_results by input1::key < input2::key;
If "A,B,C" are only a bag in one record, you can use flatten. For example,
-- Assume your input x is something like {A, B, C} in one row
y = foreach x generate flatten($0) as f1, flatten($0) as f2;
final_results = filter y by f1 < f2;
As your description is not very exhaustive, I can only provide the above solution. You may need to adapt it.
I'm currently learning the hadoop framework and the pig latin language.
Now I've a problem.
I've got a data-set with the following format:
"long a, long b, char c, char d"
Now I want to read this data-sets with pig. That's no problem with the load and PigStoarage funktion..
bla = load 'data/examples/test' as (a:long, b:long, c:chararray, d:chararray);
My next step is, that I want to compare a with b on each line. If a is greater than b it's okay. If b is greater than a, I wan't to switch a with b, so that the higher value is always the first value of my data set...
Is this possible? In Java I can do this with a simple "compareTo"...
sorry for my bad english :-)
blb = FOREACH bla GENERATE ((a < b) ? b : a), ((a < b) ? a : b), c, d;
This operator in Pig is called bincond. The first one says, if a is less than b, then output b. The second one says, if a is less than b, then output a. Notice that when a is greater than b, it outputs the opposite.