Pig: Create new column based off of two other columns - hadoop

I'm wondering if it's possible to do something like this is in pig:
There are three columns:
A "type1","type2","type3"
B 101 , 159 , 74
I want to define columns C as such:
If A == "type1" then C = B; else C = 0
Is this possible in pig?

Yes, this is possible. You would write it as below:
data = LOAD '$dataSource' using AvroStorage();
-- data = {A, B}
data2 = FOREACH data
GENERATE
A,
B,
(A == 'type1' ? B : 0) AS C;
dump data2;

Related

NOT IN , MATCHES in pig

i have a two relations in pig:
A,B
DUMP A;
Sandeep Rohan Mohan
DUMP B;
MOHAN
i need to get output as A - B;
Relation C should give me
Sandeep,Rohan
since they not present in B
try this :
A1 = LOAD 'Sandeep Rohan Mohan' USING PigStorage() AS (line:chararray);
B1 = LOAD 'MOHAN' USING PigStorage() AS (line:chararray);
A = FOREACH A1 GENERATE UPPER(line) AS line;
B = FOREACH B1 GENERATE UPPER(line) AS line;
C = COGROUP A BY line, B BY line;
D = FILTER C BY IsEmpty(B);
E = FOREACH D GENERATE group AS name;
DUMP E;
(ROHAN) (SANDEEP)
also refer sets operations in apache pig
achieved it with a left outer join, considered only those tuples which had nulls in $1

In pig I want to reduce groups to have 1 element with specific types having precedence

In pig, I have columns A, B, C, id, id_type. The possible id_types are "zip," "city," "county," "state," and "country."
I wish to make it so that there exists only one instance of each existing A, B, C, but giving precedence to the row with id_type "zip," but if not "zip," then "city," and if not "city," then... etc.
So, if I have the following two rows
(a, b, c, 555, city)
(a, b, c, 123, state)
I want to remove the second one. I can group by A, B, C to get
({a, b, c}, {(a, b, c, 555, city), (a, b, c, 123, state)})
But I do not know how I can remove all of the unwanted elements from $1
#inquistive_mind : I ran your code with following input and it DOES NOT return what is asked by OP
Input :
(aa,bb,cc,1,zip)
(aa,bb,cc,2,street)
(mmm,nnn,cc,3,county)
(mmm,nnn,cc,4,zip)
(mmm,nnn,cc,5,state)
(lll,ccc,ddd,6,city)
(lll,ccc,xxx,7,country)
Output after running your code :
((aa,bb,cc),{(2,country),(1,zip)})
((lll,ccc,ddd),{(6,city)})
((lll,ccc,xxx),{(7,country)})
((mmm,nnn,cc),{(5,state),(4,zip),(3,county)})
You clearly see it does not keep only one entry on the priority of id_type.
I solved this using a PYTHON UDF. If there are better way Please let me know
Python Code saved as priority.py
def unique_list(input):
my_list = input
last_list = []
#print(my_list[0][4])
#print(len(my_list))
for i in range(len(my_list)):
last_list.append(my_list[i][4])
print(last_list)
for j in range(len(last_list)):
if(last_list[j]) == "zip":
return_list = list(my_list[j])
break
elif (last_list[j] == 'city'):
return_list = list(my_list[j])
break
elif (last_list[j] == 'county'):
return_list = list(my_list[j])
break
elif (last_list[j] == 'state'):
return_list = list(my_list[j])
break
elif (last_list[j] == 'country'):
return_list = list(my_list[j])
break
return return_list
Now The Pig Code
REGISTER 'priority.py' using jython as callme
A = LOAD 'addr.dat' USING PigStorage(',') AS (A : chararray, B :chararray , C: chararray , ID : chararray, ID_TYPE : chararray);
B = DISTINCT A;
Z= GROUP B BY (A,B,C);
O = FOREACH Z GENERATE callme.unique_list($1) as record :{(A : chararray, B :chararray , C: chararray , ID : chararray, ID_TYPE : chararray)} ;
DUMP O;
Please run this against your input and check if it works
One way is to write a UDF, another way is:
ABC = load 'testdata.csv' using PigStorage(',') as (a: chararray, b: chararray, c: chararray, id: int, id_type: chararray);
MappedABC = foreach ABC generate a, b, c, id, id_type, (id_type == 'zip' ? 1 : (id_type == 'city' ? 2 : (id_type == 'county' ? 3 : (id_type == 'state' ? 4 : 5)))) as idorder;
FinalABC = foreach (group MappedABC by (a,b,c)) {
OrderedABC = order MappedABC by idorder;
LimitedABC = limit OrderedABC 1;
generate
flatten(LimitedABC)
;
};
store FinalABC into 'out' using PigStorage(';');

Column to row transformation using Pig Latin

A = load 'input.txt';
dump A;
"0,1, 2,3,4
5, 6,7, 8,9
B = foreach A generate FLATTEN(TOBAG(*));
dump B
("0)
(1)
( 2)
(3)
(4)
(5)
( 6)
(7)
( 8)
(9)
I want to perform some replace and trim operation on each field above. How do I transform it back to original format post that?
Expected output
0,1,2,3,4
5,6,7,8,9
Yes, This is really an experimental question.
Rows to column conversion and columns to Row conversion !!
Yes, By Getting little help from RANK operator I think we can achieve this
I tried the below code for the below input
Input :
0,1,2,3,4
5,6,7,8,9
In Below Pig script there are two dump statements
numbers = LOAD '/home/inputfiles/col_to_row.txt' USING PigStorage() As(line:chararray);
numbers_rank = RANK numbers;
numbers_each = FOREACH numbers_rank GENERATE $0 as rank_key,FLATTEN(TOKENIZE(line)) as each_number;
rows_to_columns = FOREACH numbers_each GENERATE each_number;
dump rows_to_columns;--Will give you each number in a separate row..
numbers_grp = GROUP numbers_each BY rank_key;
columns_to_rows = FOREACH numbers_grp GENERATE FLATTEN(BagToTuple(numbers_each.each_number));
dump columns_to_rows; -- Will give you as Per original input data set
Output :
dump rows_to_columns;
(0)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
dump columns_to_rows;
(0,1,2,3,4)
(5,6,7,8,9)
You can do a simple replace with regex. Since the REPLACE function invokes java String.replaceAll() you can use java compatible regex. Here is the demo:
grunt> A = load 'input.txt' as (f1:chararray);
grunt> DUMP A;
("0,1, 2,3,4 )
(5, 6,7, 8,9)
grunt> B = foreach A generate FLATTEN(TOBAG(*));
grunt> DUMP B;
("0,1, 2,3,4 )
(5, 6,7, 8,9)
grunt> X = FOREACH B GENERATE REPLACE($0, '[^0-9,]', '');
grunt> DUMP X;
(0,1,2,3,4)
(5,6,7,8,9)
grunt> Y = FOREACH X GENERATE FLATTEN(STRSPLIT($0, ','));
grunt> DUMP Y;
(0,1,2,3,4)
(5,6,7,8,9)
grunt> Z = FOREACH Y GENERATE $0;
grunt> DUMP Z;
(0)
(5)

what is the purpose of FLATTEN operator in PIG Latin

A = load 'data' as (x, y);
B = load 'data' as (x, z);
C = cogroup A by x, B by x;
D = foreach C generate flatten(A), flatten(b);
E = group D by A::x
what exactly done in the above statements and where we use flatten in realtime scenario.
A = load 'input1' USING PigStorage(',') as (x, y);
(x,y) --> (1,2)(1,3)(2,3)
B = load 'input2' USING PigStorage(',') as (x, z);`
(x,z) --> (1,4)(1,2)(3,2)*/
C = cogroup A by x, B by x;`
result:
(1,{(1,2),(1,3)},{(1,4),(1,2)})
(2,{(2,3)},{})
(3,{},{(3,2)})
D = foreach C generate group, flatten(A), flatten(B);`
when both bags flattened, the cross product of tuples are returned.
result:
(1,1,2,1,4)
(1,1,2,1,2)
(1,1,3,1,4)
(1,1,3,1,2)
E = group D by A::x`
here your are grouping with x column of relation A.
(1,1,2,1,4)
(1,1,2,1,2)
(1,1,3,1,4)
(1,1,3,1,2)

Select one relations all fields & one or two from other relation on PIG JOIN, how?

A = load '$input1' using pigStorage() AS (a,b,c,d,e)
B = load '$input2' using pigStorage() AS (a,b1,c1,d1,e1)
C = JOIN A by a, B by a;
D = do something;
'D' should be of format (a,b,c,d,e,b1)
How to achieve this?
D = FOREACH C GENERATE A::a .. A::e, B::b AS b1;

Resources