Column to row transformation using Pig Latin - hadoop

A = load 'input.txt';
dump A;
"0,1, 2,3,4
5, 6,7, 8,9
B = foreach A generate FLATTEN(TOBAG(*));
dump B
("0)
(1)
( 2)
(3)
(4)
(5)
( 6)
(7)
( 8)
(9)
I want to perform some replace and trim operation on each field above. How do I transform it back to original format post that?
Expected output
0,1,2,3,4
5,6,7,8,9

Yes, This is really an experimental question.
Rows to column conversion and columns to Row conversion !!
Yes, By Getting little help from RANK operator I think we can achieve this
I tried the below code for the below input
Input :
0,1,2,3,4
5,6,7,8,9
In Below Pig script there are two dump statements
numbers = LOAD '/home/inputfiles/col_to_row.txt' USING PigStorage() As(line:chararray);
numbers_rank = RANK numbers;
numbers_each = FOREACH numbers_rank GENERATE $0 as rank_key,FLATTEN(TOKENIZE(line)) as each_number;
rows_to_columns = FOREACH numbers_each GENERATE each_number;
dump rows_to_columns;--Will give you each number in a separate row..
numbers_grp = GROUP numbers_each BY rank_key;
columns_to_rows = FOREACH numbers_grp GENERATE FLATTEN(BagToTuple(numbers_each.each_number));
dump columns_to_rows; -- Will give you as Per original input data set
Output :
dump rows_to_columns;
(0)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
dump columns_to_rows;
(0,1,2,3,4)
(5,6,7,8,9)

You can do a simple replace with regex. Since the REPLACE function invokes java String.replaceAll() you can use java compatible regex. Here is the demo:
grunt> A = load 'input.txt' as (f1:chararray);
grunt> DUMP A;
("0,1, 2,3,4 )
(5, 6,7, 8,9)
grunt> B = foreach A generate FLATTEN(TOBAG(*));
grunt> DUMP B;
("0,1, 2,3,4 )
(5, 6,7, 8,9)
grunt> X = FOREACH B GENERATE REPLACE($0, '[^0-9,]', '');
grunt> DUMP X;
(0,1,2,3,4)
(5,6,7,8,9)
grunt> Y = FOREACH X GENERATE FLATTEN(STRSPLIT($0, ','));
grunt> DUMP Y;
(0,1,2,3,4)
(5,6,7,8,9)
grunt> Z = FOREACH Y GENERATE $0;
grunt> DUMP Z;
(0)
(5)

Related

Not getting calculated value with SUM() in pig

My commands are as under:
Z = LOAD '/..file_path' USING PigStorage(',') AS (name:CHARARRAY,gpa:int,salary:int);
y = GROUP Z BY gpa;
R = FOREACH y GENERATE SUM(Z.salary);
I am getting the output of
DUMP R;
as :
{all,()};
Please guide me.
TIA.
You need to use GROUP ALL instead of GROUP BY to get the SUM.
Z = LOAD '/..file_path' USING PigStorage(',') AS (name:CHARARRAY,gpa:int,salary:int);
y = GROUP Z ALL;
R = FOREACH y GENERATE SUM(Z.salary);
DUMP R;

NOT IN , MATCHES in pig

i have a two relations in pig:
A,B
DUMP A;
Sandeep Rohan Mohan
DUMP B;
MOHAN
i need to get output as A - B;
Relation C should give me
Sandeep,Rohan
since they not present in B
try this :
A1 = LOAD 'Sandeep Rohan Mohan' USING PigStorage() AS (line:chararray);
B1 = LOAD 'MOHAN' USING PigStorage() AS (line:chararray);
A = FOREACH A1 GENERATE UPPER(line) AS line;
B = FOREACH B1 GENERATE UPPER(line) AS line;
C = COGROUP A BY line, B BY line;
D = FILTER C BY IsEmpty(B);
E = FOREACH D GENERATE group AS name;
DUMP E;
(ROHAN) (SANDEEP)
also refer sets operations in apache pig
achieved it with a left outer join, considered only those tuples which had nulls in $1

Sum up values in Pig Tuple

I have the following output of Pig tuple:
dump g:
()
(97)
(245)
(870)
(480)
describe g:
g: {long}
I'm looking to sum up the total of the #'s above so I tried this:
h = foreach g generate SUM($0);
I received this error:
Please use an explicit cast.
I then tried to cast the value to (int) and still did not work.
The output I'm looking for is like this:
1692
Here is the code leading up to:
a = LOAD 'tellers' using TextLoader() AS line;
# convert a to charrarry
b = foreach a generate (chararray)line;
# run through my UDF to create tuples
c = foreach b generate myudfs.TellerParser5(line); # ({(20),(5),(5),(10)(1),(1),(1),(1),(1),(5),(10),(10),(10)})....
d = foreach c generate flatten(number);
e = group d by number; #{group: chararray,d: {(number: chararray)}}
f = foreach e generate group, COUNT(d); # f: {group: chararray,long}
g = foreach f generate (long)$0 * $1;
You would need to do something like this:
H = GROUP G ALL;
I = FOREACH H GENERATE SUM(G.$0);

Pig: Create new column based off of two other columns

I'm wondering if it's possible to do something like this is in pig:
There are three columns:
A "type1","type2","type3"
B 101 , 159 , 74
I want to define columns C as such:
If A == "type1" then C = B; else C = 0
Is this possible in pig?
Yes, this is possible. You would write it as below:
data = LOAD '$dataSource' using AvroStorage();
-- data = {A, B}
data2 = FOREACH data
GENERATE
A,
B,
(A == 'type1' ? B : 0) AS C;
dump data2;

Select one relations all fields & one or two from other relation on PIG JOIN, how?

A = load '$input1' using pigStorage() AS (a,b,c,d,e)
B = load '$input2' using pigStorage() AS (a,b1,c1,d1,e1)
C = JOIN A by a, B by a;
D = do something;
'D' should be of format (a,b,c,d,e,b1)
How to achieve this?
D = FOREACH C GENERATE A::a .. A::e, B::b AS b1;

Resources