unable to typecast in pig - hadoop

I am getting errors in AVG function. Can anyone please help on the following script: (Do i need to use tuple or bag while loading?) Thanks.
mydata = LOAD 'bigdata.txt' USING PigStorage(',') AS (stn , wban, yearmoda, temp, a , dewp :double, b , slp :double, c, stp :double, d, visib :double, e, wdsp :double, f, mxspd :double, gust :double, max :double, min :double, prcp :double, sndp :double, frshtt);
clean1 = FOREACH mydata GENERATE stn , wban, yearmoda, temp, a , dewp, b , slp, c, stp, d, visib, e, wdsp, f, mxspd, gust, max , min, prcp ,sndp , frshtt;
--clean2 = FILTER clean1 BY (temp == 9999.9);
tmpdata = FOREACH clean1 GENERATE stn, SUBSTRING(yearmoda, 0, 5) as year, temp;
C = GROUP tmpdata BY (year, temp);
avgtemp = FOREACH C GENERATE group, AVG(temp);

You did not assign temp a type when you LOADed your data. So when Pig tries to call the AVG function, and it checks to see which version of it to use (since it must behave differently if the field is an int rather than a double, for example), it cannot tell how to proceed. Give temp a type (like temp:int) in your LOAD statement and it should work.
In your case, you have also not specified the field correctly. You need to pass AVG a bag to evaluate. You construct this bag by projecting the temp field out of the bag of records in C. The schema of C is {(group:(year,temp)), tmpdata:{(stn,year:chararray,temp)})}, so you need to compute avgtemp like this:
avgtemp = FOREACH C GENERATE group, AVG(tmpdata.temp);

Related

How to save intermediate outputs into images on a multi-stage pipeline?

Say I have computation something like
Image resultA, resultB;
Func A, B, C, D, E;
Var x, y;
A(x,y) = C(x,y) * D(x,y);
B(x,y) = C(x,y) - D(x,y);
E(x,y) = abs(A(x,y)/B(x,y));
resultA(x,y) = sqrt(E(x,y));
resultB(x,y) = 2.f * E(x,y) + C(x,y);
How to define AOT schedule such that I can save resultA and resultB ?
E(x,y) is common to the computation of resultA and resultB.
Thank you in advance
If the results are the same size in all dimensions, you can return a Tuple:
result(x, y) = Tuple(resultA, resultB);
If they are not the same size, they can be added to a Pipeline and the Pipeline can be compiled to a filter that returns multiple Funcs.
See:
https://github.com/halide/Halide/blob/master/test/correctness/multiple_outputs.cpp

In pig I want to reduce groups to have 1 element with specific types having precedence

In pig, I have columns A, B, C, id, id_type. The possible id_types are "zip," "city," "county," "state," and "country."
I wish to make it so that there exists only one instance of each existing A, B, C, but giving precedence to the row with id_type "zip," but if not "zip," then "city," and if not "city," then... etc.
So, if I have the following two rows
(a, b, c, 555, city)
(a, b, c, 123, state)
I want to remove the second one. I can group by A, B, C to get
({a, b, c}, {(a, b, c, 555, city), (a, b, c, 123, state)})
But I do not know how I can remove all of the unwanted elements from $1
#inquistive_mind : I ran your code with following input and it DOES NOT return what is asked by OP
Input :
(aa,bb,cc,1,zip)
(aa,bb,cc,2,street)
(mmm,nnn,cc,3,county)
(mmm,nnn,cc,4,zip)
(mmm,nnn,cc,5,state)
(lll,ccc,ddd,6,city)
(lll,ccc,xxx,7,country)
Output after running your code :
((aa,bb,cc),{(2,country),(1,zip)})
((lll,ccc,ddd),{(6,city)})
((lll,ccc,xxx),{(7,country)})
((mmm,nnn,cc),{(5,state),(4,zip),(3,county)})
You clearly see it does not keep only one entry on the priority of id_type.
I solved this using a PYTHON UDF. If there are better way Please let me know
Python Code saved as priority.py
def unique_list(input):
my_list = input
last_list = []
#print(my_list[0][4])
#print(len(my_list))
for i in range(len(my_list)):
last_list.append(my_list[i][4])
print(last_list)
for j in range(len(last_list)):
if(last_list[j]) == "zip":
return_list = list(my_list[j])
break
elif (last_list[j] == 'city'):
return_list = list(my_list[j])
break
elif (last_list[j] == 'county'):
return_list = list(my_list[j])
break
elif (last_list[j] == 'state'):
return_list = list(my_list[j])
break
elif (last_list[j] == 'country'):
return_list = list(my_list[j])
break
return return_list
Now The Pig Code
REGISTER 'priority.py' using jython as callme
A = LOAD 'addr.dat' USING PigStorage(',') AS (A : chararray, B :chararray , C: chararray , ID : chararray, ID_TYPE : chararray);
B = DISTINCT A;
Z= GROUP B BY (A,B,C);
O = FOREACH Z GENERATE callme.unique_list($1) as record :{(A : chararray, B :chararray , C: chararray , ID : chararray, ID_TYPE : chararray)} ;
DUMP O;
Please run this against your input and check if it works
One way is to write a UDF, another way is:
ABC = load 'testdata.csv' using PigStorage(',') as (a: chararray, b: chararray, c: chararray, id: int, id_type: chararray);
MappedABC = foreach ABC generate a, b, c, id, id_type, (id_type == 'zip' ? 1 : (id_type == 'city' ? 2 : (id_type == 'county' ? 3 : (id_type == 'state' ? 4 : 5)))) as idorder;
FinalABC = foreach (group MappedABC by (a,b,c)) {
OrderedABC = order MappedABC by idorder;
LimitedABC = limit OrderedABC 1;
generate
flatten(LimitedABC)
;
};
store FinalABC into 'out' using PigStorage(';');

what is the purpose of FLATTEN operator in PIG Latin

A = load 'data' as (x, y);
B = load 'data' as (x, z);
C = cogroup A by x, B by x;
D = foreach C generate flatten(A), flatten(b);
E = group D by A::x
what exactly done in the above statements and where we use flatten in realtime scenario.
A = load 'input1' USING PigStorage(',') as (x, y);
(x,y) --> (1,2)(1,3)(2,3)
B = load 'input2' USING PigStorage(',') as (x, z);`
(x,z) --> (1,4)(1,2)(3,2)*/
C = cogroup A by x, B by x;`
result:
(1,{(1,2),(1,3)},{(1,4),(1,2)})
(2,{(2,3)},{})
(3,{},{(3,2)})
D = foreach C generate group, flatten(A), flatten(B);`
when both bags flattened, the cross product of tuples are returned.
result:
(1,1,2,1,4)
(1,1,2,1,2)
(1,1,3,1,4)
(1,1,3,1,2)
E = group D by A::x`
here your are grouping with x column of relation A.
(1,1,2,1,4)
(1,1,2,1,2)
(1,1,3,1,4)
(1,1,3,1,2)

Pig: Create new column based off of two other columns

I'm wondering if it's possible to do something like this is in pig:
There are three columns:
A "type1","type2","type3"
B 101 , 159 , 74
I want to define columns C as such:
If A == "type1" then C = B; else C = 0
Is this possible in pig?
Yes, this is possible. You would write it as below:
data = LOAD '$dataSource' using AvroStorage();
-- data = {A, B}
data2 = FOREACH data
GENERATE
A,
B,
(A == 'type1' ? B : 0) AS C;
dump data2;

Select one relations all fields & one or two from other relation on PIG JOIN, how?

A = load '$input1' using pigStorage() AS (a,b,c,d,e)
B = load '$input2' using pigStorage() AS (a,b1,c1,d1,e1)
C = JOIN A by a, B by a;
D = do something;
'D' should be of format (a,b,c,d,e,b1)
How to achieve this?
D = FOREACH C GENERATE A::a .. A::e, B::b AS b1;

Resources