My data is something like this:
(201601030637,2,64.001213)
(201601030756,3,63.5869656667)
(201601040220,2,62.758471)
which the first column is year (2016) month (01) day (03) hour (06) and minutes (37) connected to each other.
I want to sum the values of third column based on the week. How can I group them to have 52 different groups for entire year? Can anyone help?
Thanks!
Use ToDate to convert your datestring to datetime type. Then use GetWeek to get weeknumber. Finally use GROUP to group by weeknum and SUM.
A = LOAD '/path_to_data/data' USING PigStorage(',') as (c1: chararray, c2: int, c3: float);
B = FOREACH A GENERATE GetWeek(ToDate(c1,'yyyyMMddHHmm')) as weeknum, c1, c2, c3;
C = FOREACH (GROUP B BY weeknum) GENERATE group as weeknum, SUM(B.c2) as c2_sum;
DUMP C;
Use GetWeek and create a new column from the first column.Then group by the new column and use SUM.Assuming you have loaded the data to a relation A.
B = FOREACH A GENERATE A.$0,A.$1,A.$2,GetWeek(A.$0) as week_of_year;
C = GROUP B BY (B.$4);
D = FOREACH C GENERATE group,SUM(B.$2);
DUMP D;
Related
I am loading two datasets A, B
A= LOAD [datapath]
B= LOAD [datapath]
I want to JOIN all fields of both A and B by id field.Both A and B have common field id and other fields. When I perform JOIN by id:
AB= JOIN A by id, B by id;
The resulted dataset AB includes two similar columns for the field id, However, it only must show only one column for the id field. What am I doing wrong here?
That's the expected behaviour, when joining two datasets, all columns are included (even those ones which you are joining by)
You can check it here
If you want to drop a column you can do it with the generate statement. But first you ned to know the position of the undesired column.
If that column is, for instance, in the 3th position
C = FOREACH AB GENERATE $1,$2, $4, $5...;
Edit from the comments
You can also use a generate statement without knowing position. Example:
C = FOREACH AB GENERATE A::id AS id, A::foo AS foo, B::bar AS bar;
* hwo to combine these two tables and check the id's which are greater than 1600 for NDAKOTA regions*
1 alaska robert
2 boston lilly
3 NDakota Michael
4 NDakota Will
5 NDakota Mark
1A 1 09/09/2012 1200
2A 2 8/9/2016 3400
3B 3 4/5/2016 2300
customers = LOAD '/home/vis/Documents/customers' using PigStorage(' ') AS(cust_id:int,region:chararray,name:chararray);
sales = LOAD '/home/vis/Documents/sales' using PigStorage(' ')
AS(sales_id:int,cust_id:int,date:datetime,amount:int);
salesNA = FILTER customers BY region =='NDakota';
joined = JOIN sales BY cust_id,salesNA BY cust_id;
grouped = GROUP joined BY cust_id;
summed= FOREACH grouped GENERATE GROUP,SUM(sales.amount);
bigSpenders= FILTER summed BY 1$>1600;
DUMP sorted;
recieving errors as
From Apache Pig Docs
Use the disambiguate operator ( :: ) to identify field names after
JOIN, COGROUP, CROSS, or FLATTEN operators.
Below code snippet should be suffice to achieve the objective, let me know if you see any issues.
customers = LOAD 'customers.txt' using PigStorage(' ') AS(cust_id:int,region:chararray,name:chararray);
sales = LOAD 'sales.txt' using PigStorage(' ') AS(sales_id:chararray,cust_id:int,date:chararray,amount:int);
custNA = FILTER customers BY region =='NDakota';
joined = JOIN sales BY cust_id,custNA BY cust_id;
req_data = FILTER joined BY amount > 1600;
DUMP req_data;
I need to group below data on fname and lastname.
(fname,lname,id)
abc,xyz,I
abc,xyz,N
ppp,xxx,I
ppp,XXX,I
in id field i am expecting only 2 values i.e N or I so if I get both N and I for same fname,lname combination then I should use id as N else need to use value for id field as it is given in the group.
I am expecting below results:
abc,xyz,N
ppp,xxx,I
I have tried below code and its working fine
in =load '/testing/name.txt' USING PigStorage(',') as (fname:chararray,lname:chararray,id:chararray);
grp = group in by (fname,lname);
z = foreach grp generate FLATTEN(group) AS (fname,lname),(COUNT(in.id) >1 ? ('N') :BagToTuple(in.id))as id;
However now I need to check the values of id field instead of counts:
z = foreach grp generate FLATTEN(group) AS (fname,lname),((in.id == 'N' or in.id == 'I') ? ('N') :BagToTuple(in.id))as id;
however its giving below error:
(Name: Equal Type: null Uid: null)incompatible types in Equal Operator left hand side:bag :tuple(id:chararray) right hand side:chararray
however its giving below error:
Two inputs of BinCond must have compatible schemas. left hand side: #31:tuple(#32:chararray) right hand side: org.apache.pig.builtin.bagtotuple_3#35:tuple(id#36:int)
Please guide
You are loading a field that contains chars i.e. N,I into int column? Change the load statement where id column type is chararray.
in =load '/testing/name.txt' USING PigStorage(',') as (fname:chararray,lname:chararray,id:chararray);
grp = group in by (fname,lname);
z = foreach grp generate FLATTEN(group) AS (fname,lname),(COUNT(in.id) > 1 && in.id matches 'N') ? ('N') : in.id;
I have two files in hdfs containing data as follows, File1:
id,name,age
1,x1,15
2,x2,14
3,x3,16
File2:
id,name,grades
1,x1,A
2,x2,B
4,y1,A
5,y2,C
I want to produce the following output :
id,name,age,grades
1,x1,15,A
2,x2,14,B
3,x3,16,
4,y1,,A
5,y2,,C
I am using Apache pig to perform the operation, is it possible to get the above output in pig. This is kind of Union and Join both.
As you can do unions and joins in pig this is of course possible.
Without digging into the exact syntax, I can tell you this should work (have used similar solutions in the past).
Suppose we have A and B.
Take the first two columns of A and B to be A2 and B2
Union A2 and B2 into M2
Distinct M2
Now you have your 'index' matrix, and we just need to add the extra columns.
Left join M2 with A and B
Generate the relevant columns
Thats it!
A = load 'pdemo/File1' using PigStorage(',') as(id:int,name:chararray,age:chararray);
B = load 'pdemo/File2' using PigStorage(',') as(id:int,name:chararray,grades:chararray);
lj = join A by id left outer,B by id;
rj = join A by id right outer,B by id;
lj1 = foreach lj generate A::id as id,A::name as name,A::age as age,B::grades as grades;
rj1 = foreach rj generate B::id as id,B::name as name,A::age as age,B::grades as grades;
res = union lj1,rj1;
FinalResult = distinct res;
2nd approach is better according to performance
A1 = foreach A generate id,name;
B1 = foreach B generate id,name;
M2 = union A1,B1;
M2 = distinct M2;
M2A = JOIN M2 by id left outer,A by id;
M2AB = JOIN M2A by M2::id left outer, B by id;
Res = foreach M2AB generate M2A::M2::id as id,M2A::M2::name as name,M2A::A::age as age,B::grades as grades;
Hope this will help!!
u1 = load 'PigDir/u1' using PigStorage(',') as (id:int,name:chararray,age:int);
u2 = load 'PigDir/u2' using PigStorage(',') as (id:int, name:chararray,grades:chararray);
uj = join u2 by id full outer,u1 by id;
uif = foreach uj generate ($0 is null ?$3:$0) as id,($1 is null ? $4 : $1) as name,$5 as age,$2 as grades;
I have two tables:
A: (feature:chararray, value:float)
B:(multiplier:charray, value:float)
where A is a table with thousands of rows and B has only one row.
What I wanna do is take all the rows in A and multiply A.value by B.value.
e.g.
A:[('f1', 1.5) , ('f2', 2.3)]
B:[('mul', 2)]
I'd like to product a table C
C: [('f1', 3), ('f2', 4.6)]
Is there an easy way to do so?
You can do a CROSS and a FOREACH ... GENERATE.
X = A CROSS B;
Y = FOREACH X GENERATE A::feature, A::value * B::value;
The above code has not been tested.
If You are very sure that the 2nd table has only one row then take the first column
of 2nd table and hardcode the same value as last column in 1st table and then
do the inner join and the you can easily multiply
Let say first file as plain.txt
(f1,1.5)
(f2,2)
here is the second file as multi.txt
(mul,2)
A = load '/user/cloudera/inputfiles/plain.txt' USING PigStorage(',') AS(feature:chararray,value:double);
B = load '/user/cloudera/inputfiles/multi.txt' USING PigStorage(',') AS(operation:chararray,no:int);
C = foreach A generate feature,value,'mul' as ope;
D = join C by ope, B by operation;
E = foreach D generate feature,(value*no) as multiplied_value;