Pretty new to Pig , I have a dataset which consists of Olympics data
for 4-5 years. I am trying to generate highest and lowest medal
winning countries split by every year. Hers's a sample with header.
ATHLETE,COUNTRY,YEAR, SPORT,GOLD,SILVER,BRONZE,TOTAL
Yang Yilin,China,2008,Gymnastics,1,0,2,3
Leisel Jones,Australia,2000,Swimming,0,2,0,2
Go Gi-Hyeon,South Korea,2002,Short-Track Speed Skating,1,1,0,2
Chen Ruolin,China,2008,Diving,2,0,0,2
Katie Ledecky,United States,2012,Swimming,1,0,0,1
Ruta Meilutyte,Lithuania,2012,Swimming,1,0,0,1
Dániel Gyurta,Hungary,2004,Swimming,0,1,0,1
Arianna Fontana,Italy,2006,Short-Track Speed Skating,0,0,1,1
Olga Glatskikh,Russia,2004,Rhythmic Gymnastics,1,0,0,1
Kharikleia Pantazi,Greece,2000,Rhythmic Gymnastics,0,0,1,1
I tried my options as per my knowledge to get this , but with little
sucess.
This is what i have now. Any help on solving this will be
appreciated !
DEFINE MYOVER org.apache.pig.piggybank.evaluation.Over;
DEFINE MYSTITCH org.apache.pig.piggybank.evaluation.Stitch;
A = LOAD 'MortDataSite/MyPigExercise/OlympicMedals.csv' using PigStorage(',') as (ATHLETE:CHARARRAY,COUNTRY:CHARARRAY,YEAR:INT,SPORT:CHARARRAY,GOLD:INT,SILVER:INT,BRONZE:INT,TOTAL:INT);
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL);
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E {
E1 = ORDER D BY TOT DESC;
GENERATE FLATTEN(MYSTITCH(E1, MYOVER(E1,'dense_rank',0,1,1)));
};
G = FOREACH F GENERATE stitched::YEAR,stitched::COUNTRY ,stitched::TOT,$3;
MyOutput : ( Considering there are many nations with same TOTAL Medals
, I expect more than one country may share one RANK )
(2000,Cuba,65,1)
(2000,Iran,4,1)
(2000,Chile,17,1)
(2000,China,79,1)
(2000,India,7,1)
(2000,Italy,65,1)
(2000,Japan,42,1)
(2000,Kenya,7,1)
(2000,Qatar,1,1)
(2000,Spain,42,1)
(2000,Brazil,48,1)
Expected Ouput : 1
YEAR COUNTRY MAX(TOTAL)
2001 India 50
2003 UK 90
2006 Japan 56
&
Expected Ouput : 2
YEAR COUNTRY MIN(TOTAL)
2001 India 5
2003 UK 10
2006 Japan 6
********* Updated Query ( Working Well as expected ) ****
Here's the updated query which gave me my desired result.
DEFINE MYOVER org.apache.pig.piggybank.evaluation.Over;
DEFINE MYSTITCH org.apache.pig.piggybank.evaluation.Stitch;
A = LOAD 'MortDataSite/MyPigExercise/OlympicMedals.csv' using PigStorage(',') as (ATHLETE:CHARARRAY,COUNTRY:CHARARRAY,YEAR:INT,SPORT:CHARARRAY,GOLD:INT,SILVER:INT,BRONZE:INT,TOTAL:INT);
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL);
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,MAX(D.TOT) as MTOT;
G = GROUP F BY YEAR;
H = FOREACH G {
G1 = ORDER F BY MTOT DESC;
GENERATE FLATTEN(MYSTITCH(G1, MYOVER(G1,'dense_rank',0,1,1)));
};
J = FOREACH H GENERATE stitched::YEAR,stitched::COUNTRY ,stitched::MTOT,$3;
**Ouput : **
YEAR COUNTRY MAX(TOTAL).RANKING
(2000,United States,242,1)
(2000,Russia,187,2)
(2000,Australia,182,3)
(2002,United States,84,1)
(2002,Canada,74,2)
(2002,Germany,61,3)
(2004,United States,265,1)
(2004,Russia,190,2)
(2004,Australia,156,3)
If you would like to get the MAX and MIN total medals by country by year,just use MAX and MIN.
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL) as TOTAL;
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E GENERATE group as (YEAR,COUNTRY),MAX(D.TOTAL);
G = FOREACH E GENERATE group as (YEAR,COUNTRY),MIN(D.TOTAL);
DUMP F;
DUMP G;
Can you please suggest on below file matching logic and removing duplicate entries using Pig -
1) Removing duplicate entries based on key RoleId-
InputFile1
--------------
RoleId Name
1 A
2 B
3 C
2 D
5 E
5 F
7 G
OutpufFile1 (Only unique records)
RoleId Name
1 A
3 C
7 G
OutpufFile2 (Capture duplicate records)
RoleId Name
2 B
2 D
5 E
5 F
2) File Matching key is RoleId -
InputFile1 InputFile2
----------- ----------
RoleId Name RoleId Age
1 A 1 20
2 B 2 21
3 C 1 22
4 D 2 23
5 E 3 24
7 25
OutpufFile1 (Matching records) OutputFile2 (Un-matching from 1st)
-------------------- -----------
RoleId Name Age RoleId Name
1 A 20, 22 4 D
2 B 21, 23 5 E
3 C 24
Thanks,
Can you try the below approach?
Problem1 Solution:
input
1 A
2 B
3 C
2 D
5 E
5 F
7 G
PigScript:
A = LOAD 'in.txt' USING PigStorage() AS(RoleId:int,Name:chararray);
B = GROUP A BY RoleId;
C = FOREACH B GENERATE FLATTEN($1) AS(RoleId,Name),COUNT(A) AS cnt;
SPLIT C INTO Distval IF (cnt==1), NonDistVal IF (cnt>=2);
D = FOREACH Distval GENERATE RoleId,Name;
STORE D INTO 'DistFile' USING PigStorage();
E = FOREACH NonDistVal GENERATE RoleId,Name;
STORE E INTO 'NonDistFile' USING PigStorage();
Output:
cat DistFile/part-r-00000
1 A
3 C
7 G
cat NonDistFile/part-r-00000
2 B
2 D
5 E
5 F
Problem2 Solution:
InputFile1
1 A
2 B
3 C
4 D
5 E
InputFile2
1 20
2 21
1 22
2 23
3 24
7 25
PigScript:
A = LOAD 'InputFile1' USING PigStorage() AS(RoleId:long, Name:chararray);
B = LOAD 'InputFile2' USING PigStorage() AS(RoleId:long, Age:int);
C = COGROUP A BY RoleId ,B BY RoleId;
D = FILTER C BY NOT IsEmpty(A);
SPLIT D INTO RoleMatch IF NOT IsEmpty(B),NoRoleMatch IF IsEmpty(B);
E = FOREACH RoleMatch GENERATE FLATTEN($1),BagToTuple(B.Age);
STORE E INTO 'RoleMatchFile' USING PigStorage();
F = FOREACH NoRoleMatch GENERATE FLATTEN($1);
STORE F INTO 'NoRoleMatchFile' USING PigStorage();
Output:
cat RoleMatchFile/part-r-00000
1 A (20,22)
2 B (21,23)
3 C (24)
cat NoRoleMatchFile/part-r-00000
4 D
5 E
I have the following dataset in which I need to merge multiple rows into one if they have the same key. At the same time, I need to pick among the multiple tuples which gets grouped.
1 N1 1 10
1 N1 2 15
2 N1 1 10
3 N1 1 10
3 N1 2 15
4 N2 1 10
5 N3 1 10
5 N3 2 20
For example
A = LOAD 'data.txt' AS (f1:int, f2:chararray, f3:int, f4:int);
G = GROUP A BY (f1, f2);
DUMP G;
((1,N1),{(1,N1,1,10),(1,N1,2,15)})
((2,N1),{(2,N1,1,10)})
((3,N1),{(3,N1,1,10),(3,N1,2,15)})
((4,N2),{(4,N2,1,10)})
((5,N3),{(5,N3,1,10),(5,N3,2,20)})
Now, I want to pick if there are multiple tuples in collected bag, I want to filter only those which have f3==2. Here is the final data which I want:
((1,N1),{(1,N1,2,15)}) -- f3==2, f3==1 is removed from this set
((2,N1),{(2,N1,1,10)})
((3,N1),{(3,N1,2,15)}) -- f3==2, f3==1 is removed from this bag
((4,N2),{(4,N2,1,10)})
((5,N3),{(5,N3,2,10)})
Any idea how to achieve this?
I did with my way as specified in the comment above. Here is how I did it.
A = LOAD 'group.txt' USING PigStorage(',') AS (f1:int, f2:chararray, f3:int, f4:int);
G = GROUP A BY (f1, f2);
CNT = FOREACH G GENERATE group, COUNT($1) AS cnt, $1;
SPLIT CNT INTO
CNT1 IF (cnt > 1),
CNT2 IF (cnt == 1);
M1 = FOREACH CNT1 {
row = FILTER $2 BY (f3 == 2);
GENERATE FLATTEN(row);
};
M2 = FOREACH CNT2 GENERATE FLATTEN($2);
O = UNION M1, M2;
DUMP O;
(2,N1,1,10)
(4,N2,1,10)
(1,N1,2,15)
(3,N1,2,15)
(5,N3,2,20)
I have 2 DataTables that look like this:
DataTable 1:
cheie_primara cheie_secundara judet localitate
1 11 A
2 22 B
3 33 C
4 44 D
5 55 A
6 66 B
7 77 C
8 88 D
9 99 A
DataTable 2:
ID_CP BAN JUDET LOCALITATE ADRESA
1 11 A aa random
2 22 B ss random
3 33 C ee random
4 44 D xx random
5 55 A rr random
6 66 B aa random
7 77 C ss random
8 88 D ee random
9 99 A xx random
and I want to update DataTable 1 with the field["LOCALITATE"] using the maching key DataTable1["cheie_primara"] and DataTable2["ID_CP"].
Like this:
cheie_primara cheie_secundara judet localitate
1 11 A aa
2 22 B ss
3 33 C ee
4 44 D xx
5 55 A rr
6 66 B aa
7 77 C ss
8 88 D ee
9 99 A xx
Is there a LINQ methode to update DataTable1 ?
Thanks!
This is working:
DataTable1.AsEnumerable()
.Join( DataTable2.AsEnumerable(),
dt1_Row => dt1_Row.ItemArray[0],
dt2_Row => dt2_Row.ItemArray[0],
(dt1_Row, dt2_Row) => new { dt1_Row, dt2_Row })
.ToList()
.ForEach(o =>
o.dt1_Row.SetField(3, o.dt2_Row.ItemArray[3]));
If you want to use Linq, here's how I'd go about it;
var a = (from d1 in DataTable1
join d2 in DataTable2 on d1.cheie_primara equals d2.ID_CP
select new {d1, d2.LOCALITATE}).ToList();
a.ForEach(b => b.d1.localitate = b.LOCALITATE);