Access schema value in pig - hadoop

Dataset - Contains PostId and userID
20 1
21 2
45 3
85 1
48 1
98 1
74 1
96 2
63 2
33 3
44 3
55 3
66 3
77 3
I want to access the userID with maximum no. of post
PIG code
A = load '/home/cloudera/Desktop/post.txt' as (postid:chararray, userid:chararray);
B = load '/home/cloudera/Desktop/user.txt' as (name:chararray, id:chararray);
C = group A by userid;
D = foreach C generate group,COUNT(A.postid) as count;
E = order D by count DESC;
F = limit D 1;
It gives output -
(3,6)
Now what should be the PIG statement to access username from user.txt whose id is same as A.userid after execution of F statement?

Add another statement to get the first column from relation F
G = FOREACH F GENERATE $0;
DUMP G;

use the below SQL statement to get that desired output
declare #var int = (select max(cn) from (select count(post) cn from temp group by userid) c)
select * from (select userid,count(post) as pso from temp group by userid ) as c where pso = #var

Related

PIG - Get Highest & Lowest Medal Winning Nations , GROUPed by Year

Pretty new to Pig , I have a dataset which consists of Olympics data
for 4-5 years. I am trying to generate highest and lowest medal
winning countries split by every year. Hers's a sample with header.
ATHLETE,COUNTRY,YEAR, SPORT,GOLD,SILVER,BRONZE,TOTAL
Yang Yilin,China,2008,Gymnastics,1,0,2,3
Leisel Jones,Australia,2000,Swimming,0,2,0,2
Go Gi-Hyeon,South Korea,2002,Short-Track Speed Skating,1,1,0,2
Chen Ruolin,China,2008,Diving,2,0,0,2
Katie Ledecky,United States,2012,Swimming,1,0,0,1
Ruta Meilutyte,Lithuania,2012,Swimming,1,0,0,1
Dániel Gyurta,Hungary,2004,Swimming,0,1,0,1
Arianna Fontana,Italy,2006,Short-Track Speed Skating,0,0,1,1
Olga Glatskikh,Russia,2004,Rhythmic Gymnastics,1,0,0,1
Kharikleia Pantazi,Greece,2000,Rhythmic Gymnastics,0,0,1,1
I tried my options as per my knowledge to get this , but with little
sucess.
This is what i have now. Any help on solving this will be
appreciated !
DEFINE MYOVER org.apache.pig.piggybank.evaluation.Over;
DEFINE MYSTITCH org.apache.pig.piggybank.evaluation.Stitch;
A = LOAD 'MortDataSite/MyPigExercise/OlympicMedals.csv' using PigStorage(',') as (ATHLETE:CHARARRAY,COUNTRY:CHARARRAY,YEAR:INT,SPORT:CHARARRAY,GOLD:INT,SILVER:INT,BRONZE:INT,TOTAL:INT);
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL);
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E {
E1 = ORDER D BY TOT DESC;
GENERATE FLATTEN(MYSTITCH(E1, MYOVER(E1,'dense_rank',0,1,1)));
};
G = FOREACH F GENERATE stitched::YEAR,stitched::COUNTRY ,stitched::TOT,$3;
MyOutput : ( Considering there are many nations with same TOTAL Medals
, I expect more than one country may share one RANK )
(2000,Cuba,65,1)
(2000,Iran,4,1)
(2000,Chile,17,1)
(2000,China,79,1)
(2000,India,7,1)
(2000,Italy,65,1)
(2000,Japan,42,1)
(2000,Kenya,7,1)
(2000,Qatar,1,1)
(2000,Spain,42,1)
(2000,Brazil,48,1)
Expected Ouput : 1
YEAR COUNTRY MAX(TOTAL)
2001 India 50
2003 UK 90
2006 Japan 56
&
Expected Ouput : 2
YEAR COUNTRY MIN(TOTAL)
2001 India 5
2003 UK 10
2006 Japan 6
********* Updated Query ( Working Well as expected ) ****
Here's the updated query which gave me my desired result.
DEFINE MYOVER org.apache.pig.piggybank.evaluation.Over;
DEFINE MYSTITCH org.apache.pig.piggybank.evaluation.Stitch;
A = LOAD 'MortDataSite/MyPigExercise/OlympicMedals.csv' using PigStorage(',') as (ATHLETE:CHARARRAY,COUNTRY:CHARARRAY,YEAR:INT,SPORT:CHARARRAY,GOLD:INT,SILVER:INT,BRONZE:INT,TOTAL:INT);
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL);
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,MAX(D.TOT) as MTOT;
G = GROUP F BY YEAR;
H = FOREACH G {
G1 = ORDER F BY MTOT DESC;
GENERATE FLATTEN(MYSTITCH(G1, MYOVER(G1,'dense_rank',0,1,1)));
};
J = FOREACH H GENERATE stitched::YEAR,stitched::COUNTRY ,stitched::MTOT,$3;
**Ouput : **
YEAR COUNTRY MAX(TOTAL).RANKING
(2000,United States,242,1)
(2000,Russia,187,2)
(2000,Australia,182,3)
(2002,United States,84,1)
(2002,Canada,74,2)
(2002,Germany,61,3)
(2004,United States,265,1)
(2004,Russia,190,2)
(2004,Australia,156,3)
If you would like to get the MAX and MIN total medals by country by year,just use MAX and MIN.
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL) as TOTAL;
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E GENERATE group as (YEAR,COUNTRY),MAX(D.TOTAL);
G = FOREACH E GENERATE group as (YEAR,COUNTRY),MIN(D.TOTAL);
DUMP F;
DUMP G;

Create file with matched and non-matched records using Pig script

Can you please suggest on below file matching logic and removing duplicate entries using Pig -
1) Removing duplicate entries based on key RoleId-
InputFile1
--------------
RoleId Name
1 A
2 B
3 C
2 D
5 E
5 F
7 G
OutpufFile1 (Only unique records)
RoleId Name
1 A
3 C
7 G
OutpufFile2 (Capture duplicate records)
RoleId Name
2 B
2 D
5 E
5 F
2) File Matching key is RoleId -
InputFile1 InputFile2
----------- ----------
RoleId Name RoleId Age
1 A 1 20
2 B 2 21
3 C 1 22
4 D 2 23
5 E 3 24
7 25
OutpufFile1 (Matching records) OutputFile2 (Un-matching from 1st)
-------------------- -----------
RoleId Name Age RoleId Name
1 A 20, 22 4 D
2 B 21, 23 5 E
3 C 24
Thanks,
Can you try the below approach?
Problem1 Solution:
input
1 A
2 B
3 C
2 D
5 E
5 F
7 G
PigScript:
A = LOAD 'in.txt' USING PigStorage() AS(RoleId:int,Name:chararray);
B = GROUP A BY RoleId;
C = FOREACH B GENERATE FLATTEN($1) AS(RoleId,Name),COUNT(A) AS cnt;
SPLIT C INTO Distval IF (cnt==1), NonDistVal IF (cnt>=2);
D = FOREACH Distval GENERATE RoleId,Name;
STORE D INTO 'DistFile' USING PigStorage();
E = FOREACH NonDistVal GENERATE RoleId,Name;
STORE E INTO 'NonDistFile' USING PigStorage();
Output:
cat DistFile/part-r-00000
1 A
3 C
7 G
cat NonDistFile/part-r-00000
2 B
2 D
5 E
5 F
Problem2 Solution:
InputFile1
1 A
2 B
3 C
4 D
5 E
InputFile2
1 20
2 21
1 22
2 23
3 24
7 25
PigScript:
A = LOAD 'InputFile1' USING PigStorage() AS(RoleId:long, Name:chararray);
B = LOAD 'InputFile2' USING PigStorage() AS(RoleId:long, Age:int);
C = COGROUP A BY RoleId ,B BY RoleId;
D = FILTER C BY NOT IsEmpty(A);
SPLIT D INTO RoleMatch IF NOT IsEmpty(B),NoRoleMatch IF IsEmpty(B);
E = FOREACH RoleMatch GENERATE FLATTEN($1),BagToTuple(B.Age);
STORE E INTO 'RoleMatchFile' USING PigStorage();
F = FOREACH NoRoleMatch GENERATE FLATTEN($1);
STORE F INTO 'NoRoleMatchFile' USING PigStorage();
Output:
cat RoleMatchFile/part-r-00000
1 A (20,22)
2 B (21,23)
3 C (24)
cat NoRoleMatchFile/part-r-00000
4 D
5 E

Get previous rows and next rows of linq result for special row

I have a list of profile each profile have a own point how can I show a list that shows previous profiles(rows) and next profiles.
For example:
a 20
b 30
c 40
d 50
e 60
f 70
v 80
t 90
my goal is d but I want to show two previous rows and two next in list, too:
b 30
c 40
d 50
e 60
f 70
In my goal ProfileId will be parameter. I retrieved whatever I wanted.
// here is my code.
var pp = from p in db.Profiles.OrderByDescending(u => u.score) select p;
foreach(var x in pp)
{
t++;
if (x.ProfileId == 58)
{
r = t;
}
}
var zz = from d in pp.Skip(r - 2).Take(4) select d;

Conditional Filter in GROUP BY in Pig

I have the following dataset in which I need to merge multiple rows into one if they have the same key. At the same time, I need to pick among the multiple tuples which gets grouped.
1 N1 1 10
1 N1 2 15
2 N1 1 10
3 N1 1 10
3 N1 2 15
4 N2 1 10
5 N3 1 10
5 N3 2 20
For example
A = LOAD 'data.txt' AS (f1:int, f2:chararray, f3:int, f4:int);
G = GROUP A BY (f1, f2);
DUMP G;
((1,N1),{(1,N1,1,10),(1,N1,2,15)})
((2,N1),{(2,N1,1,10)})
((3,N1),{(3,N1,1,10),(3,N1,2,15)})
((4,N2),{(4,N2,1,10)})
((5,N3),{(5,N3,1,10),(5,N3,2,20)})
Now, I want to pick if there are multiple tuples in collected bag, I want to filter only those which have f3==2. Here is the final data which I want:
((1,N1),{(1,N1,2,15)}) -- f3==2, f3==1 is removed from this set
((2,N1),{(2,N1,1,10)})
((3,N1),{(3,N1,2,15)}) -- f3==2, f3==1 is removed from this bag
((4,N2),{(4,N2,1,10)})
((5,N3),{(5,N3,2,10)})
Any idea how to achieve this?
I did with my way as specified in the comment above. Here is how I did it.
A = LOAD 'group.txt' USING PigStorage(',') AS (f1:int, f2:chararray, f3:int, f4:int);
G = GROUP A BY (f1, f2);
CNT = FOREACH G GENERATE group, COUNT($1) AS cnt, $1;
SPLIT CNT INTO
CNT1 IF (cnt > 1),
CNT2 IF (cnt == 1);
M1 = FOREACH CNT1 {
row = FILTER $2 BY (f3 == 2);
GENERATE FLATTEN(row);
};
M2 = FOREACH CNT2 GENERATE FLATTEN($2);
O = UNION M1, M2;
DUMP O;
(2,N1,1,10)
(4,N2,1,10)
(1,N1,2,15)
(3,N1,2,15)
(5,N3,2,20)

Update DataTable from another table with LINQ

I have 2 DataTables that look like this:
DataTable 1:
cheie_primara cheie_secundara judet localitate
1 11 A
2 22 B
3 33 C
4 44 D
5 55 A
6 66 B
7 77 C
8 88 D
9 99 A
DataTable 2:
ID_CP BAN JUDET LOCALITATE ADRESA
1 11 A aa random
2 22 B ss random
3 33 C ee random
4 44 D xx random
5 55 A rr random
6 66 B aa random
7 77 C ss random
8 88 D ee random
9 99 A xx random
and I want to update DataTable 1 with the field["LOCALITATE"] using the maching key DataTable1["cheie_primara"] and DataTable2["ID_CP"].
Like this:
cheie_primara cheie_secundara judet localitate
1 11 A aa
2 22 B ss
3 33 C ee
4 44 D xx
5 55 A rr
6 66 B aa
7 77 C ss
8 88 D ee
9 99 A xx
Is there a LINQ methode to update DataTable1 ?
Thanks!
This is working:
DataTable1.AsEnumerable()
.Join( DataTable2.AsEnumerable(),
dt1_Row => dt1_Row.ItemArray[0],
dt2_Row => dt2_Row.ItemArray[0],
(dt1_Row, dt2_Row) => new { dt1_Row, dt2_Row })
.ToList()
.ForEach(o =>
o.dt1_Row.SetField(3, o.dt2_Row.ItemArray[3]));
If you want to use Linq, here's how I'd go about it;
var a = (from d1 in DataTable1
join d2 in DataTable2 on d1.cheie_primara equals d2.ID_CP
select new {d1, d2.LOCALITATE}).ToList();
a.ForEach(b => b.d1.localitate = b.LOCALITATE);

Resources