How to split cell into sepatare rows and find minial summary value - hadoop

I have the following dataset:
Movies : moviename, genre1, genre2, genre3 ..... genre19
(All the genres above have values 0 or 1, 1 indicates that the movie is of that genre)
Now i want to find which movie(s) has least genre?
I tried the below Pig script:
items = load 'path' using PigStorage('|') as (mName:chararray,g1:int,g2:int,g3:int,g4:int,g5:int,g6:int,g7:int,g8:int,g9:int,g10:int,g11:int,g12:int,g13:int,g14:int,g15:int,g16:int,g17:int,g18:int,g19:int);
sumGenre = foreach items generate mName, g1+g2+g3+g4+g5+g6+g7+g8+g9+g10+g11+g12+g13+g14+g15+g16+g17+g18+g19 as sumOfGenres;
groupAll = group sumGenre All;
In the next step by using MIN(sumGenre.sumofGenres), i can get a genre which is the MIN value , but what am looking for is to get a moviename which has the least no. of genres, alongside the number of genres of that movie.
Can someone please help?
1. I want to know is there any other easy way to get the sum of g1+g2+...g19?
2. Also the output : movie(s) that has the least genre?

After the groupAll
r1 = minGenre = foreach groupAll generate MIN(sumGenre.sumOfGenres) as minG;
do left outer join between r1 by minG with sumGenre by sumOfGenres;
to get the list of movies having least genre..
Hope this will help..
for dynamic row field sum u can use UDF like this..
public class DynRowSum extends EvalFunc<Integer>
{
public Integer exec(Tuple v) throws IOException
{
List<Object> olist = v.getAll();
int sum = 0;
int cnt=0;
for( Object o : olist){
cnt++;
if (cnt!=1) {
int val= (Integer)o;
sum = sum + val;
}
}
return new Integer(sum);
}
}
In pig update the script like this..
grunt>sumGenre = foreach items generate mName,DynRowSum(*) as sumOfGenres;
Advantage here you will get if genre increase or decrease code will remain same..

a = LOAD 'path';
b = FOREACH a generate FLATTEN(STRSPLIT($0, '\\|'));
c = FOREACH b generate $0 as movie, FLATTEN(TOBAG(*)) as genre;
d = FILTER c BY movie!=genre;
e = GROUP d BY $0;
f = FOREACH e GENERATE group, SUM(d);
i = ORDER f BY $1;
j = LIMIT i 1;

Related

Equivalent of Union_map in pig

I have been trying to find the union_map() equivalent in pig. I know for sure that TOMAP function brings in MAP datatype.
But the requirement is to bring all the MAPs for a given id as shown below.
select I1,UNION_MAP(MAP(Key,Val)) as new_val group by I1;
Sample Input and result is provided below.
Input
ID,Key,Val
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
select ID,UNION_MAP(TO_MAP(Key,VAL)) from table group by ID;
Result
ID1,(K1#V7,K2#V4)
ID2,(K1#V2,K3#V3)
I would like to get the similar output in pig.
Download the piggybank.jar from this link http://www.java2s.com/Code/Jar/p/Downloadpiggybankjar.htm and set it in your classpath and try the below approach.
input
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
PigScript:
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input' USING PigStorage(',') AS (ID:chararray,Key:chararray,Val:chararray);
B = RANK A;
C = GROUP B BY (ID,Key);
D = FOREACH C {
sortByRank = ORDER B BY rank_A DESC;
top1 = LIMIT sortByRank 1;
GENERATE FLATTEN(top1);
}
E = GROUP D BY top1::ID;
F = FOREACH E {
ToMap = FOREACH D GENERATE TOMAP(top1::Key,top1::Val);
GENERATE group,BagToTuple(ToMap) AS myMap;
}
DUMP F;
Output:
(ID1,([K1#V7],[K2#V4]))
(ID2,([K1#V2],[K3#V3]))

To find maximum occurance names in a list of tuple in PIG

I have a file as:
1,Mary,5
1,Tom,5
2,Bill,5
2,Sue,4
2,Theo,5
3,Mary,5
3,Cindy,5
4,Andrew,4
4,Katie,4
4,Scott,5
5,Jeff,3
5,Sara,4
5,Ryan,5
6,Bob,5
6,Autumn,4
7,Betty,5
7,Janet,5
7,Scott,5
8,Andrew,4
8,Katie,4
8,Scott,5
9,Mary,5
9,Tom,5
10,Bill,5
10,Sue,4
10,Theo,5
11,Mary,5
11,Cindy,5
12,Andrew,4
12,Katie,4
12,Scott,5
13,Jeff,3
13,Sara,4
13,Ryan,5
14,Bob,5
14,Autumn,4
15,Betty,5
15,Janet,5
15,Scott,5
16,Andrew,4
16,Katie,4
16,Scott,5
I want the answer with names most appeared i.e max
(Scott,6)
There's some ambiguity in your question.
What exactly do you want.
Do you want a list of user count in descending order?
OR
Do you want just (scott,6) i.e. only one user with maximum count?
I have successfully solved both the things,on the sample data which you gave.
If the question is of first type then,
a = load '/file.txt' using PigStorage(',') as (id:int,name:chararray,number:int);
g = group a by name;
g1 = foreach g{
generate group as g , COUNT(a) as cnt;
};
toptemp = group g1 all;
final = foreach toptemp{
sorted = order g1 by cnt desc;
GENERATE flatten(sorted);
};
This will give you a list of users in descending order as,
(Scott,6)
(Katie,4)
(Andrew,4)
(Mary,4)
(Bob,2)
(Sue,2)
(Tom,2)
(Bill,2)
(Jeff,2)
(Ryan,2)
(Sara,2)
(Theo,2)
(Betty,2)
(Cindy,2)
(Janet,2)
(Autumn,2)
If the question is of second type then,
a = load '/file.txt' using PigStorage(',') as (id:int,name:chararray,number:int);
g = group a by name;
g1 = foreach g{
generate group as g , COUNT(a) as cnt;
};
toptemp = group g1 all;
final = foreach toptemp{
sorted = order g1 by cnt desc;
top = limit sorted 1;
GENERATE flatten(top);
};
This gives us only one result ,
(Scott,6)
Thanks.I Hope it helps.

Computing SUM within FOREACH

Let's say I have the following
DATA = foreach INPUT {
//..
generate group, count(name) as total;
}
I'll end up with a relation where the key is grouped by name
('mike', 'someprop', 10)
('mike', 'otherprop', 3)
('doug', 'xprop', 5)
...
And I want to get the sum of the top 10 for each name:
ALIAS = group DATA by name;
RESULT = foreach ALIAS {
SORTED = ORDER DATA by total desc;
TOP10 = LIMIT SORTED 10;
//doesn't work! can't have GROUP inside FOREACH
AGG = group TOP10 ALL;
TOPTOTAL = foreach AGG generate SUM(AGG.total);
generate group, TOPTOTAL;
}
How can I compute a value (SUM,COUNT,ETC) for a relation inside a foreach? Currently there's no way to apply a GROUP ALL inside the foreach.
SUM is just a function that takes a bag as its argument, and you can create this bag by projecting from TOP10:
ALIAS = group DATA by name;
RESULT = foreach ALIAS {
SORTED = ORDER DATA by total desc;
TOP10 = LIMIT SORTED 10;
generate group, SUM(TOP10.total);
}

select count distinct using pig latin

I need help with this pig script. I am just getting a single record. I am selecting 2 columns and doing a count(distinct) on another while also using a where like clause to find a particular description (desc).
Here's my sql with pig I am trying to code.
/*
For example in sql:
select domain, count(distinct(segment)) as segment_cnt
from table
where desc='ABC123'
group by domain
order by segment_count desc;
*/
A = LOAD 'myoutputfile' USING PigStorage('\u0005')
AS (
domain:chararray,
segment:chararray,
desc:chararray
);
B = filter A by (desc=='ABC123');
C = foreach B generate domain, segment;
D = DISTINCT C;
E = group D all;
F = foreach E generate group, COUNT(D) as segment_cnt;
G = order F by segment_cnt DESC;
You could GROUP on each domain and then count the number of distinct elements in each group with a nested FOREACH syntax:
D = group C by domain;
E = foreach D {
unique_segments = DISTINCT C.segment;
generate group, COUNT(unique_segments) as segment_cnt;
};
You can better define this as a macro:
DEFINE DISTINCT_COUNT(A, c) RETURNS dist {
temp = FOREACH $A GENERATE $c;
dist = DISTINCT temp;
groupAll = GROUP dist ALL;
$dist = FOREACH groupAll GENERATE COUNT(dist);
}
Usage:
X = LOAD 'data' AS (x: int);
Y = DISTINCT_COUNT(X, x);
If you need to use it in a FOREACH instead then the easiest way is something like:
...GENERATE COUNT(Distinct(x))...
Tested on Pig 12.
If you don't want to count on any group, you use this:
G = FOREACH (GROUP A ALL){
unique = DISTINCT A.field;
GENERATE COUNT(unique) AS ct;
};
This will just give you a number.

Randomize database table result with LINQ

from f in db.Table1
orderby Guid.NewGuid()
select f
this doesn't seem to work. how can i randomize results?
How about
SELECT TOP 1 column FROM table ORDER BY NEWID and skip the linq :)
Or try this:
var t = (from row in db.Table1 order by table1.random()
select row).FirstOrDefault();
Maybe something like this works (not tested):
(from f in db.Table1 select new { f, r = Guid.NewGuid()}).OrderBy(x => x.r)
Randomize whole list
db.Table1.OrderBy(x => Guid.NewGuid())
Get single Random
db.Table1.OrderBy(x => Guid.NewGuid()).FirstOrDefault();
I like to write an extension method for this.
IEnumerable<T> Randomize(this IEnumerable<T> list)
{
T[] result = list.ToArray();
Random random = new Random();
for(int i = result.Length; i > 0; i--)
{
result[i] = random.Next(i);
}
return (result);
}

Resources