I have been trying to find the union_map() equivalent in pig. I know for sure that TOMAP function brings in MAP datatype.
But the requirement is to bring all the MAPs for a given id as shown below.
select I1,UNION_MAP(MAP(Key,Val)) as new_val group by I1;
Sample Input and result is provided below.
Input
ID,Key,Val
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
select ID,UNION_MAP(TO_MAP(Key,VAL)) from table group by ID;
Result
ID1,(K1#V7,K2#V4)
ID2,(K1#V2,K3#V3)
I would like to get the similar output in pig.
Download the piggybank.jar from this link http://www.java2s.com/Code/Jar/p/Downloadpiggybankjar.htm and set it in your classpath and try the below approach.
input
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
PigScript:
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input' USING PigStorage(',') AS (ID:chararray,Key:chararray,Val:chararray);
B = RANK A;
C = GROUP B BY (ID,Key);
D = FOREACH C {
sortByRank = ORDER B BY rank_A DESC;
top1 = LIMIT sortByRank 1;
GENERATE FLATTEN(top1);
}
E = GROUP D BY top1::ID;
F = FOREACH E {
ToMap = FOREACH D GENERATE TOMAP(top1::Key,top1::Val);
GENERATE group,BagToTuple(ToMap) AS myMap;
}
DUMP F;
Output:
(ID1,([K1#V7],[K2#V4]))
(ID2,([K1#V2],[K3#V3]))
I have a file as:
1,Mary,5
1,Tom,5
2,Bill,5
2,Sue,4
2,Theo,5
3,Mary,5
3,Cindy,5
4,Andrew,4
4,Katie,4
4,Scott,5
5,Jeff,3
5,Sara,4
5,Ryan,5
6,Bob,5
6,Autumn,4
7,Betty,5
7,Janet,5
7,Scott,5
8,Andrew,4
8,Katie,4
8,Scott,5
9,Mary,5
9,Tom,5
10,Bill,5
10,Sue,4
10,Theo,5
11,Mary,5
11,Cindy,5
12,Andrew,4
12,Katie,4
12,Scott,5
13,Jeff,3
13,Sara,4
13,Ryan,5
14,Bob,5
14,Autumn,4
15,Betty,5
15,Janet,5
15,Scott,5
16,Andrew,4
16,Katie,4
16,Scott,5
I want the answer with names most appeared i.e max
(Scott,6)
There's some ambiguity in your question.
What exactly do you want.
Do you want a list of user count in descending order?
OR
Do you want just (scott,6) i.e. only one user with maximum count?
I have successfully solved both the things,on the sample data which you gave.
If the question is of first type then,
a = load '/file.txt' using PigStorage(',') as (id:int,name:chararray,number:int);
g = group a by name;
g1 = foreach g{
generate group as g , COUNT(a) as cnt;
};
toptemp = group g1 all;
final = foreach toptemp{
sorted = order g1 by cnt desc;
GENERATE flatten(sorted);
};
This will give you a list of users in descending order as,
(Scott,6)
(Katie,4)
(Andrew,4)
(Mary,4)
(Bob,2)
(Sue,2)
(Tom,2)
(Bill,2)
(Jeff,2)
(Ryan,2)
(Sara,2)
(Theo,2)
(Betty,2)
(Cindy,2)
(Janet,2)
(Autumn,2)
If the question is of second type then,
a = load '/file.txt' using PigStorage(',') as (id:int,name:chararray,number:int);
g = group a by name;
g1 = foreach g{
generate group as g , COUNT(a) as cnt;
};
toptemp = group g1 all;
final = foreach toptemp{
sorted = order g1 by cnt desc;
top = limit sorted 1;
GENERATE flatten(top);
};
This gives us only one result ,
(Scott,6)
Thanks.I Hope it helps.
Let's say I have the following
DATA = foreach INPUT {
//..
generate group, count(name) as total;
}
I'll end up with a relation where the key is grouped by name
('mike', 'someprop', 10)
('mike', 'otherprop', 3)
('doug', 'xprop', 5)
...
And I want to get the sum of the top 10 for each name:
ALIAS = group DATA by name;
RESULT = foreach ALIAS {
SORTED = ORDER DATA by total desc;
TOP10 = LIMIT SORTED 10;
//doesn't work! can't have GROUP inside FOREACH
AGG = group TOP10 ALL;
TOPTOTAL = foreach AGG generate SUM(AGG.total);
generate group, TOPTOTAL;
}
How can I compute a value (SUM,COUNT,ETC) for a relation inside a foreach? Currently there's no way to apply a GROUP ALL inside the foreach.
SUM is just a function that takes a bag as its argument, and you can create this bag by projecting from TOP10:
ALIAS = group DATA by name;
RESULT = foreach ALIAS {
SORTED = ORDER DATA by total desc;
TOP10 = LIMIT SORTED 10;
generate group, SUM(TOP10.total);
}
I need help with this pig script. I am just getting a single record. I am selecting 2 columns and doing a count(distinct) on another while also using a where like clause to find a particular description (desc).
Here's my sql with pig I am trying to code.
/*
For example in sql:
select domain, count(distinct(segment)) as segment_cnt
from table
where desc='ABC123'
group by domain
order by segment_count desc;
*/
A = LOAD 'myoutputfile' USING PigStorage('\u0005')
AS (
domain:chararray,
segment:chararray,
desc:chararray
);
B = filter A by (desc=='ABC123');
C = foreach B generate domain, segment;
D = DISTINCT C;
E = group D all;
F = foreach E generate group, COUNT(D) as segment_cnt;
G = order F by segment_cnt DESC;
You could GROUP on each domain and then count the number of distinct elements in each group with a nested FOREACH syntax:
D = group C by domain;
E = foreach D {
unique_segments = DISTINCT C.segment;
generate group, COUNT(unique_segments) as segment_cnt;
};
You can better define this as a macro:
DEFINE DISTINCT_COUNT(A, c) RETURNS dist {
temp = FOREACH $A GENERATE $c;
dist = DISTINCT temp;
groupAll = GROUP dist ALL;
$dist = FOREACH groupAll GENERATE COUNT(dist);
}
Usage:
X = LOAD 'data' AS (x: int);
Y = DISTINCT_COUNT(X, x);
If you need to use it in a FOREACH instead then the easiest way is something like:
...GENERATE COUNT(Distinct(x))...
Tested on Pig 12.
If you don't want to count on any group, you use this:
G = FOREACH (GROUP A ALL){
unique = DISTINCT A.field;
GENERATE COUNT(unique) AS ct;
};
This will just give you a number.
from f in db.Table1
orderby Guid.NewGuid()
select f
this doesn't seem to work. how can i randomize results?
How about
SELECT TOP 1 column FROM table ORDER BY NEWID and skip the linq :)
Or try this:
var t = (from row in db.Table1 order by table1.random()
select row).FirstOrDefault();
Maybe something like this works (not tested):
(from f in db.Table1 select new { f, r = Guid.NewGuid()}).OrderBy(x => x.r)
Randomize whole list
db.Table1.OrderBy(x => Guid.NewGuid())
Get single Random
db.Table1.OrderBy(x => Guid.NewGuid()).FirstOrDefault();
I like to write an extension method for this.
IEnumerable<T> Randomize(this IEnumerable<T> list)
{
T[] result = list.ToArray();
Random random = new Random();
for(int i = result.Length; i > 0; i--)
{
result[i] = random.Next(i);
}
return (result);
}