Calculate count of distinct values of a field using pig script - hadoop

For a file of the form
A B user1
C D user2
A D user3
A D user1
I want to calculate the count of distinct values of field 3 i.e. count(distinct(user1, user2,user2,user1)) = 3
I am doing this using the following pig script
A = load 'myTestData' using PigStorage('\t') as (a1,a2,a3);
user_list = foreach A GENERATE $2;
unique_users = DISTINCT user_list;
unique_users_group = GROUP unique_users ALL;
uu_count = FOREACH unique_users_group GENERATE COUNT(unique_users);
store uu_count into 'output';
Is there a better way to get count of distinct values of a field?

A more up-to-date way to do this:
user_data = LOAD 'myTestData' USING PigStorage('\t') AS (a1,a2,a3);
users = FOREACH user_data GENERATE a3;
uniq_users = DISTINCT users;
grouped_users = GROUP uniq_users ALL;
uniq_user_count = FOREACH grouped_users GENERATE COUNT(uniq_users);
DUMP uniq_user_count;
This will leave the value (3) in your log.

I have one here which is a little more concise. You might want to check which one runs faster.
A = LOAD 'myTestData' USING PigStorage('\t') AS (a1,a2,a3);
unique_users_group = GROUP A ALL;
uu_count = FOREACH unique_users_group {user = A.a2; uniq = distinct user; GENERATE COUNT(uniq);};
STORE uu_count INTO 'output';

Related

Removing duplicates using PigLatin and retaining the last element

I am using PigLatin. And I want to remove the duplicates from the bags and want to retain the last element of the particular key.
Input:
User1 7 LA
User1 8 NYC
User1 9 NYC
User2 3 NYC
User2 4 DC
Output:
User1 9 NYC
User2 4 DC
Here the first filed is a key. And I want the last record of that particular key to be retained in the output.
I know how to retain the first element. It is as below. But not able to retain the last element.
inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
top_rec = LIMIT inpt 1;
GENERATE FLATTEN(top_rec);
};
Can anybody help me on this? Thanks in advance!
#Anil : If you order by one of the fields in descending order. You will be able to get the last record. In the below code, have ordered by second field of input (field name : no in script)
Input :
User1,7,LA
User1,8,NYC
User1,9,NYC
User2,3,NYC
User2,4,DC
Pig snippet :
user_details = LOAD 'user_details.csv' USING PigStorage(',') AS (user_name:chararray,no:long,city:chararray);
user_details_grp_user = GROUP user_details BY user_name;
required_user_details = FOREACH user_details_grp_user {
user_details_sorted_by_no = ORDER user_details BY no DESC;
top_record = LIMIT user_details_sorted_by_no 1;
GENERATE FLATTEN(top_record);
}
Output : DUMP required_user_details
(User1,9,NYC )
(User2,4,DC)
Ok.. You can use RANK Operator .
Hope the below code helps.
rec = LOAD '/user/cloudera/inputfiles/sample.txt' USING PigStorage(',') AS(user:chararray,no:int,loc:chararray);
rec_rank = rank rec;
rec_rank_each = FOREACH rec_rank GENERATE $0 as rank_key, user, no, loc;
rec_rank_grp = GROUP rec_rank_each by user;
rec_rank_max = FOREACH rec_rank_grp GENERATE group as temp_user, MAX(rec_rank_each.rank_key) as max_rank;
rec_join = JOIN rec_rank_each BY (user,rank_key) , rec_rank_min BY(temp_user,max_rank);
rec_output = FOREACH rec_join GENERATE user,no,loc;
dump rec_output;
Ensure that you run this from pig 0.11 version as rank operator introduced from pig 0.11

Equivalent of Union_map in pig

I have been trying to find the union_map() equivalent in pig. I know for sure that TOMAP function brings in MAP datatype.
But the requirement is to bring all the MAPs for a given id as shown below.
select I1,UNION_MAP(MAP(Key,Val)) as new_val group by I1;
Sample Input and result is provided below.
Input
ID,Key,Val
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
select ID,UNION_MAP(TO_MAP(Key,VAL)) from table group by ID;
Result
ID1,(K1#V7,K2#V4)
ID2,(K1#V2,K3#V3)
I would like to get the similar output in pig.
Download the piggybank.jar from this link http://www.java2s.com/Code/Jar/p/Downloadpiggybankjar.htm and set it in your classpath and try the below approach.
input
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
PigScript:
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input' USING PigStorage(',') AS (ID:chararray,Key:chararray,Val:chararray);
B = RANK A;
C = GROUP B BY (ID,Key);
D = FOREACH C {
sortByRank = ORDER B BY rank_A DESC;
top1 = LIMIT sortByRank 1;
GENERATE FLATTEN(top1);
}
E = GROUP D BY top1::ID;
F = FOREACH E {
ToMap = FOREACH D GENERATE TOMAP(top1::Key,top1::Val);
GENERATE group,BagToTuple(ToMap) AS myMap;
}
DUMP F;
Output:
(ID1,([K1#V7],[K2#V4]))
(ID2,([K1#V2],[K3#V3]))

How to count the number of unique users with PIG

The following piece of code doesn't return exactly what I am trying to compute; the number of unique users. Any idea?
data = LOAD 'input_initial' AS (user_id,item_id,rating,timestamp);
data = FOREACH data GENERATE user_id,item_id;
STORE data INTO 'input_final';
data_users = FOREACH data GENERATE user_id;
group_users = GROUP data_users BY user_id;
count_users = FOREACH group_users GENERATE COUNT(data_users);
STORE count_users INTO 'count_users';
You need to amend the final GROUP operation to act on 'all' rather than an individual field:
group_users = GROUP data_users BY user_id;
grp_all = GROUP group_users ALL;
count_users = FOREACH grp_all GENERATE COUNT(group_users);

Hadoop Pig GROUP by id, get owner_id?

In Hadoop I have many that look like this:
(item_id,owner_id,counter) - there could be duplicates but ALWAYS the item_id has the same owner_id!
I want to get the SUM of the counter for each item_id so I have the following script:
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id);
data = FOREACH group_by_item GENERATE group AS item_id, OWNER_ID_COLUMN_SOMEHOW, SUM(known_items.counter) AS items_count;
The problem is that in the FOREACH if I want to take known_items.owner_id - that would be a tuple that has the sum of all grouped item_id. What would be the most efficient way to get the first one of the owners?
The simplest solution gives you the right answer if your assumption that each item_id has the same owner_id is correct, and will let you know if it is not: incude the owner_id as part of the group.
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id, owner_id);
data = FOREACH group_by_item GENERATE FLATTEN(group), SUM(known_items.counter) AS items_count;

select count distinct using pig latin

I need help with this pig script. I am just getting a single record. I am selecting 2 columns and doing a count(distinct) on another while also using a where like clause to find a particular description (desc).
Here's my sql with pig I am trying to code.
/*
For example in sql:
select domain, count(distinct(segment)) as segment_cnt
from table
where desc='ABC123'
group by domain
order by segment_count desc;
*/
A = LOAD 'myoutputfile' USING PigStorage('\u0005')
AS (
domain:chararray,
segment:chararray,
desc:chararray
);
B = filter A by (desc=='ABC123');
C = foreach B generate domain, segment;
D = DISTINCT C;
E = group D all;
F = foreach E generate group, COUNT(D) as segment_cnt;
G = order F by segment_cnt DESC;
You could GROUP on each domain and then count the number of distinct elements in each group with a nested FOREACH syntax:
D = group C by domain;
E = foreach D {
unique_segments = DISTINCT C.segment;
generate group, COUNT(unique_segments) as segment_cnt;
};
You can better define this as a macro:
DEFINE DISTINCT_COUNT(A, c) RETURNS dist {
temp = FOREACH $A GENERATE $c;
dist = DISTINCT temp;
groupAll = GROUP dist ALL;
$dist = FOREACH groupAll GENERATE COUNT(dist);
}
Usage:
X = LOAD 'data' AS (x: int);
Y = DISTINCT_COUNT(X, x);
If you need to use it in a FOREACH instead then the easiest way is something like:
...GENERATE COUNT(Distinct(x))...
Tested on Pig 12.
If you don't want to count on any group, you use this:
G = FOREACH (GROUP A ALL){
unique = DISTINCT A.field;
GENERATE COUNT(unique) AS ct;
};
This will just give you a number.

Resources