Foreach inside Foreach in pig - hadoop

I have rec structure like this:
Read_PeopleAll: {PID: bytearray,Read_PropertyTax: {(PropertyID: bytearray,ReadPropertyDS: (PersonID: bytearay,PropertyID: bytearray))}}
Actually i am trying to access the PropertyID but unable to do it.
a = foreach Read_PeopleAll {
b = foreach Read_PropertyTax{
c = filter ReadPropertyDS by PersonID is not null;
generate $0,c;
};
GENERATE $0,b;
};
dump a;
But i am gettign error like this:
mismatched input '{' expecting GENERATE
Whether i can able to use foreach inside another foreach.
In alternative way i am able to access,
a = FOREACH Read_PeopleAll generate Read_PropertyTax.ReadPropertyDS;
IsValidProperty = FILTER a BY PropertyID==1.
Any suggestions!!!

From the docs:
Note: FOREACH statements can be nested to two levels only. FOREACH statements that are nested to three or more levels will result in a grammar error.
You can have a FOREACH nested in a FOREACH, but you cannot have another nested operation in it.

Related

Pig: is it possible to write a loop over variables in a list?

I have to loop over 30 variables in a list
[var1,var2, ... , var30]
and for each variable I use some PIG group by statement such as
grouped = GROUP data by var1;
data_var1 = FOREACH grouped{
GENERATE group as mygroup,
COUNT(data) as count;
};
Is there a way to loop over the list of variables or I am forced to repeat the code above manually 30 times in my code?
Thanks!
I think what you're looking for is the pig macro
Create a relation for your 30 variables, and iterate on them by foreach, and call a macro which get 2 params: your data relation and the var you want to group by.
Just check the example in the link the macro is really similar what you'd like to do.
UPDATE & code
So here's the macro you can use:
DEFINE my_cnt(data, group_field) RETURNS C {
$C = FOREACH (GROUP $data by $group_field) GENERATE
group AS mygroup,
COUNT($data) AS count;
};
Use the macro:
IMPORT 'cnt.macro';
data = LOAD 'data.txt' USING PigStorage(',') AS (field:chararray, value:chararray);
DESCRIBE data;
e = my_cnt(data,'the_field_you_group_by');
DESCRIBE e;
DUMP e;
I'm still thinking on how can you iterate through on your fields you'd like to group by. My original suggestion to foreach through a relation what contains the filed names not correct. (To create a UDF for this always works.) Let me think about it.
But this macro works as is if you call by all the filed name you want to group.

apache pig unable to perform grouping and counting

I am kind of newbie for Pig scripting . Please help me out with this issue.
I have no clue as to where I am going wrong.
My data
(catA,myid_1,2014,store1,appl)
(catA,myid_2,2014,store1,milk)
(catA,myid_3,2014,store1,appl)
(catA,myid_4,2014,store1,milk)
(catA,myid_5,2015,store1,milk)
(catB,myid_6,2014,store2,milk)
(catB,myid_7,2014,store2,appl)
The below is the result expected
(catA,2014,milk,2)
(catA,2014,apple,2)
(catA,2015,milk,1)
(catB,2014,milk,1)
(catB,2014,apple,1)
Need to count the number of food item based on the category,year.
below is my pig script
list = LOAD 'shop' USING PigStorage(',') AS (category:chararray,id:chararray,mdate:chararray,my_store:chararray,item:chararray);
list_of = FOREACH list GENERATE category,SUBSTRING(mdate,0,4) as my_date,my_store,item;
StoreG = GROUP list_of BY (category,my_date,my_store);
result = FOREACH StoreG
{
food_list = FOREACH list_of GENERATE item;
food_count = DISTINCT food_list;
GENERATE FLATTEN(group) AS (category,my_date,my_store),COUNT(food_count);
}
DUMP result;
My output for the above script is below
(catA,2014,store1,2)
(catA,2015,store1,1)
(catB,2014,store2,2)
Could anyone please let me know as to where I am wrong in my script
Thanks
One way to do it.Not the most elegant but working example:
list = LOAD 'shop' USING PigStorage(',') AS (category:chararray,id:chararray,mdate:chararray,my_store:chararray,item:chararray);
list_of = FOREACH list GENERATE category,SUBSTRING(mdate,0,4) AS my_date,my_store,item;
StoreG = GROUP list_of BY (category,my_date,my_store,item);
result = FOREACH StoreG GENERATE
group.category AS category,
group.my_date AS my_date,
group.my_store AS mys_store,
group.item AS item,
COUNT(list_of.item) AS nb_items;
DUMP result;
When we add alias item to the GROUP BY statement is basically the same as to find distinct items and then count them(as you already did in the parenthesis).
If you still want to use your code you simply add a relation food_list.item to the code below :
result = FOREACH StoreG
{
food_list = FOREACH list_of GENERATE item;
food_count = DISTINCT food_list;
GENERATE FLATTEN(group) AS (category,my_date,my_store),food_list.item,COUNT(food_count);
}
StoreG = GROUP list_of BY (category,my_date,my_store);
should be
StoreG = GROUP list_of BY (category,my_date,item);
since your expected results are grouping by item not store.

How do I get the matching values inside a for loop using FILTER in PIG?

Consider this as my input,
Input (File1):
12345;11
34567;12
.
.
Input (File2):
11;(1,2,3,4,5,6,7,8,9)
12;(9,8,7,6,5,4,3,2,1)
.
.
I would like to get the output as follows:
Output:
(1,2,3,4,5,6,7,8,9)
(9,8,7,6,5,4,3,2,1)
Here's the sample code which I have tried using FILTER and I face some errors with this. Please suggest me some other options.
data1 = load '/File1' using PigStorage(';') as (id,number);
data2 = load '/File2' using PigStorage(';') as (numberInfo, collection);
out = foreach data1{
Data_filter = FILTER data2 by (numberInfo matches CONCAT(number,''));
generate Data_filter;
}
Is it possible do this inside a for loop ? Please let me know. Thanks in advance !
There are no for loops in Apache Pig, if you need to iterate through each row of the data for some specific purpose you need to implement your own UDF. The foreach keyword is not used to create a loop, it is used to transform your data based on your columns, applying UDFs to it. You can also use a nested foreach, where you perform operations over each group in your relation.
However, your syntax is wrong. You are trying to use a nested foreach without grouping your data first. What a nested foreach does, is perform the operations you define in the block of code over a grouped relation. Therefore, the only way your code could work is by grouping the data first:
data1 = load '/File1' using PigStorage(';') as (id,number);
data2 = load '/File2' using PigStorage(';') as (numberInfo, collection);
data1 = group data1 by id;
out = foreach data1{
Data_filter = FILTER data2 by (numberInfo matches CONCAT(number,''));
generate Data_filter;
}
However, this won't work because inside a nested foreach you cannot refer to a different relation like data2.
What you really want, is a JOIN operation over both relations using number for data1 and numberInfo for data2. This will give you this:
joined_data = join data1 by number, data2 by numberInfo;
dump joined_data;
(12345,11,11,(1,2,3,4,5,6,7,8,9))
(34567,12,12,(9,8,7,6,5,4,3,2,1))
In your question you said you only wanted as output the last column, so now you can use a foreach to generate the column you want:
final_data = foreach joined_data generate data2::collection;
dump final_data;
((1,2,3,4,5,6,7,8,9))
((9,8,7,6,5,4,3,2,1))

Finding Unique visitors to a webpage

I want to write a pig script that find number of unique userid that visiots a particluar webpage.
table definition :a = (userid:chararray, otherid:chararray, webpage:chararray)
This is what I wrote but it doesn't work
a = (userid:chararray, otherid:chararray, webpage:chararray)
group_by_page = GROUP a by webpage ;
count_d = FOREACH group_by_page GENERATE group, count(distinct(a.userid));
You need to use the DISTINCT inside a nested foreach; it's not a UDF. This should get you where you need to go:
a = LOAD 'input' AS (userid:chararray, otherid:chararray, webpage:chararray);
group_by_page = GROUP a by webpage;
count_d = FOREACH group_by_page { uniq = DISTINCT a.userid; GENERATE group, COUNT(uniq); };
Go here to learn more about nested foreach.

Can I use "filter by' with Map structure in hadoop - PIG?

provied that there's a Map like,,,
map.text
[key1#v1]
[key2#v2]
[key3#v3]
then, if I try to find 'value of 'key2'',
A = load ‘map.text’ as (M:map[]);
B = foreach A generate M#'key2';
C = filter B by $0!=''; // to get rid of empty value like (), (), ().
dump C;
is there any other way to find key2? with using 'filter by' only.
thxs ya.
There is no need to GENERATE a field and then use it in a FILTER; you can include it in the FILTER statement to begin with:
A = load 'map.text' as (M:map[]);
B = filter A by M#'key2' != '';
dump B;
On your data, this returns one record:
([key2#v2])
As a side note, in case empty strings are ever valid values, the criterion you might rather use is by M#'key2' is not null.

Resources