Max/Min for whole sets of records in PIG - hadoop

I have a set set of records that I am loading from a file and the first thing I need to do is get the max and min of a column.
In SQL I would do this with a subquery like this:
select c.state, c.population,
(select max(c.population) from state_info c) as max_pop,
(select min(c.population) from state_info c) as min_pop
from state_info c
I assume there must be an easy way to do this in PIG as well but I'm having trouble finding it. It has a MAX and MIN function but when I tried doing the following it didn't work:
records=LOAD '/Users/Winter/School/st_incm.txt' AS (state:chararray, population:int);
with_max = FOREACH records GENERATE state, population, MAX(population);
This didn't work. I had better luck adding an extra column with the same value to each row and then grouping them on that column. Then getting the max on that new group. This seems like a convoluted way of getting what I want so I thought I'd ask if anyone knows a simpler way.
Thanks in advance for the help.

As you said you need to group all the data together but no extra column is required if you use GROUP ALL.
Pig
records = LOAD 'states.txt' AS (state:chararray, population:int);
records_group = GROUP records ALL;
with_max = FOREACH records_group
GENERATE
FLATTEN(records.(state, population)), MAX(records.population);
Input
CA 10
VA 5
WI 2
Output
(CA,10,10)
(VA,5,10)
(WI,2,10)

Related

How to filter clickhouse table by array column contents?

I have a clickhouse table that has one Array(UInt16) column. I want to be able to filter results from this table to only get rows where the values in the array column are above a threshold value. I've been trying to achieve this using some of the array functions (arrayFilter and arrayExists) but I'm not familiar enough with the SQL/Clickhouse query syntax to get this working.
I've created the table using:
CREATE TABLE IF NOT EXISTS ArrayTest (
date Date,
sessionSecond UInt16,
distance Array(UInt16)
) Engine = MergeTree(date, (date, sessionSecond), 8192);
Where the distance values will be distances from a certain point at a certain amount of seconds (sessionSecond) after the date. I've added some sample values so the table looks like the following:
Now I want to get all rows which contain distances greater than 7. I found the array operators documentation here and tried the arrayExists function but it's not working how I'd expect. From the documentation, it says that this function "Returns 1 if there is at least one element in 'arr' for which 'func' returns something other than 0. Otherwise, it returns 0". But when I run the query below I get three zeros returned where I should get a 0 and two ones:
SELECT arrayExists(
val -> val > 7,
arrayEnumerate(distance))
FROM ArrayTest;
Eventually I want to perform this select and then join it with the table contents to only return rows that have an exists = 1 but I need this first step to work before that. Am I using the arrayExists wrong? What I found more confusing is that when I change the comparison value to 2 I get all 1s back. Can this kind of filtering be achieved using the array functions?
Thanks
You can use arrayExists in the WHERE clause.
SELECT *
FROM ArrayTest
WHERE arrayExists(x -> x > 7, distance) = 1;
Another way is to use ARRAY JOIN, if you need to know which values is greater than 7:
SELECT d, distance, sessionSecond
FROM ArrayTest
ARRAY JOIN distance as d
WHERE d > 7
I think the reason why you get 3 zeros is that arrayEnumerate enumerates over the array indexes not array values, and since none of your rows have more than 7 elements arrayEnumerates results in 0 for all the rows.
To make this work,
SELECT arrayExists(
val -> distance[val] > 7,
arrayEnumerate(distance))
FROM ArrayTest;

Pig latin join by field

I have a Pig latin related problem:
I have this data below (in one row):
A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;
(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)
Now I have another dataset:
B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)
And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:
FITKA 0.123133, FINVA 0.454535 and so on ..
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )
And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.
Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.
Dataset A is text (Sentence in one way..).
So what are my options to achieve this?
Any help would be nice.
A sentence can be represented as a tuple and contains a bag of tuples (word, count).
Therefore, I suggest you change the way you store your data to the following format:
sentence:tuple(words:bag{wordcount:tuple(word, count)})

pig - need tips after performance tuning gone wrong

I have a Pig script that took around 10 minutes to finish and I thought that there was still room for some performance improvement.
So, I started by putting the JOINs and GROUPs in a nested FOREACH and also putting the previous FILTERs inside the same FOREACH.
I also added using 'replicated'.
The problem now is that instead of taking 10 minutes, it's taking over 30 minutes.
Is there a place that has best practices and performance improvement tips besides PIG's documentation?
So that you can get a better picture, here's some code:
--before
previous_join = JOIN A by id, B by id --for symplification
filtering = FILTER previous_join BY ((year_min > 1995 ? year_min - 1 : year_min) <= list_year and (year_max > 2015 ? year_max - 1 : year_max) >= list_year);
final_filtered = FOREACH filtering GENERATE user_id as user_id, list_year;
--after
final_filtered = FOREACH (JOIN A by id, B by id) {
tmp = FILTER group BY ((A::year_min > 1995 ? A::year_min - 1 : A::year_min) <= B::list_year and (A::year_max > 2015 ? A::year_max - 1 : A::year_max) >= B::list_year and A::premium == 'true');
GENERATE A::user_id AS user_id, B::list_year AS list_year;
};
Am I doing something wrong or is this the wrong approach?
Thanks.
In prior case [before] you are performing filter and projection after the join is performed.
It will be helpful if you calculate time log for each operation and identify the bottleneck operation.
Can you also try splitting your filter statements in multiple relations rather than just one and check the difference in filter timing?
filter_by_min_year = FILTER previous_join BY ((A::year_min > 1995 ? A::year_min - 1 : A::year_min) <= B::list_year);
filter_by_max_year = FILTER filter_by_min_year BY (A::year_max > 2015 ? A::year_max - 1 : A::year_max) >= B::list_year);
Overall you want to find ids(+some more columns) with A::year_min <=B::list_year and A::year_max >= B::list_year
Instead of performing join on raw A & B, you can try using projections on both of them to contain only columns needed for join and later operations.
A-projected = foreach A generate id, year_min, year_max;
B-projected = foreach B generate id, list_year;
C = join A-projected by id, B-projected by id USING 'replicated';
If any of A-projected or B-projected is a small set that can be loaded in memory use replicated join, I am assuming B-projected to be a smaller set than A-projected.
If this doesnt apply to your case, please skip this option.
Also you can try setting the number of reducers to be used for this join by using PARALLEL keyword.
After applying filter you will get a list of required id's that you can use to fetch other information from A or B.
Also consider tweaking MapReduce properties like io.sort.mb, mapred.job.shuffle.input.buffer.percent etc.
Hope this helps.

Pig - how to select only some values from the list (not just simple distinct)?

Let's say I have intput_file.txt (user_id, event_code, event_date):
1,a,1
1,b,2
2,a,3
2,b,4
2,b,5
2,b,6
2,c,7
2,b,8
as you can see, user_id = 2, has events like this: abbbcb
I'd like to have a result like this:
1,{(a,1),(b,2)}
2,{(a,2),(b,6),(c,7),(b,8)}
So when we have few events, with the same code, I'd like to take only the last one.
Can you please share any hints?
Regards
Pawel
The main thing you are describing is what GROUP BY does.
In this case:
B = GROUP A BY user_id;
Gets your records together by user_id. Your data will now look like this:
1,{(a,1),(b,2)}
2,{(a,2),(b,6),(c,7),(b,8)}
You say you only want the last one (I assume you mean the one with the greatest event_date). To do this, you can do a nested FOREACH with an ORDER BY to sort by date, and then take the first one with LIMIT. Note that this has arbitrary behavior when there are ties.
C = FOREACH B {
DA = ORDER A BY event_date DESC;
DB = LIMIT DA 1;
GENERATE FLATTEN(group), FLATTEN(DB.event_code), FLATTEN(DB.event_date);
}
Your data should now look like this:
1,b,2
2,b,8
Another option would be to use a UDF to write some custom behavior on the groups given by GROUP BY:
B = GROUP A BY user_id;
C = FOREACH B GENERATE YourUDFThatYouBuilt(group, A);
In that UDF you'd write whatever custom behavior you want (in this case return the tuple with the greatest date)
It seems like you could use the DistinctBy UDF from Apache DataFu to achieve this. This UDF, given a bag, returns the first instance found for a given field. In your case the field you care about is event_code. But we have to reverse the order, as you actually want the last instance.
One clarification though. Correct me if I'm wrong, but I think the intended output is:
1,{(a,1),(b,2)}
2,{(a,3),(b,6),(c,7),(b,8)}
That is, the (a,3) event occurs for member 2. The (a,2) event occurs for member 1.
Here's how you can do it:
-- pass in 1 because we want distinct by event code (position 1)
define DistinctBy datafu.pig.bags.DistinctBy('1');
FOREACH (GROUP A BY user_id) {
-- reverse so we can take the last event code occurrence
A_reversed = ORDER A BY event_date DESC;
-- use DistinctBy to get the first tuple having an occurrence of a field value
A_distinct_by_code = DistinctBy(A_reversed);
-- put back in order again
A_ordered = ORDER A_distinct_by_code BY event_date ASC;
GENERATE group as user_id, A_ordered.(event_code,event_date);
}

how to create set of values, after group function in Pig (Hadoop)

Lets say I have set of values in file.txt
a,b,c
a,b,d
k,l,m
k,l,n
k,l,o
And my code is:
file = LOAD 'file.txt' using PigStorage(',');
events = foreach file generate session_id, user_id, code, type;
gr = group events by (session_id, user_id);
and I have set of value:
((a,b),{(a,b,c),(a,b,d)})
((k,l),{(k,l,m),(k,l,n),(k,l,o)})
And I'd like to have:
(a,b,(c,d))
(k,l,(m,n,o))
Have you got any idea how to do it?
Regards
Pawel
Note: you are inconsistent in your question. You say session_id, user_id, code, type in the FOREACH line, but your have a PigStorage not providing values. Also, that FOREACH has 4 values, while your sample data only has 3. I'll assume that type doesn't exist in order to answer your question.
After your gr relation, you are left with the group by key (in this case (session_id, user_id)) in a automatically generated tuple called group.
So, first step: gr2 = FOREACH gr GENERATE FLATTEN(group);
This will give you the tuples (a,b) and (k,l). You need to use FLATTEN because group is a tuple and you are asking for session_id and user_id to be individual columns. FLATTEN does that for you.
Ok, so now modify the gr2 line to also use a projection to tease out the third value:
gr2 = FOREACH gr GENERATE FLATTEN(group), events.code;
events.code creates a bag out of all the code values. events is the name of the bag of grouped tuples (it's named after the original relation).
This should give you:
(a, b, {c, d})
(k, l, {m, n, o})
It's very important to note that the values in the list are in a bag not a tuple, like you asked for. Keeping it in a bag is the right idea because the bag is a variable list, while a tuple is not.
Additional advice: Understanding how GROUP BY outputs data is something I see a lot of people struggle with when first using Pig. If you think my answer doesn't make much sense, I'd recommend spending some time to really get to understand GROUP BY. Understanding versus thinking it is magic will pay off in the long run.

Resources