Pig Heaping on approx 185 gigs (202260828 records) - hadoop

I am still pretty new to PIG but I understand the basic idea of map/reduce jobs. I am trying to figure out some statistics for a user based on some simple logs. We have a utility that parses out fields from the log and I am using DataFu to figure out the variance and quartiles.
My script is as follows:
log = LOAD '$data' USING SieveLoader('node', 'uid', 'long_timestamp');
log_map = FILTER log BY $0 IS NOT NULL AND $0#'uid' IS NOT NULL;
--Find all users
SPLIT log_map INTO cloud IF $0#'node' MATCHES '*.mis01*', dev OTHERWISE;
--For the real cloud
cloud = FOREACH cloud GENERATE $0#'uid' AS uid, $0#'long_timestamp' AS long_timestamp:long, 'dev' AS domain, '192.168.0.231' AS ldap_server;
dev = FOREACH dev GENERATE $0#'uid' AS uid, $0#'long_timestamp' AS long_timestamp:long, 'dev' AS domain, '10.0.0.231' AS ldap_server;
modified_logs = UNION dev, cloud;
--Calculate user times
user_times = FOREACH modified_logs GENERATE *, ToDate((long)long_timestamp) as date;
--Based on weekday/weekend
aliased_user_times = FOREACH user_times GENERATE *, GetYear(date) AS year:int, GetMonth(date) AS month:int, GetDay(date) AS day:int, GetWeekOrWeekend(date) AS day_of_week, long_timestamp % (24*60*60*1000) AS miliseconds_into_day;
--Based on actual day of week
--aliased_user_times = FOREACH user_times GENERATE *, GetYear(date) AS year:int, GetMonth(date) AS month:int, GetDay(date) AS day:int, GetDayOfWeek(date) AS day_of_week, long_timestamp % (24*60*60*1000) AS miliseconds_into_day;
user_days = GROUP aliased_user_times BY (uid, ldap_server,domain, year, month, day, day_of_week);
some_times_by_day = FOREACH user_days GENERATE FLATTEN(group) AS (uid, ldap_server, domain, year, month, day, day_of_week), MAX(aliased_user_times.miliseconds_into_day) AS max, MIN(aliased_user_times.miliseconds_into_day) AS min;
times_by_day = FOREACH some_times_by_day GENERATE *, max-min AS time_on;
times_by_day_of_week = GROUP times_by_day BY (uid, ldap_server, domain, day_of_week);
STORE times_by_day_of_week INTO '/data/times_by_day_of_week';
--New calculation, mean, var, std_d, (min, 25th quartile, 50th (aka median), 75th quartile, max)
averages = FOREACH times_by_day_of_week GENERATE FLATTEN(group) AS (uid, ldap_server, domain, day_of_week), 'USER' as type, AVG(times_by_day.min) AS start_avg, VAR(times_by_day.min) AS start_var, SQRT(VAR(times_by_day.min)) AS start_std, Quartile(times_by_day.min) AS start_quartiles;
--AVG(times_by_day.max) AS end_avg, VAR(times_by_day.max) AS end_var, SQRT(VAR(times_by_day.max)) AS end_std, Quartile(times_by_day.max) AS end_quartiles, AVG(times_by_day.time_on) AS hours_avg, VAR(times_by_day.time_on) AS hours_var, SQRT(VAR(times_by_day.time_on)) AS hours_std, Quartile(times_by_day.time_on) AS hours_quartiles ;
STORE averages INTO '/data/averages';
I've seen that other people have had problems with DataFu calculating multiple quantiles at once so I am only trying to calculate one at a time. The custom loader loads one line at a time, passes it through a utility which converts it into a map and there is a small UDF that checks to see if a date is a weekday or a weekend (originally we wanted to get statistics based on day of week, but loading enough data to get interesting quartiles was killing the map/reduce tasks.
Using Pig 0.11

It looks like my specific problem was due to trying to calculate the min and the max in one PigLatin line. Splitting the work into two different commands and then joining them seems to have fixed my memory problem

Related

Wrong sorting while using Query function

I've been trying to do a report about the quantity of breakdonws of products in our company. The problem is that the QUERY function is operating as normal, but the sorting order is well - a bit strange.
The data I'm trying to sort are as follows (quantities are blacked out since I cannot share those informations):
Raw data
First column - name of the product, second, it's EAN code, third, breakdown rate for last year, last column - average breakdown rate. "b/d" means "brak danych" or no data.
What I want to achieve is to get the end table with values sorted by average breakdown rate.
My query is as follows:
=query(Robocze!A2:D;"select A where A is not null and NOT D contains 'b/d' order by D desc")
Final result
As You can see, we have descending order, but there are strange artifacts - like the 33.33% after 4,00% and before 3,92%.
Why is that!?
try:
=INDEX(LAMBDA(x; SORT(x; INDEX(x;; 4)*1; 0))
(QUERY(Robocze!A2:D; "where A is not null and NOT D contains 'b/d'"; 0));; 4)

nested for loops in stata

I am having trouble to understand why a for loop construction does not work. I am not really used to for loops so I apologize if I am missing something basic. Anyhow, I appreciate any piece of advice you might have.
I am using a party level dataset from the parlgov project. I am trying to create a variable which captures how many times a party has been in government before the current observation. Time is important, the counter should be zero if a party has not been in government before, even if after the observation period it entered government multiple times. Parties are nested in countries and in cabinet dates.
The code is as follows:
use "http://eborbath.github.io/stackoverflow/loop.dta", clear //to get the data
if this does not work, I also uploaded in a csv format, try:
import delimited "http://eborbath.github.io/stackoverflow/loop.csv", bindquote(strict) encoding(UTF-8) clear
The loop should go through each country-specific cabinet date, identify the previous observation and check if the party has already been in government. This is how far I have got:
gen date2=cab_date
gen gov_counter=0
levelsof country, local(countries) // to get to the unique values in countries
foreach c of local countries{
preserve // I think I need this to "re-map" the unique cabinet dates in each country
keep if country==`c'
levelsof cab_date, local(dates) // to get to the unique cabinet dates in individual countries
restore
foreach i of local dates {
egen min_date=min(date2) // this is to identify the previous cabinet date
sort country party_id date2
bysort country party_id: replace gov_counter=gov_counter+1 if date2==min_date & cabinet_party[_n-1]==1 // this should be the counter
bysort country: replace date2=. if date2==min_date // this is to drop the observation which was counted
drop min_date //before I restart the nested loop, so that it again gets to the minimum value in `dates'
}
}
The code works without an error, but it does not do the job. Evidently there's a mistake somewhere, I am just not sure where.
BTW, it's a specific application of a problem I super often encounter: how do you count frequencies of distinct values in a multilevel data structure? This is slightly more specific, to the extent that "time matters", and it should not just sum all encounters. Let me know if you have an easier solution for this.
Thanks!
The problem with your loop is that it does not keep the replaced gov_counter after the loop. However, there is a much easier solution I'd recommend:
sort country party_id cab_date
by country party_id: gen gov_counter=sum(cabinet_party[_n-1])
This sorts the data into groups and then creates a sum by group, always up to (but not including) the current observation.
I would start here. I have stripped the comments so that we can look at the code. I have made some tiny cosmetic alterations.
foreach i of local dates {
egen min_date = min(date2)
sort country party_id date2
bysort country party_id: replace gov_counter=gov_counter+1 ///
if date2 == min_date & cabinet_party[_n-1] == 1
bysort country: replace date2 = . if date2 == min_date
drop min_date
}
This loop includes no reference to the loop index i defined in the foreach statement. So, the code is the same and completely unaffected by the loop index. The variable min_date is just a constant for the dataset and the same each time around the loop. What does depend on how many times the loop is executed is how many times the counter is incremented.
The fallacy here appears to be a false analogy with constructs in other software, in which a loop automatically spawns separate calculations for different values of a loop index.
It's not illegal for loop contents never to refer to the loop index, as is easy to see
forval j = 1/3 {
di "Hurray"
}
produces
Hurray
Hurray
Hurray
But if you want different calculations for different values of the loop index, that has to be explicit.

Epoch time difference in Pig

I have 3 columns which contains start_time , end_time and tags. Times are represented in epoch time format as shown in example below. I want to find the the rows which have 1 hour time difference between them.
Example:
Start_time End_Time Tags
1235000081 1235000501 "Answered"
1235000081 1235000551 "Answered"
I need to fetch the tags column if the time diff is less than an hour.
I want do it in PIG - can anyone kindly help?
input.txt
1235000081 1235000501 Answered
1235000081 1235000551 Answered
pig script
A = Load '/home/kishore/input.txt' as (col1:long, col2:long, col3:chararray);
B = Foreach A generate ToDate(col1) as startdate,ToDate(col2) as enddate,col3;
C = Filter B by GetHour(enddate)-GetHour(startdate) == 1;
Dump C;
you can filter the row based on your condition like >,< ,==
In case if you want to keep date fields as timestamps the solution is following:
data = LOAD '/path/to/your/input' as (Start_Time:long, End_Time:long, Tags:chararray);
data_proc = FOREACH data GENERATE *, ToDate(Start_Time*1000) as Start_Time,ToDate(End_Time*1000) as End_Time;
output = FILTER data_proc BY GetHour(End_Time)-GetHour(Start_Time) == 1;
Dump #;
The one crucial thing is that Pig ToDate UDF needs a timestamp up to milliseconds precision thus you will have simply multiply your date fields by 1000 before using this UDF.

Pig case statement for finding no. Of events in a specific period of time

Pig case statement for finding no. Of events in a specific period of time.
There is a dataset which is like a movie data base bearing movies, rating, duration of movie, year of release.
The question is that how do u find the no. Of movies released during 10 years of span?
The dataset is comma separated.
Movie = load '/home/movie/movies.txt' using PigStorage(',') as (movieid:int, moviename:chararray, yearrelease:int, ratingofmovie:float, moviedurationinsec:float);
movies_released_between_2000_2010 = filter Movie by yearofrelease >2000 and yearofrelease < 2010;
result = foreach movies_released_between_2000_2010 generate moviename,yearofrelease;
dump result;
year_count = FOREACH movie GENERATE (case when year>2000 and year<2010 then 1 else 0 end) as year_flag,movie_name;
year_grp = GROUP year_count BY year_flag;
movie_count_out = FOREACH year_grp GENERATE group,COUNT(year_flag);
The above example can help you give an understanding of the solution, there might be some syntax errors tough. If you need to group on the basis of decade then you can use a substring function on top of year and get the specific range.

Subtract One row's value from another row in Pig

I'm trying to develop a sample program using Pig to analyse some log files. I want to analyze the running time of different jobs. When I read in the log file of the job, I get the start time and the end time of the job, like this:
(Wed,03/20/13,01:03:37,EDT)
(Wed,03/20/13,01:05:00,EDT)
Now, to calculate the elapsed time, I need to subtract these 2 timestamps, but since both timestamps are in the same bag, I'm not sure how to compare them. So I'm looking for an idea on how to do this. thanks!
Is there a unique ID for the job that is in both log lines? Also is there something to indicate which event is start, and which is end?
If so, you could read the dataset twice, once for start events, once for end-events, and join the two together. Then you'll have one record with both events in it.
so:
A = FOREACH logline GENERATE id, type, timestamp;
START = FILTER A BY (type == 'start');
END = FILTER A BY (type == 'end');
JOINED = JOIN START by ID, END by ID;
DIFF = FOREACH JOINED GENERATE (START.timestamp - END.timestamp); // or whatever;

Resources