Subtract One row's value from another row in Pig - hadoop

I'm trying to develop a sample program using Pig to analyse some log files. I want to analyze the running time of different jobs. When I read in the log file of the job, I get the start time and the end time of the job, like this:
(Wed,03/20/13,01:03:37,EDT)
(Wed,03/20/13,01:05:00,EDT)
Now, to calculate the elapsed time, I need to subtract these 2 timestamps, but since both timestamps are in the same bag, I'm not sure how to compare them. So I'm looking for an idea on how to do this. thanks!

Is there a unique ID for the job that is in both log lines? Also is there something to indicate which event is start, and which is end?
If so, you could read the dataset twice, once for start events, once for end-events, and join the two together. Then you'll have one record with both events in it.
so:
A = FOREACH logline GENERATE id, type, timestamp;
START = FILTER A BY (type == 'start');
END = FILTER A BY (type == 'end');
JOINED = JOIN START by ID, END by ID;
DIFF = FOREACH JOINED GENERATE (START.timestamp - END.timestamp); // or whatever;

Related

nested for loops in stata

I am having trouble to understand why a for loop construction does not work. I am not really used to for loops so I apologize if I am missing something basic. Anyhow, I appreciate any piece of advice you might have.
I am using a party level dataset from the parlgov project. I am trying to create a variable which captures how many times a party has been in government before the current observation. Time is important, the counter should be zero if a party has not been in government before, even if after the observation period it entered government multiple times. Parties are nested in countries and in cabinet dates.
The code is as follows:
use "http://eborbath.github.io/stackoverflow/loop.dta", clear //to get the data
if this does not work, I also uploaded in a csv format, try:
import delimited "http://eborbath.github.io/stackoverflow/loop.csv", bindquote(strict) encoding(UTF-8) clear
The loop should go through each country-specific cabinet date, identify the previous observation and check if the party has already been in government. This is how far I have got:
gen date2=cab_date
gen gov_counter=0
levelsof country, local(countries) // to get to the unique values in countries
foreach c of local countries{
preserve // I think I need this to "re-map" the unique cabinet dates in each country
keep if country==`c'
levelsof cab_date, local(dates) // to get to the unique cabinet dates in individual countries
restore
foreach i of local dates {
egen min_date=min(date2) // this is to identify the previous cabinet date
sort country party_id date2
bysort country party_id: replace gov_counter=gov_counter+1 if date2==min_date & cabinet_party[_n-1]==1 // this should be the counter
bysort country: replace date2=. if date2==min_date // this is to drop the observation which was counted
drop min_date //before I restart the nested loop, so that it again gets to the minimum value in `dates'
}
}
The code works without an error, but it does not do the job. Evidently there's a mistake somewhere, I am just not sure where.
BTW, it's a specific application of a problem I super often encounter: how do you count frequencies of distinct values in a multilevel data structure? This is slightly more specific, to the extent that "time matters", and it should not just sum all encounters. Let me know if you have an easier solution for this.
Thanks!
The problem with your loop is that it does not keep the replaced gov_counter after the loop. However, there is a much easier solution I'd recommend:
sort country party_id cab_date
by country party_id: gen gov_counter=sum(cabinet_party[_n-1])
This sorts the data into groups and then creates a sum by group, always up to (but not including) the current observation.
I would start here. I have stripped the comments so that we can look at the code. I have made some tiny cosmetic alterations.
foreach i of local dates {
egen min_date = min(date2)
sort country party_id date2
bysort country party_id: replace gov_counter=gov_counter+1 ///
if date2 == min_date & cabinet_party[_n-1] == 1
bysort country: replace date2 = . if date2 == min_date
drop min_date
}
This loop includes no reference to the loop index i defined in the foreach statement. So, the code is the same and completely unaffected by the loop index. The variable min_date is just a constant for the dataset and the same each time around the loop. What does depend on how many times the loop is executed is how many times the counter is incremented.
The fallacy here appears to be a false analogy with constructs in other software, in which a loop automatically spawns separate calculations for different values of a loop index.
It's not illegal for loop contents never to refer to the loop index, as is easy to see
forval j = 1/3 {
di "Hurray"
}
produces
Hurray
Hurray
Hurray
But if you want different calculations for different values of the loop index, that has to be explicit.

How to insert previous month of data into database Nifi?

I have data in which i need to compare month of data if it is previous month then it should be insert otherwise not.
Example:
23.12.2016 12:02:23,Koji,24
22.01.2016 01:21:22,Mahi,24
Now i need to get first column of data (23.12.2016 12:02:23) and then get month (12) on it.
Compared that with before of current month like.,
If current month is 'JAN_2017',then get before of 'JAN_2017' it should be 'Dec_2016'
For First row,
compare this 'Dec_2016'[month before] with month of data 'Dec_2016' [23.12.2016].
It matched then insert into database.
EDIT 1:
i have already tried with your suggestions.
"UpdateAttribute to add a new attribute with the previous month value, and then RouteOnAttribute to determine if the flowfile should be inserted "
i have used below expression language in RouteOnAttribute,
${literal('Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec'):getDelimitedField(${csv.1:toDate('dd.MM.yyyy hh:mm:ss'):format('MM')}):equals(${literal('Dec,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov'):getDelimitedField(${now():toDate(' Z MM dd HH:mm:ss.SSS yyyy'):format('MM'):toNumber()})})}
it could be failed in below data.,
23.12.2015,Andy,21
23.12.2017,Present,32
My data may contains some past years and future years
It matches with my expression it also inserted.
I need to check month with year in data.
How can i check it?
The easiest answer is to use the ExecuteScript processor with simple date logic (this will allow you to use the Groovy/Java date framework to correctly handle things like leap years, time zones, etc.).
If you really don't want to do that, you could probably use a regex and Expression Language in UpdateAttribute to add a new attribute with the previous month value, and then RouteOnAttribute to determine if the flowfile should be inserted into the database.
Here's a simple Groovy test demonstrating the logic. You'll need to add the code to process the session, flowfile, etc.
#Test
public void textScriptShouldFindPreviousMonth() throws Exception {
// Arrange
def input = ["23.12.2016 12:02:23,Koji,24", "22.01.2016 01:21:22,Mahi,24"]
def EXPECTED = ["NOV_2016", "DEC_2015"]
// Act
input.eachWithIndex { String data, int i ->
Calendar calendar = Date.parse("dd.MM.yyyy", data.tokenize(" ")[0]).toCalendar()
calendar.add(Calendar.MONTH, -1)
String result = calendar.format("MMM_yyyy").toUpperCase()
// Assert
assert result == EXPECTED[i]
}
}

Pig case statement for finding no. Of events in a specific period of time

Pig case statement for finding no. Of events in a specific period of time.
There is a dataset which is like a movie data base bearing movies, rating, duration of movie, year of release.
The question is that how do u find the no. Of movies released during 10 years of span?
The dataset is comma separated.
Movie = load '/home/movie/movies.txt' using PigStorage(',') as (movieid:int, moviename:chararray, yearrelease:int, ratingofmovie:float, moviedurationinsec:float);
movies_released_between_2000_2010 = filter Movie by yearofrelease >2000 and yearofrelease < 2010;
result = foreach movies_released_between_2000_2010 generate moviename,yearofrelease;
dump result;
year_count = FOREACH movie GENERATE (case when year>2000 and year<2010 then 1 else 0 end) as year_flag,movie_name;
year_grp = GROUP year_count BY year_flag;
movie_count_out = FOREACH year_grp GENERATE group,COUNT(year_flag);
The above example can help you give an understanding of the solution, there might be some syntax errors tough. If you need to group on the basis of decade then you can use a substring function on top of year and get the specific range.

Pig - how to select only some values from the list (not just simple distinct)?

Let's say I have intput_file.txt (user_id, event_code, event_date):
1,a,1
1,b,2
2,a,3
2,b,4
2,b,5
2,b,6
2,c,7
2,b,8
as you can see, user_id = 2, has events like this: abbbcb
I'd like to have a result like this:
1,{(a,1),(b,2)}
2,{(a,2),(b,6),(c,7),(b,8)}
So when we have few events, with the same code, I'd like to take only the last one.
Can you please share any hints?
Regards
Pawel
The main thing you are describing is what GROUP BY does.
In this case:
B = GROUP A BY user_id;
Gets your records together by user_id. Your data will now look like this:
1,{(a,1),(b,2)}
2,{(a,2),(b,6),(c,7),(b,8)}
You say you only want the last one (I assume you mean the one with the greatest event_date). To do this, you can do a nested FOREACH with an ORDER BY to sort by date, and then take the first one with LIMIT. Note that this has arbitrary behavior when there are ties.
C = FOREACH B {
DA = ORDER A BY event_date DESC;
DB = LIMIT DA 1;
GENERATE FLATTEN(group), FLATTEN(DB.event_code), FLATTEN(DB.event_date);
}
Your data should now look like this:
1,b,2
2,b,8
Another option would be to use a UDF to write some custom behavior on the groups given by GROUP BY:
B = GROUP A BY user_id;
C = FOREACH B GENERATE YourUDFThatYouBuilt(group, A);
In that UDF you'd write whatever custom behavior you want (in this case return the tuple with the greatest date)
It seems like you could use the DistinctBy UDF from Apache DataFu to achieve this. This UDF, given a bag, returns the first instance found for a given field. In your case the field you care about is event_code. But we have to reverse the order, as you actually want the last instance.
One clarification though. Correct me if I'm wrong, but I think the intended output is:
1,{(a,1),(b,2)}
2,{(a,3),(b,6),(c,7),(b,8)}
That is, the (a,3) event occurs for member 2. The (a,2) event occurs for member 1.
Here's how you can do it:
-- pass in 1 because we want distinct by event code (position 1)
define DistinctBy datafu.pig.bags.DistinctBy('1');
FOREACH (GROUP A BY user_id) {
-- reverse so we can take the last event code occurrence
A_reversed = ORDER A BY event_date DESC;
-- use DistinctBy to get the first tuple having an occurrence of a field value
A_distinct_by_code = DistinctBy(A_reversed);
-- put back in order again
A_ordered = ORDER A_distinct_by_code BY event_date ASC;
GENERATE group as user_id, A_ordered.(event_code,event_date);
}

Pig Heaping on approx 185 gigs (202260828 records)

I am still pretty new to PIG but I understand the basic idea of map/reduce jobs. I am trying to figure out some statistics for a user based on some simple logs. We have a utility that parses out fields from the log and I am using DataFu to figure out the variance and quartiles.
My script is as follows:
log = LOAD '$data' USING SieveLoader('node', 'uid', 'long_timestamp');
log_map = FILTER log BY $0 IS NOT NULL AND $0#'uid' IS NOT NULL;
--Find all users
SPLIT log_map INTO cloud IF $0#'node' MATCHES '*.mis01*', dev OTHERWISE;
--For the real cloud
cloud = FOREACH cloud GENERATE $0#'uid' AS uid, $0#'long_timestamp' AS long_timestamp:long, 'dev' AS domain, '192.168.0.231' AS ldap_server;
dev = FOREACH dev GENERATE $0#'uid' AS uid, $0#'long_timestamp' AS long_timestamp:long, 'dev' AS domain, '10.0.0.231' AS ldap_server;
modified_logs = UNION dev, cloud;
--Calculate user times
user_times = FOREACH modified_logs GENERATE *, ToDate((long)long_timestamp) as date;
--Based on weekday/weekend
aliased_user_times = FOREACH user_times GENERATE *, GetYear(date) AS year:int, GetMonth(date) AS month:int, GetDay(date) AS day:int, GetWeekOrWeekend(date) AS day_of_week, long_timestamp % (24*60*60*1000) AS miliseconds_into_day;
--Based on actual day of week
--aliased_user_times = FOREACH user_times GENERATE *, GetYear(date) AS year:int, GetMonth(date) AS month:int, GetDay(date) AS day:int, GetDayOfWeek(date) AS day_of_week, long_timestamp % (24*60*60*1000) AS miliseconds_into_day;
user_days = GROUP aliased_user_times BY (uid, ldap_server,domain, year, month, day, day_of_week);
some_times_by_day = FOREACH user_days GENERATE FLATTEN(group) AS (uid, ldap_server, domain, year, month, day, day_of_week), MAX(aliased_user_times.miliseconds_into_day) AS max, MIN(aliased_user_times.miliseconds_into_day) AS min;
times_by_day = FOREACH some_times_by_day GENERATE *, max-min AS time_on;
times_by_day_of_week = GROUP times_by_day BY (uid, ldap_server, domain, day_of_week);
STORE times_by_day_of_week INTO '/data/times_by_day_of_week';
--New calculation, mean, var, std_d, (min, 25th quartile, 50th (aka median), 75th quartile, max)
averages = FOREACH times_by_day_of_week GENERATE FLATTEN(group) AS (uid, ldap_server, domain, day_of_week), 'USER' as type, AVG(times_by_day.min) AS start_avg, VAR(times_by_day.min) AS start_var, SQRT(VAR(times_by_day.min)) AS start_std, Quartile(times_by_day.min) AS start_quartiles;
--AVG(times_by_day.max) AS end_avg, VAR(times_by_day.max) AS end_var, SQRT(VAR(times_by_day.max)) AS end_std, Quartile(times_by_day.max) AS end_quartiles, AVG(times_by_day.time_on) AS hours_avg, VAR(times_by_day.time_on) AS hours_var, SQRT(VAR(times_by_day.time_on)) AS hours_std, Quartile(times_by_day.time_on) AS hours_quartiles ;
STORE averages INTO '/data/averages';
I've seen that other people have had problems with DataFu calculating multiple quantiles at once so I am only trying to calculate one at a time. The custom loader loads one line at a time, passes it through a utility which converts it into a map and there is a small UDF that checks to see if a date is a weekday or a weekend (originally we wanted to get statistics based on day of week, but loading enough data to get interesting quartiles was killing the map/reduce tasks.
Using Pig 0.11
It looks like my specific problem was due to trying to calculate the min and the max in one PigLatin line. Splitting the work into two different commands and then joining them seems to have fixed my memory problem

Resources