PIG - trying to find max of group of months in table

PIG - trying to find max of group of months in table - hadoop

The above image represents generate statement below and describe too
D = FOREACH C GENERATE $0 AS time, $1 AS perf_temp_count;
DUMP D;
DESCRIBE D;
MY question is curretnly the above is grouped my Month and Hour(miltary time) and i am trying to find the max number next to it per each month. 1 through 12, right now i am just showing the month, hours, and numbers.
My expected out put is
(1, 4) 9
....
remaning months
....
(12, 3) 10
Where this again descibes ( Month, hour), Max count

B = GROUP A BY (month, hour);
C= FOREACH B GENERATE group as time,COUNT(A.temp) as cnt
X = GROUP C By time;
Y = FOREACH X GENERATE group, MAX(C.cnt) as mcount;
I have no idea why, but Agrregating(MAX) right after another aggregate(COUNT) is a problem or I am not refrencing the names correctly.

Related

How to speed up row-specific operation based on values of other variables

Say I have this data:
sysuse auto2, clear
keep if _n<=4
describe
local N = r(N)
gen a1 = price
gen a2 = mpg
gen a3 = headroom
gen a4 = trunk
gen a5 = weight
gen a6 = length
input yearA yearB
1 4
1 5
2 5
1 6
keep a1-a6 yearA yearB
I'd like to do a row-specific operation based on the value of other variables. As an example, I'd like to add up all a columns corresponding to some row-specific rule, in this case starting a year after yearA and a year before yearB. So, if yearA==1 and yearB==5, the starting year is 2 and the end year is 4, so we would add a2, a3, and a4 together to get that row's total. Each row has its own rule corresponding to (a function of) its values of yearA and yearB.
I came up with the following solution, which works, but it is clunky and slow:
gen total = .
forvalues i = 1/`N' {
local start = yearA[`i']+1
local end = yearB[`i']-1
display "`start' `end'"
*annoyingly, you can't replace with egen, so create a new variable and delete it
egen total`i' = rowtotal(a`start'-a`end')
replace total = total`i' if _n==`i'
drop total`i'
}
As noted in the comment in the loop, I resorted to creating a new variable for each row and deleting it after using its value. Why? Because it doesn't seem like one can use replace with egen.
The actual application creates multiple variables and there are millions of observations, so it takes many hours or even days to run. What is a faster way to accomplish my goal? I am in now tied to doing things row-by-row if there is a better way.

gen wanted = 0
forval j = 1/6 {
replace wanted = wanted + a`j' if inrange(`j', yearA + 1, yearB - 1)
}

Sort Thousands of Chuck E. Cheese Tickets

I need to sort an n-thousand size array of random unique positive integers into groups of consecutive integers, each of group size k or larger, and then further grouped into dividends of some arbitrary positive integer j.
In other words, let's say I work at Chuck E. Cheese and we sometimes give away free tickets. I have a couple hundred thousand tickets on the floor and want to find out what employee handed out what but only for ticket groupings of consecutive integers that are larger than 500. Each employee has a random number from 0 to 100 assigned to them. That number corresponds to what "batch" of tickets where handed out, i.e. tickets from #000000 to #001499 where handed out by employee 1, tickets from #001500 to #002999 were handed out by employee 2, and so on. A large number of tickets are lost or are missing. I only care about groups of consecutive ticket numbers larger than 500.
What is the fastest way for me to sort through this pile?
Edit:
As requested by #trincot, here is a worked out example:
I have 150,000 unique tickets on the floor ranging from ticket #000000 to #200000 (i.e. missing 50,001 random tickets from the pile)
Step 1: sort each ticket from smallest to largest using an introsort algorithm.
Step 2: go through the list of tickets one by one and gather only tickets with "consecutiveness" greater than 500. i.e. I keep a tally of how many consecutive values I have found and only keep those with tallys 500 or higher. If I have tickets #409 thru #909 but not #408 or #1000 then I would keep that group but if that group had missed a ticket anywhere from #409 to #909, I would have thrown out the group and moved on.
Step 3: combine all my newly sorted groups together, each of which are size 500 or larger.
Step 4: figure out what tickets belong to who by going through the final numbers one by one again, dividing each by 1500, rounding down to nearest whole number, and putting them in their respective pile where each pile represents an employee.
The end result is a set of piles telling me which employees gave out more than 500 tickets at a time, how many times they did so, and what tickets they did so with.
Sample with numbers:
where k == 3 and j = 1500; k is minimum consecutive integer grouping size, j is final ticket interval grouping size i.e. 5,6, and 7 fall into the 0th group of intervals of size 1500 and 5996, 5997, 5998, 5999 fall into the third group of intervals of size 1500.
Input: [5 , 5996 , 8111 , 1000 , 1001, 5999 , 8110 , 7 , 5998 , 2500 , 1250 , 6 , 8109 , 5997]
Output:[ 0:[5, 6, 7] , 3:[5996, 5997, 5998, 5999] , 5:[8109, 8110, 8111] ]

Here is how you could do it in Python:
from collections import defaultdict
def partition(data, k, j):
data = sorted(data)
start = data[0] # assuming data is not an empty list
count = 0
output = defaultdict(list) # to automatically create a partition when referenced
for value in data:
bucket = value // j # integer division
if value % j == start % j + count: # in same partition & consecutive?
count += 1
if count == k:
# Add the k entries that we skipped so far:
output[bucket].extend(list(range(start, start+count)))
elif count > k:
output[bucket].append(value)
else:
start = value
count = 1
return dict(output)
# The example given in the question:
data = [5, 5996, 8111, 1000, 1001, 5999, 8110, 7, 5998, 2500, 1250, 6, 8109, 5997]
print(partition(data, k=3, j=1500))
# outputs {0: [5, 6, 7], 3: [5996, 5997, 5998, 5999], 5: [8109, 8110, 8111]}

Here is untested Python for the fastest approach that I can think of. It will return just pairs of first/last ticket for each range of interest found.
def grouped_tickets (tickets, min_group_size, partition_size):
tickets = sorted(tickets)
answer = {}
min_ticket = -1
max_ticket = -1
next_partition = 0
for ticket in tickets:
if next_partition <= ticket or max_ticket + 1 < ticket:
if min_group_size <= max_ticket - min_ticket + 1:
partition = min_ticket // partition_size
if partition in answer:
answer[partition].append((min_ticket, max_ticket))
else:
answer[partition] = [(min_ticket, max_ticket)]
# Find where the next partition is.
next_partition = (ticket // partition_size) * partition_size + partition_size
min_ticket = ticket
max_ticket = ticket
else:
max_ticket = ticket
# And don't lose the last group!
if min_group_size <= max_ticket - min_ticket + 1:
partition = min_ticket // partition_size
if partition in answer:
answer[partition].append((min_ticket, max_ticket))
else:
answer[partition] = [(min_ticket, max_ticket)]
return answer

Take top n results from table in power query, where n is dynamic based on a an if function

I want to use Power Query to extract by field(field is [Project]), then get the top 3 scoring rows from the master table for each project, but if there are more than 3 rows with a score of over 15, they should all be included. 3 rows must be extracted every time as minimum.
Essentially I'm trying to combine Keep Rows function with my formula of "=if(score>=15,1,0)"
Setting the query to records with score greater than 15 doesn't work for projects where the highest scores are, for example, 1, 7 and 15. This would only return 1 row, but we need 3 as a minimum.
Setting it to the top 3 scores only would omit rows in a table where the highest scores are 18, 19, 20
Is there a way to combine the two function to say "Choose the top 3 rows, but choose the top n rows if there are n rows with score >= 15

As far as I understand you try to do following (Alexis Olson proposed very same):
let
Source = Excel.CurrentWorkbook(){[Name="Table"]}[Content],
group = Table.Group(Source, {"Project"}, {"temp", each Table.SelectRows(Table.AddIndexColumn(Table.Sort(_, {"Score", 1}), "i", 1, 1), each [i]<=3 or [Score]>=15)}),
expand = Table.ExpandTableColumn(group, "temp", {"Score"})
in
expand
Or:
let
Source = Excel.CurrentWorkbook(){[Name="Table"]}[Content],
group = Table.Group(Source, {"Project"}, {"temp", each [a = Table.Sort(_, {"Score", 1}), b = Table.FirstN(a, 3) & Table.SelectRows(Table.Skip(a,3), each [Score]>=15)][b]}),
expand = Table.ExpandTableColumn(group, "temp", {"Score"})
in
expand
Or:
let
Source = Excel.CurrentWorkbook(){[Name="Table"]}[Content],
group = Table.Group(Source, {"Project"}, {"Score", each [a = List.Sort([Score], 1), b = List.FirstN(a,3)&List.Select(List.Skip(a,3), each _ >=15)][b]}),
expand = Table.ExpandListColumn(group, "Score")
in
expand
Note, if there are more columns in the table you want to keep, for first and second variants you may just add these columns to last step. For last variant you haven't such option and the code should be modified.

Sort by the Score column in descending order and then add an Index column (go to Add Column > Index Column > From 1).
Then filter on the Index column choosing to keep values less than or equal to 3. This should produce a step with this M code:
= Table.SelectRows(#"Added Index", each [Index] <= 3)
Now you just need to make a small adjustment to also include any score 15 or greater:
= Table.SelectRows(#"Added Index", each [Index] <= 3 or [Score] >= 15)

How to compute the difference between counts after grouping?

I got data in group into format: (GroupID, count). Like the following, I would like to compute the difference between the count, meanwhile preserve the GroupID. So, it becomes (1, 288) (2, 2), (3,66)....
I tried to use the SUBTRACT function, but not sure how to subtract the previous record from the current one. The second image shows the count part. The subtraction part is failed.

This is little tricky to achieve but can be done using a JOIN.Generate another relation starting with the second row but with ID 1 i.e ($0-1).Join the 2 relations and generate the difference.For Id add 1 to get the original ids.Union the the 1st row with the rows that contain the difference.
A = foreach win_grouped generate $0 as id,count($1) as c; -- (1,228),(2,230)... so on
A1 = filter A by ($0 > 1); -- (2,230),(3,296)... so on
B = foreach A1 generate ($0 - 1) as id,$1 as c; -- (1,230),(2,296)... so on
AB = join A by id,B by id; -- (1,228,1,230),(2,230,2,296)...so on
C = foreach AB generate (A::id + 1),(B::c - A::c) -- (2,2),(3,66)...so on
D = limit A 1; -- (1,288)
E = UNION D,C; -- (1,288),(2,2),(3,66)...so on
DUMP E;

Find and bin difference values, within particular range, between elements in two large vectors (too big for bsxfun)

Lets say we have vectors time_a and time_b with ~6 million to 12 million elements (with different lengths) in ascending order (units of picoseconds)
For example:
time_a=[ 72196880 112521880 118581820 122398052 142394088 144797508........6 million more....]
time_b=[81656628 151885536 169269680 424456200 652427880 760435300........12 million more....]
In the most time consuming way, we could loop through each element in a time_a, subtract each element in time_b, and do an if statement to see if the difference is within a particular tmax and tmin. If it is, we bin it and add it to a histogram, c. c is divided into tmin:binsize:tmax, as you will see, so once we find that our difference is within our range, we add one to the appropriate bin in c.
Below is the code I have so far. I think there is a more clever way to do this. Keep in mind, the full vectors are too large to use bsxfun(#subtract,time_a,time_b')
that would create a matrix with a lot of columns and rows people. Any clever ideas?
function [c, dt, dtEdges] = coincidence4(time_a,time_b,tmin,tmax,binsize)
% round tmin, tmax to a intiger multiple of binsize:
if mod(tmin,binsize)~=0
tmin=tmin-mod(tmin,binsize)+binsize;
end
if mod(tmax,binsize)~=0
tmax=tmax-mod(tmax,binsize);
end
dt = tmin:binsize:tmax;
dtEdges = [dt(1)-binsize/2,dt+binsize/2];
c = zeros(1,length(dt));
Na = length(time_a);
Nb = length(time_b);
tic1=tic;
bbMin=1;
for aa = 1:Na
ta = time_a(aa);
bb = bbMin;
while (bb<=Nb)
tb = time_b(bb);
d = tb - ta;
if d < tmin
bbMin = bb;
bb = bb+1;
elseif d > tmax
bb = Nb+1;
else
index = floor((d-dtEdges(1))/(dtEdges(end)-dtEdges(1))*(length(dtEdges)-1)+1);
c(index)=c(index)+1;
bb = bb+1;
end
end
end
toc(tic1)
end

You could loop over just one of the vectors. I mean something along these lines:
c = zeros(size(dtEdges));
for aa = 1:Na
d = time_b - time_a(aa);
c = c + histc(time_b - time_a(aa), dtEdges);
end

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio