need to calculate cumulative sum with time point checking conditions in R - time

I am new in R and I need to generate cumulative sum with time point checking conditions.
The data sample looks like this:
item_id <- c(29, 643, 46, 3, 45, 352)
group_id <- c(2,2,2,2,2,2)
start_year <- c(1993,1997,1997,1999,2006,2010)
end_year <- c(2008,2009,2017,2014,2019,2019)
item_size <- c(12,30,57,123,45,7)
mydf <- data.frame(item_id, group_id, start_year, end_year, item_size)
And I would like to generate another column in the end, the new data should be like this:
item_id <- c(29, 643, 46, 3, 45, 352)
group_id <- c(2,2,2,2,2,2)
start_year <- c(1993,1997,1997,1999,2006,2010)
end_year <- c(2008,2009,2017,2014,2019,2019)
item_size <- c(12,30,57,123,45,7)
group_size <- c(12,99,99,222,267,232)
mydf <- data.frame(item_id, group_id, start_year, end_year, item_size, group_size)
So, my aim is to calculate the group_size for each item_id when this item is still active (be in the group). For instance, in 1993 there is only one item(29), so the group size is 12. In 1997, there are 2 items (643, 46), so the group size is the sum of these two items plus size of item29, and so on so forth.
But in 2010, the group size should be the sum of items (id=46, 3, 45, 352), not including item 29 and item 643, because these 2 items finished before 2010 and are not active anymore, so not counted in the group. So the group size of item_id 352 in year 2010 should be 57+123+45+7=232.
Therefore, the group_size should be the total size of all active items at that point.
Can anyone help me with this?
Thanks

Related

Optimizing the algorithm for checking available reservations for the store

I would like to ask about some algorithms related to checking if a customer can book a table at the store?
I will describe my problem with the following example:
Restaurant:
User M has a restaurant R. R is open from 08:00 to 17:00.
Restaurant R has 3 tables (T1, T2, T3), each table will have 6 seats.
R offers F1 food, which can be eaten within 2 hours.
Booking:
R has a customer C has booked a table T1 for 5 people with F1 food | B[0]
B[0] has a start time: 9AM
M is the manager of the store, so M wants to know if the YYYY-MM-DD date has been ordered by the customer or not?
My current algorithm is:
I will create an array with the elements as the number of minutes of the day, and their default value is 0
24 * 60 = 1440
=> I have: arr[1440] = [0, 0, 0, ...]
Next I will get all the bookings for the day YYYY-MM-DD. The result will be an array B[].
Then I will loop the array B[]
for b in B[]
I then keep looping for the start_time, to the end_time of b with step of 1 min.
for time = start_time, time <= end_time. time++
With each iteration I will reassign the value of the array arr with index as the corresponding number of minutes in the day to 1
(It is quite similar to Sieve of Eratosthenes)
Then what I need to do is iterate over the array arr 1 more time, if there is at least 1 value 0 in the array it means YYYY-MM-DDdate is still bookable.
But my algorithm will not be optimal if increase the number of tables that the store has, the number of days to check is many days (for example from 2022-01-01 -> 2022-02-01), ...
Thank you very much.
P/S: Regarding the technology background, I am currently using laravel 9

rollapply + specnumber = species richness over sampling intervals that vary in length?

I have a community matrix (samples x species of animals). I sampled the animals weekly over many years (in this example, three years). I want to figure out how sampling timing (start week and duration a.k.a. number of weeks) affects species richness. Here is an example data set:
Data <- data.frame(
Year = rep(c('1996', '1997', '1998'), each = 5),
Week = rep(c('1', '2', '3', '4', '5'), 3),
Species1 =sample(0:5, 15, replace=T),
Species2 =sample(0:5, 15, replace=T),
Species3 =sample(0:5, 15, replace=T)
)
The outcome that I want is something along the lines of:
Year StartWeek Duration(weeks) SpeciesRichness
1996 1 1 2
1996 1 2 3
1996 1 3 1
...
1998 5 1 1
I had tried doing this via a combination of rollapply and vegan's specnumber, but got a sample x species matrix instead of a vector of Species Richness. Weird.
For example, I thought that this should give me species richness for sampling windows of two weeks:
test<-rollapply(Data[3:5],width=2,specnumber,align="right")
Thank you for your help!
I figured it out by breaking up the task into two parts:
1. Summing up species abundances using rollapplyr, as implemented in a ddplyr mutate_each thingamabob
2. Calculating species richness using vegan.
I did this for each sampling duration window separately.
Here is the bare bones version (I just did this successively for each sampling duration that I wanted by changing the width argument):
weeksum2 <- function(x) {rollapply(x, width = 2, align = 'left', sum, fill=NA)}
sum2weeks<-Data%>%
arrange(Year, Week)%>%
group_by(Year)%>%
mutate_each(funs(weeksum2), -Year, -Week)
weeklyspecnumber2<-specnumber(sum2weeks[,3:ncol(sum2weeks)],
groups = interaction(sum2weeks$Week, sum2weeks$Year))
weeklyspecnumber2<-unlist(weeklyspecnumber2)
weeklyspecnumber2<-as.data.frame(weeklyspecnumber2)
weeklyspecnumber2$WeekYear<-as.factor(rownames(weeklyspecnumber2))
weeklyspecnumber2<-tidyr::separate(weeklyspecnumber2, WeekYear, into = c('Week', 'Year'), sep = '[.]')

How to find input that gives a specific max heap structure

I understand how heaps work but there is a problem I have no idea on how to solve.
Let's say you're given a max heap (not a BST),
[149 , 130 , 129 , 107 , 122 , 124 , 103 , 66 , 77 , 91 , 98 , 10 , 55 , 35 , 72]
Find a list of inputs that would give you the same heap structure such that each successive value would be the largest it can possibly which would be:
[66 , 91 , 10 , 107 , 122 , 35 , 55 , 77 , 130 , 98 , 149 , 124 , 129 , 72 , 103]
So in other words, if you were going to insert 66 first then 91 then 10 then 107 and so on into an empty max heap, you would end up with the given heap structure after all of the bubbling up and so forth. How would you even find this input in the first place?
Can anyone suggest any ideas?
Thanks
Consider this max-heap (which I'll draw as a tree, but represents [7, 6, 5, 4, 3, 1, 2].
7
6 5
4 3 1 2
What's the last element that can be inserted? The last slot filled in the heap must be the bottom-right of the tree, and the bubbling-up procedure can only have touched elements along the route from that node to the top. So the previous element inserted must be 7, 5 or 2. Not all of these are possible. If it was 7, then the tree must have looked like this before insertion (with _ representing the slot where we're going to insert before bubbling up):
5
6 2
4 3 1 _
which violates the heap constraint. If 5 were the last element to be inserted, then the heap would have looked like this:
7
6 2
4 3 1 _
This works, so 5 could have been the last thing inserted. Similarly, 2 could also have been the last thing inserted.
In general, an element along the path to the bottom-right-most node could have been the last thing inserted if all the nodes below it along the path are at least as large as the other child of its parent. In our example: 7 can't be the last thing inserted because 5 < 6. 5 can be the last thing inserted because 2 > 1. 2 can be the last thing inserted because it doesn't have any children.
With this observation, one can generate all input sequences (in reverse order) that could have resulted in the heap by recursion.
Here's some code that runs on the example you gave, and verifies that each input sequence it generates actually does generate the given heap. There's 226696 different inputs, but the program only takes a few seconds to run.
# children returns the two children of i. The first
# is along the path to n.
# For example: children(1, 4) == 4, 3
def children(i, n):
i += 1
n += 1
b = 0
while n > i:
b = n & 1
n //= 2
return 2 * i + b - 1, 2 * i - b
# try_remove tries to remove the element i from the heap, on
# the assumption is what the last thing inserted.
# It returns a new heap without that element if possible,
# and otherwise None.
def try_remove(h, i):
h2 = h[:-1]
n = len(h) - 1
while i < n:
c1, c2 = children(i, n)
h2[i] = h[c1]
if c2 < len(h) and h[c1] < h[c2]:
return None
i = c1
return h2
# inputs generates all possible input sequences that could have
# generated the given heap.
def inputs(h):
if len(h) <= 1:
yield h
return
n = len(h) - 1
while True:
h2 = try_remove(h, n)
if h2 is not None:
for ins in inputs(h2):
yield ins + [h[n]]
if n == 0: break
n = (n - 1) // 2
import heapq
# assert_inputs_give_heap builds a max-heap from the
# given inputs, and asserts it's equal to cs.
def assert_inputs_give_heap(ins, cs):
# Build a heap from the inputs.
# Python heaps are min-heaps, so we negate the items to emulate a max heap.
h = []
for i in ins:
heapq.heappush(h, -i)
h = [-x for x in h]
if h != cs:
raise AssertionError('%s != %s' % (h, cs))
cs = [149, 130, 129, 107, 122, 124, 103, 66, 77, 91, 98, 10, 55, 35, 72]
for ins in inputs(cs):
assert_inputs_give_heap(ins, cs)
print ins

R for loop to create a class variable taking forever

My question comprises two parts. I have a matrix with IDs and several columns (representing time) of values from 0-180. I'd like to summarize these with sub groups, then compare across the columns. For example, how many IDs switch from 0-10 in column 5, to 11+ in column 6?
Now, my first thought was a SAS-style format command. This would let me group integers into different blocks (0-10,11-20,21-30,etc). But, it seems that this doesn't exist.
My solution has been to loop through all values of this matrix (dual for loops) and check whether the values fall between certain ranges(string of if statements), then enter this value into a new matrix that keeps track of only classes. Example:
# search through columns
for (j in 2:(dim(Tab2)[2])){
# search through lines
for (i in 1:dim(Tab2)[1]){
if (is.na(Tab2[i,j])){
tempGliss[i,j] <- "NA"}
else if (Tab2[i,j]==0){
tempGliss[i,j] <- "Zero"}
else if (Tab2[i,j]>0 & Tab2[i,j]<=7){
tempGliss[i,j] <- "1-7"}
else if (Tab2[i,j]>=7 & Tab2[i,j]<=14){
tempGliss[i,j] <- "7-14"}
else if (Tab2[i,j]>=15 & Tab2[i,j]<=30){
tempGliss[i,j] <- "15-30"}
else if (Tab2[i,j]>=31 & Tab2[i,j]<=60){
tempGliss[i,j] <- "31-60"}
else if (Tab2[i,j]>=61 & Tab2[i,j]<=90){
tempGliss[i,j] <- "61-90"}
else if (Tab2[i,j]>=91 & Tab2[i,j]<=120){
tempGliss[i,j] <- "91-120"}
else if (Tab2[i,j]>=121 & Tab2[i,j]<=150){
tempGliss[i,j] <- "121-150"}
else if (Tab2[i,j]>=151 & Tab2[i,j]<=180){
tempGliss[i,j] <- "151-180"}
else if (Tab2[i,j]>180){
tempGliss[i,j] <- ">180"}
}
}
Here Tab2 is my original matrix, and tempGliss is what I'm creating as a class. This takes a VERY LONG TIME! It doesn't help that my file is quite large. Is there any way I can speed this up? Alternatives to the for loops or the if statements?
Maybe you can use cut
Tab2 <- data.frame(a = 1:9, b = c(0, 7, 14, 30, 60, 90, 120, 150, 155)
,c = c(0, 1, 7, 15, 31, 61, 91, 121, 155))
repla <- c("Zero", "1-7", "7-14", "15-30", "31-60", "61-90", "91-120", "121-150", "151-180", ">180")
for(j in 2:(dim(Tab2)[2])){
dum <- cut(Tab2[,j], c(-Inf,0,7,14,30,60,90,120,150,180, Inf))
levels(dum) <- repla
Tab2[,j] <- dum
}
> Tab2
a b c
1 1 Zero Zero
2 2 1-7 1-7
3 3 7-14 1-7
4 4 15-30 15-30
5 5 31-60 31-60
6 6 61-90 61-90
7 7 91-120 91-120
8 8 121-150 121-150
9 9 151-180 151-180
I havent looked at it too closely but you may need to adjust the bands slightly.

Reordering items with multiple order criteria

Scenario:
list of photos
every photo has the following properties
id
sequence_number
main_photo_bit
the first photo has the main_photo_bit set to 1 (all others are 0)
photos are ordered by sequence_number (which is arbitrary)
the main photo does not necessarily have the lowest sequence_number (before sorting)
See the following table:
id, sequence_number, main_photo_bit
1 10 1
2 5 0
3 20 0
Now you want to change the order by changing the sequence number and main photo bit.
Requirements after sorting:
the sequence_number of the first photo is not changed
the sequence_number of the first photo is the lowest
as less changes as possible
Examples:
Example #1 (second photo goes to the first position):
id, sequence_number, main_photo_bit
2 10 1
1 15 0
3 20 0
This is what happened:
id 1: new sequence_number and main_photo_bit set to 0
id 2: old first photo (id 2) sequence_number and main_photo_bit set to 1
id 3: nothing happens
Example #2 (third photo to first position):
id, sequence_number, main_photo_bit
3 10 1
1 20 0
2 30 0
This is what happened:
id 1: new sequence_number bigger than first photo and main_photo_bit to 0
id 2: new sequence_number bigger than newly generated second sequence_number
id 3: old first photo sequence_number and main_photo_bit set to 1
What is the best approach to calculate the steps needed to save the new order?
Edit:
The reason that I want as less updates as possible is because I want to sync it to an external service, which is a quite costly operation.
I already got a working prototype of the algorithm, but it fails in some edge cases. So instead of patching it up (which might work -- but it will become even more complex than it is already), I want to know if there are other (better) ways to do it.
In my version (in short) it orders the photos (changing sequence_number's), and swaps the main_photo_bit, but it isn't sufficient to solve every scenario.
From what I understood, a good solution would not only minimize changes (since updating is the costly operation), but also try to minimize future changes, as more and more photos are reordered. I'd start by adding a temporary field dirty, to indicate if the row must change or not:
id, sequence_number, main_photo_bit, dirty
1 10 1 false
2 5 0 false
3 20 0 false
4 30 0 false
5 31 0 false
6 33 0 false
If there are rows which sequence_number is smaller than the first, they will surely have to change (either to get a higher number, or to become the first). Let's mark them as dirty:
id, sequence_number, main_photo_bit, dirty
2 5 0 true
(skip this step if it's not really important that the first has the lowest sequence_number)
Now let's see the list of photos, as they should be in the result (as per the question, only one photo changed places, from anywhere to anywhere). Dirty ones in bold:
[1, 2, 3, 4, 5, 6] # Original ordering
[2, 1, 3, 4, 5, 6] # Example 1: 2nd to 1st place
[3, 1, 2, 4, 5, 6] # Example 2: 3rd to 1st place
[1, 2, 4, 3, 5, 6] # Example 3: 3rd to 4th place
[1, 3, 2, 4, 5, 6] # Example 4: 3rd to 2nd place
The first thing to do is ensure the first element has the lowest sequence_number. If it hasn't changed places, then it has by definition, otherwise the old first should be marked as dirty, have its main_photo_bit cleared, and the new one should receive those values to itself.
At this point, the first element should have a fixed sequence_number, and every dirty element can have its value changed at will (since it will have to change anyway, so it's better to change for an useful value). Before proceeding, we must ensure that it's possible to solve it with only changing the dirty rows, or if more rows will have to be dirtied as well. This is simply a matter of determining if the interval between every pair of clean rows is big enough to fit the number of dirty rows between them:
[10, D, 20, 30, 31, 33] # Original ordering (the first is dirty, but fixed)
[10, D, 20, 30, 31, 33] # Example 1: 2nd to 1st place (ok: 10 < ? < 20)
[10, D, D, 30, 31, 33] # Example 2: 3rd to 1st place (ok: 10 < ? < ? < 30)
[10, D, 30, D, 31, 33] # Example 3: 3rd to 4th place (NOT OK: 30 < ? < 31)
[10, D, 30, D, D, 33] # must mark 5th as dirty too (ok: 30 < ? < ? < 33)
[10, D, D, 30, 31, 33] # Example 4: 3rd to 2nd place (ok)
Now it's just a matter of assigning new sequence_numbers to the dirty rows. A naïve solution would be to just increment the previous one, but a better approach would be setting them as equally spaced as possible. This way, there are better odds that a future reorder would require less changes (in other words, to avoid problems like Example 3, where more rows than necessary had to be updated since some sequence_numbers were too close to each other):
[10, 15, 20, 30, 31, 33] # Example 1: 2nd to 1st place
[10, 16, 23, 30, 31, 33] # Example 2: 3rd to 1st place
[10, 20, 30, 31, 32, 33] # Example 3: 3rd to 4th place
[10, 16, 23, 30, 31, 33] # Example 4: 3rd to 2nd place
Bonus: if you really want to push the solution to its limits, do the computation twice - one moving the photo, other having it fixed and moving the surrounding photos - and see which one resulted in less changes. Take example 3A, where instead of "3rd to 4th place" we treat it as "4th to 3rd place" (same sorting results, but different changes):
[1, 2, 4, 3, 5, 6] # Example 3A: 4th to 3rd place
[10, D, D, 20, 31, 33] # (ok: 10 < ? < ? < 20)
[10, 13, 16, 20, 31, 33] # One less change
In most cases it can be done (ex.: 2nd to 4th position == 3rd/4th to 2nd/3rd position), whether or not the added complexity is worth the small gain, it's up to you to decide.
Use a linked list instead of sequence numbers. Then you can remove a picture from anywhere in the list and reinsert it anywhere in the list, and you only need to change 3 lines in your database file. Main photo bit should be unneccessary, the first photo being implicitly defined by not having any pointers to it.
id next
1 3
2 1
3
the order is: 2, 1, 3
user moves picture 3 to position 1:
id next
1
2 1
3 2
new order is: 3, 2, 1

Resources