Lubridate: Intervals that overlaps any other by group - intervals

Hello and Thank you so much!
I'm trying to identify which intervals overlaps any other in a group.
For instance, If we had the following data:
id <- rep(1:3, each=3)
hospitalization <- seq(ymd_hms("2017-11-28 00:00:01"), by = "day", length.out = length(id))
dat <- data.frame(id, hospitalization)
dat[3,2] <- dat[3,2] + dhours(12)
library(dplyr)
library(lubridate)
dat %>%
mutate(
discharge = hospitalization + dhours(35),
interval= hospitalization %--% discharge
) -> dat
dat
> dat
id hospitalization discharge interval
1 1 2017-11-28 00:00:01 2017-11-29 11:00:01 2017-11-28 00:00:01 UTC--2017-11-29 11:00:01 UTC
2 1 2017-11-29 00:00:01 2017-11-30 11:00:01 2017-11-29 00:00:01 UTC--2017-11-30 11:00:01 UTC
3 1 2017-11-30 12:00:01 2017-12-01 23:00:01 2017-11-30 12:00:01 UTC--2017-12-01 23:00:01 UTC
4 2 2017-12-01 00:00:01 2017-12-02 11:00:01 2017-12-01 00:00:01 UTC--2017-12-02 11:00:01 UTC
5 2 2017-12-02 00:00:01 2017-12-03 11:00:01 2017-12-02 00:00:01 UTC--2017-12-03 11:00:01 UTC
6 2 2017-12-03 00:00:01 2017-12-04 11:00:01 2017-12-03 00:00:01 UTC--2017-12-04 11:00:01 UTC
7 3 2017-12-04 00:00:01 2017-12-05 11:00:01 2017-12-04 00:00:01 UTC--2017-12-05 11:00:01 UTC
8 3 2017-12-05 00:00:01 2017-12-06 11:00:01 2017-12-05 00:00:01 UTC--2017-12-06 11:00:01 UTC
9 3 2017-12-06 00:00:01 2017-12-07 11:00:01 2017-12-06 00:00:01 UTC--2017-12-07 11:00:01 UTC
dat[1,4]
dat[2,4]
dat[3,4]
int_overlaps(dat[1,4],dat[2,4])
int_overlaps(dat[2,4],dat[3,4])
int_overlaps(dat[1,4],dat[3,4])
int_overlaps(dat[1,4],dat[3,4])
I would like to calculate a column that is Boolean (overlap_any) indicating if a period overlaps any (not all but at least one) another in the same group.
When grouping by id, for id==1 the first and second periods overlap but they don't with the third one. So for that id overlap_any should be (True,True,False).
I was thinking of something like:
dat %>%
group_by(id) %>%
mutate(
overlap_any = some_function(interval)
)
But I don't know what to do because group_by takes all the intervals for a group and not the current line that I would like to evaluate for overlapping with the rest. Furthermore, int_overlaps only takes two arguments.
I appreciate the help!

I did
overlaps_others <- function(y) sapply(y, function(x) sum(int_overlaps(x,y)))-1
dat %>%
split(id) %>%
lapply(function(z){
z %>%
mutate(
overlaps = overlaps_others(interval)
) %>%
select(-interval)
}) %>%
bind_rows()

Related

Rank using many Partition values in Linq c#

My Table data are
Id FileName Created Date Month Year Date
1 ff1 2022-07-24 12:19:59.740 7 2022 24
2 ff1 2022-07-24 12:19:59.740 7 2022 24
3 ff1 2021-07-24 12:19:59.740 7 2021 24
4 ff1 2022-05-24 12:19:59.740 5 2022 24
5 ff1 2021-03-24 12:19:59.740 3 2021 24
6 ff1 2021-03-24 11:19:59.740 3 2021 24
7 ff1 2021-03-24 08:19:59.740 3 2021 24
My SQL query is as follows
select
filename,
createddate,
month,
dense_rank() over(partition by year, month order by createddate desc) rank
from filedetails;
My output is
Name Date Month Date Rank
ff1 2021-03-25 08:19:59.740 3 25 1
ff1 2021-03-24 12:19:59.740 3 24 2
ff1 2021-03-24 11:19:59.740 3 24 3
ff1 2021-03-24 08:19:59.740 3 24 4
ff1 2021-07-28 12:19:59.740 7 28 1
ff1 2021-07-24 12:19:59.740 7 24 2
ff1 2022-05-28 01:29:59.740 5 28 1
ff1 2022-05-24 12:19:59.740 5 24 2
ff1 2022-07-24 12:19:59.740 7 24 1
ff1 2022-07-24 12:19:59.740 7 24 1
how to achieve this using c# linq rank.
Thank you in advance.

Question about the difference between 2 dates, which is the right approach?

I've been recently asked to do a simple exercice to calculate the difference between 2 dates.
first date is the birth date, second date is the actual date (or any date entered as the end date) let's call it "end date".
Format: Year, Month, Day (we don't care about hours for now)
Here is the deal :
The birth date : 2021, 02, 11
The end date : 2022, 02, 10
The difference between these 2 dates is : 11 months and 30 days. That's seems the logic answer but let's try to get a little bit deeper :
11/02 -> 11/03 = 1 month : 17 + 11 = 28 }
11/03 -> 11/04 = 1 month : 20 + 11 = 31 |
11/04 -> 11/05 = 1 month : 19 + 11 = 30 |
11/05 -> 11/06 = 1 month : 20 + 11 = 31 | From February
11/06 -> 11/07 = 1 month : 19 + 11 = 30 | To January
11/07 -> 11/08 = 1 month : 20 + 11 = 31 | 11 months
11/08 -> 11/09 = 1 month : 20 + 11 = 31 | => 334 days
11/09 -> 11/10 = 1 month : 19 + 11 = 30 |
11/10 -> 11/11 = 1 month : 20 + 11 = 31 |
11/11 -> 11/12 = 1 month : 19 + 11 = 30 |
11/12 -> 11/01 = 1 month : 20 + 11 = 31 }
11/01 -> 10/02 ------------> 20 + 10 = 30
This is the reasoning behind the 11 months and 30 days, but there is another approach (correct me if I'm wrong) which is :
the starting stays the same, which is 11/02 so the rest is :
02/2021: 17/28
03/2021: 31/31 }
04/2021: 30/30 |
05/2021: 31/31 |
06/2021: 30/30 |
07/2021: 31/31 |
08/2021: 31/31 | 11 months
09/2021: 30/30 | => 337 days
10/2021: 31/31 |
11/2021: 30/30 |
12/2021: 31/31 |
01/2022: 31/31 }
02/2022: 10/28
There is a difference of 3 days between the 2 dates when using the 2nd approach, and therefore the difference will be
11 month and 27 days.
Which approach do you think is the right one ?

Power BI - Get last month from last year

I have this table:
Year
Month
Agency
Value
2019
9
1
233
2019
9
4
132
2019
8
3
342
2020
3
2
321
2020
3
4
34
2020
5
2
56
2020
5
4
221
2020
5
1
117
2018
12
2
112
2018
12
2
411
2020
4
3
241
2020
4
2
155
I'd like to set a new measure/column where last month from last year is 1, and 0 in another cases:
Year
Month
Agency
Value
Filter
2019
9
1
233
0
2019
9
4
132
0
2019
8
3
342
0
2020
3
2
321
0
2020
3
4
34
0
2020
5
2
56
1
2020
5
4
221
1
2020
5
1
117
1
2018
12
2
112
0
2018
12
2
411
0
2020
4
3
241
0
2020
4
2
155
0
I've been able to "copy" a new table with values from Month=5 and Year=2020 ("the lastest from the lastest"):
TableData - Last Charge =
var table = FILTER(
TableData,
AND(
MAX('TableData '[Year])='TableData '[Year],
MAX('TableData '[Month])='TableData '[Month]
)
)
return SUMMARIZE(table , TableData [Year], TableData [Month], TableData [Agency], TableData [Value])
However, my intention is don't create new tables and use measures/columns tu use it like FILTER when I create a graphic.
Thanks a lot, and sorry for my poor english.
I solved it with this measure:
Measure =
VAR a =
MAX ( 'Table'[Year] )
VAR b =
MAX ( 'Table'[Months] )
VAR c =
MAXX ( ALL ( 'Table' ), [Year] )
VAR d =
MAXX ( FILTER ( ALL ( 'Table' ), [Year] = c ), [Months] )
RETURN
IF ( a * 100 + b = c * 100 + d, 1, 0 )

Writing a For Loop in R for survival analysis

I am having issues extracting survival data for specific times ( years 1,5 and 10). i tried summary(fit, times = c(1,5,10)), but this doesn't extract the right survival estimates.
I have written the following code to censor the data to include only the cohort for year 1 and extract survival for year 1:
TIME <- 1
tmp <- data1[data1$tstart < TIME*365.25,]
tmp <- tmp[!duplicated(tmp$id,fromLast = T),]
tmp$status[tmp$time >TIME*365.25] <- 0
tmp$time[tmp$time > TIME*365.25] <- TIME*365.25
fit <- survfit(Surv(time/365.25, status) ~ drug_dosage, data=tmp)
fit_year <- summary(fit, times = TIME)
My question is how can I create a loop for time to include the years 5 and 10. Thank you in advance.
This is a sample of what my data looks like.
id time status tstart
1 2131 2311 0 0
2 2131 2311 0 17
3 2131 2311 0 50
4 2131 2311 0 105
5 2131 2311 0 133
6 2131 2311 0 153
7 2131 2311 0 209
8 2131 2311 0 238
9 2131 2311 0 276
10 2131 2311 0 317
I think this is what you are looking for. It would be great if the sample data you provide corresponds to the code chunk you provide in order to ensure reproductivity.
for (i in c(1,5,10)){
TIME <- i
tmp <- data1[data1$tstart < TIME*365.25,]
tmp <- tmp[!duplicated(data$id,fromLast = T),]
tmp$status[tmp$time >TIME*365.25] <- 0
tmp$time[tmp$time > TIME*365.25] <- TIME*365.25
fit <- survfit(Surv(time/365.25, status) ~ drug_dosage, data=tmp)
fit_year <- summary(fit, time = TIME)
}

oracle group dates rows by continous range

I need to group and sum the rows of one day according to continuous worker date range.
Table attendance definition:
row_no NUMBER (*,0) NOT NULL, -- row number - generated from a sequence
worker_id NUMBER NOT NULL, -- Attendance worker id
date1 DATE DEFAULT SYSDATE NOT NULL, -- Attendance Date/time
type1 NUMBER(3,0) NOT NULL, -- Attendance type: 0-Enter, 1-Exit
worker_id date1 type1
2 13/06/2016-09:00 0
3 13/06/2016-12:10 0
2 13/06/2016-13:20 1
2 13/06/2016-15:00 0
2 13/06/2016-17:00 1
3 13/06/2016-18:45 1
2 13/06/2016-19:00 0
Result if report is run at 22:00
worker_id date1 fr_hour to_hour hours
2 13/06/2016 09:00 13:20 4:20
2 13/06/2016 15:00 17:00 2:00
2 13/06/2016 19:00 22:00 3:00
3 13/06/2016 12:10 18:45 6:35
In the inner query we get for every row the date1,type1 from the next row (LEAD 1) for the same worker and than we filter only what we need:
SELECT worker_id,
TRUNC (date1) AS date1,
TO_CHAR (date1, 'HH24:MI') fr_hour,
TO_CHAR (date2, 'HH24:MI') to_hour,
TRUNC ( (date2 - date1) * 24) || ':' ||
TO_CHAR (TRUNC ( (date2 - date1) * 24 * 60) - TRUNC ( (date2 - date1) * 24) * 60, '00') hours
FROM (SELECT a.*,
LEAD (a.date1, 1) OVER (PARTITION BY worker_id ORDER BY date1) date2,
LEAD (a.type1, 1) OVER (PARTITION BY worker_id ORDER BY date1) type2
FROM testtemp a)
WHERE type1 = 0
AND type2 = 1
AND TRUNC (date1) = TRUNC (date2)
Taking continuous periods that start on an earlier day complicates it a bit. You can either calculate all the ranges etc. for all dates, back to the start of time - assuming you don't archive off old records and the very first entry for any worker in the data isn't a check-out - and then after doing all that work filter on the date you're interested in. Or you can look only at that date's data and see if a worker's records start with a check-in or check-out.
I've added records for a fourth worker:
WORKER_ID DATE1 TYPE1
---------- ---------------- ----------
4 2016-06-12 19:00 0
4 2016-06-13 03:00 1
2 2016-06-13 09:00 0
3 2016-06-13 12:10 0
4 2016-06-13 13:00 0
2 2016-06-13 13:20 1
4 2016-06-13 14:30 1
2 2016-06-13 15:00 0
2 2016-06-13 17:00 1
3 2016-06-13 18:45 1
4 2016-06-13 19:00 0
2 2016-06-13 19:00 0
You can use analytic functions to work out a row number for each entry, and also find the first type1 value for each worker that day; this also effectively pivots to get the time in and out as separate columns:
select worker_id, trunc(date1) as date1, type1,
case when type1 = 0 then date1 end as time_in,
case when type1 = 1 then date1 end as time_out,
row_number() over (partition by worker_id, trunc(date1), type1 order by date1) as rn,
min(type1) keep (dense_rank first order by date1) over (partition by worker_id, trunc(date1)) as open_start,
max(type1) keep (dense_rank last order by date1) over (partition by worker_id, trunc(date1)) as open_end,
row_number() over (partition by worker_id, trunc(date1), type1 order by date1)
- case when type1 = 1 then min(type1) keep (dense_rank first order by date1)
over (partition by worker_id, trunc(date1)) else 0 end as grp
from attendance
where date1 >= date '2016-06-13' and date1 < date '2016-06-14'
order by worker_id, attendance.date1;
WORKER_ID DATE1 TYPE1 TIME_IN TIME_OUT RN OPEN_START OPEN_END GRP
---------- ---------------- ---------- ---------------- ---------------- ---------- ---------- ---------- ----------
2 2016-06-13 00:00 0 2016-06-13 09:00 1 0 0 1
2 2016-06-13 00:00 1 2016-06-13 13:20 1 0 0 1
2 2016-06-13 00:00 0 2016-06-13 15:00 2 0 0 2
2 2016-06-13 00:00 1 2016-06-13 17:00 2 0 0 2
2 2016-06-13 00:00 0 2016-06-13 19:00 3 0 0 3
3 2016-06-13 00:00 0 2016-06-13 12:10 1 0 1 1
3 2016-06-13 00:00 1 2016-06-13 18:45 1 0 1 1
4 2016-06-13 00:00 1 2016-06-13 03:00 1 1 0 0
4 2016-06-13 00:00 0 2016-06-13 13:00 1 1 0 1
4 2016-06-13 00:00 1 2016-06-13 14:30 2 1 0 1
4 2016-06-13 00:00 0 2016-06-13 19:00 2 1 0 2
The rn column is a raw (naive) attempt to group in and out records together, but for worker 4 that goes out of step. The open_start works out if the first record was a check-out. The value that gets - either zero or 1 - can then be subtracted from rn to get a more useful grouping flag, which I've called grp.
You can then use that as an inline view or CTE and aggregate the time in/out records for each group, adding nvl() or coalesce() to put in missing midnight-start or 10pm-end values:
select worker_id,
date1 as date1,
nvl(min(time_in), date1) as fr_hour,
nvl(max(time_out), date1 + 22/24) as to_hour,
date1 + (nvl(max(time_out), date1 + 22/24) - date1)
- (nvl(min(time_in), date1) - date1) as hours
from (
select worker_id,
trunc(date1) as date1,
case when type1 = 0 then date1 end as time_in,
case when type1 = 1 then date1 end as time_out,
row_number() over (partition by worker_id, trunc(date1), type1 order by date1)
- case when type1 = 1 then min(type1) keep (dense_rank first order by date1)
over (partition by worker_id, trunc(date1)) else 0 end as grp
from attendance
where date1 >= date '2016-06-13' and date1 < date '2016-06-14'
)
group by worker_id, date1, grp
order by worker_id, date1, grp;
WORKER_ID DATE1 FR_HOUR TO_HOUR HOURS
---------- ---------------- ---------------- ---------------- ----------------
2 2016-06-13 00:00 2016-06-13 09:00 2016-06-13 13:20 2016-06-13 04:20
2 2016-06-13 00:00 2016-06-13 15:00 2016-06-13 17:00 2016-06-13 02:00
2 2016-06-13 00:00 2016-06-13 19:00 2016-06-13 22:00 2016-06-13 03:00
3 2016-06-13 00:00 2016-06-13 12:10 2016-06-13 18:45 2016-06-13 06:35
4 2016-06-13 00:00 2016-06-13 00:00 2016-06-13 03:00 2016-06-13 03:00
4 2016-06-13 00:00 2016-06-13 13:00 2016-06-13 14:30 2016-06-13 01:30
4 2016-06-13 00:00 2016-06-13 19:00 2016-06-13 22:00 2016-06-13 03:00
The 'hours' value manipulates the date and the time-in/out values to come up with what looks like another time; but it's actually the elapsed time.
Finally you can format the columns to remove the bits you aren't interested in:
select worker_id,
to_char(date1, 'DD/MM/YYYY') as date1,
to_char(nvl(min(time_in), date1), 'HH24:MI') as fr_hour,
to_char(nvl(max(time_out), date1 + 22/24), 'HH24:MI') as to_hour,
to_char(date1 + (nvl(max(time_out),date1 + 22/24) - date1)
- (nvl(min(time_in), date1) - date1), 'HH24:MI') as hours
from (
select worker_id,
trunc(date1) as date1,
case when type1 = 0 then date1 end as time_in,
case when type1 = 1 then date1 end as time_out,
row_number() over (partition by worker_id, trunc(date1), type1 order by date1)
- case when type1 = 1 then min(type1) keep (dense_rank first order by date1)
over (partition by worker_id, trunc(date1)) else 0 end as grp
from attendance
where date1 >= date '2016-06-13' and date1 < date '2016-06-14'
)
group by worker_id, date1, grp
order by worker_id, date1, grp;
WORKER_ID DATE1 FR_HO TO_HO HOURS
---------- ---------- ----- ----- -----
2 13/06/2016 09:00 13:20 04:20
2 13/06/2016 15:00 17:00 02:00
2 13/06/2016 19:00 22:00 03:00
3 13/06/2016 12:10 18:45 06:35
4 13/06/2016 00:00 03:00 03:00
4 13/06/2016 13:00 14:30 01:30
4 13/06/2016 19:00 22:00 03:00

Resources