I am having issues extracting survival data for specific times ( years 1,5 and 10). i tried summary(fit, times = c(1,5,10)), but this doesn't extract the right survival estimates.
I have written the following code to censor the data to include only the cohort for year 1 and extract survival for year 1:
TIME <- 1
tmp <- data1[data1$tstart < TIME*365.25,]
tmp <- tmp[!duplicated(tmp$id,fromLast = T),]
tmp$status[tmp$time >TIME*365.25] <- 0
tmp$time[tmp$time > TIME*365.25] <- TIME*365.25
fit <- survfit(Surv(time/365.25, status) ~ drug_dosage, data=tmp)
fit_year <- summary(fit, times = TIME)
My question is how can I create a loop for time to include the years 5 and 10. Thank you in advance.
This is a sample of what my data looks like.
id time status tstart
1 2131 2311 0 0
2 2131 2311 0 17
3 2131 2311 0 50
4 2131 2311 0 105
5 2131 2311 0 133
6 2131 2311 0 153
7 2131 2311 0 209
8 2131 2311 0 238
9 2131 2311 0 276
10 2131 2311 0 317
I think this is what you are looking for. It would be great if the sample data you provide corresponds to the code chunk you provide in order to ensure reproductivity.
for (i in c(1,5,10)){
TIME <- i
tmp <- data1[data1$tstart < TIME*365.25,]
tmp <- tmp[!duplicated(data$id,fromLast = T),]
tmp$status[tmp$time >TIME*365.25] <- 0
tmp$time[tmp$time > TIME*365.25] <- TIME*365.25
fit <- survfit(Surv(time/365.25, status) ~ drug_dosage, data=tmp)
fit_year <- summary(fit, time = TIME)
}
Related
We've run an Interrupted Time Series analysis on some aggregate count data using a Poisson regression. Code shown below - where Subject Total is the count, Quarter is time, int2 is the dummy variable for the intervention [0 pre, 1 post] and time_since_intervention2 the dummy variable for time since intervention [0 pre, 1:N post].
fit1a <- glm(`Subject Total` ~ Quarter + int2 + time_since_intervention2 , df, family = "poisson")
Quarter Subject Total int2 time_since_intervention2 subjectfit subcounter
1 1 34 0 0 34.20968 34.20968
2 2 32 0 0 33.39850 33.39850
3 3 36 0 0 32.60656 32.60656
4 4 34 0 0 31.83339 31.83339
5 5 23 0 0 31.07856 31.07856
6 6 34 0 0 30.34163 30.34163
7 7 33 0 0 29.62217 29.62217
8 8 24 0 0 28.91977 28.91977
9 9 31 0 0 28.23402 28.23402
10 10 32 0 0 27.56454 27.56454
11 11 21 0 0 26.91093 26.91093
12 12 26 0 0 26.27282 26.27282
13 13 22 0 0 25.64984 25.64984
14 14 28 0 0 25.04163 25.04163
15 15 28 0 0 24.44784 24.44784
16 16 22 0 0 23.86814 23.86814
17 17 14 1 1 17.88365 23.30218
18 18 16 1 2 17.01622 22.74964
19 19 20 1 3 16.19087 22.21020
20 20 19 1 4 15.40556 21.68355
21 21 13 1 5 14.65833 21.16939
22 22 15 1 6 13.94735 20.66743
23 23 16 1 7 13.27085 20.17736
24 24 8 1 8 12.62717 19.69892
Due to the need to exponentiate the outcome the summary is currently being derived using the margins package.
> summary(margins(fit1a))
factor AME SE z p lower upper
int2 -5.7843 5.1734 -1.1181 0.2635 -15.9241 4.3555
Quarter -0.5809 0.2469 -2.3526 0.0186 -1.0649 -0.0970
time_since_intervention2 -0.6227 0.9955 -0.6255 0.5316 -2.5738 1.3285
If reading the outcome correctly it would suggest that the level change between the final quarter in the pre-intervention period and first in the post-intervention period is -5.7843.
I've tried inputting coefficient values into my model [initial intercept = 35.0405575], but they don't appear to correlate at all with the subjectfit data, which I believed it would. Should the level change reported by the margins package replicate the difference in the full data.....?
I can see very high % of stolen time on a EC2 web server (t2.micro) without any load (one current user) with a high page load time. Is there a correlation between hight load time and hight stolen time? I have the same symptoms with another server from class t2.medium
Do you have an explanation?
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 79824 7428 479172 0 0 0 0 52 49 18 0 0 0 82
1 0 0 79792 7436 479172 0 0 0 6 54 49 18 0 0 0 82
1 0 0 79824 7444 479172 0 0 0 5 54 51 18 0 0 0 82
Given an image I and two matrices m_1 ;m_2 (same size with I). The function f is defined as:
Because my goal design wants to get the sign of f . Hence, the function f can rewritten as following:
I think that second formula is faster than first formula because: It
can ignore the square term
It can compute the sign directly, instead of two steps in first equation: compute the f and check sign.
Do you agree with me? Do you have another faster formula for f
I =[16 23 11 42 10
11 21 22 24 30
16 22 154 155 156
25 28 145 151 156
11 38 147 144 153];
m1 =[0 0 0 0 0
0 0 22 11 0
0 23 34 56 0
0 56 0 0 0
0 11 0 0 0];
m2 =[0 0 0 0 0
0 0 12 11 0
0 22 111 156 0
0 32 0 0 0
0 12 0 0 0];
The ouput f is
f =[1 1 1 1 1
1 1 -1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1]
I implemented the first way, but I did not finish the second way by matlab. Could you check help me the second way and compare it
UPDATE: I would like to add code of chepyle and Divakar to make clearly question. Note that both of them give the same result as above f
function compare()
I =[16 23 11 42 10
11 21 22 24 30
16 22 154 155 156
25 28 145 151 156
11 38 147 144 153];
m1 =[0 0 0 0 0
0 0 22 11 0
0 23 34 56 0
0 56 0 0 0
0 11 0 0 0];
m2 =[0 0 0 0 0
0 0 12 11 0
0 22 111 156 0
0 32 0 0 0
0 12 0 0 0];
function f=first_way()
f=sign((I-m1).^2-(I-m2).^2);
f(f==0)=1;
end
function f= second_way()
f = double(abs(I-m1) >= abs(I-m2));
f(f==0) = -1;
end
function f= third_way()
v1=abs(I-m1);
v2=abs(I-m2);
f= int8(v1>v2) + -1*int8(v1<v2); % need to convert to int from logical
f(f==0) = 1;
end
disp(['First way : ' num2str(timeit(#first_way))])
disp(['Second way: ' num2str(timeit(#second_way))])
disp(['Third way : ' num2str(timeit(#third_way))])
end
First way : 1.2897e-05
Second way: 1.9381e-05
Third way : 2.0077e-05
This seems to be comparable and might be a wee bit faster at times than the original approach -
f = sign(abs(I-m1) - abs(I-m2)) + sign(abs(m1-m2)) + ...
sign(abs(2*I-m1-m2)) - 1 -sign(abs(2*I-m1-m2) + abs(m1-m2))
Benchmarking Code
%// Create random inputs
N = 5000;
I = randi(1000,N,N);
m1 = randi(1000,N,N);
m2 = randi(1000,N,N);
num_iter = 20; %// Number of iterations for all approaches
%// Warm up tic/toc.
for k = 1:100000
tic(); elapsed = toc();
end
disp('------------------------- With Original Approach')
tic
for iter = 1:num_iter
out1 = sign((I-m1).^2-(I-m2).^2);
out1(out1==0)=-1;
end
toc, clear out1
disp('------------------------- With Proposed Approach')
tic
for iter = 1:num_iter
out2 = sign(abs(I-m1) - abs(I-m2)) + sign(abs(m1-m2)) + ...
sign(abs(2*I-m1-m2)) - 1 -sign(abs(2*I-m1-m2) + abs(m1-m2));
end
toc
Results
------------------------- With Original Approach
Elapsed time is 1.751966 seconds.
------------------------- With Proposed Approach
Elapsed time is 1.681263 seconds.
There is a problem with the accuracy of second formula, but for the sake of comparison, here's how I would implement it in matlab, along with a third approach to avoid squaring and the sign() function, inline with your intent. Note that the matlab's matrix and sign functions are pretty well optimized, the second and third approaches are both slower.
function compare()
I =[16 23 11 42 10
11 21 22 24 30
16 22 154 155 156
25 28 145 151 156
11 38 147 144 153];
m1 =[0 0 0 0 0
0 0 22 11 0
0 23 34 56 0
0 56 0 0 0
0 11 0 0 0];
m2 =[0 0 0 0 0
0 0 12 11 0
0 22 111 156 0
0 32 0 0 0
0 12 0 0 0];
function f=first_way()
f=sign((I-m1).^2-(I-m2).^2);
end
function f= second_way()
v1=(I-m1);
v2=(I-m2);
f= int8(v1<=0 & v2>0) + -1* int8(v1>0 & v2<=0);
end
function f= third_way()
v1=abs(I-m1);
v2=abs(I-m2);
f= int8(v1>v2) + -1*int8(v1<v2); % need to convert to int from logical
end
disp(['First way : ' num2str(timeit(#first_way))])
disp(['Second way: ' num2str(timeit(#second_way))])
disp(['Third way : ' num2str(timeit(#third_way))])
end
The output:
First way : 9.4226e-06
Second way: 1.2247e-05
Third way : 1.1546e-05
I have this data
y x1 x2 pre
1 16 1 1 14
2 15 1 1 13
3 14 1 2 14
4 13 1 2 13
5 12 2 1 12
6 11 2 1 12
7 11 2 2 13
8 13 2 2 13
9 10 3 1 10
10 11 3 1 11
11 11 3 2 11
12 9 3 2 10
And I fitted the following model
lm(y ~ x1 + x2 + x1*x2)
My design matrix is
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 14 1 0 1 1 0
[2,] 1 13 1 0 1 1 0
[3,] 1 14 1 0 0 0 0
[4,] 1 13 1 0 0 0 0
[5,] 1 12 0 1 1 0 1
[6,] 1 12 0 1 1 0 1
[7,] 1 13 0 1 0 0 0
[8,] 1 13 0 1 0 0 0
[9,] 1 10 0 0 1 0 0
[10,] 1 11 0 0 1 0 0
[11,] 1 11 0 0 0 0 0
[12,] 1 10 0 0 0 0 0
I'm trying to use this design to reproduce the following table:
Source DF Squares Mean Square F Value Pr > F
Model 6 44.79166667 7.46527778 12.98 0.0064
Error 5 2.87500000 0.57500000
Corrected Total 11 47.66666667
Source DF Type III SS Mean Square F Value Pr > F
pre 1 3.12500000 3.12500000 5.43 0.0671
x1 2 4.58064516 2.29032258 3.98 0.0923
x2 1 3.01785714 3.01785714 5.25 0.0706
x1*x2 2 1.25000000 0.62500000 1.09 0.4055
The first part is fine
XtX <- t(x) %*% x
XtXinv <- solve(XtX)
betahat <- XtXinv %*% t(x) %*% y
H <- x %*% XtXinv %*% t(x)
IH <- (diag(1,12) - H)
yhat <- H %*% y
e <- IH %*% y
ybar <- mean(y)
MSS <- t(betahat) %*% t(x) %*% y - length(y)*(ybar^2)
ESS <- t(e) %*% e
TSS <- MSS + ESS
dfM <- sum(diag(H)) - 1
dfE <- sum(diag(IH))
dfT <- dfM + dfE
MSM <- MSS/dfM
MSE <- ESS/dfE
Ftest <- MSM / MSE
pr <- 1 - pf(Ftest, dfM, dfE)
The contrast coefficient matrix for 'pre' seems correct.
L <- matrix(c(0,1,0,0,0,0,0), 1, 7, byrow=T)
Lb <- L %*% betahat
LXtXinvLt <- round(L %*% XtXinv %*% t(L), digits=4)
SSpre <- t(Lb) %*% solve(LXtXinvLt) %*% (Lb)
MSpre <- SSpre / 1
Fpre <- MSpre / MSE
PRpre <- 1 - pf(Fpre, 1, 12-7)
But I can't understand how to define the contrast coefficient matrix for x1, x2, and x1*x2. What's the problem with the rest of my code? Below an example for how I think I should calculate for x1
L <- matrix(c(0,0,1,1,0,0,0), 1, 7, byrow=T)
Lb <- L %*% betahat
LXtXinvLt <- round(L %*% XtXinv %*% t(L), digits=4)
SSX1 <- t(Lb) %*% solve(LXtXinvLt) %*% (Lb)
MSX1 <- SSX1 / 1
FX1 <- MSX1 / MSE
PRX1 <- 1 - pf(FX1, 1, 12-7)
Thanks!
I'm working with oracle and it's group by clause seems to behave very differently than I'd expect.
When using this query:
SELECT stats.gds_id,
stats.stat_date,
SUM(stats.A_BOOKINGS_NBR) as "Bookings",
SUM(stats.RESPONSES_LESS_1_NBR) as "<1",
SUM(stats.RESPONSES_LESS_2_NBR) AS "<2",
SUM(STATS.RESPONSES_LESS_3_NBR) AS "<3",
SUM(stats.RESPONSES_LESS_4_NBR) AS "<4",
SUM(stats.RESPONSES_LESS_5_NBR) AS "<5",
SUM(stats.RESPONSES_LESS_6_NBR + stats.RESPONSES_LESS_7_NBR + stats.RESPONSES_GREATER_7_NBR) AS ">5",
SUM(stats.RESPONSES_LESS_6_NBR) AS "<6",
SUM(stats.RESPONSES_LESS_7_NBR) AS "<7",
SUM(stats.RESPONSES_GREATER_7_NBR) AS ">7",
SUM(stats.RESPONSES_LESS_1_NBR + stats.RESPONSES_LESS_2_NBR + stats.RESPONSES_LESS_3_NBR + stats.RESPONSES_LESS_4_NBR + stats.RESPONSES_LESS_5_NBR + stats.RESPONSES_LESS_6_NBR + stats.RESPONSES_LESS_7_NBR + stats.RESPONSES_GREATER_7_NBR) as "Total"
FROM gwydb.statistics stats
WHERE stats.stat_date >= '01-JUN-2011'
GROUP BY stats.gds_id, stats.stat_date
I get results like this:
GDS_ID STAT_DATE Bookings <1 <2 <3 <4 <5 >5 <6 <7 >7 Total
02 12-JUN-11 0 1 0 0 0 0 0 0 0 0 1
1A 01-JUN-11 15 831 52 6 2 2 4 1 1 2 897
1A 01-JUN-11 15 758 59 8 1 1 5 2 1 2 832
1A 01-JUN-11 10 593 40 2 2 1 2 1 0 1 640
1A 01-JUN-11 12 678 40 10 5 2 3 1 0 2 738
1A 01-JUN-11 24 612 56 6 1 3 4 0 0 4 682
1A 01-JUN-11 23 552 37 7 1 1 2 0 1 1 600
1A 01-JUN-11 35 1147 132 13 6 0 8 0 2 6 1306
1A 01-JUN-11 91 2331 114 14 5 1 14 3 1 10 2479
As you can see, I have multiple duplicate STAT_DATE's per GDS_ID. Why is that, and how can I make it group by both of those? I.E. Sum the values for each GDS_ID per STAT_DATE.
Probably because STAT_DATE has a time component, which is being taken into account in the GROUP BY but not being displayed in the results due to the default format mask. To ignore the time, do this:
SELECT stats.gds_id,
TRUNC(stats.stat_date) stat_date,
SUM(stats.A_BOOKINGS_NBR) as "Bookings",
SUM(stats.RESPONSES_LESS_1_NBR) as "<1",
SUM(stats.RESPONSES_LESS_2_NBR) AS "<2",
SUM(STATS.RESPONSES_LESS_3_NBR) AS "<3",
SUM(stats.RESPONSES_LESS_4_NBR) AS "<4",
SUM(stats.RESPONSES_LESS_5_NBR) AS "<5",
SUM(stats.RESPONSES_LESS_6_NBR + stats.RESPONSES_LESS_7_NBR + stats.RESPONSES_GREATER_7_NBR) AS ">5",
SUM(stats.RESPONSES_LESS_6_NBR) AS "<6",
SUM(stats.RESPONSES_LESS_7_NBR) AS "<7",
SUM(stats.RESPONSES_GREATER_7_NBR) AS ">7",
SUM(stats.RESPONSES_LESS_1_NBR + stats.RESPONSES_LESS_2_NBR + stats.RESPONSES_LESS_3_NBR + stats.RESPONSES_LESS_4_NBR + stats.RESPONSES_LESS_5_NBR + stats.RESPONSES_LESS_6_NBR + stats.RESPONSES_LESS_7_NBR + stats.RESPONSES_GREATER_7_NBR) as "Total"
FROM gwydb.statistics stats
WHERE stats.stat_date >= '01-JUN-2011'
GROUP BY stats.gds_id, TRUNC(stats.stat_date)