How to speed up Pandas multilevel dataframe shift by group? - performance

I am trying to shift the Pandas dataframe column data by group of first index. Here is the demo code:
In [8]: df = mul_df(5,4,3)
In [9]: df
Out[9]:
COL000 COL001 COL002
STK_ID RPT_Date
A0000 B000 -0.5505 0.7445 -0.3645
B001 0.9129 -1.0473 -0.5478
B002 0.8016 0.0292 0.9002
B003 2.0744 -0.2942 -0.7117
A0001 B000 0.7064 0.9636 0.2805
B001 0.4763 0.2741 -1.2437
B002 1.1563 0.0525 -0.7603
B003 -0.4334 0.2510 -0.0105
A0002 B000 -0.6443 0.1723 0.2657
B001 1.0719 0.0538 -0.0641
B002 0.6787 -0.3386 0.6757
B003 -0.3940 -1.2927 0.3892
A0003 B000 -0.5862 -0.6320 0.6196
B001 -0.1129 -0.9774 0.7112
B002 0.6303 -1.2849 -0.4777
B003 0.5046 -0.4717 -0.2133
A0004 B000 1.6420 -0.9441 1.7167
B001 0.1487 0.1239 0.6848
B002 0.6139 -1.9085 -1.9508
B003 0.3408 -1.3891 0.6739
In [10]: grp = df.groupby(level=df.index.names[0])
In [11]: grp.shift(1)
Out[11]:
COL000 COL001 COL002
STK_ID RPT_Date
A0000 B000 NaN NaN NaN
B001 -0.5505 0.7445 -0.3645
B002 0.9129 -1.0473 -0.5478
B003 0.8016 0.0292 0.9002
A0001 B000 NaN NaN NaN
B001 0.7064 0.9636 0.2805
B002 0.4763 0.2741 -1.2437
B003 1.1563 0.0525 -0.7603
A0002 B000 NaN NaN NaN
B001 -0.6443 0.1723 0.2657
B002 1.0719 0.0538 -0.0641
B003 0.6787 -0.3386 0.6757
A0003 B000 NaN NaN NaN
B001 -0.5862 -0.6320 0.6196
B002 -0.1129 -0.9774 0.7112
B003 0.6303 -1.2849 -0.4777
A0004 B000 NaN NaN NaN
B001 1.6420 -0.9441 1.7167
B002 0.1487 0.1239 0.6848
B003 0.6139 -1.9085 -1.9508
The mul_df() code is attached here : How to speed up Pandas multilevel dataframe sum?
Now I want to grp.shift(1) for a big dataframe.
In [1]: df = mul_df(5000,30,400)
In [2]: grp = df.groupby(level=df.index.names[0])
In [3]: timeit grp.shift(1)
1 loops, best of 3: 5.23 s per loop
5.23s is too slow. How to speed it up ?
(My computer configuration is: Pentium Dual-Core T4200#2.00GHZ, 3.00GB RAM, WindowXP, Python 2.7.4, Numpy 1.7.1, Pandas 0.11.0, numexpr 2.0.1 , Anaconda 1.5.0 (32-bit))

How about shift the total DataFrame object and then set the first row of every group to NaN?
dfs = df.shift(1)
dfs.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan

the problem is that the shift operation is not cython optimized, so it involves callback to python. Compare this with:
In [84]: %timeit grp.shift(1)
1 loops, best of 3: 1.77 s per loop
In [85]: %timeit grp.sum()
1 loops, best of 3: 202 ms per loop
added an issue for this: https://github.com/pydata/pandas/issues/4095

similar question and added answer with that works for shift in either direction and magnitude: pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift
Code (including test setup) is:
#
# the function to use in apply
#
def replace_shift_overlap(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
#
# the apply
#
df = df.groupby(level=0).apply(replace_shift_overlap,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)
EDIT: Note that the initial sort really eats into the effectiveness of this. So in some cases the original answer is more effective.

try this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [10, 20, 15, 30, 45,43,67,22,12,14,54],
'B': [13, 23, 18, 33, 48, 1,7, 56,66,45,32],
'C': [17, 27, 22, 37, 52,77,34,21,22,90,8],
'D': ['a','a','a','a','b','b','b','c','c','c','c']
})
df
#> A B C D
#> 0 10 13 17 a
#> 1 20 23 27 a
#> 2 15 18 22 a
#> 3 30 33 37 a
#> 4 45 48 52 b
#> 5 43 1 77 b
#> 6 67 7 34 b
#> 7 22 56 21 c
#> 8 12 66 22 c
#> 9 14 45 90 c
#> 10 54 32 8 c
def groupby_shift(df, col, groupcol, shift_n, fill_na = np.nan):
'''df: dataframe
col: column need to be shifted
groupcol: group variable
shift_n: how much need to shift
fill_na: how to fill nan value, default is np.nan
'''
rowno = list(df.groupby(groupcol).size().cumsum())
lagged_col = df[col].shift(shift_n)
na_rows = [i for i in range(shift_n)]
for i in rowno:
if i == rowno[len(rowno)-1]:
continue
else:
new = [i + j for j in range(shift_n)]
na_rows.extend(new)
na_rows = list(set(na_rows))
na_rows = [i for i in na_rows if i <= len(lagged_col) - 1]
lagged_col.iloc[na_rows] = fill_na
return lagged_col
df['A_lag_1'] = groupby_shift(df, 'A', 'D', 1)
df
#> A B C D A_lag_1
#> 0 10 13 17 a NaN
#> 1 20 23 27 a 10.0
#> 2 15 18 22 a 20.0
#> 3 30 33 37 a 15.0
#> 4 45 48 52 b NaN
#> 5 43 1 77 b 45.0
#> 6 67 7 34 b 43.0
#> 7 22 56 21 c NaN
#> 8 12 66 22 c 22.0
#> 9 14 45 90 c 12.0
#> 10 54 32 8 c 14.0

Related

Mutate new column from random value in existing columns

I'm looking to mutate my data and create a new column which randomly selects a value from the existing data. My data looks something like:
individual
age_2010
age_2011
age_2012
age_2013
a
20
21
NA
21
b
33
34
35
36
c
76
NA
78
79
d
46
46
48
49
And I want it to look like:
individual
age_2010
age_2011
age_2012
age_2013
Random Sample
a
20
21
22
NA
21
b
33
34
35
36
36
c
76
NA
78
79
78
d
46
46
48
49
48
Is there any way to add a new column which includes a random figure from any of the previous age columns, and preferably keeping the data in wide form?
I think this is an easier approach:
d[, RandomSample:=sample(na.omit(t(.SD)),1),individual]
If dealing with the edge cases discussed above is desired, and one wanted to follow this approach, we could do this:
f <- function(df) {
s = na.omit(t(df))
ifelse(length(s)>0, sample(s,1),NA_real_)
}
d[, RandomSample:=f(.SD),individual]
Or,
we could just wrap the original approach in tryCatch
d[, RandomSample:=tryCatch(sample(na.omit(t(.SD)),1),error=\(e) NA),individual]
You can reshape longer, then do grouped sampling:
library(data.table)
# Sample data
d <- structure(list(individual = c("a", "b", "c", "d"), age_2010 = c(20, 33, 76, 46), age_2011 = c(21, 34, NA, 46), age_2012 = c(NA, 35, 78, 48), age_2013 = c(21, 36, 79, 49)), row.names = c(NA, -4L), spec = structure(list(cols = list(individual = structure(list(), class = c("collector_character", "collector")), age_2010 = structure(list(), class = c("collector_double", "collector")), age_2011 = structure(list(), class = c("collector_double", "collector")), age_2012 = structure(list(), class = c("collector_double", "collector")), age_2013 = structure(list(), class = c("collector_double", "collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 2L), class = "col_spec"), class = c("data.table", "data.frame"))
d
#> individual age_2010 age_2011 age_2012 age_2013
#> 1: a 20 21 NA 21
#> 2: b 33 34 35 36
#> 3: c 76 NA 78 79
#> 4: d 46 46 48 49
# Solution
d[, "Random Sample"] <- d |>
melt("individual") |> # go long
(`[`)(!is.na(value), # drop NAs
.(x = sample(value, 1)), # sampling
keyby = .(individual)) |> # Grouping variable
(`[[`)(2) # extract vector from frame
d
#> individual age_2010 age_2011 age_2012 age_2013 Random Sample
#> 1: a 20 21 NA 21 21
#> 2: b 33 34 35 36 33
#> 3: c 76 NA 78 79 76
#> 4: d 46 46 48 49 49
Alternatively, you can also use apply(), which is less verbose but much slower:
d[, "Random Sample"] <- apply(d[, -1], 1, \(x) x |> na.omit() |> sample(1))
See the benchmark here for speed comparison. On just 40k observations, apply() needs 59 times longer and 8 times the memory.
# Make large sample data set
d_large <- d |>
list() |>
rep(1e4) |>
rbindlist()
bench::mark(
base = apply(d_large[, -1], 1, \(x) x |> na.omit() |> sample(1)),
dt = d_large |>
melt("individual") |>
(`[`)(!is.na(value),
.(x = sample(value, 1)),
keyby = .(individual)) |>
(`[[`)(2),
check = F
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 base 617.86ms 617.9ms 1.62 103.3MB 12.9
#> 2 dt 6.96ms 10.5ms 80.9 13.1MB 47.3
Created on 2022-07-27 by the reprex package (v2.0.1)
Edit:
Here are versions that work with the edge case where all years are NA. In the first case I went for a join with the original table, which is a bit more expensive than the other version
# Solution with Data Table
d <- d |>
melt("individual") |> # go long
(`[`)(!is.na(value), # drop NAs
.(`Random Sample` = sample(value, 1)), # sampling
keyby = .(individual)) |> # Grouping variable
(`[`)(d) # right join with original frame
Here I simply used purrr::possibly() to return NA when sampling a zero length vector.
# Solution with apply
d[, "Random Sample"] <- apply(d[, -1], 1,
\(x) x |> na.omit() |> purrr::possibly(sample, NA)(1))

storage problem in R. alternative to nested loop for creating array of matrices and then multiple plots

With the following pieces of information, I can easily create an array of matrices
b0=data.frame(b0_1=c(11.41,11.36),b0_2=c(8.767,6.950))
b1=data.frame(b1_1=c(0.8539,0.9565),b1_2=c(-0.03179,0.06752))
b2=data.frame(b2_1=c(-0.013020 ,-0.016540),b2_2=c(-0.0002822,-0.0026720))
T.val=data.frame(T1=c(1,1),T2=c(1,2),T3=c(2,1))
dt_data=cbind(b0,b1,b2,T.val)
fu.time=seq(0,50,by=0.8)
pat=ncol(T.val) #number of T's
nit=2 #no of rows
pt.array1=array(NA, dim=c(nit,length(fu.time),pat))
for ( it.er in 1:nit){
for ( ti in 1:length(fu.time)){
for (pt in 1:pat){
pt.array1[it.er,ti,pt]=b0[it.er,T.val[it.er,pt]]+b1[it.er,T.val[it.er,pt]]*fu.time[ti]+b2[it.er,T.val[it.er,pt]]*fu.time[ti]^2
}
}
}
pt.array_mean=apply(pt.array1, c(3,2), mean)
pt.array_LCL=apply(pt.array1, c(3,2), quantile, prob=0.25)
pt.array_UCL=apply(pt.array1, c(3,2), quantile, prob=0.975)
Now with these additional data, I can create three plots as follows
mydata
pt.ID time IPSS
1 1 0.000000 10
2 1 1.117808 8
3 1 4.504110 5
4 1 6.410959 14
5 1 13.808220 10
6 1 19.890410 4
7 1 28.865750 15
8 1 35.112330 7
9 2 0.000000 6
10 2 1.117808 7
11 2 4.109589 8
12 2 10.093151 7
13 2 16.273973 11
14 2 18.345205 18
15 2 21.567120 14
16 2 25.808220 12
17 2 56.087670 5
18 3 0.000000 8
19 3 1.413699 3
20 3 4.405479 3
21 3 10.389041 8
pdf("plots.pdf")
par(mfrow=c(3,2))
for( pt.no in 1:pat){
plot(IPSS[ID==pt.no]~time[ID==pt.no],xlim=c(0,57),ylim=c(0,35),type="l",col="black",
xlab="f/u time", ylab= "",main = paste("patient", pt.no),data=mydata)
points(IPSS[ID==pt.no]~time[ID==pt.no],data=mydata)
lines(pt.array_mean[pt.no,]~fu.time, col="blue")
lines(pt.array_LCL[pt.no,]~fu.time, col="green")
lines(pt.array_UCL[pt.no,]~fu.time, col="green")
}
dev.off()
The problem arise when the number of rows in each matrix is much bigger say 10000. It takes too much computation time to create the pt.array1 for large number of rows in b0, b1 and b2.
Is there any alternative way I can do it quickly using any builtin function?
Can I avoid the storage allocation for pt.array1 as I am not using it further? I just need pt.array_mean, pt.array_UCL and pt.array_LCL for myplot.
Any help is appreciated.
There are a couple of other approaches you can employ.
First, you largely have a model of b0 + b1*fu + b2*fu^2. Therefore, you could make the coefficients and apply the fu after the fact:
ind <- expand.grid(nits = seq_len(nit), pats = seq_len(pat))
mat_ind <- cbind(ind[, 'nits'], T.val[as.matrix(ind)])
b_mat <- matrix(c(b0[mat_ind], b1[mat_ind], b2[mat_ind]), ncol = 3)
b_mat
[,1] [,2] [,3]
[1,] 11.410 0.85390 -0.0130200
[2,] 11.360 0.95650 -0.0165400
[3,] 11.410 0.85390 -0.0130200
[4,] 6.950 0.06752 -0.0026720
[5,] 8.767 -0.03179 -0.0002822
[6,] 11.360 0.95650 -0.0165400
Now if we apply the model to each row, we will get all of your raw results. The only problem is that we don't match your original output - each column slice of your array is equivalent of a row slice of my matrix output.
pt_array <- apply(b_mat, 1, function(x) x[1] + x[2] * fu.time + x[3] * fu.time^2)
pt_array[1,]
[1] 11.410 11.360 11.410 6.950 8.767 11.360
pt.array1[, 1, ]
[,1] [,2] [,3]
[1,] 11.41 11.41 8.767
[2,] 11.36 6.95 11.360
That's OK because we can fix the shape of it as we get summary statistics - we just need to take the colSums and colQuantiles of each row converted to a 2 x 3 matrix:
library(matrixStats)
pt_summary = array(t(apply(pt_array,
1,
function(row) {
M <- matrix(row, ncol = pat)
c(colMeans2(M),colQuantiles(M, probs = c(0.25, 0.975))
)
}
)),
dim = c(length(fu.time), pat, 3),
dimnames = list(NULL, paste0('pat', seq_len(pat)), c('mean', 'LCL', 'UCL'))
)
pt_summary[1, ,] #slice at time = 1
mean LCL UCL
pat1 11.3850 11.37250 11.40875
pat2 9.1800 8.06500 11.29850
pat3 10.0635 9.41525 11.29518
# rm(pt.array1)
Then to do your final graphing, I simplified it - the data argument can be a subset(mydata, pt.ID == pt.no). Additionally, since the summary statistics are now in an array format, matlines allows everything to be done at once:
par(mfrow=c(3,2))
for( pt.no in 1:pat){
plot(IPSS~pt.ID, data=subset(mydata, pt.ID == pt.no),
xlim=c(0,57), ylim=c(0,35),
type="l",col="black", xlab="f/u time", ylab= "",
main = paste("patient", pt.no)
)
points(IPSS~time, data=subset(mydata, pt.ID == pt.no))
matlines(y = pt_summary[,pt.no ,], x = fu.time, col=c("blue", 'green', 'green'))
}

How to define contrast coefficient matrix?

I have this data
y x1 x2 pre
1 16 1 1 14
2 15 1 1 13
3 14 1 2 14
4 13 1 2 13
5 12 2 1 12
6 11 2 1 12
7 11 2 2 13
8 13 2 2 13
9 10 3 1 10
10 11 3 1 11
11 11 3 2 11
12 9 3 2 10
And I fitted the following model
lm(y ~ x1 + x2 + x1*x2)
My design matrix is
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 14 1 0 1 1 0
[2,] 1 13 1 0 1 1 0
[3,] 1 14 1 0 0 0 0
[4,] 1 13 1 0 0 0 0
[5,] 1 12 0 1 1 0 1
[6,] 1 12 0 1 1 0 1
[7,] 1 13 0 1 0 0 0
[8,] 1 13 0 1 0 0 0
[9,] 1 10 0 0 1 0 0
[10,] 1 11 0 0 1 0 0
[11,] 1 11 0 0 0 0 0
[12,] 1 10 0 0 0 0 0
I'm trying to use this design to reproduce the following table:
Source DF Squares Mean Square F Value Pr > F
Model 6 44.79166667 7.46527778 12.98 0.0064
Error 5 2.87500000 0.57500000
Corrected Total 11 47.66666667
Source DF Type III SS Mean Square F Value Pr > F
pre 1 3.12500000 3.12500000 5.43 0.0671
x1 2 4.58064516 2.29032258 3.98 0.0923
x2 1 3.01785714 3.01785714 5.25 0.0706
x1*x2 2 1.25000000 0.62500000 1.09 0.4055
The first part is fine
XtX <- t(x) %*% x
XtXinv <- solve(XtX)
betahat <- XtXinv %*% t(x) %*% y
H <- x %*% XtXinv %*% t(x)
IH <- (diag(1,12) - H)
yhat <- H %*% y
e <- IH %*% y
ybar <- mean(y)
MSS <- t(betahat) %*% t(x) %*% y - length(y)*(ybar^2)
ESS <- t(e) %*% e
TSS <- MSS + ESS
dfM <- sum(diag(H)) - 1
dfE <- sum(diag(IH))
dfT <- dfM + dfE
MSM <- MSS/dfM
MSE <- ESS/dfE
Ftest <- MSM / MSE
pr <- 1 - pf(Ftest, dfM, dfE)
The contrast coefficient matrix for 'pre' seems correct.
L <- matrix(c(0,1,0,0,0,0,0), 1, 7, byrow=T)
Lb <- L %*% betahat
LXtXinvLt <- round(L %*% XtXinv %*% t(L), digits=4)
SSpre <- t(Lb) %*% solve(LXtXinvLt) %*% (Lb)
MSpre <- SSpre / 1
Fpre <- MSpre / MSE
PRpre <- 1 - pf(Fpre, 1, 12-7)
But I can't understand how to define the contrast coefficient matrix for x1, x2, and x1*x2. What's the problem with the rest of my code? Below an example for how I think I should calculate for x1
L <- matrix(c(0,0,1,1,0,0,0), 1, 7, byrow=T)
Lb <- L %*% betahat
LXtXinvLt <- round(L %*% XtXinv %*% t(L), digits=4)
SSX1 <- t(Lb) %*% solve(LXtXinvLt) %*% (Lb)
MSX1 <- SSX1 / 1
FX1 <- MSX1 / MSE
PRX1 <- 1 - pf(FX1, 1, 12-7)
Thanks!

Update DataTable from another table with LINQ

I have 2 DataTables that look like this:
DataTable 1:
cheie_primara cheie_secundara judet localitate
1 11 A
2 22 B
3 33 C
4 44 D
5 55 A
6 66 B
7 77 C
8 88 D
9 99 A
DataTable 2:
ID_CP BAN JUDET LOCALITATE ADRESA
1 11 A aa random
2 22 B ss random
3 33 C ee random
4 44 D xx random
5 55 A rr random
6 66 B aa random
7 77 C ss random
8 88 D ee random
9 99 A xx random
and I want to update DataTable 1 with the field["LOCALITATE"] using the maching key DataTable1["cheie_primara"] and DataTable2["ID_CP"].
Like this:
cheie_primara cheie_secundara judet localitate
1 11 A aa
2 22 B ss
3 33 C ee
4 44 D xx
5 55 A rr
6 66 B aa
7 77 C ss
8 88 D ee
9 99 A xx
Is there a LINQ methode to update DataTable1 ?
Thanks!
This is working:
DataTable1.AsEnumerable()
.Join( DataTable2.AsEnumerable(),
dt1_Row => dt1_Row.ItemArray[0],
dt2_Row => dt2_Row.ItemArray[0],
(dt1_Row, dt2_Row) => new { dt1_Row, dt2_Row })
.ToList()
.ForEach(o =>
o.dt1_Row.SetField(3, o.dt2_Row.ItemArray[3]));
If you want to use Linq, here's how I'd go about it;
var a = (from d1 in DataTable1
join d2 in DataTable2 on d1.cheie_primara equals d2.ID_CP
select new {d1, d2.LOCALITATE}).ToList();
a.ForEach(b => b.d1.localitate = b.LOCALITATE);

Change a column_vector to a matrix in MATLAB

I have a column vector that needs to be changed into a matrix. The size of matrix is specified and can change. Please suggest a vectorized solution.
rows = 3 ; cols = 4 ; %matrix elements for this case = 12
colvector = [ 2;4;5;8;10;14;16;18;20;21;28;30] ;
desired_mat = [ ...
2 4 5 8
10 14 16 18
20 21 28 30 ] ;
Thanks!
The reshape function does that:
>> colvector = [ 2;4;5;8;10;14;16;18;20;21;28;30] ;
>> A = reshape(colvector, 3, 4)
A =
2 8 16 21
4 10 18 28
5 14 20 30

Resources