What is model.cov_params() in statsmodels? - statsmodels

I am unable to understand what the [cov_params][1] from a fitted statsmodel represents. I thought it would be the covariance matrix of the data but that does not seem to be the case. It is not even scale*convariance_matrix_of_the_data
I have used the following code snippet to try to understand:
A random dataset preparation
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({'num1':np.random.randn(30,),
'num2':np.random.randn(30,),
'labels':np.random.choice([1,0],30)})
Computing covariance matrix on it:
df.drop('labels',axis=1).cov()
output:
num1 num2
num1 0.810012 0.082823
num2 0.082823 0.866951
computing the cov_params() from the model:
Fitting the model:
import statsmodels.api as sm
mod = sm.formula.glm(formula = "labels ~ num1+num2",\
data = df,
family = sm.families.Binomial()).fit()
scale is 1.0:
mod.scale
output
1.0
Getting cov_params:
mod.cov_params()
output:
Intercept num1 num2
Intercept 0.162491 0.006924 0.006894
num1 0.006924 0.234236 0.004327
num2 0.006894 0.004327 0.198648
As you can see the cov values between num1 and num2 are not the same in the two cav matrices. They are not even a scaled version of each other by mod.scale parameter as mod.scale is 1.0
Can you help me understand what is mod.cov_params()
[1]: https://www.statsmodels.org/0.8.0/generated/statsmodels.genmod.generalized_linear_model.GLMResults.cov_params.html

Related

Runge-Kutta curve fitting extremely slow

I am currently trying to do a regression of a function calculated via a RK4 method performed on a non-linear Volterra integral of the second kind. The problem I found is that the code is extremely slow, for 1 call of the curve_fit function (fitt), it takes about 30-40 minute to generate a data. Overall, there will be a lot of calls to fitt before the parameters are determined, this takes more than 6 hours to run. Is there anyway to optimize this code? Thanks in advance!
from scipy.special import gamma
from ml_internal import LTInversion
from scipy.optimize import curve_fit , fsolve
from scipy.misc import derivative
from sklearn.metrics import r2_score
from math import comb , factorial
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Gets the data
df = pd.read_excel('D:\\CoMat\\Fractional_fit\\optimized\\data_optimized.xlsx')
skipTime = 1
skipIndex = df[df['Time']== skipTime].index.values[0]
xls = pd.read_excel('D:\\CoMat\\Fractional_fit\\optimized\\data_optimized.xlsx',skiprows=np.arange(1,skipIndex+1,1))
timeDF = xls['Time']
tempDF = xls['Temp']
taDF = xls['Ta']
timeDF = timeDF - timeDF[0]
tempDF = tempDF + 273.15
t0 = tempDF[0]
ta = sum(taDF)/len(taDF)
ta = ta + 273.15
###########################################
#Spliting into intervals
h = 0.05
a = 0
b = timeDF[len(timeDF)-1]
N = int(np.round((b-a)/h))
#Each xi
def xidx(index):
return a + h*index
#Function in the image are written here.
def gx(t,lamda,alpha):
return t0 * ml(lamda*(t**alpha),alpha)
gx = np.vectorize(gx)
def kernel(t,s,rad,lamda,alpha,beta):
if t == s:
return 0
return (t-s)**(alpha-1) * ml_(lamda*((t-s)**alpha),alpha,alpha,1) * (beta*(rad**4) - beta*(ta**4) - lamda*ta)
kernel = np.vectorize(kernel)
############################
# The problem is here!!!!!!
def fx(x,n,lamda,alpha,beta):
ans = gx(x,lamda,alpha)
for j in range(n):
ans += (h/6)*(kernel(x,xidx(j),f0[j],lamda,alpha,beta) + 2*kernel(x,xidx(j+1/2),f1[j],lamda,alpha,beta) + 2*kernel(x,xidx(j+1/2),f2[j],lamda,alpha,beta) + kernel(x,xidx(j+1),f3[j],lamda,alpha,beta))
return ans
#########################
f0 = np.zeros(N+1)
f0[0] = t0
f1 = np.zeros(N+1)
f2 = np.zeros(N+1)
f3 = np.zeros(N+1)
F = np.zeros((3,N+1))
def fitt(xvalue,lamda,alpha,beta):
global f0,f1,f2,f3,F
n = int(np.round(xvalue/h))
f1[n] = fx(xidx(n) + 1/2,n,lamda,alpha,beta) + (h/2)*kernel(xidx(n + 1/2),xidx(n),f0[n],lamda,alpha,beta)
f2[n] = fx(xidx(n + 1/2),n,lamda,alpha,beta)
f3[n] = fx(xidx(n+1),n,lamda,alpha,beta) + h*kernel(xidx(n+1),xidx(n+1/2),f2[n],lamda,alpha,beta)
if n+1 <= N:
f0[n+1] = fx(xidx(n+1),n,lamda,alpha,beta) + (h/6)*(kernel(xidx(n+1),xidx(n),f0[n],lamda,alpha,beta) + 2*kernel(xidx(n+1),xidx(n+1/2),f1[n],lamda,alpha,beta) + 2*kernel(xidx(n+1),xidx(n+1/2),f2[n],lamda,alpha,beta) + kernel(xidx(n+1),xidx(n+1),f3[n],lamda,alpha,beta))
if(xvalue == timeDF[len(timeDF) - 1]):
print(f0[n],n)
returnValue = f0[n]
f0 = np.zeros(N+1)
f0[0] = t0
f1 = np.zeros(N+1)
f2 = np.zeros(N+1)
f3 = np.zeros(N+1)
return returnValue
print(f0[n],n)
return f0[n]
fitt = np.vectorize(fitt)
#Fitting, plotting and giving (Adj) R-squared
popt , pcov = curve_fit(fitt,timeDF,tempDF,p0=(-0.1317,0.95,-1e-11),bounds=((-np.inf,0,-np.inf),(0,1,0)))
print(popt)
y_fit = np.array(fitt(timeDF,popt[0],popt[1],popt[2]))
plt.scatter(timeDF,tempDF,color='ORANGE',marker='.',s=0.5)
plt.fill_between(timeDF,tempDF-0.5,tempDF+0.5,color='ORANGE', alpha=0.2)
plt.plot(timeDF,y_fit,color='RED',linewidth=1)
plt.legend(["Experimental data", "Caputo fit"], loc ="upper right")
plt.xlabel("Time (min)")
plt.ylabel("Temperature (Kelvin)")
plt.show()
plt.close()
r2 = r2_score(tempDF,y_fit)
print(r2)
adjr2 = 1 - (1 - r2)*((len(xls)-1)/(len(xls)-3-1))
print(adjr2)
I already tried computing the values f0,f1,f2,f3 all at once, but the thing consuming the most time is Fn(x) which I haven't figured in out how to compute them all at once. If this is possible to compute at once, I think the program will run much faster. PS: ML,ML_ is a function from https://github.com/khinsen/mittag-leffler.
This is the function necesssary. Fn is the only one I haven't figured out yet.
There are two typing errors in the cited image. The combination of x_n and 1/2 is always meant to be the midpoint x_{n+1/2} = x_n + h/2. The second error is a duplication of x_{n+1/2} in the formula for f^{(4)}_n in its third term. The first error is probably producing errors that are large enough to make convergence complicated and any limit wrong for the intended problem.
In the Simpson/RK4 step, the 4 fx computations can be reduced to 2.
The F_n implement the left side of the integral equation
F(x) = g(x) + int(s=0 to x of K(x,s,f(s))
where the integral is approximated with the sample sequences f0,...,f3. Due to the structure of problem and algorithm F_n(x_n)=f^0_n = f^4_{n-1}.
Note that K(x,s,f) should be set to zero for s >= x. In the exact version of the equation these values "above the diagonal" are not used.
If an increase in accuracy is needed, for instance to avoid divergence where there is none in the exact solution, you can decrease the step site by a factor of 10 and then sub-sample the f^0_n sequence to produce the numerical guess for the given data. Other factors than 10 are of course also possible.

Regression with constraints on contribution from variables

I'm trying to develop a regression model with constraints on effect from the independent variables. So my model equation is y = a0 + a1x1 + a2x2 with 200 datapoints. What I want to achieve is sum(a1x1) over 200 datapoints should fall in certain range i.e. lb1<sum(a1x1)<ub1. I am using Gekko for the optimization part and a got stuck while applying this condition.
I am using the following code where ubdict is the dictionary for the boundaries:
m = gk.GEKKO(remote=False)
m.options.IMODE=2 #Regression mode
y = np.array(df['y']) #dependant vars for optimization
x = np.array(df[X]) #array of independent vars for optimization
n = x.shape[1] #number of variables
c = m.Array(m.FV, n+1) #array of parameters and intercept
for ci in c:
ci.STATUS = 1 #calculate fixed parameter
xp = [None]*n
#load data
xd = m.Array(m.Param,n)
yd = m.Param(value=y)
for i in range(n):
xd[i].value = x[:,i]
xp[i] = m.Var()
if ubound_dict[i] >= 0:
xp[i] = m.Var(lb=0, ub=ubdict[i])
elif ubound_dict[i] < 0:
xp[i] = m.Var(lb=ubdict[i], ub=0)
m.Equation(xp[i]==c[i]*xd[i])
yp = m.Var()
m.Equation(yp==m.sum([xp[i] for i in range(n)] + [c[n]]))
#Minimize difference between actual and predicted y
m.Minimize((yd-yp)**2)
#APOPT solver
m.options.SOLVER = 1
#Solve
m.solve(disp=True)
#Retrieve parameter values
a = [i.value[0] for i in c]
print(a)
But this is applying the constraint row-wise. What I want is something like
xp[i] = m.Var(lb=0, ub=ubdict[i])
m.Equation(xp[i]==sum(c[i]*xd[i]) over observations)
Any suggestion would be of great help!
Below is a similar problem with sample data.
Regression Mode with IMODE=2
Use the m.vsum() object in Gekko with IMODE=2. Gekko lets you write the equations once and then applies the data to each equation. This is more efficient for large-scale data sets.
import numpy as np
from gekko import GEKKO
# load data
x1 = np.array([1,2,5,3,2,5,2])
x2 = np.array([5,6,7,2,1,3,2])
ym = np.array([3,2,3,5,6,7,8])
# model
m = GEKKO()
c = m.Array(m.FV,3)
for ci in c:
ci.STATUS=1
x1 = m.Param(value=x1)
x2 = m.Param(value=x2)
ymeas = m.Param(value=ym)
ypred = m.Var()
m.Equation(ypred == c[0] + c[1]*x1 + c[2]*x2)
# add constraint on sum(c[1]*x1) with vsum
v1 = m.Var(); m.Equation(v1==c[1]*x1)
con = m.Var(lb=0,ub=10); m.Equation(con==m.vsum(v1))
m.Minimize((ypred-ymeas)**2)
m.options.IMODE = 2
m.solve()
print('Final SSE Objective: ' + str(m.options.objfcnval))
print('Solution')
for i,ci in enumerate(c):
print(i,ci.value[0])
# plot solution
import matplotlib.pyplot as plt
plt.figure(figsize=(8,4))
plt.plot(ymeas,ypred,'ro')
plt.plot([0,10],[0,10],'k-')
plt.xlabel('Meas')
plt.ylabel('Pred')
plt.savefig('results.png',dpi=300)
plt.show()
Optimization Mode (IMODE=3)
The optimization mode 3 allows you to write each equation and objective term individually. Both give the same solution.
import numpy as np
from gekko import GEKKO
# load data
x1 = np.array([1,2,5,3,2,5,2])
x2 = np.array([5,6,7,2,1,3,2])
ym = np.array([3,2,3,5,6,7,8])
n = len(ym)
# model
m = GEKKO()
c = m.Array(m.FV,3)
for ci in c:
ci.STATUS=1
yp = m.Array(m.Var,n)
for i in range(n):
m.Equation(yp[i]==c[0]+c[1]*x1[i]+c[2]*x2[i])
m.Minimize((yp[i]-ym[i])**2)
# add constraint on sum(c[1]*x1)
s = m.Var(lb=0,ub=10); m.Equation(s==c[1]*sum(x1))
m.options.IMODE = 3
m.solve()
print('Final SSE Objective: ' + str(m.options.objfcnval))
print('Solution')
for i,ci in enumerate(c):
print(i,ci.value[0])
# plot solution
import matplotlib.pyplot as plt
plt.figure(figsize=(8,4))
ypv = [yp[i].value[0] for i in range(n)]
plt.plot(ym,ypv,'ro')
plt.plot([0,10],[0,10],'k-')
plt.xlabel('Meas')
plt.ylabel('Pred')
plt.savefig('results.png',dpi=300)
plt.show()
For future questions, please create a simple and complete example that demonstrates the issue.

What is the formula being used in the in-sample prediction of statsmodels?

I would like to know what formula is being used in statsmodels ARIMA predict/forecast. For a simple AR(1) model I thought that it would be y_t = a1 * y_t-1. However, I am not able to recreate the results produced by forecast or predict.
Here's what I am trying to do:
from statsmodels.tsa.arima.model import ARIMA
import numpy as np
def ar_series(n):
# generate the series y_t = a1 y_t-1 + eps
np.random.seed(1)
y0 = np.random.rand()
y = [y0]
a1 = 0.7 # the AR coefficient
for i in range(1, n):
y.append(a1 * y[i - 1] + 0.3 * np.random.rand())
return np.array(y)
series = ar_series(10)
model = ARIMA(series, order=(1, 0, 0))
fit = model.fit()
#print(fit.summary())
# const = 0.3441; ar.L1 = 0.6518
print(fit.predict())
y_pred = [0.3441]
for i in range(1, 10):
y_pred.append( 0.6518 * series[i-1])
y_pred = np.array(y_pred)
print(y_pred)
The two series don't match and I have no idea how the in-sample predictions are being calculated?
Found the answer here. I think what I was trying to do is valid only if the process mean is zero.
https://faculty.washington.edu/ezivot/econ584/notes/forecast.pdf

Numpy version of rolling MAD (mean absolute deviation)

How to make a rolling version of the following MAD function
from numpy import mean, absolute
def mad(data, axis=None):
return mean(absolute(data - mean(data, axis)), axis)
This code is an answer to this question
At the moment i convert numpy to pandas then apply this function, then convert the result back to numpy
pandasDataFrame.rolling(window=90).apply(mad)
but this is inefficient on larger data-frames. How to get a rolling window for the same function in numpy without looping and give the same result?
Here's a vectorized NumPy approach -
# From this post : http://stackoverflow.com/a/40085052/3293881
def strided_app(a, L, S ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
# From this post : http://stackoverflow.com/a/14314054/3293881 by #Jaime
def moving_average(a, n=3) :
ret = np.cumsum(a, dtype=float)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
def mad_numpy(a, W):
a2D = strided_app(a,W,1)
return np.absolute(a2D - moving_average(a,W)[:,None]).mean(1)
Runtime test -
In [617]: data = np.random.randint(0,9,(10000))
...: df = pd.DataFrame(data)
...:
In [618]: pandas_out = pd.rolling_apply(df,90,mad).values.ravel()
In [619]: numpy_out = mad_numpy(data,90)
In [620]: np.allclose(pandas_out[89:], numpy_out) # Nans part clipped
Out[620]: True
In [621]: %timeit pd.rolling_apply(df,90,mad)
10 loops, best of 3: 111 ms per loop
In [622]: %timeit mad_numpy(data,90)
100 loops, best of 3: 3.4 ms per loop
In [623]: 111/3.4
Out[623]: 32.64705882352941
Huge 32x+ speedup there over the loopy pandas solution!

pymc3 improving theano compile time before sampling

I'm working with this hierarchical Bayesian model:
import pymc3 as pm
import pandas as pd
import theano.tensor as T
categories = pd.Categorical(df.cat)
n_categories = len(set(categories.codes))
cat_idx = categories.codes
with pm.Model()
mu_a = pm.Normal('mu_a', 0, sd=100**2)
sig_a = pm.Uniform('sig_a', lower=0, upper=100)
alpha = pm.Normal('alpha', mu=mu_a, sd=sig_a, shape=n_categories)
betas = []
for f in FEATURE_LIST:
mu_b = pm.Normal('mu_b_%s' % f, 0, sd=100**2)
sig_b = pm.Uniform('sig_b_%s' % f, lower=0, upper=100)
betas.append(pm.Normal('beta_%s' % f, mu=mu_b, sd=sig_b, shape=n_categories))
logit = 1.0 / (1.0 + T.exp(-(
sum([betas[i][cat_idx] * X_train[f].values for i, f in enumerate(FEATURE_LIST)])
+ alpha[cat_idx]
)))
y_est = pm.Bernoulli('y_est', logit, observed=df.y)
start = pm.find_MAP()
trace = pm.sample(2000, pm.NUTS(), start=start, random_seed=42, njobs=40)
I would imagine that replace my python list of priors and individual additions and multiplications with proper Theano code (perhaps using T.dot?) would improve the performance of the call to sample. How do I set this up in Theano correctly? I imagine that I need to do something like shape=(n_features, n_categories) for my priors, but I'm not sure how to do the category index in the dot product.

Resources