GARCH model in pymc3: how to loop over random variables? - pymc

I'm attempting to implement a GARCH model in pymc3, along the lines of this example. For this I attempted to implement a GARCH(1, 1) distribution as follows
import pymc3 as pm
from pymc3.distributions import Continuous, Normal
class GARCH(Continuous):
def __init__(self, alpha_0=None, alpha_1=None, beta_1=None, sigma_0=None, *args, **kwargs):
super(GARCH, self).__init__(*args, **kwargs)
self.alpha_0 = alpha_0
self.alpha_1 = alpha_1
self.beta_1 = beta_1
self.sigma_0 = sigma_0
self.mean = 0
def logp(self, values):
sigma = self.sigma_0
alpha_0 = self.alpha_0
alpha_1 = self.alpha_1
beta_1 = self.beta_1
x_prev = values[0]
_logp = Normal.dist(0., sd=sigma).logp(x_prev)
for x in values[1:]:
sigma = pm.sqrt(alpha_0 + alpha_1 * (x_prev/sigma)**2
+ beta_1 * sigma**2)
_logp = _logp + pm.Normal(0., sd=sigma).logp(x)
x_prev = x
return _logp
To clarify, this is the log-likelihood of the GARCH(1,1) model. The volatility process is a time series where the volatility at time t depends on the residual at time t-1. But to determine the residual at time t-1, we require the volatility at time t-1.
Anyways, that's not really important for my question. What matters is that the likelihood cannot be computed by vectorizing the for-loop (which is how it is done in the link at the top of the post). So you need an explicit loop which at each step first updates the volatility, and then determines likelihood of the observed return.
But the code above doesn't work. If I try to build a model like
import numpy as np
returns = np.genfromtxt("SP500.csv")[-200:]
garchmodel = pm.Model()
with garchmodel:
alpha_0 = pm.Exponential('alpha_0', 30., testval=.02)
alpha_1 = pm.Uniform('alpha_1', lower=0, upper=1, testval=.9)
upper = pm.Deterministic('upper', 1-alpha_1)
beta_1 = pm.Uniform('beta_1', lower=0, upper=upper, testval=.05)
sigma_0 = pm.Exponential('sigma_0', 30., testval=.02)
garch = GARCH('garch', alpha_0=alpha_0, alpha_1=alpha_1,
beta_1=beta_1, sigma_0=sigma_0, observed=returns)
The "SP500.csv" file can be found on e.g. github
This code generate the error:
ValueError: length not known
I'm pretty certain that this is because the for loops are conflicting with theano. How do I deal with this?

Related

Runge-Kutta curve fitting extremely slow

I am currently trying to do a regression of a function calculated via a RK4 method performed on a non-linear Volterra integral of the second kind. The problem I found is that the code is extremely slow, for 1 call of the curve_fit function (fitt), it takes about 30-40 minute to generate a data. Overall, there will be a lot of calls to fitt before the parameters are determined, this takes more than 6 hours to run. Is there anyway to optimize this code? Thanks in advance!
from scipy.special import gamma
from ml_internal import LTInversion
from scipy.optimize import curve_fit , fsolve
from scipy.misc import derivative
from sklearn.metrics import r2_score
from math import comb , factorial
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Gets the data
df = pd.read_excel('D:\\CoMat\\Fractional_fit\\optimized\\data_optimized.xlsx')
skipTime = 1
skipIndex = df[df['Time']== skipTime].index.values[0]
xls = pd.read_excel('D:\\CoMat\\Fractional_fit\\optimized\\data_optimized.xlsx',skiprows=np.arange(1,skipIndex+1,1))
timeDF = xls['Time']
tempDF = xls['Temp']
taDF = xls['Ta']
timeDF = timeDF - timeDF[0]
tempDF = tempDF + 273.15
t0 = tempDF[0]
ta = sum(taDF)/len(taDF)
ta = ta + 273.15
###########################################
#Spliting into intervals
h = 0.05
a = 0
b = timeDF[len(timeDF)-1]
N = int(np.round((b-a)/h))
#Each xi
def xidx(index):
return a + h*index
#Function in the image are written here.
def gx(t,lamda,alpha):
return t0 * ml(lamda*(t**alpha),alpha)
gx = np.vectorize(gx)
def kernel(t,s,rad,lamda,alpha,beta):
if t == s:
return 0
return (t-s)**(alpha-1) * ml_(lamda*((t-s)**alpha),alpha,alpha,1) * (beta*(rad**4) - beta*(ta**4) - lamda*ta)
kernel = np.vectorize(kernel)
############################
# The problem is here!!!!!!
def fx(x,n,lamda,alpha,beta):
ans = gx(x,lamda,alpha)
for j in range(n):
ans += (h/6)*(kernel(x,xidx(j),f0[j],lamda,alpha,beta) + 2*kernel(x,xidx(j+1/2),f1[j],lamda,alpha,beta) + 2*kernel(x,xidx(j+1/2),f2[j],lamda,alpha,beta) + kernel(x,xidx(j+1),f3[j],lamda,alpha,beta))
return ans
#########################
f0 = np.zeros(N+1)
f0[0] = t0
f1 = np.zeros(N+1)
f2 = np.zeros(N+1)
f3 = np.zeros(N+1)
F = np.zeros((3,N+1))
def fitt(xvalue,lamda,alpha,beta):
global f0,f1,f2,f3,F
n = int(np.round(xvalue/h))
f1[n] = fx(xidx(n) + 1/2,n,lamda,alpha,beta) + (h/2)*kernel(xidx(n + 1/2),xidx(n),f0[n],lamda,alpha,beta)
f2[n] = fx(xidx(n + 1/2),n,lamda,alpha,beta)
f3[n] = fx(xidx(n+1),n,lamda,alpha,beta) + h*kernel(xidx(n+1),xidx(n+1/2),f2[n],lamda,alpha,beta)
if n+1 <= N:
f0[n+1] = fx(xidx(n+1),n,lamda,alpha,beta) + (h/6)*(kernel(xidx(n+1),xidx(n),f0[n],lamda,alpha,beta) + 2*kernel(xidx(n+1),xidx(n+1/2),f1[n],lamda,alpha,beta) + 2*kernel(xidx(n+1),xidx(n+1/2),f2[n],lamda,alpha,beta) + kernel(xidx(n+1),xidx(n+1),f3[n],lamda,alpha,beta))
if(xvalue == timeDF[len(timeDF) - 1]):
print(f0[n],n)
returnValue = f0[n]
f0 = np.zeros(N+1)
f0[0] = t0
f1 = np.zeros(N+1)
f2 = np.zeros(N+1)
f3 = np.zeros(N+1)
return returnValue
print(f0[n],n)
return f0[n]
fitt = np.vectorize(fitt)
#Fitting, plotting and giving (Adj) R-squared
popt , pcov = curve_fit(fitt,timeDF,tempDF,p0=(-0.1317,0.95,-1e-11),bounds=((-np.inf,0,-np.inf),(0,1,0)))
print(popt)
y_fit = np.array(fitt(timeDF,popt[0],popt[1],popt[2]))
plt.scatter(timeDF,tempDF,color='ORANGE',marker='.',s=0.5)
plt.fill_between(timeDF,tempDF-0.5,tempDF+0.5,color='ORANGE', alpha=0.2)
plt.plot(timeDF,y_fit,color='RED',linewidth=1)
plt.legend(["Experimental data", "Caputo fit"], loc ="upper right")
plt.xlabel("Time (min)")
plt.ylabel("Temperature (Kelvin)")
plt.show()
plt.close()
r2 = r2_score(tempDF,y_fit)
print(r2)
adjr2 = 1 - (1 - r2)*((len(xls)-1)/(len(xls)-3-1))
print(adjr2)
I already tried computing the values f0,f1,f2,f3 all at once, but the thing consuming the most time is Fn(x) which I haven't figured in out how to compute them all at once. If this is possible to compute at once, I think the program will run much faster. PS: ML,ML_ is a function from https://github.com/khinsen/mittag-leffler.
This is the function necesssary. Fn is the only one I haven't figured out yet.
There are two typing errors in the cited image. The combination of x_n and 1/2 is always meant to be the midpoint x_{n+1/2} = x_n + h/2. The second error is a duplication of x_{n+1/2} in the formula for f^{(4)}_n in its third term. The first error is probably producing errors that are large enough to make convergence complicated and any limit wrong for the intended problem.
In the Simpson/RK4 step, the 4 fx computations can be reduced to 2.
The F_n implement the left side of the integral equation
F(x) = g(x) + int(s=0 to x of K(x,s,f(s))
where the integral is approximated with the sample sequences f0,...,f3. Due to the structure of problem and algorithm F_n(x_n)=f^0_n = f^4_{n-1}.
Note that K(x,s,f) should be set to zero for s >= x. In the exact version of the equation these values "above the diagonal" are not used.
If an increase in accuracy is needed, for instance to avoid divergence where there is none in the exact solution, you can decrease the step site by a factor of 10 and then sub-sample the f^0_n sequence to produce the numerical guess for the given data. Other factors than 10 are of course also possible.

What is the formula being used in the in-sample prediction of statsmodels?

I would like to know what formula is being used in statsmodels ARIMA predict/forecast. For a simple AR(1) model I thought that it would be y_t = a1 * y_t-1. However, I am not able to recreate the results produced by forecast or predict.
Here's what I am trying to do:
from statsmodels.tsa.arima.model import ARIMA
import numpy as np
def ar_series(n):
# generate the series y_t = a1 y_t-1 + eps
np.random.seed(1)
y0 = np.random.rand()
y = [y0]
a1 = 0.7 # the AR coefficient
for i in range(1, n):
y.append(a1 * y[i - 1] + 0.3 * np.random.rand())
return np.array(y)
series = ar_series(10)
model = ARIMA(series, order=(1, 0, 0))
fit = model.fit()
#print(fit.summary())
# const = 0.3441; ar.L1 = 0.6518
print(fit.predict())
y_pred = [0.3441]
for i in range(1, 10):
y_pred.append( 0.6518 * series[i-1])
y_pred = np.array(y_pred)
print(y_pred)
The two series don't match and I have no idea how the in-sample predictions are being calculated?
Found the answer here. I think what I was trying to do is valid only if the process mean is zero.
https://faculty.washington.edu/ezivot/econ584/notes/forecast.pdf

Tensorflow/Keras: volatile validation loss

I've been training a U-Net for single class small lesion segmentation, and have been getting consistently volatile validation loss. I have about 20k images split 70/30 between training and validation sets-so I don't think the issue is too little data. I've tried shuffling and resplitting the sets a few times with no change in volatility-so I don't think the validation set is unrepresentative. I have tried lowering the learning rate with no effect on volatility. And I have tried a few loss functions (dice coefficient, focal tversky, weighted binary cross-entropy). I'm using a decent amount of augmentation so as to avoid overfitting. I've also run through all my data (512x512 float64s with corresponding 512x512 int64 masks--both stored as numpy arrays) do double check that the value range, dtypes, etc. aren't screwy...and I even removed any ROIs in the masks under 35 pixels in area which I thought might be artifact and messing with loss.
I'm using keras ImageDataGen.flow_from_directory...I was initially using zca_whitening and brightness_range augmentation but I think this causes issues with flow_from_directory and the link between mask and image being lost.. so I skipped this.
I've tried validation generators with and without shuffle=True. Batch size is 8.
Here's some of my code, happy to include more if it would help:
# loss
from keras.losses import binary_crossentropy
import keras.backend as K
import tensorflow as tf
epsilon = 1e-5
smooth = 1
def dsc(y_true, y_pred):
smooth = 1.
y_true_f = K.flatten(y_true)
y_pred_f = K.flatten(y_pred)
intersection = K.sum(y_true_f * y_pred_f)
score = (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)
return score
def dice_loss(y_true, y_pred):
loss = 1 - dsc(y_true, y_pred)
return loss
def bce_dice_loss(y_true, y_pred):
loss = binary_crossentropy(y_true, y_pred) + dice_loss(y_true, y_pred)
return loss
def confusion(y_true, y_pred):
smooth=1
y_pred_pos = K.clip(y_pred, 0, 1)
y_pred_neg = 1 - y_pred_pos
y_pos = K.clip(y_true, 0, 1)
y_neg = 1 - y_pos
tp = K.sum(y_pos * y_pred_pos)
fp = K.sum(y_neg * y_pred_pos)
fn = K.sum(y_pos * y_pred_neg)
prec = (tp + smooth)/(tp+fp+smooth)
recall = (tp+smooth)/(tp+fn+smooth)
return prec, recall
def tp(y_true, y_pred):
smooth = 1
y_pred_pos = K.round(K.clip(y_pred, 0, 1))
y_pos = K.round(K.clip(y_true, 0, 1))
tp = (K.sum(y_pos * y_pred_pos) + smooth)/ (K.sum(y_pos) + smooth)
return tp
def tn(y_true, y_pred):
smooth = 1
y_pred_pos = K.round(K.clip(y_pred, 0, 1))
y_pred_neg = 1 - y_pred_pos
y_pos = K.round(K.clip(y_true, 0, 1))
y_neg = 1 - y_pos
tn = (K.sum(y_neg * y_pred_neg) + smooth) / (K.sum(y_neg) + smooth )
return tn
def tversky(y_true, y_pred):
y_true_pos = K.flatten(y_true)
y_pred_pos = K.flatten(y_pred)
true_pos = K.sum(y_true_pos * y_pred_pos)
false_neg = K.sum(y_true_pos * (1-y_pred_pos))
false_pos = K.sum((1-y_true_pos)*y_pred_pos)
alpha = 0.7
return (true_pos + smooth)/(true_pos + alpha*false_neg + (1-alpha)*false_pos + smooth)
def tversky_loss(y_true, y_pred):
return 1 - tversky(y_true,y_pred)
def focal_tversky(y_true,y_pred):
pt_1 = tversky(y_true, y_pred)
gamma = 0.75
return K.pow((1-pt_1), gamma)
model = BlockModel((len(os.listdir(os.path.join(imageroot,'train_ct','train'))), 512, 512, 1),filt_num=16,numBlocks=4)
#model.compile(optimizer=Adam(learning_rate=0.001), loss=weighted_cross_entropy)
#model.compile(optimizer=Adam(learning_rate=0.001), loss=dice_coef_loss)
model.compile(optimizer=Adam(learning_rate=0.001), loss=focal_tversky)
train_mask = os.path.join(imageroot,'train_masks')
val_mask = os.path.join(imageroot,'val_masks')
model.load_weights(model_weights_path) #I'm initializing with some pre-trained weights from a similar model
data_gen_args_mask = dict(
rotation_range=10,
shear_range=20,
width_shift_range=0.1,
height_shift_range=0.1,
zoom_range=[0.8,1.2],
horizontal_flip=True,
#vertical_flip=True,
fill_mode='nearest',
data_format='channels_last'
)
data_gen_args = dict(
**data_gen_args_mask
)
image_datagen_train = ImageDataGenerator(**data_gen_args)
mask_datagen_train = ImageDataGenerator(**data_gen_args)#_mask)
image_datagen_val = ImageDataGenerator()
mask_datagen_val = ImageDataGenerator()
seed = 1
BS = 8
steps = int(np.floor((len(os.listdir(os.path.join(train_ct,'train'))))/BS))
print(steps)
val_steps = int(np.floor((len(os.listdir(os.path.join(val_ct,'val'))))/BS))
print(val_steps)
train_image_generator = image_datagen_train.flow_from_directory(
train_ct,
target_size = (512, 512),
color_mode = ("grayscale"),
classes=None,
class_mode=None,
seed = seed,
shuffle = True,
batch_size = BS)
train_mask_generator = mask_datagen_train.flow_from_directory(
train_mask,
target_size = (512, 512),
color_mode = ("grayscale"),
classes=None,
class_mode=None,
seed = seed,
shuffle = True,
batch_size = BS)
val_image_generator = image_datagen_val.flow_from_directory(
val_ct,
target_size = (512, 512),
color_mode = ("grayscale"),
classes=None,
class_mode=None,
seed = seed,
shuffle = True,
batch_size = BS)
val_mask_generator = mask_datagen_val.flow_from_directory(
val_mask,
target_size = (512, 512),
color_mode = ("grayscale"),
classes=None,
class_mode=None,
seed = seed,
shuffle = True,
batch_size = BS)
train_generator = zip(train_image_generator, train_mask_generator)
val_generator = zip(val_image_generator, val_mask_generator)
# make callback for checkpointing
plot_losses = PlotLossesCallback(skip_first=0,plot_extrema=False)
%matplotlib inline
filepath = os.path.join(versionPath, model_version + "_saved-model-{epoch:02d}-{val_loss:.2f}.hdf5")
if reduce:
cb_check = [ModelCheckpoint(filepath,monitor='val_loss',
verbose=1,save_best_only=False,
save_weights_only=True,mode='auto',period=1),
reduce_lr,
plot_losses]
else:
cb_check = [ModelCheckpoint(filepath,monitor='val_loss',
verbose=1,save_best_only=False,
save_weights_only=True,mode='auto',period=1),
plot_losses]
# train model
history = model.fit_generator(train_generator, epochs=numEp,
steps_per_epoch=steps,
validation_data=val_generator,
validation_steps=val_steps,
verbose=1,
callbacks=cb_check,
use_multiprocessing = False
)
And here's how my loss looks:
Another potentially relevant thing: I tweaked the flow_from_directory code a bit (added npy to the white list). But training loss looks fine so assuming the issue isnt here
Two suggestions:
Switch to the classic validation data format (i.e. numpy array) instead of using a generator -- this will ensure you always use the exactly same validation data every time. If you see a different validation curve, then there is something "random" in the validation generator giving you different data at different epochs.
Use a fixed set of samples (100 or 1000 should be enough w/o any data augmentation) for both training and validation. If everything goes well, you should see your network quickly overfit to this dataset and your training and validation curves should very much similar. If not, debug your network.

Why does pymc.MAP not always return the same value

I am running pymc2 to fit a straight line through my data. The code is shown below (modified from examples I found online). When I call the MAP function multiple times, I get different answers, even though I start with the exact same model. I thought the optimization method, fmin_powell, starts at the supplied value for each parameter. As far as I know, fmin_powell has no random component, so it should always end at the same optimum, yet it doesn't. Why do I keep getting different results?
import numpy as np
import pymc
# observed data
n = 21
a = 6
b = 2
sigma = 2
x = np.linspace(0, 1, n)
np.random.seed(1)
y_obs = a * x + b + np.random.normal(0, sigma, n)
def model():
# define priors
a = pymc.Normal('a', mu=0, tau=1 /10 ** 2, value=5)
b = pymc.Normal('b', mu=0, tau=1 / 10 ** 2, value=1)
tau = pymc.Gamma('tau', alpha=0.1, beta=0.1, value=1)
# define likelihood
#pymc.deterministic
def mu(a=a, b=b, x=x):
return a * x + b
y = pymc.Normal('y', mu=mu, tau=tau, value=y_obs, observed=True)
return locals()
ml = model() # dictionary of all locals
mcmc = pymc.Model(ml) # MCMC object
mapmcmc = pymc.MAP(mcmc)
mapmcmc.fit(method='fmin_powell')
print(mcmc.a.value, mcmc.b.value, mcmc.tau.value)
ml = model() # dictionary of all locals
mcmc = pymc.Model(ml) # MCMC object
mapmcmc = pymc.MAP(mcmc)
mapmcmc.fit(method='fmin_powell')
print(mcmc.a.value, mcmc.b.value, mcmc.tau.value)
ml = model() # dictionary of all locals
mcmc = pymc.Model(ml) # MCMC object
mapmcmc = pymc.MAP(mcmc)
mapmcmc.fit(method='fmin_powell')
print(mcmc.a.value, mcmc.b.value, mcmc.tau.value)

Fitting a capped Poisson process with a variable rate

I'm trying to estimate the rate of a Poisson process where the rate varies over time using the maximum a posteriori estimate. Here's a simplified example with a rate varying linearly (λ = ax+b) :
import numpy as np
import pymc
# Observation
a_actual = 1.3
b_actual = 2.0
t = np.arange(10)
obs = np.random.poisson(a_actual * t + b_actual)
# Model
a = pymc.Uniform(name='a', value=1., lower=0, upper=10)
b = pymc.Uniform(name='b', value=1., lower=0, upper=10)
#pymc.deterministic
def linear(a=a, b=b):
return a * t + b
r = pymc.Poisson(mu=linear, name='r', value=obs, observed=True)
model = pymc.Model([a, b, r])
map = pymc.MAP(model)
map.fit()
map.revert_to_max()
print "a :", a._value
print "b :", b._value
This is working fine. But my actual Poisson process is capped by a deterministic value. As I can't associate my observed values to a Deterministic function, I'm adding a Normal Stochastic function with a small variance for my observations :
import numpy as np
import pymc
# Observation
a_actual = 1.3
b_actual = 2.0
t = np.arange(10)
obs = np.random.poisson(a_actual * t + b_actual).clip(0, 10)
# Model
a = pymc.Uniform(name='a', value=1., lower=0, upper=10)
b = pymc.Uniform(name='b', value=1., lower=0, upper=10)
#pymc.deterministic
def linear(a=a, b=b):
return a * t + b
r = pymc.Poisson(mu=linear, name='r')
#pymc.deterministic
def clip(r=r):
return r.clip(0, 10)
rc = pymc.Normal(mu=r, tau=0.001, name='rc', value=obs, observed=True)
model = pymc.Model([a, b, r, rc])
map = pymc.MAP(model)
map.fit()
map.revert_to_max()
print "a :", a._value
print "b :", b._value
This code is producing the following error :
Traceback (most recent call last):
File "pymc-bug-2.py", line 59, in <module>
map.revert_to_max()
File "pymc/NormalApproximation.py", line 486, in revert_to_max
self._set_stochastics([self.mu[s] for s in self.stochastics])
File "pymc/NormalApproximation.py", line 58, in __getitem__
tot_len += self.owner.stochastic_len[p]
KeyError: 0
Any idea on what am I doing wrong?
By "Capped" do you mean that it is a truncated Poisson? It appears thats what you are saying. If it were a left truncation (which is more common), you could use the TruncatedPoisson distribution, but since you are doing a right truncation, you cannot (we should have made this more general!). What you are trying will not work -- the Poisson object has no clip() method. What you can do is use a factor potential. It would look like this:
#pymc.potential
def clip(r=r):
if np.any(r>10):
return -np.inf
return 0
This will constrain the values of r to be less than 10. Refer to the pymc docs for information on the Potential class.

Resources