How does the custom loss function gets calculated in binary classification problem? - h2o

I created a custom loss function for a binary (0/1) classification problem in h2o via Python as shown below. The idea is to minimize total cost based on true positive, true negative, false positive, and false negative. Here are the questions that I hope to get an answer on:
What is used to calculate custom loss function? I used the output of confusion matrix of training data and validation data and manually calculate the value of the loss function but it doesn't match the output. Note that the default metrics used to generate confusion matrix is using f1 threshold
On h2o documentation, custom_metric_func can be used in GLM, DRF, and GBM. However, it doesn't work in GLM (the loss function value is default to 0) although it works perfectly in GBM. Any idea on why that's the case?
Custom loss function:
class CustomLossFunc:
def map(self, predicted, actual, weight, offset, model):
import math
cost_tp = -9
cost_tn = 0
cost_fp = 1
cost_fn = 10
y = actual[0]
y_pred = predicted[0] # [class, p0, p1]
if (y == 0) and (y_pred == 0):
total_cost = cost_tn
elif (y == 0) and (y_pred == 1):
total_cost = cost_fp
elif (y == 1) and (y_pred == 1):
total_cost = cost_tp
else:
total_cost = cost_fn
return [total_cost, 1]
def reduce(self, left, right):
return [left[0] + right[0], left[1] + right[1]]
def metric(self, last):
return last[0]
The loss function is uploaded using h2o.upload_custom_metric() then I run GLM and GBM for comparison:
# GLM
glm_fit_cost = H2OGeneralizedLinearEstimator(family='binomial',
model_id='glm_fit_cost',
#standardize=True,
custom_metric_func= cost_loss_func)
glm_fit_cost.train(x=x_co,
y=y_co,
training_frame = train_co_h2o,
validation_frame = valid_co_h2o)
# GBM
gbm_mod = H2OGradientBoostingEstimator(model_id = "gbm_mod",
custom_metric_func = cost_loss_func)
gbm_mod.train(y=y_co,
x=x_co,
training_frame=train_co_h2o,
validation_frame = valid_co_h2o)
I have tried the following:
Debug via manual calculation using confusion matrix of the training set and validation set and compare to the output of the loss function
Reference for the example used to create my own loss function:
Resource1
Resource2

Related

Problem by dictionaries to use numba njit parallelization to accelerate the code

I have written a code and try to use numba for accelerating the code. The main goal of the code is to group some values based on a condition. In this regard, iter_ is used for converging the code to satisfy the condition. I prepared a small case below to reproduce the sample code:
import numpy as np
import numba as nb
rng = np.random.default_rng(85)
# --------------------------------------- small data volume ---------------------------------------
# values_ = {'R0': np.array([0.01090976, 0.01069902, 0.00724112, 0.0068463 , 0.01135723, 0.00990762,
# 0.01090976, 0.01069902, 0.00724112, 0.0068463 , 0.01135723]),
# 'R1': np.array([0.01836379, 0.01900166, 0.01864162, 0.0182823 , 0.01840322, 0.01653088,
# 0.01900166, 0.01864162, 0.0182823 , 0.01840322, 0.01653088]),
# 'R2': np.array([0.02430913, 0.02239156, 0.02225379, 0.02093393, 0.02408692, 0.02110411,
# 0.02239156, 0.02225379, 0.02093393, 0.02408692, 0.02110411])}
#
# params = {'R0': [3, 0.9490579204466154, 1825, 7.070272000000002e-05],
# 'R1': [0, 0.9729203826820172, 167 , 7.070272000000002e-05],
# 'R2': [1, 0.6031363088057902, 1316, 8.007296000000003e-05]}
#
# Sno, dec_, upd_ = 2, 100, 200
# -------------------------------------------------------------------------------------------------
# ----------------------------- UPDATED (medium and large data volumes) ---------------------------
# values_ = np.load("values_med.npy", allow_pickle=True)[()]
# params = np.load("params_med.npy", allow_pickle=True)[()]
values_ = np.load("values_large.npy", allow_pickle=True)[()]
params = np.load("params_large.npy", allow_pickle=True)[()]
Sno, dec_, upd_ = 2000, 1000, 200
# -------------------------------------------------------------------------------------------------
# values_ = [*values_.values()]
# params = [*params.values()]
# #nb.jit(forceobj=True)
# def test(values_, params, Sno, dec_, upd_):
final_dict = {}
for i, j in enumerate(values_.keys()):
Rand_vals = []
goal_sum = params[j][1] * params[j][3]
tel = goal_sum / dec_ * 10
if params[j][0] != 0:
for k in range(Sno):
final_sum = 0.0
iter_ = 0
t = 1
while not np.allclose(goal_sum, final_sum, atol=tel):
iter_ += 1
vals_group = rng.choice(values_[j], size=params[j][0], replace=False)
# final_sum = 0.0016 * np.sum(vals_group) # -----> For small data volume
final_sum = np.sum(vals_group ** 3) # -----> UPDATED For med or large data volume
if iter_ == upd_:
t += 1
tel = t * tel
values_[j] = np.delete(values_[j], np.where(np.in1d(values_[j], vals_group)))
Rand_vals.append(vals_group)
else:
Rand_vals = [np.array([])] * Sno
final_dict["R" + str(i)] = Rand_vals
# return final_dict
# test(values_, params, Sno, dec_, upd_)
At first, for applying numba on this code #nb.jit was used (forceobj=True is used for avoiding warnings and …), which will have adverse effect on the performance. nopython is checked, too, with #nb.njit which get the following error due to not supporting (as mentioned in 1, 2) dictionary type of the inputs:
cannot determine Numba type of <class 'dict'>
I don't know if (how) it could be handled by Dict from numba.typed (by converting created python dictionaries to numba Dict) or if converting the dictionaries to lists of arrays have any advantage. I think, parallelization may be possible if some code lines e.g. Rand_vals.append(vals_group) or else section or … be taken or be modified out of the function to get the same results as before, but I don't have any idea how to do so.
I will be grateful for helping utilize numba on this code. numba parallelization will be the most desired (probably the best applicable method in terms of performance) solution if it could.
Data:
medium data volume: values_med, params_med
large data volume: values_large, params_large
This code can be converted to Numba but it is not straightforward.
First of all, the dictionary and list type must be defined since Numba njit functions cannot directly operate on reflected lists (aka. pure-python lists). This is a bit tedious to do in Numba and the resulting code is a bit verbose:
String = nb.types.unicode_type
ValueArray = nb.float64[::1]
ValueDict = nb.types.DictType(String, ValueArray)
ParamDictValue = nb.types.Tuple([nb.int_, nb.float64, nb.int_, nb.float64])
ParamDict = nb.types.DictType(String, ParamDictValue)
FinalDictValue = nb.types.ListType(ValueArray)
FinalDict = nb.types.DictType(String, FinalDictValue)
Then you need to convert the input dictionaries:
nbValues = nb.typed.typeddict.Dict.empty(String, ValueArray)
for key,value in values_.items():
nbValues[key] = value.copy()
nbParams = nb.typed.typeddict.Dict.empty(String, ParamDictValue)
for key,value in params.items():
nbParams[key] = (nb.int_(value[0]), nb.float64(value[1]), nb.int_(value[2]), nb.float64(value[3]))
Then, you need to write the core function. np.allclose and np.isin are not implemented in Numba so they should be reimplemented manually. But the main point is that Numba does not support the rng Numpy object. I think it will certainly not support it any time soon. Note that Numba has an random numbers implementation that try to mimic the behavior of Numpy but the management of the seed is a bit different. Note also that results should be the same with the np.random.xxx Numpy functions if the seed is set to the same value (Numpy and Numba have different seed variables that are not synchronized).
#nb.njit(FinalDict(ValueDict, ParamDict, nb.int_, nb.int_, nb.int_))
def nbTest(values_, params, Sno, dec_, upd_):
final_dict = nb.typed.Dict.empty(String, FinalDictValue)
for i, j in enumerate(values_.keys()):
Rand_vals = nb.typed.List.empty_list(ValueArray)
goal_sum = params[j][1] * params[j][3]
tel = goal_sum / dec_ * 10
if params[j][0] != 0:
for k in range(Sno):
final_sum = 0.0
iter_ = 0
t = 1
vals_group = np.empty(0, dtype=nb.float64)
while np.abs(goal_sum - final_sum) > (1e-05 * np.abs(final_sum) + tel):
iter_ += 1
vals_group = np.random.choice(values_[j], size=params[j][0], replace=False)
final_sum = 0.0016 * np.sum(vals_group)
# final_sum = 0.0016 * np.sum(vals_group) # (for small data volume)
final_sum = np.sum(vals_group ** 3) # (for med or large data volume)
if iter_ == upd_:
t += 1
tel = t * tel
# Perform an in-place deletion
vals, gr = values_[j], vals_group
cur = 0
for l in range(vals.size):
found = False
for m in range(gr.size):
found |= vals[l] == gr[m]
if not found:
# Keep the value (delete it otherwise)
vals[cur] = vals[l]
cur += 1
values_[j] = vals[:cur]
Rand_vals.append(vals_group)
else:
for k in range(Sno):
Rand_vals.append(np.empty(0, dtype=nb.float64))
final_dict["R" + str(i)] = Rand_vals
return final_dict
Note that the replacement implementation of np.isin is quite naive but it works pretty well in practice on your input example.
The function can be called using the following way:
nbFinalDict = nbTest(nbValues, nbParams, Sno, dec_, upd_)
Finally, the dictionary should be converted back to basic Python objects:
finalDict = dict()
for key,value in nbFinalDict.items():
finalDict[key] = list(value)
This implementation is fast for small inputs but not large ones since np.random.choice takes almost all the time (>96%). The thing is this function is clearly not optimal when the number of requested item is small (which is your case). Indeed, it surprisingly runs in linear time of the input array and not in linear time of the number of requested items.
Further Optimizations
The algorithm can be completely rewritten to extract only 12 random items and discard them from the main currant array in a much more efficient way. The idea is to swap n items (small target sample) at the end of the array with other items at random locations, then check the sum, repeat this process until a condition is fulfilled, and finally extract the view to the last n items before resizing the view so to discard the last items. All of this can be done in O(n) time rather than O(m) time where m is the size of the main current array with n << m (eg. 12 VS 20_000). It can also be compute without any expensive allocation. Here is the resulting code:
#nb.njit(nb.void(ValueArray, nb.int_, nb.int_))
def swap(arr, i, j):
arr[i], arr[j] = arr[j], arr[i]
#nb.njit(FinalDict(ValueDict, ParamDict, nb.int_, nb.int_, nb.int_))
def nbTest(values_, params, Sno, dec_, upd_):
final_dict = nb.typed.Dict.empty(String, FinalDictValue)
for i, j in enumerate(values_.keys()):
Rand_vals = nb.typed.List.empty_list(ValueArray)
goal_sum = params[j][1] * params[j][3]
tel = goal_sum / dec_ * 10
values = values_[j]
n = params[j][0]
if n != 0:
for k in range(Sno):
final_sum = 0.0
iter_ = 0
t = 1
m = values.size
assert n <= m
group = values[-n:]
while np.abs(goal_sum - final_sum) > (1e-05 * np.abs(final_sum) + tel):
iter_ += 1
# Swap the group view with other random items
for pos in range(m - n, m):
swap(values, pos, np.random.randint(0, m))
# For small data volume:
# final_sum = 0.0016 * np.sum(group)
# For med/large data volume
final_sum = 0.0
for v in group:
final_sum += v ** 3
if iter_ == upd_:
t += 1
tel *= t
assert iter_ > 0
values = values[:m-n]
Rand_vals.append(group)
else:
for k in range(Sno):
Rand_vals.append(np.empty(0, dtype=nb.float64))
final_dict["R" + str(i)] = Rand_vals
return final_dict
In addition to being faster, this implementation as the benefit of being also simpler. Results looks quite similar to the previous implementation despite the randomness make the check of the results tricky (especially since this function does not use the same method to choose the random sample). Note that this implementation does not remove items in values that are in group as opposed to the previous one (this is probably not wanted though).
Benchmark
Here are the results of the last implementation on my machine (compilation and conversion timings excluded):
Provided small input (embedded in the question):
- Initial code: 42.71 ms
- Numba code: 0.11 ms
Medium input:
- Initial code: 3481 ms
- Numba code: 11 ms
Large input:
- Initial code: 6728 ms
- Numba code: 20 ms
Note that the conversion time takes about the same time than the computation.
This last implementation is 316~388 times faster than the initial code on small inputs.
Notes
Note that the compilation time takes few seconds due to the dict and lists types.
Note that while it may be possible to parallelise the implementation, only the most encompassing loop can be parallelised. The thing is there is only few items to compute and the time is already quite small (not the best case for multi-threading). <-- Additionally, the creation of many temporary arrays (created by rng.choice) will certainly cause the parallel loop not to scale well anyway. --> Additionally, the list/dict cannot be written from multiple threads safely so one need to use Numpy arrays in the whole function to be able to do that (or add additional conversion that are already expensive). Moreover, Numba parallelism tends to increase significantly the compilation time which is already significant. Finally, the result will be less deterministic since each Numba thread has its own random number generator seed and the items computed by the threads cannot be predicted with prange (dependent of the parallel runtime chosen on the target platform). Note that in Numpy there is one global seed by default used by usual random functions (deprecated way) and RNG objects have their own seed (new preferred way).

What does it mean when train and validation loss diverge from epoch 1?

I was recently working on a deep learning model in Keras and it gave me very perplexing results. The model is capable of mastering the training data over time, but it consistently gets worse results on the validation data.
I know that if the validation accuracy goes up for a while and then starts to decrease that you are over-fitting to the training data, but in this case, the validation accuracy only ever decreases. I am really confused why this happens. Does anyone have any intuition as to what could cause this to happen? Or any suggestions on things to test to potentially fix it?
Edit to add more info and code
Ok. So I am making a model that is trying to do some basic stock predictions. By looking at the open, high, low, close, and volume of the last 40 days, the model tries to predict whether or not the price will go up two average true ranges without going down one average true range. As input, I took CSVs from Yahoo Finance that include this information for the last 30 years for all of the stocks in the Dow Jones Industrial Average. The model trains on 70% of the stocks and validates on the other 20%. This leads to about 150,000 training samples. I am currently using a 1d Convolutional Neural Network, but I have also tried other smaller models (logistic regression and small Feed Forward NN) and I always get the same either diverging train and validation loss or nothing learned at all because the model is too simple.
Here is the code:
import numpy as np
from sklearn import preprocessing
from sklearn.metrics import auc, roc_curve, roc_auc_score
from keras.layers import Input, Dense, Flatten, Conv1D, Activation, MaxPooling1D, Dropout, Concatenate
from keras.models import Model
from keras.callbacks import ModelCheckpoint, EarlyStopping, Callback
from keras import backend as K
import matplotlib.pyplot as plt
from random import seed, shuffle
from os import listdir
class roc_auc(Callback):
def on_train_begin(self, logs={}):
self.aucs = []
def on_train_end(self, logs={}):
return
def on_epoch_begin(self, epoch, logs={}):
return
def on_epoch_end(self, epoch, logs={}):
y_pred = self.model.predict(self.validation_data[0])
self.aucs.append(roc_auc_score(self.validation_data[1], y_pred))
if max(self.aucs) == self.aucs[-1]:
model.save_weights("weights.roc_auc.hdf5")
print(" - auc: %0.4f" % self.aucs[-1])
return
def on_batch_begin(self, batch, logs={}):
return
def on_batch_end(self, batch, logs={}):
return
rrr = 2
epochs = 200
batch_size = 64
days_input = 40
seed(42)
X_train = []
X_test = []
y_train = []
y_test = []
files = listdir("Stocks")
total_stocks = len(files)
shuffle(files)
for x, file in enumerate(files):
test = False
if (x+1.0)/total_stocks > 0.7:
test = True
if test:
print("Test -> Stocks/%s" % file)
else:
print("Train -> Stocks/%s" % file)
stock = np.loadtxt(open("Stocks/"+file, "r"), delimiter=",", skiprows=1, usecols = (1,2,3,5,6))
atr = []
last = None
for day in stock:
if last is None:
tr = abs(day[1] - day[2])
atr.append(tr)
else:
tr = max(day[1] - day[2], abs(last[3] - day[1]), abs(last[3] - day[2]))
atr.append((13*atr[-1]+tr)/14)
last = day.copy()
stock = np.insert(stock, 5, atr, axis=1)
for i in range(days_input,stock.shape[0]-1):
input = stock[i-days_input:i, 0:5].copy()
for j, day in enumerate(input):
input[j][1] = (day[1]-day[0])/day[0]
input[j][2] = (day[2]-day[0])/day[0]
input[j][3] = (day[3]-day[0])/day[0]
input[:,0] = input[:,0] / np.linalg.norm(input[:,0])
input[:,1] = input[:,1] / np.linalg.norm(input[:,1])
input[:,2] = input[:,2] / np.linalg.norm(input[:,2])
input[:,3] = input[:,3] / np.linalg.norm(input[:,3])
input[:,4] = input[:,4] / np.linalg.norm(input[:,4])
preprocessing.scale(input, copy=False)
output = -1
buy = stock[i][1]
stoploss = buy - stock[i][5]
target = buy + rrr*stock[i][5]
for j in range(i+1, stock.shape[0]):
if stock[j][0] < stoploss or stock[j][2] < stoploss:
output = 0
break
elif stock[j][1] > target:
output = 1
break
if output != -1:
if test:
X_test.append(input)
y_test.append(output)
else:
X_train.append(input)
y_train.append(output)
shape = list(X_train[0].shape)
shape[:0] = [len(X_train)]
X_train = np.concatenate(X_train).reshape(shape)
y_train = np.array(y_train)
shape = list(X_test[0].shape)
shape[:0] = [len(X_test)]
X_test = np.concatenate(X_test).reshape(shape)
y_test = np.array(y_test)
print("Train class split is %0.2f" % (100*np.average(y_train)))
print("Test class split is %0.2f" % (100*np.average(y_test)))
inputs = Input(shape=(days_input,5))
x = Conv1D(32, 5, padding='same')(inputs)
x = Activation('relu')(x)
x = MaxPooling1D()(x)
x = Conv1D(64, 5, padding='same')(x)
x = Activation('relu')(x)
x = MaxPooling1D()(x)
x = Conv1D(128, 5, padding='same')(x)
x = Activation('relu')(x)
x = MaxPooling1D()(x)
x = Flatten()(x)
x = Dense(128, activation="relu")(x)
x = Dense(64, activation="relu")(x)
output = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inputs,outputs=output)
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
filepath="weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=0, save_best_only=True, mode='max')
auc_hist = roc_auc()
callbacks_list = [checkpoint, auc_hist]
history = model.fit(X_train, y_train, validation_data=(X_test,y_test) , epochs=epochs, callbacks=callbacks_list, batch_size=batch_size, class_weight ='balanced').history
model_json = model.to_json()
with open("model.json", "w") as json_file:
json_file.write(model_json)
model.save_weights("weights.latest.hdf5")
model.load_weights("weights.roc_auc.hdf5")
plt.plot(history['acc'])
plt.plot(history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
plt.plot(history['loss'])
plt.plot(history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
plt.plot(auc_hist.aucs)
plt.title('model ROC AUC')
plt.ylabel('AUC')
plt.xlabel('epoch')
plt.show()
y_pred = model.predict(X_train)
fpr, tpr, _ = roc_curve(y_train, y_pred)
roc_auc = auc(fpr, tpr)
plt.subplot(1, 2, 1)
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy',linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Train ROC')
plt.legend(loc="lower right")
y_pred = model.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(fpr, tpr)
plt.subplot(1, 2, 2)
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy',linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Test ROC')
plt.legend(loc="lower right")
plt.show()
with open('roc.csv','w+') as file:
for i in range(len(thresholds)):
file.write("%f,%f,%f\n" % (fpr[i], tpr[i], thresholds[i]))
Results by 100 batches instead of by epoch
I listened to suggestions and made a few updates. The classes are now balanced 50% to 50% instead of 25% to 75%. Also, the validation data is randomly selected now instead of being a specific set of stocks. By graphing the loss and accuracy at a finer resolution(100 batches vs 1 epoch), the over-fitting can clearly be seen. The model does actually start to learn at the very beginning before it starts to diverge. I am surprised at how fast it starts to over-fit, but now that I can see the issue hopefully I can debug it.
Possible explanations
Coding error
Overfitting due to differences in the training / validation data
Skewed classes (and differences in the training / validation data)
Things I would try
Swapping the training and the validation set. Does the problem still occur?
Plot the curves in more detail for the first ~10 epochs (e.g. directly after initialization; each few training iterations, not only per epoch). Do you still start at > 75%? Then your classes might be skewed and you might also want to check if your training-validation split is stratified.
Code
This is useless: np.concatenate(X_train)
Make your code as readable as possible when you post it here. This includes removing lines which are commented out.
This looks suspicious for a coding error to me:
if test:
X_test.append(input)
y_test.append(output)
else:
#if((output == 0 and np.average(y_train) > 0.5) or output == 1):
X_train.append(input)
y_train.append(output)
use sklearn.model_selection.train_test_split instead. Do all transformations to the data before, then make the split with this method.
Looks like the batch size is much too small for the number of training samples you have. Try batching 20% and see if that makes a difference.

Implicit recommender Tuning hyper parameters Pyspark

computeMAPK function takes the model, Actual data and Validation data (user,product) to generate ratings. Then sort the predicted ratings for every user and take top K to compare with the actual data to calculate Mean Average Precision at K
I am using this function to tune the hyper parameters i.e. fit multiple models and select the best Lambda, Alpha, Ranks with highest MAPK. This works for small data sets but when the the matrix becomes 10 Million users * 200 products. It breaks especially with reduceByKey step and joins. Any better way to Tune the hyper parameters for ALS implicit ? and I am using Spark 1.3.
Actual RDD is of the form (user,product)
Valid RDD is of the form (user,product)
def apk(act_pred):
predicted = act_pred[0]
actual = act_pred[1]
k = act_pred[2]
if len(predicted)>k:
predicted = predicted[:k]
score =0.0
num_hits = 0.0
for i,p in enumerate(predicted):
if p in actual and p not in predicted[:i]:
num_hits += 1.0
score += num_hits / (i+1.0)
if not actual:
return 1.0
#return num_hits
return (score/min(len(actual),k))
def computeMAPKR(model,actual,valid,k):
pred = model.predictAll(valid).map(lambda x:(x[0],[(x[1],x[2])])).cache()
gp = pred.reduceByKey(lambda x,y:x+y)
#gp = pred.groupByKey().map(lambda x : (x[0], list(x[1])))
# for every user, sort the items by predicted ratings and get user, item pairs
def f(x):
s = sorted(x,key=lambda x:x[1],reverse=True)
sm = map(lambda x:x[0],s)
return sm
sp = gp.mapValues(f)
# actual data
ac = actual.map(lambda x:(x[0],[(x[1])]))
#gac = ac.reduceByKey(lambda x,y:(x,y)).map(lambda x : (x[0], list(x[1])))
gac = ac.reduceByKey(lambda x,y:x+y)
ap = sp.join(gac)
apk_result = ap.map(lambda x:(x[0],(x[1][0],x[1][1],k))).mapValues(apk)
mapk = apk_result.map(lambda x :x[1]).reduce(add) / ap.count()
#print(apk_result.collect())
return mapk

PyMC: sampling step by step?

I would like to know why the sampler is incredibly slow when sampling step by step.
For example, if I run:
mcmc = MCMC(model)
mcmc.sample(1000)
the sampling is fast. However, if I run:
mcmc = MCMC(model)
for i in arange(1000):
mcmc.sample(1)
the sampling is slower (and the more it samples, the slower it is).
If you are wondering why I am asking this.. well, I need a step by step sampling because I want to perform some operations on the values of the variables after each step of the sampler.
Is there a way to speed it up?
Thank you in advance!
------------------ EDIT -------------------------------------------------------------
Here I present the specific problem in more details:
I have two models in competition and they are part of a bigger model that has a categorical variable functioning as a 'switch' between the two.
In this toy example, I have the observed vector 'Y', that could be explained by a Poisson or a Geometric distribution. The Categorical variable 'switch_model' selects the Geometric model when = 0 and the Poisson model when =1.
After each sample, if switch_model selects the Geometric model, I want the variables of the Poisson model NOT to be updated, because they are not influencing the likelihood and therefore they are just drifting away. The opposite is true if the switch_model selects the Poisson model.
Basically what I do at each step is to 'change' the value of the non-selected model by bringing it manually one step back.
I hope that my explanation and the commented code will be clear enough. Let me know if you need further details.
import numpy as np
import pymc as pm
import pandas as pd
import matplotlib.pyplot as plt
# OBSERVED VALUES
Y = np.array([0, 1, 2, 3, 8])
# PRIOR ON THE MODELS
pi = (0.5, 0.5)
switch_model = pm.Categorical("switch_model", p = pi)
# switch_model = 0 for Geometric, switch_model = 1 for Poisson
p = pm.Uniform('p', lower = 0, upper = 1) # Prior of the parameter of the geometric distribution
mu = pm.Uniform('mu', lower = 0, upper = 10) # Prior of the parameter of the Poisson distribution
# LIKELIHOOD
#pm.observed
def Ylike(value = Y, mu = mu, p = p, M = switch_model):
if M == 0:
out = pm.geometric_like(value+1, p)
elif M == 1:
out = pm.poisson_like(value, mu)
return out
model = pm.Model([Ylike, p, mu, switch_model])
mcmc = pm.MCMC(model)
n_samples = 5000
traces = {}
for var in mcmc.stochastics:
traces[str(var)] = np.zeros(n_samples)
bar = pm.progressbar.progress_bar(n_samples)
bar.update(0)
mcmc.sample(1, progress_bar=False)
for var in mcmc.stochastics:
traces[str(var)][0] = mcmc.trace(var)[-1]
for i in np.arange(1,n_samples):
mcmc.sample(1, progress_bar=False)
bar.update(i)
for var in mcmc.stochastics:
traces[str(var)][i] = mcmc.trace(var)[-1]
if mcmc.trace('switch_model')[-1] == 0: # Gemetric wins
traces['mu'][i] = traces['mu'][i-1] # One step back for the sampler of the Poisson parameter
mu.value = traces['mu'][i-1]
elif mcmc.trace('switch_model')[-1] == 1: # Poisson wins
traces['p'][i] = traces['p'][i-1] # One step back for the sampler of the Geometric parameter
p.value = traces['p'][i-1]
print '\n\n'
traces=pd.DataFrame(traces)
traces['mu'][traces['switch_model'] == 0] = np.nan
traces['p'][traces['switch_model'] == 1] = np.nan
print traces.describe()
traces.plot()
plt.show()
The reason this is so slow is that Python's for loops are pretty slow, especially when they are compared to FORTRAN loops (Which is what PyMC is written in basically.) If you could show more detailed code, it might be easier to see what you are trying to do and to provide faster alternative algorithms.
Actually I found a 'crazy' solution, and I have the suspect to know why it works. I would still like to get an expert opinion on my trick.
Basically if I modify the for loop in the following way, adding a 'reset of the mcmc' every 1000 loops, the sampling fires up again:
for i in np.arange(1,n_samples):
mcmc.sample(1, progress_bar=False)
bar.update(i)
for var in mcmc.stochastics:
traces[str(var)][i] = mcmc.trace(var)[-1]
if mcmc.trace('switch_model')[-1] == 0: # Gemetric wins
traces['mu'][i] = traces['mu'][i-1] # One step back for the sampler of the Poisson parameter
mu.value = traces['mu'][i-1]
elif mcmc.trace('switch_model')[-1] == 1: # Poisson wins
traces['p'][i] = traces['p'][i-1] # One step back for the sampler of the Geometric parameter
p.value = traces['p'][i-1]
if i%1000 == 0:
mcmc = pm.MCMC(model)
In practice this trick erases the traces and the database of the sampler every 1000 steps. It looks like the sampler does not like having a long database, although I do not really understand why. (of course 1000 steps is arbitrary, too short it adds too much overhead, too long it will cause the traces and database to be too long).
I find this hack a bit crazy and definitely not elegant.. does any of the experts or developers have a comment on it? Thank you!

PyMC for Model Averaging

I am interested in applying PyMC to model averaging. My goal is to estimate many linear models and average estimates across them, weighting by their posterior model probabilities. I am currently using the Bayesian Information Criterion (BIC) to approximate the likelihood of my data (therefore, my analysis is not fully Bayesian). I have successfully simulated a Markov Chain of models using one of my own scripts but I want to use PyMC because it seems like a great tool.
In my attempts thus far, I have not been forming the Markov Chain correctly. I am not visiting models with higher posterior weights more often than others. I will include the example code below. Please also see the IPython notebook here! on github for the math markup and code together.
import numpy as np
from pymc import stochastic, DiscreteMetropolis, MCMC
import statsmodels.api as sm
import pandas as pd
import random
def pack(alist, rank):
binary = [str(1) if i in alist else str(0) for i in xrange(0,rank)]
string = '0b1'+''.join(binary)
return int(string, 2)
def unpack(integer):
string = bin(integer)[3:]
return [int(i) for i in xrange(len(string)) if string[i]=='1']
def make_bma():
# Simulating Data
size = 100
rank = 20
X = 10*np.random.randn(size, rank)
error = 30*np.random.randn(size,1)
coefficients = np.array([10, 2, 2, 2, 2, 2]).reshape((6,1))
y = np.dot(sm.add_constant(X[:,:5], prepend=True), coefficients) + error
# Number of allowable regressors
predictors = [3,4,5,6,7]
#stochastic(dtype=int)
def regression_model():
def logp(value):
columns = unpack(value)
x = sm.add_constant(X[:,columns], prepend=True)
corr = np.corrcoef(x[:,1:], rowvar=0)
prior = np.linalg.det(corr)
ols = sm.OLS(y,x).fit()
posterior = np.exp(-0.5*ols.bic)*prior
return np.log(posterior)
def random():
k = np.random.choice(predictors)
columns = sorted(np.random.choice(xrange(0,rank), size=k, replace=False))
return pack(columns, rank)
class ModelMetropolis(DiscreteMetropolis):
def __init__(self, stochastic):
DiscreteMetropolis.__init__(self, stochastic)
def propose(self):
'''considers a neighborhood around the previous model,
defined as having one regressor removed or added, provided
the total number of regressors coincides with predictors
'''
# Building set of neighboring models
last = unpack(self.stochastic.value)
last_indicator = np.zeros(rank)
last_indicator[last] = 1
last_indicator = last_indicator.reshape((-1,1))
neighbors = abs(np.diag(np.ones(rank)) - last_indicator)
neighbors = neighbors[:,np.any([neighbors.sum(axis=0) == i \
for i in predictors], axis=0)]
neighbors = pd.DataFrame(neighbors)
# Drawing one model at random from the neighborhood
draw = random.choice(xrange(neighbors.shape[1]))
self.stochastic.value = pack(list(neighbors[draw][neighbors[draw]==1].index), rank)
# def step(self):
#
# logp_p = self.stochastic.logp
#
# self.propose()
#
# logp = self.stochastic.logp
#
# if np.log(random.random()) > logp_p - logp:
#
# self.reject()
return locals()
if __name__ == '__main__':
model = make_bma()
M = MCMC(model)
M.use_step_method(model['ModelMetropolis'], model['regression_model'])
M.sample(iter=5000, burn=1000, thin=1)
model_chain = M.trace("regression_model")[:]
from collections import Counter
counts = Counter(model_chain).items()
counts.sort(reverse=True, key=lambda x: x[1])
for f in counts[:10]:
columns = unpack(f[0])
print('Visits:', f[1])
print(np.array([1. if i in columns else 0 for i in range(0,M.rank)]))
print(M.coefficients.flatten())
X = sm.add_constant(M.X[:, columns], prepend=True)
corr = np.corrcoef(X[:,1:], rowvar=0)
prior = np.linalg.det(corr)
fit = sm.OLS(model['y'],X).fit()
posterior = np.exp(-0.5*fit.bic)*prior
print(fit.params)
print('R-squared:', fit.rsquared)
print('BIC', fit.bic)
print('Prior', prior)
print('Posterior', posterior)
print(" ")
It sounds like you are trying to do something akin to reversible jump MCMC, where you are sampling from the model space in addition to the parameter space(s). PyMC does not currently do rjMCMC, though it probably ought to. The trick is to account for the change in dimension when moving among models. If you do have a modest number of models, you can use an indicator function to select from the models, all of which are fit simultaneously.

Resources