Implicit recommender Tuning hyper parameters Pyspark - performance

computeMAPK function takes the model, Actual data and Validation data (user,product) to generate ratings. Then sort the predicted ratings for every user and take top K to compare with the actual data to calculate Mean Average Precision at K
I am using this function to tune the hyper parameters i.e. fit multiple models and select the best Lambda, Alpha, Ranks with highest MAPK. This works for small data sets but when the the matrix becomes 10 Million users * 200 products. It breaks especially with reduceByKey step and joins. Any better way to Tune the hyper parameters for ALS implicit ? and I am using Spark 1.3.
Actual RDD is of the form (user,product)
Valid RDD is of the form (user,product)
def apk(act_pred):
predicted = act_pred[0]
actual = act_pred[1]
k = act_pred[2]
if len(predicted)>k:
predicted = predicted[:k]
score =0.0
num_hits = 0.0
for i,p in enumerate(predicted):
if p in actual and p not in predicted[:i]:
num_hits += 1.0
score += num_hits / (i+1.0)
if not actual:
return 1.0
#return num_hits
return (score/min(len(actual),k))
def computeMAPKR(model,actual,valid,k):
pred = model.predictAll(valid).map(lambda x:(x[0],[(x[1],x[2])])).cache()
gp = pred.reduceByKey(lambda x,y:x+y)
#gp = pred.groupByKey().map(lambda x : (x[0], list(x[1])))
# for every user, sort the items by predicted ratings and get user, item pairs
def f(x):
s = sorted(x,key=lambda x:x[1],reverse=True)
sm = map(lambda x:x[0],s)
return sm
sp = gp.mapValues(f)
# actual data
ac = actual.map(lambda x:(x[0],[(x[1])]))
#gac = ac.reduceByKey(lambda x,y:(x,y)).map(lambda x : (x[0], list(x[1])))
gac = ac.reduceByKey(lambda x,y:x+y)
ap = sp.join(gac)
apk_result = ap.map(lambda x:(x[0],(x[1][0],x[1][1],k))).mapValues(apk)
mapk = apk_result.map(lambda x :x[1]).reduce(add) / ap.count()
#print(apk_result.collect())
return mapk

Related

Problem by dictionaries to use numba njit parallelization to accelerate the code

I have written a code and try to use numba for accelerating the code. The main goal of the code is to group some values based on a condition. In this regard, iter_ is used for converging the code to satisfy the condition. I prepared a small case below to reproduce the sample code:
import numpy as np
import numba as nb
rng = np.random.default_rng(85)
# --------------------------------------- small data volume ---------------------------------------
# values_ = {'R0': np.array([0.01090976, 0.01069902, 0.00724112, 0.0068463 , 0.01135723, 0.00990762,
# 0.01090976, 0.01069902, 0.00724112, 0.0068463 , 0.01135723]),
# 'R1': np.array([0.01836379, 0.01900166, 0.01864162, 0.0182823 , 0.01840322, 0.01653088,
# 0.01900166, 0.01864162, 0.0182823 , 0.01840322, 0.01653088]),
# 'R2': np.array([0.02430913, 0.02239156, 0.02225379, 0.02093393, 0.02408692, 0.02110411,
# 0.02239156, 0.02225379, 0.02093393, 0.02408692, 0.02110411])}
#
# params = {'R0': [3, 0.9490579204466154, 1825, 7.070272000000002e-05],
# 'R1': [0, 0.9729203826820172, 167 , 7.070272000000002e-05],
# 'R2': [1, 0.6031363088057902, 1316, 8.007296000000003e-05]}
#
# Sno, dec_, upd_ = 2, 100, 200
# -------------------------------------------------------------------------------------------------
# ----------------------------- UPDATED (medium and large data volumes) ---------------------------
# values_ = np.load("values_med.npy", allow_pickle=True)[()]
# params = np.load("params_med.npy", allow_pickle=True)[()]
values_ = np.load("values_large.npy", allow_pickle=True)[()]
params = np.load("params_large.npy", allow_pickle=True)[()]
Sno, dec_, upd_ = 2000, 1000, 200
# -------------------------------------------------------------------------------------------------
# values_ = [*values_.values()]
# params = [*params.values()]
# #nb.jit(forceobj=True)
# def test(values_, params, Sno, dec_, upd_):
final_dict = {}
for i, j in enumerate(values_.keys()):
Rand_vals = []
goal_sum = params[j][1] * params[j][3]
tel = goal_sum / dec_ * 10
if params[j][0] != 0:
for k in range(Sno):
final_sum = 0.0
iter_ = 0
t = 1
while not np.allclose(goal_sum, final_sum, atol=tel):
iter_ += 1
vals_group = rng.choice(values_[j], size=params[j][0], replace=False)
# final_sum = 0.0016 * np.sum(vals_group) # -----> For small data volume
final_sum = np.sum(vals_group ** 3) # -----> UPDATED For med or large data volume
if iter_ == upd_:
t += 1
tel = t * tel
values_[j] = np.delete(values_[j], np.where(np.in1d(values_[j], vals_group)))
Rand_vals.append(vals_group)
else:
Rand_vals = [np.array([])] * Sno
final_dict["R" + str(i)] = Rand_vals
# return final_dict
# test(values_, params, Sno, dec_, upd_)
At first, for applying numba on this code #nb.jit was used (forceobj=True is used for avoiding warnings and …), which will have adverse effect on the performance. nopython is checked, too, with #nb.njit which get the following error due to not supporting (as mentioned in 1, 2) dictionary type of the inputs:
cannot determine Numba type of <class 'dict'>
I don't know if (how) it could be handled by Dict from numba.typed (by converting created python dictionaries to numba Dict) or if converting the dictionaries to lists of arrays have any advantage. I think, parallelization may be possible if some code lines e.g. Rand_vals.append(vals_group) or else section or … be taken or be modified out of the function to get the same results as before, but I don't have any idea how to do so.
I will be grateful for helping utilize numba on this code. numba parallelization will be the most desired (probably the best applicable method in terms of performance) solution if it could.
Data:
medium data volume: values_med, params_med
large data volume: values_large, params_large
This code can be converted to Numba but it is not straightforward.
First of all, the dictionary and list type must be defined since Numba njit functions cannot directly operate on reflected lists (aka. pure-python lists). This is a bit tedious to do in Numba and the resulting code is a bit verbose:
String = nb.types.unicode_type
ValueArray = nb.float64[::1]
ValueDict = nb.types.DictType(String, ValueArray)
ParamDictValue = nb.types.Tuple([nb.int_, nb.float64, nb.int_, nb.float64])
ParamDict = nb.types.DictType(String, ParamDictValue)
FinalDictValue = nb.types.ListType(ValueArray)
FinalDict = nb.types.DictType(String, FinalDictValue)
Then you need to convert the input dictionaries:
nbValues = nb.typed.typeddict.Dict.empty(String, ValueArray)
for key,value in values_.items():
nbValues[key] = value.copy()
nbParams = nb.typed.typeddict.Dict.empty(String, ParamDictValue)
for key,value in params.items():
nbParams[key] = (nb.int_(value[0]), nb.float64(value[1]), nb.int_(value[2]), nb.float64(value[3]))
Then, you need to write the core function. np.allclose and np.isin are not implemented in Numba so they should be reimplemented manually. But the main point is that Numba does not support the rng Numpy object. I think it will certainly not support it any time soon. Note that Numba has an random numbers implementation that try to mimic the behavior of Numpy but the management of the seed is a bit different. Note also that results should be the same with the np.random.xxx Numpy functions if the seed is set to the same value (Numpy and Numba have different seed variables that are not synchronized).
#nb.njit(FinalDict(ValueDict, ParamDict, nb.int_, nb.int_, nb.int_))
def nbTest(values_, params, Sno, dec_, upd_):
final_dict = nb.typed.Dict.empty(String, FinalDictValue)
for i, j in enumerate(values_.keys()):
Rand_vals = nb.typed.List.empty_list(ValueArray)
goal_sum = params[j][1] * params[j][3]
tel = goal_sum / dec_ * 10
if params[j][0] != 0:
for k in range(Sno):
final_sum = 0.0
iter_ = 0
t = 1
vals_group = np.empty(0, dtype=nb.float64)
while np.abs(goal_sum - final_sum) > (1e-05 * np.abs(final_sum) + tel):
iter_ += 1
vals_group = np.random.choice(values_[j], size=params[j][0], replace=False)
final_sum = 0.0016 * np.sum(vals_group)
# final_sum = 0.0016 * np.sum(vals_group) # (for small data volume)
final_sum = np.sum(vals_group ** 3) # (for med or large data volume)
if iter_ == upd_:
t += 1
tel = t * tel
# Perform an in-place deletion
vals, gr = values_[j], vals_group
cur = 0
for l in range(vals.size):
found = False
for m in range(gr.size):
found |= vals[l] == gr[m]
if not found:
# Keep the value (delete it otherwise)
vals[cur] = vals[l]
cur += 1
values_[j] = vals[:cur]
Rand_vals.append(vals_group)
else:
for k in range(Sno):
Rand_vals.append(np.empty(0, dtype=nb.float64))
final_dict["R" + str(i)] = Rand_vals
return final_dict
Note that the replacement implementation of np.isin is quite naive but it works pretty well in practice on your input example.
The function can be called using the following way:
nbFinalDict = nbTest(nbValues, nbParams, Sno, dec_, upd_)
Finally, the dictionary should be converted back to basic Python objects:
finalDict = dict()
for key,value in nbFinalDict.items():
finalDict[key] = list(value)
This implementation is fast for small inputs but not large ones since np.random.choice takes almost all the time (>96%). The thing is this function is clearly not optimal when the number of requested item is small (which is your case). Indeed, it surprisingly runs in linear time of the input array and not in linear time of the number of requested items.
Further Optimizations
The algorithm can be completely rewritten to extract only 12 random items and discard them from the main currant array in a much more efficient way. The idea is to swap n items (small target sample) at the end of the array with other items at random locations, then check the sum, repeat this process until a condition is fulfilled, and finally extract the view to the last n items before resizing the view so to discard the last items. All of this can be done in O(n) time rather than O(m) time where m is the size of the main current array with n << m (eg. 12 VS 20_000). It can also be compute without any expensive allocation. Here is the resulting code:
#nb.njit(nb.void(ValueArray, nb.int_, nb.int_))
def swap(arr, i, j):
arr[i], arr[j] = arr[j], arr[i]
#nb.njit(FinalDict(ValueDict, ParamDict, nb.int_, nb.int_, nb.int_))
def nbTest(values_, params, Sno, dec_, upd_):
final_dict = nb.typed.Dict.empty(String, FinalDictValue)
for i, j in enumerate(values_.keys()):
Rand_vals = nb.typed.List.empty_list(ValueArray)
goal_sum = params[j][1] * params[j][3]
tel = goal_sum / dec_ * 10
values = values_[j]
n = params[j][0]
if n != 0:
for k in range(Sno):
final_sum = 0.0
iter_ = 0
t = 1
m = values.size
assert n <= m
group = values[-n:]
while np.abs(goal_sum - final_sum) > (1e-05 * np.abs(final_sum) + tel):
iter_ += 1
# Swap the group view with other random items
for pos in range(m - n, m):
swap(values, pos, np.random.randint(0, m))
# For small data volume:
# final_sum = 0.0016 * np.sum(group)
# For med/large data volume
final_sum = 0.0
for v in group:
final_sum += v ** 3
if iter_ == upd_:
t += 1
tel *= t
assert iter_ > 0
values = values[:m-n]
Rand_vals.append(group)
else:
for k in range(Sno):
Rand_vals.append(np.empty(0, dtype=nb.float64))
final_dict["R" + str(i)] = Rand_vals
return final_dict
In addition to being faster, this implementation as the benefit of being also simpler. Results looks quite similar to the previous implementation despite the randomness make the check of the results tricky (especially since this function does not use the same method to choose the random sample). Note that this implementation does not remove items in values that are in group as opposed to the previous one (this is probably not wanted though).
Benchmark
Here are the results of the last implementation on my machine (compilation and conversion timings excluded):
Provided small input (embedded in the question):
- Initial code: 42.71 ms
- Numba code: 0.11 ms
Medium input:
- Initial code: 3481 ms
- Numba code: 11 ms
Large input:
- Initial code: 6728 ms
- Numba code: 20 ms
Note that the conversion time takes about the same time than the computation.
This last implementation is 316~388 times faster than the initial code on small inputs.
Notes
Note that the compilation time takes few seconds due to the dict and lists types.
Note that while it may be possible to parallelise the implementation, only the most encompassing loop can be parallelised. The thing is there is only few items to compute and the time is already quite small (not the best case for multi-threading). <-- Additionally, the creation of many temporary arrays (created by rng.choice) will certainly cause the parallel loop not to scale well anyway. --> Additionally, the list/dict cannot be written from multiple threads safely so one need to use Numpy arrays in the whole function to be able to do that (or add additional conversion that are already expensive). Moreover, Numba parallelism tends to increase significantly the compilation time which is already significant. Finally, the result will be less deterministic since each Numba thread has its own random number generator seed and the items computed by the threads cannot be predicted with prange (dependent of the parallel runtime chosen on the target platform). Note that in Numpy there is one global seed by default used by usual random functions (deprecated way) and RNG objects have their own seed (new preferred way).

Tune a learner with the searchspace parameter setting

I am trying to tune a ranger learner with the searchspace parameter setting. The purpose is to find the optimal K (the number of input indicators, I uesd a filterpipe with setting importance.filter.nfeat) and D (the depth of each tree, i.e., classif.ranger.max.depth) by grid search. D's value should not be greater than the number of input indicators K. The values searched for D are then set proportionally to the input K: D ∈ {10%, 25%, 50%, 100%} ∗ K. Values of D ≤ 0 were rejected.
However, I am unfamiliar with writing fuction code within searchspace, thus the can not achieve the purpose (D is greater than K).
My question is:
How to set a parameter that is based on the other one in the searchspace? (I think it is different with the depends metioned in mlr3 book)
Here is my code:
ranger = lrn("classif.ranger", importance = "impurity", predict_type = "prob", id = "ranger")
graph = po("filter", flt("importance"), filter.nfeat = 3) %>>% ranger %>>% po("threshold")
plot(graph)
graph_learner = GraphLearner$new(graph)
searchspace = ps(
importance.filter.nfeat = p_int(1,length(task$feature_names)),
classif.ranger.max.depth = p_int(1,length(task$feature_names)),
.extra_trafo = function(x, param_set) {x = graph_learner$param_set$importance.filter.nfeat * c(.1,.25,.50,1)})
inst1 = TuningInstanceMultiCrit$new(
task,
learner = graph_learner,
resampling = rsmp("cv"),
measures = msrs(c("classif.ce","classif.bacc","classif.mcc")),
terminator = trm("evals", n_evals = 50),
search_space = searchspace
)
tuner = tnr("grid_search")
# reduce logging output
lgr:: get_logger("bbotk") $set_threshold("warn")
# The tuning procedure may take some time:
set.seed(1234)
tuner$optimize(inst1)
#Returns list with optimal configurations and estimated performance.
inst1$result
# We can plot the performance against the number of features.
#If we do so, we see the possible trade-off between sparsity and predictive performance:
arx = as.data.table(inst$archive)
ggplot(arx, aes(x = importance.filter.nfeat, y = classif.ce)) + geom_line()
How to know what indicators are uesd in the tuned model, for we only see the trade-off between sparsity and predictive performance, are they based on the importance rank?
I also have tried the feature selection. In FS, I could get the optimal feature set. So what are the relationships betweet the tuning nfeat and feature selection? Which one is perfer in real partice?
# https://mlr3gallery.mlr-org.com/posts/2020-09-14-mlr3fselect-basic/
resampling = rsmp("cv")
measure = msr("classif.mcc")
terminator = trm("none")
ranger_lrn = lrn("classif.ranger", importance = "impurity", predict_type = "prob")
#
instance = FSelectInstanceSingleCrit$new(
task = task,
learner = ranger_lrn,
resampling = resampling,
measure = measure,
terminator = terminator,
store_models = TRUE)
#
fselector = fs("rfe", recursive = FALSE)
set.seed(1234)
fselector$optimize(instance)
#
as.data.table(instance$archive)
instance$result
instance$result_feature_set
instance$result_y
# set new feature_set
# task$select(instance$result_feature_set)
Does this answer question 1?
How to set specific values in `paradox`?
Seems that you could simply set up your own data table as shown there, except remove rows where D>K, then use the design_points tuner.

How does the custom loss function gets calculated in binary classification problem?

I created a custom loss function for a binary (0/1) classification problem in h2o via Python as shown below. The idea is to minimize total cost based on true positive, true negative, false positive, and false negative. Here are the questions that I hope to get an answer on:
What is used to calculate custom loss function? I used the output of confusion matrix of training data and validation data and manually calculate the value of the loss function but it doesn't match the output. Note that the default metrics used to generate confusion matrix is using f1 threshold
On h2o documentation, custom_metric_func can be used in GLM, DRF, and GBM. However, it doesn't work in GLM (the loss function value is default to 0) although it works perfectly in GBM. Any idea on why that's the case?
Custom loss function:
class CustomLossFunc:
def map(self, predicted, actual, weight, offset, model):
import math
cost_tp = -9
cost_tn = 0
cost_fp = 1
cost_fn = 10
y = actual[0]
y_pred = predicted[0] # [class, p0, p1]
if (y == 0) and (y_pred == 0):
total_cost = cost_tn
elif (y == 0) and (y_pred == 1):
total_cost = cost_fp
elif (y == 1) and (y_pred == 1):
total_cost = cost_tp
else:
total_cost = cost_fn
return [total_cost, 1]
def reduce(self, left, right):
return [left[0] + right[0], left[1] + right[1]]
def metric(self, last):
return last[0]
The loss function is uploaded using h2o.upload_custom_metric() then I run GLM and GBM for comparison:
# GLM
glm_fit_cost = H2OGeneralizedLinearEstimator(family='binomial',
model_id='glm_fit_cost',
#standardize=True,
custom_metric_func= cost_loss_func)
glm_fit_cost.train(x=x_co,
y=y_co,
training_frame = train_co_h2o,
validation_frame = valid_co_h2o)
# GBM
gbm_mod = H2OGradientBoostingEstimator(model_id = "gbm_mod",
custom_metric_func = cost_loss_func)
gbm_mod.train(y=y_co,
x=x_co,
training_frame=train_co_h2o,
validation_frame = valid_co_h2o)
I have tried the following:
Debug via manual calculation using confusion matrix of the training set and validation set and compare to the output of the loss function
Reference for the example used to create my own loss function:
Resource1
Resource2

Shuffling rows of large pandas DataFrame and correlation with a series

I need to independently shuffle each row of large pandas DataFrames several times (the typical shape is (10000,1000)) and then estimate the correlation of each row with a given series.
The most efficient (=quick) way I found to do it staying within pandas is the following:
for i in range(N): #the larger is N, the better it is
df_sh = df.apply(numpy.random.permutation, axis=1)
#where df this is my large dataframe, with 10K rows and 1K columns
corr = df_sh.corrwith(s, axis = 1)
#where s is the provided series (shape of s =(1000,))
The two tasks take approximately the same amount of time (namely 30 secs each). I tried to convert my dataframe into a numpy.array, to perform a for loop over the array and, for each line, I first perform the permutation and then measure the correlation with scipy.stats.pearsonr. I unfortunately managed to speed up my two tasks only by a factor 2.
There would be other viable options to speed up the tasks even more? (NB: I am already parallelizing with Joblib the execution of my code up to the maximum factor allowed by the machine I am using).
Correlation between a 2D matrix/array and 1D array/vector :
We can adapt corr2_coeff_rowwise for correlation between a 2D array/matrix and a 1D array/vector, like so -
def corr2_coeff_2d_1d(A, B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1,keepdims=1)
B_mB = B - B.mean()
# Sum of squares across rows
ssA = np.einsum('ij,ij->i',A_mA,A_mA)
ssB = B_mB.dot(B_mB)
# Finally get corr coeff
return A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
To shuffle each row and do this for all rows, we can make use of np.random.shuffle. Now, this shuffle function works along the first axis. So, to solve our case, we need to feed in the transposed version. Also, note that this shuffling would be done in-place. So, if original dataframe is needed elsewhere, do make a copy before processing. Thus, the solution would be -
Hence, let's use this to solve our case -
# Extract underlying arry data for faster NumPy processing in loop later on
a = df.values
s_ar = s.values
# Setup array for row-indexing with NumPy's advanced indexing later on
r = np.arange(a.shape[0])[:,None]
for i in range(N):
# Get shuffled indices per row with `rand+argsort/argpartition` trick from -
# https://stackoverflow.com/a/45438143/
idx = np.random.rand(*a.shape).argsort(1)
# Shuffle array data with NumPy's advanced indexing
shuffled_a = a[r, idx]
# Compute correlation
corr = corr2_coeff_2d_1d(shuffled_a, s_ar)
Optimized version #1
Now, we could pre-compute for the parts involving the series that stays the same between iterations. Hence, a further optimized version would look like this -
a = df.values
s_ar = s.values
r = np.arange(a.shape[0])[:,None]
B = s_ar
B_mB = B - B.mean()
ssB = B_mB.dot(B_mB)
A = a
A_mean = A.mean(1,keepdims=1)
for i in range(N):
# Get shuffled indices per row with `rand+argsort/argpartition` trick from -
# https://stackoverflow.com/a/45438143/
idx = np.random.rand(*a.shape).argsort(1)
# Shuffle array data with NumPy's advanced indexing
shuffled_a = a[r, idx]
# Compute correlation
A = shuffled_a
A_mA = A - A_mean
ssA = np.einsum('ij,ij->i',A_mA,A_mA)
corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
Benchmarking
Setup inputs with actual use-case shapes/sizes
In [302]: df = pd.DataFrame(np.random.rand(10000,1000))
In [303]: s = pd.Series(df.iloc[0])
1. Original method
In [304]: %%timeit
...: df_sh = df.apply(np.random.permutation, axis=1)
...: corr = df_sh.corrwith(s, axis = 1)
1 loop, best of 3: 1.99 s per loop
2. Proposed method
The pre-processing part (only done once before starting loop, so not including in timings) -
In [305]: a = df.values
...: s_ar = s.values
...: r = np.arange(a.shape[0])[:,None]
...:
...: B = s_ar
...: B_mB = B - B.mean()
...: ssB = B_mB.dot(B_mB)
...:
...: A = a
...: A_mean = A.mean(1,keepdims=1)
Part of proposed solution that runs in loop -
In [306]: %%timeit
...: idx = np.random.rand(*a.shape).argsort(1)
...: shuffled_a = a[r, idx]
...:
...: A = shuffled_a
...: A_mA = A - A_mean
...: ssA = np.einsum('ij,ij->i',A_mA,A_mA)
...: corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
1 loop, best of 3: 675 ms per loop
Thus, we are seeing a speedup of around 3x here!

PyMC for Model Averaging

I am interested in applying PyMC to model averaging. My goal is to estimate many linear models and average estimates across them, weighting by their posterior model probabilities. I am currently using the Bayesian Information Criterion (BIC) to approximate the likelihood of my data (therefore, my analysis is not fully Bayesian). I have successfully simulated a Markov Chain of models using one of my own scripts but I want to use PyMC because it seems like a great tool.
In my attempts thus far, I have not been forming the Markov Chain correctly. I am not visiting models with higher posterior weights more often than others. I will include the example code below. Please also see the IPython notebook here! on github for the math markup and code together.
import numpy as np
from pymc import stochastic, DiscreteMetropolis, MCMC
import statsmodels.api as sm
import pandas as pd
import random
def pack(alist, rank):
binary = [str(1) if i in alist else str(0) for i in xrange(0,rank)]
string = '0b1'+''.join(binary)
return int(string, 2)
def unpack(integer):
string = bin(integer)[3:]
return [int(i) for i in xrange(len(string)) if string[i]=='1']
def make_bma():
# Simulating Data
size = 100
rank = 20
X = 10*np.random.randn(size, rank)
error = 30*np.random.randn(size,1)
coefficients = np.array([10, 2, 2, 2, 2, 2]).reshape((6,1))
y = np.dot(sm.add_constant(X[:,:5], prepend=True), coefficients) + error
# Number of allowable regressors
predictors = [3,4,5,6,7]
#stochastic(dtype=int)
def regression_model():
def logp(value):
columns = unpack(value)
x = sm.add_constant(X[:,columns], prepend=True)
corr = np.corrcoef(x[:,1:], rowvar=0)
prior = np.linalg.det(corr)
ols = sm.OLS(y,x).fit()
posterior = np.exp(-0.5*ols.bic)*prior
return np.log(posterior)
def random():
k = np.random.choice(predictors)
columns = sorted(np.random.choice(xrange(0,rank), size=k, replace=False))
return pack(columns, rank)
class ModelMetropolis(DiscreteMetropolis):
def __init__(self, stochastic):
DiscreteMetropolis.__init__(self, stochastic)
def propose(self):
'''considers a neighborhood around the previous model,
defined as having one regressor removed or added, provided
the total number of regressors coincides with predictors
'''
# Building set of neighboring models
last = unpack(self.stochastic.value)
last_indicator = np.zeros(rank)
last_indicator[last] = 1
last_indicator = last_indicator.reshape((-1,1))
neighbors = abs(np.diag(np.ones(rank)) - last_indicator)
neighbors = neighbors[:,np.any([neighbors.sum(axis=0) == i \
for i in predictors], axis=0)]
neighbors = pd.DataFrame(neighbors)
# Drawing one model at random from the neighborhood
draw = random.choice(xrange(neighbors.shape[1]))
self.stochastic.value = pack(list(neighbors[draw][neighbors[draw]==1].index), rank)
# def step(self):
#
# logp_p = self.stochastic.logp
#
# self.propose()
#
# logp = self.stochastic.logp
#
# if np.log(random.random()) > logp_p - logp:
#
# self.reject()
return locals()
if __name__ == '__main__':
model = make_bma()
M = MCMC(model)
M.use_step_method(model['ModelMetropolis'], model['regression_model'])
M.sample(iter=5000, burn=1000, thin=1)
model_chain = M.trace("regression_model")[:]
from collections import Counter
counts = Counter(model_chain).items()
counts.sort(reverse=True, key=lambda x: x[1])
for f in counts[:10]:
columns = unpack(f[0])
print('Visits:', f[1])
print(np.array([1. if i in columns else 0 for i in range(0,M.rank)]))
print(M.coefficients.flatten())
X = sm.add_constant(M.X[:, columns], prepend=True)
corr = np.corrcoef(X[:,1:], rowvar=0)
prior = np.linalg.det(corr)
fit = sm.OLS(model['y'],X).fit()
posterior = np.exp(-0.5*fit.bic)*prior
print(fit.params)
print('R-squared:', fit.rsquared)
print('BIC', fit.bic)
print('Prior', prior)
print('Posterior', posterior)
print(" ")
It sounds like you are trying to do something akin to reversible jump MCMC, where you are sampling from the model space in addition to the parameter space(s). PyMC does not currently do rjMCMC, though it probably ought to. The trick is to account for the change in dimension when moving among models. If you do have a modest number of models, you can use an indicator function to select from the models, all of which are fit simultaneously.

Resources