PyMC: summary statistics for a subset of variables - pymc

I'm using PyMC 2.3.4 and am trying to get summary statistics for a subset of my model variables, but can't seem to do so using the method in the docs.
Model-building code:
import pymc
a = pymc.Normal('a',0,1)
b = pymc.Normal('b',0,1)
myModel = pymc.Model((a,b))
M = pymc.MCMC(myModel)
M.sample(1000)
Per the documentation at https://pymc-devs.github.io/pymc/database.html, I should be able to run
M.a.summary() -> summary statistics for a
but instead, I get
AttributeError: 'MCMC' object has no attribute 'a'
However, M.summary() gives summary statistics for all variables.

There are perhaps too many ways to create a model in PyMC2. The one you used, passing an iterable of pymc.Node instances, does not record the names, so the model doesn't have an M.a, even though M.nodes contains a stochastic named 'a'.
If you prefer to create your model this way, you can get a summary from a directly, with
a.summary()
For me, this prints
a:
Mean SD MC Error 95% HPD interval
------------------------------------------------------------------
[[-0.016]] [[ 0.992]] [[ 0.031]] [-1.986 1.939]
Posterior quantiles:
2.5 25 50 75 97.5
|---------------|===============|===============|---------------|
[[-2.047]] [[-0.665]] [[-0.058]] [[ 0.672]] [[ 1.937]]
I find it convenient to have the attribute M.a available sometimes, and you can get it by using a dictionary instead of a list when constructing the model:
M2 = pymc.MCMC({'a':a, 'b':b})
M2.sample(1000)
M2.a.summary()

Related

multiprocessing a geopandas.overlay() throws no error but seemingly never completes

I'm trying to pass a geopandas.overlay() to multiprocessing to speed it up.
I have used custom functions and functools to partially fill function inputs and then pass the iterative component to the function to produce a series of dataframes that I then concat into one.
def taska(id, points, crs):
return make_break_points((vms_points[points.ID == id]).reset_index(drop=True), crs)
points_gdf = geodataframe of points with an id field
grid_gdf = geodataframe polygon grid
partialA = functools.partial(taska, points=points_gdf, crs=grid_gdf.crs)
partialA_results =[]
with Pool(cpu_count()-4) as pool:
for results in pool.map(partialA, list(points_gdf.ID.unique())):
partialA_results.append(results)
bpts_gdf = pd.concat(partialA_results)
In the example above I use the list of unique values to subset the df and pass it to a processor to perform the function and return the results. In the end all the results are combined using pd.concat.
When I apply the same approach to a list of dataframes created using numpy.array_split() the process starts with a number of processors, then they all close and everything hangs with no indication that work is being done or that it will ever exit.
def taskc(tracks, grid):
return gpd.overlay(tracks, grid, how='union').explode().reset_index(drop=True)
tracks_gdf = geodataframe of points with an id field
dfs = np.array_split(tracks_gdf, (cpu_count()-4))
grid_gdf = geodataframe polygon grid
partialC_results = []
partialC = functools.partial(taskc, grid=grid_gdf)
with Pool(cpu_count() - 4) as pool:
for results in pool.map(partialC, dfs):
partialC_results.append(results)
results_df = pd.concat(partialC_results)
I tried using with get_context('spawn').Pool(cpu_count() - 4) as pool: based on the information here https://pythonspeed.com/articles/python-multiprocessing/ with no change in behavior.
Additionally, if I simply run geopandas.overlay(tracks_gdf, grid_gdf) the process is successful and the script carries on to the end with expected results.
Why does the partial function approach work on a list of items but not a list of dataframes?
Is the numpy.array_split() not an iterable object like a list?
How can I pass a single df into geopandas.overlay() in chunks to utilize multiprocessing capabilities and get back a single dataframe or a series of dataframes to concat?
This is my work around but am also interested if there is a better way to perform this and similar tasks. Essentially, modified the partial function so the df split is moved to the partial function then I create a list of values from range() as my iteral.
def taskc(num, tracks, grid):
return gpd.overlay(np.array_split(tracks, cpu_count()-4)[num], grid, how='union').explode().reset_index(drop=True)
partialC = functools.partial(taskc, tracks=tracks_gdf, grid=grid_gdf)
dfrange = list(range(0, cpu_count() - 4))
partialC_results = []
with get_context('spawn').Pool(cpu_count() - 4) as pool:
for results in pool.map(partialC, dfrange):
partialC_results.append(results)
results_gdf = pd.concat(partialC_results)

Error: requires numeric/complex matrix/vector arguments for %*%; cross validating glmmTMB model

I am adapting some k-fold cross validation code written for glmer/merMod models to a glmmTMB model framework. All seems well until I try and use the output from the model(s) fit with training data to predict and exponentiate values into a matrix (to then break into quantiles/number of bins to assess predictive performance). I can get get this line to work using glmer models, but it seems when I run the same model using glmmTMB I get Error in model.matrix: requires numeric/complex matrix/vector arguments There are many other posts out there discussing this error code and I have tried converting the data frame into matrix form and changing the class of the covariates with no luck. Separately running the parts before and after the %*% works but when combined I get the error. For context, this code is intended to be run with use/availability data so the example variables may not make sense, but the problem gets shown well enough. Any suggestions as to what is going on?
library(lme4)
library(glmmTMB)
# Example with mtcars dataset
data(mtcars)
# Model both with glmmTMB and lme4
m1 <- glmmTMB(am ~ mpg + wt + (1|carb), family = poisson, data=mtcars)
m2 <- glmer(am ~ mpg + wt + (1|carb), family = poisson, data=mtcars)
#--- K-fold code (hashed out sections are original glmer version of code where different)---
# define variables
k <- 5
mod <- m1 #m2
dt <- model.frame(mod) #data used
reg.list <- list() # initialize object to store all models used for cross validation
# finds the name of the response variable in the model dataframe
resp <- as.character(attr(terms(mod), "variables"))[attr(terms(mod), "response") + 1]
# define column called sets and populates it with character "train"
dt$sets <- "train"
# randomly selects a proportion of the "used"/am records (i.e. am = 1) for testing data
dt$sets[sample(which(dt[, resp] == 1), sum(dt[, resp] == 1)/k)] <- "test"
# updates the original model using only the subset of "trained" data
reg <- glmmTMB(formula(mod), data = subset(dt, sets == "train"), family=poisson,
control = glmmTMBControl(optimizer = optim, optArgs=list(method="BFGS")))
#reg <- glmer(formula(mod), data = subset(dt, sets == "train"), family=poisson,
# control = glmerControl(optimizer = "bobyqa", optCtrl=list(maxfun=2e5)))
reg.list[[i]] <- reg # store models
# uses new model created with training data (i.e. reg) to predict and exponentiate values
predall <- exp(as.numeric(model.matrix(terms(reg), dt) %*% glmmTMB::fixef(reg)))
#predall <- exp(as.numeric(model.matrix(terms(reg), dt) %*% lme4::fixef(reg)))
Without looking at the code too carefully: glmmTMB::fixef(reg) returns a list (with elements cond (conditional model parameters), zi (zero-inflation parameters), disp (dispersion parameters) rather than a vector.
If you replace this bit with glmmTMB::fixef(reg)[["cond"]] it will probably work.

Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript.
For example, I tried the following:
import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('wordnet')
def Preprocesser(text_list):
smallestWordSize = 3
processedList = []
for token in gensim.utils.simple_preprocess(text_list):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:
processedList.append(StemmAndLemmatize(token))
return processedList
lda_model = models.LdaModel.load('LDAModel\GoldModel') #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict") #Load dictionary model was trained on
#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"
termTopicMatrix = lda_model.get_topics() #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc) #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc) #Create bow using dictionary
dictSize = len(termTopicMatrix[0]) #Get length of terms in dictionary
fullDict = np.zeros(dictSize) #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc] #Get index of terms in bag of words
Second = [second[1] for second in bowDoc] #Get frequency of term in bag of words
fullDict[First] = Second #Add word frequency to full dictionary
print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])
Output:
Matrix Multiplication:
[0.0283254 0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
0.01558603 0.0370233 0.04648389 0.02887623 0.00776652 0.02147539
0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax:
[(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]
In the pretrained model there are 35 topics and 1155 words.
In the "Conventional Syntax" output, the first element of each tuple is the index of the topic and the second element is the probability of the topic. In the "Matrix Multiplication" version, the probability is the index and the value is the probability. Clearly the two don't match up.
For example, the lda_model[unseenDoc] shows that topic 0 has a 0.07 probability, but the matrix multiplication method says that topic has a 0.028 probability. Am I missing a step here?
You can review the full source code used by LDAModel's get_document_topics() method in your installation, or online at:
https://github.com/RaRe-Technologies/gensim/blob/e75f6c8e8d1dee0786b1b2cd5ef60da2e290f489/gensim/models/ldamodel.py#L1283
(It also makes use of the inference() method in the same file.)
It's doing a lot more scaling/normalization/clipping than your code, which is likely the cause of the discrepancy. But you should be able to examine, line-by-line, where your process & its differ to get the steps to match up.
It also shouldn't be hard to use the gensim code's steps as guidance for creating parallel Javascript code that, given the right parts of the model's state, can reproduce its results.

Why does a Gensim Doc2vec object return empty doctags?

My question is how I should interpret my situation?
I trained a Doc2Vec model following this tutorial https://blog.griddynamics.com/customer2vec-representation-learning-and-automl-for-customer-analytics-and-personalization/.
For some reason, doc_model.docvecs.doctags returns {}. But doc_model.docvecs.vectors_docs seems to return a proper value.
Why the doc2vec object doesn't return any doctags but vectors_docs?
Thank you for any comments and answers in advance.
This is the code I used to train a Doc2Vec model.
from gensim.models.doc2vec import LabeledSentence, TaggedDocument, Doc2Vec
import timeit
import gensim
embeddings_dim = 200 # dimensionality of user representation
filename = f'models/customer2vec.{embeddings_dim}d.model'
if TRAIN_USER_MODEL:
class TaggedDocumentIterator(object):
def __init__(self, df):
self.df = df
def __iter__(self):
for row in self.df.itertuples():
yield TaggedDocument(words=dict(row._asdict())['all_orders'].split(),tags=[dict(row._asdict())['user_id']])
it = TaggedDocumentIterator(combined_orders_by_user_id)
doc_model = gensim.models.Doc2Vec(vector_size=embeddings_dim,
window=5,
min_count=10,
workers=mp.cpu_count()-1,
alpha=0.055,
min_alpha=0.055,
epochs=20) # use fixed learning rate
train_corpus = list(it)
doc_model.build_vocab(train_corpus)
for epoch in tqdm(range(10)):
doc_model.alpha -= 0.005 # decrease the learning rate
doc_model.min_alpha = doc_model.alpha # fix the learning rate, no decay
doc_model.train(train_corpus, total_examples=doc_model.corpus_count, epochs=doc_model.iter)
print('Iteration:', epoch)
doc_model.save(filename)
print(f'Model saved to [{filename}]')
else:
doc_model = Doc2Vec.load(filename)
print(f'Model loaded from [{filename}]')
doc_model.docvecs.vectors_docs returns
If all of the tags you supply are plain Python ints, those ints are used as the direct-indexes into the vectors-array.
This saves the overhead of maintaining a mapping from arbitrary tags to indexes.
But, it may also cause an over-allocation of the vectors array, to be large enough for the largest int tag you provided, even if other lower ints are never used. (That is: if you provided a single document, with a tags=[1000000], it will allocate an array sufficient for tags 0 to 1000000, even if most of those never appear in your training data.)
If you want model.docvecs.doctags to collect a list of all your tags, use string tags rather than plain ints.
Separately: don't call train() multiple times in your own loop, or manage the alpha learning-rate in your own code, unless you have an overwhelmingly good reason to do so. It's inefficient & error-prone. (Your code, for example, is actually performing 200 training-epochs, and if you were to increase the loop count without carefully adjusting your alpha increment, you could wind up with nonsensical negative alpha values – a very common error in code following this bad practice. Call .train() once with your desired number of epochs. Set the alpha and min_alpha at reasonable starting and nearly-zero values – probably just the defaults unless you're sure your change is helping – and then leave them alone.

using ols from statsmodels.formula.api with only a constant term?

I'd like to show students what happens when only a constant is used in a regression model. I specified one model as price ~ age for an OLS model of the price of used cars as a function of age plus a constant. Now I'd like to drop the age variable and just have the constant. How do I do this?
The formula fitting in statsmodels uses Patsy, which tries to mimic R-style model specifications.
Since you didn't specify a data source, I've taken a dataset from the
statsmodels OLS guide to provide a worked example - can wealth explain lottery spending:
import statsmodels.api as sm
import statsmodels.formula.api as smf
# load example and trim to a few features
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
# fit with y=mx + c
model1 = smf.ols(formula='Lottery ~ Wealth', data=df).fit()
print(model1.summary())
# fit with y=c (only an intercept)
model2 = smf.ols(formula='Lottery ~ 1', data=df).fit()
print(model2.summary())
For your question, a model with only the intercept is nothing more than the mean, but presumably you are interested in techniques for comparing different models, so let's do a quick comparison to see whether the simpler model gives a better fit - one option is the f-test:
f_val, p_val, _ = model1.compare_f_test(model2)
print(f_val, p_val, p_val<0.01)
The p value is below 1% significance level, so we interpret that the more complex model is "more correct" in this case.
For completeness, to specify a model without an intercept (useful e.g. if we already mean-centered the data), we can exclude with -1 in the formula:
# y = mx
model3 = smf.ols(formula='Lottery ~ Wealth -1', data=df).fit()
print(model3.summary())
f_val, p_val, _ = model1.compare_f_test(model3)
print(f_val, p_val, p_val<0.01)
Again, p_val is below 1% significance level, so including intercept and slope improves model fit. (No multi-test correction here, but p values are <<1%)

Resources