Two StatsModels modules have totally different 'end-runs' - statsmodels

I'm running StatsModels to estimate parameters of a multiple regression model, using county-level data for 3085 counties. When I use statsmodels.formula.api, and drop a few rows from the data, I get desired results. All seems well enough.
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
%matplotlib inline
from statsmodels.compat import lzip
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")
eg=pd.read_csv(r'C:/Users/user/anaconda3/une_edu_pipc_06.csv')
pd.options.display.precision = 3
plt.rc("figure", figsize=(16,8))
plt.rc("font", size=14)
sm_col = eg["lt_hsd_17"] + eg["hsd_17"]
eg["ut_hsd_17"] = sm_col
sm_col2 = eg["sm_col_17"] + eg["col_17"]
eg["bnd_hsd_17"] = sm_col2
eg["d_09"]= eg["Rate_09"]-eg["Rate_06"]
eg["d_10"]= eg["Rate_10"]-eg["Rate_06"]
inc_2=eg["p_c_inc_18"]*eg["p_c_inc_18"]
res = sm.ols(formula = "Rate_18 ~ p_c_inc_18 + ut_hsd_17 + d_10 + inc_2",
data=eg, missing='drop').fit()
print(res.summary()).
(BTW, eg["p_c_inc_18"]is per-capita income, and inc_2 is p_c_inc_18 squarred).
But when I wish to use import statsmodels.api as smas the module, everything else staying pretty much the same, and run the following code after all appropriate preliminaries,
inc_2=eg["p_c_inc_18"]*eg["p_c_inc_18"]
X = eg[["p_c_inc_18","ut_hsd_17","d_10","inc_2"]]
y = eg["Rate_18"]
X = sm.add_constant(X)
mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())
then things fall apart, and the Python interpreter throws an error, as follows:
[......]
KeyError: "['inc_2'] not in index"
BTW, the only difference between the two 'runs' is that 15 rows are dropped during the first, successful, model run, while I don't as yet know how to drop missing rows from the second model formulation. Could that difference be responsible for why the second run fails? (I chose to omit large parts of the error message, to reduce clutter.)

You need to assign inc_2 in your DataFrame.
inc_2=eg["p_c_inc_18"]*eg["p_c_inc_18"]
should be
eg["inc_2"] = eg["p_c_inc_18"]*eg["p_c_inc_18"]

Related

GEKKO: MHE load data of previous cycle

i am developing a model predictive controller (MPC) with a moving horizon estimation (MHE) Plugin for a dynamic simulation program.
My Problem is, that the simulation program executes the Python script in each timestep. So each timestep a new model in GEKKO is produced. Is there a possibility reload the model and the data files? So for example give the path of the data to GEKKO?
Best Regards,
Moritz
Try using a Pickle file to store the Gekko model. If the Gekko model archive exists then it is read back into Python.
from os.path import exists
import pickle
import numpy as np
from gekko import GEKKO
import matplotlib.pyplot as plt
if exists('m.pkl'):
# load model from subsequent call
m = pickle.load(open('m.pkl','rb'))
m.solve()
else:
# define model the first time
m = GEKKO()
m.time = np.linspace(0,20,41)
m.p = m.MV(value=0, lb=0, ub=1)
m.v = m.CV(value=0)
m.Equation(5*m.v.dt() == -m.v + 10*m.p)
m.options.IMODE = 6
m.p.STATUS = 1; m.p.DCOST = 1e-3
m.v.STATUS = 1; m.v.SP = 40; m.v.TAU = 5
m.options.CV_TYPE = 2
m.solve()
pickle.dump(m,open('m.pkl','wb'))
plt.figure()
plt.subplot(2,1,1)
plt.plot(m.time,m.p.value,'b-',lw=2)
plt.ylabel('gas')
plt.subplot(2,1,2)
plt.plot(m.time,m.v.value,'r--',lw=2)
plt.ylabel('velocity')
plt.xlabel('time')
plt.show()
Each cycle of the controller, the plot updates with the automatic time-shift of the initial condition.
This is similar to what happens in a loop with a combined MHE and MPC. As long as you include everything in the Pickle file, it should reload on the next cycle.
Here is the example code for MHE and MPC.

sklearn - sample from GaussianMixture without fitting

I would like to use a GaussianMixture for generating random data.
The parameters should not be learnt from data but supplied.
GaussianMixture allows supplying inital values for weights, means, precisions, but calling "sample" is still not possible.
Example:
import numpy as np
from sklearn.mixture import GaussianMixture
d = 10
k = 2
_weights = np.random.gamma(shape=1, scale=1, size=k)
data_gmm = GaussianMixture(n_components=k,
weights_init=_weights / _weights.sum(),
means_init=np.random.random((k, d)) * 10,
precisions_init=[np.diag(np.random.random(d)) for _ in range(k)])
data_gmm.sample(100)
This throws:
NotFittedError: This GaussianMixture instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
I've tried:
Calling _initialize_parameters() - this requires also supplying a data matrix, and does not initialize a covariances variable needed for sampling.
Calling set_params() - this does not allow supplying values for the attributes used by sampling.
Any help would be appreciated.
You can set all the attributes manually so you don't have to fit the GaussianMixture.
You need to set weights_, means_, covariances_ as follow:
import numpy as np
from sklearn.mixture import GaussianMixture
d = 10
k = 2
_weights = np.random.gamma(shape=1, scale=1, size=k)
data_gmm = GaussianMixture(n_components=k)
data_gmm.weights_ = _weights / _weights.sum()
data_gmm.means_ = np.random.random((k, d)) * 10
data_gmm.covariances_ = [np.diag(np.random.random(d)) for _ in range(k)]
data_gmm.sample(100)
NOTE: You might need to modify theses parameters values according to your usecase.

I can't load my nn model that I've trained and saved

I used transfer learning to train the model. The fundamental model was efficientNet.
You can read more about it here
from tensorflow import keras
from keras.models import Sequential,Model
from keras.layers import Dense,Dropout,Conv2D,MaxPooling2D,
Flatten,BatchNormalization, Activation
from keras.optimizers import RMSprop , Adam ,SGD
from keras.backend import sigmoid
Activation function
class SwishActivation(Activation):
def __init__(self, activation, **kwargs):
super(SwishActivation, self).__init__(activation, **kwargs)
self.__name__ = 'swish_act'
def swish_act(x, beta = 1):
return (x * sigmoid(beta * x))
from keras.utils.generic_utils import get_custom_objects
from keras.layers import Activation
get_custom_objects().update({'swish_act': SwishActivation(swish_act)})
Model Definition
model = enet.EfficientNetB0(include_top=False, input_shape=(150,50,3), pooling='avg', weights='imagenet')
Adding 2 fully-connected layers to B0.
x = model.output
x = BatchNormalization()(x)
x = Dropout(0.7)(x)
x = Dense(512)(x)
x = BatchNormalization()(x)
x = Activation(swish_act)(x)
x = Dropout(0.5)(x)
x = Dense(128)(x)
x = BatchNormalization()(x)
x = Activation(swish_act)(x)
x = Dense(64)(x)
x = Dense(32)(x)
x = Dense(16)(x)
# Output layer
predictions = Dense(1, activation="sigmoid")(x)
model_final = Model(inputs = model.input, outputs = predictions)
model_final.summary()
I saved it using:
model.save('model.h5')
I get the following error trying to load it:
model=tf.keras.models.load_model('model.h5')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-e3bef1680e4f> in <module>()
1 # Recreate the exact same model, including its weights and the optimizer
----> 2 model = tf.keras.models.load_model('PhoneDetection-CNN_29_July.h5')
3
4 # Show the model architecture
5 model.summary()
10 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/utils/generic_utils.py in class_and_config_for_serialized_keras_object(config, module_objects, custom_objects, printable_module_name)
319 cls = get_registered_object(class_name, custom_objects, module_objects)
320 if cls is None:
--> 321 raise ValueError('Unknown ' + printable_module_name + ': ' + class_name)
322
323 cls_config = config['config']
ValueError: Unknown layer: FixedDropout
```python
I was getting the same error while trying to do the inference by loading my saved model.
Then i just imported the effiecientNet library in my inference notebook as well and the error was gone.
My import command looked like:
import efficientnet.keras as efn
(Note that if you havent installed effiecientNet already(which is unlikely), you can do so by using !pip install efficientnet command.)
I had this same issue with a recent model. Researching the source code you can find the FixedDropout Class. I added this to my inference code with import of backend and layers. The rate should also match the rate from your efficientnet model, so for the EfficientNetB0 the rate is .2 (others are different).
from tensorflow.keras import backend, layers
class FixedDropout(layers.Dropout):
def _get_noise_shape(self, inputs):
if self.noise_shape is None:
return self.noise_shape
symbolic_shape = backend.shape(inputs)
noise_shape = [symbolic_shape[axis] if shape is None else shape
for axis, shape in enumerate(self.noise_shape)]
return tuple(noise_shape)
model = keras.models.load_model('model.h5',
custom_objects={'FixedDropout':FixedDropout(rate=0.2)})
I was getting the same error. Then I import the below code. then it id working properly
import cv2
import matplotlib.pyplot as plt
import tensorflow as tf
from sklearn.metrics import confusion_matrix
import itertools
import os, glob
from tqdm import tqdm
from efficientnet.tfkeras import EfficientNetB4
if you don't have to install this. !pip install efficientnet. If you have any problem put here.
In my case, I had two files train.py and test.py.
I was saving my .h5 model inside train.py and was attempting to load it inside test.py and got the same error. To fix it, you need to add the import statements for your efficientnet models inside the file that is attempting to load it as well (in my case, test.py).
from efficientnet.tfkeras import EfficientNetB0

Use Gensim or other python LDA packages to use trained LDA model from Mallet

I have an LDA model trained through Mallet in Java. Three files are generated from the Mallet LDA model, which allow me to run the model from files and infer the topic distribution of a new text.
Now I would like to implement a Python tool which is able to infer a topic distribution given a new text, based on the trained LDA model. I do not want to re-trained the LDA model in Python. Therefore, I wonder if it is possible to load the trained Mallet LDA model into Gensim or any other python LDA package. If so, how can I do it?
Thanks for any answers or comments.
In short yes you can! That is what is nice about using mallet is that once it is run you don't have to go through and relabel topics. I'm doing something very similar - I'll post my code below with a few helpful links. Once your model is trained save the notebook widget state and you'll be free to run your model on new and different data-sets with the same topic allocation. This code includes a test and validation set. Make sure you've downloaded mallet and java then try this:
# future bridges python 2 and 3
from __future__ import print_function
# pandas works with data structures, data manipulation, and analysis specifically for numerical tables, and series like
# the csv we are using here today
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
# Gensim unsupervised topic modeling, natural language processing, statistical machine learning
import gensim
# convert a document to a list of tolkens
from gensim.utils import simple_preprocess
# remove stopwords - words that are not telling: "it" "I" "the" "and" ect.
from gensim.parsing.preprocessing import STOPWORDS
# corpus iterator
from gensim import corpora, models
# nltk - Natural Language Toolkit
# lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed
# into present.
# stemmed — words are reduced to their root form.
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
# NumPy - multidimensional arrays, matrices, and high-level mathematical formulas
import numpy as np
np.random.seed(2018)
import os
from gensim.models.wrappers import LdaMallet
from pathlib import Path
import codecs
import logging
import re
import numpy as np
import pandas as pd
from pprint import pprint
# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
# spacy for lemmatization
import spacy
# Plotting tools
import pyLDAvis
import pyLDAvis.gensim # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline
# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
logging.basicConfig(format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO)
data = pd.read_csv('YourData.csv', encoding = "ISO-8859-1");
data_text = data[['Preprocessed Document or your comments column title']]
data_text['index'] = data_text.index
documents = data_text
# Create functions to lemmatize stem, and preprocess
# turn beautiful, beautifuly, beautified into stem beauti
def lemmatize_stemming(text):
stemmer = PorterStemmer()
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
newStopWords = ['yourStopWord1', 'yourStopWord2']
if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
nltk.bigrams(token)
result.append(lemmatize_stemming(token))
return result
# gensim.parsing.preprocessing.STOPWORDS
# look at a random row 4310 and see if things worked out
# note that the document created was already preprocessed
doc_sample = documents[documents['index'] == 4310].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))
# let’s look at ten rows passed through the lemmatize stemming and preprocess
documents = documents.dropna(subset=['Preprocessed Document'])
processed_docs = documents['Preprocessed Document'].map(preprocess)
processed_docs[:10]
# we create a dictionary of all the words in the csv by iterating through
# contains the number of times a word appears in the training set.
dictionary_valid = gensim.corpora.Dictionary(processed_docs[20000:])
count = 0
for k, v in dictionary_valid.iteritems():
print(k, v)
count += 1
if count > 30:
break
# we create a dictionary of all the words in the csv by iterating through
# contains the number of times a word appears in the training set.
dictionary_test = gensim.corpora.Dictionary(processed_docs[:20000])
count = 0
for k, v in dictionary_test.iteritems():
print(k, v)
count += 1
if count > 30:
break
# we want to throw out words that are so frequent that they tell us little about the topic
# as well as words that are too infrequent >15 rows then keep just 100,000 words
dictionary_valid.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
# we want to throw out words that are so frequent that they tell us little about the topic
# as well as words that are too infrequent >15 rows then keep just 100,000 words
dictionary_test.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
# the words become numbers and are then counted for frequency
# consider a random row 4310 - it has 8 words word indexed 2 shows up once
# preview the bag of words
bow_corpus_valid = [dictionary_valid.doc2bow(doc) for doc in processed_docs]
bow_corpus_valid[4310]
# the words become numbers and are then counted for frequency
# consider a random row 4310 - it has 8 words word indexed 2 shows up once
# preview the bag of words
bow_corpus_test = [dictionary_test.doc2bow(doc) for doc in processed_docs]
bow_corpus_test[4310]
# same thing in more words
bow_doc_4310 = bow_corpus_test[4310]
for i in range(len(bow_doc_4310)):
print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0],
dictionary_test[bow_doc_4310[i][0]],
bow_doc_4310[i][1]))
mallet_path = 'C:/mallet/mallet-2.0.8/bin/mallet.bat'
ldamallet_test = gensim.models.wrappers.LdaMallet(mallet_path, corpus=bow_corpus_test, num_topics=20, id2word=dictionary_test)
result = (ldamallet_test.show_topics(num_topics=20, num_words=10,formatted=False))
for each in result:
print (each)
mallet_path = 'C:/mallet/mallet-2.0.8/bin/mallet.bat'
ldamallet_valid = gensim.models.wrappers.LdaMallet(mallet_path, corpus=bow_corpus_valid, num_topics=20, id2word=dictionary_valid)
result = (ldamallet_valid.show_topics(num_topics=20, num_words=10,formatted=False))
for each in result:
print (each)
# Show Topics
for idx, topic in ldamallet_test.print_topics(-1):
print('Topic: {} \nWords: {}'.format(idx, topic))
# Show Topics
for idx, topic in ldamallet_valid.print_topics(-1):
print('Topic: {} \nWords: {}'.format(idx, topic))
# check out the topics - 30 words - 20 topics
ldamallet_valid.print_topics(idx, 30)
# check out the topics - 30 words - 20 topics
ldamallet_test.print_topics(idx, 30)
# Compute Coherence Score
coherence_model_ldamallet_valid = CoherenceModel(model=ldamallet_valid, texts=processed_docs, dictionary=dictionary_valid, coherence='c_v')
coherence_ldamallet_valid = coherence_model_ldamallet_valid.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet_valid)
# Compute Coherence Score
coherence_model_ldamallet_test = CoherenceModel(model=ldamallet_test, texts=processed_docs, dictionary=dictionary_test, coherence='c_v')
coherence_ldamallet_test = coherence_model_ldamallet_test.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet_test)
Look at 16: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
This helped: https://rare-technologies.com/tutorial-on-mallet-in-python/
and this: https://radimrehurek.com/gensim/models/wrappers/ldamallet.html
I hope this helps and good luck :)

how to sample multiple chains in PyMC3

I'm trying to sample multiple chains in PyMC3. In PyMC2 I would do something like this:
for i in range(N):
model.sample(iter=iter, burn=burn, thin = thin)
How should I do the same thing in PyMC3? I saw there is a 'njobs' argument in the 'sample' method, but it throws an error when I set a value for it. I want to use sampled chains to get 'pymc.gelman_rubin' output.
Better is to use njobs to run chains in parallel:
#!/usr/bin/env python3
import pymc3 as pm
import numpy as np
from pymc3.backends.base import merge_traces
xobs = 4 + np.random.randn(20)
model = pm.Model()
with model:
mu = pm.Normal('mu', mu=0, sd=20)
x = pm.Normal('x', mu=mu, sd=1., observed=xobs)
step = pm.NUTS()
with model:
trace = pm.sample(1000, step, njobs=2)
To run them serially, you can use a similar approach to your PyMC 2
example. The main difference is that each call to sample returns a
multi-chain trace instance (containing just a single chain in this
case). merge_traces will take a list of multi-chain instances and
create a single instance with all the chains.
#!/usr/bin/env python3
import pymc as pm
import numpy as np
from pymc.backends.base import merge_traces
xobs = 4 + np.random.randn(20)
model = pm.Model()
with model:
mu = pm.Normal('mu', mu=0, sd=20)
x = pm.Normal('x', mu=mu, sd=1., observed=xobs)
step = pm.NUTS()
with model:
trace = merge_traces([pm.sample(1000, step, chain=i)
for i in range(2)])

Resources