Why does the number of observations change the prediction of a sarimax model with fixed coefficients? - statsmodels

After training a sarimax model, I had hoped to be able to preform forecasts in future using it with new observations without having to retrain it. However, I noticed that the number of observations i use in the newly applied forecast change the predictions.
From my understanding, provided that enough observations are given to allow the autoregression and moving average to be calculated correctly, the model would not even use the earlier historic observations to inform itself as the coefficients are not being retrained. In a (3,0,1) example i would have thought it would need atleast 3 observations to apply its trained coefficients. However this does not seem to be the case and i am questioning whether i have understood the model correctly.
as an example and test, i have applied a trained sarimax to the exact same data with the initial few observations removed to test the effect of the number of rows on the prediction with the following code:
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX, SARIMAXResults
y = [348, 363, 435, 491, 505, 404, 359, 310, 337, 360, 342, 406, 396, 420, 472, 548, 559, 463, 407, 362, 405, 417, 391, 419, 461, 472, 535, 622, 606, 508, 461, 390, 432]
ynew = y[10:]
model = SARIMAX(endog=y, order=(3,0,1))
model = model.fit()
pred1 = model.predict(start=len(y), end = len(y)+7)
model2 = model.apply(ynew)
pred2 = model2.predict(start=len(ynew), end = len(ynew)+7)
print(pd.DataFrame({'pred1': pred1, 'pred2':pred2}))
The results are as follows:
pred1 pred2
0 472.246996 472.711770
1 494.753955 495.745968
2 498.092585 499.427285
3 489.428531 490.862153
4 477.678527 479.035869
5 469.023243 470.239459
6 465.576002 466.673790
7 466.338141 467.378903
Based on this, it means that if I were to produce a forecast from a trained model with new observations, the change in the number of observations itself would impact the integrity of the forecast.
What is the explanation for this? What is the standard practice for applying a trained model on new observations given the change in the number of them?
If i wanted to update the model but could not control for whether or not i had all of the original observations from the very start of my training set, this test would indicate that my forecast might as well be random numbers.

Main issue
The main problem here is that you are not using your new results object (model2) for your second set of predictions. You have:
pred2 = model.predict(start=len(ynew), end = len(ynew)+7)
but you should have:
pred2 = model2.predict(start=len(ynew), end = len(ynew)+7)
If you fix this, you get very similar predictions:
pred1 pred2
0 472.246996 472.711770
1 494.753955 495.745968
2 498.092585 499.427285
3 489.428531 490.862153
4 477.678527 479.035869
5 469.023243 470.239459
6 465.576002 466.673790
7 466.338141 467.378903
To understand why they're not identical, there is a second issue (which is not a problem in your code, but just a statistical feature of your data/model).
Secondary issue
Your estimated parameters imply an extremely persistent model:
ar.L1 2.134401
ar.L2 -1.683946
ar.L3 0.549369
ma.L1 -0.874801
sigma2 1807.187815
with is associated with a near-unit-root process (largest eigenvalue
= 0.99957719).
What this means is that it takes a very long time for the effects of a particular datapoint on the forecast to die out. In your case, this just means that there are still small effects on the forecasts from the first 10 periods.
This isn't a problem, it's just the way this particular estimated model works.


Do huggingface translation models support separate vocabulary for source and target?

Every example I've looked at so far seems to use a shared vocabulary between source and target languages, and I'm wondering if that is a hard-coded constraint of the Huggingface models, or my misunderstanding, or I've just not looked in the right place yet?
To take a random example, when I look at the files here, https://huggingface.co/Helsinki-NLP/opus-mt-en-zls/tree/main, I see separate "spm" (sentience piece model) files for source and target languages, and they are of different sizes (792kb vs. 850kb). But there is only a single "vocab.json" file. And the config.json file only mentions a single "vocab_size": 57680.
I've also been experimenting, e.g. tokenizer(inputs, text_target=inputs, return_tensors="pt"). If source and target used different vocabulary I would expect the returned input_ids and labels to use different numbers. But every model I've tried so far the numbers are identical (NO, my mistake - see update below).
Can a Huggingface tokenizer even support two vocabularies? If not then a model would need two tokenizers, which seems to clash with the way AutoTokenizer works.
Here is a test script to show the above model is actually using two spm vocabs with AutoTokenizer.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = 'Helsinki-NLP/opus-mt-en-zls'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
inputs = ['Filter all items from same host']
targets = ['Filtriraj sve stavke s istog hosta']
x=tokenizer(inputs, text_target=targets, return_tensors="pt")
print("\nGiving inputs on both sides")
x=tokenizer(inputs, text_target=inputs, return_tensors="pt")
print(x) ## Expecting to see different numbers if they use different vocabs
print("\nGiving targets on both sides")
x=tokenizer(targets, text_target=targets, return_tensors="pt") ## Expecting to see different numbers if they use different vocabs
The output is:
{'input_ids': tensor([[10373, 90, 8255, 98, 605, 6276, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[11638, 1392, 7636, 386, 35861, 95, 2130, 218, 6276, 27,
▁Filter all▁items from same host</s>
Filtriraj sve stavke s istog hosta</s>
Giving inputs on both sides
{'input_ids': tensor([[10373, 90, 8255, 98, 605, 6276, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[11638, 911, 90, 3188, 7, 98, 605, 6276, 0]])}
▁Filter all▁items from same host</s>
Filter all items from same host</s>
Giving targets on both sides
{'input_ids': tensor([[11638, 1392, 7636, 95, 120, 914, 465, 478, 95, 29,
25, 897, 6276, 27, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[11638, 1392, 7636, 386, 35861, 95, 2130, 218, 6276, 27,
Filtriraj sve stavke s istog hosta</s>
Filtriraj sve stavke s istog hosta</s>
When I choose identical strings in English or Croatian it gives slightly different numbers, showing that different tokenizers are involved. You can then see that the different ids sometimes map back to an identical string, sometimes not.
But when I print out the model we see it is actually a shared vocabulary, which makes the two spm models a bit pointless.
(encoder): MarianEncoder(
(embed_tokens): Embedding(57680, 512, padding_idx=57679)
(decoder): MarianDecoder(
(embed_tokens): Embedding(57680, 512, padding_idx=57679)
(lm_head): Linear(in_features=512, out_features=57680, bias=False)
I haven't got as far as finding out if a non-shared vocabulary is possible, but still yet to see evidence of one.
For Marian-based models, HuggingFace now supports separate vocabularies for source and target, but some models may not, especially older models.
(As you know, OPUS-MT models are based on MarianMT. The MarianMT framework supports it.)
Before https://github.com/huggingface/transformers/pull/15831, HuggingFace used a shared vocabulary file for Marian.
This PR updates the Marian model:
To allow not sharing embeddings between encoder and decoder.
Allow tying only decoder embeddings with lm_head.
Separate two vocabs in tokenizer for src and tgt language
share_encoder_decoder_embeddings: to indicate if emb should be shared or not
So models trained with earlier versions of the framework, or that parameter set to false, only have one shared vocabulary file for source and target.

Good training and validation accuracy but poor confusion matrix

I have training my model to detect normal vs pneumonia chest x-ray classes. This is my dataset as listed below:
train_batch= ImageDataGenerator(preprocessing_function=tf.keras.applications.vgg16.preprocess_input)\
.flow_from_directory(directory=train_path, target_size=(224,224), classes=['NORMAL', 'PNEUMONIA'],
val_batch= ImageDataGenerator(preprocessing_function=tf.keras.applications.vgg16.preprocess_input) \
.flow_from_directory(directory=val_path, target_size=(224,224), classes=['NORMAL', 'PNEUMONIA'], batch_size=32, class_mode='categorical')
test_batch= ImageDataGenerator(preprocessing_function=tf.keras.applications.vgg16.preprocess_input) \
.flow_from_directory(directory=test_path, target_size=(224,224), classes=['NORMAL', 'PNEUMONIA'], batch_size=16,class_mode='categorical', shuffle=False)
Found 3616 images belonging to 2 classes. #training
Found 1616 images belonging to 2 classes. #validation
Found 624 images belonging to 2 classes. #test
my model consist of 5 CNN layers where image w,h = (224* 224,3) with 16 feature map as first layer and then 32, 64, 128,256. Batch normalization , max pooling and dropout is added to every cnn layer, but last dense layer is as follow
model.add(Dense(units=2 , activation='softmax'))
optim = Adam( lr=0.001 )
model.compile(optimizer=optim , loss= 'categorical_crossentropy' , metrics= ['accuracy'])
steps_per_epoch= 113, #3616/32=113
epochs = 25,
validation_data = val_batch,
validation_steps = 51 #1616/32=51
#callbacks=callbacks #remove to chk
as it can be seen in the graph that my training and validation accuracy and loss is good but when I plot confusion matrix it dose not seems good why??
prediction = model.predict_generator(test_batch,steps= stepss) #, verbose=0)
prediction1 = np.argmax(prediction, axis=1)
cm = confusion_matrix (test_batch.classes, prediction1)
this is my confusion matrix as below
as you can see my graph which is as below
after that I did fine tuning of my model with VGG!6 by replacing last dense layer with my own dense layer with two outputs and here is the graph and confusion matrix:
I do not understand why my testing in not going good even with vgg16 model as you can see the results so please give me your valuable suggestions THANKS

How to calculate shap values for ADABoost model?

I am running 3 different model (Random forest, Gradient Boosting, Ada Boost) and a model ensemble based on these 3 models.
I managed to use SHAP for GB and RF but not for ADA with the following error:
Exception Traceback (most recent call last)
in engine
----> 1 explainer = shap.TreeExplainer(model,data = explain_data.head(1000), model_output= 'probability')
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, model_output, feature_perturbation, **deprecated_options)
110 self.feature_perturbation = feature_perturbation
111 self.expected_value = None
--> 112 self.model = TreeEnsemble(model, self.data, self.data_missing)
114 if feature_perturbation not in feature_perturbation_codes:
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, data_missing)
752 self.tree_output = "probability"
753 else:
--> 754 raise Exception("Model type not yet supported by TreeExplainer: " + str(type(model)))
756 # build a dense numpy version of all the tree objects
Exception: Model type not yet supported by TreeExplainer: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>
I found this link on Git that state
TreeExplainer creates a TreeEnsemble object from whatever model type we are trying to explain, and then works with that downstream. So all you would need to do is and add another if statement in the
TreeEnsemble constructor similar to the one for gradient boosting
But I really don't know how to implement it since I quite new to this.
I had the same problem and what I did, was to modify the file in the git you are commenting.
In my case I use windows so the file is in C:\Users\my_user\AppData\Local\Continuum\anaconda3\Lib\site-packages\shap\explainers but you can do double click over the error message and the file will be opened.
The next step is to add another elif as the answer of the git help says. In my case I did it from the line 404 as following:
1) Modify the source code.
self.objective = objective_name_map.get(model.criterion, None)
self.tree_output = "probability"
elif str(type(model)).endswith("sklearn.ensemble.weight_boosting.AdaBoostClassifier'>"): #From this line I have modified the code
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line I added
elif str(type(model)).endswith("sklearn.ensemble.forest.ExtraTreesClassifier'>"): # TODO: add unit test for this case
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
Note in the other models, the code of shap needs the attribute 'criterion' that the AdaBoost classifier doesn't have in a direct way. So in this case this attribute is obtained from the "weak" classifiers with the AdaBoost has been trained, that's why I add model.base_estimator_.criterion .
Finally you have to import the library again, train your model and get the shap values. I leave an example:
2) Import again the library and try:
from sklearn import datasets
from sklearn.ensemble import AdaBoostClassifier
import shap
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
ADABoost_model = AdaBoostClassifier()
ADABoost_model.fit(X, y)
shap_values = shap.TreeExplainer(ADABoost_model).shap_values(X)
shap.summary_plot(shap_values, X, plot_type="bar")
Which generates the following:
3) Get your new results:
It seems that the shap package has been updated and still does not contain the AdaBoostClassifier. Based on the previous answer, I've modified the previous answer to work with the shap/explainers/tree.py file in lines 598-610
### Added AdaBoostClassifier based on the outdated StackOverflow response and Github issue here
### https://stackoverflow.com/questions/60433389/how-to-calculate-shap-values-for-adaboost-model/61108156#61108156
### https://github.com/slundberg/shap/issues/335
elif safe_isinstance(model, ["sklearn.ensemble.AdaBoostClassifier", "sklearn.ensemble._weighted_boosting.AdaBoostClassifier"]):
assert hasattr(model, "estimators_"), "Model has no `estimators_`! Have you called `model.fit`?"
self.internal_dtype = model.estimators_[0].tree_.value.dtype.type
self.input_dtype = np.float32
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line added
Also working on testing to add this to the package :)

Keras Inception-v3 fine-tuning workaraound

I am trying to fine-tune Inception-v3, but no matter which layer I choose to freeze I get random predictions. I found that other people are having the same problem: https://github.com/keras-team/keras/issues/9214 . It seems that the problem comes from setting the BN layer to not trainable.
Now I am trying to get the output of the last layer I want to freeze and use it as an input to the following layers, which I will then train:
train_generator = train_datagen.flow_from_directory(
os.path.join(directory, "train_data"),
classes=["a", "b", "c","d"],
shuffle=False) base_model = InceptionV3(weights='imagenet', include_top=True, input_shape=(299, 299, 3))
model_features = Model(inputs=base_model.input, outputs=base_model.get_layer(
#I want to use this as input
values_train = model_features.predict_generator(train_generator, verbose=1)
However, I get Memory error like this, although I have 12Gb, which is more than what I need:
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3268864 totalling 3.12MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3489024 totalling 3.33MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 4211968 totalling 4.02MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 5129472 totalling 4.89MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 3.62GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 68719476736
InUse: 3886957312
MaxInUse: 3889054464
NumAllocs: 3709
MaxAllocSize: 8388608
Any suggestion how to fix that or another workaround to fine-tune Inception will be very helpful.
I can't tell if you're preprocessing your input properly from what you've provided. However, Keras provides functions for preprocessing that are specific to the pre-trained net, in this case Inception V3.
from keras.applications.inception_v3 import preprocess_input
Try adding this to your data generator as the preprocessing function like so...
train_generator = train_datagen.flow_from_directory(
os.path.join(directory, "train_data"),
preprocessing_function=preprocess_input, # <---
classes=["a", "b", "c","d"],
You should then be able to unfreeze all of the layers, or the select few that you want to train.
Hope that helps!

Dirichlet vs Binomial in pymc3

I am having trouble sampling from a Dirichlet/Multinomial distribution with pymc3.
I tried to create a simple test-case to recreate a Beta/Binomial using Dirichlet/Multinomial with n=2, but I can't get it to work.
Below I have some code that works for Binomial but fails for Multinomial.
One of the obvious differences is that the Multinomial model is more constrained:
i.e. to start, rating is set to 10 in the Binomial model, and [10,10] in the Multinomial.
The pymc3 Dirichlet code does say "Only the first k-1 elements of x are expected" but only arrays of shape 2 seem to work in my code.
The output shows that num_friends and rating are being sampled in the Binomial case, but not in the Multinomial case. friends_ratings is being sampled in both. Thanks!
Oh, also Dirichlet('d', np.array([1,1])) crashes with "Floating point error 8". It only appears to fail when two integers of value 1 are passed in. np.array([1.,1.]) works.
import pymc as pm
import numpy as np
with pm.Model() as model:
friends_ratings = pm.Beta('friends_ratings', alpha=1, beta=2)
num_friends = pm.DiscreteUniform('num_friends', lower=0, upper=100)
rating = pm.Binomial('rating', n=num_friends, p=friends_ratings)
step = pm.Metropolis([num_friends, friends_ratings, rating])
start = {"friends_ratings":.5, "num_friends":20, 'rating':10}
tr = pm.sample(5, step, start=start, progressbar=False)
print "friends", [tr[i]['num_friends'] for i in range(len(tr))]
print "friends_ratings", [tr[i]['friends_ratings'] for i in range(len(tr))]
print "rating", [tr[i]['rating'] for i in range(len(tr))]
with pm.Model() as model:
friends_ratings = pm.Dirichlet('friends_ratings', np.array([1.,1.]), shape=2)
num_friends = pm.DiscreteUniform('num_friends', lower=0, upper=100)
rating = pm.Multinomial('rating', n=num_friends, p=friends_ratings, shape=2)
step = pm.Metropolis([num_friends, friends_ratings, rating])
start = {'friends_ratings': np.array([0.5,0.5]), 'num_friends': 20, 'rating': [10,10]}
tr = pm.sample(5, step, start=start, progressbar=False)
print "friends", [tr[i]['num_friends'] for i in range(len(tr))]
print "friends_ratings", [tr[i]['friends_ratings'] for i in range(len(tr))]
print "rating", [tr[i]['rating'] for i in range(len(tr))]
friends [22.0, 24.0, 24.0, 23.0, 23.0]
friends_ratings [0.5, 0.5, 0.41, 0.41, 0.41]
ratingf [10.0, 11.0, 11.0, 11.0, 11.0]
friends [20.0, 20.0, 20.0, 20.0, 20.0]
friends_ratings [array([ 0.51369621, 1.490608 ]), ... ]
rating [array([ 10., 10.]), array([ 10., 10.]), ... ]
PyMC3 does not automatically normalize the Dirichlet. So far you have to do this explicitly using simplextransform. See here for an example.
There is an issue of making this transform automatic though: https://github.com/pymc-devs/pymc3/issues/315
EDIT (9/14/2015): PyMC3 now automatically transforms the dirichlet distribution (as any other distribution). So you don't need to specify that manually anymore.
