statsmodels ARIMA predict is giving me predictions of the differenced signal instead of predictions of the actual signal. What mistake am I making? - statsmodels

The signal looks as such
original signal
The differenced signal obtained by using plot(output.diff()) looks as such
differenced signal
Next the parameters of the ARIMA model were obtained by analyzing the ACF and PACF
The model was fit in the following manner
model = ARIMA(output.values, order=(2,1,1))
model_fit = model.fit(disp=0)
When I used
model_fit.plot_predict(dynamic=False)
plt.show()
It is perfect!
result using plot_predict
but when i use plt.plot(model_fit.predict(dynamic=False))
It gives a predictions of the differenced signal
result using predict of arima

If you are using the model sm.tsa.ARIMA, then you can use the following:
plt.plot(model_fit.predict(dynamic=False, typ='levels'))
However, this model is deprecated and will be removed in future Statsmodels versions. For compatibility with future versions, you can use the new ARIMA model:
from statsmodels.tsa.arima.model import ARIMA
or
import statsmodels.api as sm
model = sm.tsa.arima.ARIMA(output.values, order=(2,1,1))
This newer model will automatically produce forecasts and predictions of the actual signal, so you do not need to use typ='levels' in this case.

Related

How do I train a encoder-decoder model for a translation task using hugging face transformers?

I would like to train a encoder decoder model as configured below for a translation task. Could someone guide me as to how I can set-up a training pipeline for such a model? Any links or code snippets would be appreciated to understand.
from transformers import BertConfig, EncoderDecoderConfig, EncoderDecoderModel
# Initializing a BERT bert-base-uncased style configuration
config_encoder = BertConfig()
config_decoder = BertConfig()
config = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
# Initializing a Bert2Bert model from the bert-base-uncased style configurations
model = EncoderDecoderModel(config=config)
The encoder-decoder models are used in the same as any other models in Transformers. It accepts batches of tokenized text as vocabulary indices (i.e., you need a tokenizer that is suitable for your sequence-to-sequence task). When you feed the model with the input (input_ids) and the desired output (decoder_input_ids and labels), you will get the loss value that you can optimize during training. Note that if the sentences in the batch have different lengths, you need to do masking too. This is a minimum example for the EncoderDecoderModel documentation:
from transformers import EncoderDecoderModel, BertTokenizer
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = EncoderDecoderModel.from_encoder_decoder_pretrained(
'bert-base-uncased', 'bert-base-uncased')
input_ids = torch.tensor(
tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)
outputs = model(
input_ids=input_ids, decoder_input_ids=input_ids, labels=input_ids,
return_dict=True)
loss = outputs.loss
If you do not want to write the training loop yourself, you can use dataset processing (DataCollatorForSeq2Seq) and training (Seq2SeqTrainer) utilities from Transformers. You can follow the Seq2Seq example on GitHub.

In Gensim Word2vec, how to reduce the vocab size of an existing model?

In Gensims word2vec api, I trained a model where I initialized the model with max_final_vocab = 100000 and saved the model using model.save()
(This gives me one .model file, one .model.trainables.syn1neg.npy and one .model.wv.vectors.npy file).
I do not need to train model any further, so I'm fine with using just
model = gensim.models.Word2Vec.load("train.fr.model")
kv = model.wv
del model
the kv variable shown here. I now want to use only the top N (N=40000 in my case) vocabulary items instead of the entire vocabulary. The only way to even attempt cutting down the vocabulary I could find was
import numpy as np
emb_matrix = np.load("train.fr.model.wv.vectors.npy")
emb_matrix.shape
# (100000, 300)
new_emb_matrix = emb_matrix[:40000]
np.save("train.fr.model.wv.vectors.npy", new_emb_matrix)
If I load this model again though, the vocabulary still has length 100000.
I want to reduce the vocabulary of the model or model.wv while retaining a working model. Retraining is not an option.
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('train.fr.model', limit=1000)
Use optional limitparameter to reduce number of vectors that will be loaded from Word2Vec model file.

Reusing h2o model mojo or pojo file from python

As H2o models are only reusable with the same major version of h2o they were saved with, an alternative is to save the model as MOJO/POJO format. Is there a way these saved models can be reused/loaded from python code. Or is there any way to keep the model for further development when upgrading the H2O version??
If you want to use your model for scoring via python, you could use either h2o.mojo_predict_pandas or h2o.mojo_predict_csv. But otherwise if you want to load a binary model that you previously saved, you will need to have compatible versions.
Outside of H2O-3 you can look into pyjnius as Tom recommended: https://github.com/kivy/pyjnius
Another alternative is to use pysparkling, if you only need it for scoring:
from pysparkling.ml import H2OMOJOModel
# Load test data to predict
df = spark.read.parquet(test_data_path)
# Load mojo model
mojo = H2OMOJOModel.createFromMojo(mojo_path)
# Make predictions
predictions = mojo.transform(df)
# Show predictions with ground truth (y_true and y_pred)
predictions.select('your_target_column', 'prediction').show()

Is it possible to train XGBoost4J-Spark with tree_method='exact' ?

I intend to use the trained xgboost model with tree_method='exact' in the SparkML pipeline so I need to use XGBoost4J-Spark; however documentation says "Distributed and external memory version only support approximate algorithm." (https://xgboost.readthedocs.io/en/latest//parameter.html). Is there anyway to work around this?
Alternatively, I can train the model with C-based xgboost and some how convert the trained model to XGBoostEstimator which is a SparkML estimator and seamless to integrate in SparkML pipeline. Has anyone came across such a convertor?
I don't mind running on a single node instead of a cluster as I can afford to wait.
Any insights is appreciated.
So there is this way:
import ml.dmlc.xgboost4j.scala.XGBoost
val xgb1 = XGBoost.loadModel("xgb1")
import ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel
val xgbSpark = new XGBoostRegressionModel(xgb1)
where xgb1 is the model trained with C-based xgboost. There is a problem however; their predictions don't match. I have reported the issue on the github repo: https://github.com/dmlc/xgboost/issues/3190

How to find accuracy of model from hdf5 generated from keras?

I have trained a CNN model using keras and I saved its weight in HDF5 file.
Now I want to see accuracy of my model, how can I find it?
and how can I predict unlabled data using HDF5 ? Is it possible using h5 file?
Thanks
Create your model and compile it using model.compile.Then, you can load the weight with :
model.load_weights('my_model_weights.h5')
then you can use model.evaluate on your testing set. Find more info on model.evaluate here : Model API

Resources