Predicting multiple forecast values using timeseries ARIMA model - arima

-> We are trying to build a timeseries model using ARIMA.
-> The goal is to predict multiple forcast values.
On the basis of history data I want to predict the next 3 data values.
-> In my use case I am using pmdarima for automatically finding the p,d,q values.
code:
import pmdarima as pm
model = pm.auto_arima(df[data_column])
hyperparameter = model.fit(df[data_column])
-> I have one dataset of 4000 rows. In one case I am considering first 100 rows of dataset and in second case I am considering all 4000 rows of the dataset.
-> ARIMA on 100 row dataset
Here I am considering first 97 values as lag and trying to predict last 3 values.
p,d,q values from pmdarima are (1,0,0)
In this case we are able to successfully predict three different values.
code is as follows
considering first 100 values of the dataset and getting p,d,q as 1,0,0 using pmdarima
import pmdarima as pm
model = pm.auto_arima(df[col_name])
hyperparameter = model.fit(df[col_name])
hyperparameter
Output: ARIMA(order=(1, 0, 0), scoring_args={}, suppress_warnings=True)
p=1
d=0
q=0
X = df[col_name].values # converting data to ndarray
number_of_moving_data_for_model=97
train, test = X[:number_of_moving_data_for_model],X[number_of_moving_data_for_model:]
# fit model
model = ARIMA(train, order=(p,d,q))
model_fit = model.fit()
# multi-step out-of-sample forecast
forecast = model_fit.forecast(steps=3)
print('forcast =', forecast)
Output: forcast = [55135502.82729979 55133836.2767516 55132494.00564519]
-> ARIMA on 4000 row dataset
Here I am considering first 3997 values as lag and trying to predict last 3 values.
p,d,q values from pmdarima are (0,1,0)
In this case we are NOT able to successfully predict three different values.
Code is as follows
considering first 4000 values of the dataset and getting p,d,q as 0,1,0 using pmdarima
import pmdarima as pm
model = pm.auto_arima(df[col_name])
hyperparameter = model.fit(df[col_name])
hyperparameter
Output: ARIMA(order=(0, 1, 0), scoring_args={}, suppress_warnings=True)
p=0
d=1
q=0
X = df[col_name].values # converting data to ndarray
number_of_moving_data_for_model=3997
train, test = X[:number_of_moving_data_for_model],X[number_of_moving_data_for_model:]
fit model
model = ARIMA(train, order=(p,d,q))
model_fit = model.fit()
multi-step out-of-sample forecast
forecast = model_fit.forecast(steps=3)
print('forcast =', forecast)
Output: forcast = [57531824. 57531824. 57531824.]
-> But when we are changing the p,d,q parameter values to (1,0,0) then it is predicting different three future values.
Code is as follows
considering first 4000 values of the dataset and getting p,d,q as 0,1,0 using pmdarima
import pmdarima as pm
model = pm.auto_arima(df[col_name])
hyperparameter = model.fit(df[col_name])
hyperparameter
Output: ARIMA(order=(0, 1, 0), scoring_args={}, suppress_warnings=True)
p=1
d=0
q=0
X = df[col_name].values # converting data to ndarray
number_of_moving_data_for_model=3997
train, test = X[:number_of_moving_data_for_model],X[number_of_moving_data_for_model:]
fit model
model = ARIMA(train, order=(p,d,q))
model_fit = model.fit()
multi-step out-of-sample forecast
forecast = model_fit.forecast(steps=3)
print('forcast =', forecast)
Output: forcast = [57531509.23896821 57531194.56251951 57530879.97063115]
-> How by changing the p,d,q parameter values from (1,0,0) to (0,1,0) ARIMA is working correctly?

Related

Using RMSE function with 5-fold cross validation to choose the best model out of 3

I have defined three different models obtained from the dataset diabetes from the library lars. The first model (M1) is the one that minimizes the BIC value out of all the possible regression models obtained combining the explanatory variables (which are p=10, so 2^10 possible models). The other two are obtained through glmnet and are a Lasso regression with respectively lambda.min (M2) and lambda.1se (M3), where lambda.min and lambda.1se are obtained through cv.glmnet. Now I should perform 5-fold cross-validation using the RMSE (Root Mean Square Error) function to check which of the tree models Μ1, Μ2 and Μ3, has the best predictive performance. In order to find the errors in the models obtained from Lasso I have to use the ordinal least squares estimates.
This is my code as for now:
library(lars)
library(glmnet)
data(diabetes)
y<-diabetes$y
x<-diabetes$x
x2<-diabetes$x2
X = as.data.frame(cbind(x))
Y = as.data.frame(y)
p=10
n=442
best_score = Inf
M1 = NA
for (i in 1:(2^p-1)){
model = lm(y ~ ., data = subset(X, select = c(which(as.integer(intToBits(i)) == 1))))
if (BIC(model) < best_score){
M1 = model
best_score = BIC ( model )
}
}
W<-as.matrix(X)
Y<-as.matrix(Y)
lasso<-glmnet(W, Y)
x11()
plot(lasso, label=T)
x11()
plot(lasso, xvar = 'lambda', label=T)
lasso$df
lasso$lambda
cvfit<-cv.glmnet(W,Y)
cvfit
coef(cvfit, s="lambda.min")
coef(cvfit, s="lambda.1se")
M2<-glmnet(W,Y,lambda = cvfit$lambda.min)
M3<-glmnet(W,Y,lambda = cvfit$lambda.1se)
I really don't know where to put hands now. Should I first of all split the original dataset in 5 and then compute again the models on the different train and test set? And how do I compute the final RMSE for each model? And what does it mean that I should use ordinal least square estimates for the models obtained through Lasso?

tf.data.Dataset.zip(a, b) changes order of elements if a was shuffled

I am preparing a dataset and then training a model before storing the outputs (for the purpose of knowledge distillation)
In order to store them in the tfrecords format i need to use the .zip() function.
I reproduced the bug/mistake with the following code.
My actual training files are hundreds of lines so I didn't include them here.
I use tensorflow 2.1. and python 3.7 on ubuntu 18.04
The problem I can't solve is:
The data is shuffled (which is okay). But after zipping the tuples have a different order to each other (which is not okay).
import tensorflow as tf
ds = tf.data.Dataset.from_tensor_slices([1,2,3,4, 5])
#prepare dataset for training
batch_size=2
ds = ds.cache().repeat().shuffle(buffer_size=5, reshuffle_each_iteration=True).batch(batch_size)
#create model. here: map identity function
model = tf.keras.models.Sequential([tf.keras.layers.Lambda(lambda x: x , input_shape=(1,))])
#train with model.fit()
#make predictions.
pred = model.predict(ds, steps=5//batch_size)
#prepare for saving to tfrecords
ds = ds.unbatch()
ds = ds.take(5)
pred = tf.data.Dataset.from_tensor_slices(pred)
combined = tf.data.Dataset.zip((ds, pred))
#show unwanted behaviour
for (a),c in combined:
print(a,c)
output of code snippet shows that the elements per line don't match. (eg line 1: 3 should be mapped to 3)
tf.Tensor(3, shape=(), dtype=int32) tf.Tensor([4.], shape=(1,), dtype=float32)
tf.Tensor(1, shape=(), dtype=int32) tf.Tensor([1.], shape=(1,), dtype=float32)
tf.Tensor(4, shape=(), dtype=int32) tf.Tensor([1.], shape=(1,), dtype=float32)
tf.Tensor(3, shape=(), dtype=int32) tf.Tensor([2.], shape=(1,), dtype=float32)
Tensorflow applies shuffle at each iteration through the dataset.
Zip is one of those iterations, this is why the order in model.predict will not match the order in zip when (both times there is a shuffle)
anyway, for predict you do not really need to shuffle the dataset. The predictions should not depend on what the model sees in a previous prediction.
TensorFlow shuffles the first axis so if your tensor shape is (x,) this will change the order of your elements; here is a test
a = tf.data.Dataset.from_tensor_slices(tf.constant([[x] for x in range(10)]))
b = tf.data.Dataset.from_tensor_slices(tf.constant([[x] for x in range(10)]))
c = tf.data.Dataset.zip((a,b)).shuffle(10)
for i,j in c.batch(1):
print(i.numpy(),j.numpy())
and the output is
[[3]] [[3]]
[[6]] [[6]]
[[5]] [[5]]
[[8]] [[8]]
[[7]] [[7]]
[[1]] [[1]]
[[2]] [[2]]
[[0]] [[0]]
[[9]] [[9]]
[[4]] [[4]]
as you can see the order has been preserved but the items on the first axis of each tensor has been shuffled.

stasmodels SARIMAX predictions

I'm trying to understand how to verify a ARIMAX model for > 1 step ahead using statsmodels.
My understanding is the results.get_prediction(start=, dynamic=) api does this but I'm having trouble getting my head around how it works. My training data is indexed by a localised DateTimeIndex (tz='Sydney\Australia') at 15T freq. I want to predict a full day for '2019-02-04 00:00:00+1100' using one-step-ahead prediction up to '2019-02-04 06:00:00+1100' the previous predicted endogenous values for the rest of the day.
Is the code below correct? It seems statsmodel converts the start to a TimeStamp and treats dynamic as a multiple of the freq, so this should start the simulation using 1 step ahead until 06:00 then use the previous predicted endogenous values. The results don't look great so I want to confirm it's a model issue rather than me having incorrect diagnosis.
dt = '2019-02-04'
predict = res.get_prediction(start='2019-02-04 00:00:00+11:00')
predict_dy = res.get_prediction(start='2019-02-04 00:00:00+11:00', dynamic=4*6)
fig = plt.figure(figsize=(10,10)) ax = fig.gca()
y_train[dt].plot(ax=ax, style='o', label='Observed')
predict.predicted_mean[dt].plot(ax=ax, style='r--', label='One-step-ahead forecast')
predict_dy.predicted_mean[dt].plot(ax=ax, style='g', label='Dynamic forecast')
It seems statsmodel converts the start to a TimeStamp
Yes, if you give it a string value, then it will attempt to map it to an index in your dataset (like a timestamp).
and treats dynamic as a multiple of the freq
But this is not correct. dynamic is an integer offset to start. So if dynamic=0, that means that dynamic prediction begins at start, whereas if dynamic=1, that means that dynamic prediction begins at start+1.
It's not quite clear to me what's going on in your example (or what you think is not great about the predictions you generated), so here is a description of how dynamic works that may help:
Here's an example that may help explain how things work. A couple of key points for this exercise will be:
I set all elements of endog to be equal to 1
This is an AR(1) model with parameter 0.5. That means that if we know y_t, then the prediction of y_t+1 is equal to 0.5 * y_t.
Now, the example code is:
ix = pd.date_range(start='2018-12-01', end='2019-01-31', freq='D')
endog = pd.Series(np.ones(len(ix)), index=ix)
mod = sm.tsa.SARIMAX(endog, order=(1, 0, 0), concentrate_scale=True)
res = mod.smooth([0.5])
p1 = res.predict(start='January 1, 2019', end='January 5, 2019').rename('d=False')
p2 = res.predict(start='January 1, 2019', end='January 5, 2019', dynamic=0).rename('d=0')
p3 = res.predict(start='January 1, 2019', end='January 5, 2019', dynamic=1).rename('d=2')
print(pd.concat([p1, p2, p3], axis=1))
this gives:
d=False d=0 d=2
2019-01-01 0.5 0.50000 0.5000
2019-01-02 0.5 0.25000 0.5000
2019-01-03 0.5 0.12500 0.2500
2019-01-04 0.5 0.06250 0.1250
2019-01-05 0.5 0.03125 0.0625
The first column (d=False) is the default case, where dynamic=False. Here, all predictions are one-step-ahead predictions. Since I set every element of endog to 1 and we have an AR(1) model with parameter 0.5, all one-step-ahead predictions will be equal to 0.5 * 1 = 0.5.
In the second column (d=0), we specify that dynamic=0 so that dynamic prediction begins at the first prediction. This means that we do not use any endog data past start - 1 in forming our predictions, which in this case means we do not use any data past December 31, 2018 in making predictions. The first prediction will be equal to 0.5 times the observation on December 31, 2018, i.e. 0.5 * 1 = 0.5. Each subsequent prediction will be equal to 0.5 * the previous prediction, so the second prediction is 0.5 * 0.5 = 0.25, etc.
The third column (d=1) is like the second column, except that here dynamic=1 so that dynamic prediction begins at the second prediction. This means we do not use any endog data past start (i.e. past January 1, 2019).

python testing data in 2d array

I have a 2d array with depth information, 640x480.
I want to add (row, col) values to a list where the value is in the range 800 to 2800 (true in my example data for about 5% of the values).
I have this code (python 2.7, w10, new laptop 2017)
depth = np.load("depth.npy") # depth.shape = (640, 480), ndarray
obstacleList[]
for row in range(480):
for col in range(640):
dist = depth[col, row]
if dist > 800 and dist < 2800:
obstacleList.append((col, dist))
My time measure shows me that it takes almost 10 seconds to complete the list.
For further processing of the data I need only the col with the lowest dist value but I thought this would add only more processing time.
What is wrong with my code?
found numpy.nanmin which finds the min value for each column in practical no time. To get rid of some values (like 0) I needed to convert my array to float (as NaN is float) and replace unwanted values with NaN
b = depth.astype(float)
b[b<800] = np.NaN
obstacles = np.nanmin(b, axis=1)
gave me an array with the lowest value <> NaN for each column in no time (0.05 secs on my machine versus the 8 seconds using normal iteration)!

What is stratified bootstrap?

I have learned bootstrap and stratification. But what is stratified bootstrap? And how does it work?
Let's say we have a dataset of n instances (observations), and m is the number of classes. How should I divide the dataset, and what's the percentage for training and testing?
You split your dataset per class. Afterwards, you sample from each sub-population independently. The number of instances you sample from one sub-population should be relative to its proportion.
data
d(i) <- { x in data | class(x) =i }
for each class
for j = 0..samplesize*(size(d(i))/size(data))
sample(i) <- draw element from d(i)
sample <- U sample(i)
If you sample four elements from a dataset with classes {'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b'}, this procedure makes sure that at least one element of class b is contained in the stratified sample.
Just had to implement this in python, I will just post my current approach here in case this is of interest for others.
Function to create index for original Dataframe to create stratified bootstrapped sample
I chose to iterate over all relevant strata clusters in the original Dataframe , retrieve the index of the relevant rows in each stratum and randomly (with replacement) draw the same amount of samples from the stratum that this very stratum consists of.
In turn, the randomly drawn indices can just be combined into one list (that should in the end have the same length as the original Dataframe).
import pandas as pd
from random import choices
def provide_stratified_bootstap_sample_indices(bs_sample):
strata = bs_sample.loc[:, "STRATIFICATION_VARIABLE"].value_counts()
bs_index_list_stratified = []
for idx_stratum_var, n_stratum_var in strata.iteritems():
data_index_stratum = list(bs_sample[bs_sample["STRATIFICATION_VARIABLE"] == idx_stratum_var[0]].index)
bs_index_list_stratified.extend(choices(data_index_stratum , k = len(data_index_stratum )))
return bs_index_list_stratified
And then the actual bootstrapping loop
(say 10k times):
k=10000
for i in range(k):
bs_sample = DATA_original.copy()
bs_index_list_stratified = provide_stratified_bootstap_sample_indices(bs_sample)
bs_sample = bs_sample.loc[bs_index_list_stratified , :]
# process data with some statistical operation as required and save results as required for each iteration
RESULTS = FUNCTION_X(bs_sample)

Resources