fable package: Could not find an appropriate ARIMA model - arima

I am trying to fit the Arima model to hourly data. First, I tried fable package, and the ARIMA function could not find the appropriate model. Second, I used forecast package with auto.arima function, which worked perfectly. I have one example series (available here: https://gist.github.com/mizhozan/800fec80682822969e7d35ebba395) and the results as an example here:
data.arima <- read.csv('test.csv', header = TRUE)[,-1]
## fable package
data.arima$Date <- lubridate::ymd_hms(data.arima$Date, truncated = 2)
library(tidyverse)
library(fable)
result.arima <- data.arima %>%
as_tsibble(., index = Date)%>%
model(ARIMA(value ~ PDQ() + pdq() +
fourier(period = "day", K = 3) +
fourier(period = "week", K = 2), seasonal.test = "ocsb")) %>%
forecast(h = 24)
Warning message:
1 error encountered for ARIMA(value ~ PDQ() + pdq() + fourier(period = "day", K = 3) +
fourier(period = "week", K = 2), seasonal.test = "ocsb")
[1] Could not find an appropriate ARIMA model.
This is likely because automatic selection does not select models with characteristic roots that may be numerically unstable.
For more details, refer to https://otexts.com/fpp3/arima-r.html#plotting-the-characteristic-roots
## forecast package
library(forecast)
series.arima <- msts(data.arima$value, seasonal.periods = c(24, 24*7))
model.arima <- auto.arima(series.arima, seasonal.test = "ocsb", xreg=fourier(series.arima,K=c(3,2)))
Series: series.arima
Regression with ARIMA(4,0,1) errors
Coefficients:
ar1 ar2 ar3 ar4 ma1 intercept S1-24 C1-24 S2-24 C2-24 S3-24 C3-24 S1-168 C1-168 S2-168 C2-168
1.9064 -1.4934 0.8292 -0.3056 -0.8728 664263.21 -310891.13 -349744.23 -133862.32 -20587.2 69313.88 51963.803 43880.66 1524.578 -3823.166 5642.26
s.e. 0.0755 0.1192 0.1085 0.0521 0.0605 7781.72 20778.06 20591.69 11662.66 11606.0 8792.99 8768.856 11342.32 11669.244 12819.074 13091.08
sigma^2 estimated as 5.122e+09: log likelihood=-4225.19
AIC=8484.38 AICc=8486.31 BIC=8549.28
result.arima.2 <- forecast(model.arima, xreg=fourier(series.arima, K = c(3,2), h = 24))
I would appreciate that if someone could explain the problem here.

Related

R estimating one independent variable more than once

I am trying to estimate a multinomial logit model for predicting systemic banking crisis with panel data. Below is my code. I have ran this code before and it has worked fine. However, I tried to change the names of the independent variables and used the new data to run the model again. But ever since then R is estimating multiple iterations of x1 variable. But when I am dropping x1 the model estimation turns out to be just fine again. I have attached a screenshots of the results. Faulty_result1, Faulty_result_2 and Result_with_x1_dropped. I can't seem to figure out what the issue is. Any help will be much appreciated.
#Remove all items from memory (if any)
rm(list=ls(all=TRUE))
#Set working directory to load files
setwd("D:/PhD/Codes")
#Load necessary libraries
library(readr)
library(nnet)
library(plm)
#Load data
my_data <- read_csv("D:/PhD/Data/xx_Final Data_4.csv",
col_types = cols(`Time Period` = col_date(format = "%d/%m/%Y"),
y = col_factor(levels = c("0", "1",
"2")), x2 = col_double(), x5 = col_double(),
x9 = col_double(), x11 = col_double(),
x13 = col_double(), x24 = col_double()),
na = "NA")
#Change levels from numeric to character
levels(my_data$y) <- c("Tranquil", "Pre-crisis", "Crisis")
str(my_data$y)
#Create Panel Data
p_data=pdata.frame(my_data)
#Export dataset
write_csv(p_data,"D:/PhD/Data/Clean_Final Data_4.csv")
#Drop unnecessary columns
p <- subset(p_data, select = c(3:27))
#Set reference level
p$y <- relevel(p$y, ref="Tranquil")
#Create Model
model <- multinom(y~ ., data = p)
summary(model)
stargazer::stargazer(model, type = "text")

Error in (function (classes, fdef, mtable) unable to find an inherited method for function ‘krige’ for signature ‘"formula", "tbl_df"’

I have a strange Error and actually don't know how to solve it, even after checking other posts. Everything runs until the Kriging and then I receive the error: Error in (function (classes, fdef, mtable) unable to find an inherited method for function ‘krige’ for signature ‘"formula", "tbl_df"’
The strange thing is that everything worked a few days ago, I did not change anything in the code and now it doesn't run anymore. Some other posts related the problem with the Raster, but I could not find any discrepances. Is there something because of recent updates? I use for example the sp package.
Unfortunately I cannot provide the data I use, hopefully it can be solved without.
How can I solve the issue? Thank you in advance for the help.
homeDir = "D:/Folder/DataXYyear/"
y = 1992
Source = paste("Year", y, ".csv")
File = file.path(homeDir,Source)
GWMeas <- read_csv(File)
GWMeasX <- na.omit(GWMeas)
ggplot(
data = GWMeasX,
mapping = aes(x = X, y = Y, color = level)
) +
geom_point(size = 3) +
scale_color_viridis(option = "B") +
theme_classic()
GWMX_sf <- st_as_sf(GWMeasX, coords = c("X", "Y"), crs = 25832) %>%
cbind(st_coordinates(.))
v_emp_OK <- gstat::variogram(
level~1,
as(GWMX_sf, "Spatial") # switch from {sf} to {sp}
)
v_mod_OK <- automap::autofitVariogram(level~1, as(GWMX_sf, "Spatial"), model = "Sph")$var_model
GWMeasX %>% as.data.frame %>% glimpse
GW.vgm <- variogram(level~1, locations = ~X+Y, data = GWMeasX) # calculates sample variogram values
GW.fit <- fit.variogram(GW.vgm, model=vgm(model = "Gau")) # fit model
sf_GWlevel <- st_as_sf(GWMeasX, coords = c("X", "Y"), crs = 25833)
grd_sf <- sf_GWlevel %>%
st_bbox() %>%
st_as_sfc() %>%
st_make_grid(
cellsize = c(5000, 5000), # 5000m pixel size
what = "centers"
) %>%
st_as_sf() %>%
cbind(., st_coordinates(.))
grid <- as(grd_sf, "Spatial")
gridded(grid) <- TRUE
grid <- as(grid, "SpatialPixels")
createGrid <- function(XY.Spacing)
crs(grid) <- crs(GWMX_sf)
OK3 <- krige(formula = level~1, # variable to interpolate
data = GWMX_sf, # gauge data
newdata = grid, # grid to interpolate on
model = v_mod_OK, # variogram model to use
nmin = 4, # minimum number of points to use for the interpolation
nmax = 20, # maximum number of points to use for the interpolation
maxdist = 120e3 # maximum distance of points to use for the interpolation
)

Confused about the use of validation set here

For the main.py of the px2graph project, the part of training and validation is shown as below:
splits = [s for s in ['train', 'valid'] if opt.iters[s] > 0]
start_round = opt.last_round - opt.num_rounds
# Main training loop
for round_idx in range(start_round, opt.last_round):
for split in splits:
print("Round %d: %s" % (round_idx, split))
loader.start_epoch(sess, split, train_flag, opt.iters[split] * opt.batchsize)
flag_val = split == 'train'
for step in tqdm(range(opt.iters[split]), ascii=True):
global_step = step + round_idx * opt.iters[split]
to_run = [sample_idx, summaries[split], loss, accuracy]
if split == 'train': to_run += [optim]
# Do image summaries at the end of each round
do_image_summary = step == opt.iters[split] - 1
if do_image_summary: to_run[1] = image_summaries[split]
# Start with lower learning rate to prevent early divergence
t = 1/(1+np.exp(-(global_step-5000)/1000))
lr_start = opt.learning_rate / 15
lr_end = opt.learning_rate
tmp_lr = (1-t) * lr_start + t * lr_end
# Run computation graph
result = sess.run(to_run, feed_dict={train_flag:flag_val, lr:tmp_lr})
out_loss = result[2]
out_accuracy = result[3]
if sum(out_loss) > 1e5:
print("Loss diverging...exiting before code freezes due to NaN values.")
print("If this continues you may need to try a lower learning rate, a")
print("different optimizer, or a larger batch size.")
return
time_str = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
print("{}: step {}, loss {:g}, acc {:g}".format(time_str, global_step, out_loss, out_accuracy))
# Log data
if split == 'valid' or (split == 'train' and step % 20 == 0) or do_image_summary:
writer.add_summary(result[1], global_step)
writer.flush()
# Save training snapshot
saver.save(sess, 'exp/' + opt.exp_id + '/snapshot')
with open('exp/' + opt.exp_id + '/last_round', 'w') as f:
f.write('%d\n' % round_idx)
It seems that the author only get the result of each batch of the validation set. I am wondering, if I want to observe whether the model is improving or reaching the best performance, should I use the result on the whole validation set?
If the validation set is small enough, we could calculate the loss, accuracy on the whole validation set during training to observe the performance. However, if the validation set is too large, it is better to calculate batch-wise validation results and for multiple steps.

How to do a customized "average" for pandas multilevel dataframe?

I have a pandas multilevel dataframe df to contain the quarterly financial report data for about 2000+ stocks from year 2006 to 2012 . And I am trying to figure out a way to quickly calculate the 'average' values for each data point.
demo_data() is the function to generate the demo data (df = demo_data(stk_qty=2000, col_num=200) can be used to simulate the financial report data):
def demo_data(stk_qty, col_num):
''' generate demo data, return multilevel dataframe '''
import random
import pandas as pd
rpt_date_template = [(yr+qt) for yr in map(str, range(2006, 2013)) for qt in ['0331','0630','0930','1231']]
stk_id_list = ['STK'+str(x).zfill(3) for x in range(0, stk_qty)]
stk_id_column, rpt_date_column = [], []
for i in range(stk_qty):
stk_rpt_date_list = rpt_date_template[random.randint(0,8):] # rpt dates with random start
stk_id_column.extend([stk_id_list[i]] * len(stk_rpt_date_list))
rpt_date_column.extend(stk_rpt_date_list)
index_name = ['STK_ID', 'RPT_Date']
col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]
first_level_dt = stk_id_column
second_level_dt = rpt_date_column
dt = pd.DataFrame(np.random.randn(len(stk_id_column), col_num), columns=col_name)
dt[index_name[0]] = first_level_dt
dt[index_name[1]] = second_level_dt
multilevel_df = dt.set_index(index_name, drop=True, inplace=False)
return multilevel_df
Here is a sample data. (note: sw() is a method to display the four corners data of a big dataframe, source code is at: How to preview a part of a large pandas DataFrame? )
>>> df = demo_data(5,3)
>>> df.sw()
COL000 COL001 COL002
STK_ID RPT_Date
STK000 20060630 1.8196 0.9519 -1.0526
20060930 -0.4074 -0.9025 1.3562
20061231 -1.1750 0.4190 -1.2976
20070331 -0.5609 1.5190 0.4893
20070630 0.4580 -0.3804 0.3705
20070930 -0.4711 -1.1953 -0.0609
20071231 0.3363 1.1949 1.2802
20080331 1.6359 0.8355 -0.2763
20080630 0.2697 -0.8236 -1.7095
20080930 0.6178 -0.3742 -1.1646
.......................................
STK004 20111231 -0.3198 1.6972 -1.3281
20120331 -1.1905 -0.4597 0.3695
20120630 -0.8253 -0.0502 -0.2862
20120930 0.0059 -1.8535 -1.2107
20121231 0.5762 -0.2872 0.0993
Index : ['STK_ID', 'RPT_Date']
Column: COL000,COL001,COL002
row: 117 col: 3
The customized average function I want is named as my_avg() and defined as below rules:
1. Q1's average value is (Q4_of_previous_yr + Q1)/2
2. Q2's average value is (Q4_of_previous_yr + Q1 + Q2)/3
3. Q3's average value is (Q4_of_previous_yr + Q1 + Q2 + Q3)/4
4. Q4's average value is (Q4_of_previous_yr + Q1 + Q2 + Q3 + Q4)/5
5. if some of the data points are not provided, just calculate the normal average of available data points
so the my_avg(df) will have below output for each STK_ID:
STK_ID RPT_Date COL000 COL001 COL002
STK000 20060630 1.819619705 0.951918984 -1.052639309
20060930 0.706112476 0.024688028 0.151757352
20061231 0.079077767 0.156125083 -0.331359614
20070331 -0.867930112 0.969000466 -0.404129827
20070630 -0.425943376 0.519205768 -0.145929753
20070930 -0.437234418 0.090579744 -0.124681449
20071231 -0.282524858 0.3114374 0.156297097
20080331 0.986121631 1.015202552 0.501971496
.......................................
STK004 20111231 xxxxx xxxxxxx xxxxxxx
How to write the code for my_avg() ?
Reference:
I try to write a temp_solution_avg() function. But it has three issues:
1. the average calculation not include 'Q4_of_previous_yr' data point, so the result is not what I want.
2. data's 'RPT_Date' must start with Q1 ('xxxx0331'), otherwise first yr's data is wrong
3. the calculation speed is very very slow.
In [3]: df = demo_data(500,100)
In [4]: timeit temp_solution_avg(df)
1 loops, best of 3: 66.3 s per loop
def temp_solution_avg(df):
''' return the average , Q1: not change, Q2 : (df.Q1 + df.Q2)/2 ,
Q3: (df.Q1 + df.Q2 + df.Q3)/3, Q4 : (df.Q1 + df.Q2 + df.Q3 + df.Q4)/4
data's 'RPT_Date' must start with Q1 ('xxxx0331'), otherwise first yr's
data is wrong .
'''
dt = df.reset_index()
dt['yr'] = dt['RPT_Date'].str[0:4]
dt['temp_stk_id'] = dt['STK_ID']
dt = dt.set_index(['STK_ID','RPT_Date'], drop=True, inplace=False)
rst = dt.groupby(['temp_stk_id','yr']).transform(pd.expanding_mean)
return rst

Cross validation in R

I have a problem cross validating a dataset in R.
mypredict.rpart <- function(object, newdata){
predict(object, newdata, type = "class")
}
res <- errorest(win~., data=df, model = rpart, predict = mypredict.rpart)
I get this error.
Error in predict.rpart(object, newdata, type = "class") :
Invalid prediction for rpart object
My dataset is made out of 16 numerical atributes and win is has two factor 0 and 1.
You can download the dataset on link
If you're doing classification, win should be a factor.
df$win = factor(df$win)
Then your code works for me:
> res
Call:
errorest.data.frame(formula = win ~ ., data = df, model = rpart,
predict = mypredict.rpart)
10-fold cross-validation estimator of misclassification error
Misclassification error: 0.4844

Resources