How to implement Breusch-Godfrey test for a regression with ARIMA errors in R - arima

I’m fitting a regression with ARIMA errors with the fable package and as mentioned im my previous question the Breusch-Godfrey test is not available there.
The regression part of the model has two pairs of Fourier terms to account for yearly seasonality and several exogenous regressors. The residuals are modeled with a seasonal ARIMA(2,0,0)(1,0,0)[7] model. My goal is to check for autocorrelation in residuals.
I can use the Ljung-Box test but according to this thread and textbook sources there it will not be valid in presence of lags of the dependent variable.
And I’m afraid i will loose my model specification using different packages/libraries. An alternative might be to use Arima from the forecast package and retain model specification. Then use bgtest from lmtest package. But I can’t figure out how to do this.
According to this R forum the Breusch-Godfrey test for an ARIMA model can be done by fitting a simple regression of the residuals from the fitted model on a constant and then perform a bgtest. But it only concerns a simple AR(1) model with no exogenous regressors.
Is this the right way to do it? I’m concerned that for the BG test you have to perform an auxiliary regression on the regressors and lagged resuduals up to order p. How in this case the bgtest knows the X variables since they are not stored in the residuals object - this should be a simple vector.

Related

Machine learning algorithm for correlation between indicators

I have a dataset with several indicators related to some geographical entities ,I want to study factors that influence an indicator A (among the other indicator) .I need to determine which indicators affect it the most (correlation)
which ML algo should I use
I want to have a kind of scoring function for my indicator A to allow its prediction
enter image description here
What you are looking for are correlation coefficients, you have multiple choices for that, the most commons are:
Pearson's coefficient which only measure linear relationship between two variables, see [Scipy's implementation]
Spearman's coefficient which can show non-linear relationship , see Scipy's implementation
You can also normalize your data using z-normalization and then do a simple Linear regression. The regression coefficient can give you an idea of the influence of each variable on the outcome. However this method is highly sensible to multi-collinearity which might be present, especially if your variables are geographical.
Could you provide an example of the dataset? Discrete or continuous variables? Which software are you using?
Anyway an easy way to test correlation (without going into ML algorithms in the very sense) is to simply perform Pearson's or Spearman's correlation coefficient on selected features or on the whole dataset by creating a matrix of the data. You can do that in Python with NumPy (see this) or in R (see this).
You can also use simple linear regression or logistic/multinomial logistic regression (depending on the nature of your data) to quantify the influence of the other features on your target variables. Just keep in mind that "correlation is not causation. Look here to see some models.
Then it depends on the object of your analysis whether to aggregate all the features of all the geographical points or create covariance matrices for each "subset" of observation related to the geographical points.

Clarifying statsmodels AutoReg(), ARMA() and SARIMAX() for time-series forecasting

I am buidling my first time-series prediction model with scikit-learn's LinearRegression(). I also came across statsmodels AutoReg(), ARMA() and SARIMAX(). Unfortunately out of the literature I could not figure out to consider them. Are they alternatives to LinearRegression()? Are they ML? Are they fundamental different?
I'd appreciate a hint, where to look further. Thanks.
All three fit variants of Seasonal Autoregressive Integrated Moving Average with eXogenous Variables (SARIMAX) models.
AutoReg
AutoReg is limited to only Autoregressive Models and so does not include Seasonal or Moving Average components. It does support exogenous regressors. It also supports complex deterministic processes such as Fourier series to model multiple seasonalities. Parameters are estimated using OLS which is equivalent to conditional maximum likelihood. Since parameters are estimated using OLS, estimation is very fast and completely deterministic.
ARIMA
ARIMA is a restricted version of SARIMAX that does not include Seasonal components or Exogenous regressors. Because it excludes these two types of terms, it can offer additional fitting options that are not available when fitting a full SARIMAX model. These have different statistical properties than the Maximum Likelihood method that is the only method available in SARIMAX (ARIMA also supports Maximum Likelihood). Many of these alternative parameter estimation methods are also faster than ML.
SARIMAX
SARIMAX supports all features of ARIMA plus the two additional components. It can only be estimated using Maximum Likelihood. ML uses numerical methods to maximize the function and so estimation of some series/models may encounter difficulties converging.
The examples page is the best place to look to see the detailed use of these models. Many of the notebooks include both code examples and LaTeX markup that explains the underlying math.

Which model to pick from K fold Cross Validation

I was reading about cross validation and about how it it is used to select the best model and estimate parameters , I did not really understand the meaning of it.
Suppose I build a Linear regression model and go for a 10 fold cross validation, I think each of the 10 will have different coefficiant values , now from 10 different which should I pick as my final model or estimate parameters.
Or do we use Cross Validation only for the purpose of finding an average error(average of 10 models in our case) and comparing against another model ?
If your build a Linear regression model and go for a 10 fold cross validation, indeed each of the 10 will have different coefficient values. The reason why you use cross validation is that you get a robust idea of the error of your linear model - rather than just evaluating it on one train/test split only, which could be unfortunate or too lucky. CV is more robust as no ten splits can be all ten lucky or all ten unfortunate.
Your final model is then trained on the whole training set - this is where your final coefficients come from.
Cross-validation is used to see how good your models prediction is. It's pretty smart making multiple tests on the same data by splitting it as you probably know (i.e. if you don't have enough training data this is good to use).
As an example it might be used to make sure you aren't overfitting the function. So basically you try your function when you've finished it with Cross-validation and if you see that the error grows a lot somewhere you go back to tweaking the parameters.
Edit:
Read the wikipedia for deeper understanding of how it works: https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
You are basically confusing Grid-search with cross-validation. The idea behind cross-validation is basically to check how well a model will perform in say a real world application. So we basically try randomly splitting the data in different proportions and validate it's performance. It should be noted that the parameters of the model remain the same throughout the cross-validation process.
In Grid-search we try to find the best possible parameters that would give the best results over a specific split of data (say 70% train and 30% test). So in this case, for different combinations of the same model, the dataset remains constant.
Read more about cross-validation here.
Cross Validation is mainly used for the comparison of different models.
For each model, you may get the average generalization error on the k validation sets. Then you will be able to choose the model with the lowest average generation error as your optimal model.
Cross-Validation or CV allows us to compare different machine learning methods and get a sense of how well they will work in practice.
Scenario-1 (Directly related to the question)
Yes, CV can be used to know which method (SVM, Random Forest, etc) will perform best and we can pick that method to work further.
(From these methods different models will be generated and evaluated for each method and an average metric is calculated for each method and the best average metric will help in selecting the method)
After getting the information about the best method/ or best parameters we can train/retrain our model on the training dataset.
For parameters or coefficients, these can be determined by grid search techniques. See grid search
Scenario-2:
Suppose you have a small amount of data and you want to perform training, validation and testing on data. Then dividing such a small amount of data into three sets reduce the training samples drastically and the result will depend on the choice of pairs of training and validation sets.
CV will come to the rescue here. In this case, we don't need the validation set but we still need to hold the test data.
A model will be trained on k-1 folds of training data and the remaining 1 fold will be used for validating the data. A mean and standard deviation metric will be generated to see how well the model will perform in practice.

Will non-linear regression algorithms perform better if trained with normally distributed target values?

After finding out about many transformations that can be applied on the target values(y column), of a data set, such as box-cox transformations I learned that linear regression models need to be trained with normally distributed target values in order to be efficient.(https://stats.stackexchange.com/questions/298/in-linear-regression-when-is-it-appropriate-to-use-the-log-of-an-independent-va)
I'd like to know if the same applies for non-linear regression algorithms. For now I've seen people on kaggle use log transformation for mitigation of heteroskedasticity, by using xgboost, but they never mention if it is also being done for getting normally distributed target values.
I've tried to do some research and I found in Andrew Ng's lecture notes(http://cs229.stanford.edu/notes/cs229-notes1.pdf) on page 11 that the least squares cost function, used by many algorithms linear and non-linear, is derived by assuming normal distribution of the error. I believe if the error should be normally distributed then the target values should be as well.
If this is true then all the regression algorithms using least squares cost function should work better with normally distributed target values.
Since xgboost uses least squares cost function for node splitting(http://cilvr.cs.nyu.edu/diglib/lsml/lecture03-trees-boosting.pdf - slide 13) then maybe this algorithm would work better if I transform the target values using box-cox transformations for training the model and then apply inverse box-cox transformations on the output in order to get the predicted values.
Will this theoretically speaking give better results?
Your conjecture "I believe if the error should be normally distributed then the target values should be as well." is totally wrong. So your question does not have any answer at all since it is not a valid question.
There are no assumptions on the target variable to be Normal at all.
Getting the target variable transformed does not mean the errors are normally distributed. In fact, that may ruin normality.
I have no idea what this is supposed to mean: "linear regression models need to be trained with normally distributed target values in order to be efficient." Efficient in what way?
Linear regression models are global models. They simply fit a surface to the overall data. The operations are matrix operations, so the time to "train" the model depends only on the size of data. The distribution of the target has nothing to do with model building performance. And, it has nothing to do with model scoring performance either.
Because targets are generally not normally distributed, I would certainly hope that such a distribution is not required for a machine learning algorithm to work effectively.

PyMC: Hidden Markov Models

How suitable is PyMC in its currently available versions for modelling continuous emission HMMs?
I am interested in having a framework where I can easily explore model variations, without having to update E- and M-step, and dynamic programming recursions for every change I make to the model.
More specific questions are:
When modelling an HMM in PyMC can I answer the 'typical' tasks that one would like to solve -- i.e., besides parameter estimation also infer the most likely sequence (as usually done with the Viterbi algorithm), or solve a smoothing problem?
As compared to an implementation with Expectation Maximization, I would expect a sampling based approach to be slower. If that gives me more flexibility on the model building side, that is fine. I would imagine using PyMC for prototyping models. I am wondering though, if I can expect PyMC to handle inference for models with > 10k observations to finish in any reasonable amount of time.
Would you recommend starting out with PyMC2 or PyMC3 for model building. I know that the inference engine changed between the version, so I would especially wonder what type of sampler might be more suited.
If you'ld think PyMC is not a good choice for my use case, that definitely helps as an answer as well.

Resources