How to make forecast using the fpp2 package? - forecast

It is the first time I am using fpp2 package to make linear forecast. I have successfully installed the package. However i am having an error when using the commands.
I have already converted the data in time series using ts command.
library(SPEI)
library(fpp2)
m<- read.delim("D:/PHD_UOM/PHD_Dissertation/PhD/PhD_R/mydata/mruspi.txt")
head(m)
y<- spi(ts(m$mru,freq=12,start=c(1971,1)), end=c(2019,12),scale =12)
y
forecast(y,12)
naive(y,12)'''
forecast(y,12)
Error in is.constant(y) :
'list' object cannot be coerced to type 'double'
naive(y,12)
Error in x[, (1 + cs[i]):cs[i + 1]] <- xx :
incorrect number of subscripts on matrix

The problem seems to be with the output of spi, according to the manual it outputs an object of class spi. This object is possibly not adequate to input is forecast function. You might need to use fitted component of the spi class object: y.fitted
For the documentation of SPEI package (specifically p. 6-7): SPEI Documentation

There is a newer version of fpp, called fpp3. I recommend installing fpp3 for starters:
install.package(fpp3)
There is an excellent book that demonstrates how to use fpp3, called Forecasting Principles and Practice. It can be purchased via Amazon, or viewed for free from the author online: https://otexts.com/fpp3/. I am working my way through the book on my own (not in a class), it is very clear and extremely well written. The book receives my highest recommendation, I very strongly recommend using it to learn forecasting.
I am unable to load the library spei, R returns this error:
"package ‘spei’ is not available for this version of R"
If you are able to update R and fpp, then an example of making a linear forecast would be:
library(fpp3)
library(tidyverse)
us_change %>%
model(TSLM(Unemployment ~ Consumption + Production + Savings + season() + trend())) %>%
report()
You can learn more about linear regression using fpp3 here: https://otexts.com/fpp3/regression-intro.html

Related

SortedEffects R Package

I study the association between air pollution levels and the weekly number of cases of a respiratory disease controlling for various socioeconomic characteristics at the census block level in a state. I do not attempt to examine the causal effect of pollution - just the association between air pollution and the respiratory disease.
I want to explore the heterogeneity of the effect of pollution on the respiratory disease incidence across census blocks. I know that I can implement the partial (sorted) effects method and classification analysis using the SortedEffects R package. However, my main variable of interest, the level of pollution, is continuous and not binary. In this case, does it still make sense for me to use the package (its spe command, etc.)?
If I set var_type = "continuous”, the spe command gives me the following error: “Error in FUN(left, right) : non-numeric argument to binary operator”.
If I set var_type = "binary”, which is not the case in ‘real life’, the command starts working, but then it gives new errors: “Error in quantile.default(t_hat, 1 - alpha) : missing values and NaN's not allowed if 'na.rm' is FALSE. In addition: Warning message: In predict.lm(model_fit, newdata = d1) : prediction from a rank-deficient fit may be misleading”
I do not know what I am doing wrong.
I am quite new to R, so sorry in advance.
Thank you.

Gensim's FastText KeyedVector out of vocab

I want to use the read-only version of Gensim's FastText Embedding to save some RAM compared to the full model.
After loading the KeyVectors version, I get the following Error when fetching a vector:
IndexError: index 878080 is out of bounds for axis 0 with size 761210
The error occurs when using words that should be out-of-vocabulary e.g. "lawyerxy" instead of "lawyer". The full model returns a vector for both.
from gensim.models import KeyedVectors
model = KeyedVectors.load("model.kv")
model .wv.__getitem__("lawyerxy")
So, my assumption is that the KeyedVectors do not offer FastText's out of vacabulary function - a key feature for my usecase. This limitation is not given in the documentation:
https://radimrehurek.com/gensim/models/word2vec.html
Can anyone prove that assumption and/or name a fix to allow vectors for "lawyerxy" etc. ?
The KeyedVectors name is (as of gensim-3.8.0) just an alias for class Word2VecKeyedVectors, which only maintains a simple word (as key) to vector (as value) mapping.
You shouldn't expect FastText's advanced ability to synthesize vectors for out-of-vocabulary words to appear in any model/representation that doesn't explicitly claim to offer that ability.
(I would expect a lookup of an out-of-vocabulary word to give a clearer KeyError rather than the IndexError you've reported. But, you'd need to show exactly what code created the file you're loading, and triggered the error, and the full error stack, to further guess what's going wrong in your case.)
Depending on how your model.kv file was saved, you might be able to load it, with retained OOV-vector functionality, by using the class FastTextKeyedVectors instead of plain KeyedVectors.

gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars

/Users/Barry/anaconda/lib/python2.7/site-packages/gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
#dynamic topic model
def run_dtm(num_topics=18):
docs, years, titles = preprocessing(datasetType=2)
#resort document by years
Z = zip(years, docs)
Z = sorted(Z, reverse=False)
years_new, docs_new = zip(*Z)
#generate time slice
time_slice = Counter(years_new).values()
for year in Counter(years_new):
print year,' --- ',Counter(years_new)[year]
print '********* data set loaded ********'
dictionary = corpora.Dictionary(docs_new)
corpus = [dictionary.doc2bow(text) for text in docs_new]
print '********* train lda seq model ********'
ldaseq = ldaseqmodel.LdaSeqModel(corpus=corpus, id2word=dictionary, time_slice=time_slice, num_topics=num_topics)
print '********* lda seq model done ********'
ldaseq.print_topics(time=1)
Hey guys, I'm using the dynamic topic models in gensim package for topic analysis, following this tutorial, https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/ldaseqmodel.ipynb, however I always got the same unexpected error. Can anyone give me some guidance? I'm really puzzled even thought I have tried some different dataset for generating corpus and dictionary.
The error is like this:
/Users/Barry/anaconda/lib/python2.7/site-packages/gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
The np.fabs error means it is encountering an error with NumPy. What NumPy and gensim versions are you using?
NumPy no longer supports Python 2.7, and Ldaseq was added to Gensim in 2016, so you might just not have a compatible version available. If you are recoding a Python 3+ tutorial to a 2.7 variant, you obviously understand a little bit about the version differences - try running it in a, say, 3.6.8 environment (you will have to upgrade sometime anyway, 2020 is the end of 2.7 support from Python itself). That might already help, I've gone through the tutorial and did not encounter this with my own data.
That being said, I have encountered the same error before when running LdaMulticore, and it was caused by an empty corpus.
Instead of running your code fully in a function, can you try to go through it line by line (or look at you DEBUG level log) and check whether your output has the expected properties: that, for example your corpus is not empty (or contains empty documents)?
If that happens, fix the preprocessing steps and try again - that at least helped me and helped with the same ldamodel error in the mailing list.
PS: not commenting because I lack the reputation, feel free to edit this.
This is the issue with the source code of ldaseqmodel.py itself.
For the latest gensim package(version 3.8.3) I am getting the same error at line 293:
ldaseqmodel.py:293: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
Now, if you go through the code you will see this:
enter image description here
You can see that here they divide the difference between bound and old_bound by the old_bound(which is also visible from the warning)
Now if you analyze further you will see that at line 263, the old_bound is initialized with zero and this is the main reason that you are getting this warning of divide by zero encountered.
enter image description here
For further information, I put a print statement at line 294:
print('bound = {}, old_bound = {}'.format(bound, old_bound))
The output I received is: enter image description here
So, in a single line you are getting this warning because of the source code of the package ldaseqmodel.py not because of any empty document. Although if you do not remove the empty documents from your corpus you will receive another warning. So I suggest if there are any empty documents in your corpus remove them and just ignore the above warning of division by zero.

Parameters for dlib::find_min_bobyqa

I'm working on the C++ version of Matt Zucker's Page dewarping. So far everything works fine, but I have a problem with optimization. In line 748 of Github repo Matt uses optimize function from Scipy. My C++ equivalent is find_min_bobyqa from dlib.net. The code is:
auto f = [&](const column_vector& ppts) { return objective( dstpoints, ppts, keypoint_index); };
dlib::find_min_bobyqa(f,
params,
2 * params.nr() + 1, // npt - number of interpolation points: x.size() + 2 <= npt && npt <= (x.size()+1)*(x.size()+2)/2
dlib::uniform_matrix<double>(params.nr(), 1, -2), // lower bound constraint
dlib::uniform_matrix<double>(params.nr(), 1, 2), // upper bound constraint
1, // initial trust region radius
1e-5, // stopping trust region radius
4000 // max number of objective function evaluations
);
In my concrete example params is a dlib::column_vector with double values and length = 189. Every element of params is less than 2.0 and greater than -2.0. Function objective() returns double value and "alone" it works properly because I get the same value as in the Python version. But after running fin_min_bobyqa function I usually get the message:
Terminate called after throwing an instance of 'dlib:bobyqa_failure', return from BOBYQA because the objective function has been called max_f_evals times.
I set max_f_evals to quite big value to see if it optimizes at all, but it doesn't. I did some tweaking with parameters but without good results. How should I set the parameters of find_min_bobyqa to get the right solution?
I am very interested in this issue as well. Zucker's work, with very minor tweaks, is ideal for straightening sheet music images, and I was looking for ways to implement it in a mobile platform when I came across your question.
My research so far suggests that BOBYQA is not the equivalent of Powell's method in scipy. BOBYQA is constrained, and the one in scipy is not.
See these links for more information, and a possible way to compile the right supporting library - I would try UOBYQA or NEWUOA.
https://github.com/jacobwilliams/PowellOpt
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html#rdd2e1855725e-3
(See the Notes section)
EDIT: see C version here:
https://github.com/emmt/Algorithms/tree/master/newuoa
I wanted to post this as a comment, but I don't have enough points for that.
I am very interested in your progress. If you're willing, please keep me posted.
I finally solved this problem. I used PRAXIS library, because it doesn't need derivative information and is fast.
I modified the code a little to my needs and now it is faster around few seconds than original version written in Python.

xgboost implementation in h2o offset

I have been accustomed to use the base margin parameter in standard xgboost to allow for offset, starting (transformed) prediction (see this SO question SO xgboost exposure question. I wonder if it were possible to perform the same in the h2o implementation of xgboost. In particular I see an offset parameter, but I wonder wether it has been really truly implemented.
Good question -- this is not documented in the parameter description (we use a common definition of offset_column among all algos and there's no note about how its not working in XGBoost). It is not functional and you should get an error if you try to supply it.
R example:
library(h2o)
h2o.init()
fit <- h2o.xgboost(x = 1:3, y = "Species", offset_column = "Petal.Width",
training_frame = as.h2o(iris))
Gives error:
Error: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for XGBoost model: XGBoost_model_R_1520909592004_2. Details: ERRR on field: _offset_column: Offset is not supported for XGBoost.

Resources