I have been accustomed to use the base margin parameter in standard xgboost to allow for offset, starting (transformed) prediction (see this SO question SO xgboost exposure question. I wonder if it were possible to perform the same in the h2o implementation of xgboost. In particular I see an offset parameter, but I wonder wether it has been really truly implemented.
Good question -- this is not documented in the parameter description (we use a common definition of offset_column among all algos and there's no note about how its not working in XGBoost). It is not functional and you should get an error if you try to supply it.
R example:
library(h2o)
h2o.init()
fit <- h2o.xgboost(x = 1:3, y = "Species", offset_column = "Petal.Width",
training_frame = as.h2o(iris))
Gives error:
Error: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for XGBoost model: XGBoost_model_R_1520909592004_2. Details: ERRR on field: _offset_column: Offset is not supported for XGBoost.
Related
I study the association between air pollution levels and the weekly number of cases of a respiratory disease controlling for various socioeconomic characteristics at the census block level in a state. I do not attempt to examine the causal effect of pollution - just the association between air pollution and the respiratory disease.
I want to explore the heterogeneity of the effect of pollution on the respiratory disease incidence across census blocks. I know that I can implement the partial (sorted) effects method and classification analysis using the SortedEffects R package. However, my main variable of interest, the level of pollution, is continuous and not binary. In this case, does it still make sense for me to use the package (its spe command, etc.)?
If I set var_type = "continuous”, the spe command gives me the following error: “Error in FUN(left, right) : non-numeric argument to binary operator”.
If I set var_type = "binary”, which is not the case in ‘real life’, the command starts working, but then it gives new errors: “Error in quantile.default(t_hat, 1 - alpha) : missing values and NaN's not allowed if 'na.rm' is FALSE. In addition: Warning message: In predict.lm(model_fit, newdata = d1) : prediction from a rank-deficient fit may be misleading”
I do not know what I am doing wrong.
I am quite new to R, so sorry in advance.
Thank you.
It is the first time I am using fpp2 package to make linear forecast. I have successfully installed the package. However i am having an error when using the commands.
I have already converted the data in time series using ts command.
library(SPEI)
library(fpp2)
m<- read.delim("D:/PHD_UOM/PHD_Dissertation/PhD/PhD_R/mydata/mruspi.txt")
head(m)
y<- spi(ts(m$mru,freq=12,start=c(1971,1)), end=c(2019,12),scale =12)
y
forecast(y,12)
naive(y,12)'''
forecast(y,12)
Error in is.constant(y) :
'list' object cannot be coerced to type 'double'
naive(y,12)
Error in x[, (1 + cs[i]):cs[i + 1]] <- xx :
incorrect number of subscripts on matrix
The problem seems to be with the output of spi, according to the manual it outputs an object of class spi. This object is possibly not adequate to input is forecast function. You might need to use fitted component of the spi class object: y.fitted
For the documentation of SPEI package (specifically p. 6-7): SPEI Documentation
There is a newer version of fpp, called fpp3. I recommend installing fpp3 for starters:
install.package(fpp3)
There is an excellent book that demonstrates how to use fpp3, called Forecasting Principles and Practice. It can be purchased via Amazon, or viewed for free from the author online: https://otexts.com/fpp3/. I am working my way through the book on my own (not in a class), it is very clear and extremely well written. The book receives my highest recommendation, I very strongly recommend using it to learn forecasting.
I am unable to load the library spei, R returns this error:
"package ‘spei’ is not available for this version of R"
If you are able to update R and fpp, then an example of making a linear forecast would be:
library(fpp3)
library(tidyverse)
us_change %>%
model(TSLM(Unemployment ~ Consumption + Production + Savings + season() + trend())) %>%
report()
You can learn more about linear regression using fpp3 here: https://otexts.com/fpp3/regression-intro.html
/Users/Barry/anaconda/lib/python2.7/site-packages/gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
#dynamic topic model
def run_dtm(num_topics=18):
docs, years, titles = preprocessing(datasetType=2)
#resort document by years
Z = zip(years, docs)
Z = sorted(Z, reverse=False)
years_new, docs_new = zip(*Z)
#generate time slice
time_slice = Counter(years_new).values()
for year in Counter(years_new):
print year,' --- ',Counter(years_new)[year]
print '********* data set loaded ********'
dictionary = corpora.Dictionary(docs_new)
corpus = [dictionary.doc2bow(text) for text in docs_new]
print '********* train lda seq model ********'
ldaseq = ldaseqmodel.LdaSeqModel(corpus=corpus, id2word=dictionary, time_slice=time_slice, num_topics=num_topics)
print '********* lda seq model done ********'
ldaseq.print_topics(time=1)
Hey guys, I'm using the dynamic topic models in gensim package for topic analysis, following this tutorial, https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/ldaseqmodel.ipynb, however I always got the same unexpected error. Can anyone give me some guidance? I'm really puzzled even thought I have tried some different dataset for generating corpus and dictionary.
The error is like this:
/Users/Barry/anaconda/lib/python2.7/site-packages/gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
The np.fabs error means it is encountering an error with NumPy. What NumPy and gensim versions are you using?
NumPy no longer supports Python 2.7, and Ldaseq was added to Gensim in 2016, so you might just not have a compatible version available. If you are recoding a Python 3+ tutorial to a 2.7 variant, you obviously understand a little bit about the version differences - try running it in a, say, 3.6.8 environment (you will have to upgrade sometime anyway, 2020 is the end of 2.7 support from Python itself). That might already help, I've gone through the tutorial and did not encounter this with my own data.
That being said, I have encountered the same error before when running LdaMulticore, and it was caused by an empty corpus.
Instead of running your code fully in a function, can you try to go through it line by line (or look at you DEBUG level log) and check whether your output has the expected properties: that, for example your corpus is not empty (or contains empty documents)?
If that happens, fix the preprocessing steps and try again - that at least helped me and helped with the same ldamodel error in the mailing list.
PS: not commenting because I lack the reputation, feel free to edit this.
This is the issue with the source code of ldaseqmodel.py itself.
For the latest gensim package(version 3.8.3) I am getting the same error at line 293:
ldaseqmodel.py:293: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
Now, if you go through the code you will see this:
enter image description here
You can see that here they divide the difference between bound and old_bound by the old_bound(which is also visible from the warning)
Now if you analyze further you will see that at line 263, the old_bound is initialized with zero and this is the main reason that you are getting this warning of divide by zero encountered.
enter image description here
For further information, I put a print statement at line 294:
print('bound = {}, old_bound = {}'.format(bound, old_bound))
The output I received is: enter image description here
So, in a single line you are getting this warning because of the source code of the package ldaseqmodel.py not because of any empty document. Although if you do not remove the empty documents from your corpus you will receive another warning. So I suggest if there are any empty documents in your corpus remove them and just ignore the above warning of division by zero.
I'm working on the C++ version of Matt Zucker's Page dewarping. So far everything works fine, but I have a problem with optimization. In line 748 of Github repo Matt uses optimize function from Scipy. My C++ equivalent is find_min_bobyqa from dlib.net. The code is:
auto f = [&](const column_vector& ppts) { return objective( dstpoints, ppts, keypoint_index); };
dlib::find_min_bobyqa(f,
params,
2 * params.nr() + 1, // npt - number of interpolation points: x.size() + 2 <= npt && npt <= (x.size()+1)*(x.size()+2)/2
dlib::uniform_matrix<double>(params.nr(), 1, -2), // lower bound constraint
dlib::uniform_matrix<double>(params.nr(), 1, 2), // upper bound constraint
1, // initial trust region radius
1e-5, // stopping trust region radius
4000 // max number of objective function evaluations
);
In my concrete example params is a dlib::column_vector with double values and length = 189. Every element of params is less than 2.0 and greater than -2.0. Function objective() returns double value and "alone" it works properly because I get the same value as in the Python version. But after running fin_min_bobyqa function I usually get the message:
Terminate called after throwing an instance of 'dlib:bobyqa_failure', return from BOBYQA because the objective function has been called max_f_evals times.
I set max_f_evals to quite big value to see if it optimizes at all, but it doesn't. I did some tweaking with parameters but without good results. How should I set the parameters of find_min_bobyqa to get the right solution?
I am very interested in this issue as well. Zucker's work, with very minor tweaks, is ideal for straightening sheet music images, and I was looking for ways to implement it in a mobile platform when I came across your question.
My research so far suggests that BOBYQA is not the equivalent of Powell's method in scipy. BOBYQA is constrained, and the one in scipy is not.
See these links for more information, and a possible way to compile the right supporting library - I would try UOBYQA or NEWUOA.
https://github.com/jacobwilliams/PowellOpt
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html#rdd2e1855725e-3
(See the Notes section)
EDIT: see C version here:
https://github.com/emmt/Algorithms/tree/master/newuoa
I wanted to post this as a comment, but I don't have enough points for that.
I am very interested in your progress. If you're willing, please keep me posted.
I finally solved this problem. I used PRAXIS library, because it doesn't need derivative information and is fast.
I modified the code a little to my needs and now it is faster around few seconds than original version written in Python.
I had a use-case that I thought was really simple but couldn't find a way to do it with h2o. I thought you might know.
I want to train my model once, and then evaluate its ROC on a few different test sets (e.g. a validation set and a test set, though in reality I have more than 2) without having to retrain the model. The way I know to do it now requires retraining the model each time:
train, valid, test = fr.split_frame([0.2, 0.25], seed=1234)
rf_v1 = H2ORandomForestEstimator( ... )
rf_v1.train(features, var_y, training_frame=train, validation_frame=valid)
roc = rf_v1.roc(valid=1)
rf_v1.train(features, var_y, training_frame=train, validation_frame=test) # training again with the same training set - can I avoid this?
roc2 = rf_v1.roc(valid=1)
I can also use model_performance(), which gives me some metrics on an arbitrary test set without retraining, but not the ROC. Is there a way to get the ROC out of the H2OModelMetrics object?
Thanks!
You can use the h2o flow to inspect the model performance. Simply go to: http://localhost:54321/flow/index.html (if you changed the default port change it in the link); type "getModel "rf_v1"" in a cell and it will show you all the measurements of the model in multiple cells in the flow. It's quite handy.
If you are using Python, you can find the performance in your IDE like this:
rf_perf1 = rf_v1.model_performance(test)
and then print the ROC like this:
print (rf_perf1.auc())
Yes, indirectly. Get the TPRs and FPRs from the H2OModelMetrics object:
out = rf_v1.model_performance(test)
fprs = out.fprs
tprs = out.tprs
roc = zip(fprs, tprs)
(By the way, my H2ORandomForestEstimator object does not seem to have an roc() method at all, so I'm not 100% sure that this output is in the exact same format. I'm using h2o version 3.10.4.7.)