Crossvalidation in Stanford NER

Crossvalidation in Stanford NER - stanford-nlp

I'm trying to use cross validation in Stanford NER. The feature factory lists 3 properties:
numFolds int 1 The number of folds to use for cross-validation.
startFold int 1 The starting fold to run.
numFoldsToRun int 1 The number of folds to run.
which I think should be used for cross validation. But I don't think they actually work. Setting numFolds to 1 or 10 doesn't change the running time for training at all. And strangely, using numFoldsToRun gives the following warning:
Unknown property: |numFoldsToRun|

You're right. These options haven't been implemented. If you want to run cross-validation experiments, you'll have to do it completely manually by preparing the data sets yourself. (Sorry!)

Related

Creating balanced bootstrap resamples in caret

I'm using caret to compare models for a classification problem with nested CV. Vfold in the outer loop and bootstrap (500 replicates) in the inner loop. I get this error after training knn:
Warning: There were missing values in resampled performance measures.
Which I believe comes from the fact that some resamples have zero items of the class of interest in the holdout sample, yielding NA for Sensitivity and ROC. My question is: Is there any way to ensure that items from this class are present in every bootstrap resample? Kind of what the CreateDataPartition function does (I believe this is also called stratified bootstrap?).
If not, how should we proceed with this? (In terms of comparing model performance on the same resamples)
Thanks!

So I couldn't find a way to do this within caret but here is a workaround using rsample package. The point is to compute the resamples before and feed this information to trainControl function via index and indexOut arguments, previous conversion to caret format.
indices=bootstraps(train,times=50,strata="class_of_interest")
indices=rsample2caret(indices)
train_control <- trainControl(method="boot",number=50,index=indices$index,indexOut = indices$indexOut)
Hope this helps.

SortedEffects R Package

I study the association between air pollution levels and the weekly number of cases of a respiratory disease controlling for various socioeconomic characteristics at the census block level in a state. I do not attempt to examine the causal effect of pollution - just the association between air pollution and the respiratory disease.
I want to explore the heterogeneity of the effect of pollution on the respiratory disease incidence across census blocks. I know that I can implement the partial (sorted) effects method and classification analysis using the SortedEffects R package. However, my main variable of interest, the level of pollution, is continuous and not binary. In this case, does it still make sense for me to use the package (its spe command, etc.)?
If I set var_type = "continuous”, the spe command gives me the following error: “Error in FUN(left, right) : non-numeric argument to binary operator”.
If I set var_type = "binary”, which is not the case in ‘real life’, the command starts working, but then it gives new errors: “Error in quantile.default(t_hat, 1 - alpha) : missing values and NaN's not allowed if 'na.rm' is FALSE. In addition: Warning message: In predict.lm(model_fit, newdata = d1) : prediction from a rank-deficient fit may be misleading”
I do not know what I am doing wrong.
I am quite new to R, so sorry in advance.
Thank you.

How to make forecast using the fpp2 package?

It is the first time I am using fpp2 package to make linear forecast. I have successfully installed the package. However i am having an error when using the commands.
I have already converted the data in time series using ts command.
library(SPEI)
library(fpp2)
m<- read.delim("D:/PHD_UOM/PHD_Dissertation/PhD/PhD_R/mydata/mruspi.txt")
head(m)
y<- spi(ts(m$mru,freq=12,start=c(1971,1)), end=c(2019,12),scale =12)
y
forecast(y,12)
naive(y,12)'''
forecast(y,12)
Error in is.constant(y) :
'list' object cannot be coerced to type 'double'
naive(y,12)
Error in x[, (1 + cs[i]):cs[i + 1]] <- xx :
incorrect number of subscripts on matrix

The problem seems to be with the output of spi, according to the manual it outputs an object of class spi. This object is possibly not adequate to input is forecast function. You might need to use fitted component of the spi class object: y.fitted
For the documentation of SPEI package (specifically p. 6-7): SPEI Documentation

There is a newer version of fpp, called fpp3. I recommend installing fpp3 for starters:
install.package(fpp3)
There is an excellent book that demonstrates how to use fpp3, called Forecasting Principles and Practice. It can be purchased via Amazon, or viewed for free from the author online: https://otexts.com/fpp3/. I am working my way through the book on my own (not in a class), it is very clear and extremely well written. The book receives my highest recommendation, I very strongly recommend using it to learn forecasting.
I am unable to load the library spei, R returns this error:
"package ‘spei’ is not available for this version of R"
If you are able to update R and fpp, then an example of making a linear forecast would be:
library(fpp3)
library(tidyverse)
us_change %>%
model(TSLM(Unemployment ~ Consumption + Production + Savings + season() + trend())) %>%
report()
You can learn more about linear regression using fpp3 here: https://otexts.com/fpp3/regression-intro.html

In the Stanford NLP POS tagger, setting normalizeParenthese=True changes the POS results

I may have found a bug in the POS tagger. The tagging results change whether I use the "-tokenizerOptions" flag with "normalizeParentheses=True" or False. I'm accessing the tagger from python using a server set up via:
pos_args=['java', '-mx400m', '-cp', homedir+'/models/stanfordpostagger.jar','edu.stanford.nlp.tagger.maxent.MaxentTaggerServer','-model','english-bidirectional-distsim.tagger','-port','2021','-loadClassifier',english.all.3class.distsim.crf.ser.gz','-tokenizerOptions','normalizeParentheses=true']
POS=Popen(pos_args)
and I use the SNER package to actually do the tagging.
If I tag the sentence "(Bob is nice)" with normalizeParentheses=true, I get:
[(u'-LRB-', u'-LRB-'),
(u'Bob', u'NNP'),
(u'is', u'VBZ'),
(u'nice', u'JJ'),
(u'-RRB-', u'-RRB-')]
But if I use normalizeParentheses=false, I get:
[(u'(', u'NNP'),
(u'Bob', u'NNP'),
(u'is', u'VBZ'),
(u'nice', u'JJ'),
(u')', u'NN')]
and this version of the tagger also marks many words as foreign ('FW') when they aren't.
I've tried experimenting with many other options, and only this one and the normalizeOtherBrackets=False seem to cause this behavior. It is as if these two options cause a totally different tagger method to be used. I'm curious if this is indeed a bug or if there is a clever workaround?

You need to normalize the parentheses when using the POS tagger. It was trained on data that has normalized parentheses.

ROC on multiple test sets in h2o (python)

I had a use-case that I thought was really simple but couldn't find a way to do it with h2o. I thought you might know.
I want to train my model once, and then evaluate its ROC on a few different test sets (e.g. a validation set and a test set, though in reality I have more than 2) without having to retrain the model. The way I know to do it now requires retraining the model each time:
train, valid, test = fr.split_frame([0.2, 0.25], seed=1234)
rf_v1 = H2ORandomForestEstimator( ... )
rf_v1.train(features, var_y, training_frame=train, validation_frame=valid)
roc = rf_v1.roc(valid=1)
rf_v1.train(features, var_y, training_frame=train, validation_frame=test) # training again with the same training set - can I avoid this?
roc2 = rf_v1.roc(valid=1)
I can also use model_performance(), which gives me some metrics on an arbitrary test set without retraining, but not the ROC. Is there a way to get the ROC out of the H2OModelMetrics object?
Thanks!

You can use the h2o flow to inspect the model performance. Simply go to: http://localhost:54321/flow/index.html (if you changed the default port change it in the link); type "getModel "rf_v1"" in a cell and it will show you all the measurements of the model in multiple cells in the flow. It's quite handy.
If you are using Python, you can find the performance in your IDE like this:
rf_perf1 = rf_v1.model_performance(test)
and then print the ROC like this:
print (rf_perf1.auc())

Yes, indirectly. Get the TPRs and FPRs from the H2OModelMetrics object:
out = rf_v1.model_performance(test)
fprs = out.fprs
tprs = out.tprs
roc = zip(fprs, tprs)
(By the way, my H2ORandomForestEstimator object does not seem to have an roc() method at all, so I'm not 100% sure that this output is in the exact same format. I'm using h2o version 3.10.4.7.)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Crossvalidation in Stanford NER - stanford-nlp

You're right. These options haven't been implemented. If you want to run cross-validation experiments, you'll have to do it completely manually by preparing the data sets yourself. (Sorry!)

Related

Creating balanced bootstrap resamples in caret

SortedEffects R Package

How to make forecast using the fpp2 package?

In the Stanford NLP POS tagger, setting normalizeParenthese=True changes the POS results

ROC on multiple test sets in h2o (python)

Categories

Resources