How to jointly use makeFeatSelWrapper and resample function in mlr - cross-validation

I'm fitting classification models for binary issues using MLR package in R. For each model, I perform a cross-validation with embedded feature selection using "selectFeatures" function. In output, I retrieve mean AUCs over test sets and predictions. To do so, after having get some advices (Get predictions on test sets in MLR), I use "makeFeatSelWrapper" function in combination with "resample" function. The goal seems to be achieved but results are strange. With a logistic regression as classifier, I get an AUC of 0.5 which means no variable selected. This result is unexpected as I get an AUC of 0.9824432 with this classifier using the method mentioned in the linked question. With a neural network as classifier, I get an error message
Error in sum(x) : invalid 'type' (list) of argument
What is wrong?
Here is the sample code:
# 1. Find a synthetic dataset for supervised learning (two classes)
###################################################################
install.packages("mlbench")
library(mlbench)
data(BreastCancer)
# generate 1000 rows, 21 quantitative candidate predictors and 1 target variable
p<-mlbench.waveform(1000)
# convert list into dataframe
dataset<-as.data.frame(p)
# drop thrid class to get 2 classes
dataset2 = subset(dataset, classes != 3)
# 2. Perform cross validation with embedded feature selection using logistic regression
#######################################################################################
library(BBmisc)
library(nnet)
library(mlr)
# Choice of data
mCT <- makeClassifTask(data =dataset2, target = "classes")
# Choice of algorithm i.e. neural network
mL <- makeLearner("classif.logreg", predict.type = "prob")
# Choice of cross-validations for folds
outer = makeResampleDesc("CV", iters = 10,stratify = TRUE)
# Choice of feature selection method
ctrl = makeFeatSelControlSequential(method = "sffs", maxit = NA,alpha = 0.001)
# Choice of hold-out sampling between training and test within the fold
inner = makeResampleDesc("Holdout",stratify = TRUE)
lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl)
r = resample(lrn, mCT, outer, extract = getFeatSelResult,measures = list(mlr::auc,mlr::acc,mlr::brier),models=TRUE)
# 3. Perform cross validation with embedded feature selection using neural network
##################################################################################
library(BBmisc)
library(nnet)
library(mlr)
# Choice of data
mCT <- makeClassifTask(data =dataset2, target = "classes")
# Choice of algorithm i.e. neural network
mL <- makeLearner("classif.nnet", predict.type = "prob")
# Choice of cross-validations for folds
outer = makeResampleDesc("CV", iters = 10,stratify = TRUE)
# Choice of feature selection method
ctrl = makeFeatSelControlSequential(method = "sffs", maxit = NA,alpha = 0.001)
# Choice of sampling between training and test within the fold
inner = makeResampleDesc("Holdout",stratify = TRUE)
lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl)
r = resample(lrn, mCT, outer, extract = getFeatSelResult,measures = list(mlr::auc,mlr::acc,mlr::brier),models=TRUE)

If you run your logistic regression part of the code a couple of times, you should also get the Error in sum(x) : invalid 'type' (list) of argument error. However, I find it strange that fixing a particular seed (e.g., set.seed(1)) before resampling does not ensure that the error does or does not appear.
The error occurs in internal mlr code for printing the output of feature selection to the console. A very simple workaround is to simply avoid printing such output with show.info = FALSE in makeFeatSelWrapper (see code below). While this removes the error, it is possible that what caused it may have other consequences, although I it is possible the error only affects the printing code.
When running your code, I only get AUC above 0.90. Please find below a your code for logistic regression, slightly re-organized and with the workaround. I have added a droplevels() to the dataset2 to remove the missing level 3 from the factor, though this is not related with the workaround.
library(mlbench)
library(mlr)
data(BreastCancer)
p<-mlbench.waveform(1000)
dataset<-as.data.frame(p)
dataset2 = subset(dataset, classes != 3)
dataset2 <- droplevels(dataset2 )
mCT <- makeClassifTask(data =dataset2, target = "classes")
ctrl = makeFeatSelControlSequential(method = "sffs", maxit = NA,alpha = 0.001)
mL <- makeLearner("classif.logreg", predict.type = "prob")
inner = makeResampleDesc("Holdout",stratify = TRUE)
lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl, show.info = FALSE)
# uncomment this for the error to appear again. Might need to run the code a couple of times to see the error
# lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl)
outer = makeResampleDesc("CV", iters = 10,stratify = TRUE)
r = resample(lrn, mCT, outer, extract = getFeatSelResult,measures = list(mlr::auc,mlr::acc,mlr::brier),models=TRUE)
Edit: I've reported an issue and created a pull request with a fix.

Related

Error: requires numeric/complex matrix/vector arguments for %*%; cross validating glmmTMB model

I am adapting some k-fold cross validation code written for glmer/merMod models to a glmmTMB model framework. All seems well until I try and use the output from the model(s) fit with training data to predict and exponentiate values into a matrix (to then break into quantiles/number of bins to assess predictive performance). I can get get this line to work using glmer models, but it seems when I run the same model using glmmTMB I get Error in model.matrix: requires numeric/complex matrix/vector arguments There are many other posts out there discussing this error code and I have tried converting the data frame into matrix form and changing the class of the covariates with no luck. Separately running the parts before and after the %*% works but when combined I get the error. For context, this code is intended to be run with use/availability data so the example variables may not make sense, but the problem gets shown well enough. Any suggestions as to what is going on?
library(lme4)
library(glmmTMB)
# Example with mtcars dataset
data(mtcars)
# Model both with glmmTMB and lme4
m1 <- glmmTMB(am ~ mpg + wt + (1|carb), family = poisson, data=mtcars)
m2 <- glmer(am ~ mpg + wt + (1|carb), family = poisson, data=mtcars)
#--- K-fold code (hashed out sections are original glmer version of code where different)---
# define variables
k <- 5
mod <- m1 #m2
dt <- model.frame(mod) #data used
reg.list <- list() # initialize object to store all models used for cross validation
# finds the name of the response variable in the model dataframe
resp <- as.character(attr(terms(mod), "variables"))[attr(terms(mod), "response") + 1]
# define column called sets and populates it with character "train"
dt$sets <- "train"
# randomly selects a proportion of the "used"/am records (i.e. am = 1) for testing data
dt$sets[sample(which(dt[, resp] == 1), sum(dt[, resp] == 1)/k)] <- "test"
# updates the original model using only the subset of "trained" data
reg <- glmmTMB(formula(mod), data = subset(dt, sets == "train"), family=poisson,
control = glmmTMBControl(optimizer = optim, optArgs=list(method="BFGS")))
#reg <- glmer(formula(mod), data = subset(dt, sets == "train"), family=poisson,
# control = glmerControl(optimizer = "bobyqa", optCtrl=list(maxfun=2e5)))
reg.list[[i]] <- reg # store models
# uses new model created with training data (i.e. reg) to predict and exponentiate values
predall <- exp(as.numeric(model.matrix(terms(reg), dt) %*% glmmTMB::fixef(reg)))
#predall <- exp(as.numeric(model.matrix(terms(reg), dt) %*% lme4::fixef(reg)))
Without looking at the code too carefully: glmmTMB::fixef(reg) returns a list (with elements cond (conditional model parameters), zi (zero-inflation parameters), disp (dispersion parameters) rather than a vector.
If you replace this bit with glmmTMB::fixef(reg)[["cond"]] it will probably work.

word2vec recommendation system KeyError: "word '21883' not in vocabulary"

The code works absolutely fine for the data set containing 500000+ instances but whenever I reduce the data set to 5000/10000/15000 it throws a key error : word "***" not in vocabulary.Not for every data point but for most them it throws the error.The data set is in excel format. [1]: https://i.stack.imgur.com/YCBiQ.png
I don't know how to fix this problem since i have very little knowledge about it,,I am still learning.Please help me fix this problem!
purchases_train = []
for i in tqdm(customers_train):
temp = train_df[train_df["CustomerID"] == i]["StockCode"].tolist()
purchases_train.append(temp)
purchases_val = []
for i in tqdm(validation_df['CustomerID'].unique()):
temp = validation_df[validation_df["CustomerID"] == i]["StockCode"].tolist()
purchases_val.append(temp)
model = Word2Vec(window = 10, sg = 1, hs = 0,
negative = 10, # for negative sampling
alpha=0.03, min_alpha=0.0007,
seed = 14)
model.build_vocab(purchases_train, progress_per=200)
model.train(purchases_train, total_examples = model.corpus_count,
epochs=10, report_delay=1)
model.save("word2vec_2.model")
model.init_sims(replace=True)
# extract all vectors
X = model[model.wv.vocab]
X.shape
products = train_df[["StockCode", "Description"]]
products.drop_duplicates(inplace=True, subset='StockCode', keep="last")
products_dict=products.groupby('StockCode'['Description'].apply(list).to_dict()
def similar_products(v, n = 6):
ms = model.similar_by_vector(v, topn= n+1)[1:]
new_ms = []
for j in ms:
pair = (products_dict[j[0]][0], j[1])
new_ms.append(pair)
return new_ms
similar_products(model['21883'])
If you get a KeyError saying a word is not in the vocabulary, that's a reliable indicator that the word you're looking-up was not in the training data fed to Word2Vec, or did not appear enough (default min_count=5) times.
So, your error indicates the word-token '21883' did not appear at least 5 times in the texts (purchases_train) supplied to Word2Vec. You should do either or both of:
Ensure all words you're going to look-up appear enough times, either with more training data or a lower min_count. (However, words with only one or a few occurrences tend not to get good vectors & instead just drag the quaality of surrounding-words' vectors down - so keeping this value above 1, or even raising it above the default of 5 to discard more rare words, is a better path whenever you have sufficient data.)
If your later code will be looking up words that might not be present, either check for their presence first (word in model.wv.vocab) or set up a try: ... except: ... to catch & handle the case where they're not present.

What is pmodels in drc package

Sorry if it's a dumb question, but I am having trouble figuring out how to use pmodels in the drc package. I've searched everywhere online and all I can find is the definition, which is: "a data frame with a many columns as there are parameters in the non-linear function. Or a list containing a formula for each parameter in the nonlinear function." There are examples online, but I have no what it represents. For example, for the commands:
sel.m2 <- drm(dead/total~conc, type, weights=total, data=selenium, fct=LL.2(),
type="binomial", pmodels=list(~1, ~factor(type)-1))
met.as.m1<-drm(gain ~ dose, product, data = methionine, fct = AR.3(),
pmodels = list(~1, ~factor(product), ~factor(product)))
plot(met.as.m1, log = "", ylim = c(1450, 1800))
auxins.m1 <- boxcox(drm(y ~ dose, h, pmodels = data.frame(h, h, 1, h), fct = LL.4(), data = auxins), method = "anova")
I see pmodels as a list and data frame, but what does the "-1"vs "~1" mean or what does it mean to list a factor, what's the significance of the order within the parenthesis?
I agree that it's not well explained for new people. Unfortunately, I can only answer you in part.
A late response but for anyone else:
Two resources are available for reference with drc:
a) The writers published about drc. See main text and supplementary (S3 in this example) DOI:10.1371/journal.pone.0146021
b) See the drc.pdf and ctrl+f for pmodel to inspect the various uses.
data.frame vs. list depends on the grouping level I believe.
After playing around with my data (subsets), I found that
pmodels() = parameter/pooled models
aka how you set those parameters to equal (i.e., global/shared or not).
With your last example using the auxins df
library(drc)
auxins.m1 <- boxcox(drm(y ~ dose, h, pmodels = data.frame(h, h, 1, h),
fct = LL.4(), data = auxins), method = "anova")
## changed names to familiar terms by a non-statistician
auxins.m1 <- boxcox(drm(y ~ dose, h, pmodels = data.frame(h, h, 1, h),
fct = LL.4(names=c("hill.slope","bot","top","ed50"), data = auxins), method = "anova")
Shows that the top is set to 1.
The order is the same as the LL.4(names...)
So if you set
pmodels = data.frame(h, 1, 1, h) ## ("hill.slope","bot","top","ed50")
as they do in the drc.pdf on pg.10, you'll see that it's to set a common/shared bottom and top.
Check out pg.9 of their supplementary article, it shows that for LL.2, the two-parameter logistic fit has pre-set top = 1 and bottom = 0. The output of
selenium.LL.2.2 <- drm(dead/total~conc, type, weights = total,
data = selenium, fct = LL.2(), type="binomial",
pmodels = list(~factor(type)-1, ~1)) ## ("hill-slope", "ed50")
Shows that ed50 is assumed constant.
Alternatively from pg.91 of the drc.pdf:
## Fitting the model with freely varying ED50 values
mecter.free <- drm(rgr ~ dose, pct, data = mecter,
fct = LL.4(), pmodels = list(~1, ~1, ~1, ~factor(pct) - 1))
Unfortunately, it's really not clear what the object-1 means vs. just the object.
A better approach might be to use the base drm() without the special case of LL.#()
Check
getMeanFunctions()
to see all available functions
if you're trying to fix a value at a certain value you can
fct = LL.4(fixed = c(NA,0,1,NA))
## effectively becomes the standard LL.2()
## or
fct = LL.4(fixed = c(1,0,NA,NA))
## common hill slope = 1; assumes baseline correction hence = 0
Related in part; see a lot of drm functions laid out:
https://stackoverflow.com/a/39257095

How to modify lightgbm for my custom loss functions?

Which files should be change while adding my own custom loss function? I know I can add my objective and gradient/hessian computation in ObjectiveFunction, just wondering if there is anything else I need to do or if there are other alternatives for custom loss functions.
According to demo files in lightGBM early stopping example,
Setting objective function as:
# User define objective function, given prediction, return gradient and second order gradient
# This is loglikelihood loss
logregobj <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
preds <- 1 / (1 + exp(-preds))
grad <- preds - labels
hess <- preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
Setting error function as:
# User defined evaluation function, return a pair metric_name, result, higher_better
# NOTE: when you do customized loss function, the default prediction value is margin
# This may make buildin evalution metric not function properly
# For example, we are doing logistic loss, the prediction is score before logistic transformation
# The buildin evaluation error assumes input is after logistic transformation
# Take this in mind when you use the customization, and maybe you need write customized evaluation function
evalerror <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
err <- as.numeric(sum(labels != (preds > 0.5))) / length(labels)
return(list(name = "error", value = err, higher_better = FALSE))
}
And then you can run lightgbm as:
bst <- lgb.train(param,
dtrain,
num_round,
valids,
objective = logregobj,
eval = evalerror,
early_stopping_round = 3)

Neural network - solve a net with time arrays and different sample rate

I have 3 measurements for a machine. Each measurement is trigged every time its value changes by a certain delta.
I have these 3 data sets, represented as Matlab objects: T1, T2 and O. Each of them has a obj.t containing the timestamp values and obj.y containing the measurement values.
I will measure T1 and T2 for a long time, but O only for a short period. The task is to reconstruct O_future from T1 and T2, using the existing values for O for training and validation.
Note that T1.t, T2.t and O.t are not equal, not even their frequency (I might call it 'variable sample rate', but not sure if this name applies).
Is it possible to solve this problem using Matlab or other software? Do I need to resample all data to a common time vector?
Concerning the common time. Below some basic code which does this. (I guess you might know how to do it but just in case). However, the second option might bring you further...
% creating test signals
t1 = 1:2:100;
t2 = 1:3:200;
to = [5 6 100 140];
s1 = round (unifrnd(0,1,size(t1)));
s2 = round (unifrnd(0,1,size(t2)));
o = ones(size(to));
maxt = max([t1 t2 to]);
mint = min([t1 t2 to]);
% determining minimum frequency
frequ = min([t1(2:length(t1)) - t1(1:length(t1)-1) t2(2:length(t2)) - t2(1:length(t2)-1) to(2:length(to)) - to(1:length(to)-1)] );
% create a time vector with highest resolution
tinterp = linspace(mint,maxt,(maxt-mint)/frequ+1);
s1_interp = zeros(size(tinterp));
s2_interp = zeros(size(tinterp));
o_interp = zeros(size(tinterp));
for i = 1: length(t1)
s1_interp(ceil(t1(i))==floor(tinterp)) =s1(i);
end
for i = 1: length(t2)
s2_interp(ceil(t2(i))==floor(tinterp)) =s2(i);
end
for i = 1: length(to)
o_interp(ceil(to(i))==floor(tinterp)) = o(i);
end
figure,
subplot 311
hold on, plot(t1,s1,'ro'), plot(tinterp,s1_interp,'k-')
legend('observation','interpolation')
title ('signal 1')
subplot 312
hold on, plot(t2,s2,'ro'), plot(tinterp,s2_interp,'k-')
legend('observation','interpolation')
title ('signal 2')
subplot 313
hold on, plot(to,o,'ro'), plot(tinterp,o_interp,'k-')
legend('observation','interpolation')
title ('O')
Its not ideal as for large vectors this might become ineffective as soon as you have small sampling frequencies in one of the signals which will determine the lowest resolution.
Another option would be to define a coarser time vector and look at the number of events that happend in a certain period which might have some predictive power as well (not sure about your setup).
The structure would be something like
coarse_t = 1:5:100;
s1_coarse = zeros(size(coarse_t));
s2_coarse = zeros(size(coarse_t));
o_coarse = zeros(size(coarse_t));
for i = 2:length(coarse_t)
s1_coarse(i) = sum(nonzeros(s1(t1<coarse_t(i) & t1>coarse_t(i-1))));
s2_coarse(i) = sum(nonzeros(s2(t2<coarse_t(i) & t2>coarse_t(i-1))));
o_coarse(i) = sum(nonzeros(o(to<coarse_t(i) & to>coarse_t(i-1))));
end

Resources