How to modify lightgbm for my custom loss functions? - lightgbm

Which files should be change while adding my own custom loss function? I know I can add my objective and gradient/hessian computation in ObjectiveFunction, just wondering if there is anything else I need to do or if there are other alternatives for custom loss functions.

According to demo files in lightGBM early stopping example,
Setting objective function as:
# User define objective function, given prediction, return gradient and second order gradient
# This is loglikelihood loss
logregobj <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
preds <- 1 / (1 + exp(-preds))
grad <- preds - labels
hess <- preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
Setting error function as:
# User defined evaluation function, return a pair metric_name, result, higher_better
# NOTE: when you do customized loss function, the default prediction value is margin
# This may make buildin evalution metric not function properly
# For example, we are doing logistic loss, the prediction is score before logistic transformation
# The buildin evaluation error assumes input is after logistic transformation
# Take this in mind when you use the customization, and maybe you need write customized evaluation function
evalerror <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
err <- as.numeric(sum(labels != (preds > 0.5))) / length(labels)
return(list(name = "error", value = err, higher_better = FALSE))
}
And then you can run lightgbm as:
bst <- lgb.train(param,
dtrain,
num_round,
valids,
objective = logregobj,
eval = evalerror,
early_stopping_round = 3)

Related

Error: requires numeric/complex matrix/vector arguments for %*%; cross validating glmmTMB model

I am adapting some k-fold cross validation code written for glmer/merMod models to a glmmTMB model framework. All seems well until I try and use the output from the model(s) fit with training data to predict and exponentiate values into a matrix (to then break into quantiles/number of bins to assess predictive performance). I can get get this line to work using glmer models, but it seems when I run the same model using glmmTMB I get Error in model.matrix: requires numeric/complex matrix/vector arguments There are many other posts out there discussing this error code and I have tried converting the data frame into matrix form and changing the class of the covariates with no luck. Separately running the parts before and after the %*% works but when combined I get the error. For context, this code is intended to be run with use/availability data so the example variables may not make sense, but the problem gets shown well enough. Any suggestions as to what is going on?
library(lme4)
library(glmmTMB)
# Example with mtcars dataset
data(mtcars)
# Model both with glmmTMB and lme4
m1 <- glmmTMB(am ~ mpg + wt + (1|carb), family = poisson, data=mtcars)
m2 <- glmer(am ~ mpg + wt + (1|carb), family = poisson, data=mtcars)
#--- K-fold code (hashed out sections are original glmer version of code where different)---
# define variables
k <- 5
mod <- m1 #m2
dt <- model.frame(mod) #data used
reg.list <- list() # initialize object to store all models used for cross validation
# finds the name of the response variable in the model dataframe
resp <- as.character(attr(terms(mod), "variables"))[attr(terms(mod), "response") + 1]
# define column called sets and populates it with character "train"
dt$sets <- "train"
# randomly selects a proportion of the "used"/am records (i.e. am = 1) for testing data
dt$sets[sample(which(dt[, resp] == 1), sum(dt[, resp] == 1)/k)] <- "test"
# updates the original model using only the subset of "trained" data
reg <- glmmTMB(formula(mod), data = subset(dt, sets == "train"), family=poisson,
control = glmmTMBControl(optimizer = optim, optArgs=list(method="BFGS")))
#reg <- glmer(formula(mod), data = subset(dt, sets == "train"), family=poisson,
# control = glmerControl(optimizer = "bobyqa", optCtrl=list(maxfun=2e5)))
reg.list[[i]] <- reg # store models
# uses new model created with training data (i.e. reg) to predict and exponentiate values
predall <- exp(as.numeric(model.matrix(terms(reg), dt) %*% glmmTMB::fixef(reg)))
#predall <- exp(as.numeric(model.matrix(terms(reg), dt) %*% lme4::fixef(reg)))
Without looking at the code too carefully: glmmTMB::fixef(reg) returns a list (with elements cond (conditional model parameters), zi (zero-inflation parameters), disp (dispersion parameters) rather than a vector.
If you replace this bit with glmmTMB::fixef(reg)[["cond"]] it will probably work.

Input f into play3d() and movie3d() in the rgl package in R

I don't understand the input f expected by play3d and movie3d in the rgl package.
library(rgl)
nobs<-10
x<-runif(nobs)
y<-runif(nobs)
z<-runif(nobs)
n<-rep(1:nobs)
df<-as.data.frame(cbind(x,y,z,n))
listofobs<-split(df,n)
plot3d(df[,1],df[,2],df[,3], type = "n", radius = .2 )
myplotfunction<-function(x) {
rgl.spheres(x=x$x,y=x$y,z=x$z, type="s", r=0.025)
}
When executing the 2 lines below, the animation does play but both lines (play3d() and movie3d()) trigger the error displayed below:
play3d(f=lapply(listofobs,myplotfunction), fps=1 )
movie3d(f=lapply(listofobs,myplotfunction), fps=1 , duration=20)
I am hoping someone can correct my code and help me understand the f input to play3d and movie3d.
Question 1: Why is the play3d line above correct enough that the animation does display correctly?
Question 2: Why is the play3d line above incorrect enough that it triggers the error?
Question 3: What is wrong with the movie3d line that it does not produce a video output?
As the docs say, f is "A function returning a list that may be passed to par3d". It's not a list, which is what your usage passes.
To answer the questions:
R evaluates the lapply call which does the animation, then play3d looks at the result and dies because it's not a function.
f needs to be a function, as described in the help page.
It dies when it looks at f, because it's not a function.
This looks like it will do what you want:
library(rgl)
nobs<-10
x<-runif(nobs)
y<-runif(nobs)
z<-runif(nobs)
df<-data.frame(x,y,z)
plot3d(df, type = "n" )
id <- NA
myplotfunction<-function(time) {
index <- round(time)
# For a 3x faster display, use index <- round(3*time)
# To cycle through the points several times, use
# index <- round(3*time) %% nobs + 1
if (!is.na(id))
pop3d(id = id) # Delete previous item
id <<- spheres3d(df[index,], r=0.025)
list()
}
play3d(myplotfunction, startTime = 1, duration = nobs - 1)
movie3d(myplotfunction, startTime = 1, duration = nobs - 1, fps = 1)
This will leave a GIF in file.path(tempdir(), "movie.gif").
Some other notes:
don't call rgl.spheres. It will cause you immense pain later. Use spheres3d, or never call any *3d function, and never upgrade rgl: you're living in the past using the rgl.* functions. The *3d functions and the rgl.* functions don't play nicely together.
to construct a dataframe, just use the data.frame() function, don't convert
a matrix.
you don't need all those contortions to extract points from the dataframe.
Most rgl functions can handle a dataframe with x, y, and z columns.
You might notice the plot3d frame move a little: spheres are bigger than points, so it will adjust to accommodate them. You could use xlim, ylim and zlim to set the original frame a little bigger if you don't like this.

What is pmodels in drc package

Sorry if it's a dumb question, but I am having trouble figuring out how to use pmodels in the drc package. I've searched everywhere online and all I can find is the definition, which is: "a data frame with a many columns as there are parameters in the non-linear function. Or a list containing a formula for each parameter in the nonlinear function." There are examples online, but I have no what it represents. For example, for the commands:
sel.m2 <- drm(dead/total~conc, type, weights=total, data=selenium, fct=LL.2(),
type="binomial", pmodels=list(~1, ~factor(type)-1))
met.as.m1<-drm(gain ~ dose, product, data = methionine, fct = AR.3(),
pmodels = list(~1, ~factor(product), ~factor(product)))
plot(met.as.m1, log = "", ylim = c(1450, 1800))
auxins.m1 <- boxcox(drm(y ~ dose, h, pmodels = data.frame(h, h, 1, h), fct = LL.4(), data = auxins), method = "anova")
I see pmodels as a list and data frame, but what does the "-1"vs "~1" mean or what does it mean to list a factor, what's the significance of the order within the parenthesis?
I agree that it's not well explained for new people. Unfortunately, I can only answer you in part.
A late response but for anyone else:
Two resources are available for reference with drc:
a) The writers published about drc. See main text and supplementary (S3 in this example) DOI:10.1371/journal.pone.0146021
b) See the drc.pdf and ctrl+f for pmodel to inspect the various uses.
data.frame vs. list depends on the grouping level I believe.
After playing around with my data (subsets), I found that
pmodels() = parameter/pooled models
aka how you set those parameters to equal (i.e., global/shared or not).
With your last example using the auxins df
library(drc)
auxins.m1 <- boxcox(drm(y ~ dose, h, pmodels = data.frame(h, h, 1, h),
fct = LL.4(), data = auxins), method = "anova")
## changed names to familiar terms by a non-statistician
auxins.m1 <- boxcox(drm(y ~ dose, h, pmodels = data.frame(h, h, 1, h),
fct = LL.4(names=c("hill.slope","bot","top","ed50"), data = auxins), method = "anova")
Shows that the top is set to 1.
The order is the same as the LL.4(names...)
So if you set
pmodels = data.frame(h, 1, 1, h) ## ("hill.slope","bot","top","ed50")
as they do in the drc.pdf on pg.10, you'll see that it's to set a common/shared bottom and top.
Check out pg.9 of their supplementary article, it shows that for LL.2, the two-parameter logistic fit has pre-set top = 1 and bottom = 0. The output of
selenium.LL.2.2 <- drm(dead/total~conc, type, weights = total,
data = selenium, fct = LL.2(), type="binomial",
pmodels = list(~factor(type)-1, ~1)) ## ("hill-slope", "ed50")
Shows that ed50 is assumed constant.
Alternatively from pg.91 of the drc.pdf:
## Fitting the model with freely varying ED50 values
mecter.free <- drm(rgr ~ dose, pct, data = mecter,
fct = LL.4(), pmodels = list(~1, ~1, ~1, ~factor(pct) - 1))
Unfortunately, it's really not clear what the object-1 means vs. just the object.
A better approach might be to use the base drm() without the special case of LL.#()
Check
getMeanFunctions()
to see all available functions
if you're trying to fix a value at a certain value you can
fct = LL.4(fixed = c(NA,0,1,NA))
## effectively becomes the standard LL.2()
## or
fct = LL.4(fixed = c(1,0,NA,NA))
## common hill slope = 1; assumes baseline correction hence = 0
Related in part; see a lot of drm functions laid out:
https://stackoverflow.com/a/39257095

How to jointly use makeFeatSelWrapper and resample function in mlr

I'm fitting classification models for binary issues using MLR package in R. For each model, I perform a cross-validation with embedded feature selection using "selectFeatures" function. In output, I retrieve mean AUCs over test sets and predictions. To do so, after having get some advices (Get predictions on test sets in MLR), I use "makeFeatSelWrapper" function in combination with "resample" function. The goal seems to be achieved but results are strange. With a logistic regression as classifier, I get an AUC of 0.5 which means no variable selected. This result is unexpected as I get an AUC of 0.9824432 with this classifier using the method mentioned in the linked question. With a neural network as classifier, I get an error message
Error in sum(x) : invalid 'type' (list) of argument
What is wrong?
Here is the sample code:
# 1. Find a synthetic dataset for supervised learning (two classes)
###################################################################
install.packages("mlbench")
library(mlbench)
data(BreastCancer)
# generate 1000 rows, 21 quantitative candidate predictors and 1 target variable
p<-mlbench.waveform(1000)
# convert list into dataframe
dataset<-as.data.frame(p)
# drop thrid class to get 2 classes
dataset2 = subset(dataset, classes != 3)
# 2. Perform cross validation with embedded feature selection using logistic regression
#######################################################################################
library(BBmisc)
library(nnet)
library(mlr)
# Choice of data
mCT <- makeClassifTask(data =dataset2, target = "classes")
# Choice of algorithm i.e. neural network
mL <- makeLearner("classif.logreg", predict.type = "prob")
# Choice of cross-validations for folds
outer = makeResampleDesc("CV", iters = 10,stratify = TRUE)
# Choice of feature selection method
ctrl = makeFeatSelControlSequential(method = "sffs", maxit = NA,alpha = 0.001)
# Choice of hold-out sampling between training and test within the fold
inner = makeResampleDesc("Holdout",stratify = TRUE)
lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl)
r = resample(lrn, mCT, outer, extract = getFeatSelResult,measures = list(mlr::auc,mlr::acc,mlr::brier),models=TRUE)
# 3. Perform cross validation with embedded feature selection using neural network
##################################################################################
library(BBmisc)
library(nnet)
library(mlr)
# Choice of data
mCT <- makeClassifTask(data =dataset2, target = "classes")
# Choice of algorithm i.e. neural network
mL <- makeLearner("classif.nnet", predict.type = "prob")
# Choice of cross-validations for folds
outer = makeResampleDesc("CV", iters = 10,stratify = TRUE)
# Choice of feature selection method
ctrl = makeFeatSelControlSequential(method = "sffs", maxit = NA,alpha = 0.001)
# Choice of sampling between training and test within the fold
inner = makeResampleDesc("Holdout",stratify = TRUE)
lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl)
r = resample(lrn, mCT, outer, extract = getFeatSelResult,measures = list(mlr::auc,mlr::acc,mlr::brier),models=TRUE)
If you run your logistic regression part of the code a couple of times, you should also get the Error in sum(x) : invalid 'type' (list) of argument error. However, I find it strange that fixing a particular seed (e.g., set.seed(1)) before resampling does not ensure that the error does or does not appear.
The error occurs in internal mlr code for printing the output of feature selection to the console. A very simple workaround is to simply avoid printing such output with show.info = FALSE in makeFeatSelWrapper (see code below). While this removes the error, it is possible that what caused it may have other consequences, although I it is possible the error only affects the printing code.
When running your code, I only get AUC above 0.90. Please find below a your code for logistic regression, slightly re-organized and with the workaround. I have added a droplevels() to the dataset2 to remove the missing level 3 from the factor, though this is not related with the workaround.
library(mlbench)
library(mlr)
data(BreastCancer)
p<-mlbench.waveform(1000)
dataset<-as.data.frame(p)
dataset2 = subset(dataset, classes != 3)
dataset2 <- droplevels(dataset2 )
mCT <- makeClassifTask(data =dataset2, target = "classes")
ctrl = makeFeatSelControlSequential(method = "sffs", maxit = NA,alpha = 0.001)
mL <- makeLearner("classif.logreg", predict.type = "prob")
inner = makeResampleDesc("Holdout",stratify = TRUE)
lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl, show.info = FALSE)
# uncomment this for the error to appear again. Might need to run the code a couple of times to see the error
# lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl)
outer = makeResampleDesc("CV", iters = 10,stratify = TRUE)
r = resample(lrn, mCT, outer, extract = getFeatSelResult,measures = list(mlr::auc,mlr::acc,mlr::brier),models=TRUE)
Edit: I've reported an issue and created a pull request with a fix.

How does keras pass y_pred to loss object/function via model.compile

If I have a function that defines the triple loss (which expects a y_true & y_pred as input parameters), and I "reference or call it" via the following:
model.compile(optimizer="rmsprop", loss=triplet_loss, metrics=[accuracy])
Hows does the y_pred get passed to the triplet_loss function?
For example the triplet_loss function may be:
def triplet_loss(y_true, y_pred, alpha = 0.2):
"""
Implementation of the triplet loss function
Arguments:
y_true -- true labels, required when you define a loss in Keras,
y_pred -- python list containing three objects:
"""
anchor, positive, negative = y_pred[0], y_pred[1], y_pred[2]
# distance between the anchor and the positive
pos_dist = tf.reduce_sum(tf.square(tf.subtract(anchor,positive)))
# distance between the anchor and the negative
neg_dist = tf.reduce_sum(tf.square(tf.subtract(anchor,negative)))
# compute loss
basic_loss = pos_dist-neg_dist+alpha
loss = tf.maximum(basic_loss,0.0)
return loss
Thanks Jon
I did a little bit of poking through the keras source code. In the Model() class:
First they modify the function a bit to take into account weights:
self.loss_functions = loss_functions
weighted_losses = [_weighted_masked_objective(fn) for fn in loss_functions]
A bit later during training they map their outputs (predictions) to their targets (labels) and call the loss function to get the output_loss. Here y_true and y_pred are passed into your function.
y_true = self.targets[i]
y_pred = self.outputs[i]
weighted_loss = weighted_losses[i]
sample_weight = sample_weights[i]
mask = masks[i]
loss_weight = loss_weights_list[i]
with K.name_scope(self.output_names[i] + '_loss'):
output_loss = weighted_loss(y_true, y_pred,
sample_weight, mask)

Resources