How to add bootstrap CI into this function in R - statistics-bootstrap

I want to calculate the 90% Confidence interval (Bootstrap) for these two estimations. Harrell-Davis Distribution-Free Quantile Estimator. I think this function is the bootstrapped version of the nonparametric quantile which estimate the mean and standard deviations. Now I was wondering how I can calculate the 90% CI?
library(Hmisc)
x <- runif(100)
hdquantile(x, probs = seq(0.025, 0.975,0.95), se=TRUE,names = TRUE, weights=FALSE)

The hdquantile function computes the estimator and also gives the standard error if you set se=TRUE. To find the confidence interval you can bootstrap it through boot function and get confidence intervals through boot.ci
Code
library(Hmisc)
x <- runif(100)
hdquantile(x, probs = seq(0.025, 0.975,0.95), se=TRUE,names = TRUE,
weights=FALSE)
For booting create a statistic
library(boot)
hq <- function(x,i) {
hdquantile(x[i], probs = seq(0.025, 0.975,0.95), se=TRUE,names = TRUE,
weights=FALSE)
}
bootx <- boot(x,hq,1000)
boot.ci(bootx, conf = 0.90)

Related

Line search fails when training linear SVM with caret

I am trying to train a linear SVM while tuning the parameters with 10fold CV for binary text classification.
As all solutions provided in other threads do not work and I already removed all NAs, NANs and Inf and balanced my dataset by applying downsampling but still the model returns NAs and fails in line search. Therefore I need the help of the community as I am kind of stuck.
The data has 2099 observations of 926 variables and is mostly 0 and 1, 2 or 3s.
dat_SetimentAnalysis <- c(
This is my code:
set.seed(335)
trainIndex <- createDataPartition(dat_SentimentAnalysis$Usefulness, p = .75,
list = FALSE,
times = 1)
train <- dat_SentimentAnalysis[ trainIndex,]
test <- dat_SentimentAnalysis[-trainIndex,]
#check for distribution of class
table(train$Usefulness)
#downsample training set
train <- downSample(train, as.factor(train$Usefulness))
#check again for distribution
table(train$Usefulness)
train <- na.omit(train) #no na values detected
#separate feature and predictors
x_train <- train[2:926]
y_train <- as.factor(train$Usefulness)
x_test <- test[2:926]
y_test <- as.factor(test$Usefulness)
sum(is.na(x_train))
sum(is.na(y_train))
#tune hyperparameters for SVM
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
search = "grid",
classProbs = TRUE,
savePredictions = TRUE)
model <- caret::train(x = x_train,
y = y_train,
method = "svmLinear",
trControl = fitControl,
tunegrid=data.frame(C=c(0.25, 0.5, 1,5,8,12,100)))
Does anybody have an idea what could be wrong? Because, when I do not perform tuning I get a very poor performing SVM with around 52 % accuracy but at least I get one. So maybe something with the tuning formula is wrong?
Thank you very much for your help!

How to set my inequality constraint into my R function?

I am working on a project consisting of the analysis of different portfolio constructions in a universe of various assets. I work on 22 assets and I recalibrate my portfolio every 90 days. This is why a weights penalties (see code) constraint is applied as the allocation changes every period.
I am currently implementing a construction based on independent components. My objective is to minimize the modified value at risk based on its components. (See code below).
My function runs correctly and everything seems to be OK, my function "MVaR.IC.port" and "MVaR.cm" work well. However, I can only implement this model in the case where short selling is allowed. I would now like to operate only in "Long only", i.e. that my weight vectors w only contain elements >=0. Concretely, i want that the expression "w <- t(w.IC)%*%a$A" in my code be >=0.
Do you know how to help me? Thank you in advance.
[results w.out.MVaR.IC.22,][1] Here are the results that must be positive. I also constraint that the sum of the weights must be equal to 1 (the investor allocates 100% of his wealth.).
Thomas
PS: train and test represent my rolling windows. In fact, I calibrate my models on 'train' (in sample) and apply them on 'test' (out of sample) in order to analyse their performance.
########################################
######### MVar on IC with CM #########
########################################
lower = rep(-5,k)
upper = rep(5,k)
#Set up objective function and constraint
MVaR.IC.cm.port <- function(S, weights, alpha, MixingMatrix)
{
obj <- MVaR(S, weights, alpha)
w.ICA <- t(weights)%*%MixingMatrix
weight.penalty = abs(1000*(1-sum(w.ICA)))
down.weight.penalty = 1000*sum(w.ICA[w.ICA > 1])
up.weight.penalty = 1000*abs(sum(w.ICA[w.ICA < -1]))
return(obj + weight.penalty + down.weight.penalty + up.weight.penalty)
}
#Out of sample return portfolio computation
ret.out.MVaR.IC.cm.22 <- c()
w.out.MVaR.IC.cm.22 <- matrix(ncol = n, nrow = 10)
for (i in 0:9) {
train <- as.matrix(portfolioReturns.new[((1+i*90):(8*90+i*90)),])
test <- as.matrix(portfolioReturns.new[(1+8*90+i*90):(9*90+i*90),])
a <- myfastICA(train, k, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 2000,
tol = 0.0000000001, verbose = TRUE)
x <- DEoptim(MVaR.IC.cm.port,lower,upper,
control=list(NP=(10*k),F=0.8,CR=0.9, trace=50),
S=a$S, alpha = alpha, MixingMatrix = a$A)
w.IC <- matrix(x$optim$bestmem, ncol=1)
w <- t(w.IC)%*%a$A
for (j in 1:ncol(train)){
w.out.MVaR.IC.cm.22[(i+1),j] <- w[j]
}
ret.out.MVaR.IC.cm.22 <- rbind(ret.out.MVaR.IC.cm.22, test %*% t(w))
}
w.out.MVaR.IC.cm.22

Calculate certainty of Monte Carlo simulation

Let's say that we use the Monte Carlo method to estimate the area of an object, in the exact same way you'd use it to estimate the value of π.
Now, let's say we want to calculate the certainty of our simulation result. We've cast n samples, m of which landed inside the object, so the area of the object is approximately m/n of the total sampled area. We would like to make a statement such as:
"We are 99% certain that the area of the object is between a1 and a2."
How can we calculate a1 and a2 above (given n, m, total area, and the desired certainty)?
Here is a program which attempts to estimate this bound numerically. Here the samples are points in [0,1), and the object is the segment [0.25,0.75). It prints a1 and a2 for 50%, 90%, and 99%, for a range of sample counts:
import std.algorithm;
import std.random;
import std.range;
import std.stdio;
void main()
{
foreach (numSamples; iota(0, 1000+1, 100).filter!(n => n > 0))
{
auto samples = new double[numSamples];
enum objectStart = 0.25;
enum objectEnd = 0.75;
enum numTotalSamples = 10_000_000;
auto numSizes = numTotalSamples / numSamples;
auto sizes = new double[numSizes];
foreach (ref size; sizes)
{
size_t numHits;
foreach (i; 0 .. numSamples)
{
auto sample = uniform01!double;
if (sample >= objectStart && sample < objectEnd)
numHits++;
}
size = 1.0 / numSamples * numHits;
}
sizes.sort;
writef("%d samples:", numSamples);
foreach (certainty; [50, 90, 99])
{
auto centerDist = numSizes * certainty / 100 / 2;
auto startPos = numSizes / 2 - centerDist;
auto endPos = numSizes / 2 + centerDist;
writef("\t%.5f..%.5f", sizes[startPos], sizes[endPos]);
}
writeln;
}
}
(Run it online.) It outputs:
// 50% 90% 99%
100 samples: 0.47000..0.53000 0.42000..0.58000 0.37000..0.63000
200 samples: 0.47500..0.52500 0.44500..0.56000 0.41000..0.59000
300 samples: 0.48000..0.52000 0.45333..0.54667 0.42667..0.57333
400 samples: 0.48250..0.51750 0.46000..0.54250 0.43500..0.56500
500 samples: 0.48600..0.51600 0.46400..0.53800 0.44200..0.55800
600 samples: 0.48667..0.51333 0.46667..0.53333 0.44833..0.55167
700 samples: 0.48714..0.51286 0.46857..0.53143 0.45000..0.54857
800 samples: 0.48750..0.51250 0.47125..0.53000 0.45375..0.54625
900 samples: 0.48889..0.51111 0.47222..0.52667 0.45778..0.54111
1000 samples: 0.48900..0.51000 0.47400..0.52500 0.45800..0.53900
Is it possible to precisely calculate these numbers instead?
(Context: I'd like to add something like "±X.Y GB with 99% certainty" to btdu)
Ok, with question being language agnostic, here is the illustration how to do error estimation with Monte-Carlo.
Suppose, you want to compute integral
I = S01 f(x) dx
where f(x) is simple polynomial function
f(x) = xn
Here is the illustration of the calculations.
For that you have to compute not only mean value, but standard deviation as well.
Then, knowing that Monte Carlo error is going down as inverse square root of number of samples, computing confidence interval is simple
Code, Python 3.7, Windows 10 x64
import numpy as np
rng = np.random.default_rng()
N = 100000
n = 2
def f(x):
return np.power(x, n)
sample = f(rng.random(N)) # N samples of the function
m = np.mean(sample) # mean value of the sample, approaching integral value as N->∞
s = np.std(sample, ddof=1) # standard deviation with Bessel correction
e = s / np.sqrt(N) # Monte Carlo error decreases as inverse square root
t = 2.576 # For 99% confidence interval, we should take 2.58 sigma, per Gaussian distribution
#t = 3.00 # For 99.7% confidence interval, we should take 3 sigma, per Gaussian distribution
print(f'True integral value is {1.0/(1.0+n)}')
print(f'Computed integral value is in the range [{m-t*e}...{m+t*e}] with 99% confidence')
will print something like
True integral value is 0.3333333333333333
Computed integral value is in the range
[0.33141772204489295...0.3362795491124624] with 99% confidence
You could use Z-score table, line this one along the lines, to print table you want. You could vary N to get desired N dependency
zscore = {'50%': 0.674, '80%': 1.282, '90%': 1.645, '95%': 1.960, '98%': 2.326, '99%': 2.576, '99.7%': 3.0}
for c, z in zscore.items():
print(f'Computed integral value is in the range [{m-z*e}...{m+z*e}] with {c} confidence')
Based on Severin's answer, here is the code to calculate the values as stated in the question:
def calculate_error(n, m, z):
p = m / n
std_dev = (p * (1 - p)) ** 0.5 # Standard deviation of Bernoulli variable
error = std_dev / n ** 0.5 # Monte Carlo error decreases as inverse square root
return (mean - z * error, mean + z * error)
n = 1000
z = 2.576 # For 99% confidence interval, we should take 2.58 sigma, per Gaussian distribution
print(calculate_error(n, n * 0.5, z))

Tune a learner with the searchspace parameter setting

I am trying to tune a ranger learner with the searchspace parameter setting. The purpose is to find the optimal K (the number of input indicators, I uesd a filterpipe with setting importance.filter.nfeat) and D (the depth of each tree, i.e., classif.ranger.max.depth) by grid search. D's value should not be greater than the number of input indicators K. The values searched for D are then set proportionally to the input K: D ∈ {10%, 25%, 50%, 100%} ∗ K. Values of D ≤ 0 were rejected.
However, I am unfamiliar with writing fuction code within searchspace, thus the can not achieve the purpose (D is greater than K).
My question is:
How to set a parameter that is based on the other one in the searchspace? (I think it is different with the depends metioned in mlr3 book)
Here is my code:
ranger = lrn("classif.ranger", importance = "impurity", predict_type = "prob", id = "ranger")
graph = po("filter", flt("importance"), filter.nfeat = 3) %>>% ranger %>>% po("threshold")
plot(graph)
graph_learner = GraphLearner$new(graph)
searchspace = ps(
importance.filter.nfeat = p_int(1,length(task$feature_names)),
classif.ranger.max.depth = p_int(1,length(task$feature_names)),
.extra_trafo = function(x, param_set) {x = graph_learner$param_set$importance.filter.nfeat * c(.1,.25,.50,1)})
inst1 = TuningInstanceMultiCrit$new(
task,
learner = graph_learner,
resampling = rsmp("cv"),
measures = msrs(c("classif.ce","classif.bacc","classif.mcc")),
terminator = trm("evals", n_evals = 50),
search_space = searchspace
)
tuner = tnr("grid_search")
# reduce logging output
lgr:: get_logger("bbotk") $set_threshold("warn")
# The tuning procedure may take some time:
set.seed(1234)
tuner$optimize(inst1)
#Returns list with optimal configurations and estimated performance.
inst1$result
# We can plot the performance against the number of features.
#If we do so, we see the possible trade-off between sparsity and predictive performance:
arx = as.data.table(inst$archive)
ggplot(arx, aes(x = importance.filter.nfeat, y = classif.ce)) + geom_line()
How to know what indicators are uesd in the tuned model, for we only see the trade-off between sparsity and predictive performance, are they based on the importance rank?
I also have tried the feature selection. In FS, I could get the optimal feature set. So what are the relationships betweet the tuning nfeat and feature selection? Which one is perfer in real partice?
# https://mlr3gallery.mlr-org.com/posts/2020-09-14-mlr3fselect-basic/
resampling = rsmp("cv")
measure = msr("classif.mcc")
terminator = trm("none")
ranger_lrn = lrn("classif.ranger", importance = "impurity", predict_type = "prob")
#
instance = FSelectInstanceSingleCrit$new(
task = task,
learner = ranger_lrn,
resampling = resampling,
measure = measure,
terminator = terminator,
store_models = TRUE)
#
fselector = fs("rfe", recursive = FALSE)
set.seed(1234)
fselector$optimize(instance)
#
as.data.table(instance$archive)
instance$result
instance$result_feature_set
instance$result_y
# set new feature_set
# task$select(instance$result_feature_set)
Does this answer question 1?
How to set specific values in `paradox`?
Seems that you could simply set up your own data table as shown there, except remove rows where D>K, then use the design_points tuner.

Rcpp instead of mapply in model validation for many subsets

Let's say that we have specified N number of train datasets (80:20 division) and we want to retrieve a two element list with pvalues and coefficients from glm model, for each train dataset. The code reproducing this example is as follows:
library(parallel)
library(caret)
# prepare dataset
data(iris)
iris <- iris[!iris$Species == "setosa", ]
# create validation folds
set.seed(12345)
folds <- createDataPartition(y = iris$Species, times = 100, p = 0.8, list = FALSE)
# glm model expression
model.expr.tr <- expression(glm(formula = Species ~ Sepal.Length,
data = dtr,
family = binomial(link = "logit")))
# glm elements that will be validated
val_part <- list(coefs = expression(summary(eval(model.expr.tr))$coefficients[, 1]),
pvals = expression(summary(eval(model.expr.tr))$coefficients[, 4]))
# lapply with mapply for validation results
val_results <- lapply(val_part, function(x){
mapply(function(i){
trindex <- rownames(iris) %in% folds[, i]
dtr <- iris[trindex, ]
eval(x)
},
i = 1:100)
})
As you are aware, the longest part is running the model summary through all of train datasets, especially if we choose more than 100 of them. In your opinion, is there any way to speed up this process? Of course I am aware of parLapply / mcmapply options but what about some kind of Rcpp speed up in this case? Any suggestions?
Thanks.

Resources