I am working on a project consisting of the analysis of different portfolio constructions in a universe of various assets. I work on 22 assets and I recalibrate my portfolio every 90 days. This is why a weights penalties (see code) constraint is applied as the allocation changes every period.
I am currently implementing a construction based on independent components. My objective is to minimize the modified value at risk based on its components. (See code below).
My function runs correctly and everything seems to be OK, my function "MVaR.IC.port" and "MVaR.cm" work well. However, I can only implement this model in the case where short selling is allowed. I would now like to operate only in "Long only", i.e. that my weight vectors w only contain elements >=0. Concretely, i want that the expression "w <- t(w.IC)%*%a$A" in my code be >=0.
Do you know how to help me? Thank you in advance.
[results w.out.MVaR.IC.22,][1] Here are the results that must be positive. I also constraint that the sum of the weights must be equal to 1 (the investor allocates 100% of his wealth.).
Thomas
PS: train and test represent my rolling windows. In fact, I calibrate my models on 'train' (in sample) and apply them on 'test' (out of sample) in order to analyse their performance.
########################################
######### MVar on IC with CM #########
########################################
lower = rep(-5,k)
upper = rep(5,k)
#Set up objective function and constraint
MVaR.IC.cm.port <- function(S, weights, alpha, MixingMatrix)
{
obj <- MVaR(S, weights, alpha)
w.ICA <- t(weights)%*%MixingMatrix
weight.penalty = abs(1000*(1-sum(w.ICA)))
down.weight.penalty = 1000*sum(w.ICA[w.ICA > 1])
up.weight.penalty = 1000*abs(sum(w.ICA[w.ICA < -1]))
return(obj + weight.penalty + down.weight.penalty + up.weight.penalty)
}
#Out of sample return portfolio computation
ret.out.MVaR.IC.cm.22 <- c()
w.out.MVaR.IC.cm.22 <- matrix(ncol = n, nrow = 10)
for (i in 0:9) {
train <- as.matrix(portfolioReturns.new[((1+i*90):(8*90+i*90)),])
test <- as.matrix(portfolioReturns.new[(1+8*90+i*90):(9*90+i*90),])
a <- myfastICA(train, k, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 2000,
tol = 0.0000000001, verbose = TRUE)
x <- DEoptim(MVaR.IC.cm.port,lower,upper,
control=list(NP=(10*k),F=0.8,CR=0.9, trace=50),
S=a$S, alpha = alpha, MixingMatrix = a$A)
w.IC <- matrix(x$optim$bestmem, ncol=1)
w <- t(w.IC)%*%a$A
for (j in 1:ncol(train)){
w.out.MVaR.IC.cm.22[(i+1),j] <- w[j]
}
ret.out.MVaR.IC.cm.22 <- rbind(ret.out.MVaR.IC.cm.22, test %*% t(w))
}
w.out.MVaR.IC.cm.22
Related
I am trying to rewrite a fairness ranking algorithm (source: https://arxiv.org/abs/1802.07281) from Python to Rust. The objective is finding a document-ranking probability matrix that is doubly stochastic and, by use of an utility vector (i.e. the document relevance in this case) gives fair exposure to all document types.
The objective is thus to maximise the expected utility under the following constraints:
sum of probabilities for each position equals 1;
sum of probabilities for each document equals 1;
every probibility is valid (i.e. 0 <= P[i,j] <= 1);
P is fair (disparate treatment constraints).
In Python we have done this using CVXPY:
u = documents[['rel']].iloc[:n].values.ravel() # utility vector
v = np.array([1.0 / (np.log(2 + i)) for i in range(n)]) # position discount vector
P = cp.Variable((n, n)) # linear maximizing problem uͭPv s.t. P is doubly stochastic and fair.
# Construct f in fͭPv such that for P every group's exposure divided by mean utility should be
# equal (i.e. enforcing DTC). Do this for the set of every individual two groups:
# example: calculated f for three groups {a, b, c}
# resulting constraints: [a - b == 0, a - c == 0, b - c == 0]
groups = {k: group.index.values for k, group in documents.iloc[:n].groupby('document_type')}
fairness_constraints = []
for k0, k1 in combinations(groups, 2):
g0, g1 = groups[k0], groups[k1]
f_i = np.zeros(n)
f_i[g0] = 1 / u[g0].sum()
f_i[g1] = -1 / u[g1].sum()
fairness_constraints.append(f_i)
# Create convex problem to solve for finding the probabilities that
# a document is at a certain position/rank, matching the fairness criteria
objective = cp.Maximize(cp.matmul(cp.matmul(u, P), v))
constraints = ([cp.matmul(np.ones((1, n)), P) == np.ones((1, n)), # ┤
cp.matmul(P, np.ones((n,))) == np.ones((n,)), # ┤
0.0 <= P, P <= 1] + # └┤ doubly stochastic matrix constraints
[cp.matmul(cp.matmul(c, P), v) == 0 for c in fairness_constraints]) # DTC
prob = cp.Problem(objective, constraints)
prob.solve(solver=cp.CBC)
This works great for multiple solvers, including SCS, ECOS and CBC.
Now trying to implement the algorithm above to Rust, I have resolved to crates like good_lp and lp_modeler. These should both be able to solve linear problems using CBC as also demonstrated in the Python example above. I am struggling however to find examples on how to define the needed constraints on my matrix variable P.
The code below is my work in progress for rewriting the Python code in Rust, using in this case the lp_modeler crate as an example. The code below compiles but panics when run. Furthermore I don't know how to add the disparate treatment constraints in a way Rust likes, as no package seems to be able to accept equality constraints on two vectors.
let n = cmp::min(u.len(), 25);
let u: Array<f32, Ix1> = array![...]; // utility vector filled with dummy data
// position discount vector
let v: Array<f32, Ix1> = (0..n)
.map(|i| 1.0 / ((2 + i) as f32).ln())
.collect();
let P: Array<f32, Ix2> = Array::ones((n, n));
// dummy data for document indices and their types
let groups = vec![
vec![23], // type A
vec![8, 10, 16, 19], // type B
vec![0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 12, 13, 15, 21, 24], // type C
vec![14, 17, 18, 20, 22] // type D
];
let mut fairness_contraints: Vec<Vec<f32>> = Vec::new();
for combo in groups.iter().combinations(2).unique() {
let mut f_i: Vec<f32> = vec![0f32; n];
{ // f_i[g0] = 1 / u[g0].sum()
let usum_g0: f32 = combo[0].iter()
.map(|&i| u[i])
.sum();
for &i in combo[0].iter() {
f_i[i] = 1f32 / usum_g0;
}
}
{ // f_i[g1] = -1 / u[g1].sum()
let usum_g1: f32 = combo[1].iter()
.map(|&i| u[i])
.sum();
for &i in combo[1].iter() {
f_i[i] = -1.0 / usum_g1;
}
}
fairness_contraints.push(f_i);
}
let mut problem = LpProblem::new("Fairness", LpObjective::Maximize);
problem += u.dot(&P).dot(&v); // Expected utility objective
// Doubly stochastic constraints
for col in P.columns() { // Sum of probabilities for each position
problem += sum(&col.to_vec(), |&el| el).equal(1);
}
for row in P.rows() { // Sum of probabilities for each document
problem += sum(&row.to_vec(), |&el| el).equal(1);
}
// Valid probability constraints
for el in P.iter() {
problem += lp_sum(&vec![el]).ge(0);
problem += lp_sum(&vec![el]).le(1);
}
// TODO: implement DTC fairness constraints
let solver = CbcSolver::new();
let result = solver.run(&problem);
Can anybody give me a nudge in the right direction on this specific problem? Thanks in advance!
I am trying to train a linear SVM while tuning the parameters with 10fold CV for binary text classification.
As all solutions provided in other threads do not work and I already removed all NAs, NANs and Inf and balanced my dataset by applying downsampling but still the model returns NAs and fails in line search. Therefore I need the help of the community as I am kind of stuck.
The data has 2099 observations of 926 variables and is mostly 0 and 1, 2 or 3s.
dat_SetimentAnalysis <- c(
This is my code:
set.seed(335)
trainIndex <- createDataPartition(dat_SentimentAnalysis$Usefulness, p = .75,
list = FALSE,
times = 1)
train <- dat_SentimentAnalysis[ trainIndex,]
test <- dat_SentimentAnalysis[-trainIndex,]
#check for distribution of class
table(train$Usefulness)
#downsample training set
train <- downSample(train, as.factor(train$Usefulness))
#check again for distribution
table(train$Usefulness)
train <- na.omit(train) #no na values detected
#separate feature and predictors
x_train <- train[2:926]
y_train <- as.factor(train$Usefulness)
x_test <- test[2:926]
y_test <- as.factor(test$Usefulness)
sum(is.na(x_train))
sum(is.na(y_train))
#tune hyperparameters for SVM
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
search = "grid",
classProbs = TRUE,
savePredictions = TRUE)
model <- caret::train(x = x_train,
y = y_train,
method = "svmLinear",
trControl = fitControl,
tunegrid=data.frame(C=c(0.25, 0.5, 1,5,8,12,100)))
Does anybody have an idea what could be wrong? Because, when I do not perform tuning I get a very poor performing SVM with around 52 % accuracy but at least I get one. So maybe something with the tuning formula is wrong?
Thank you very much for your help!
Let's say that we have specified N number of train datasets (80:20 division) and we want to retrieve a two element list with pvalues and coefficients from glm model, for each train dataset. The code reproducing this example is as follows:
library(parallel)
library(caret)
# prepare dataset
data(iris)
iris <- iris[!iris$Species == "setosa", ]
# create validation folds
set.seed(12345)
folds <- createDataPartition(y = iris$Species, times = 100, p = 0.8, list = FALSE)
# glm model expression
model.expr.tr <- expression(glm(formula = Species ~ Sepal.Length,
data = dtr,
family = binomial(link = "logit")))
# glm elements that will be validated
val_part <- list(coefs = expression(summary(eval(model.expr.tr))$coefficients[, 1]),
pvals = expression(summary(eval(model.expr.tr))$coefficients[, 4]))
# lapply with mapply for validation results
val_results <- lapply(val_part, function(x){
mapply(function(i){
trindex <- rownames(iris) %in% folds[, i]
dtr <- iris[trindex, ]
eval(x)
},
i = 1:100)
})
As you are aware, the longest part is running the model summary through all of train datasets, especially if we choose more than 100 of them. In your opinion, is there any way to speed up this process? Of course I am aware of parLapply / mcmapply options but what about some kind of Rcpp speed up in this case? Any suggestions?
Thanks.
I would like to speed up this function as much as possible in Matlab.
This is part of a bigger simulation project, and as it is one of the most called functions within the simulation, this is crucial.
For now, I tried generating a MEX file, but the speed was not better.
Vectorizing seems difficult (but would be beneficial due to the nested loops), given the non-linear operations.
function y = mixing(T,dis,rr,n)
%% ===================================================================
% input: temperature of cells array T, distance array dis, number of cells
% n, mixing ratio r
%
% output: new temperature array
%
% purpose: calculates the temperature array of next timestep
% ===================================================================
for j = 1:n
i = 1;
r = rr;
while i < dis(j)+1 && j+i <= n
if (dis(j) < i)
r = r*(dis(j)-floor(dis(j)));
end
d = T(j+i-1);
T(j+i-1) = r*T(j+i) + (1-r)*T(j+i-1);
T(j+i) = r*d + (1-r)*T(j+i);
i = i + 1;
end
end
y = T;
end
Any ideas on how to speed-up this Matlab function?
Inputs: T is a 10x1 double, dis is a 10x1 double, rr is a 1x1 double, and n is a 1x1 integer value.
Example values: T = random('unif',55,65,10,1); dis = repmat(0.1,10,1); rr = rand; n = 10;
What I'm trying to compute with this is the degree of temperature mixing between water layers, given by the equations for T(j+i-1) and T(j+i).
This must be calculated for all , this is for all the layers, at all timesteps (note that are the total number of water layers).
I'm searching the John Tukey algorithm which compute a "resistant line" or "median-median line" on my linear regression with R.
A student on a mailling list explain this algorithm in these terms :
"The way it's calculated is to divide
the data into three groups, find the
x-median and y-median values (called
the summary point) for each group, and
then use those three summary points to
determine the line. The outer two
summary points determine the slope,
and an average of all of them
determines the intercept."
Article about John tukey's median median for curious : http://www.johndcook.com/blog/2009/06/23/tukey-median-ninther/
Do you have an idea of where i could find this algorithm or R function ? In which packages,
Thanks a lot !
There's a description of how to calculate the median-median line here. An R implementation of that is
median_median_line <- function(x, y, data)
{
if(!missing(data))
{
x <- eval(substitute(x), data)
y <- eval(substitute(y), data)
}
stopifnot(length(x) == length(y))
#Step 1
one_third_length <- floor(length(x) / 3)
groups <- rep(1:3, times = switch((length(x) %% 3) + 1,
one_third_length,
c(one_third_length, one_third_length + 1, one_third_length),
c(one_third_length + 1, one_third_length, one_third_length + 1)
))
#Step 2
x <- sort(x)
y <- sort(y)
#Step 3
median_x <- tapply(x, groups, median)
median_y <- tapply(y, groups, median)
#Step 4
slope <- (median_y[3] - median_y[1]) / (median_x[3] - median_x[1])
intercept <- median_y[1] - slope * median_x[1]
#Step 5
middle_prediction <- intercept + slope * median_x[2]
intercept <- intercept + (median_y[2] - middle_prediction) / 3
c(intercept = unname(intercept), slope = unname(slope))
}
To test it, here's an example:
dfr <- data.frame(
time = c(.16, .24, .25, .30, .30, .32, .36, .36, .50, .50, .57, .61, .61, .68, .72, .72, .83, .88, .89),
distance = c(12.1, 29.8, 32.7, 42.8, 44.2, 55.8, 63.5, 65.1, 124.6, 129.7, 150.2, 182.2, 189.4, 220.4, 250.4, 261.0, 334.5, 375.5, 399.1))
median_median_line(time, distance, dfr)
#intercept slope
# -113.6 520.0
Note the slightly odd way of specifying the groups. The instructions are quite picky about how you define group sizes, so the more obvious method of cut(x, quantile(x, seq.int(0, 1, 1/3))) doesn't work.
I'm a little late to the party, but have you tried line() from the stats package?
From the helpfile:
Value
An object of class "tukeyline".
References
Tukey, J. W. (1977). Exploratory Data Analysis, Reading Massachusetts: Addison-Wesley.
As member of the R Core team, I now have digged in the source code, and also studied the history of it.
Conclusion: The source C source code, added in 19961997, when R was still called alpha (and around version 0.14alpha) already computed the quantiles not quite correctly... for some sample sizes.
More about this on the R mailing lists (not yet).