Add multiple parallel planes to 3D plot with plotly package (datacamp exercise) - matrix

I am trying to visualize two parallel planes in a 3D plot with the plotly package (for those that have access to Datacamp, it is this exercise: https://campus.datacamp.com/courses/multiple-and-logistic-regression/multiple-regression?ex=9)
At some point in the construction of this graph, you need to create a matrix to fit the planes that match the categorical variable (plane_new = condition is new, plane_used = condition is used). The problem is, that the two datasets that I used as input (corresponding to the two levels of condition), have different number of observations. I can't seem to figure out how I can get these matrices to be comparable, so that the planes are correctly fit as geometrical objects in the figure.
I hope an R-wizard can help me out ;). Here's my code:
# libraries
library(openintro) #exemplary datasets
library(modelr) #multivariate methods
library(broom) #tidy
library(ggplot2) #visualizing data
library(plotly) #visualizing models in 3D
## Fit the model
lm_ext <- lm(totalPr ~ duration + startPr + cond, # Interpretation: With every 1 unit increase of auction duration (unit = day), the price of the game decreases with .51 units in the response variable (total price), when keeping startPr constant. The eventual value of the predicted value also depends on condition (categorical), for which the y-intercept is different
data = marioKart)
## Visualize the model (including predictions)
marioKart_ss_new <- subset(marioKart, cond == "new") # To visualize planes in a 3D graph in plotly, the dataframe needs to be split in the number of levels of the categorical variable
marioKart_ss_used <- subset(marioKart, cond == "used")
duration_new <- as.vector(marioKart_ss_new$duration) # These vectors represent the linear model for condition = new
startPr_new <- as.vector(marioKart_ss_new$startPr)
duration_used <- as.vector(marioKart_ss_used$duration) # These vectors represent the linear model for condition = used
startPr_used <- as.vector(marioKart_ss_used$startPr)
lm_new <- lm(totalPr ~ duration + startPr, # Create two linear models
data = marioKart_ss_new)
lm_used <- lm(totalPr ~ duration + startPr,
data = marioKart_ss_used)
grid_new <- marioKart_ss_new %>% # Make two grids with all combinations of the levels of the two numerical explanatory variables
data_grid(duration =
seq_range(duration, by = 1),
startPr =
seq_range(startPr, by = 1))
grid_used <- marioKart_ss_used %>%
data_grid(duration =
seq_range(duration, by = 1),
startPr =
seq_range(startPr, by = 1))
lm_new <- lm(totalPr ~ duration + startPr, # Make two seperate models based on the two levels of the categorical explanatory variable
data = marioKart_ss_new)
lm_used <- lm(totalPr ~ duration + startPr,
data = marioKart_ss_used)
pred_new <- augment(lm_new, newdata = grid_new) # Predictions
pred_used <- augment(lm_used, newdata = grid_used)
plane_new <- matrix(pred_new$.fitted, # Matrix of preditions as input for planes
nrow = 70,
ncol = 70)
plane_used <- matrix(pred_used$.fitted,
nrow = 55,
ncol = 55)
plot <- plot_ly(data = marioKart, # 3D plot of datapoints
z = ~totalPr,
x = ~duration,
y = ~startPr,
opacity = 0.6) %>%
add_markers(color = ~cond)
plot %>% # Add planes
add_surface(x = ~duration_new, ### NOT WORKING, WAIT FOR DATACAMP
y = ~startPr_new,
z = ~plane_new,
showscale = FALSE) %>%
add_surface(x = ~duration_used,
y = ~duration_used,
z = ~plane_used,
showscale = FALSE)

No code wizard here but asked for the same thing:
library(tidyverse)
library(modelr)
grid <- mario_kart %>%
modelr::data_grid(
duration = seq_range(duration, n = 70),
startPr = seq_range(startPr, n = 70),
cond
)
library(broom)
tidy_planes <- mod %>%
augment(newdata = grid)
x <- unique(grid$duration)
y <- unique(grid$startPr)
plane0 <- tidy_planes %>%
filter(cond == "new") %>%
pull(.fitted) %>%
matrix(nrow = length(x), byrow = TRUE)
plane1 <- tidy_planes %>%
filter(cond == "used") %>%
pull(.fitted) %>%
matrix(nrow = length(x), byrow = TRUE)

Related

Emotion detection using facial landmarks

I plan on using scikit svm for class prediction.
I have been trying this :
Get images from a webcam
Detect Facial Landmarks
Train a machine learning algorithm (we will use a linear SVM)
Predict emotions
I have a problem in this line : clf.fit(npar_train, training_labels)
also I have a problem in site-packages\sklearn\svm_base.py and in site-packages\sklearn\utils\validation.py
How can I remove this error?
thank you in advance
python script
emotions = ['neutral', 'sad', 'happy', 'anger']
data={}
detector = dlib.get_frontal_face_detector()
predictor = dlib.shape_predictor('shape_predictor_68_face_landmarks.dat')
clf = SVC(kernel='linear', probability=True, tol=1e-3)
def get_files(emotion):
files = glob.glob('img\\datasets\\%s\\*' %emotion)
random.shuffle(files)
training = files[:int(len(files)*0.8)]
prediction = files[-int(len(files)*0.2)]
return training, prediction
def get_landmarks(image):
detections = detector(image, 1)
for k, d in enumerate(detections): # For all detected face instances individually
shape = predictor(image, d) # Draw Facial Landmarks with the predictor class
xlist = []
ylist = []
for i in range(1, 68): # Store X and Y coordinates in two lists
xlist.append(float(shape.part(i).x))
ylist.append(float(shape.part(i).y))
xmean = np.mean(xlist)
ymean = np.mean(ylist)
xcentral = [(x - xmean) for x in xlist]
ycentral = [(y - ymean) for y in ylist]
landmarks_vectorised = []
for x, y, w, z in zip(xcentral, ycentral, xlist, ylist):
landmarks_vectorised.append(w)
landmarks_vectorised.append(z)
meannp = np.asarray((ymean, xmean))
coornp = np.asarray((z, w))
dist = np.linalg.norm(coornp - meannp)
landmarks_vectorised.append(dist)
landmarks_vectorised.append((math.atan2(y, x) * 360) / (2 * math.pi))
data['landmarks_vectorised'] = landmarks_vectorised
if len(detections) < 1:
data['landmarks_vestorised'] = "error"
def make_sets():
training_data = []
training_labels = []
prediction_data = []
prediction_labels = []
for emotion in emotions:
print("Working on %s emotion" %emotion)
training, prediction = get_files(emotion)
for item in training:
image = cv2.imread(item)
try:
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
except:
print()
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
clahe_image = clahe.apply(image)
get_landmarks(clahe_image)
if data['landmarks_vectorised'] == "error":
print("no face detected on this one")
else:
training_data.append(data['landmarks_vectorised']) # append image array to training data list
training_labels.append(emotions.index(emotion))
for item in prediction:
image = cv2.imread(item)
try:
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
except:
print()
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
clahe_image = clahe.apply(image)
get_landmarks(clahe_image)
if data['landmarks_vectorised'] == "error":
print("no face detected on this one")
else:
prediction_data.append(data['landmarks_vectorised'])
prediction_labels.append(emotions.index(emotion))
return training_data, training_labels, prediction_data, prediction_labels
accur_lin = []
for i in range(0,10):
print("Making sets %s" % i) # Make sets by random sampling 80/20%
training_data, training_labels, prediction_data, prediction_labels = make_sets()
npar_train = np.array(training_data)
npar_trainlabs = np.array(training_labels)
print("training SVM linear %s" % i) # train SVM
clf.fit(npar_train, training_labels)
print("getting accuracies %s" % i)
npar_pred = np.array(prediction_data)
pred_lin = clf.score(npar_pred, prediction_labels)
print("Mean value lin svm: %s" % np.mean(accur_lin))

PortfolioAnalytics R - I find optimal portfolio with DEOptim, but when plotting it looks like portfolio is not optimal, why?

I create optimal portfolio using optimize.portfolio and solver DEoptim (maxDrawdown risk objective with target -0.1 and maximum return objective). But when I plot the optimized portfolio object, it does not seem that this portfolio is optimal as there are portfolios with the same Drawdown and better returns. Could someone explain?
library('PortfolioAnalytics')
library('PerformanceAnalytics')
library('DEoptim')
data(edhec)
ret <- edhec[, 1:10]
init.portf <- portfolio.spec(assets=colnames(ret))
init.portf <- add.constraint(portfolio=init.portf, type="full_investment")
init.portf <- add.constraint(portfolio=init.portf, type="long_only")
group_list <- list(group1=c(3),
group2=c(1, 2),
group3=c(5, 7, 8))
init.portf <- add.constraint(portfolio= init.portf,
type="group",
groups=group_list,
group_min=c(0.03, 0, 0),
group_max=c(0.032, 0.2, 0.3))
ret.obj.portf <- add.objective(portfolio=init.portf, type="return",
name="mean")
ret.obj.portf <- add.objective(portfolio = ret.obj.portf,
type = 'risk',
name = 'maxDrawdown',
arguments = list(inverse=TRUE),
target = -0.1)
ret.obj.portf$constraints[[1]]$min_sum <- 0.99
ret.obj.portf$constraints[[1]]$max_sum <- 1.01
ret.obj.portf
set.seed(123)
opt.obj.no1.1 <- optimize.portfolio(R=ret, portfolio=ret.obj.portf,
optimize_method="DEoptim", search_size=2000, trace=TRUE)
opt.obj.no1.1
chart.RiskReward(opt.obj.no1.1,
main = 'Optimized Portfolio: max return and 10% maxDD, all investments allowed',
return.col = "mean", risk.col = 'maxDrawdown')

Error: Model is large in H2o autoencoder training

I have a table of 5360*51200 size. Here, 5360 are the number of instances and 51200 are the number of features. I need to reduce the dimension of features. I was trying it by the help of stacked autoencoder in H2o, but it did not allow me to train to raise an error as:
Model is a large and large number of parameters
Here is the code:
library(h2o)
h2o.init(nthreads = -1)
check.deeplearning_stacked_autoencoder <- function() {
# this function builds a vector of autoencoder models, one per layer
#library(h2o)
#h2o.init()
get_stacked_ae_array <- function(training_data, layers, args) {
vector <- c()
index = 0
for (i in 1:length(layers)) {
index = index + 1
ae_model <- do.call(h2o.deeplearning,
modifyList(
list(
x = names(training_data),
training_frame = training_data,
autoencoder = T,
hidden = layers[i]
),
args
))
training_data = h2o.deepfeatures(ae_model, training_data, layer =
3)
names(training_data) <-
gsub("DF", paste0("L", index, sep = ""), names(training_data))
vector <- c(vector, ae_model)
}
cat(
length(vector))
}
# this function returns final encoded contents
apply_stacked_ae_array <- function(data, ae) {
index = 0
for (i in 1:length(ae)) {
index = index + 1
data = h2o.deepfeatures(ae[[i]], data, layer = 3)
names(data) <-
gsub("DF", paste0("L", index, sep = ""), names(data))
}
data
}
TRAIN <-
"E:/Chiranjibi file/Geometric features/Lu/Train/d_features.csv"
TEST <-
"E:/Chiranjibi file/Geometric features/Lu/Test/d_features.csv"
response <- 51201
# set to T for RUnit
# set to F for stand-alone demo
if (T) {
train_hex <- h2o.importFile((TRAIN))
test_hex <- h2o.importFile((TEST))
} else
{
library(h2o)
h2o.init()
homedir <-
paste0(path.expand("~"), "/h2o-dev/") #modify if needed
train_hex <-
h2o.importFile(path = paste0(homedir, TRAIN),
header = F,
sep = ',')
test_hex <-
h2o.importFile(path = paste0(homedir, TEST),
header = F,
sep = ',')
}
train <- train_hex[, -response]
test <- test_hex [, -response]
train_hex[, response] <- as.factor(train_hex[, response])
test_hex [, response] <- as.factor(test_hex [, response])
## Build reference model on full dataset and evaluate it on the test set
model_ref <-
h2o.deeplearning(
training_frame = train_hex,
x = 1:(ncol(train_hex) - 1),
y = response,
hidden = c(67),
epochs = 50
)
p_ref <- h2o.performance(model_ref, test_hex)
h2o.logloss(p_ref)
## Now build a stacked autoencoder model with three stacked layer AE models
## First AE model will compress the 717 non-const predictors into 200
## Second AE model will compress 200 into 100
## Third AE model will compress 100 into 50
layers <- c(50000,20000,10000,5000,2000, 1000, 500)
args <- list(activation = "Tanh",
epochs = 1,
l1 = 1e-5)
ae <- get_stacked_ae_array(train, layers, args)
## Now compress the training/testing data with this 3-stage set of AE models
train_compressed <- apply_stacked_ae_array(train, ae)
test_compressed <- apply_stacked_ae_array(test, ae)
## Build a simple model using these new features (compressed training data) and evaluate it on the compressed test set.
train_w_resp <- h2o.cbind(train_compressed, train_hex[, response])
test_w_resp <- h2o.cbind(test_compressed, test_hex[, response])
model_on_compressed_data <-
h2o.deeplearning(
training_frame = train_w_resp,
x = 1:(ncol(train_w_resp) - 1),
y = ncol(train_w_resp),
hidden = c(67),
epochs = 1
)
p <- h2o.performance(model_on_compressed_data, test_w_resp)
h2o.logloss(p)
}
#h2o.describe(train)
#doTest("Deep Learning Stacked Autoencoder", check.deeplearning_stacked_autoencoder)
As Tom says, your autoencoder first layer is too big.
51,200 is a lot of features. How much correlation is there between them? The more correlation you have, the smaller the first layer of your autoencoder can happily be.
Try h2o.prcomp() and seeing how many dimensions cover 99% of the variance, is often a good guide to how big your first layer can/should be.
Or, if you prefer a more experimental approach:
Start with, e.g. 200 neurons in one layer.
Look at the MSE it gets to, after enough epochs to stop improving.
Double the number of neurons in that layer.
See if the MSE gets any better. If not, stop there.
If it did, double again, and repeat.
You could then try moving to multiple layers. But not much point using a bigger first layer than the best you can get from trying a single layer.
Since your dataset has 51,200 features, and your layers array has 50,000 as the first value, 51200 * 50000 == 2.56e9 weights in that first set of network connections.
It’s too many, try smaller numbers.

JAGS - pow function does not work properly in mixture model with label switching

I am fitting a mixture model to estimate the average of a trait in each of 3 populations.
I have a label switching issue and I am trying to compute the distance between the observed and expected numbers of individuals of each genotype in each population to relabel population clusters. Below is a reproducible example.
For some reasons, JAGS does not compute the square values for distance properly. The corresponding line in code below is: pow(DistNumPerClust[k,j], 2))
Hence, the output matrix results$mean$dist is different from the matrix, results$mean$DistNumPerClust^2, computed a posteriori.
Would anyone know a way to solve this?
library(R2jags)
library(runjags)
library(dirmult)
set.seed(123)
############################
## Simulation of the data ##
############################
npop=3
ngeno=2
freqbalance=1
nsamplesizeperpop <- 100
freqMLG <- t(rdirichlet(n=npop, alpha=rep(freqbalance, ngeno)))
samplesizegenoperpop <- sweep(freqMLG, 1, nsamplesizeperpop, "*")
## Compute membership (probability that a genotype comes from pop 1, 2 or 3)
## Genotype as rows and populations as columns
membership <- sweep(freqMLG, 1, rowSums(freqMLG), "/")
# Parameters for simulations
nind=90
N = npop*nind # nb of observations
clust <- rep(1:npop, each=N/npop)
geno <- c()
for (i in 1:N){
geno <- c(geno, sum(rmultinom(n=1, size=1, prob=freqMLG[, clust[i]])*1:ngeno))
}
numgeno <- as.numeric(table(geno))
## Multiply membership probabilities by sample size for each genotype
ExpNumPerClust <- sweep(membership, 1, numgeno, "*")
muOfClustsim <- c(1, 20, 50) # vector of population means
sigma <- 1.5 # residual sd
(tausim <- 1/(sigma*sigma)) # precision
# parameters are treated as data for the simulation step
data <- list(N=N, npop=npop, ngeno=ngeno, geno=geno, muOfClustsim=muOfClustsim, tausim=tausim, samplesizegenoperpop=samplesizegenoperpop)
## JAG model
txtstring <- "
data{
# Likelihood:
for (i in 1:N){
ysim[i] ~ dnorm(eta[i], tausim) # tau is precision (1 / variance)
eta[i] <- muOfClustsim[clust[i]]
clust[i] ~ dcat( pClust[geno[i], 1:npop] )
}
for (k in 1:ngeno){
pClust[k, 1:npop] ~ ddirch( samplesizegenoperpop[k,] )
}
}
model{
fake <- 0
}
"
# Simulate with jags
out <- run.jags(txtstring, data = data, monitor=c("ysim"), sample=1, n.chains=1, summarise=FALSE)
# reformat the outputs
ysim <- coda::as.mcmc(out)[1:N]
## Estimation model
bayes.mod <- function(){
# Likelihood:
for (i in 1:N){
ysim[i] ~ dnorm(eta[i], tau) # tau is precision (1 / variance)
eta[i] <- beta[clust[i]]
clust[i] ~ dcat( pClust[geno[i], 1:npop] )
}
for (k in 1:ngeno){
## pClust membership estimates
pClust[k, 1:npop] ~ ddirch( samplesizegenoperpop[k,] )
}
for (k in 1:ngeno){
for (j in 1:npop){
# problem of label switching: try to compute the distance between ObsNumPerClust and ExpNumPerClust (i.e. between observed and expected number of individuals of each genotype in each population)
ObsNumPerClust[k,j] <- pClust[k, j] * numgeno[k]
DistNumPerClust[k,j] <- ObsNumPerClust[k,j] - ExpNumPerClust[k,j]
dist[k,j] <- pow(DistNumPerClust[k,j], 2)
}
}
# Priors
beta ~ dmnorm(mu, sigma.inv)
mu ~ dmnorm(m, V)
sigma.inv ~ dwish(R, K)
tau ~ dgamma(0.01, 0.01)
# parameters transformations
sig <- sqrt(1/ tau)
}
m = rep(1, npop)
V = diag(rep(0.01, npop))
R = diag(rep(0.1, npop))
K = npop
## Input variables
sim.dat.jags<-list("ysim","N","npop", "ngeno", "geno","m","V","R", "K", "samplesizegenoperpop","numgeno","ExpNumPerClust")
## Variables to monitor
bayes.mod.params <- c("beta","tau","sig","DistNumPerClust","dist")
## Starting values
init1 <- list(beta = c(0, 100, 1000), tau = 1)
bayes.mod.inits <- list(init1)
## Run model
bayes.mod.fit<-jags(data = sim.dat.jags, inits = bayes.mod.inits, parameters.to.save = bayes.mod.params, n.chains=1, n.iter=101000, n.burnin=1000, n.thin=200, model.file = bayes.mod)
results <- print(bayes.mod.fit)
results$mean$dist
results$mean$DistNumPerClust^2
It seems that you expect that the mean of a transformed set of values will give the same result as transforming the mean of the same set of values. But this is not the case - for example:
values <- c(1,2,3,6,8,20)
mean(values)^2
mean(values^2)
Are not the same thing.
The equivalent is happening in your model - you calculate dist[k,j] as the square of DistNumPerClust[k,j] and then summarise to a mean of dist, and expect this to be the same as the square of the mean of DistNumPerClust[k,j]. Or in a simpler example:
library('runjags')
X <- 1:100
Y <- rnorm(length(X), 2*X + 10, 1)
model <- "model {
for(i in 1 : N){
Y[i] ~ dnorm(true.y[i], precision);
true.y[i] <- (m * X[i]) + c
}
m ~ dunif(-1000,1000)
c ~ dunif(-1000,1000)
precision ~ dexp(1)
p2 <- precision^2
}"
data <- list(X=X, Y=Y, N=length(X))
results <- run.jags(model=model, monitor=c("m", "c", "precision", "p2"),
data=data, n.chains=2)
results
More specifically, these should not be expected to be the same:
summary(results)['p2','Mean']
summary(results)['precision','Mean']^2
If you want to calculate the same thing you can extract the full chain of values as an MCMC object and do your transformation on these:
p <- combine.mcmc(results,vars='precision')
p2 <- combine.mcmc(results,vars='p2')
mean(p^2)
mean(p2)
mean(p)
mean(sqrt(p2))
Now everything is equivalent.
Matt

Applying nlminb to subsets of data (by index or label) and store what the program returns as a new data frame

I was wondering if anyone could kindly help me with this seemingly easy task. I'm using nlminb to conduct optimization and compute some statistics by index. Here's an example from nlminb help.
> x <- rnbinom(100, mu = 10, size = 10)
> hdev <- function(par) {
+ -sum(dnbinom(x, mu = par[1], size = par[2], log = TRUE))
+ }
> nlminb(c(9, 12), hdev)
$par
[1] 9.730000 5.954936
$objective
[1] 297.2074
$convergence
[1] 0
$message
[1] "relative convergence (4)"
$iterations
[1] 10
$evaluations
function gradient
12 27
Suppose I generate random variables x, y, and z where z acts as an index (from 1 to 3).
> x <- rnbinom(100, mu = 10, size = 10)
> y <- rnbinom(100, mu = 10, size = 10)
> z <- rep(1:3, length=100)
> A <- cbind(x,y,z)
> hdev <- function(par) {
+ -sum(dnbinom(x+y, mu = par[1], size = par[2], log = TRUE))}
How can I apply nlminb(c(9, 12), hdev) to the data set by index z? In other words, I would like to compute nlminb(c(9, 12), hdev) for z=1, z=2, and z=3 separately. I tried by(A, z, function(A) nlminb(c(9,12), hdev)) and sparseby(A, z, function(A) nlminb(c(9,12), hdev)), but they return exactly the same values for each value of z.
I would like to turn each output into a new data frame so that it will become a 3X2 matrix.
[1] Z1_ANSWER_1 Z1_ANSWER_2
[2] Z2_ANSWER_1 Z2_ANSWER_2
[3] Z3_ANSWER_1 Z3_ANSWER_2
Since nlminb returns the summary of statistics, I needed to use CASEZ1<-nlminb$par, CASEZ2<-nlminb$par, CASEZ3<-nlminb$par and then use cbind to combine them. However, I would like to automate this process as the real data I'm working on has a lot more categories than z presented here.
If I'm not making myself clear, please let me know. I'll see if I can replicate the actual data set and functions I'm working on (I just don't have them on this computer).
Thank you very much in advance.
Let me try an approach
x <- rnbinom(100, mu = 10, size = 10)
y <- rnbinom(100, mu = 10, size = 10)
z <- rep(1:3, length=100)
A <- as.data.frame(cbind(x,y,z))
At first load the plyr library
library(plyr)
The following code returns the results for each z
dlply(A, .(z), function(x) {
hdev <- function(par, mydata) {-sum(dnbinom(mydata, mu = par[1], size = par[2], log = TRUE))}
nlminb(c(9, 12), hdev, mydata=t(as.vector(x[1] + as.vector(x[2]))))
}
)
Now, with this one you will get a 3x2 dataframe with the $par results
ddply(A, .(z), function(x) {
hdev <- function(par, mydata) {-sum(dnbinom(mydata, mu = par[1], size = par[2], log = TRUE))}
res <- nlminb(c(9, 12), hdev, mydata=t(as.vector(x[1] + as.vector(x[2]))))
return(res$par)
}
)

Resources