Error: Model is large in H2o autoencoder training - h2o

I have a table of 5360*51200 size. Here, 5360 are the number of instances and 51200 are the number of features. I need to reduce the dimension of features. I was trying it by the help of stacked autoencoder in H2o, but it did not allow me to train to raise an error as:
Model is a large and large number of parameters
Here is the code:
library(h2o)
h2o.init(nthreads = -1)
check.deeplearning_stacked_autoencoder <- function() {
# this function builds a vector of autoencoder models, one per layer
#library(h2o)
#h2o.init()
get_stacked_ae_array <- function(training_data, layers, args) {
vector <- c()
index = 0
for (i in 1:length(layers)) {
index = index + 1
ae_model <- do.call(h2o.deeplearning,
modifyList(
list(
x = names(training_data),
training_frame = training_data,
autoencoder = T,
hidden = layers[i]
),
args
))
training_data = h2o.deepfeatures(ae_model, training_data, layer =
3)
names(training_data) <-
gsub("DF", paste0("L", index, sep = ""), names(training_data))
vector <- c(vector, ae_model)
}
cat(
length(vector))
}
# this function returns final encoded contents
apply_stacked_ae_array <- function(data, ae) {
index = 0
for (i in 1:length(ae)) {
index = index + 1
data = h2o.deepfeatures(ae[[i]], data, layer = 3)
names(data) <-
gsub("DF", paste0("L", index, sep = ""), names(data))
}
data
}
TRAIN <-
"E:/Chiranjibi file/Geometric features/Lu/Train/d_features.csv"
TEST <-
"E:/Chiranjibi file/Geometric features/Lu/Test/d_features.csv"
response <- 51201
# set to T for RUnit
# set to F for stand-alone demo
if (T) {
train_hex <- h2o.importFile((TRAIN))
test_hex <- h2o.importFile((TEST))
} else
{
library(h2o)
h2o.init()
homedir <-
paste0(path.expand("~"), "/h2o-dev/") #modify if needed
train_hex <-
h2o.importFile(path = paste0(homedir, TRAIN),
header = F,
sep = ',')
test_hex <-
h2o.importFile(path = paste0(homedir, TEST),
header = F,
sep = ',')
}
train <- train_hex[, -response]
test <- test_hex [, -response]
train_hex[, response] <- as.factor(train_hex[, response])
test_hex [, response] <- as.factor(test_hex [, response])
## Build reference model on full dataset and evaluate it on the test set
model_ref <-
h2o.deeplearning(
training_frame = train_hex,
x = 1:(ncol(train_hex) - 1),
y = response,
hidden = c(67),
epochs = 50
)
p_ref <- h2o.performance(model_ref, test_hex)
h2o.logloss(p_ref)
## Now build a stacked autoencoder model with three stacked layer AE models
## First AE model will compress the 717 non-const predictors into 200
## Second AE model will compress 200 into 100
## Third AE model will compress 100 into 50
layers <- c(50000,20000,10000,5000,2000, 1000, 500)
args <- list(activation = "Tanh",
epochs = 1,
l1 = 1e-5)
ae <- get_stacked_ae_array(train, layers, args)
## Now compress the training/testing data with this 3-stage set of AE models
train_compressed <- apply_stacked_ae_array(train, ae)
test_compressed <- apply_stacked_ae_array(test, ae)
## Build a simple model using these new features (compressed training data) and evaluate it on the compressed test set.
train_w_resp <- h2o.cbind(train_compressed, train_hex[, response])
test_w_resp <- h2o.cbind(test_compressed, test_hex[, response])
model_on_compressed_data <-
h2o.deeplearning(
training_frame = train_w_resp,
x = 1:(ncol(train_w_resp) - 1),
y = ncol(train_w_resp),
hidden = c(67),
epochs = 1
)
p <- h2o.performance(model_on_compressed_data, test_w_resp)
h2o.logloss(p)
}
#h2o.describe(train)
#doTest("Deep Learning Stacked Autoencoder", check.deeplearning_stacked_autoencoder)

As Tom says, your autoencoder first layer is too big.
51,200 is a lot of features. How much correlation is there between them? The more correlation you have, the smaller the first layer of your autoencoder can happily be.
Try h2o.prcomp() and seeing how many dimensions cover 99% of the variance, is often a good guide to how big your first layer can/should be.
Or, if you prefer a more experimental approach:
Start with, e.g. 200 neurons in one layer.
Look at the MSE it gets to, after enough epochs to stop improving.
Double the number of neurons in that layer.
See if the MSE gets any better. If not, stop there.
If it did, double again, and repeat.
You could then try moving to multiple layers. But not much point using a bigger first layer than the best you can get from trying a single layer.

Since your dataset has 51,200 features, and your layers array has 50,000 as the first value, 51200 * 50000 == 2.56e9 weights in that first set of network connections.
It’s too many, try smaller numbers.

Related

Add multiple parallel planes to 3D plot with plotly package (datacamp exercise)

I am trying to visualize two parallel planes in a 3D plot with the plotly package (for those that have access to Datacamp, it is this exercise: https://campus.datacamp.com/courses/multiple-and-logistic-regression/multiple-regression?ex=9)
At some point in the construction of this graph, you need to create a matrix to fit the planes that match the categorical variable (plane_new = condition is new, plane_used = condition is used). The problem is, that the two datasets that I used as input (corresponding to the two levels of condition), have different number of observations. I can't seem to figure out how I can get these matrices to be comparable, so that the planes are correctly fit as geometrical objects in the figure.
I hope an R-wizard can help me out ;). Here's my code:
# libraries
library(openintro) #exemplary datasets
library(modelr) #multivariate methods
library(broom) #tidy
library(ggplot2) #visualizing data
library(plotly) #visualizing models in 3D
## Fit the model
lm_ext <- lm(totalPr ~ duration + startPr + cond, # Interpretation: With every 1 unit increase of auction duration (unit = day), the price of the game decreases with .51 units in the response variable (total price), when keeping startPr constant. The eventual value of the predicted value also depends on condition (categorical), for which the y-intercept is different
data = marioKart)
## Visualize the model (including predictions)
marioKart_ss_new <- subset(marioKart, cond == "new") # To visualize planes in a 3D graph in plotly, the dataframe needs to be split in the number of levels of the categorical variable
marioKart_ss_used <- subset(marioKart, cond == "used")
duration_new <- as.vector(marioKart_ss_new$duration) # These vectors represent the linear model for condition = new
startPr_new <- as.vector(marioKart_ss_new$startPr)
duration_used <- as.vector(marioKart_ss_used$duration) # These vectors represent the linear model for condition = used
startPr_used <- as.vector(marioKart_ss_used$startPr)
lm_new <- lm(totalPr ~ duration + startPr, # Create two linear models
data = marioKart_ss_new)
lm_used <- lm(totalPr ~ duration + startPr,
data = marioKart_ss_used)
grid_new <- marioKart_ss_new %>% # Make two grids with all combinations of the levels of the two numerical explanatory variables
data_grid(duration =
seq_range(duration, by = 1),
startPr =
seq_range(startPr, by = 1))
grid_used <- marioKart_ss_used %>%
data_grid(duration =
seq_range(duration, by = 1),
startPr =
seq_range(startPr, by = 1))
lm_new <- lm(totalPr ~ duration + startPr, # Make two seperate models based on the two levels of the categorical explanatory variable
data = marioKart_ss_new)
lm_used <- lm(totalPr ~ duration + startPr,
data = marioKart_ss_used)
pred_new <- augment(lm_new, newdata = grid_new) # Predictions
pred_used <- augment(lm_used, newdata = grid_used)
plane_new <- matrix(pred_new$.fitted, # Matrix of preditions as input for planes
nrow = 70,
ncol = 70)
plane_used <- matrix(pred_used$.fitted,
nrow = 55,
ncol = 55)
plot <- plot_ly(data = marioKart, # 3D plot of datapoints
z = ~totalPr,
x = ~duration,
y = ~startPr,
opacity = 0.6) %>%
add_markers(color = ~cond)
plot %>% # Add planes
add_surface(x = ~duration_new, ### NOT WORKING, WAIT FOR DATACAMP
y = ~startPr_new,
z = ~plane_new,
showscale = FALSE) %>%
add_surface(x = ~duration_used,
y = ~duration_used,
z = ~plane_used,
showscale = FALSE)
No code wizard here but asked for the same thing:
library(tidyverse)
library(modelr)
grid <- mario_kart %>%
modelr::data_grid(
duration = seq_range(duration, n = 70),
startPr = seq_range(startPr, n = 70),
cond
)
library(broom)
tidy_planes <- mod %>%
augment(newdata = grid)
x <- unique(grid$duration)
y <- unique(grid$startPr)
plane0 <- tidy_planes %>%
filter(cond == "new") %>%
pull(.fitted) %>%
matrix(nrow = length(x), byrow = TRUE)
plane1 <- tidy_planes %>%
filter(cond == "used") %>%
pull(.fitted) %>%
matrix(nrow = length(x), byrow = TRUE)

Get better performance for converting matrix to vector

when working with images, usually they include 3 layers, (RGB). In order to do some computation, I need to convert each layer of the image into a vector.
I1 = ones(70,50,3); % the first image
I2 = 0.4 * ones(70,50,3); % the second image
for dd = 1:3
ILayer1 = I1(:,:,dd);
ILayerLinear1 = ILayer1(:);
ILayer2 = I2(:,:,dd);
ILayerLinear2 = ILayer2(:);
comp = ILayerLinear1 * ILayerLinear1.';
end
Here I have replaced the main computation part with a very simple computation, but that is not the point.
Is there a better way to not repeat the matrix-to-vector conversion, or do it more efficiently? Because it may happen multiple times through the code.
Update:
I can also define a function as follows to pass an Image and retrieve a vector, but it still is not improving the code.
function V = I2V(I)
[r,c,d] = size(I);
V = zeros(d,r*c);
for dd = 1:d
layer = I(:,:,dd);
V(dd,:) = layer(:);
end
end
I'm not sure about the outer product but, here's everything else.
I1 = reshape(1:70*50*3, 70,50,3);
I2 = 0.4*reshape(1:70*50*3, 70,50,3);
i1 = reshape(I1, [], 3);
i2 = reshape(I2, [], 3);

Increase performance/speed

I need to take data from 1303 rasters (each raster has data for 1 month) and make a time series for each grid cell in the rasters. In the end I will join all the time series into one massive (zoo) file.
I have the code that can do it (I tried on a small portion of the dataset and it worked) but it seems to be taking for ever just to stack the raster (more than 2 hours now and still counting) and this is not the slower part, that will be doing the time series. So here is my code, if anyone knows a faster way to stack rasters and /or to create the time series (maybe without the double loop?) please help...
I don't know any other programming language but would this be just too much to ask from R?
files <- list.files(pattern=".asc")
pat <- "^.*pet_([0-9]{1,})_([0-9]{1,}).asc$"
ord_files <- as.Date(gsub(pat, sprintf("%s-%s-01", "\\1", "\\2"), files))
files<-files[order(ord_files)]
#using "raster" package to import data
s<- raster(files[1])
pet<-vector()
for (i in 2:length(files))
{
r<- raster(files[i])
s <- stack(s, r)
}
#creating a data vector
beginning = as.Date("1901-01-01")
full <- seq(beginning, by='1 month', length=length(files))
dat<-as.yearmon(full)
#building the time series
for (lat in 1:360)
for (long in 1:720)
{
pet<-as.vector(s[lat,long])
x <- xts(pet, dat)
write.zoo(x,file=paste("P:/WRSRL/Users1/ncgk/IBERIA/cru_pet/zoo/","lat",lat,"long",long,".csv", sep="") , sep=",")
}
The first bit could simply be:
s <- stack(files)
The reason why creating a stack is somewhat slow is that each file needs to be opened and checked to see if it has the same nrow, ncol etc. as the other files. If you are absolutely certain that is the case, you can use a shortcut like this (NOT generally recommended)
quickStack <- function(f) {
r <- raster(f[1])
ln <- extension(basename(f), '')
s <- stack(r)
s#layers <- sapply(1:length(f), function(x){ r#file#name = f[x]; r#layernames=ln[x]; r#data#haveminmax=FALSE ; r })
s#layernames <- ln
s
}
quickStack(files)
You can probably also speed up the second part as in the below examples, depending on how much RAM you have.
Read row by row:
for (lat in 1:360) {
pet <- getValues(s, lat, 1)
for (long in 1:720) {
x <- xts(pet[long,], dat)
write.zoo(x,file=paste("P:/WRSRL/Users1/ncgk/IBERIA/cru_pet/zoo/","lat",lat,"long",long,".csv", sep="") , sep=",")
}
}
more extreme, read all values in one go:
pet <- getValues(s)
for (lat in 1:360) {
for (long in 1:720) {
cell <- (lat-1) * 720 + long
x <- xts(pet[cell,], dat)
write.zoo(x,file=paste("P:/WRSRL/Users1/ncgk/IBERIA/cru_pet/zoo/","lat",lat,"long",long,".csv", sep="") , sep=",")
}
}
I will repost my comment here and give a better example:
The general idea: allocate the space for s before the 'raster'-loop is executed. If you concatenate s and r to a new object s inside the loop, R has to allocate new memory for s for each iteration. This is really slow, especially if s is large.
s <- c()
system.time(for(i in 1:1000){ s <- c(s, rnorm(100))})
# user system elapsed
# 0.584 0.244 0.885
s <- rep(NA, 1000*100)
system.time(for(i in seq(1,1000*100,100)){ s[i:(i+99)] <- rnorm(100) })
# user system elapsed
# 0.052 0.000 0.050
as you can see, pre-allocation is around 10 times faster.
Unfortunately I am not familiar with raster and stack so I can not tell you how to apply this to your code.
Something like this should work (if you have enough memory):
#using "raster" package to import data
rlist <- lapply(files, raster)
s <- do.call(stack, rlist)
rlist <- NULL # to allow freeing of memory
It loads all raster objects into a big list and then calls stack once.
Here's an example of the speed gains: 1.25 sec vs 8 secs for 60 files - but your old code is quadratic in time so the gains are much higher for more files...
library(raster)
f <- system.file("external/test.grd", package="raster")
files <- rep(f, 60)
system.time({
rlist <- lapply(files, raster)
s <- do.call(stack, rlist)
rlist <- NULL # to allow freeing of memory
}) # 1.25 secs
system.time({
s<- raster(files[1])
for (i in 2:length(files)) {
r<- raster(files[i])
s <- stack(s, r)
}
}) # 8 secs
I tried another way to dealing with numerous files.
First I combined the time series raster into one file in the NetCDF format,
Using write.Raster(x,format="CDF",..)
and then just read one file for each year, this time I used brick(netcdffile,varname='') it the reading saves a lot.
However, I need to save each cell's value for all the years according to some predefined format,in which I use write.fwf(x=v,...,append=TRUE)
but it takes a long time for nearly 500,000 points.
Is anyone has the same experiences and help on how to speed up this process?
Here is my code for extracting all the value for each point:
weather4Point <- function(startyear,endyear)
{
for (year in startyear:endyear)
{
#get the combined netCDF file
tminfile <- paste("tmin","_",year,".nc",sep='')
b_tmin <- brick(tminfile,varname='tmin')
pptfile <- paste("ppt","_",year,".nc",sep='')
b_ppt <- brick(pptfile,varname='ppt')
tmaxfile <- paste("tmax","_",year,".nc",sep='')
b_tmax <- brick(tmaxfile,varname='tmax')
#Get the first year here!!!
print(paste("processing year :",year,sep=''))
for(l in 1:length(pl))
{
v <- NULL
#generate file with the name convention with t_n(latitude)w(longitude).txt, 5 digits after point should be work
filename <- paste("c:/PRISM/MD/N",round(coordinates(pl[l,])[2],5),"W",abs(round(coordinates(pl[l,])[1],5)),".wth",sep='')
print(paste("processing file :",filename,sep=''))
tmin <- as.numeric(round(extract(b_tmin,coordinates(pl[l,])),digits=1))
tmax <- as.numeric(round(extract(b_tmax,coordinates(pl[l,])),digits=1))
ppt <- as.numeric(round(extract(b_ppt,coordinates(pl[l,])),digits=2))
v <- cbind(tmax,tmin,ppt)
tablename <- c("tmin","tmax","ppt")
v <- data.frame(v)
colnames(v) <- tablename
v["default"] <- 0
v["year"] <- year
date <- seq(as.Date(paste(year,"/1/1",sep='')),as.Date(paste(year,"/12/31",sep='')),"days")
month <- as.numeric(substr(date,6,7))
day <- as.numeric(substr(date,9,10))
v["month"] <- month
v["day"] <- day
v <- v[c("year","month","day","default","tmin","tmax","ppt")]
#write into a file with format
write.fwf(x=v,filename,append=TRUE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6))
}
}
}

Make this loop faster in R

How can I speed up the following (noob) code:
#"mymatrix" is the matrix of word counts (docs X terms)
#"tfidfmatrix" is the transformed matrix
tfidfmatrix = Matrix(mymatrix, nrow=num_of_docs, ncol=num_of_words, sparse=T)
#Apply a transformation on each row of the matrix
for(i in 1:dim(mymatrix)[[1]]){
r = mymatrix[i,]
s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
tfmat[i,] = s/sqrt(sum(s^2))
}
return (tfidfmatrix)
Problem is that the matrices I am working on are fairly large (~40kX100k), and this code is very slow.
The reason I am not using "apply" (instead of using a for loop and sapply) is that apply is going to give me the transpose of the matrix I want - I want num_of_docs X num_of_words, but apply will give me the transpose. I will then have to spend more time computing the transpose and re-allocating it.
Any thoughts on making this faster?
Thanks much.
Edit : I have found that the suggestions below greatly speed up my code (besides making me feel stupid). Any suggestions on where I can learn to write "optimized" R code from?
Edit 2: OK, so something is not right. Once I do s.vec[!is.finite(s.vec)] <- 0 every element of s.vec is being set to 0. Just to re-iterate my original matrix is a sparse matrix containing integers. This is due to some quirk of the Matrix package I am using. When I do s.vec[which(s.vec==-Inf)] <- 0 things work as expected. Thoughts?
As per my comment,
#Slightly larger example data
mymatrix <- matrix(runif(10000),nrow=10)
mymatrix[sample(10000,100)] <- 0
tfmat <- matrix(nrow=10, ncol=1000)
ndocs <- 1
justin <- function(){
s.vec <- ifelse(mymatrix==0, 0, (1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix)))
tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
}
joran <- function(){
s.vec <- (1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix))
s.vec[!is.finite(s.vec)] <- 0
tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
}
require(rbenchmark)
benchmark(justin(),joran(),replications = 1000)
test replications elapsed relative user.self sys.self user.child sys.child
2 joran() 1000 0.940 1.00000 0.842 0.105 0 0
1 justin() 1000 2.786 2.96383 2.617 0.187 0 0
So it's around 3x faster or so.
not sure what ndocs is, but ifelse is already vectorized, so you should be able to use the ifelse statement without walking through the matrix row by row and sapply along the row. The same can be said for the final calc.
However, you haven't given a complete example to replicate...
mymatrix <- matrix(runif(100),nrow=10)
tfmat <- matrix(nrow=10, ncol=10)
ndocs <- 1
s.vec <- ifelse(mymatrix==0, 0, 1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix))
for(i in 1:dim(mymatrix)[[1]]){
r = mymatrix[i,]
s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
tfmat[i,] <- s
}
all.equal(s.vec, tfmat)
so the only piece missing is the rowSums in your final calc.
tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
for(i in 1:dim(mymatrix)[[1]]){
r = mymatrix[i,]
s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
tfmat[i,] = s/sqrt(sum(s^2))
}
all.equal(tfmat, tfmat.vec)

Applying nlminb to subsets of data (by index or label) and store what the program returns as a new data frame

I was wondering if anyone could kindly help me with this seemingly easy task. I'm using nlminb to conduct optimization and compute some statistics by index. Here's an example from nlminb help.
> x <- rnbinom(100, mu = 10, size = 10)
> hdev <- function(par) {
+ -sum(dnbinom(x, mu = par[1], size = par[2], log = TRUE))
+ }
> nlminb(c(9, 12), hdev)
$par
[1] 9.730000 5.954936
$objective
[1] 297.2074
$convergence
[1] 0
$message
[1] "relative convergence (4)"
$iterations
[1] 10
$evaluations
function gradient
12 27
Suppose I generate random variables x, y, and z where z acts as an index (from 1 to 3).
> x <- rnbinom(100, mu = 10, size = 10)
> y <- rnbinom(100, mu = 10, size = 10)
> z <- rep(1:3, length=100)
> A <- cbind(x,y,z)
> hdev <- function(par) {
+ -sum(dnbinom(x+y, mu = par[1], size = par[2], log = TRUE))}
How can I apply nlminb(c(9, 12), hdev) to the data set by index z? In other words, I would like to compute nlminb(c(9, 12), hdev) for z=1, z=2, and z=3 separately. I tried by(A, z, function(A) nlminb(c(9,12), hdev)) and sparseby(A, z, function(A) nlminb(c(9,12), hdev)), but they return exactly the same values for each value of z.
I would like to turn each output into a new data frame so that it will become a 3X2 matrix.
[1] Z1_ANSWER_1 Z1_ANSWER_2
[2] Z2_ANSWER_1 Z2_ANSWER_2
[3] Z3_ANSWER_1 Z3_ANSWER_2
Since nlminb returns the summary of statistics, I needed to use CASEZ1<-nlminb$par, CASEZ2<-nlminb$par, CASEZ3<-nlminb$par and then use cbind to combine them. However, I would like to automate this process as the real data I'm working on has a lot more categories than z presented here.
If I'm not making myself clear, please let me know. I'll see if I can replicate the actual data set and functions I'm working on (I just don't have them on this computer).
Thank you very much in advance.
Let me try an approach
x <- rnbinom(100, mu = 10, size = 10)
y <- rnbinom(100, mu = 10, size = 10)
z <- rep(1:3, length=100)
A <- as.data.frame(cbind(x,y,z))
At first load the plyr library
library(plyr)
The following code returns the results for each z
dlply(A, .(z), function(x) {
hdev <- function(par, mydata) {-sum(dnbinom(mydata, mu = par[1], size = par[2], log = TRUE))}
nlminb(c(9, 12), hdev, mydata=t(as.vector(x[1] + as.vector(x[2]))))
}
)
Now, with this one you will get a 3x2 dataframe with the $par results
ddply(A, .(z), function(x) {
hdev <- function(par, mydata) {-sum(dnbinom(mydata, mu = par[1], size = par[2], log = TRUE))}
res <- nlminb(c(9, 12), hdev, mydata=t(as.vector(x[1] + as.vector(x[2]))))
return(res$par)
}
)

Resources