Deepwater threw java.lang.ArrayIndexOutOfBoundsException during training if balance_classes=TRUE - h2o

In AWS, I followed the instruction in here and launched a g2.2xlarge EC2 using the community AMI ami-97591381
On the docker image, I can run a simple deepwater tutorial without a problem. However, when I tried to train a deepwater model using my own data (which worked ok with a non-GPU deeplearning model), h2o gave me this exception:
java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException: 0 <= 186393 < 170807
at water.Futures.blockForPending(Futures.java:88)
at hex.deepwater.DeepWaterDatasetIterator.Next(DeepWaterDatasetIterator.java:99)
at hex.deepwater.DeepWaterTask.setupLocal(DeepWaterTask.java:168)
at water.MRTask.setupLocal0(MRTask.java:550)
at water.MRTask.dfork(MRTask.java:456)
at water.MRTask.doAll(MRTask.java:389)
at water.MRTask.doAll(MRTask.java:385)
at hex.deepwater.DeepWater$DeepWaterDriver.trainModel(DeepWater.java:345)
at hex.deepwater.DeepWater$DeepWaterDriver.buildModel(DeepWater.java:205)
at hex.deepwater.DeepWater$DeepWaterDriver.computeImpl(DeepWater.java:118)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:173)
at hex.deepwater.DeepWater$DeepWaterDriver.compute2(DeepWater.java:111)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1256)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 <= 186393 < 170807
at water.fvec.Vec.elem2ChunkIdx(Vec.java:925)
at water.fvec.Vec.chunkForRow(Vec.java:1063)
at hex.deepwater.DeepWaterDatasetIterator$FrameDataConverter.compute2(DeepWaterDatasetIterator.java:76)
... 6 more
This is my code, which you can run as I made the S3 links public:
library(h2o)
library(jsonlite)
library(curl)
h2o.init()
df.truth <- h2o.importFile("https://s3.amazonaws.com/nw.data.test.us.east/df.truth.zeroed", header = T, sep=",")
df.truth$isFemale <- h2o.asfactor(df.truth$isFemale)
hotnames.truth <- fromJSON("https://s3.amazonaws.com/nw.data.test.us.east/hotnames.json", simplifyVector = T)
# Training and validation sets
splits <- h2o.splitFrame(df.truth, c(0.9), seed=1234)
train.truth <- h2o.assign(splits[[1]], "train.truth.hex")
valid.truth <- h2o.assign(splits[[2]], "valid.truth.hex")
dl.2.balanced <- h2o.deepwater(
training_frame = train.truth, model_id="dl.2.balanced",
x=setdiff(hotnames.truth[1:(length(hotnames.truth)/2)], c("isFemale", "nwtcs")),
y="isFemale", stopping_metric = "AUTO", seed = 1000000,
sparse = F,
balance_classes = T,
mini_batch_size = 20)
The h2o version is 3.13.0.356.
Update:
I think I found the h2o bug. If I set balance_classes to FALSE, then it will run w/o crashing.

Please note that Deep Water is a legacy project (as of December 2017), which means that it is no longer under active development. The H2O.ai team has no current plans to add new features, however, contributions from the community (in the form of pull requests) are welcome.

Related

Error message: Cube::init(): requested size is too large; suggest to enable ARMA_64BIT_WORD using sommer package for GWAS

Output from sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] sommer_4.1.4 crayon_1.4.1 lattice_0.20-41 MASS_7.3-53.1 Matrix_1.3-2 data.table_1.14.0
loaded via a namespace (and not attached):
[1] compiler_4.0.5 tools_4.0.5 rstudioapi_0.13 Rcpp_1.0.6 grid_4.0.5
I have been trying to carry out a gwas using the sommer package with the following code:
var_cov <- A.mat(m_matrix) ## aditive relationship matrix
model <- GWAS(cbind(DW20, PLA07, PLA08, PLA09, PLA10, PLA11, PLA12, PLA13, PLA14, PLA15, PLA16, PLA17, PLA18, RGR07_09, RGR08_10, RGR09_11, RGR10_12, RGR11_13, RGR12_14, RGR13_15, RGR14_16, RGR15_17, RGR16_18, SA, SL, SW) ~ 1, random = ~ vs(accession, Gu = var_cov), data = pheno2, M = m_matrix, gTerm = "u:accession", n.PC = 5)
As described in the code, I have 26 traits and I would like to use the K+P model. My SNPs matrix has
211 260 markers and 309 accessions.
When I run this code for one and two traits, it works fine. But, when I try to run with all the 26 traits I get the error message:
Error in GWAS(cbind(DW20, PLA07, PLA08, PLA09, PLA10, PLA11, PLA12, PLA13, :
Cube::init(): requested size is too large; suggest to enable ARMA_64BIT_WORD
I searched online and found that this error is related to the package RcppArmadillo.
Following the suggestions here (http://arma.sourceforge.net/docs.html#config_hpp_arma_64bit_word) and
here (Large Matrices in RcppArmadillo via the ARMA_64BIT_WORD define), I tried to enable the ARMA_64BIT_WORD by uncommenting the line #define ARMA_64BIT_WORD (bellow) in the file RcppArmadillo\include\armadillo_bits\config.hpp:
#if !defined(ARMA_64BIT_WORD)
//#define ARMA_64BIT_WORD
//// Uncomment the above line if you require matrices/vectors capable of holding more than 4 billion elements.
//// Note that ARMA_64BIT_WORD is automatically enabled when std::size_t has 64 bits and ARMA_32BIT_WORD is not defined.
#endiff
and also including the following line in the file Makevars.win in RcppArmadillo\skeleton.
PKG_CPPFLAGS = -DARMA_64BIT_WORD=1
None of the suggestions worked and I continue getting the same error message. My questions are: is there another option to enable the ARMA_64BIT_WORD that I am missing? Is it possible to run the GWAS function in sommer package with as many traits as 26 or this number is too much? Would the error message result from a mistake in the GWAS code?
Thank you very much in advance.
My first take Ana is that you're trying to fit an unstructured multivariate model with 26 traits when you use cbind(), that means that if you have 1000 records, this will be a model of 309 x 26 = 8,034 records which would be a bit too big for the direct inversion algorithm that sommer uses, plus the number of parameters to estimate are a lot (think all the covariance parameters (26*25)/2 = 325. I would suggest fitting a GWAS per trait in a for loop to solve your issue. Unless you have a good justification to run a multivariate GWAS this is the issue with your analysis more than the C++ code behind. For example:
var_cov <- A.mat(m_matrix) ## aditive relationship matrix
traits <- c(DW20, PLA07, PLA08, PLA09, PLA10, PLA11, PLA12, PLA13, PLA14, PLA15, PLA16, PLA17, PLA18, RGR07_09, RGR08_10, RGR09_11, RGR10_12, RGR11_13, RGR12_14, RGR13_15, RGR14_16, RGR15_17, RGR16_18, SA, SL, SW)
for(itrait in traits){
model <- GWAS(as.formula(paste(itrait,"~1")), random = ~ vs(accession, Gu = var_cov), data = pheno2, M = m_matrix, gTerm = "u:accession", n.PC = 5)
}
If it turns out that even with a single trait the arma::cube function presents memory issues then definitely we need to look at why the armadillo library cannot deal with those dimensions.
Cheers,
Eduardo

VIF values in R

I got a question: Someone have run the corvif function with the code HighstatLibV10.R available in the page http://www.highstat.com/index.php/mixed-effects-models-and-extensions-in-ecology-with-r? I can't get the VIF values because the output gives me this error:
Error in myvif(lm_mod) : object 'tmp_cor' not found!
I have 6 physical variables and I'm looking for collinearity among variables. Any help more than welcome!
If working with the corvif() is not of utmost importance you can use the vif() in the R package 'car' to get VIF values for your linear models.
So tmp_cor is an object that is supposed to be created in corvif
tmp_cor is created using the cor function (in the base stats package that comes with R install) via: tmp_cor <- cor(dataz,use="complete.obs").
However, I noticed that with both v1 and v10 of Zurr et al's HighstatLib.R code this error occurs:
Error in myvif(lm_mod) : object 'tmp_cor' not found!
First I checked V10:
It seems that the "final" version of corvif created when sourcing HighstatLibV10.R actually neglects to create tmp_cor at all!
> print(corvif)
function(dataz) {
dataz <- as.data.frame(dataz)
#vif part
form <- formula(paste("fooy ~ ",paste(strsplit(names(dataz)," "),collapse=" + ")))
dataz <- data.frame(fooy=1 + rnorm(nrow(dataz)) ,dataz)
lm_mod <- lm(form,dataz)
cat("\n\nVariance inflation factors\n\n")
print(myvif(lm_mod))
}
But, I noticed that the error in the OP's post also occurred when using V1 (i.e., HighstatLib.R associated with Zuur et al 2010). Although the code file creates 2 versions of corvif, they (and especially the latter of the two which would supercede the first) include a line to create tmp_cor:
corvif <- function(dataz) {
dataz <- as.data.frame(dataz)
#correlation part
cat("Correlations of the variables\n\n")
tmp_cor <- cor(dataz,use="complete.obs")
print(tmp_cor)
#vif part
form <- formula(paste("fooy ~ ",paste(strsplit(names(dataz)," "),collapse=" + ")))
dataz <- data.frame(fooy=1,dataz)
lm_mod <- lm(form,dataz)
cat("\n\nVariance inflation factors\n\n")
print(myvif(lm_mod))
}
So even though the code for corvif creates tmp_cor in the V1 code file, it appears that the helper function myvif (which actually uses the tmp_cor object) is not accessing it.
This suggests that we have a scoping problem...
Sure enough, if I just quickly change the tmp_cor line to create a global object, the code works fine:
tmp_cor <<- cor(dataz,use="complete.obs")
Specifically:
corvif <- function(dataz) {
dataz <- as.data.frame(dataz)
#correlation part
cat("Correlations of the variables\n\n")
tmp_cor <<- cor(dataz,use="complete.obs")
print(tmp_cor)
#vif part
form <- formula(paste("fooy ~ ",paste(strsplit(names(dataz)," "),collapse=" + ")))
dataz <- data.frame(fooy=1,dataz)
lm_mod <- lm(form,dataz)
cat("\n\nVariance inflation factors\n\n")
print(myvif(lm_mod))
}
A more complete "fix" could be done by manipulating environments.

SparkR dapply not working

I'm trying to call lapply within a function applied on spark data frame. According to documentation it's possible since Spark 2.0.
wrapper = function(df){
out = df
out$len <- unlist(lapply(df$value, function(y) length(y)))
return(out)
}
# dd is Spark Data Frame with one column (value) of type raw
dapplyCollect(dd, wrapper)
It returns error:
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...): org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 37.0 failed 1 times, most recent failure: Lost task 0.0 in stage 37.0 (TID 37, localhost): org.apache.spark.SparkException: R computation failed with
Error in (function (..., deparse.level = 1, make.row.names = TRUE) :
incompatible types (from raw to logical) in subassignment type fix
The following works fine:
wrapper(collect(dd))
But we want computation to run on nodes (not on driver).
What could be the problem? There is a related question but it does not help.
Thanks.
You need to add the schema as it can only be defaulted if the columns of the output are the same mode as the input.

Clustering of Items in a UI-Matrix with package skmeans

i've got a specific question on the package "skmeans".
I considered the movielens100k dataset which involves "u.data" which is a dataset of four columns in the following order "User","Item","Rating" and "Timestamp". I've implemented the following code:
UI_ratings_raw <- scan(file="u1.base",what=list(user=0,movie=0,rating=0),flush=TRUE)
UI_ratings_sparse <- sparseMatrix(UI_ratings_raw$user,UI_ratings_raw$movie,x=UI_ratings_raw$rating,dims =c(943,1682)) #Eintrag aus Forum github siehe R Datei Matrixreduzierung
UI_ratings_sparse_dgT <- as(UI_ratings_sparse,"dgTMatrix")
install.packages("skmeans")
library(skmeans)
install.packages("cluster")
library(cluster)
UI_ratings_sparse_clust_sk <- skmeans(UI_ratings_sparse_dgT,20,control = list(verbose=TRUE))
summary(silhouette(UI_ratings_sparse_clust_sk))
Clustering performed very well but only on the users side. Is there any possibility to change the code in that way that i'm able to compute Cluster for the Items?

error in running auto.arima function

I am downloading some stock's daily close data using quantmod package:
library(quantmod)
library(dygraphs)
library(forecast)
date <- as.Date("2014-11-01")
getSymbols("SBIN.BO",from = date )
close <- SBIN.BO[, 4]
dygraph(close)
dat <- data.frame(date = index(SBIN.BO),SBIN.BO)
acf1 <- acf(close)
When I tried to execute auto arima function from forecast package:
fit <- auto.arima(close, seasonal=FALSE, xreg=fourier(close, K=4))
I encountered the following error:
Error in ...fourier(x, K, 1:length(x)) :
K must be not be greater than period/2
So I want to know why there is this error? Did I do any mistake in writing code, which based upon tutorials available on Rob's website/blogs...

Resources