LGOCV caret train - validation

Why does initial part of below code run, but when I try to run later part of code I get an error? I am learning data mining from the page and trying to understand how to perform cross validation using LGOCV option
library(mlbench)
data(Sonar)
str(Sonar)
library(caret)
set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = 0.75, list = FALSE)
training <- Sonar[inTraining, ]
testing <- Sonar[-inTraining, ]
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10)
gbmGrid <- expand.grid(.interaction.depth = c(1, 5, 9),
.n.trees = (1:15)*100,
.shrinkage = 0.1)
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)
set.seed(825)
gbmFit3 <- train(Class ~ ., data = training,
method = "gbm",
trControl = fitControl,
verbose = FALSE,
tuneGrid = gbmGrid,
## Specify which metric to optimize
metric = "ROC")
gbmFit3
Get error below: (
datarow <- 1:nrow(training)
fitControl <- trainControl(method = "LGOCV",
summaryFunction = twoClassSummary,
classProbs = TRUE,
index = list(TrainSet = datarow ),
savePredictions = TRUE)
gbmFit4 <- train(Class ~ ., data = training,
method = "gbm",
trControl = fitControl,
verbose = FALSE,
tuneGrid = gbmGrid,
## Specify which metric to optimize
metric = "ROC")
My error is as below
Error in { :
task 1 failed - "arguments imply differing number of rows: 0, 1"
In addition: Warning messages:
1: In eval(expr, envir, enclos) :
predictions failed for TrainSet: interaction.depth=1, shrinkage=0.1, n.trees=1500 Error in 1:ncol(tmp) : argument of length 0
2: In eval(expr, envir, enclos) :
predictions failed for TrainSet: interaction.depth=5, shrinkage=0.1, n.trees=1500 Error in 1:ncol(tmp) : argument of length 0
3: In eval(expr, envir, enclos) :
predictions failed for TrainSet: interaction.depth=9, shrinkage=0.1, n.trees=150
session info:
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel splines stats graphics grDevices utils datasets methods base
other attached packages:
[1] gbm_2.1 survival_2.37-4 mlbench_2.1-1 pROC_1.5.4 caret_5.17-7 reshape2_1.2.2
[7] plyr_1.8 lattice_0.20-15 foreach_1.4.1 cluster_1.14.4
loaded via a namespace (and not attached):
[1] codetools_0.2-8 compiler_3.0.1 grid_3.0.1 iterators_1.0.6 stringr_0.6.2 tools_3.0.1

You also posted the same question on CrossValidated. We normally say to make very sure that you are not in error before look for help and then contact the package author.
The problem is your use of datarow <- 1:nrow(training). You are tuning model on all of the instances and leaving nothing to compute the hold-out estimates.
I'm not really sure what you are try to do.
Max

Related

fable package: Could not find an appropriate ARIMA model

I am trying to fit the Arima model to hourly data. First, I tried fable package, and the ARIMA function could not find the appropriate model. Second, I used forecast package with auto.arima function, which worked perfectly. I have one example series (available here: https://gist.github.com/mizhozan/800fec80682822969e7d35ebba395) and the results as an example here:
data.arima <- read.csv('test.csv', header = TRUE)[,-1]
## fable package
data.arima$Date <- lubridate::ymd_hms(data.arima$Date, truncated = 2)
library(tidyverse)
library(fable)
result.arima <- data.arima %>%
as_tsibble(., index = Date)%>%
model(ARIMA(value ~ PDQ() + pdq() +
fourier(period = "day", K = 3) +
fourier(period = "week", K = 2), seasonal.test = "ocsb")) %>%
forecast(h = 24)
Warning message:
1 error encountered for ARIMA(value ~ PDQ() + pdq() + fourier(period = "day", K = 3) +
fourier(period = "week", K = 2), seasonal.test = "ocsb")
[1] Could not find an appropriate ARIMA model.
This is likely because automatic selection does not select models with characteristic roots that may be numerically unstable.
For more details, refer to https://otexts.com/fpp3/arima-r.html#plotting-the-characteristic-roots
## forecast package
library(forecast)
series.arima <- msts(data.arima$value, seasonal.periods = c(24, 24*7))
model.arima <- auto.arima(series.arima, seasonal.test = "ocsb", xreg=fourier(series.arima,K=c(3,2)))
Series: series.arima
Regression with ARIMA(4,0,1) errors
Coefficients:
ar1 ar2 ar3 ar4 ma1 intercept S1-24 C1-24 S2-24 C2-24 S3-24 C3-24 S1-168 C1-168 S2-168 C2-168
1.9064 -1.4934 0.8292 -0.3056 -0.8728 664263.21 -310891.13 -349744.23 -133862.32 -20587.2 69313.88 51963.803 43880.66 1524.578 -3823.166 5642.26
s.e. 0.0755 0.1192 0.1085 0.0521 0.0605 7781.72 20778.06 20591.69 11662.66 11606.0 8792.99 8768.856 11342.32 11669.244 12819.074 13091.08
sigma^2 estimated as 5.122e+09: log likelihood=-4225.19
AIC=8484.38 AICc=8486.31 BIC=8549.28
result.arima.2 <- forecast(model.arima, xreg=fourier(series.arima, K = c(3,2), h = 24))
I would appreciate that if someone could explain the problem here.

Error in (function (classes, fdef, mtable) unable to find an inherited method for function ‘krige’ for signature ‘"formula", "tbl_df"’

I have a strange Error and actually don't know how to solve it, even after checking other posts. Everything runs until the Kriging and then I receive the error: Error in (function (classes, fdef, mtable) unable to find an inherited method for function ‘krige’ for signature ‘"formula", "tbl_df"’
The strange thing is that everything worked a few days ago, I did not change anything in the code and now it doesn't run anymore. Some other posts related the problem with the Raster, but I could not find any discrepances. Is there something because of recent updates? I use for example the sp package.
Unfortunately I cannot provide the data I use, hopefully it can be solved without.
How can I solve the issue? Thank you in advance for the help.
homeDir = "D:/Folder/DataXYyear/"
y = 1992
Source = paste("Year", y, ".csv")
File = file.path(homeDir,Source)
GWMeas <- read_csv(File)
GWMeasX <- na.omit(GWMeas)
ggplot(
data = GWMeasX,
mapping = aes(x = X, y = Y, color = level)
) +
geom_point(size = 3) +
scale_color_viridis(option = "B") +
theme_classic()
GWMX_sf <- st_as_sf(GWMeasX, coords = c("X", "Y"), crs = 25832) %>%
cbind(st_coordinates(.))
v_emp_OK <- gstat::variogram(
level~1,
as(GWMX_sf, "Spatial") # switch from {sf} to {sp}
)
v_mod_OK <- automap::autofitVariogram(level~1, as(GWMX_sf, "Spatial"), model = "Sph")$var_model
GWMeasX %>% as.data.frame %>% glimpse
GW.vgm <- variogram(level~1, locations = ~X+Y, data = GWMeasX) # calculates sample variogram values
GW.fit <- fit.variogram(GW.vgm, model=vgm(model = "Gau")) # fit model
sf_GWlevel <- st_as_sf(GWMeasX, coords = c("X", "Y"), crs = 25833)
grd_sf <- sf_GWlevel %>%
st_bbox() %>%
st_as_sfc() %>%
st_make_grid(
cellsize = c(5000, 5000), # 5000m pixel size
what = "centers"
) %>%
st_as_sf() %>%
cbind(., st_coordinates(.))
grid <- as(grd_sf, "Spatial")
gridded(grid) <- TRUE
grid <- as(grid, "SpatialPixels")
createGrid <- function(XY.Spacing)
crs(grid) <- crs(GWMX_sf)
OK3 <- krige(formula = level~1, # variable to interpolate
data = GWMX_sf, # gauge data
newdata = grid, # grid to interpolate on
model = v_mod_OK, # variogram model to use
nmin = 4, # minimum number of points to use for the interpolation
nmax = 20, # maximum number of points to use for the interpolation
maxdist = 120e3 # maximum distance of points to use for the interpolation
)

How to distribute and fit h2o model by group in sparklyr

I would like to fit a model by group in h2o using some type of distributed apply function.
I tried the following but it doesn't work. Probably due to the fact I cannot pipe the sc object through.
df%>%
spark_apply(function(e)
h2o.coxph(x = predictors,
event_column = "event",
stop_column = "time_to_next",
training_frame = as_h2o_frame(sc, e, strict_version_check = FALSE))
group_by = "id"
)
I receive a pretty generic spark error like this:
error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23.0 failed 4 times, most recent failure: Lost task 0.3 in stage 23.0 :
I'm not sure if you can return an entire H2OCoxPH model from sparklyr::spark_apply(): Errors are no method for coercing this S4 class to a vector if you set the fetch_result_as_sdf argument to FALSE and cannot coerce class ‘structure("H2OCoxPHModel", package = "h2o")’ to a data.frame if set to TRUE.
But if you can make your own vector or dataframe from the relevant parts of the model, I think you can do it.
Here I'll use a sample Cox Proportional Hazards file from H2O Docs Cox Proportional Hazards (CoxPH) and I'll use group_by = "surgery".
heart_hf <- h2o::h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")
##### Convert to Spark DataFrame since I assume that is the use case
heart_sf <- sparklyr::copy_to(sc, heart_hf %>% as.data.frame())
##### Use sparklyr::spark_apply() on Spark DataFrame to "distribute and fit h2o model by group"
sparklyr::spark_apply(
x = heart_sf,
f = function(x) {
h2o::h2o.init()
heart_coxph <- h2o::h2o.coxph(x = c("age", "year"),
event_column = "event",
start_column = "start",
stop_column = "stop",
ties = "breslow",
training_frame = h2o::as.h2o(x, strict_version_check = FALSE))
return(data.frame(conc = heart_coxph#model$model_summary$concordance))
},
columns = list(surgery = "integer", conc = "numeric"),
group_by = c("surgery"))
# Source: spark<?> [?? x 2]
surgery conc
<int> <dbl>
1 1 0.588
2 0 0.614

How to do parallel processing in pytorch

I am working on a deep learning problem. I am solving it using pytorch. I have two GPU's which are on the same machine (16273MiB,12193MiB). I want to use both the GPU's for my training (video dataset).
I get a warning:
There is an imbalance between your GPUs. You may want to exclude GPU 1 which
has less than 75% of the memory or cores of GPU 0. You can do so by setting
the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
environment variable.
warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
I also get an error:
raise TypeError('Broadcast function not implemented for CPU tensors')
TypeError: Broadcast function not implemented for CPU tensors
if __name__ == '__main__':
opt.scales = [opt.initial_scale]
for i in range(1, opt.n_scales):
opt.scales.append(opt.scales[-1] * opt.scale_step)
opt.arch = '{}-{}'.format(opt.model, opt.model_depth)
opt.mean = get_mean(opt.norm_value)
opt.std = get_std(opt.norm_value)
print("opt",opt)
with open(os.path.join(opt.result_path, 'opts.json'), 'w') as opt_file:
json.dump(vars(opt), opt_file)
torch.manual_seed(opt.manual_seed)
model, parameters = generate_model(opt)
#print(model)
pytorch_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Total number of trainable parameters: ", pytorch_total_params)
# Define Class weights
if opt.weighted:
print("Weighted Loss is created")
if opt.n_finetune_classes == 2:
weight = torch.tensor([1.0, 3.0])
else:
weight = torch.ones(opt.n_finetune_classes)
else:
weight = None
criterion = nn.CrossEntropyLoss()
if not opt.no_cuda:
criterion = nn.DataParallel(criterion.cuda())
if opt.no_mean_norm and not opt.std_norm:
norm_method = Normalize([0, 0, 0], [1, 1, 1])
elif not opt.std_norm:
norm_method = Normalize(opt.mean, [1, 1, 1])
else:
norm_method = Normalize(opt.mean, opt.std)
train_loader = torch.utils.data.DataLoader(
training_data,
batch_size=opt.batch_size,
shuffle=True,
num_workers=opt.n_threads,
pin_memory=True)
train_logger = Logger(
os.path.join(opt.result_path, 'train.log'),
['epoch', 'loss', 'acc', 'precision','recall','lr'])
train_batch_logger = Logger(
os.path.join(opt.result_path, 'train_batch.log'),
['epoch', 'batch', 'iter', 'loss', 'acc', 'precision', 'recall', 'lr'])
if opt.nesterov:
dampening = 0
else:
dampening = opt.dampening
optimizer = optim.SGD(
parameters,
lr=opt.learning_rate,
momentum=opt.momentum,
dampening=dampening,
weight_decay=opt.weight_decay,
nesterov=opt.nesterov)
# scheduler = lr_scheduler.ReduceLROnPlateau(
# optimizer, 'min', patience=opt.lr_patience)
if not opt.no_val:
spatial_transform = Compose([
Scale(opt.sample_size),
CenterCrop(opt.sample_size),
ToTensor(opt.norm_value), norm_method
])
print('run')
for i in range(opt.begin_epoch, opt.n_epochs + 1):
if not opt.no_train:
adjust_learning_rate(optimizer, i, opt.lr_steps)
train_epoch(i, train_loader, model, criterion, optimizer, opt,
train_logger, train_batch_logger)
I have also made changes in my train file:
model = nn.DataParallel(model(),device_ids=[0,1]).cuda()
outputs = model(inputs)
It does not seem to work properly and is giving error. Please advice, I am new to pytorch.
Thanks
As mentioned in this link, you have to do model.cuda() before passing it to nn.DataParallel.
net = nn.DataParallel(model.cuda(), device_ids=[0,1])
https://github.com/pytorch/pytorch/issues/17065

Cross validation in R

I have a problem cross validating a dataset in R.
mypredict.rpart <- function(object, newdata){
predict(object, newdata, type = "class")
}
res <- errorest(win~., data=df, model = rpart, predict = mypredict.rpart)
I get this error.
Error in predict.rpart(object, newdata, type = "class") :
Invalid prediction for rpart object
My dataset is made out of 16 numerical atributes and win is has two factor 0 and 1.
You can download the dataset on link
If you're doing classification, win should be a factor.
df$win = factor(df$win)
Then your code works for me:
> res
Call:
errorest.data.frame(formula = win ~ ., data = df, model = rpart,
predict = mypredict.rpart)
10-fold cross-validation estimator of misclassification error
Misclassification error: 0.4844

Resources