Cross validation in R

Cross validation in R - validation

I have a problem cross validating a dataset in R.
mypredict.rpart <- function(object, newdata){
predict(object, newdata, type = "class")
}
res <- errorest(win~., data=df, model = rpart, predict = mypredict.rpart)
I get this error.
Error in predict.rpart(object, newdata, type = "class") :
Invalid prediction for rpart object
My dataset is made out of 16 numerical atributes and win is has two factor 0 and 1.
You can download the dataset on link

If you're doing classification, win should be a factor.
df$win = factor(df$win)
Then your code works for me:
> res
Call:
errorest.data.frame(formula = win ~ ., data = df, model = rpart,
predict = mypredict.rpart)
10-fold cross-validation estimator of misclassification error
Misclassification error: 0.4844

Related

Is "insample" in mlr3tuning resampling can be used when we want to do hyperparameter tuning with the full dataset?

I've been trying to do some tuning hyperparameters for the survival SVM model. I used the AutoTuner function from the mlr3tuning package. I want to do tuning for the whole dataset (No train & test split). I've found the resampling class which is "insample". When I look at the mlr3 dictionary, it said "Uses all observations as training and as test set."
My questions is, Is "insample" in mlr3tuning resampling can be used when we want to do hyperparameter tuning with the full dataset and if it applies, why when I tried to use the hyperparameter to the survivalsvm function from the survivalsvm package, it gives the different output of concordance index?
This is the code I used for hyperparameter tuning
veteran<-veteran
set.seed(1)
task = as_task_surv(x = veteran, time = 'time', event = 'status')
learner = lrn("surv.svm", type = "hybrid", diff.meth = "makediff3",
gamma.mu = c(0.1, 0.1),kernel = 'rbf_kernel')
search_space = ps(gamma = p_dbl(2^-5, 2^5),mu = p_dbl(2^-5, 2^5))
search_space$trafo = function(x, param_set) {
x$gamma.mu = c(x$gamma, x$mu)
x$gamma = x$mu = NULL
x}
ssvm_at = AutoTuner$new(
learner = learner,
resampling = rsmp("insample"),
search_space = search_space,
measure = msr('surv.cindex'),
terminator = trm('evals', n_evals = 5),
tuner = tnr('grid_search'))
ssvm_at$train(task)
And this is the code that I've been trying using the survivalsvm function from the survivalsvm package
survsvm.reg <- survivalsvm(Surv(veteran$time , veteran$status ) ~ .,
data = veteran,
type = "hybrid", gamma.mu = c(32,32),diff.meth = "makediff3",
opt.meth = "quadprog", kernel = "rbf_kernel")
pred.survsvm.reg <- predict(survsvm.reg,veteran)
conindex(pred.survsvm.reg, veteran$time)

fable package: Could not find an appropriate ARIMA model

I am trying to fit the Arima model to hourly data. First, I tried fable package, and the ARIMA function could not find the appropriate model. Second, I used forecast package with auto.arima function, which worked perfectly. I have one example series (available here: https://gist.github.com/mizhozan/800fec80682822969e7d35ebba395) and the results as an example here:
data.arima <- read.csv('test.csv', header = TRUE)[,-1]
## fable package
data.arima$Date <- lubridate::ymd_hms(data.arima$Date, truncated = 2)
library(tidyverse)
library(fable)
result.arima <- data.arima %>%
as_tsibble(., index = Date)%>%
model(ARIMA(value ~ PDQ() + pdq() +
fourier(period = "day", K = 3) +
fourier(period = "week", K = 2), seasonal.test = "ocsb")) %>%
forecast(h = 24)
Warning message:
1 error encountered for ARIMA(value ~ PDQ() + pdq() + fourier(period = "day", K = 3) +
fourier(period = "week", K = 2), seasonal.test = "ocsb")
[1] Could not find an appropriate ARIMA model.
This is likely because automatic selection does not select models with characteristic roots that may be numerically unstable.
For more details, refer to https://otexts.com/fpp3/arima-r.html#plotting-the-characteristic-roots
## forecast package
library(forecast)
series.arima <- msts(data.arima$value, seasonal.periods = c(24, 24*7))
model.arima <- auto.arima(series.arima, seasonal.test = "ocsb", xreg=fourier(series.arima,K=c(3,2)))
Series: series.arima
Regression with ARIMA(4,0,1) errors
Coefficients:
ar1 ar2 ar3 ar4 ma1 intercept S1-24 C1-24 S2-24 C2-24 S3-24 C3-24 S1-168 C1-168 S2-168 C2-168
1.9064 -1.4934 0.8292 -0.3056 -0.8728 664263.21 -310891.13 -349744.23 -133862.32 -20587.2 69313.88 51963.803 43880.66 1524.578 -3823.166 5642.26
s.e. 0.0755 0.1192 0.1085 0.0521 0.0605 7781.72 20778.06 20591.69 11662.66 11606.0 8792.99 8768.856 11342.32 11669.244 12819.074 13091.08
sigma^2 estimated as 5.122e+09: log likelihood=-4225.19
AIC=8484.38 AICc=8486.31 BIC=8549.28
result.arima.2 <- forecast(model.arima, xreg=fourier(series.arima, K = c(3,2), h = 24))
I would appreciate that if someone could explain the problem here.

Properly evaluate a test dataset

I trained a machine translation model using huggingface library:
def compute_metrics(eval_preds):
preds, labels = eval_preds
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# Some simple post-processing
decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
result = metric.compute(predictions=decoded_preds, references=decoded_labels)
result = {"bleu": result["score"]}
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
result["gen_len"] = np.mean(prediction_lens)
result = {k: round(v, 4) for k, v in result.items()}
return result
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['test'],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train()
model_dir = './models/'
trainer.save_model(model_dir)
The code above is taken from this Google Colab notebook. After the training, I can see the trained model is saved to the folder models and the metric is calculated. Now I want to load the trained model and do the prediction on a new dataset, here is what I tried:
dataset = load_dataset('csv', data_files='data/training_data.csv')
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Tokenize the test dataset
tokenized_datasets = train_test.map(preprocess_function_v2, batched=True)
test_dataset = tokenized_datasets['test']
model = AutoModelForSeq2SeqLM.from_pretrained('models')
model(test_dataset)
It threw the following error:
*** AttributeError: 'Dataset' object has no attribute 'size'
I tried the evaluate() function as well, but it said:
*** torch.nn.modules.module.ModuleAttributeError: 'MarianMTModel' object has no attribute 'evaluate'
And the function eval only prints the configuration of the model.
What is the proper way to evaluate the performance of the trained model on a new dataset?

Turned out that the prediction can be produced using the following code:
inputs = tokenizer(
questions,
max_length=max_input_length,
truncation=True,
return_tensors='pt',
padding=True).to('cuda')
translation = model.generate(**inputs)

How to distribute and fit h2o model by group in sparklyr

I would like to fit a model by group in h2o using some type of distributed apply function.
I tried the following but it doesn't work. Probably due to the fact I cannot pipe the sc object through.
df%>%
spark_apply(function(e)
h2o.coxph(x = predictors,
event_column = "event",
stop_column = "time_to_next",
training_frame = as_h2o_frame(sc, e, strict_version_check = FALSE))
group_by = "id"
)
I receive a pretty generic spark error like this:
error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23.0 failed 4 times, most recent failure: Lost task 0.3 in stage 23.0 :

I'm not sure if you can return an entire H2OCoxPH model from sparklyr::spark_apply(): Errors are no method for coercing this S4 class to a vector if you set the fetch_result_as_sdf argument to FALSE and cannot coerce class ‘structure("H2OCoxPHModel", package = "h2o")’ to a data.frame if set to TRUE.
But if you can make your own vector or dataframe from the relevant parts of the model, I think you can do it.
Here I'll use a sample Cox Proportional Hazards file from H2O Docs Cox Proportional Hazards (CoxPH) and I'll use group_by = "surgery".
heart_hf <- h2o::h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")
##### Convert to Spark DataFrame since I assume that is the use case
heart_sf <- sparklyr::copy_to(sc, heart_hf %>% as.data.frame())
##### Use sparklyr::spark_apply() on Spark DataFrame to "distribute and fit h2o model by group"
sparklyr::spark_apply(
x = heart_sf,
f = function(x) {
h2o::h2o.init()
heart_coxph <- h2o::h2o.coxph(x = c("age", "year"),
event_column = "event",
start_column = "start",
stop_column = "stop",
ties = "breslow",
training_frame = h2o::as.h2o(x, strict_version_check = FALSE))
return(data.frame(conc = heart_coxph#model$model_summary$concordance))
},
columns = list(surgery = "integer", conc = "numeric"),
group_by = c("surgery"))
# Source: spark<?> [?? x 2]
surgery conc
<int> <dbl>
1 1 0.588
2 0 0.614

How to do parallel processing in pytorch

I am working on a deep learning problem. I am solving it using pytorch. I have two GPU's which are on the same machine (16273MiB,12193MiB). I want to use both the GPU's for my training (video dataset).
I get a warning:
There is an imbalance between your GPUs. You may want to exclude GPU 1 which
has less than 75% of the memory or cores of GPU 0. You can do so by setting
the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
environment variable.
warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
I also get an error:
raise TypeError('Broadcast function not implemented for CPU tensors')
TypeError: Broadcast function not implemented for CPU tensors
if __name__ == '__main__':
opt.scales = [opt.initial_scale]
for i in range(1, opt.n_scales):
opt.scales.append(opt.scales[-1] * opt.scale_step)
opt.arch = '{}-{}'.format(opt.model, opt.model_depth)
opt.mean = get_mean(opt.norm_value)
opt.std = get_std(opt.norm_value)
print("opt",opt)
with open(os.path.join(opt.result_path, 'opts.json'), 'w') as opt_file:
json.dump(vars(opt), opt_file)
torch.manual_seed(opt.manual_seed)
model, parameters = generate_model(opt)
#print(model)
pytorch_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Total number of trainable parameters: ", pytorch_total_params)
# Define Class weights
if opt.weighted:
print("Weighted Loss is created")
if opt.n_finetune_classes == 2:
weight = torch.tensor([1.0, 3.0])
else:
weight = torch.ones(opt.n_finetune_classes)
else:
weight = None
criterion = nn.CrossEntropyLoss()
if not opt.no_cuda:
criterion = nn.DataParallel(criterion.cuda())
if opt.no_mean_norm and not opt.std_norm:
norm_method = Normalize([0, 0, 0], [1, 1, 1])
elif not opt.std_norm:
norm_method = Normalize(opt.mean, [1, 1, 1])
else:
norm_method = Normalize(opt.mean, opt.std)
train_loader = torch.utils.data.DataLoader(
training_data,
batch_size=opt.batch_size,
shuffle=True,
num_workers=opt.n_threads,
pin_memory=True)
train_logger = Logger(
os.path.join(opt.result_path, 'train.log'),
['epoch', 'loss', 'acc', 'precision','recall','lr'])
train_batch_logger = Logger(
os.path.join(opt.result_path, 'train_batch.log'),
['epoch', 'batch', 'iter', 'loss', 'acc', 'precision', 'recall', 'lr'])
if opt.nesterov:
dampening = 0
else:
dampening = opt.dampening
optimizer = optim.SGD(
parameters,
lr=opt.learning_rate,
momentum=opt.momentum,
dampening=dampening,
weight_decay=opt.weight_decay,
nesterov=opt.nesterov)
# scheduler = lr_scheduler.ReduceLROnPlateau(
# optimizer, 'min', patience=opt.lr_patience)
if not opt.no_val:
spatial_transform = Compose([
Scale(opt.sample_size),
CenterCrop(opt.sample_size),
ToTensor(opt.norm_value), norm_method
])
print('run')
for i in range(opt.begin_epoch, opt.n_epochs + 1):
if not opt.no_train:
adjust_learning_rate(optimizer, i, opt.lr_steps)
train_epoch(i, train_loader, model, criterion, optimizer, opt,
train_logger, train_batch_logger)
I have also made changes in my train file:
model = nn.DataParallel(model(),device_ids=[0,1]).cuda()
outputs = model(inputs)
It does not seem to work properly and is giving error. Please advice, I am new to pytorch.
Thanks

As mentioned in this link, you have to do model.cuda() before passing it to nn.DataParallel.
net = nn.DataParallel(model.cuda(), device_ids=[0,1])
https://github.com/pytorch/pytorch/issues/17065

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Cross validation in R - validation

Related

Is "insample" in mlr3tuning resampling can be used when we want to do hyperparameter tuning with the full dataset?

fable package: Could not find an appropriate ARIMA model

Properly evaluate a test dataset

How to distribute and fit h2o model by group in sparklyr

How to do parallel processing in pytorch

Categories

Resources