RayTune HyperOptSearch - fitting resampling into pipeline throws error: All intermediate steps should be transformers and implement fit and transform - resampling

I'm getting started with Raytune and trying to set up a HyperOptSearch with imbalanced data.
Fitting a pipeline without RandomOverSampler works fine, but when I add that in, I get the error:
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough'
Code sample here, and works fine without the RandomOverSampler step:
cfg_hgb = {
'clf__learning_rate' : tune.loguniform(0.001, 0.8),
'clf__max_leaf_nodes' : tune.randint(2,20),
'clf__min_samples_leaf' : tune.randint(50,500),
'clf__max_depth' : tune.randint(2,15),
'clf__l2_regularization' : tune.loguniform(0.001, 1000),
'clf__max_iter' : tune.choice([800]),
}
hyperopt = HyperOptSearch(
# space=cfg_hgb,
metric="mean_auc",
mode="max",
points_to_evaluate=None,
n_initial_points=20,
random_state_seed=RANDOM_STATE,
gamma=0.25,
)
def train_hgb(config):
# LOAD DATA
X, y, nominal, ordinal, numeric = load_clean_data()
# LOAD TRANSFORMERS
prep = Preprocessor(nominal, ordinal, numeric)
# CHOOSE CV STRATEGY
splitter = StratifiedKFold(CV, random_state=RANDOM_STATE, shuffle=True)
# TRAIN
scores = []
for train_ind, val_ind in splitter.split(X,y):
hgb_os = Pipeline(steps=[
('coltrans', prep.transformer_ord),
('ros', RandomOverSampler(random_state=RANDOM_STATE)), # if I comment out, works fine
('clf', HistGradientBoostingClassifier(
categorical_features=prep.cat_feature_mask,
random_state=RANDOM_STATE))
])
hgb_os.set_params(**config)
hgb_os.fit(X.iloc[train_ind], y[train_ind])
y_pred = hgb_os.predict(X.iloc[val_ind])
scores.append(roc_auc_score(y_true=y[val_ind], y_score=y_pred, average="macro"))
# REPORT SCORES
session.report({
'mean_auc' : np.array(scores).mean(),
'std_auc' : np.array(scores).std(),
})
tuner = tune.Tuner(
trainable=train_hgb,
param_space=cfg_hgb,
tune_config=tune.TuneConfig(
num_samples=10,
search_alg=hyperopt,
),
run_config=RunConfig(
name="experiment_name",
local_dir="./results/hgb",
)
)
results = tuner.fit()
Whereas if using ray.tune.sklearn.TuneSearchCV, RandomOverSampler works fine in the pipeline:
hgb_tune = {
'learning_rate' : tune.loguniform(0.001, 0.15),
'max_leaf_nodes' : tune.randint(2,4),
'min_samples_leaf' : tune.randint(160,300),
'max_depth' : tune.randint(2,7),
'l2_regularization' : tune.loguniform(5, 1000),
'max_iter' : tune.choice([400]),
}
hgb_os = Pipeline(steps=[
('trans', prep.transformer_ord),
('ros', RandomOverSampler(random_state=RANDOM_STATE)),
('clf', TuneSearchCV(
HistGradientBoostingClassifier(
categorical_features=prep.cat_feature_mask,
random_state=RANDOM_STATE),
param_distributions=hgb_tune,
cv=CV, scoring=SCORER,
verbose=VERBOSE, search_optimization="bayesian",
n_trials=N_TRIALS, )) # local_dir='~/rayresults/hgbtune'
])
results, params = fit_eval(hgb_os, X_train, X_test, y_train, y_test)
I understand that probably tune is expecting .fit_transform for intermediate steps whereas RandomOverSampler uses .fit_resample. Also note that RandomOverSampler requires imblearn.pipeline.Pipeline rather than sklearn.pipeline.Pipeline so perhaps therein lies the problem.
Is there a way to add any form of resampling with the current Tuner API? Or do I need to part out the pipeline and resample it first outside of this loop?
Thanks in advance.

Related

Fine-tune a pre-trained model

I am new to transformer based models. I am trying to fine-tune the following model (https://huggingface.co/Chramer/remote-sensing-distilbert-cased) on my dataset. The code:
enter image description here
and I got the following error:
enter image description here
I will be thankful if anyone could help.
The preprocessing steps I followed:
input_ids_t = []
attention_masks_t = []
for sent in df_train['text_a']:
encoded_dict = tokenizer.encode_plus(
sent,
add_special_tokens = True,
max_length = 128,
pad_to_max_length = True,
return_attention_mask = True,
return_tensors = 'tf',
)
input_ids_t.append(encoded_dict['input_ids'])
attention_masks_t.append(encoded_dict['attention_mask'])
# Convert the lists into tensors.
input_ids_t = tf.concat(input_ids_t, axis=0)
attention_masks_t = tf.concat(attention_masks_t, axis=0)
labels_t = np.asarray(df_train['label'])
and i did the same for testing data. Then:
train_data = tf.data.Dataset.from_tensor_slices((input_ids_t,attention_masks_t,labels_t))
and the same for testing data
It sounds like you are feeding the transformer_model 1 input instead of 3. Try removing the square brackets around transformer_model([input_ids, input_mask, segment_ids])[0] so that it reads transformer_model(input_ids, input_mask, segment_ids)[0]. That way, the function will have 3 arguments and not just 1.

Change all images in training set

I have a convolutional neural network. And I wanted to train it on images from the training set but first they should be wrapped with my function change(tensor, float) that takes in a tensor/image of the form [hight,width,3] and a float.
Batch size =4
loading data
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
shuffle=True, num_workers=2)
Cnn architecture
for epoch in range(2): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
#size of inputs [4,3,32,32]
#size of labels [4]
inputs = change(inputs,0.1) <----------------------------
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs) #[4, 10]
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if i % 2000 == 1999: # print every 2000 mini-batches
print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}')
running_loss = 0.0
print('Finished Training')
I am trying to apply the image function change but it gives an object error.
it there a quick way to fix it?
I am using a Julia function but it works completely fine with other objects. Error message:
JULIA: MethodError: no method matching copy(::PyObject)
Closest candidates are:
copy(!Matched::T) where T<:SHA.SHA3_CTX at /opt/julia-1.7.2/share/julia/stdlib/v1.7/SHA/src/types.jl:213
copy(!Matched::T) where T<:SHA.SHA2_CTX at /opt/julia-1.7.2/share/julia/stdlib/v1.7/SHA/src/types.jl:212
copy(!Matched::Number) at /opt/julia-1.7.2/share/julia/base/number.jl:113
I would recommend to put change function to transforms list, so you do data changes on transformation stage.
partial from functools will help you to fix number of arguments, like this:
from functools import partial
def change(input, float):
pass
# Use partial to fix number of params, such that change accepts only input
change_partial = partial(change, float=pass_float_value_here)
# Add change_partial to a list of transforms before or after converting to tensors
transforms = Compose([
RandomResizedCrop(img_size), # example
# Add change_partial here if it operates on PIL Image
change_partial,
ToTensor(), # convert to tensor
# Add change_partial here if it operates on torch tensors
change_partial,
])

Properly evaluate a test dataset

I trained a machine translation model using huggingface library:
def compute_metrics(eval_preds):
preds, labels = eval_preds
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# Some simple post-processing
decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
result = metric.compute(predictions=decoded_preds, references=decoded_labels)
result = {"bleu": result["score"]}
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
result["gen_len"] = np.mean(prediction_lens)
result = {k: round(v, 4) for k, v in result.items()}
return result
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['test'],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train()
model_dir = './models/'
trainer.save_model(model_dir)
The code above is taken from this Google Colab notebook. After the training, I can see the trained model is saved to the folder models and the metric is calculated. Now I want to load the trained model and do the prediction on a new dataset, here is what I tried:
dataset = load_dataset('csv', data_files='data/training_data.csv')
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Tokenize the test dataset
tokenized_datasets = train_test.map(preprocess_function_v2, batched=True)
test_dataset = tokenized_datasets['test']
model = AutoModelForSeq2SeqLM.from_pretrained('models')
model(test_dataset)
It threw the following error:
*** AttributeError: 'Dataset' object has no attribute 'size'
I tried the evaluate() function as well, but it said:
*** torch.nn.modules.module.ModuleAttributeError: 'MarianMTModel' object has no attribute 'evaluate'
And the function eval only prints the configuration of the model.
What is the proper way to evaluate the performance of the trained model on a new dataset?
Turned out that the prediction can be produced using the following code:
inputs = tokenizer(
questions,
max_length=max_input_length,
truncation=True,
return_tensors='pt',
padding=True).to('cuda')
translation = model.generate(**inputs)

LSTM - LSTM - future value prediction error

After some research, I was able to predict the future value using the LSTM code below. I have also attached the Dmd1ahr.csv file in the github link that I am using.
https://github.com/ukeshchawal/hello-world/blob/master/Dmd1ahr.csv
As you all can see below, 90 data points are training sets and 91st to 100th are future value prediction.
However some of the questions that I still have are:
In order to predict these values I had to originally take more than hundred data sets (here, I have taken 500 data sets) which is not exactly what my primary goal is. Is there a way that given 500 data sets, it will predict the rest 10 or 20 out of sample data points? If yes, will you please write me a sample code where you can just take 500 data points from Dmd1ahr.csv file attached below and it will predict some future values (say 501 to 520) based on those 500 points?
The prediction are way off compared to the one who have in your blogs (definitely indicates for parameter tuning - I tried changing epochs, LSTM layers, Activation, optimizer). What other parameter tuning I can do to make it more robust?
Thank you'll in advance.
import numpy as np
import matplotlib.pyplot as plt
import pandas
# By twaking the architecture it could be made more robust
np.random.seed(7)
numOfSamples = 500
lengthTrain = 90
lengthValidation = 100
look_back = 1 # Can be set higher, in my experiments it made performance worse though
transientTime = 90 # Time to "burn in" time series
series = pandas.read_csv('Dmd1ahr.csv')
def generateTrainData(series, i, look_back):
return series[i:look_back+i+1]
trainX = np.stack([generateTrainData(series, i, look_back) for i in range(lengthTrain)])
testX = np.stack([generateTrainData(series, lengthTrain + i, look_back) for i in range(lengthValidation)])
trainX = trainX.reshape((lengthTrain,look_back+1,1))
testX = testX.reshape((lengthValidation, look_back + 1, 1))
trainY = trainX[:,1:,:]
trainX = trainX[:,:-1,:]
testY = testX[:,1:,:]
testX = testX[:,:-1,:]
############### Build Model ###############
import keras
from keras.models import Model
from keras import layers
from keras import regularizers
inputs = layers.Input(batch_shape=(1,look_back,1), name="main_input")
inputsAux = layers.Input(batch_shape=(1,look_back,1), name="aux_input")
# this layer makes the actual prediction, i.e. decides if and how much it goes up or down
x = layers.recurrent.LSTM(300,return_sequences=True, stateful=True)(inputs)
x = layers.recurrent.LSTM(200,return_sequences=True, stateful=True)(inputs)
x = layers.recurrent.LSTM(100,return_sequences=True, stateful=True)(inputs)
x = layers.recurrent.LSTM(50,return_sequences=True, stateful=True)(inputs)
x = layers.wrappers.TimeDistributed(layers.Dense(1, activation="linear",
kernel_regularizer=regularizers.l2(0.005),
activity_regularizer=regularizers.l1(0.005)))(x)
# auxillary input, the current input will be feed directly to the output
# this way the prediction from the step before will be used as a "base", and the Network just have to
# learn if it goes a little up or down
auxX = layers.wrappers.TimeDistributed(layers.Dense(1,
kernel_initializer=keras.initializers.Constant(value=1),
bias_initializer='zeros',
input_shape=(1,1), activation="linear", trainable=False
))(inputsAux)
outputs = layers.add([x, auxX], name="main_output")
model = Model(inputs=[inputs, inputsAux], outputs=outputs)
model.compile(optimizer='adam',
loss='mean_squared_error',
metrics=['mean_squared_error'])
#model.summary()
#model.fit({"main_input": trainX, "aux_input": trainX[look_back-1,look_back,:]},{"main_output": trainY}, epochs=4, batch_size=1, shuffle=False)
model.fit({"main_input": trainX, "aux_input": trainX[:,look_back-1,:].reshape(lengthTrain,1,1)},{"main_output": trainY}, epochs=100, batch_size=1, shuffle=False)
############### make predictions ###############
burnedInPredictions = np.zeros(transientTime)
testPredictions = np.zeros(len(testX))
# burn series in, here use first transitionTime number of samples from test data
for i in range(transientTime):
prediction = model.predict([np.array(testX[i, :, 0].reshape(1, look_back, 1)), np.array(testX[i, look_back - 1, 0].reshape(1, 1, 1))])
testPredictions[i] = prediction[0,0,0]
burnedInPredictions[:] = testPredictions[:transientTime]
# prediction, now dont use any previous data whatsoever anymore, network just has to run on its own output
for i in range(transientTime, len(testX)):
prediction = model.predict([prediction, prediction])
testPredictions[i] = prediction[0,0,0]
# for plotting reasons
testPredictions[:np.size(burnedInPredictions)-1] = np.nan
############### plot results ###############
#import matplotlib.pyplot as plt
plt.plot(testX[:, 0, 0])
plt.show()
plt.plot(burnedInPredictions, label = "training")
plt.plot(testPredictions, label = "prediction")
plt.legend()
plt.show()

Using a complex likelihood in PyMC3

pymc.__version__ = '3.0'
theano.__version__ = '0.6.0.dev-RELEASE'
I'm trying to use PyMC3 with a complex likelihood function:
First question: Is this possible?
Here's my attempt using Thomas Wiecki's post as a guide:
import numpy as np
import theano as th
import pymc as pm
import scipy as sp
# Actual data I'm trying to fit
x = np.array([52.08, 58.44, 60.0, 65.0, 65.10, 66.0, 70.0, 87.5, 110.0, 126.0])
y = np.array([0.522, 0.659, 0.462, 0.720, 0.609, 0.696, 0.667, 0.870, 0.889, 0.919])
yerr = np.array([0.104, 0.071, 0.138, 0.035, 0.102, 0.096, 0.136, 0.031, 0.024, 0.035])
th.config.compute_test_value = 'off'
a = th.tensor.dscalar('a')
with pm.Model() as model:
# Priors
alpha = pm.Normal('alpha', mu=0.3, sd=5)
sig_alpha = pm.Normal('sig_alpha', mu=0.03, sd=5)
t_double = pm.Normal('t_double', mu=4, sd=20)
t_delay = pm.Normal('t_delay', mu=21, sd=20)
nu = pm.Uniform('nu', lower=0, upper=20)
# Some functions needed for calculation of the y estimator
def T(eqd):
doses = np.array([52.08, 58.44, 60.0, 65.0, 65.10,
66.0, 70.0, 87.5, 110.0, 126.0])
tmt_times = np.array([29,29,43,29,36,48,22,11,7,8])
return np.interp(eqd, doses, tmt_times)
def TCP(a):
time = T(x)
BCP = pm.exp(-1E7*pm.exp(-alpha*x*1.2 + 0.69315/t_delay(time-t_double)))
return pm.prod(BCP)
def normpdf(a, alpha, sig_alpha):
return 1./(sig_alpha*pm.sqrt(2.*np.pi))*pm.exp(-pm.sqr(a-alpha)/(2*pm.sqr(sig_alpha)))
def normcdf(a, alpha, sig_alpha):
return 1./2.*(1+pm.erf((a-alpha)/(sig_alpha*pm.sqrt(2))))
def integrand(a):
return normpdf(a,alpha,sig_alpha)/(1.-normcdf(0,alpha,sig_alpha))*TCP(a)
func = th.function([a,alpha,sig_alpha,t_double,t_delay], integrand(a))
y_est = sp.integrate.quad(func(a, alpha, sig_alpha,t_double,t_delay), 0, np.inf)[0]
likelihood = pm.T('TCP', mu=y_est, nu=nu, observed=y_tcp)
start = pm.find_MAP()
step = pm.NUTS(state=start)
trace = pm.sample(2000, step, start=start, progressbar=True)
which produces the following message regarding the expression for y_est:
TypeError: ('Bad input argument to theano function with name ":42" at index 0(0-based)', 'Expected an array-like object, but found a Variable: maybe you are trying to call a function on a (possibly shared) variable instead of a numeric array?')
I've overcome various other hurdles to get this far, and this is where I'm stuck. So, provided the answer to my first question is 'yes', then am I on the right track? Any guidance would be helpful!
N.B. Here is a similar question I found, and another.
Disclaimer: I'm very new at this. My only previous experience is successfully reproducing the linear regression example in Thomas' post. I've also successfully run the Theano test suite, so I know it works.
Yes, its possible to make something with a complex or arbitrary likelihood. Though that doesn't seem like what you're doing here. It looks like you have a complex transformation of one variable into another, the integration step.
Your particular exception is that integrate.quad is expecting a numpy array, not a pymc Variable. If you want to do quad within pymc, you'll have to make a custom theano Op (with derivative) for it.

Resources