Saving bert model at every epoch for further training - huggingface-transformers

I am using bert_model.save_pretrained for saving the model at end as this is the command that helps in saving the model with all configurations and weights but this cannot be used in model.fit command as in callbacks saving model at each epoch does not save with save_pretrained. Can anybody help me in saving bert model at each epoch since i cannot train whole bert model in one go?
Edit
Code for loading pre trained bert model
bert_model = TFAutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_classes)
Code for compiling the bert model
from tensorflow.keras import optimizers
bert_model.compile(loss='categorical_crossentropy',
optimizer=optimizers.Adam(learning_rate=0.00005),
metrics=['accuracy'])
bert_model.summary()
Code for training and saving the bert model
checkpoint_filepath_1 = 'callbacks_models/BERT1.{epoch:02d}-
{val_loss:.2f}.h5'
checkpoint_filepath_2 = 'callbacks_models/complete_best_BERT_model_1.h5'
callbacks_1 = ModelCheckpoint(
filepath=checkpoint_filepath_1,
monitor='val_loss',
mode='min',
save_best_only=False,
save_weights_only=False,
save_freq='epoch')
callbacks_2 = ModelCheckpoint(
filepath=checkpoint_filepath_2,
monitor='val_loss',
mode='min',
save_best_only=True)
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1,
patience=5)
hist = bert_model.fit([train1_input_ids, train1_attention_masks],
y_train1, batch_size=16, epochs=1,validation_data=
([val_input_ids, val_attention_masks], y_val),
callbacks
[es,callbacks_1,callbacks_2,history_logger])
min_val_score = min(hist.history['val_loss'])
print ("\nMinimum validation loss = ", min_val_score)
bert_model.save_pretrained("callbacks_models/Complete_BERT_model_1.h5")

Related

Properly evaluate a test dataset

I trained a machine translation model using huggingface library:
def compute_metrics(eval_preds):
preds, labels = eval_preds
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# Some simple post-processing
decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
result = metric.compute(predictions=decoded_preds, references=decoded_labels)
result = {"bleu": result["score"]}
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
result["gen_len"] = np.mean(prediction_lens)
result = {k: round(v, 4) for k, v in result.items()}
return result
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['test'],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train()
model_dir = './models/'
trainer.save_model(model_dir)
The code above is taken from this Google Colab notebook. After the training, I can see the trained model is saved to the folder models and the metric is calculated. Now I want to load the trained model and do the prediction on a new dataset, here is what I tried:
dataset = load_dataset('csv', data_files='data/training_data.csv')
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Tokenize the test dataset
tokenized_datasets = train_test.map(preprocess_function_v2, batched=True)
test_dataset = tokenized_datasets['test']
model = AutoModelForSeq2SeqLM.from_pretrained('models')
model(test_dataset)
It threw the following error:
*** AttributeError: 'Dataset' object has no attribute 'size'
I tried the evaluate() function as well, but it said:
*** torch.nn.modules.module.ModuleAttributeError: 'MarianMTModel' object has no attribute 'evaluate'
And the function eval only prints the configuration of the model.
What is the proper way to evaluate the performance of the trained model on a new dataset?
Turned out that the prediction can be produced using the following code:
inputs = tokenizer(
questions,
max_length=max_input_length,
truncation=True,
return_tensors='pt',
padding=True).to('cuda')
translation = model.generate(**inputs)

Uploading models with custom forward functions to the huggingface model hub?

Is it possible to upload a model with a custom forward function to the huggingface model hub?
I can see how to do it if your model is of a normal form but can't see how to customise the forward function and do it?
Yes absolutely. You can create your own model with added any number of layers/customisations you want and upload it to model hub. Let me present you a demo which will describe the entire process.
Uploading custom model to 🤗 model hub
import tqdm
from datasets import load_dataset
import transformers
from transformers import AutoTokenizer, AutoModel, BertConfig
from transformers import AdamW
from transformers import get_scheduler
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
# setting device to `cuda` if gpu exists
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# initialising the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("google/bert_uncased_L-2_H-128_A-2")
bert = AutoModel.from_pretrained("google/bert_uncased_L-2_H-128_A-2")
def tokenize_function(examples):
'''Function for tokenizing raw texts'''
return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)
# downloading IMDB dataset from 🤗 `datasets`
raw_datasets = load_dataset("imdb")
# Running tokenizing function on the raw texts
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
# for simplicity I have taken only the train split
tokenized_datasets = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
# Now lets create the torch Dataset class
class IMDBClassificationDataset(Dataset):
def __init__(self, dataset):
self.dataset = dataset
def __len__(self):
return len(self.dataset)
def __getitem__(self, idx):
d = self.dataset[idx]
ids = torch.tensor(d['input_ids'])
mask = torch.tensor(d['attention_mask'])
label = torch.tensor(d['label'])
return ids, mask, label
# Preparing the dataset and the Dataloader
dataset = IMDBClassificationDataset(tokenized_datasets)
train_dataloader = DataLoader(dataset, shuffle=True, batch_size=8)
# Now lets create a custom Bert model
class CustomBert(transformers.PreTrainedModel):
'''Custom model class
------------------
Now the trick is not to inherit the class from `nn.Module` but `transformers.PretrainedModel`
Also you need to pass the model config during initialisation'''
def __init__(self, bert):
super(CustomBert, self).__init__(config=BertConfig.from_pretrained('google/bert_uncased_L-2_H-128_A-2'))
self.bert = bert
self.l1 = nn.Linear(128, 1)
self.do = nn.Dropout(0.1)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
def forward(self, sent_id, mask):
'''For simplicity I have added only one linear layer, you can create any type of network you want'''
bert_out = self.bert(sent_id, attention_mask=mask)
o = bert_out.last_hidden_state[:,0,:]
o = self.do(o)
o = self.relu(o)
o = self.l1(o)
o = self.sigmoid(o)
return o
# initialising model, loss and optimizer
model = CustomBert(bert)
model.to(device)
criterion = torch.nn.BCELoss()
optimizer = AdamW(model.parameters(), lr=5e-5)
# setting epochs, num_training_steps and the lr_scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps
)
# training loop
model.train()
for epoch in tqdm.tqdm(range(num_epochs)):
for batch in train_dataloader:
ids, masks, labels = batch
labels = labels.type(torch.float32)
o = model(ids.to(device), masks.to(device))
loss = criterion(torch.squeeze(o), labels.to(device))
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
# save the tokenizer and the model in `./test-model/` directory
tokenizer.save_pretrained("./test-model/")
model.save_pretrained("./test-model/", push_to_hub=False)
Now create a new model in 🤗 and push all the contents inside the test-model to 🤗 model hub.
To test the authenticity of the model you can try 🤗's pipeline to check if something is wrong.
from transformers import pipeline
# as this is classification so you need to mention `text-classification` as task
classifier = pipeline('text-classification', model='tanmoyio/test-model')
classifier("This movie was superb")
It will output something like this
[{'label': 'LABEL_0', 'score': 0.5571992993354797}]
This is a real demo, check the model here - https://huggingface.co/tanmoyio/test-model. Let me know if you have further questions.

Why fine-tuning BERT mlm on specific domain doesn't work? What am I doing wrong?

I'm new. I'm trying to fine-tuned a BERT MLM (bert-base-uncased) on a target domain. Unfortunately, results are not good.
Before fine-tuning, the pre-trained model fills the mask of a sentence with words in line of human expectations.
E.g. Wikipedia is a free online [MASK], created and edited by volunteers around the world.
The most probable prediction are encyclopedia (score: 0.650) and resource (score:0.087).
After fine-tuning, the prediction are completely wrong. Often stopwords are predicted as result.
E.g. Wikipedia is a free online [MASK], created and edited by volunteers around the world.
The most probable prediction are the (score: 0.052) and be (score:0.033).
I experimented with different epochs (from 1 to 10) and different datasets (from a few MB to a few GB) but I got the same issue. What am I doing wrong? I'm using the following code, I hope you can help me.
from transformers import AutoConfig, AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
config = AutoConfig.from_pretrained('bert-base-uncased', output_hidden_states=True)
model = AutoModelForMaskedLM.from_config(config) # BertForMaskedLM.from_pretrained(path)
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(tokenizer=tokenizer,
file_path="data/english/corpora.txt",
block_size = 512)
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(output_dir="output/models/english",
overwrite_output_dir=True,
num_train_epochs=5,
per_gpu_train_batch_size=8,
save_steps = 22222222,
save_total_limit=2)
trainer = Trainer(model=model, args=training_args, data_collator=data_collator, train_dataset=dataset)
trainer.train()
trainer.save_model("output/models/english")
from transformers import pipeline
# Initialize MLM pipeline
mlm = pipeline('fill-mask', model="output/models/english", tokenizer="output/models/english")
# Get mask token
mask = mlm.tokenizer.mask_token
# Get result for particular masked phrase
phrase = f'Wikipedia is a free online {mask}, created and edited by volunteers around the world'
result = mlm(phrase)
# Print result
print(result)

Train GPT2 with Trainer & TrainingArguments using/specifying attention_mask

I'm using Trainer & TrainingArguments to train GPT2 Model, but it seems that this does not work well.
My datasets have the ids of the tokens of my corpus and the mask of each text, to indicate where to apply the attention:
Dataset({
features: ['attention_mask', 'input_ids', 'labels'],
num_rows: 2012860
}))
I am doing the training with Trainer & TrainingArguments, passing my model and my previous dataset as follows. But nowhere do I specify anything about the attention_mask:
training_args = TrainingArguments(
output_dir=path_save_checkpoints,
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size = 4,
gradient_accumulation_steps = 4,
logging_steps = 5_000, save_steps=5_000,
fp16=True,
deepspeed="ds_config.json",
remove_unused_columns = True,
debug = True
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
How should I tell the Trainer to use this feature (attention_mask)?
If you take a look at the file /transformers/trainer.py there is no reference to "attention" or "mask".
Thanks in advance!
Somewhere in the source code, you will see that inputs are passed to the model something like this
outputs = model(**inputs)
As long as your collator returns a dictionary that includes the attention_mask key, your attention mask will be passed to your GPT2 model.

How to train tfidfvectorizer for new dataset

I am doing document classification using tfidfvectorizer and LinearSVC. I need to train tfidfvectorizer again and again as new dataset comes. Is there any way to store current tfidfvectorizer and mix new features when new dataset comes.
Code :
if os.path.exists("trans.pkl"):
with open("trans.pkl", "rb") as fid:
transformer = cPickle.load(fid)
else:
transformer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words = 'english')
with open("trans.pkl", "wb") as fid:
cPickle.dump(transformer, fid)
X_train = transformer.fit_transform(train_data)
X_test = transformer.transform(test_data)
print X_train.shape[1]
if os.path.exists("store_model.pkl"):
print "model exists"
with open("store_model.pkl","rb") as fid:
classifier = cPickle.load(fid)
print classifier
else:
print "model created"
classifier = LinearSVC().fit(X_train, train_target)
with open("store_model.pkl","wb") as fid:
cPickle.dump(classifier,fid)
predictions = classifier.predict(X_test)
I have 2 diff train files and 1 test file. I executed code for 1st train file,then it works well. But when I try for 2nd train file,no of features are different than 1st so it gives error. How can I train my model if I have multiple such dataset files.

Resources