How to train tfidfvectorizer for new dataset - text-classification

I am doing document classification using tfidfvectorizer and LinearSVC. I need to train tfidfvectorizer again and again as new dataset comes. Is there any way to store current tfidfvectorizer and mix new features when new dataset comes.
Code :
if os.path.exists("trans.pkl"):
with open("trans.pkl", "rb") as fid:
transformer = cPickle.load(fid)
else:
transformer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words = 'english')
with open("trans.pkl", "wb") as fid:
cPickle.dump(transformer, fid)
X_train = transformer.fit_transform(train_data)
X_test = transformer.transform(test_data)
print X_train.shape[1]
if os.path.exists("store_model.pkl"):
print "model exists"
with open("store_model.pkl","rb") as fid:
classifier = cPickle.load(fid)
print classifier
else:
print "model created"
classifier = LinearSVC().fit(X_train, train_target)
with open("store_model.pkl","wb") as fid:
cPickle.dump(classifier,fid)
predictions = classifier.predict(X_test)
I have 2 diff train files and 1 test file. I executed code for 1st train file,then it works well. But when I try for 2nd train file,no of features are different than 1st so it gives error. How can I train my model if I have multiple such dataset files.

Related

Training loss Validation loss all 0

Im trying to finetune a T5 model with my own dataset for grammatical error correction, but when i run the model i keep on getting all 0's for my results. Im following the huggingface translation tutorial.
enter image description here
I think its a problem with the preprocess function, but i can't seem to figure out why
prefix = ''
max_input_length = 128
max_target_length = 128
source_lang = "ar"
target_lang = "ar"
def preprocess_function(examples):
inputs = [prefix + ex for ex in examples["original"]]
targets = [ex for ex in examples["corrected"]]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
# Setup the tokenizer for targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(targets, max_length=max_target_length, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs

Properly evaluate a test dataset

I trained a machine translation model using huggingface library:
def compute_metrics(eval_preds):
preds, labels = eval_preds
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# Some simple post-processing
decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
result = metric.compute(predictions=decoded_preds, references=decoded_labels)
result = {"bleu": result["score"]}
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
result["gen_len"] = np.mean(prediction_lens)
result = {k: round(v, 4) for k, v in result.items()}
return result
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['test'],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train()
model_dir = './models/'
trainer.save_model(model_dir)
The code above is taken from this Google Colab notebook. After the training, I can see the trained model is saved to the folder models and the metric is calculated. Now I want to load the trained model and do the prediction on a new dataset, here is what I tried:
dataset = load_dataset('csv', data_files='data/training_data.csv')
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Tokenize the test dataset
tokenized_datasets = train_test.map(preprocess_function_v2, batched=True)
test_dataset = tokenized_datasets['test']
model = AutoModelForSeq2SeqLM.from_pretrained('models')
model(test_dataset)
It threw the following error:
*** AttributeError: 'Dataset' object has no attribute 'size'
I tried the evaluate() function as well, but it said:
*** torch.nn.modules.module.ModuleAttributeError: 'MarianMTModel' object has no attribute 'evaluate'
And the function eval only prints the configuration of the model.
What is the proper way to evaluate the performance of the trained model on a new dataset?
Turned out that the prediction can be produced using the following code:
inputs = tokenizer(
questions,
max_length=max_input_length,
truncation=True,
return_tensors='pt',
padding=True).to('cuda')
translation = model.generate(**inputs)

Uploading models with custom forward functions to the huggingface model hub?

Is it possible to upload a model with a custom forward function to the huggingface model hub?
I can see how to do it if your model is of a normal form but can't see how to customise the forward function and do it?
Yes absolutely. You can create your own model with added any number of layers/customisations you want and upload it to model hub. Let me present you a demo which will describe the entire process.
Uploading custom model to 🤗 model hub
import tqdm
from datasets import load_dataset
import transformers
from transformers import AutoTokenizer, AutoModel, BertConfig
from transformers import AdamW
from transformers import get_scheduler
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
# setting device to `cuda` if gpu exists
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# initialising the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("google/bert_uncased_L-2_H-128_A-2")
bert = AutoModel.from_pretrained("google/bert_uncased_L-2_H-128_A-2")
def tokenize_function(examples):
'''Function for tokenizing raw texts'''
return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)
# downloading IMDB dataset from 🤗 `datasets`
raw_datasets = load_dataset("imdb")
# Running tokenizing function on the raw texts
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
# for simplicity I have taken only the train split
tokenized_datasets = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
# Now lets create the torch Dataset class
class IMDBClassificationDataset(Dataset):
def __init__(self, dataset):
self.dataset = dataset
def __len__(self):
return len(self.dataset)
def __getitem__(self, idx):
d = self.dataset[idx]
ids = torch.tensor(d['input_ids'])
mask = torch.tensor(d['attention_mask'])
label = torch.tensor(d['label'])
return ids, mask, label
# Preparing the dataset and the Dataloader
dataset = IMDBClassificationDataset(tokenized_datasets)
train_dataloader = DataLoader(dataset, shuffle=True, batch_size=8)
# Now lets create a custom Bert model
class CustomBert(transformers.PreTrainedModel):
'''Custom model class
------------------
Now the trick is not to inherit the class from `nn.Module` but `transformers.PretrainedModel`
Also you need to pass the model config during initialisation'''
def __init__(self, bert):
super(CustomBert, self).__init__(config=BertConfig.from_pretrained('google/bert_uncased_L-2_H-128_A-2'))
self.bert = bert
self.l1 = nn.Linear(128, 1)
self.do = nn.Dropout(0.1)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
def forward(self, sent_id, mask):
'''For simplicity I have added only one linear layer, you can create any type of network you want'''
bert_out = self.bert(sent_id, attention_mask=mask)
o = bert_out.last_hidden_state[:,0,:]
o = self.do(o)
o = self.relu(o)
o = self.l1(o)
o = self.sigmoid(o)
return o
# initialising model, loss and optimizer
model = CustomBert(bert)
model.to(device)
criterion = torch.nn.BCELoss()
optimizer = AdamW(model.parameters(), lr=5e-5)
# setting epochs, num_training_steps and the lr_scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps
)
# training loop
model.train()
for epoch in tqdm.tqdm(range(num_epochs)):
for batch in train_dataloader:
ids, masks, labels = batch
labels = labels.type(torch.float32)
o = model(ids.to(device), masks.to(device))
loss = criterion(torch.squeeze(o), labels.to(device))
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
# save the tokenizer and the model in `./test-model/` directory
tokenizer.save_pretrained("./test-model/")
model.save_pretrained("./test-model/", push_to_hub=False)
Now create a new model in 🤗 and push all the contents inside the test-model to 🤗 model hub.
To test the authenticity of the model you can try 🤗's pipeline to check if something is wrong.
from transformers import pipeline
# as this is classification so you need to mention `text-classification` as task
classifier = pipeline('text-classification', model='tanmoyio/test-model')
classifier("This movie was superb")
It will output something like this
[{'label': 'LABEL_0', 'score': 0.5571992993354797}]
This is a real demo, check the model here - https://huggingface.co/tanmoyio/test-model. Let me know if you have further questions.

Saving bert model at every epoch for further training

I am using bert_model.save_pretrained for saving the model at end as this is the command that helps in saving the model with all configurations and weights but this cannot be used in model.fit command as in callbacks saving model at each epoch does not save with save_pretrained. Can anybody help me in saving bert model at each epoch since i cannot train whole bert model in one go?
Edit
Code for loading pre trained bert model
bert_model = TFAutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_classes)
Code for compiling the bert model
from tensorflow.keras import optimizers
bert_model.compile(loss='categorical_crossentropy',
optimizer=optimizers.Adam(learning_rate=0.00005),
metrics=['accuracy'])
bert_model.summary()
Code for training and saving the bert model
checkpoint_filepath_1 = 'callbacks_models/BERT1.{epoch:02d}-
{val_loss:.2f}.h5'
checkpoint_filepath_2 = 'callbacks_models/complete_best_BERT_model_1.h5'
callbacks_1 = ModelCheckpoint(
filepath=checkpoint_filepath_1,
monitor='val_loss',
mode='min',
save_best_only=False,
save_weights_only=False,
save_freq='epoch')
callbacks_2 = ModelCheckpoint(
filepath=checkpoint_filepath_2,
monitor='val_loss',
mode='min',
save_best_only=True)
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1,
patience=5)
hist = bert_model.fit([train1_input_ids, train1_attention_masks],
y_train1, batch_size=16, epochs=1,validation_data=
([val_input_ids, val_attention_masks], y_val),
callbacks
[es,callbacks_1,callbacks_2,history_logger])
min_val_score = min(hist.history['val_loss'])
print ("\nMinimum validation loss = ", min_val_score)
bert_model.save_pretrained("callbacks_models/Complete_BERT_model_1.h5")

[XAI for transformer custom model using AllenNLP]

I have been solving the NER problem for a Vietnamese dataset with 15 tags in IO format. I have been using the AllenNLP Interpret Toolkit for my model, but I can not configure it completely.
I have used a pre-trained language model "xlm-roberta-base" based-on HuggingFace. I have concatenated 4 last bert layers, and pass through to linear layer. The model architecture you can see in the source below.
class BaseBertSoftmax(nn.Module):
def __init__(self, model, drop_out , num_labels):
super(BaseBertSoftmax, self).__init__()
self.num_labels = num_labels
self.model = model
self.dropout = nn.Dropout(drop_out)
self.classifier = nn.Linear(4*768, num_labels) # 4 last of layer
def forward_custom(self, input_ids, attention_mask=None,
labels=None, head_mask=None):
outputs = self.model(input_ids = input_ids, attention_mask=attention_mask)
sequence_output = torch.cat((outputs[1][-1], outputs[1][-2], outputs[1][-3], outputs[1][-4]),-1)
sequence_output = self.dropout(sequence_output)
logits = self.classifier(sequence_output) # bsz, seq_len, num_labels
outputs = (logits,) + outputs[2:] # add hidden states and attention if they are here
if labels is not None:
loss_fct = nn.CrossEntropyLoss(ignore_index=0)
if attention_mask is not None:
active_loss = attention_mask.view(-1) == 1
active_logits = logits.view(-1, self.num_labels)[active_loss]
active_labels = labels.view(-1)[active_loss]
loss = loss_fct(active_logits, active_labels)
else:
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
outputs = (loss,) + outputs
return outputs #scores, (hidden_states), (attentions)
What steps do I have to take to integrate this model to AllenNLP Interpret?
Could you please help me with this problem?

Resources