How can I control data feeding order to model using Huggingface Trainer? - transformer-model

I want to train model in the order in which the data are stored.
For example, if there are 100 data, then I want to feed 1st, 2nd data together(because I set batch_size=2 in code) and then 3rd, 4th data and then 5th, 6th data together and so on....
But huggingface Trainer train model using datacollator and this feed data to model randomly by the parameter data_seed.
How can I train model feeding data in the order in which the data are stored?
# load tokenizer
model_checkpoint = "facebook/bart-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# load model
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
# make batch
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
batch_size = 2
epochs = 3
args = Seq2SeqTrainingArguments(
output_dir = "saved_model",
overwrite_output_dir = True,
evaluation_strategy = "epoch",
save_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
gradient_accumulation_steps=2,
weight_decay=0.01,
num_train_epochs=epochs,
predict_with_generate=True,
fp16=False,
dataloader_num_workers=8,
)
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)

Related

Match individual records during Batch predictions with VertexAI pipeline

I have a custom model in Vertex AI and a table storing the features for the model along with the record_id.
I am building pipeline component for the batch prediction and facing a critical issue.
When I submit the batch_prediction, I should exclude the record_id for the job but How can I map the record if I don't have the record_id in the result?
from google.cloud import bigquery
from google.cloud import aiplatform
aiplatform.init(project=project_id)
client = bigquery.Client(project=project_id)
query = '''
SELECT * except(record_id) FROM `table`
'''
df = client.query(query).to_dataframe() # drop the record_id and load it to another table
job = client.load_table_from_dataframe(
X, "table_wo_id",
)
clf = aiplatform.Model(model_id = 'custom_model')
clf.batch_predict(job_display_name = 'custom model batch prediction',
bigquery_source = 'bq://table_wo_id',
instances_format = 'bigquery',
bigquery_destination_prefix = 'bq://prediction_result_table',
predictions_format = 'bigquery',
machine_type = 'n1-standard-4',
max_replica_count = 1
)
like the above example, there is no record_id column in prediction_result_table. There is no way to map the result back to each record

Fix tokenization to tensors with padding Huggingface

I'm trying to tokenize my dataset with the following preprocessing function. I've already donlowaded with AutoTokenizer from the Spanish BERT version.
`
max_input_length = 280
max_target_length = 280
source_lang = "es"
target_lang = "en"
prefix = "translate spanish_to_women to spanish_to_men: "
def preprocess_function(examples):
inputs = [prefix + ex for ex in examples["mujeres_tweet"]]
targets = [ex for ex in examples["hombres_tweet"]]
model_inputs = tokz(inputs,
padding=True,
truncation=True,
max_length=max_input_length,
return_tensors = 'pt'
)
# Setup the tokenizer for targets
with tokz.as_target_tokenizer():
labels = tokz(targets,
padding=True,
truncation=True,
max_length=max_target_length,
return_tensors = 'pt'
)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
`
And I get the following error when trying to pass my dataset object through the function.
I've already tried dropping the columns that have strings. I've seen also that when I do not set the return_tensors it does tokenize my dataset (but later on I have the same problem when trying to train my BERT model. Anyone knows what might be going on? *inserts crying face
Also, I've tried tokenizing it without the return_tensors and then doing set_format but it returns and empty dataset object *inserts another crying face.
My Dataset looks like the following
And an example of the inputs
So that I just do:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

How to split a dataset into 3: train, test and validation with ImageDataGenerator for my dataset from directory?

I want to split my image dataset into three parts: train=80, validation=10, and test=10 with ImageDataGenerator for my dataset from directory
train_dataset = image_generator.flow_from_directory(...)
validation_dataset = image_generator.flow_from_directory(...)
test_dataset = image_generator.flow_from_directory(...)

h2o.auc( perf , xval =TRUE) - what does this call return?

My code is as follows
gbm.fit.hex = h2o.gbm(x= xcols , y =1865 , training_frame = tr.hex , distribution = "bernoulli", model_id = "gbm.model" , key = "gbm.model.key" , ntrees = gbm.trees , max_depth = gbm.depth , min_rows = gbm.min.rows , learn_rate = gbm.learn.rate , nbins = 20 , balance_classes = gbm.balance , nfolds = gbm.folds )
perf <- h2o.performance(gbm.fit.hex , tr.hex)
a = h2o.auc(perf , xval = TRUE)
What does the auc call return? does it return the AUC on training dataset or on the crossvalidation results?
It retrieves the cross-validated AUC.
Since you set the nfolds argument to something non-zero, the h2o.gbm function also performs k-fold cross-validation in addition to training a GBM model on the full training set. In your command, you did not specify a validation set, so the AUC values you can retrieve are training AUC, h2o.auc(perf, train = TRUE), and cross-validated AUC (as above).
If you want to evaluate performance on a separate validation (or test) set, you can pass that frame using the validation_frame argument and retrieve the validation AUC using h2o.auc(perf, valid = TRUE).

Labeled LDA learn in Stanford Topic Modeling Toolbox

It's ok when I run the example-6-llda-learn.scala as follows:
val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);
val tokenizer = {
SimpleEnglishTokenizer() ~> // tokenize on space and punctuation
CaseFolder() ~> // lowercase everything
WordsAndNumbersOnlyFilter() ~> // ignore non-words and non-numbers
MinimumLengthFilter(3) // take terms with >=3 characters
}
val text = {
source ~> // read from the source file
Column(4) ~> // select column containing text
TokenizeWith(tokenizer) ~> // tokenize with tokenizer above
TermCounter() ~> // collect counts (needed below)
TermMinimumDocumentCountFilter(4) ~> // filter terms in <4 docs
TermDynamicStopListFilter(30) ~> // filter out 30 most common terms
DocumentMinimumLengthFilter(5) // take only docs with >=5 terms
}
// define fields from the dataset we are going to slice against
val labels = {
source ~> // read from the source file
Column(2) ~> // take column two, the year
TokenizeWith(WhitespaceTokenizer()) ~> // turns label field into an array
TermCounter() ~> // collect label counts
TermMinimumDocumentCountFilter(10) // filter labels in < 10 docs
}
val dataset = LabeledLDADataset(text, labels);
// define the model parameters
val modelParams = LabeledLDAModelParams(dataset);
// Name of the output model folder to generate
val modelPath = file("llda-cvb0-"+dataset.signature+"-"+modelParams.signature);
// Trains the model, writing to the given output path
TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000);
// or could use TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500);
But it's not ok when I change the last line from:
TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000);
to:
TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500);
And the method of CVB0 cost much memory.I train a corpus of 10,000 documents with about 10 labels each document,it will cost 30G memory.
I've encountered the same situation and indeed I believe it's a bug. Check GIbbsLabeledLDA.scala in edu.stanford.nlp.tmt.model.llda under the src/main/scala folder, from line 204:
val z = doc.labels(zI);
val pZ = (doc.theta(z)+topicSmoothing(z)) *
(countTopicTerm(z)(term)+termSmooth) /
(countTopic(z)+termSmoothDenom);
doc.labels is self-explanatory, and doc.theta records the distribution (counts, actually) of its labels, which has the same size as doc.labels.
zI is index variable iterating doc.labels, while the value z gets the actual label number. Here comes the problem: it's possible this documents has only one label - say 1000 - therefore zI is 0 and z is 1000, then doc.theta(z) gets out of range.
I suppose the solution would be to modify doc.theta(z) to doc.theta(zI).
(I'm trying to check whether the results would be meaningful, anyway this bug has made me not so confident in this toolbox.)

Resources