GPT-3 Fine Tune a Fine Tuned Model? - transformer-model

The OpenAI documentation for the model attribute in the fine-tune API states a bit confusingly:
model
The name of the base model to fine-tune. You can select one of "ada", "babbage", "curie", "davinci", or a fine-tuned model created after 2022-04-21.
My question: is it better to fine-tune a base model or a fine-tuned model?
I created a fine-tune model from ada with file mydata1K.jsonl:
ada + mydata1K.jsonl --> ada:ft-acme-inc-2022-06-25
Now I have a bigger file of samples mydata2K.jsonl that I want to use to improve the fine-tuned model.
In this second round of fine-tuning, is it better to fine-tune ada again or to fine-tune my fine-tuned model ada:ft-acme-inc-2022-06-25? I'm assuming this is possible because my fine tuned model is created after 2022-04-21.
ada + mydata2K.jsonl --> better-model
or
ada:ft-acme-inc-2022-06-25 + mydata2K.jsonl --> even-better-model?

If you read the Fine-tuning documentation as of Jan 4, 2023, the only part talking about "fine-tuning a fine-tuned model" is the following part under Advanced usage:
Continue fine-tuning from a fine-tuned model
If you have already fine-tuned a model for your task and now have
additional training data that you would like to incorporate, you can
continue fine-tuning from the model. This creates a model that has
learned from all of the training data without having to re-train from
scratch.
To do this, pass in the fine-tuned model name when creating a new
fine-tuning job (e.g., -m curie:ft-<org>-<date>). Other training
parameters do not have to be changed, however if your new training
data is much smaller than your previous training data, you may find it
useful to reduce learning_rate_multiplier by a factor of 2 to 4.
Which option to choose?
You're asking about two options:
Option 1: ada + bigger-training-dataset.jsonl
Option 2: ada:ft-acme-inc-2022-06-25 + additional-training-dataset.jsonl
The documentation says nothing about which option is better in terms of which would yield better results.
However...
Choose Option 2
Why?
When training a fine-tuned model, the total tokens used will be billed
according to our training rates.
If you choose Option 1, you'll pay for some tokens in your training dataset twice. First when doing fine-tuning with initial training dataset, second when doing fine-tuning with bigger training dataset (i.e., bigger-training-dataset.jsonl = initial-training-dataset.jsonl + additional-training-dataset.jsonl).
It's better to continue fine-tuning from a fine-tuned model because you'll pay only for tokens in your additional training dataset.
Read more about fine-tuning pricing calculation.

Related

How to change the task of a pretrained model without jeopardizing the capacity of it?

I want to build a text2text model. Specifically, I want to transfer some automatically generated scrabbling text pieces into a smooth paragraph within the same language. I've already prepared the text inputs and outputs. So corpus is not the primary problem now.
I want to use hugging face models like:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
model = AutoModelForMaskedLM.from_pretrained("bert-base-chinese")
because it has already obtained the capacity to generate the language, the model is made for masked language, and there's no mature task like mine as it is really customized. So how could I use the hugging face masked language model as a base text2text model without jeopardizing its capacity? I want to fine-tune it to achieve that task/goal. I want to know how.

Difference between from_config and from_pretrained in HuggingFace

num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
preconfig = DistilBertConfig(n_layers=6)
model1 = AutoModelForSequenceClassification.from_config(preconfig)
model2 = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
I am modifying this code (modified code is provided above) to test DistilBERT transformer layer depth size via from_config since from my knowledge from_pretrained uses 6 layers because in the paper section 3 they said:
we initialize the student from the teacher by taking one layer out of two
While what I want to test is various sizes of layers. To test whether both are the same, I tried running the from_config
with n_layers=6 because based on the documentation DistilBertConfig the n_layers is used to determine the transformer block depth. However as I run model1 and model2 I found that with SST-2 dataset, in accuracy:
model1 achieved only 0.8073
model2 achieved 0.901
If they both behave the same I expect the result to be somewhat similar but 10% drop is a significant drop, therefore I believe there ha to be a difference between the functions. Is there a reason behind the difference of the approach (for example model1 has not yet applied hyperparameter search) and is there a way to make both functions behave the same? Thank you!
The two functions you described, from_config and from_pretrained, do not behave the same. For a model M, with a reference R:
from_config allows you to instantiate a blank model, which has the same configuration (the same shape) as your model of choice: M is as R was before training
from_pretrained allows you to load a pretrained model, which has already been trained on a specific dataset for a given number of epochs: M is as R after training.
To cite the doc, Note: Loading a model from its configuration file does not load the model weights. It only affects the model’s configuration. Use from_pretrained() to load the model weights.

ML.NET doesn't support resuming training for ImageClassificationTrainer

I want to continue training the model.zip file with more images without retraining from the baseline model from scratch, how do I do that?
This isn't possible at the moment. ML.NET's ImageClassificationTrainer already uses a pre-trained model, so you're using transfer learning to create your model. Any additions would have to be "from scratch" on the pre-trained model.
Also, looking at the existing trainers that can be re-trained, the ImageClassificationTrainer isn't listed among them.

why TFBertForSequenceClassification.from_pretrained('bert-base-chinese') can't use?

I want to do chinese Textual Similarity with huggingface:
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = TFBertForSequenceClassification.from_pretrained('bert-base-chinese')
It doesn't work, system report errors:
Some weights of the model checkpoint at bert-base-chinese were not used when initializing TFBertForSequenceClassification: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-chinese and are newly initialized: ['classifier', 'dropout_37']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
But I can use huggingface to do name entity:
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = TFBertForTokenClassification.from_pretrained("bert-base-chinese")
Does that mean huggingface haven't done chinese sequenceclassification? If my judge is right, how to sove this problem with colab with only 12G memory?
The reason is simple. The model has not been fine-tuned for the Sequence classification task hence when you try to load the 'bert-base-chinese' model over a Sequence classification model. It updates the rest of the layers ['nsp___cls', 'mlm___cls'] randomly.
And it's a warning which means the model will be giving random results due to the random last layer initialization.
BTW #andy you didn't upload the output for token classification? It should also show a similar warning but with ['classifier'] layer as randomly initiated.
Do use a fine-tuned model, else you would need to fine-tune this loaded model.

Multiple Stanford CoreNLP model files made, which one is the correct one to use?

I made a sentiment analysis model using Standford CoreNLP's library. So I have a bunch of ser.gz files that look like the following:
I was wondering what model to use in my java code, but based on a previous question,
I just used the model with the highest F1 score, which in this case is model-0014-93.73.ser.gz. And in my java code, I pointed to the model I want to use by using the following line:
props.put("sentiment.model", "/path/to/model-0014-93.73.ser.gz.");
However, by referring to just that model, am I excluding the sentiment analysis from the other models that were made? Should I be referring to all the model files to make sure I "covered" all the bases or does the highest scoring model trump everything else?
You should point to only the single highest scoring model. The code has no way to make use of multiple models at the same time.

Resources