Difference between from_config and from_pretrained in HuggingFace - huggingface-transformers

num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
preconfig = DistilBertConfig(n_layers=6)
model1 = AutoModelForSequenceClassification.from_config(preconfig)
model2 = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
I am modifying this code (modified code is provided above) to test DistilBERT transformer layer depth size via from_config since from my knowledge from_pretrained uses 6 layers because in the paper section 3 they said:
we initialize the student from the teacher by taking one layer out of two
While what I want to test is various sizes of layers. To test whether both are the same, I tried running the from_config
with n_layers=6 because based on the documentation DistilBertConfig the n_layers is used to determine the transformer block depth. However as I run model1 and model2 I found that with SST-2 dataset, in accuracy:
model1 achieved only 0.8073
model2 achieved 0.901
If they both behave the same I expect the result to be somewhat similar but 10% drop is a significant drop, therefore I believe there ha to be a difference between the functions. Is there a reason behind the difference of the approach (for example model1 has not yet applied hyperparameter search) and is there a way to make both functions behave the same? Thank you!

The two functions you described, from_config and from_pretrained, do not behave the same. For a model M, with a reference R:
from_config allows you to instantiate a blank model, which has the same configuration (the same shape) as your model of choice: M is as R was before training
from_pretrained allows you to load a pretrained model, which has already been trained on a specific dataset for a given number of epochs: M is as R after training.
To cite the doc, Note: Loading a model from its configuration file does not load the model weights. It only affects the model’s configuration. Use from_pretrained() to load the model weights.

Related

GPT-3 Fine Tune a Fine Tuned Model?

The OpenAI documentation for the model attribute in the fine-tune API states a bit confusingly:
model
The name of the base model to fine-tune. You can select one of "ada", "babbage", "curie", "davinci", or a fine-tuned model created after 2022-04-21.
My question: is it better to fine-tune a base model or a fine-tuned model?
I created a fine-tune model from ada with file mydata1K.jsonl:
ada + mydata1K.jsonl --> ada:ft-acme-inc-2022-06-25
Now I have a bigger file of samples mydata2K.jsonl that I want to use to improve the fine-tuned model.
In this second round of fine-tuning, is it better to fine-tune ada again or to fine-tune my fine-tuned model ada:ft-acme-inc-2022-06-25? I'm assuming this is possible because my fine tuned model is created after 2022-04-21.
ada + mydata2K.jsonl --> better-model
or
ada:ft-acme-inc-2022-06-25 + mydata2K.jsonl --> even-better-model?
If you read the Fine-tuning documentation as of Jan 4, 2023, the only part talking about "fine-tuning a fine-tuned model" is the following part under Advanced usage:
Continue fine-tuning from a fine-tuned model
If you have already fine-tuned a model for your task and now have
additional training data that you would like to incorporate, you can
continue fine-tuning from the model. This creates a model that has
learned from all of the training data without having to re-train from
scratch.
To do this, pass in the fine-tuned model name when creating a new
fine-tuning job (e.g., -m curie:ft-<org>-<date>). Other training
parameters do not have to be changed, however if your new training
data is much smaller than your previous training data, you may find it
useful to reduce learning_rate_multiplier by a factor of 2 to 4.
Which option to choose?
You're asking about two options:
Option 1: ada + bigger-training-dataset.jsonl
Option 2: ada:ft-acme-inc-2022-06-25 + additional-training-dataset.jsonl
The documentation says nothing about which option is better in terms of which would yield better results.
However...
Choose Option 2
Why?
When training a fine-tuned model, the total tokens used will be billed
according to our training rates.
If you choose Option 1, you'll pay for some tokens in your training dataset twice. First when doing fine-tuning with initial training dataset, second when doing fine-tuning with bigger training dataset (i.e., bigger-training-dataset.jsonl = initial-training-dataset.jsonl + additional-training-dataset.jsonl).
It's better to continue fine-tuning from a fine-tuned model because you'll pay only for tokens in your additional training dataset.
Read more about fine-tuning pricing calculation.

why TFBertForSequenceClassification.from_pretrained('bert-base-chinese') can't use?

I want to do chinese Textual Similarity with huggingface:
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = TFBertForSequenceClassification.from_pretrained('bert-base-chinese')
It doesn't work, system report errors:
Some weights of the model checkpoint at bert-base-chinese were not used when initializing TFBertForSequenceClassification: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-chinese and are newly initialized: ['classifier', 'dropout_37']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
But I can use huggingface to do name entity:
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = TFBertForTokenClassification.from_pretrained("bert-base-chinese")
Does that mean huggingface haven't done chinese sequenceclassification? If my judge is right, how to sove this problem with colab with only 12G memory?
The reason is simple. The model has not been fine-tuned for the Sequence classification task hence when you try to load the 'bert-base-chinese' model over a Sequence classification model. It updates the rest of the layers ['nsp___cls', 'mlm___cls'] randomly.
And it's a warning which means the model will be giving random results due to the random last layer initialization.
BTW #andy you didn't upload the output for token classification? It should also show a similar warning but with ['classifier'] layer as randomly initiated.
Do use a fine-tuned model, else you would need to fine-tune this loaded model.

Linear Chain CRF with Feature Function Filtering

I am working on collective classification of entities and using the
CRFClassifier class for sequence labelling. I have a requirement that a
certain feature F_i should NOT be considered with certain class label C_i.
I have specified various flags in the property file for CRFClassifier (of Stanford CoreNLP) and accordingly NERFeature factory generates the features. Internally, I think
it generates total L*N binary feature functions (indicator functions) where, L=#classLabels and N=#features. Out of total functions in this cross product, I do not want to consider few pairs of . What is the best way to achieve this?
Note: The L*N functions are I think generated by - getObjectiveFunction at
following location.
CRFClassifier {
protected double[] trainWeights(int[][][][] data, .......) {
CRFLogConditionalObjectiveFunction func =
getObjectiveFunction(data, labels);
}
}
EHat protected variable in Class CRFLogConditionalObjectiveFunction
contains the empirical counts for these L*N features
For the combination that I do not want in my classifier,
Will it be okay to explicitly set the empirical count to 0 for these
combination (in EHat variable) before I call the Minimizer? Will it be same as saying that I do not have that combination in my model?
Does Mallet provide a way for doing this?

Laravel / Eloquent special relation type based on parsed string attribute

I have developed a system where various classes have attributes consisting of a custom formula. The formula can contain special tokens which refer to different types of object. For example an object of class FruitSalad may have the following attribute;
$contents = "[A12] + [B76]";
In somewhat abstract terms, this means "add apple 12 to banana 76". It can also get significantly more complex than that with as many as 15 or 20 references to other objects involved in one formula.
I have a trait which passes formulae such as this and each time it finds a reference to a model (i.e. "[A12]") it gets it from the database with A::find(12) and adds it to an array of component objects which can be used for other processes later on in the request.
So, in essence, it's a relationship. But instead of a pivot table to describe the relationship, there is a formula on the parent model which can include references to child models.
This is all working. Yay! But it's really inefficient because there are so many tiny queries to get single models as formulae are parsed. One request may quite easily result in hundreds of queries. Oops.
I see two potential options;
1. Get all my apples and bananas from the database at the start of the request and get them from an in-memory store instead of from the database when parsing a formula (is this the repository pattern??).
2. Create a custom relation type (something like hasManyFromFormula) which makes eager loading work so that the parsing becomes much simpler because the relevant apples and bananas would already be loaded into the parent model.
Is there a precedent for this? As for why I am doing it like this, it would a bit tough to explain in brief but suffice to say it is to support a highly configurable data retrieval system which supports as-yet unknown input data configurations.
Help!
Thanks,
Geoff
Am not completely sure if it is the best solution, but in the end I created a new directory class for basic components and then set it up in the app service provider as a singleton. The constructor for the directory class loaded all models of several relevant classes and made them available as collections throughout the app.

statsmodels - create model from params

I am trying to create an empty model from params saved from a previously trained model, but the constructor stubbornly wants me to provide both endogenous and exogenous variables, which I don't have. Is there any way to get around this?
For example, I only want to do:
logit = sm.Logit()
pred = logit.predict(params, X)
But the first line won't work.
No, this is not supported in statsmodels. Models are always associated with data.
However, for the usecase of prediction, it is possible to pickle the model and optionally delete all full length arrays including the data from the model instance and from the results instance before pickling. This doesn't work with formulas.
On the other hand, since this is Python, there might be several ways how to cheat, at your own risk.
It would be helpful if you open a issue on github https://github.com/statsmodels/statsmodels/issues with a description of your usecase, and it might be possible to get the relevant features into a future version.

Resources