why TFBertForSequenceClassification.from_pretrained('bert-base-chinese') can't use? - huggingface-transformers

I want to do chinese Textual Similarity with huggingface:
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = TFBertForSequenceClassification.from_pretrained('bert-base-chinese')
It doesn't work, system report errors:
Some weights of the model checkpoint at bert-base-chinese were not used when initializing TFBertForSequenceClassification: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-chinese and are newly initialized: ['classifier', 'dropout_37']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
But I can use huggingface to do name entity:
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = TFBertForTokenClassification.from_pretrained("bert-base-chinese")
Does that mean huggingface haven't done chinese sequenceclassification? If my judge is right, how to sove this problem with colab with only 12G memory?

The reason is simple. The model has not been fine-tuned for the Sequence classification task hence when you try to load the 'bert-base-chinese' model over a Sequence classification model. It updates the rest of the layers ['nsp___cls', 'mlm___cls'] randomly.
And it's a warning which means the model will be giving random results due to the random last layer initialization.
BTW #andy you didn't upload the output for token classification? It should also show a similar warning but with ['classifier'] layer as randomly initiated.
Do use a fine-tuned model, else you would need to fine-tune this loaded model.

Related

How to change the task of a pretrained model without jeopardizing the capacity of it?

I want to build a text2text model. Specifically, I want to transfer some automatically generated scrabbling text pieces into a smooth paragraph within the same language. I've already prepared the text inputs and outputs. So corpus is not the primary problem now.
I want to use hugging face models like:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
model = AutoModelForMaskedLM.from_pretrained("bert-base-chinese")
because it has already obtained the capacity to generate the language, the model is made for masked language, and there's no mature task like mine as it is really customized. So how could I use the hugging face masked language model as a base text2text model without jeopardizing its capacity? I want to fine-tune it to achieve that task/goal. I want to know how.

GPT-3 Fine Tune a Fine Tuned Model?

The OpenAI documentation for the model attribute in the fine-tune API states a bit confusingly:
model
The name of the base model to fine-tune. You can select one of "ada", "babbage", "curie", "davinci", or a fine-tuned model created after 2022-04-21.
My question: is it better to fine-tune a base model or a fine-tuned model?
I created a fine-tune model from ada with file mydata1K.jsonl:
ada + mydata1K.jsonl --> ada:ft-acme-inc-2022-06-25
Now I have a bigger file of samples mydata2K.jsonl that I want to use to improve the fine-tuned model.
In this second round of fine-tuning, is it better to fine-tune ada again or to fine-tune my fine-tuned model ada:ft-acme-inc-2022-06-25? I'm assuming this is possible because my fine tuned model is created after 2022-04-21.
ada + mydata2K.jsonl --> better-model
or
ada:ft-acme-inc-2022-06-25 + mydata2K.jsonl --> even-better-model?
If you read the Fine-tuning documentation as of Jan 4, 2023, the only part talking about "fine-tuning a fine-tuned model" is the following part under Advanced usage:
Continue fine-tuning from a fine-tuned model
If you have already fine-tuned a model for your task and now have
additional training data that you would like to incorporate, you can
continue fine-tuning from the model. This creates a model that has
learned from all of the training data without having to re-train from
scratch.
To do this, pass in the fine-tuned model name when creating a new
fine-tuning job (e.g., -m curie:ft-<org>-<date>). Other training
parameters do not have to be changed, however if your new training
data is much smaller than your previous training data, you may find it
useful to reduce learning_rate_multiplier by a factor of 2 to 4.
Which option to choose?
You're asking about two options:
Option 1: ada + bigger-training-dataset.jsonl
Option 2: ada:ft-acme-inc-2022-06-25 + additional-training-dataset.jsonl
The documentation says nothing about which option is better in terms of which would yield better results.
However...
Choose Option 2
Why?
When training a fine-tuned model, the total tokens used will be billed
according to our training rates.
If you choose Option 1, you'll pay for some tokens in your training dataset twice. First when doing fine-tuning with initial training dataset, second when doing fine-tuning with bigger training dataset (i.e., bigger-training-dataset.jsonl = initial-training-dataset.jsonl + additional-training-dataset.jsonl).
It's better to continue fine-tuning from a fine-tuned model because you'll pay only for tokens in your additional training dataset.
Read more about fine-tuning pricing calculation.

Difference between from_config and from_pretrained in HuggingFace

num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
preconfig = DistilBertConfig(n_layers=6)
model1 = AutoModelForSequenceClassification.from_config(preconfig)
model2 = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
I am modifying this code (modified code is provided above) to test DistilBERT transformer layer depth size via from_config since from my knowledge from_pretrained uses 6 layers because in the paper section 3 they said:
we initialize the student from the teacher by taking one layer out of two
While what I want to test is various sizes of layers. To test whether both are the same, I tried running the from_config
with n_layers=6 because based on the documentation DistilBertConfig the n_layers is used to determine the transformer block depth. However as I run model1 and model2 I found that with SST-2 dataset, in accuracy:
model1 achieved only 0.8073
model2 achieved 0.901
If they both behave the same I expect the result to be somewhat similar but 10% drop is a significant drop, therefore I believe there ha to be a difference between the functions. Is there a reason behind the difference of the approach (for example model1 has not yet applied hyperparameter search) and is there a way to make both functions behave the same? Thank you!
The two functions you described, from_config and from_pretrained, do not behave the same. For a model M, with a reference R:
from_config allows you to instantiate a blank model, which has the same configuration (the same shape) as your model of choice: M is as R was before training
from_pretrained allows you to load a pretrained model, which has already been trained on a specific dataset for a given number of epochs: M is as R after training.
To cite the doc, Note: Loading a model from its configuration file does not load the model weights. It only affects the model’s configuration. Use from_pretrained() to load the model weights.

What is the purpose of EventMention in Stanford-NLP?

When using the pipeline annotator, relation, I get back appropriate RelationMention objects for each sentence. These are binary type objects with two entity mentions and a corresponding relation type.
However, in the code, I also see EventMention objects which can be obtained from the sentence in much the same way. In the class MachineReadingProperties, I see that extraction of relations and extraction of events both default to true. However, I am only seeing generated relations and not generated events.
I can find no mention of events in the Stanford documentation, nor does the page describing the relation annotator or how to train a custom relation model describe it. There are no links to research papers on the event portion, just the link to the Roth and Yih paper on relations.
So do events work with the relation annotator and if so are there any more documents describing them?
From tracing the code, I see that EventMention is only used in conjunction with the ACE2005 code. It does not appear to be connected with any of the pipeline annotators and cannot be directly created in that manner.

Laravel / Eloquent special relation type based on parsed string attribute

I have developed a system where various classes have attributes consisting of a custom formula. The formula can contain special tokens which refer to different types of object. For example an object of class FruitSalad may have the following attribute;
$contents = "[A12] + [B76]";
In somewhat abstract terms, this means "add apple 12 to banana 76". It can also get significantly more complex than that with as many as 15 or 20 references to other objects involved in one formula.
I have a trait which passes formulae such as this and each time it finds a reference to a model (i.e. "[A12]") it gets it from the database with A::find(12) and adds it to an array of component objects which can be used for other processes later on in the request.
So, in essence, it's a relationship. But instead of a pivot table to describe the relationship, there is a formula on the parent model which can include references to child models.
This is all working. Yay! But it's really inefficient because there are so many tiny queries to get single models as formulae are parsed. One request may quite easily result in hundreds of queries. Oops.
I see two potential options;
1. Get all my apples and bananas from the database at the start of the request and get them from an in-memory store instead of from the database when parsing a formula (is this the repository pattern??).
2. Create a custom relation type (something like hasManyFromFormula) which makes eager loading work so that the parsing becomes much simpler because the relevant apples and bananas would already be loaded into the parent model.
Is there a precedent for this? As for why I am doing it like this, it would a bit tough to explain in brief but suffice to say it is to support a highly configurable data retrieval system which supports as-yet unknown input data configurations.
Help!
Thanks,
Geoff
Am not completely sure if it is the best solution, but in the end I created a new directory class for basic components and then set it up in the app service provider as a singleton. The constructor for the directory class loaded all models of several relevant classes and made them available as collections throughout the app.

Resources