How to fix max_length and max_features? - text-classification

I am working on text classification using the pre-trained language model BERT. I don't understand how to fix some parameters like max_features and max_length. In other words, how can I choose the efficient value of max_features and max_length for good model performance (good accuracy).
Thank you.

Related

DistilBert for self-supervision - switch heads for pre-training: MaskedLM and SequenceClassification

Say I want to train a model for sequence classification. And so I define my model to be:
model = DistilBertForSequenceClassification.from_pretrained("bert-base-uncased")
My question is - what would be the optimal way if I want to pre-train this model with masked language modeling task? After pre-training I would like to model to train on the down-stream task of sequence classification.
My understanding is that I can somehow switch the heads of my model and a DistilBertForMaskedLM for pre-training, and then switch it back to the original downstream task. Although I haven't figured out if this is indeed optimal or how to write it.
Does hugging face offer any built in function that accepts the input ids, a percentage of tokens to masked (which aren't pad tokens) and simply trains the model?
Thank you in advance
I've tried to implement this myself, and while it does seem to work it is extremely slow. I figured there could already be implemented solutions instead of trying to optimize my code.

If I train a custom tokenizer on my dataset, I would still be able to leverage a pre-trained model weight

This is a declaration, but I'm not sure it is correct. I can elaborate.
I have a considerably large dataset (23Gb). I'd like to pre-train the Roberta-base or XLM-Roberta-base, so my language model would fit better to be used in further downstream tasks.
I know I can just run it against my dataset for a few epochs and get good results. But, what if I also train the tokenizer to generate a new vocab, and merge files? The weights from the pre-trained model I started from will still be used, or the new set of tokens will demand complete training from scratch?
I'm asking this because maybe some layers can still contribute with knowledge, so the final model will have the better of both worlds: A tokenizer that fits my dataset, and the weights from previous training.
That makes sense?
In short no.
You cannot use your own pretrained tokenizer for a pretrained model. The reason is that the vocabulary for your tokenizer and the vocabulary of the tokenizer that was used to pretrain the model that later you will use it as pretrained model are different. Thus a word-piece token which is present in Tokenizers's vocabulary may not be present in pretrained model's vocabulary.
Detailed answers can be found here,

Transformer: Which parameters are learned during training process?

I got some questions when tried to read and learn the Transformer paper "Attention is all you need":
Which parameters exactly are Tranformer model learned during training process since the attention weight matrix is temporarily calculated from "softmax(QKT/√dk)"? The only trained parameters i know are the linear transformation factor applied on input before entering Multi-head Attention and factors inside FFN. Is there any parameter else? I wish to have a clear and unambiguous summary please.
What is the role of FFN in this model? How does it process the data and why we need it? I wish to have a simple and direct explanation please.
Please forgive my grammar mistakes since English is not my native language. Thank you so much.
the parameters are the weights of linear layers refer to this question
take a look into this answer

How do you treat multi-class classification use case?

I have a list of labelled text. Some have one label, others have 2 and some have even 3. Do you treat this as a multi-class classification problem?
The type of classification problem to solve depends on what your goal is, id don't know exactly what type of problem you are trying to solve, but from the form of data i presume you are talking about a multi-label classification problem.
In any case let's make some clarifications:
Multi-class classification:
you can have many classes (dog,cat,bear, ...) but each sample can be assigned only to one class, a dog cannot be a cat.
Multi-label classfication
the goal of this approach is assigning a set of labels to samples, in the text classification scenario for example the phrase "Today is the weather is sunny" may be assigned the set of labels ["weather","good"].
So, if you need to assign each sample to one class only, based on some metric that for example can be tied to the labels, you should use a multi-class algorithm,
but if your goal is predicting the labels that are most appropriate for your sample (text tagging for ex.), then we are talking about a multi-label classification problem.

Machine Learning/Artificial Intelligence - Classify column based on the value / pattern

I have been trying some frameworks and algorithms, and I can't find one that do what I want - which is classify the column of the data based on the value.
I tried to use Bayes algorithm, but it isn't very precise because I can't expect that the data that is being searched for is in the training set - but I can expect that the pattern is in the training.
I don't have background in Machine Learning / AI, but I was looking for some working example before really going deeper in the implementation.
I built a smaller ARFF to exemplify. Also tried lots of Weka classifying algorithms but none of them gave me good results.
#relation recommend
#attribute class {name,email,taxid,phone}
#attribute text String
#data
name,'Erik Kolh'
name,'Eric Candid'
name,'Allan Pavinan'
name,'Jubaru Guttenberg'
name,'Barabara Bere'
name,'Chuck Azul'
email,'erik#gmail.com'
email,'steven#spielberg.com'
email,'dogs#cats.com'
taxid,'123611216'
taxid,'123545413'
taxid,'562321677'
taxid,'671312678'
taxid,'123123216'
phone,'438-597-7427'
phone,'478-711-7678'
phone,'321-651-5468'
My expectation is train a huge dataset like the above one and get recommendations based on the pattern, e.g.:
joao#bing.com -> email
Joao Vitor -> name
400-123-5519 -> phone
Can you please suggest any algorithms, examples or ideas to research?
I couldn't find a good fit, maybe it's just lack of vocabulary.
Thank you!
What you are trying to do is called named entity recognition (NER). Weka is most likely not a real help here. The library Mallet (http://mallet.cs.umass.edu) might be a good fit. I would recommend a Conditional Random Field (CRF) based approach.
If you would like to stay with weka, you need to change your feature space. Then Naive bayes will be do ok on your data as presented
E.g. add a features for
whether the word has only characters
whether it is alphanumeric
whether it is numeric data
number of Numbers,
whether it starts captilized
... (just be creative)

Resources