How to add weights in BERT loss function - huggingface-transformers

I have unbalanced dataset size N with such classes:
class 1 - size 0.554*N
class 2 - size 0.271*N
class 3 - size 0.185*N
I’m trying to solve NER task by fine-tuning Bert “dslim / bert-large-NER”, but during training my eval f1 score doesn’t rise above 0.53
How to add weights in BERT loss function to overcome low f1 score?
I tried to fine-tune other ner models from huggingface, but they didn't help
I use Trainer from Transformers to train the model

As mentioned before the Bert loss function is defined by the model. If you want to modify loss function, you need to define the Bert class again. For example:
class Bert_modified(BertForTokenClassification):
def forward(*some input parameters*):
*important class code*
loss_fct = CrossEntropyLoss(weight=class_weights_tensor)
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
*important class code*
Bert's classes in Transformers you can find here

Related

Reduce the output layer size from XLTransformers

I'm running the following using the huggingface implementation:
t1 = "My example sentence is really great."
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
model = TransfoXLLMHeadModel.from_pretrained("transfo-xl-wt103")
encoded_input = tokenizer(t1, return_tensors='pt', add_space_before_punct_symbol=True)
output = model(**encoded_input)
tmp = output[0].detach().numpy()
print(tmp.shape)
>>> (1, 7, 267735)
With the goal of getting output embeddings that I'll use downstream.
The last dimension is /substantially/ larger than I expected, and it looks like it is the size of the entire vocab_size rather than a reduction based on the ECL from the paper (which potentially I am misinterpreting).
What argument would I provide the model to reduce this layer size to a smaller dimensional space, something more like the basic BERT at 400 or 768 and still obtain good performance based on the pretrained embeddings?
That's because you used ...LMHeadModel, which predicts the next token. You can use TransfoXLModel.from_pretrained("transfo-xl-wt103") instead, then output[0] is the last hidden state which has the shape (batch_size, sequence_length, hidden_size).

Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript.
For example, I tried the following:
import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('wordnet')
def Preprocesser(text_list):
smallestWordSize = 3
processedList = []
for token in gensim.utils.simple_preprocess(text_list):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:
processedList.append(StemmAndLemmatize(token))
return processedList
lda_model = models.LdaModel.load('LDAModel\GoldModel') #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict") #Load dictionary model was trained on
#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"
termTopicMatrix = lda_model.get_topics() #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc) #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc) #Create bow using dictionary
dictSize = len(termTopicMatrix[0]) #Get length of terms in dictionary
fullDict = np.zeros(dictSize) #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc] #Get index of terms in bag of words
Second = [second[1] for second in bowDoc] #Get frequency of term in bag of words
fullDict[First] = Second #Add word frequency to full dictionary
print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])
Output:
Matrix Multiplication:
[0.0283254 0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
0.01558603 0.0370233 0.04648389 0.02887623 0.00776652 0.02147539
0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax:
[(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]
In the pretrained model there are 35 topics and 1155 words.
In the "Conventional Syntax" output, the first element of each tuple is the index of the topic and the second element is the probability of the topic. In the "Matrix Multiplication" version, the probability is the index and the value is the probability. Clearly the two don't match up.
For example, the lda_model[unseenDoc] shows that topic 0 has a 0.07 probability, but the matrix multiplication method says that topic has a 0.028 probability. Am I missing a step here?
You can review the full source code used by LDAModel's get_document_topics() method in your installation, or online at:
https://github.com/RaRe-Technologies/gensim/blob/e75f6c8e8d1dee0786b1b2cd5ef60da2e290f489/gensim/models/ldamodel.py#L1283
(It also makes use of the inference() method in the same file.)
It's doing a lot more scaling/normalization/clipping than your code, which is likely the cause of the discrepancy. But you should be able to examine, line-by-line, where your process & its differ to get the steps to match up.
It also shouldn't be hard to use the gensim code's steps as guidance for creating parallel Javascript code that, given the right parts of the model's state, can reproduce its results.

using ols from statsmodels.formula.api with only a constant term?

I'd like to show students what happens when only a constant is used in a regression model. I specified one model as price ~ age for an OLS model of the price of used cars as a function of age plus a constant. Now I'd like to drop the age variable and just have the constant. How do I do this?
The formula fitting in statsmodels uses Patsy, which tries to mimic R-style model specifications.
Since you didn't specify a data source, I've taken a dataset from the
statsmodels OLS guide to provide a worked example - can wealth explain lottery spending:
import statsmodels.api as sm
import statsmodels.formula.api as smf
# load example and trim to a few features
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
# fit with y=mx + c
model1 = smf.ols(formula='Lottery ~ Wealth', data=df).fit()
print(model1.summary())
# fit with y=c (only an intercept)
model2 = smf.ols(formula='Lottery ~ 1', data=df).fit()
print(model2.summary())
For your question, a model with only the intercept is nothing more than the mean, but presumably you are interested in techniques for comparing different models, so let's do a quick comparison to see whether the simpler model gives a better fit - one option is the f-test:
f_val, p_val, _ = model1.compare_f_test(model2)
print(f_val, p_val, p_val<0.01)
The p value is below 1% significance level, so we interpret that the more complex model is "more correct" in this case.
For completeness, to specify a model without an intercept (useful e.g. if we already mean-centered the data), we can exclude with -1 in the formula:
# y = mx
model3 = smf.ols(formula='Lottery ~ Wealth -1', data=df).fit()
print(model3.summary())
f_val, p_val, _ = model1.compare_f_test(model3)
print(f_val, p_val, p_val<0.01)
Again, p_val is below 1% significance level, so including intercept and slope improves model fit. (No multi-test correction here, but p values are <<1%)

Can I perform Keras training in a deterministic manner?

I'm using a Keras Sequential model where the inputs and labels are exactly the same each run. Keras is using a Tensorflow backend.
I've set the layer activations to 'zeros' and disabled batch shuffling during training.
model = Sequential()
model.add(Dense(128,
activation='relu',
kernel_initializer='zeros',
bias_initializer='zeros'))
...
model.compile(optimizer='rmsprop', loss='binary_crossentropy')
model.fit(x_train, y_train,
batch_size = 128, verbose = 1, epochs = 200,
validation_data=(x_validation, y_validation),
shuffle=False)
I've also tried seeding Numpy's random() method:
np.random.seed(7) # fix random seed for reproducibility
With the above in place I still receive different accuracy and loss values after training.
Am I missing something or is there no way to fully remove the variance between trainings?
Since this seems to be a real issue, as commented before, maybe you could go for manually initializing your weights (instead of trusting the 'zeros' parameter passed in the layer constructor):
#where you see layers[0], it's possible that the correct layer is layers[1] - I can't test at this moment.
weights = model.layers[0].get_weights()
ws = np.zeros(weights[0].shape)
bs = np.zeros(weights[1].shape)
model.layers[0].set_weights([ws,bs])
It seems the problem occurs in training and not initialization. You can check this by first initializing two models model1 and model2 and running the following code:
w1 = model1.get_weights()
w2 = model2.get_weights()
for i in range(len(w1)):
w1i = w1[i]
w2i = w2[i]
assert np.allclose(w1i, w2i), (w1i, w2i)
print("Weight %i were equal. "%i)
print("All initial weights were equal. ")
Even though all assertions passed, training model1 and model2 with shuffle=False yielded different models. That is, if I perform similar assertions on the weights of model1 and model2 after training the assertions all fail. This suggests that the problem lies in randomness from training.
As of this post I have not managed to figure out how to circumvent this.

GenSim Word2Vec unexpectedly pruning

My objective is to find a vector representation of phrases. Below is the code I have, that works partially for bigrams using the Word2Vec model provided by the GenSim library.
from gensim.models import word2vec
def bigram2vec(unigrams, bigram_to_search):
bigrams = Phrases(unigrams)
model = word2vec.Word2Vec(sentences=bigrams[unigrams], size=20, min_count=1, window=4, sg=1, hs=1, negative=0, trim_rule=None)
if bigram_to_search in model.vocab.keys():
return model[bigram_to_search]
else:
return None
The problem is that the Word2Vec model is seemingly doing automatic pruning of some of the bigrams, i.e. len(model.vocab.keys()) != len(bigrams.vocab.keys()). I've tried adjusting various parameters such as trim_rule, min_count, but they don't seem to affect the pruning.
PS - I am aware that bigrams to look up need to be represented using underscore instead of space, i.e. proper way to call my function would be bigram2vec(unigrams, 'this_report')
Thanks to further clarification at the GenSim support forum, the solution is to set the appropriate min_count and threshold values for the Phrases being generated (see documentation for details about these parameters in the Phrases class). The corrected solution code is below.
from gensim.models import word2vec, Phrases
def bigram2vec(unigrams, bigram_to_search):
bigrams = Phrases(unigrams, min_count=1, threshold=0.1)
model = word2vec.Word2Vec(sentences=bigrams[unigrams], size=20, min_count=1, trim_rule=None)
if bigram_to_search in model.vocab.keys():
return model[bigram_to_search]
else:
return []

Resources