BERT without positional embeddings - huggingface-transformers

I am trying to build a pipeline in HuggingFace which will not use the positional embeddings in BERT, in order to study the role of the embeddings for a particular use case. I have looked through the documentation and the code, but I have not been able to find a way to implement a model like that. Will I need to modify BERT source code, or is there a configuration I can fiddle around with?

You can do a workaround by setting the position embedding layer to zeros. When you check, the embeddings part of BERT, you can see that the position embeddings are there as a separate PyTorch module:
from transformers import AutoModel
bert = AutoModel.from_pretrained("bert-base-cased")
print(bert.embeddings)
BertEmbeddings(
(word_embeddings): Embedding(28996, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
You can assign the position embedding parameters whatever value you want, including zeros, which will effectively disable the position embeddings:
bert.embeddings.position_embeddings.weight.data = torch.zeros((512, 768))
If you plan to fine-tune the modified model, make sure the zeroed parameters do not get updated by setting:
bert.embeddings.position_embeddings.requires_grad_ = False
This sort of bypassing the position embeddings might work well when you train a model from scratch. When you work with a pre-trained model, such removal of some parameters might confuse the models quite a bit, so more fine-tuning data might be needed. In this case, there might be better strategies on how to replace the position embeddings, e.g., using the average value for all positions.

Related

fine tuning word2vec on a specific article, using transfer learning

i try to fine tune an exicting model on specific article. I have tried transfer learning using genism build_vocab, adding gloveword2vec to a base model i trained on the article. but the build_vocab does not change the basic model- it is very small and no words are added to it's vocabulary.
this is the code:
#load glove model
glove_file = datapath("/content/glove.6B.200d.txt")
tmp_file = get_tmpfile("test_word2vec.txt")
_ = glove2word2vec(glove_file, tmp_file)
glove_vectors = KeyedVectors.load_word2vec_format(tmp_file)`
(in here - len(glove_vectors.wv.vocab) = 40000)
#create good article basic model
base_model = Word2Vec(size=300, min_count=5)
base_model.build_vocab([tokenizer.tokenize(data.text[0])])
total_examples = base_model.corpus_count`
(in here - len(base_model.wv.vocab) = 24)
#add GloVe's vocabulary & weights base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)
(in here- still - len(base_model_good_wv.vocab) = 24)
#training
base_model.train([tokenizer.tokenize(good_trump.text[0])], total_examples=total_examples, epochs=base_model.epochs+5)
base_model_wv = base_model.wv
i think that the
"base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)"
does nothing- so there is no transfer learning.
any recommendations?
i relied on this article for the guideline...
Many articles at the 'Towards Data Science' site are very confused, to the point of misleading more than helping. Unfortunately, the article you've linked is a good example:
The author first uses an unsupported value (workers=-1) that manages to make his local-corpus training do nothing, and rather than discovering & fixing that error, incorrectly concludes he needs to use 'transfer learning'/'fine-tuning' instead. (He doesn't.)
He then tries to improvise a re-use of the GLoVe vectors, but as you've noted, his build_vocab() only manages to add the word-tokens to the model's vocabulary. This operation does not copy over any of the actual vectors!
Then, by doing training in a model where the default workers=3 was still in-effect, he finally does real training on just his own texts – no contribution from GLoVe values at all. He attributes the improvement to GLoVE, but really multiple mistakes have just cancelled each other.
I would avoid relying on a 'Towards Data Science' source if any other docs or tutorials are available.
Further, many who think they want to do re-use of someone else's pretrained vectors, with a small update from their own texts, should really just improve their own training corpus, so that they have one unified, evenly-trained model that covers all their needed words.
There's no explicit support for 'fine-tuning' in Gensim. Bold advanced users can try to cobble it together from other methods, and tampering with the model between usual steps, but I've never seen a well-characterized & evaluated process for doing so. (Lots of the people fumbling through the process aren't even doing a good check of end-quality versus other approaches, just noting some improvement on a few ad hoc, perhaps unrepresentative tests.)
Are you sure you need to do this? What was wrong with vectors taught on just your corpus? Might extending your corpus with extra texts to expand its vocabulary work as well or better?
Or, you could try translating the new domain words from your limited corpus & model into the same coordinate space as some older larger set of pretrained vectors that you like. There's an example of that process in a Gensim demo notebook using its utility TranslationMatrix class: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb

Dutch pre-trained model not working in gensim

When trying to upload the fasttext model (cc.nl.300.bin) in gensim I get the following error:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.nl.300.bin.gz
!gunzip cc.nl.300.bin.gz
model = FastText_gensim.load_fasttext_format('cc.nl.300.bin')
model.build_vocab(cleaned_text, update=True)
AttributeError: 'FastTextTrainables' object has no attribute 'syn1neg'
The code goes wrong when building the vocab with my own dataset. The format of that dataset is all right, as I already used it to build and train other (not pre-trained) Word2Vec and FastText models.
I saw other had the same error on this blog, however their solution did not work for me: https://github.com/RaRe-Technologies/gensim/issues/2588
Also, I read somewhere that I should use 'load_facebook_model'? However I was not able to import load_facebook_model at all? Is this even a good way to solve this problem?
Any other suggestions?
Are you sure you're using the latest version of Gensim, 4.0.1, with many improvements to the FastText implementation?
And, there you will definitely want to use .load_facebook_model() to load a full .bin Facebook-format model:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_model
But also note: the post-training expansion of the vocabulary is best considered an advanced & experimental function. It may not offer any improvement on typical tasks - indeed, without careful consideration of tradeoffs & balancing influence of later traiing against earlier, it can make things worse.
A FastText model trained on a large, diverse corpus may already be able to synthesize better-than-nothing guess vectors for out-of-vocabulary words, via its subword vectors.
If there's some data with very-different words & word-senses you need to integrate, it will often be better to re-train from scratch, using an equal combination of all desired text influences. Then you'll be doing things in a standard and balanced way, without harder-to-tune and harder-to-evaluate improvised changes to usual practice.

Can you provide additional tags for documents using TaggedLineDocument?

When training a doc2vec model using a corpus in the TaggedDocument class, you can provide a list of tags. When the doc2vec model is trained it learns a vector representation for the tags. For example you could have one tag representing the document, and another representing some classification that can be shared between documents.
How would one provide additional tags when streaming a corpus using TaggedLineDocument?
The TaggedLineDocument class only considers documents to be one per line, with a single tag that is their line-number.
If you want more tags, you'll have to provide your own iterable which does that. It should only be a few lines of code, depending on where your other tags come from. You can use the source for TaggedLineDocument – which is itself only 9 lines of Python code –as a model to build on:
https://github.com/RaRe-Technologies/gensim/blob/e4199cb4e9a90df44ca59c1d0505b138caa21951/gensim/models/doc2vec.py#L1126
Note: while supplying ore than one tag per document is a natural extension of the original 'Paragraph Vectors' approach, and often can provide benefits, sometimes it also 'dilutes' the salience of each tag's vector – which will be a special concern as the average number of tags per document grows, or the model acquires many more tags than unique documents. So be sure to comparatively evaluate whether any multiple-tag strategy is helping or hurting, in different modes, and whether things like pre-known categories work better as extra tags or known-labels for some later steps.

Train Model with Token Features

I want to train a BERT like model for Hebrew, where fore very word I know:
Lemma
Gender
Number
Voice
And I would like to train a model where for each token these features are concatenated
Embedding(Token) = E1(Lemma):E2(Gender):E3(Number):E4(Voice)
Is there a way to do such a thing with the current huggingface transformers library?
Models in the Huggingface's Transformers do not support factored inputs by default. As a workaround, you can embed the inputs yourself and bypass the embedding layer in BERT. Instead of providing the input_ids when you call the model, you can provide input_embeds. It will use the provided embeddings and the position embeddings to them. Note that the provided embeddings need to have the same dimension as the rest of the model.
You need to have one embedding layer per input type (lemma, gender, number, voice), which also means having factor-specific vocabularies that will assign indices to the inputs that are used for the embedding lookup. It makes sense to have a larger embedding for lemmas than for the grammatical categories that have several possible values.
Then you just concatenate the embeddings, optionally project them and feed them as input_embeds to the model.

Add domain-specific entities to spaCy or Stanford NLP training set

We would like to add some custom entities to the training set of either Stanford NLP or spaCy, before re-training the model. We are willing to label our custom entities, but we would like to add these to the existing training set, so as to not spend too much time labeling.
We assume that the NLP model was trained on a large labeled data set, which includes labels for words that are labeled "O" ("other", i.e. nothing of interest) as well as words that are labeled "DATE", "PERSON", "ORGANIZATION", etc. We have a custom set of ORGANIZATION words, but we would like to add these to all the other labeled data, before re-training the model.
Is this possible? How can we do this? Do we have to get the labeled dataset that the models were trained on, so we can add our own data? If so, how can we do that?
We have built prototypes using both Stanford NLP and spaCy, so an answer for either one works for us.
For spaCy, you should just be able to call nlp.update(). This will make a weight update against the current weights, allowing you to resume training. If you want to make many updates, you might want to parse some text with the original model and mix that through your training, to avoid the "catastrophic forgetting" problem.
You can use this entity tagger tool by helkaroui to create your own training set.

Resources