How to know if HuggingFace's pipeline text input exceeds 512 tokens

How to know if HuggingFace's pipeline text input exceeds 512 tokens - huggingface-transformers

I've finetuned a Huggingface BERT model for Named Entity Recognition based on 'bert-base-uncased'. I perform inference like this:
from transformers import pipeline
ner_pipeline = pipeline('token-classification', model=model_folder, tokenizer=model_folder)
out = ner_pipeline(text, aggregation_strategy='simple')
I want to obtain results on very long texts, and since I know of the 512 token maximum capacity for both training and inference, I split my texts in smaller chunks before passing those to the ner_pipeline.
But, how do I split the text without actually tokenizing the texts myself in order to check for the length of each chunk? I want to make them as long as possible, but at the same time I don't want to exceed the maximum 512 tokens, risking that no predictions are computed on what's left of the sentence.
Is there a way to know if the texts I'm feeding exceed the 512 maximum tokens?

Finding out whether tokenized text exceeds 512 tokens is simply checking its tokenized output. For this purpose, you can simply use AutoTokenizer library of HuggingFace. For example,
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
sentence = "Sentence to check whether it exceeds 512 tokens"
tokenized_sentence = tokenizer.tokenize(sentence)
print(len(sentence.split())) # here is the default length of the sentence
print(len(tokenized_sentence)) # here is the tokenized length
You can give it a try for long documents and observe that at some points tokenized lengths exceed 512 tokens. This may not be a problem for text classification but you may lose your entity labels for the token classification task. Thus, before feeding your Transformer-based network with long documents, you should preprocess your texts with AutoTokenizer, find the points where the tokenized texts reach the maximum length of the model input size (e.g, 512), and simply cut the sentence from that point and create a new sample from the remaining part of that long document.

Related

What is the maximum text length in tokens that can be given as input for summarisation task using a sentence transformer models

Most Bert models take a maximum input length of 512 tokens. When I used sentence transformer multi-qa-distilbert-cos-v1 model with bert-extractive-summarizer for summarisation task. A text with 792 tokens was accepted by the model and the summary contained the last line from the original text. Usually the text after 512 tokens is truncated by the model and not considered for nlp task. The documentation also states the max sequence length of 512 tokens. How is the model able to read beyond 512 tokens?

How to get immediate next word probability using GPT2 model?

I was trying the hugging face gpt2 model. I have seen the run_generation.py script, which generates a sequence of tokens given a prompt. I am aware that we can use GPT2 for NLG.
In my use case, I wish to determine the probability distribution for (only) the immediate next word following the given prompt. Ideally this distribution would be over the entire vocab.
For example, given the prompt: "How are ", it should give a probability distribution where "you" or "they" have the some high floating point values and other vocab words have very low floating values.
How to do this using hugging face transformers? If it is not possible in hugging face, is there any other transformer model that does this?

You can have a look at how the generation script works with the probabilities.
GPT2LMHeadModel (as well as other "MLHead"-models) returns a tensor that contains for each input the unnormalized probability of what the next token might be. I.e., the last output of the model is the normalized probability of the next token (assuming input_ids is a tensor with token indices from the tokenizer):
outputs = model(input_ids)
next_token_logits = outputs[0][:, -1, :]
You get the distribution by normalizing the logits using softmax. The indices in the first dimension of the next_token_logits correspond to indices in the vocabulary that you get from the tokenizer object.
Selecting the last logits becomes tricky when you use a batch size bigger than 1 and sequences of different lengths. In that case, you would need to specify attention_mask in the model call to mask out padding tokens and then select the last logits using torch.index_select. It is much easier either to use batch size 1 or batch of equally long sequences.
You can use any autoregressive model in Transformers: there is distilGPT-2 (a distilled version of GPT-2), CTRL (which is basically GPT-2 trained with some additional "commands"), the original GPT (under the name openai-gpt), XLNet (designed for contextual embeddings, but can be used for generation in arbitrary order). There are probably more, you can Hugging Face Model Hub.

Correct way to represent documents containing multiple sentences in gensim file-based training

I am trying to use gensim's file-based training (example from documentation below):
from multiprocessing import cpu_count
from gensim.utils import save_as_line_sentence
from gensim.test.utils import get_tmpfile
from gensim.models import Word2Vec, Doc2Vec, FastText
# Convert any corpus to the needed format: 1 document per line, words delimited by " "
corpus = api.load("text8")
corpus_fname = get_tmpfile("text8-file-sentence.txt")
save_as_line_sentence(corpus, corpus_fname)
# Choose num of cores that you want to use (let's use all, models scale linearly now!)
num_cores = cpu_count()
# Train models using all cores
w2v_model = Word2Vec(corpus_file=corpus_fname, workers=num_cores)
d2v_model = Doc2Vec(corpus_file=corpus_fname, workers=num_cores)
ft_model = FastText(corpus_file=corpus_fname, workers=num_cores)
However, my actual corpus contains many documents, each containing many sentences.
For example, let's assume my corpus is the plays of Shakespeare - Each play is a document, each document has many many sentences, and I would like to learn embeddings for each play, but the word embeddings only from within the same sentence.
Since the file-based training is meant to be one document per line, I assume that I should put one play per line. However, the documentation for file-based-training doesn't have an example of any documents with multiple sentences.
Is there a way to peek inside the model to see the documents and word context pairs that have been found before they are trained?
What is the correct way to build this file, maintaining sentence boundaries?
Thank you

These algorithm implementations don't have any real understanding of, or dependence on, actual sentences. They just take texts – runs of word-tokens.
Often the texts provided to Word2Vec will be multiple sentences. Sometimes punctuation like sentence-ending periods are even retained as pseudo-words. (And when the sentences were really consecutive with each other in the source data, the overlapping word-context windows, between sentences, may even be a benefit.)
So you don't have to worry about "maintaining sentence boundaries". Any texts you provide that are sensible units of words that really co-occur will work about as well. (Especially in Word2Vec and FastText, even changing your breaks between texts to be sentences, or paragraphs, or sections, or documents is unlikely to have very much effect on the final word-vectors – it's just changing a subset of the training contexts, and probably not in any way that significantly changes which words influence which other words.)
There is, however, another implementation limit in gensim that you should watch out for: each training text can only be 10,000 tokens long, and if you supply larger texts, the extra tokens will be silently ignored.
So, be sure to use texts that are 10k tokens or shorter – even if you have to arbitrarily split longer ones. (Per above, any such arbitrary extra break in the token grouping is unlikely to have a noticeable effect on results.)
However, this presents a special problem using Doc2Vec in corpus_file mode, because in that mode, you don't get to specify your preferred tags for a text. (A text's tag, in this mode, is essentially just the line-number.)
In the original sequence corpus mode, the workaround for this 10k token limit was just to break up larger docs into multiple docs - but use the same repeated tags for all sub-documents from an original document. (This very closely approximates how a doc of any size would affect training.)
If you have documents with more than 10k tokens, I'd recommend either not using corpus_file mode, or figuring some way to use logical sub-documents of less than 10k tokens, then perhaps modeling your larger docs as the set of their sub-documents, or otherwise adjusting your downstream tasks to work on the same sub-document units.

How are word vectors co-trained with paragraph vectors in doc2vec DBOW?

I don't understand how word vectors are involved at all in the training process with gensim's doc2vec in DBOW mode (dm=0). I know that it's disabled by default with dbow_words=0. But what happens when we set dbow_words to 1?
In my understanding of DBOW, the context words are predicted directly from the paragraph vectors. So the only parameters of the model are the N p-dimensional paragraph vectors plus the parameters of the classifier.
But multiple sources hint that it is possible in DBOW mode to co-train word and doc vectors. For instance:
section 5 of An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation
this SO answer: How to use Gensim doc2vec with pre-trained word vectors?
So, how is this done? Any clarification would be much appreciated!
Note: for DM, the paragraph vectors are averaged/concatenated with the word vectors to predict the target words. In that case, it's clear that words vectors are trained simultaneously with document vectors. And there are N*p + M*q + classifier parameters (where M is vocab size and q word vector space dim).

If you set dbow_words=1, then skip-gram word-vector training is added the to training loop, interleaved with the normal PV-DBOW training.
So, for a given target word in a text, 1st the candidate doc-vector is used (alone) to try to predict that word, with backpropagation adjustments then occurring to the model & doc-vector. Then, a bunch of the surrounding words are each used, one at a time in skip-gram fashion, to try to predict that same target word – with the followup adjustments made.
Then, the next target word in the text gets the same PV-DBOW plus skip-gram treatment, and so on, and so on.
As some logical consequences of this:
training takes longer than plain PV-DBOW - by about a factor equal to the window parameter
word-vectors overall wind up getting more total training attention than doc-vectors, again by a factor equal to the window parameter

How to rotate a word2vec onto another word2vec?

I am training multiple word2vec models with Gensim. Each of the word2vec will have the same parameter and dimension, but trained with slightly different data. Then I want to compare how the change in data affected the vector representation of some words.
But every time I train a model, the vector representation of the same word is wildly different. Their similarity among other words remain similar, but the whole vector space seems to be rotated.
Is there any way I can rotate both of the word2vec representation in such way that same words occupy same position in vector space, or at least they are as close as possible.
Thanks in advance.

That the locations of words vary between runs is to be expected. There's no one 'right' place for words, just mutual arrangements that are good at the training task (predicting words from other nearby words) – and the algorithm involves random initialization, random choices during training, and (usually) multithreaded operation which can change the effective ordering of training examples, and thus final results, even if you were to try to eliminate the randomness by reliance on a deterministically-seeded pseudorandom number generator.
There's a class called TranslationMatrix in gensim that implements the learn-a-projection-between-two-spaces method, as used for machine-translation between natural languages in one of the early word2vec papers. It requires you to have some words that you specify should have equivalent vectors – an anchor/reference set – then lets other words find their positions in relation to those. There's a demo of its use in gensim's documentation notebooks:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb
But, there are some other techniques you could also consider:
transform & concatenate the training corpuses instead, to both retain some words that are the same across all corpuses (such as very frequent words), but make other words of interest different per segment. For example, you might leave words like "hot" and "cold" unchanged, but replace words like "tamale" or "skiing" with subcorpus-specific versions, like "tamale(A)", "tamale(B)", "skiing(A)", "skiing(B)". Shuffle all data together for training in a single session, then check the distances/directions between "tamale(A)" and "tamale(B)" - since they were each only trained by their respective subsets of the data. (It's still important to have many 'anchor' words, shared between different sets, to force a correlation on those words, and thus a shared influence/meaning for the varying-words.)
create a model for all the data, with a single vector per word. Save that model aside. Then, re-load it, and try re-training it with just subsets of the whole data. Check how much words move, when trained on just the segments. (It might again help comparability to hold certain prominent anchor words constant. There's an experimental property in the model.trainables, with a name ending _lockf, that lets you scale the updates to each word. If you set its values to 0.0, instead of the default 1.0, for certain word slots, those words can't be further updated. So after re-loading the model, you could 'freeze' your reference words, by setting their _lockf values to 0.0, so that only other words get updated by the secondary training, and they're still bound to have coordinates that make sense with regard to the unmoving anchor words. Read the source code to better understand how _lockf works.)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio