What is the maximum text length in tokens that can be given as input for summarisation task using a sentence transformer models - huggingface-transformers

Most Bert models take a maximum input length of 512 tokens. When I used sentence transformer multi-qa-distilbert-cos-v1 model with bert-extractive-summarizer for summarisation task. A text with 792 tokens was accepted by the model and the summary contained the last line from the original text. Usually the text after 512 tokens is truncated by the model and not considered for nlp task. The documentation also states the max sequence length of 512 tokens. How is the model able to read beyond 512 tokens?

Related

How to know if HuggingFace's pipeline text input exceeds 512 tokens

I've finetuned a Huggingface BERT model for Named Entity Recognition based on 'bert-base-uncased'. I perform inference like this:
from transformers import pipeline
ner_pipeline = pipeline('token-classification', model=model_folder, tokenizer=model_folder)
out = ner_pipeline(text, aggregation_strategy='simple')
I want to obtain results on very long texts, and since I know of the 512 token maximum capacity for both training and inference, I split my texts in smaller chunks before passing those to the ner_pipeline.
But, how do I split the text without actually tokenizing the texts myself in order to check for the length of each chunk? I want to make them as long as possible, but at the same time I don't want to exceed the maximum 512 tokens, risking that no predictions are computed on what's left of the sentence.
Is there a way to know if the texts I'm feeding exceed the 512 maximum tokens?
Finding out whether tokenized text exceeds 512 tokens is simply checking its tokenized output. For this purpose, you can simply use AutoTokenizer library of HuggingFace. For example,
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
sentence = "Sentence to check whether it exceeds 512 tokens"
tokenized_sentence = tokenizer.tokenize(sentence)
print(len(sentence.split())) # here is the default length of the sentence
print(len(tokenized_sentence)) # here is the tokenized length
You can give it a try for long documents and observe that at some points tokenized lengths exceed 512 tokens. This may not be a problem for text classification but you may lose your entity labels for the token classification task. Thus, before feeding your Transformer-based network with long documents, you should preprocess your texts with AutoTokenizer, find the points where the tokenized texts reach the maximum length of the model input size (e.g, 512), and simply cut the sentence from that point and create a new sample from the remaining part of that long document.

How to get immediate next word probability using GPT2 model?

I was trying the hugging face gpt2 model. I have seen the run_generation.py script, which generates a sequence of tokens given a prompt. I am aware that we can use GPT2 for NLG.
In my use case, I wish to determine the probability distribution for (only) the immediate next word following the given prompt. Ideally this distribution would be over the entire vocab.
For example, given the prompt: "How are ", it should give a probability distribution where "you" or "they" have the some high floating point values and other vocab words have very low floating values.
How to do this using hugging face transformers? If it is not possible in hugging face, is there any other transformer model that does this?
You can have a look at how the generation script works with the probabilities.
GPT2LMHeadModel (as well as other "MLHead"-models) returns a tensor that contains for each input the unnormalized probability of what the next token might be. I.e., the last output of the model is the normalized probability of the next token (assuming input_ids is a tensor with token indices from the tokenizer):
outputs = model(input_ids)
next_token_logits = outputs[0][:, -1, :]
You get the distribution by normalizing the logits using softmax. The indices in the first dimension of the next_token_logits correspond to indices in the vocabulary that you get from the tokenizer object.
Selecting the last logits becomes tricky when you use a batch size bigger than 1 and sequences of different lengths. In that case, you would need to specify attention_mask in the model call to mask out padding tokens and then select the last logits using torch.index_select. It is much easier either to use batch size 1 or batch of equally long sequences.
You can use any autoregressive model in Transformers: there is distilGPT-2 (a distilled version of GPT-2), CTRL (which is basically GPT-2 trained with some additional "commands"), the original GPT (under the name openai-gpt), XLNet (designed for contextual embeddings, but can be used for generation in arbitrary order). There are probably more, you can Hugging Face Model Hub.

How are word vectors co-trained with paragraph vectors in doc2vec DBOW?

I don't understand how word vectors are involved at all in the training process with gensim's doc2vec in DBOW mode (dm=0). I know that it's disabled by default with dbow_words=0. But what happens when we set dbow_words to 1?
In my understanding of DBOW, the context words are predicted directly from the paragraph vectors. So the only parameters of the model are the N p-dimensional paragraph vectors plus the parameters of the classifier.
But multiple sources hint that it is possible in DBOW mode to co-train word and doc vectors. For instance:
section 5 of An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation
this SO answer: How to use Gensim doc2vec with pre-trained word vectors?
So, how is this done? Any clarification would be much appreciated!
Note: for DM, the paragraph vectors are averaged/concatenated with the word vectors to predict the target words. In that case, it's clear that words vectors are trained simultaneously with document vectors. And there are N*p + M*q + classifier parameters (where M is vocab size and q word vector space dim).
If you set dbow_words=1, then skip-gram word-vector training is added the to training loop, interleaved with the normal PV-DBOW training.
So, for a given target word in a text, 1st the candidate doc-vector is used (alone) to try to predict that word, with backpropagation adjustments then occurring to the model & doc-vector. Then, a bunch of the surrounding words are each used, one at a time in skip-gram fashion, to try to predict that same target word – with the followup adjustments made.
Then, the next target word in the text gets the same PV-DBOW plus skip-gram treatment, and so on, and so on.
As some logical consequences of this:
training takes longer than plain PV-DBOW - by about a factor equal to the window parameter
word-vectors overall wind up getting more total training attention than doc-vectors, again by a factor equal to the window parameter

text classificacion: how many dimensions does my data have?

I am classifying text using the bag of words model. I read in 800 text files, each containing a sentence.
The sentences are then represented like this:
[{"OneWord":True,"AnotherWord":True,"AndSoOn":True},{"FirstWordNewSentence":True,"AnSoOn":True},...]
How many dimensions does my data have?
Is it the number of entries in the largest vector? Or is it the number of unique words? Or something else?
For each doc, the bag of words model has a set of sparse features. For example (use your first sentence in your example):
OneWord
AnotherWord
AndSoOn
The above three are the three active features for the document. It is sparse because we never list those inactive features explicitly AND we have a very large vocabulary (all possible unique words that you consider as features). In another words, we did not say:
OneWord
AnotherWord
AndSoOn
FirstWordNewSentence: false
We only include those words that are "true".
How many dimensions does my data have?
Is it the number of entries in the largest vector? Or is it the number of unique words? Or something else?
If you stick with the sparse feature representation, you might want to estimate the average number of active features per document instead. That number is 2.5 in your example ((3+2)/2 = 2.5).
If you use a dense representation (e.g., one-hot encoding, it is not a good idea though if the vocabulary is large), the input dimension is equal to your vocabulary size.
If you use a word embedding that has 100-dimension and combine all words' embedding to form a new input vector to represent a document, your input dimension is 100 then. In this case, you convert your sparse features into dense features via the embedding.

number of vocabulary in gensim is much lower than the ones in training data

I am using Gensim to train sentences with size 4 and I have 1192 unique words in training dataset. Number of words in model len(model.vocab) is 141 though that does not make sense. Is there any reason for seeing this? How I can change them model to have a key for every word in the training?
model = Word2Vec(windows,min_count=1)

Resources