How to represent paginated documents as a single instance of training data for whole document classification? - huggingface-transformers

The Huggingface Transformers library includes a number of document processing models that can do whole document classification. At least one of these models (LayoutLMv2) requires 3 inputs for each instance of training data:
a resized image of the document,
the words in the document
and the word bounding boxes
(I suspect a number require these inputs). HF documentation provides a number of examples that support this use case, but I can't find any that discuss paginated documents. Bounding boxes, for example, are based on the dimensions of a given page, so the paginated nature of the document needs to pass through HF Datasets, into torch and into training (for e.g. you can't just concat all the paginated data). In essence, you need a HF Datasets representation and torch representation that encodes the paginated nature of the document and has a single label (if you're doing classification). This was my naive idea at supporting paginated documents in HF datasets:
features = Features({
'image': Array4D(dtype="uint8", shape=(None, 3, 224, 224)),
'input_ids': Array2D(dtype='int64', shape=(None, 512)),
'attention_mask': Array2D(dtype='int64', shape=(None, 512)),
'token_type_ids': Array2D(dtype='int64', shape=(None, 512)),
'bbox': Array3D(dtype="int64", shape=(None, 512, 4)),
'labels': ClassLabel(num_classes=len(unique_labels), names=unique_labels),
})
Here, every training data instance is represented as a matrix where the first dimension (the None) is the number of pages and the instance is given a single labels (the Processor uses the key labels so multi-label classification is supported). This data is loaded and passed into:
dataloader = torch.utils.data.DataLoader(encoded_data, batch_size=None)
Which basically exploits batch_size as the representation of the pages. This seems to work until torch encounters the labels portion of the instance where a batch size isn't provided because the label is supposed to represent the entirety of the batch:
ValueError: Expected input batch_size (57) to match target batch_size (1).
Anyways, I wanted to pass this idea and the greater question to the SO community: how do you represent paginated documents to HF datasets/transformers models?

Related

Can MappingScore() be used to get an absolute measure of scRNAseq dataset similarity to the reference dataset?

I have been using Seurat v4 Reference Mapping align some query scRNAseq datasets that come from IPSC-derived cells that were subject to several directed cortical differentiation protocols at multiple timepoints. The reference dataset I made by merging several individual fetal cortical sample datasets that I had annotated based on their unsupervised cluster DEGs (following this vignette using the default parameters).
I am interested in seeing which protocol produces cells most similar to the cells found in the fetal datasets as well as which fetal timepoints the query datasets tend to map to. I understand that the MappingScore() function can show me query cells that aren't well represented in the reference dataset, so I figured that these scores could tell me which datasets are most similar to the reference dataset. However, in comparing the violin plots of the mapping scores for a query dataset from one of the differentiation protocols to a query dataset that contains just pluripotent cells it looks like there are cells with high mapping scores found in both cases (see attached images) even though really only the differentiated cells should have cells closely resembling the fetal cortical tissue cells. I attached the code as a .txt file.
My question is whether or not the mapping score can be used as an absolute measurement of query to reference dataset similarity or if it is always just a relative measure where the high and low thresholds are set by the query dataset. If the latter, what alternative functions might I use here to get information about absolute similarity?
Thanks.
Attachments:
Pluripotent Cell Mapping Score
Differentiated Cell Mapping Score
Code Used For Mapping

Gensim Doc2Vec model returns different cosine similarity depending on the dataset

I trained two versions of doc2vec models with two datasets.
The first dataset was made with 2400 documents and the second one was made with 3000 documents including the documents which were used in the first dataset.
For an example,
dataset 1 = doc1, doc2, ... doc2400
dataset 2 = doc1, doc2, ... doc2400, doc2401, ... doc3000
I thought that both doc2vec models should return the same similarity score between doc1 and doc2, however, they returned different scores.
Does doc2vec model's result change upon the datasets even they include the same documents?
Yes, any addition to the training set will change the relative results.
Further, as explained in the Gensim FAQ, even re-training with the exact same data will typically result in different end coordinates for each training doc, though each run should be about equivalently useful:
https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ#q11-ive-trained-my-word2vec--doc2vec--etc-model-repeatedly-using-the-exact-same-text-corpus-but-the-vectors-are-different-each-time-is-there-a-bug-or-have-i-made-a-mistake-2vec-training-non-determinism
What should remain roughly the same between runs is the neighborhoods around each document. That is, adding some extra training docs shouldn't change the general result that some candidate doc is "very close" or "closer than other docs" to some target doc - except to the extent that (1) the new docs might include some even-closer docs; and (2) a small amount of 'jitter' between runs, per the FAQ answer above.
If in fact you see lots of change in the relative neighborhoods and top-N neighbors of a document, either in repeated runs or runs with small increments of extra data, there's possibly something else wrong in the training.
In particular, 2400 docs is a pretty small dataset for Doc2Vec - smaller datasets might need smaller vector_size and/or more epochs and/or other tweaks to get more reliable results, and even then, might not show off the strengths of this algorithm on larger (tens-of-thousands to millions of docs) datasets.

How to get immediate next word probability using GPT2 model?

I was trying the hugging face gpt2 model. I have seen the run_generation.py script, which generates a sequence of tokens given a prompt. I am aware that we can use GPT2 for NLG.
In my use case, I wish to determine the probability distribution for (only) the immediate next word following the given prompt. Ideally this distribution would be over the entire vocab.
For example, given the prompt: "How are ", it should give a probability distribution where "you" or "they" have the some high floating point values and other vocab words have very low floating values.
How to do this using hugging face transformers? If it is not possible in hugging face, is there any other transformer model that does this?
You can have a look at how the generation script works with the probabilities.
GPT2LMHeadModel (as well as other "MLHead"-models) returns a tensor that contains for each input the unnormalized probability of what the next token might be. I.e., the last output of the model is the normalized probability of the next token (assuming input_ids is a tensor with token indices from the tokenizer):
outputs = model(input_ids)
next_token_logits = outputs[0][:, -1, :]
You get the distribution by normalizing the logits using softmax. The indices in the first dimension of the next_token_logits correspond to indices in the vocabulary that you get from the tokenizer object.
Selecting the last logits becomes tricky when you use a batch size bigger than 1 and sequences of different lengths. In that case, you would need to specify attention_mask in the model call to mask out padding tokens and then select the last logits using torch.index_select. It is much easier either to use batch size 1 or batch of equally long sequences.
You can use any autoregressive model in Transformers: there is distilGPT-2 (a distilled version of GPT-2), CTRL (which is basically GPT-2 trained with some additional "commands"), the original GPT (under the name openai-gpt), XLNet (designed for contextual embeddings, but can be used for generation in arbitrary order). There are probably more, you can Hugging Face Model Hub.

How to rotate a word2vec onto another word2vec?

I am training multiple word2vec models with Gensim. Each of the word2vec will have the same parameter and dimension, but trained with slightly different data. Then I want to compare how the change in data affected the vector representation of some words.
But every time I train a model, the vector representation of the same word is wildly different. Their similarity among other words remain similar, but the whole vector space seems to be rotated.
Is there any way I can rotate both of the word2vec representation in such way that same words occupy same position in vector space, or at least they are as close as possible.
Thanks in advance.
That the locations of words vary between runs is to be expected. There's no one 'right' place for words, just mutual arrangements that are good at the training task (predicting words from other nearby words) – and the algorithm involves random initialization, random choices during training, and (usually) multithreaded operation which can change the effective ordering of training examples, and thus final results, even if you were to try to eliminate the randomness by reliance on a deterministically-seeded pseudorandom number generator.
There's a class called TranslationMatrix in gensim that implements the learn-a-projection-between-two-spaces method, as used for machine-translation between natural languages in one of the early word2vec papers. It requires you to have some words that you specify should have equivalent vectors – an anchor/reference set – then lets other words find their positions in relation to those. There's a demo of its use in gensim's documentation notebooks:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb
But, there are some other techniques you could also consider:
transform & concatenate the training corpuses instead, to both retain some words that are the same across all corpuses (such as very frequent words), but make other words of interest different per segment. For example, you might leave words like "hot" and "cold" unchanged, but replace words like "tamale" or "skiing" with subcorpus-specific versions, like "tamale(A)", "tamale(B)", "skiing(A)", "skiing(B)". Shuffle all data together for training in a single session, then check the distances/directions between "tamale(A)" and "tamale(B)" - since they were each only trained by their respective subsets of the data. (It's still important to have many 'anchor' words, shared between different sets, to force a correlation on those words, and thus a shared influence/meaning for the varying-words.)
create a model for all the data, with a single vector per word. Save that model aside. Then, re-load it, and try re-training it with just subsets of the whole data. Check how much words move, when trained on just the segments. (It might again help comparability to hold certain prominent anchor words constant. There's an experimental property in the model.trainables, with a name ending _lockf, that lets you scale the updates to each word. If you set its values to 0.0, instead of the default 1.0, for certain word slots, those words can't be further updated. So after re-loading the model, you could 'freeze' your reference words, by setting their _lockf values to 0.0, so that only other words get updated by the secondary training, and they're still bound to have coordinates that make sense with regard to the unmoving anchor words. Read the source code to better understand how _lockf works.)

How to get word vectors from a gensim Doc2Vec?

I trained a gensim.models.doc2vec.Doc2Vec model
d2v_model = Doc2Vec(sentences, size=100, window=8, min_count=5, workers=4)
and I can get document vectors by
docvec = d2v_model.docvecs[0]
How can I get word vectors from trained model ?
Doc2Vec inherits from Word2Vec, and thus you can access word vectors the same as in Word2Vec, directly by indexing the model:
wv = d2v_model['apple']
Note, however, that a Doc2Vec training mode like pure DBOW (dm=0) doesn't need or create word vectors. (Pure DBOW still works pretty well and fast for many purposes!) If you do access word vectors from such a model, they'll just be the automatic randomly-initialized vectors, with no meaning.
Only when the Doc2Vec mode itself co-trains word-vectors, as in the DM mode (default dm=1) or when adding optional word-training to DBOW (dm=0, dbow_words=1), are word-vectors and doc-vectors both learned simultaneously.
If you want to get all the trained doc vectors, you can easily use
model.docvecs.doctag_syn0. If you want to get the indexed doc, you can use model.docvecs[i].
If you are training a Word2Vec model, you can get model.wv.syn0.
If you want to get more, check this github issue link: (https://github.com/RaRe-Technologies/gensim/issues/1513)

Resources