GenSim Word2Vec unexpectedly pruning - gensim

My objective is to find a vector representation of phrases. Below is the code I have, that works partially for bigrams using the Word2Vec model provided by the GenSim library.
from gensim.models import word2vec
def bigram2vec(unigrams, bigram_to_search):
bigrams = Phrases(unigrams)
model = word2vec.Word2Vec(sentences=bigrams[unigrams], size=20, min_count=1, window=4, sg=1, hs=1, negative=0, trim_rule=None)
if bigram_to_search in model.vocab.keys():
return model[bigram_to_search]
else:
return None
The problem is that the Word2Vec model is seemingly doing automatic pruning of some of the bigrams, i.e. len(model.vocab.keys()) != len(bigrams.vocab.keys()). I've tried adjusting various parameters such as trim_rule, min_count, but they don't seem to affect the pruning.
PS - I am aware that bigrams to look up need to be represented using underscore instead of space, i.e. proper way to call my function would be bigram2vec(unigrams, 'this_report')

Thanks to further clarification at the GenSim support forum, the solution is to set the appropriate min_count and threshold values for the Phrases being generated (see documentation for details about these parameters in the Phrases class). The corrected solution code is below.
from gensim.models import word2vec, Phrases
def bigram2vec(unigrams, bigram_to_search):
bigrams = Phrases(unigrams, min_count=1, threshold=0.1)
model = word2vec.Word2Vec(sentences=bigrams[unigrams], size=20, min_count=1, trim_rule=None)
if bigram_to_search in model.vocab.keys():
return model[bigram_to_search]
else:
return []

Related

On-the-fly tokenization with datasets, tokenizers, and torch Datasets and Dataloaders

I have a question regarding "on-the-fly" tokenization. This question was elicited by reading the "How to train a new language model from scratch using Transformers and Tokenizers" here. Towards the end there is this sentence: "If your dataset is very large, you can opt to load and tokenize examples on the fly, rather than as a preprocessing step". I've tried coming up with a solution that would combine both datasets and tokenizers, but did not manage to find a good pattern.
I guess the solution would entail wrapping a dataset into a Pytorch dataset.
As a concrete example from the docs
import torch
class SquadDataset(torch.utils.data.Dataset):
def __init__(self, encodings):
# instead of doing this beforehand, I'd like to do tokenization on the fly
self.encodings = encodings
def __getitem__(self, idx):
return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
def __len__(self):
return len(self.encodings.input_ids)
train_dataset = SquadDataset(train_encodings)
How would one implement this with "on-the-fly" tokenization exploiting the vectorized capabilities of tokenizers?
UPDATE Feb 2021
As of v1.3.0 datasets supports lazy evaluation of functions via the set_transform method. Therefore, you can apply on-the-fly tokenization directly like shown here.
OLD ANSWER
In the end I settled for this solution. I do not like that the batch_size is now controlled at the dataset level. However, it does its job.
In this way we exploit two nice things:
fast indexing the HuggingFace datasets
vectorization capabilities of the HuggingFace tokenizer
class CustomPytorchDataset(Dataset):
"""
This class wraps the HuggingFace dataset and allows for
batch indexing into the dataset. This allows exploiting
the capabilities of the tokenizer to work on batches.
NOTE: now we control batch_size at the Dataset level, not
in the DataLoader therefore the DataLoader should always be
used with `batch_size=1`.
"""
def __init__(self, batch_size: int):
self.batch_size = batch_size
self.dataset = train_ds # HuggingFace dataset
self.tokenizer = bert_tokenizer # HuggingFace tokenizer
def __getitem__(self, batch_idx: List[int]):
instance = self.dataset[batch_idx]
# tokenize on-the-fly
tokenized_instance = self.tokenizer(
instance[text_col],
truncation=True,
padding=True
)
return tokenized_instance
def __len__(self):
return len(self.dataset)
def sampler(self):
# shuffling can be controlled by the sampler,
# without touching the dataset
return BatchSampler(
SequentialSampler(self),
batch_size=self.batch_size,
drop_last=True
)
#staticmethod
def collate_fn(batches: List[Dict[str, int]]):
return {
k: torch.tensor(v, dtype=torch.int64)
for k, v in batches[0].items()
}

Huggingface transformer model returns string instead of logits

I am trying to run this example from huggingface website. https://huggingface.co/transformers/task_summary.html. It seems that the model returns two strings instead of logits! and that leads to an error thrown by torch.argmax()
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad", return_dict=True)
text = r"""πŸ€— Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""
questions = ["How many pretrained models are available in πŸ€— Transformers?",
"What does πŸ€— Transformers provide?",
"πŸ€— Transformers provides interoperability between which frameworks?"]
for question in questions:
inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0] # the list of all indices of words in question + context
text_tokens = tokenizer.convert_ids_to_tokens(input_ids) # Get the tokens for the question + context
answer_start_scores, answer_end_scores = model(**inputs)
answer_start = torch.argmax(answer_start_scores) # Get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1 # Get the most likely end of answer with the argmax of the score
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
print(f"Question: {question}")
print(f"Answer: {answer}")
Since one of the recent updates, the models return now task-specific output objects (which are dictionaries) instead of plain tuples. The site you used has not been updated to reflect that change. You can either force the model to return a tuple by specifying return_dict=False:
answer_start_scores, answer_end_scores = model(**inputs, return_dict=False)
or you can extract the values from the QuestionAnsweringModelOutput object by calling the values() method:
answer_start_scores, answer_end_scores = model(**inputs).values()
or even utilizing the QuestionAnsweringModelOutput object:
outputs = model(**inputs)
answer_start_scores = outputs.start_logits
answer_end_scores = outputs.end_logits

Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript.
For example, I tried the following:
import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('wordnet')
def Preprocesser(text_list):
smallestWordSize = 3
processedList = []
for token in gensim.utils.simple_preprocess(text_list):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:
processedList.append(StemmAndLemmatize(token))
return processedList
lda_model = models.LdaModel.load('LDAModel\GoldModel') #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict") #Load dictionary model was trained on
#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"
termTopicMatrix = lda_model.get_topics() #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc) #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc) #Create bow using dictionary
dictSize = len(termTopicMatrix[0]) #Get length of terms in dictionary
fullDict = np.zeros(dictSize) #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc] #Get index of terms in bag of words
Second = [second[1] for second in bowDoc] #Get frequency of term in bag of words
fullDict[First] = Second #Add word frequency to full dictionary
print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])
Output:
Matrix Multiplication:
[0.0283254 0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
0.01558603 0.0370233 0.04648389 0.02887623 0.00776652 0.02147539
0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax:
[(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]
In the pretrained model there are 35 topics and 1155 words.
In the "Conventional Syntax" output, the first element of each tuple is the index of the topic and the second element is the probability of the topic. In the "Matrix Multiplication" version, the probability is the index and the value is the probability. Clearly the two don't match up.
For example, the lda_model[unseenDoc] shows that topic 0 has a 0.07 probability, but the matrix multiplication method says that topic has a 0.028 probability. Am I missing a step here?
You can review the full source code used by LDAModel's get_document_topics() method in your installation, or online at:
https://github.com/RaRe-Technologies/gensim/blob/e75f6c8e8d1dee0786b1b2cd5ef60da2e290f489/gensim/models/ldamodel.py#L1283
(It also makes use of the inference() method in the same file.)
It's doing a lot more scaling/normalization/clipping than your code, which is likely the cause of the discrepancy. But you should be able to examine, line-by-line, where your process & its differ to get the steps to match up.
It also shouldn't be hard to use the gensim code's steps as guidance for creating parallel Javascript code that, given the right parts of the model's state, can reproduce its results.

Why does a Gensim Doc2vec object return empty doctags?

My question is how I should interpret my situation?
I trained a Doc2Vec model following this tutorial https://blog.griddynamics.com/customer2vec-representation-learning-and-automl-for-customer-analytics-and-personalization/.
For some reason, doc_model.docvecs.doctags returns {}. But doc_model.docvecs.vectors_docs seems to return a proper value.
Why the doc2vec object doesn't return any doctags but vectors_docs?
Thank you for any comments and answers in advance.
This is the code I used to train a Doc2Vec model.
from gensim.models.doc2vec import LabeledSentence, TaggedDocument, Doc2Vec
import timeit
import gensim
embeddings_dim = 200 # dimensionality of user representation
filename = f'models/customer2vec.{embeddings_dim}d.model'
if TRAIN_USER_MODEL:
class TaggedDocumentIterator(object):
def __init__(self, df):
self.df = df
def __iter__(self):
for row in self.df.itertuples():
yield TaggedDocument(words=dict(row._asdict())['all_orders'].split(),tags=[dict(row._asdict())['user_id']])
it = TaggedDocumentIterator(combined_orders_by_user_id)
doc_model = gensim.models.Doc2Vec(vector_size=embeddings_dim,
window=5,
min_count=10,
workers=mp.cpu_count()-1,
alpha=0.055,
min_alpha=0.055,
epochs=20) # use fixed learning rate
train_corpus = list(it)
doc_model.build_vocab(train_corpus)
for epoch in tqdm(range(10)):
doc_model.alpha -= 0.005 # decrease the learning rate
doc_model.min_alpha = doc_model.alpha # fix the learning rate, no decay
doc_model.train(train_corpus, total_examples=doc_model.corpus_count, epochs=doc_model.iter)
print('Iteration:', epoch)
doc_model.save(filename)
print(f'Model saved to [{filename}]')
else:
doc_model = Doc2Vec.load(filename)
print(f'Model loaded from [{filename}]')
doc_model.docvecs.vectors_docs returns
If all of the tags you supply are plain Python ints, those ints are used as the direct-indexes into the vectors-array.
This saves the overhead of maintaining a mapping from arbitrary tags to indexes.
But, it may also cause an over-allocation of the vectors array, to be large enough for the largest int tag you provided, even if other lower ints are never used. (That is: if you provided a single document, with a tags=[1000000], it will allocate an array sufficient for tags 0 to 1000000, even if most of those never appear in your training data.)
If you want model.docvecs.doctags to collect a list of all your tags, use string tags rather than plain ints.
Separately: don't call train() multiple times in your own loop, or manage the alpha learning-rate in your own code, unless you have an overwhelmingly good reason to do so. It's inefficient & error-prone. (Your code, for example, is actually performing 200 training-epochs, and if you were to increase the loop count without carefully adjusting your alpha increment, you could wind up with nonsensical negative alpha values – a very common error in code following this bad practice. Call .train() once with your desired number of epochs. Set the alpha and min_alpha at reasonable starting and nearly-zero values – probably just the defaults unless you're sure your change is helping – and then leave them alone.

Gensim LDA topic assignment

I am hoping to assign each document to one topic using LDA. Now I realise that what you get is a distribution over topics from LDA. However as you see from the last line below I assign it to the most probable topic.
My question is this. I have to run lda[corpus] for somewhat the second time in order to get these topics. Is there some other builtin gensim function that will give me this topic assignment vectors directly? Especially since the LDA algorithm has passed through the documents it might have saved these topic assignments?
# Get the Dictionary and BoW of the corpus after some stemming/ cleansing
texts = [[stem(word) for word in document.split() if word not in STOPWORDS] for document in cleanDF.text.values]
dictionary = corpora.Dictionary(texts)
dictionary.filter_extremes(no_below=5, no_above=0.9)
corpus = [dictionary.doc2bow(text) for text in texts]
# The actual LDA component
lda = models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=30, chunksize=10000, passes=10,workers=4)
# Assign each document to most prevalent topic
lda_topic_assignment = [max(p,key=lambda item: item[1]) for p in lda[corpus]]
There is no other builtin Gensim function that will give the topic assignment vectors directly.
Your question is valid that LDA algorithm has passed through the documents but implementation of LDA is working by updating the model in chunks (based on value of chunksize parameter), hence it will not keep the entire corpus in-memory.
Hence you have to use lda[corpus] or use the method lda.get_document_topics()
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
test =LDA[corpus[0]]
print(test)
sorted(test, reverse=True, key=lambda x: x[1])
Topics = ['Topic_'+str(sorted(LDA[i], reverse=True, key=lambda x: x[1])[0][0]).zfill(3) for i in corpus]

Resources