Doc2Vec How to find most similar document - gensim

I am using Gensim's Doc2Vec, and was wondering if there is a way to get the most similar document to another document that is outside the list of TaggedDocuments used to train the Doc2Vec model.
Right now I can infer a vector from a document not in the training set:
# 'model' here is a instance of Doc2Vec class that has been trained
# Inferring a vector
doc_not_in_training_set = "Foo Foo Foo Foo Foo Foo Fie"
v1 = model.infer_vector(word_tokenize(doc_not_in_training_set.lower()))
print("V1_infer", v1)
This prints out a vector representation of the 'doc_not_in_training_set' string. However, is there a way to use this vector to find the n most similar documents to the 'doc_not_in_training_set' string (in the TaggedDocuments training set for this word2vec model)?
Looking under the documentation, the closest I could find was the model.docvec.most_similar() method:
# Finding most similar to first
similar_doc = model.docvecs.most_similar('0')
This returns the document in the training set most similar to the document in the training set with tag '0'.
In the documentation of this method, it looks like there is not yet the functionality I am looking for:
TODO: Accept vectors of out-of-training-set docs, as if from inference.
Is there another method I can use to find documents similar to a document not in the training set?

The .most_similar() method will also take a raw vectors as the target position.
It helps to explicitly name the positive parameter, to prevent other logic of that method, which tries to intuit what other strings/etc supplied as arguments might mean, from misinterpreting a single raw vector.
So try:
similar_docs = model.docvecs.most_similar(positive=[v1])
You should get back a list of nearest-neighbors to the v1 vector that you'd previously inferred.

Related

how to handle spelling mistake(typos) in entity extraction in Rasa NLU?

I have few intents in my training set(nlu_data.md file) with sufficient amount of training examples under each intent.
Following is an example,
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
I have added multiple sentences like this.
At the time of testing, all sentences in training file are working fine. But if any input query is having spelling mistake e.g, hotol/hetel/hotele for hotel keyword then Rasa NLU is unable to extract it as an entity.
I want to resolve this issue.
I am allowed to change only training data, also restricted not to write any custom component for this.
To handle spelling mistakes like this in entities, you should add these examples to your training data. So something like this:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place) in Chennai
- [hetel](place) in Berlin please
Once you've added enough examples, the model should be able to generalise from the sentence structure.
If you're not using it already, it also makes sense to use the character-level CountVectorFeaturizer. That should be in the default pipeline described on this page already
One thing I would highly suggest you to use is to use look-up tables with fuzzywuzzy matching. If you have limited number of entities (like country names) look-up tables are quite fast, and fuzzy matching catches typos when that entity exists in your look-up table (searching for typo variations of those entities). There's a whole blogpost about it here: on Rasa.
There's a working implementation of fuzzy wuzzy as a custom component:
class FuzzyExtractor(Component):
name = "FuzzyExtractor"
provides = ["entities"]
requires = ["tokens"]
defaults = {}
language_list ["en"]
threshold = 90
def __init__(self, component_config=None, *args):
super(FuzzyExtractor, self).__init__(component_config)
def train(self, training_data, cfg, **kwargs):
pass
def process(self, message, **kwargs):
entities = list(message.get('entities'))
# Get file path of lookup table in json format
cur_path = os.path.dirname(__file__)
if os.name == 'nt':
partial_lookup_file_path = '..\\data\\lookup_master.json'
else:
partial_lookup_file_path = '../data/lookup_master.json'
lookup_file_path = os.path.join(cur_path, partial_lookup_file_path)
with open(lookup_file_path, 'r') as file:
lookup_data = json.load(file)['data']
tokens = message.get('tokens')
for token in tokens:
# STOP_WORDS is just a dictionary of stop words from NLTK
if token.text not in STOP_WORDS:
fuzzy_results = process.extract(
token.text,
lookup_data,
processor=lambda a: a['value']
if isinstance(a, dict) else a,
limit=10)
for result, confidence in fuzzy_results:
if confidence >= self.threshold:
entities.append({
"start": token.offset,
"end": token.end,
"value": token.text,
"fuzzy_value": result["value"],
"confidence": confidence,
"entity": result["entity"]
})
file.close()
message.set("entities", entities, add_to_output=True)
But I didn't implement it, it was implemented and validated here: Rasa forum
Then you will just pass it to your NLU pipeline in config.yml file.
Its a strange request that they ask you not to change the code or do custom components.
The approach you would have to take would be to use entity synonyms. A slight edit on a previous answer:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place:hotel) in Chennai
- [hetel](place:hotel) in Berlin please
This way even if the user enters a typo, the correct entity will be extracted. If you want this to be foolproof, I do not recommend hand-editing the intents. Use some kind of automated tool for generating the training data. E.g. Generate misspelled words (typos)
First of all, add samples for the most common typos for your entities as advised here
Beyond this, you need a spellchecker.
I am not sure whether there is a single library that can be used in the pipeline, but if not you need to create a custom component. Otherwise, dealing with only training data is not feasible. You can't create samples for each typo.
Using Fuzzywuzzy is one of the ways, generally, it is slow and it doesn't solve all the issues.
Universal Encoder is another solution.
There should be more options for spell correction, but you will need to write code in any way.

Gensim's FastText KeyedVector out of vocab

I want to use the read-only version of Gensim's FastText Embedding to save some RAM compared to the full model.
After loading the KeyVectors version, I get the following Error when fetching a vector:
IndexError: index 878080 is out of bounds for axis 0 with size 761210
The error occurs when using words that should be out-of-vocabulary e.g. "lawyerxy" instead of "lawyer". The full model returns a vector for both.
from gensim.models import KeyedVectors
model = KeyedVectors.load("model.kv")
model .wv.__getitem__("lawyerxy")
So, my assumption is that the KeyedVectors do not offer FastText's out of vacabulary function - a key feature for my usecase. This limitation is not given in the documentation:
https://radimrehurek.com/gensim/models/word2vec.html
Can anyone prove that assumption and/or name a fix to allow vectors for "lawyerxy" etc. ?
The KeyedVectors name is (as of gensim-3.8.0) just an alias for class Word2VecKeyedVectors, which only maintains a simple word (as key) to vector (as value) mapping.
You shouldn't expect FastText's advanced ability to synthesize vectors for out-of-vocabulary words to appear in any model/representation that doesn't explicitly claim to offer that ability.
(I would expect a lookup of an out-of-vocabulary word to give a clearer KeyError rather than the IndexError you've reported. But, you'd need to show exactly what code created the file you're loading, and triggered the error, and the full error stack, to further guess what's going wrong in your case.)
Depending on how your model.kv file was saved, you might be able to load it, with retained OOV-vector functionality, by using the class FastTextKeyedVectors instead of plain KeyedVectors.

Gensim Predict Output Word Function Syntax

How do you use the Gensim predict output word function?
model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
model.predict_output_word(['Hi', 'how', 'you'], topn=10)
AttributeError: 'Word2VecKeyedVectors' object has no attribute 'predict_output_word'
I tried Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True), which was deprecated as well.
A file like GoogleNews-vectors-negative300.bin only contains the word vectors, not the complete model used for training. So it is not possible to use predict_output_word in this case. If you would have trained a full model yourself and saved it with model.save(), then the method predict_output_word would be available.

Is there a way to set min_df and max_df in gensim's tfidf model?

I am using gensim's tdidf model like so:
from gensim import corpora, models
dictionary = corpora.Dictionary(some_corpus)
mapped_corpus = [dictionary.doc2bow(text)
for text in some_corpus]
tfidf = models.TfidfModel(mapped_corpus)
Now I'd like to apply thresholds to remove terms that appear too frequently (max_df) and too infrequently (min_df). I know that scikit's CountVectorizer allows you to do this, but I can't seem to find how to set these thresholds in gensim's tfidf. Could someone please help?
You can filter your dictionary with
dictionary.filter_extremes(no_below=min_df, no_above=rel_max_df)
Note that no_below expects the minimum number of documents in which tokens must appear, whereas no_above expects a maximum relative frequency, e.g. 0.5. Afterwards you can then construct your corpus with the filtered dictionary. According to the gensim docs it is also possible to construct a TfidfModel with only a dictionary.

Condense nested for loop to improve processing time with text analysis python

I am working on an untrained classifier model. I am working in Python 2.7. I have a loop. It looks like this:
features = [0 for i in xrange(len(dictionary))]
for bgrm in new_scored:
for i in xrange(len(dictionary)):
if bgrm[0] == dictionary[i]:
features[i] = int(bgrm[1])
break
I have a "dictionary" of bigrams that I have collected from a data set containing customer reviews and I would like to construct feature arrays of each review corresponding to the dictionary I have created. It would contain the frequencies of the bigrams found within the review of the features in the dictionary (I hope that makes sense). new_scored is a list of tuples which contains the bigrams found within a particular review paired with their relative frequency of occurrence in that review. The final feature arrays will be the same length as the original dictionary with few non zero entries.
The above works fine but I am looking at a data set of 13000 reviews, for each review to loop through this code is going to take for eeever (if my computer doesnt run out of RAM first). I have been sitting with it for a while and cannot see how I can condense it.
I am very new to python so I was hoping a more experienced could help with condensing it or perhaps point me in the right direction towards a library that will contain the function I need.
Thank you in advance!
Consider making dictionary an actual dict object (or some fancier subclass of dict if it better suits your needs), as opposed to an iterable (list or tuple seems like what it is now). dictionary could map bigrams as keys to an integer identifier that would identify a feature position.
If you refactor dictionary that way, then the loop can be rewritten as:
features = [0 for key in dictionary]
for bgram in new_scored:
try:
features[dictionary[bgram[0]]] = int(bgrm[1])
except KeyError:
# do something if the bigram is not in the dictionary for some reason
This should convert what was an O(n) traversal through dictionary into a hash lookup.
Hope this helps.

Resources