Understanding Gensim Doc2vec ranking - gensim

I use gensim 4.0.1 and follow tutorial 1 and 2:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
texts = [
"Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey",
]
texts = [t.lower().split() for t in texts]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(texts)]
model = Doc2Vec(documents, epochs=50, vector_size=5, window=2, min_count=2, workers=4)
new_vector = model.infer_vector("human machine interface".split())
for rank,(doc_id,score) in enumerate(model.dv.most_similar_cosmul(positive=[new_vector])):
print('{}. {:.5f} [{}] {}'.format(rank, score, doc_id, ' '.join(documents[doc_id].words)))
1. 0.56613 [7] graph minors iv widths of trees and well quasi ordering
2. 0.55941 [6] the intersection graph of paths in trees
3. 0.55061 [2] the eps user interface management system
4. 0.54981 [1] a survey of user opinion of computer system response time
5. 0.52249 [4] relation of user perceived response time to error measurement
6. 0.52240 [8] graph minors a survey
7. 0.49214 [0] human machine interface for lab abc computer applications
8. 0.49016 [3] system and human system engineering testing of eps
9. 0.47899 [5] the generation of random binary unordered trees
​
Why the document[0] containing "human machine interface" has such a low (position 7) ranking? Is it a result of semantic generalization or the model needs to be tuned? Is larger corpus tutorial available to get repeatable results?

The problem is the same as in my prior anwer to a similar question:
https://stackoverflow.com/a/66976706/130288
Doc2Vec needs far more data to start working. 9 texts, with maybe 55 total words and perhaps around half that unique words is far too small to show any interesting results with this algorithm.
A few of Gensim's Doc2Vec-specific test cases & tutorials manage to squeeze some vaguely understandable similarities out of a test dataset (from a file lee_background.cor) that has 300 documents, each of a few hundred words - so tens of thousands of words, several thousand of which are unique. But it still needs to reduce the dimensionality & up the epochs, and the results are still very weak.
If you want to see meaningful results from Doc2Vec, you should be aiming for tens-of-thousands of documents, ideally with each document having dozens or hundreds or words.
Everything short of that is going to be disappointing and not-representative of what sort of tasks the algorithm was designed to work with.
There's a tutorial using a larger movie-review dataset (100K documents) that was also used in the original 'Paragraph Vector' paper at:
https://radimrehurek.com/gensim/auto_examples/howtos/run_doc2vec_imdb.html#sphx-glr-auto-examples-howtos-run-doc2vec-imdb-py
There's a tutorial based on Wikipedia (millions of documents) that might need some fixup to work nowadays at:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb

Related

Is it possible to search for part the of text using word embeddings?

I have found successful weighting theme for adding word vectors which seems to work for sentence comparison in my case:
query1 = vectorize_query("human cat interaction")
query2 = vectorize_query("people and cats talk")
query3 = vectorize_query("monks predicted frost")
query4 = vectorize_query("man found his feline in the woods")
>>> print(1 - spatial.distance.cosine(query1, query2))
>>> 0.7154500319
>>> print(1 - spatial.distance.cosine(query1, query3))
>>> 0.415183904078
>>> print(1 - spatial.distance.cosine(query1, query4))
>>> 0.690741014142
When I add additional information to the sentence which acts as noise I get decrease:
>>> query4 = vectorize_query("man found his feline in the dark woods while picking white mushrooms and watching unicorns")
>>> print(1 - spatial.distance.cosine(query1, query4))
>>> 0.618269123349
Are there any ways to deal with additional information when comparing using word vectors? When I know that some subset of the text can provide better match.
UPD: edited the code above to make it more clear.
vectorize_query in my case does so called smooth inverse frequency weighting, when word vectors from GloVe model (that can be word2vec as well, etc.) are added with weights a/(a+w), where w should be the word frequency. I use there word's inverse tfidf score, i.e. w = 1/tfidf(word). Coefficient a is typically taken 1e-3 in this approach. Taking just tfidf score as weight instead of that fraction gives almost similar result, I also played with normalization, etc.
But I wanted to have just "vectorize sentence" in my example to not overload the question as I think it does not depend on how I add word vectors using weighting theme - the problem is only that comparison works best when sentences have approximately the same number of meaning words.
I am aware of another approach when distance between sentence and text is being computed using the sum or mean of minimal pairwise word distances, e.g.
"Obama speaks to the media in Illinois" <-> "The President greets the press in Chicago" where we have dist = d(Obama, president) + d(speaks, greets) + d(media, press) + d(Chicago, Illinois). But this approach does not take into account that adjective can change the meaning of noun significantly, etc - which is more or less incorporated in vector models. Words like adjectives 'good', 'bad', 'nice', etc. become noise there, as they match in two texts and contribute as zero or low distances, thus decreasing the distance between sentence and text.
I played a bit with doc2vec models, it seems it was gensim doc2vec implementation and skip-thoughts embedding, but in my case (matching short query with much bigger amount of text) I had unsatisfactory results.
If you are interested in part-of-speech to trigger similarity (e.g. only interested in nouns and noun phrases and ignore adjectives), you might want to look at sense2vec, which incorporates word classes into the model. https://explosion.ai/blog/sense2vec-with-spacy ...after which you can weight the word class while performing a dot product across all terms, effectively deboosting what you consider the 'noise'.
It's not clear your original result, the similarity decreasing when a bunch of words are added, is 'bad' in general. A sentence that says a lot more is a very different sentence!
If that result is specifically bad for your purposes – you need a model that captures whether a sentence says "the same and then more", you'll need to find/invent some other tricks. In particular, you might need a non-symmetric 'contains-similar' measure – so that the longer sentence is still a good match for the shorter one, but not vice-versa.
Any shallow, non-grammar-sensitive embedding that's fed by word-vectors will likely have a hard time with single-word reversals-of-meaning, as for example the difference between:
After all considerations, including the relevant measures of economic, cultural, and foreign-policy progress, historians should conclude that Nixon was one of the very *worst* Presidents
After all considerations, including the relevant measures of economic, cultural, and foreign-policy progress, historians should conclude that Nixon was one of the very *best* Presidents
The words 'worst' and 'best' will already be quite-similar, as they serve the same functional role and appear in the same sorts of contexts, and may only contrast with each other a little in the full-dimensional space. And then their influence may be swamped in the influence of all the other words. Only more sophisticated analyses may highlight their role as reversing the overall import of the sentence.
While it's not yet an option in gensim, there are alternative ways to calculation the "Word Mover's Distance" that report the unmatched 'remainder' after all the easy pairwise-meaning-measuring is finished. While I don't know any prior analysis or code that'd flesh out this idea for your needs, or prove its value, I have a hunch such an analysis might help better discover cases of "says the same and more", or "says mostly the same but with reversal in a few words/aspects".

Trying to understand one-class SVM

I am trying to use one-class SVM with Python scikit-learn.
But I do not understand what are the different variables X_outliers, n_error_train, n_error_test, n_error_outliers, etc. which are at this address. Why does X is randomly selected and is not a part of a dataset?
Scikit-learn "documentation" did not help me a lot. Also, I found very few examples on Internet
Can I use One-class SVM for outlier detection in a case of a hudge number of data and if I do not know if there are anomalies in my training set?
One-class SVM is an Unsupervised Outlier Detection (here)
One-class SVM is not an outlier-detection method, but a
novelty-detection method (here)
Is this possible?
Ok, so this is not really a Python question, more of a SVM comprehension question, but eh. A typical SVM is two-classed, and is an algorithm which is going to have two phases :
First, it will learn relationships between variables and attributes. For example, you show your algorithm tomato pictures and banana pictures, telling him each time if it's a banana or a tomato, and you tell him to count the number of red pixels in each picture. If you do it correctly, the SVM will be trained, meaning he will know that pictures with lots of red pixels are more likely to be tomatoes than bananas.
Then comes the predicting phase. You show him a picture of a tomato or a banana without telling him which it is. And since he has been trained before, he will count the red pixels, and know which it is.
In your case of a one-class SVM, it's a bit simpler, basically the training phase is showing him a bunch of variables which are all supposed to be similar. You show him a bunch of tomato pictures telling him "these are tomatoes, everything else too different from these are not tomatoes".
The code you link to is a code to test the SVM's capability of learning. You start by creating variables X_train. Then you generate two other sets, X_test which is similar to X_train (tomato pictures) and X_outliers which is very different. (banana pictures)
Then you show him the X_train variables and tell your SVM "this is the kind of variables we're looking for" with the line clf.fit(X_train). This is equivalent in my example to showing him lots of tomato images, and the SVN learning what a "tomato" is.
And then you test your SVM's capability to sort new variables, by showing him your two other sets (X_test and X_outliers), and asking him whether he thinks they are similar to X_train or not. You ask him that with the predict fuction, and predict will yield for every element in the sets either "1" i.e. "yes this is a similar element to X_train", or "-1", i.e. "this element is very different".
In an ideal case, the SVM should yield only "1" for X_test and only "-1" for X_outliers. But this code is to show you that this is not always the case. The variables n_error_ are here to count the mistakes that the SVM makes, misclassifying X_test elements as "not similar to X_train and X_outliers elements as "similar to X_train". You can see that there are even errors when the SVM is asked to predict on the very set that is has been trained on ! (n_error_train)
Why are there such errors ? Welcome to machine learning. The main difficulty of SVMs is setting the training set such that it enables the SVM to learn efficiently to distinguish between classes. So you need to set carefully the number of images you show him, (and what he has to look out for in the images (in my example, it was the number of red pixels, in the code, it is the value of the variable), but that is a different question).
In the code, the bounded but random initialization of the X sets means that for example you could during on run train the SVM on an X_train set with lots of values between -0.3 and 0 even though they are randomly initialized between -0.3 and 0.3 (espcecially if you have few elements per set, say for example 5, and you get [-0.2 -0.1 0 -0.1 0.1]). And so, when you show the SVM an element with a value of 0.2, then he will have trouble associating it to X_train, because it will have learned that X_train elements are more likely to have negative values.
This is equivalent to show your SVM a few yellow-ish tomatoes when you train him, so when you show him a really red tomato afterwards, it will have trouble clasifying it as a tomato.
This one-class SVM is a classifier to determine whether entries are similar or dissimilar to entries that the classifier has been trained with.
The script generates three sets:
A training set.
A test-set of entries that are similar to the training a set.
A test-set of entries that are dissimilar to the training set.
The error is the number of entries from each of the sets, that have been classified wrongly. That is; That have been classified as dissimilar to the training set when they were similar (for set 1 and 2), or that have been classifier as similar to the training set when they were dissimilar (set 3).
X_outliers: This is set 3.
n_error_train: The number of classification errors for the elements in the train-set (1).
n_error_test: The number of classification errors for the elements in the test-set (2).
n_error_outliers: The number of classification errors for the elements in the outlier-set (3).
This answer should be complementary to scikit-description but I agree that is a bit technical. I will elaborate some aspects of the One Class SVM algorithm (OCSVM) here. OCSVM is designed to solve the unsupervised anomaly detection problem.
Given unstructured (unlabelled) data it will find a n-dimensional space a matrix W^T with d columns (T stands for transpose).
The objective function of all SVM based methods (and OCSVM) is:
$$f(x) = sign(wT x + b)$$, where sign means sign (-1 anomalous 1 nominal) shifted by a bias term b.
In the classification problem the matrix W is associated with the distance(margin) between 2 classes but this differs in OCSVM since there is only 1 class and it maximizes from the origin (original paper of OCSVM demonstrates this ) .
As you see it is a generic algorithm because SVM is a family of models that can approximate any non linear boundary such as neural networks. To achieve something complicated you have to construct your own kernel matrix.
To do this you need to find some convenient mathematical property (suggestions to improve the answer are welcome at this point).
But in the most cases Gaussian kernel is a kernel that has some quite nice mathematical properties and associated ML theorems such as the Large
of large numbers.
The scikit implementation provides a wrapper to LIBSVM implementation for SVM and has 4 such kernels.
-nu parameter is a problem formulation parameter it allows to say to the model here is how dirty my sample is.
More formally it makes the problem a outlier detection problem where you know your data is mixed (nominal and anomalous) instead of pure where the problem is different and it is called novelty detection.
kernel parameter: One of the most important decisions. Mathematically kernel is a big matrix of numbers where by multiplying you achieve to project data in a higher dimensions. A nice read demonstrating the issue is here while the paper of Scholkopf who created OCSVMK goes into more detail.
gamma
In the case of robust kernel you essentially use a gaussian projection.
Disclaimer my interpretation: Essentially with gamma parameter you describe how big the variance of the Normal distribution $N(\mu, \sigma)$ is.
-tolerance
One class svm search the margin tha separates better among training data and the origin. The tolerance refers to the stopping criterion or how small should the tolerance for satisfaction of the quadratic optimization of the
objective function. The objective function the thing that tells SVM what the parameters should like to describe a specific margin - the space between nominal and anomalous) seen in Figure~().
Many Sklearn examples are usually based on randomly generated data. If you want to see an example of how OneClassSVM works on a real dataset for outlier detection, you can go through my post: https://justanoderbit.com/outlier-detection/one-class-svm/

In general, when does TF-IDF reduce accuracy?

I'm training a corpus consisting of 200000 reviews into positive and negative reviews using a Naive Bayes model, and I noticed that performing TF-IDF actually reduced the accuracy (while testing on test set of 50000 reviews) by about 2%. So I was wondering if TF-IDF has any underlying assumptions on the data or model that it works with, i.e. any cases where accuracy is reduced by the use of it?
The IDF component of TF*IDF can harm your classification accuracy in some cases.
Let suppose the following artificial, easy classification task, made for the sake of illustration:
Class A: texts containing the word 'corn'
Class B: texts not containing the word 'corn'
Suppose now that in Class A, you have 100 000 examples and in class B, 1000 examples.
What will happen to TFIDF? The inverse document frequency of corn will be very low (because it is found in almost all documents), and the feature 'corn' will get a very small TFIDF, which is the weight of the feature used by the classifier. Obviously, 'corn' was THE best feature for this classification task. This is an example where TFIDF may reduce your classification accuracy. In more general terms:
when there is class imbalance. If you have more instances in one class, the good word features of the frequent class risk having lower IDF, thus their best features will have a lower weight
when you have words with high frequency that are very predictive of one of the classes (words found in most documents of that class)
You can heuristically determine whether the usage of IDF on your training data decreases your predictive accuracy by performing grid search as appropriate.
For example, if you are working in sklearn, and you want to determine whether IDF decreases the predictive accuracy of your model, you can perform a grid search on the use_idf parameter of the TfidfVectorizer.
As an example, this code would implement the gridsearch algorithm on the selection of IDF for classification with SGDClassifier (you must import all the objects being instantiated first):
# import all objects first
X = # your training data
y = # your labels
pipeline = Pipeline([('tfidf',TfidfVectorizer()),
('sgd',SGDClassifier())])
params = {'tfidf__use_idf':(False,True)}
gridsearch = GridSearch(pipeline,params)
gridsearch.fit(X,y)
print(gridsearch.best_params_)
The output would be either:
Parameters selected as the best fit:
{'tfidf__use_idf': False}
or
{'tfidf__use_idf': True}
TF-IDF as far as I understand is a feature. TF is term frequency i.e. frequency of occurence in a document. IDF is inverse document frequncy i.e frequency of documents in which the term occurs.
Here, the model is using the TF-IDF info in the training corpus to estimate the new documents. For a very simple example, Say a document with word bad has pretty high term frequency of word bad in training set will sentiment label as negative. So, any new document containing bad will be more likely to be negative.
For the accuracy you can manaually select training corpus which contains mostly used negative or positive words. This will boost the accuracy.

Document clasification, using genetic algorithms

I have a bit of a problem with my project for the university.
I have to implement document classification using genetic algorithm.
I've had a look at this example and (lets say) understood the principles of the genetic algorithms but I'm not sure how they can be implemented in document classification. Can't figure out the fitness function.
Here is what I've managed to think of so far (Its probably totally wrong...)
Accept that I have the categories and each category is described by some keywords.
Split the file to words.
Create first population from arrays (100 arrays for example but it will depends on the size of the file) filled with random words from the file.
1:
Choose the best category for each child in the population (by counting the keywords in it).
Crossover each 2 children in the population (new array containing half of each children) - "crossover"
Fill the rest of the children left from the crossover with random not used words from the file - "evolution??"
Replace random words in random child from the new population with random word from the file (used or not) - "mutation"
Copy the best results to the new population.
Go to 1 until some population limit is reached or some category is found enough times
I'm not sure if this is correct and will be happy to have some advices, guys.
Much appreciate it!
Ivane, in order to properly apply GA's to document classification:
You have to reduce the problem to a system of components that can be evolved.
You can't do GA training for document classification on a single document.
So the steps that you've described are on the right track, but I'll give you some improvements:
Have a sufficient amount of training data: you need a set of documents which are already classified and are diverse enough to cover the range of documents which you're likely to encounter.
Train your GA to correctly classify a subset of those documents, aka the Training Data Set.
At each generation, test your best specimen against a Validation Data Set and stop training if the validation accuracy starts to decrease.
So what you want to do is:
prevValidationFitness = default;
currentValidationFitness = default;
bestGA = default;
while(currentValidationFitness.IsBetterThan( prevValidationFitness ) )
{
prevValidationFitness = currentValidationFitness;
// Randomly generate a population of GAs
population[] = randomlyGenerateGAs();
// Train your population on the training data set
bestGA = Train(population);
// Get the validation fitness fitness of the best GA
currentValidationFitness = Validate(bestGA);
// Make your selection (i.e. half of the population, roulette wheel selection, or random selection)
selection[] = makeSelection(population);
// Mate the specimens in the selection (each mating involves a crossover and possibly a mutation)
population = mate(selection);
}
Whenever you get get a new document (one which has not been classified before), you can now classify it with your best GA:
category = bestGA.Classify(document);
So this is not the end-all-be-all solution, but it should give you a decent start.
Pozdravi,
Kiril
You might find Learning Classifier Systems useful/interesting. An LCS is a type of evolutionary algorithm intended for classification problems. There is a chapter about them in Eiben & Smith's Introduction to Evolutionary Computing.

Classifying english words into rare and common

I'm trying to devise a method that will be able to classify a given number of english words into 2 sets - "rare" and "common" - the reference being to how much they are used in the language.
The number of words I would like to classify is bounded - currently at around 10,000, and include everything from articles, to proper nouns that could be borrowed from other languages (and would thus be classified as "rare"). I've done some frequency analysis from within the corpus, and I have a distribution of these words (ranging from 1 use, to tops about 100).
My intuition for such a system was to use word lists (such as the BNC word frequency corpus, wordnet, internal corpus frequency), and assign weights to its occurrence in one of them.
For instance, a word that has a mid level frequency in the corpus, (say 50), but appears in a word list W - can be regarded as common since its one of the most frequent in the entire language. My question was - whats the best way to create a weighted score for something like this? Should I go discrete or continuous? In either case, what kind of a classification system would work best for this?
Or do you recommend an alternative method?
Thanks!
EDIT:
To answer Vinko's question on the intended use of the classification -
These words are tokenized from a phrase (eg: book title) - and the intent is to figure out a strategy to generate a search query string for the phrase, searching a text corpus. The query string can support multiple parameters such as proximity, etc - so if a word is common, these params can be tweaked.
To answer Igor's question -
(1) how big is your corpus?
Currently, the list is limited to 10k tokens, but this is just a training set. It could go up to a few 100k once I start testing it on the test set.
2) do you have some kind of expected proportion of common/rare words in the corpus?
Hmm, I do not.
Assuming you have a way to evaluate the classification, you can use the "boosting" approach to machine learning. Boosting classifiers use a set of weak classifiers combined to a strong classifier.
Say, you have your corpus and K external wordlists you can use.
Pick N frequency thresholds. For example, you may have 10 thresholds: 0.1%, 0.2%, ..., 1.0%.
For your corpus and each of the external word lists, create N "experts", one expert per threshold per wordlist/corpus, total of N*(K+1) experts. Each expert is a weak classifier, with a very simple rule: if the frequency of the word is higher than its threshold, they consider the word to be "common". Each expert has a weight.
The learning process is as follows: assign the weight 1 to each expert. For each word in your corpus, make the experts vote. Sum their votes: 1 * weight(i) for "common" votes and (-1) * weight(i) for "rare" votes. If the result is positive, mark the word as common.
Now, the overall idea is to evaluate the classification and increase the weight of experts that were right and decrease the weight of the experts that were wrong. Then repeat the process again and again, until your evaluation is good enough.
The specifics of the weight adjustment depends on the way how you evaluate the classification. For example, if you don't have per-word evaluation, you may still evaluate the classification as "too many common" or "too many rare" words. In the first case, promote all the pro-"rare" experts and demote all pro-"common" experts, or vice-versa.
Your distribution is most likely a Pareto distribution (a superset of Zipf's law as mentioned above). I am shocked that the most common word is used only 100 times - this is including "a" and "the" and words like that? You must have a small corpus if that is the same.
Anyways, you will have to choose a cutoff for "rare" and "common". One potential choice is the mean expected number of appearances (see the linked wiki article above to calculate the mean). Because of the "fat tail" of the distribution, a fairly small number of words will have appearances above the mean -- these are the "common". The rest are "rare". This will have the effect that many more words are rare than common. Not sure if that is what you are going for but you can just move the cutoff up and down to get your desired distribution (say, all words with > 50% of expected value are "common").
While this is not an answer to your question, you should know that you are inventing a wheel here.
Information Retrieval experts have devised ways to weight search words according to their frequency. A very popular weight is TF-IDF, which uses a word's frequency in a document and its frequency in a corpus. TF-IDF is also explained here.
An alternative score is the Okapi BM25, which uses similar factors.
See also the Lucene Similarity documentation for how TF-IDF is implemented in a popular search library.

Resources