Suppose I build a LDA topic model using gensim or sklearn and assign top topics to each document. But some of documents don't match top topics assigned. Besides trying out different numbers of topics or use coherence score to get the optimal number of topics, what other tricks can I use to improve my model?
LDA also (semi-secretly) takes the parameters alpha and beta. Think of alpha as the parameter that tells LDA how many topics each document should be generated from. beta is the parameter that tells LDA how many topics each word should be in. You can play with these and you may get better results.
However, LDA is an unsupervised model, and even the perfect settings for k, alpha, and beta will result in some incorrectly assigned documents. If your data isn't preprocessed well, it almost doesn't matter what you assign the parameters, it will always produce poor results.
A sample python implementation for #rchurch4's answer:
We can try out a different number of topics, and different values of alpha and beta(eta) to increase the coherence score. High coherence score is good for our model
def calculate_coherence_score(n, alpha, beta):
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=n,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha=alpha,
per_word_topics=True,
eta = beta)
coherence_model_lda = CoherenceModel(model=lda_model, texts=tokeize_article, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
return coherence_lda
#list containing various hyperparameters
no_of_topics = [2,5,7,10,12,14]
alpha_list = ['symmetric',0.3,0.5,0.7]
beta_list = ['auto',0.3,0.5,0.7]
for n in no_of_topics:
for alpha in alpha_list:
for beta in beta_list:
coherence_score = calculate_coherence_score(n, alpha, beta)
print(f"n : {n} ; alpha : {alpha} ; beta : {beta} ; Score : {coherence_score}")
Related
I have two Doc2Vec models trained on the same corpus but with different parameters. I would like to concatenate the two of them and calculate the similarity of a given input word, choosing the returned vectors from the concatenated model. I read a lot of comments regarding the fact that this method may not be particularly suited for performance improvement and that it might be necessary to change the source code to the KeyedVector class in gensim to enable it. Up to now I attempted to do that using the Translation Matrix but it returns 5 features from the second model and I am not sure about whether it is performing the translations correctly or not.
Has anybody already encountered this issue? Is there another way to calculate the similarity for an input word in a concatenated doc2vec model?
Up to now I have been able to reproduce this:
vocab1 = model1.wv
vocab2 = model2.wv
concatenated_vectors = {}
vocab_concatenated = vocab1
for i in range(len(vocab1.vectors)):
v1 = vocab1.vectors[i]
v2 = vocab2.vectors[i]
vocab_concatenated[list(vocab1.vocab.keys())[i]] = np.concatenate((v1, v2))
In order to re-calculate the most_similar() top-n features for a passed argument, how should I re-istantiate the newly created object? It seems that
.add_vectors(list(vocab1.vocab.keys()), vocab_concatenated[list(vocab1.vocab.keys())])
is not working, but I am sure I am missing something.
I am using H2O autoencoder in R for anomaly detection. I don’t have a training dataset, so I am using the data.hex to train the model, and then the same data.hex to calculate the reconstruction errors. The rows in data.hex with the largest reconstruction errors are considered anomalous. Mean squared error (MSE) of the model, which is calculated by the model itself, would be the sum of the squared reconstruction errors and then divided by the number of rows (i.e. examples). Below is some sudo code of the model.
# Deeplearning Model
model.dl <- h2o.deeplearning(x = x, training_frame = data.hex, autoencoder = TRUE, activation = "Tanh", hidden = c(25,25,25), variable_importances = TRUE)
# Anomaly Detection Algorithm
errors <- h2o.anomaly(model.dl, data.hex, per_feature = FALSE)
Currently there are about 10 features (factors) in my data.hex, and they are all categorical features. I have two questions below:
(1) Do I need to perform feature selection to select a subset of the 10 features before the data go into the deep learning model (with autoencoder=TRUE), in case some features are significantly associated with each other? Or I don’t need to since the data will go into an autoencoder which compresses the data and selects only the most importance information already, so feature selection would be redundant?
(2) The purpose of using the H2O autoencoder here is to identify the senders in data.hex whose action is anomalous. Here are two examples of data.hex. Example B is a transformed version of Example A, by concatenating all the actions for each sender-receiver pair in Example A.
After running the model on data.hex in Example A and in Example B separately, what I got is
(a) MSE from Example A (~0.005) is 20+ times larger than MSE from Example B;
(b) When I put the reconstruction errors in ascending order and plot them (so errors increase from left to right in the plot), the reconstruction error curve from Example A is steeper (e.g. skyrocketing) on the right end, while the reconstruction error curve from Example B increases more gradually.
My question is, which example of data.hex works better for my purpose to identify anomalies?
Thanks for your insights!
Question 1
You shouldn't need to decrease the number of inputted features into the model. I can't say I know what would happen during training, but collinear/associated features could be eliminated in the hidden layers as you said. You could consider adjusting your hidden nodes and see how it behaves. hidden = c(25,25,25) -> hidden = c(25,10,25) or hidden = c(15,15) or even hidden = c(7, 5, 7) for your few features.
Question 2
What is the purpose of your model? Are you trying to determine which "Sender/Receiver combinations" are anomalies or are you trying to determine which "Sender/Receiver + specific Action combo" are anomalies? If it's the former ("Sender/Receiver combinations") I would guess Example B is better.
If you want to know "Sender/Receiver combinations" and use Example A, then how would you aggregate all the actions for one Sender-Receiver combo? Will you average their error?
But it sounds like Example A has more of a response for anomalies in ascended order list (where only a few rows have high error). I would sample different rows and see if the errors make sense (as a domain expert). See if higher errors tend to seem to be anomaly-like rows.
I'm looking for a solution to use something like most_similar() from Gensim but using Spacy.
I want to find the most similar sentence in a list of sentences using NLP.
I tried to use similarity() from Spacy (e.g. https://spacy.io/api/doc#similarity) one by one in loop, but it takes a very long time.
To go deeper :
I would like to put all these sentences in a graph (like this) to find sentence clusters.
Any idea ?
This is a simple, built-in solution you could use:
import spacy
nlp = spacy.load("en_core_web_lg")
text = (
"Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity."
" These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature."
" The term semantic similarity is often confused with semantic relatedness."
" Semantic relatedness includes any relation between two terms, while semantic similarity only includes 'is a' relations."
" My favorite fruit is apples."
)
doc = nlp(text)
max_similarity = 0.0
most_similar = None, None
for i, sent in enumerate(doc.sents):
for j, other in enumerate(doc.sents):
if j <= i:
continue
similarity = sent.similarity(other)
if similarity > max_similarity:
max_similarity = similarity
most_similar = sent, other
print("Most similar sentences are:")
print(f"-> '{most_similar[0]}'")
print("and")
print(f"-> '{most_similar[1]}'")
print(f"with a similarity of {max_similarity}")
(text from wikipedia)
It will yield the following output:
Most similar sentences are:
-> 'Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity.'
and
-> 'These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.'
with a similarity of 0.9583859443664551
Note the following information from spacy.io:
To make them compact and fast, spaCy’s small pipeline packages (all packages that end in sm) don’t ship with word vectors, and only include context-sensitive tensors. This means you can still use the similarity() methods to compare documents, spans and tokens – but the result won’t be as good, and individual tokens won’t have any vectors assigned. So in order to use real word vectors, you need to download a larger pipeline package:
- python -m spacy download en_core_web_sm
+ python -m spacy download en_core_web_lg
Also see Document similarity in Spacy vs Word2Vec for advice on how to improve the similarity scores.
I'm a very new student of doc2vec and have some questions about document vector.
What I'm trying to get is a vector of phrase like 'cat-like mammal'.
So, what I've tried so far is by using doc2vec pre-trained model, I tried the code below
import gensim.models as g
model = "path/pre-trained doc2vec model.bin"
m = g. Doc2vec.load(model)
oneword = 'cat'
phrase = 'cat like mammal'
oneword_vec = m[oneword]
phrase_vec = m[phrase_vec]
When I tried this code, I could get a vector for one word 'cat', but not 'cat-like mammal'.
Because word2vec only provide the vector for one word like 'cat' right? (If I'm wrong, plz correct me)
So I've searched and found infer_vector() and tried the code below
phrase = phrase.lower().split(' ')
phrase_vec = m.infer_vector(phrase)
When I tried this code, I could get a vector, but every time I get different value when I tried
phrase_vec = m.infer_vector(phrase)
Because infer_vector has 'steps'.
When I set steps=0, I get always the same vector.
phrase_vec = m.infer_vector(phrase, steps=0)
However, I also found that document vector is obtained from averaging words in document.
like if the document is composed of three words, 'cat-like mammal', add three vectors of 'cat', 'like', 'mammal', and then average it, that would be the document vector. (If I'm wrong, plz correct me)
So here are some questions.
Is it the right way to use infer_vector() with 0 steps to getting a vector of phrase?
If it is the right averaging vector of words to get document vector, is there no need to use infer_vector()?
What is a model.docvecs for?
Using 0 steps means no inference at all happens: the vector stays at its randomly-initialized position. So you definitely don't want that. That the vectors for the same text vary a little each time you run infer_vector() is normal: the algorithm is using randomness. The important thing is that they're similar-to-each-other, within a small tolerance. You are more likely to make them more similar (but still not identical) with a larger steps value.
You can see also an entry about this non-determinism in Doc2Vec training or inference in the gensim FAQ.
Averaging word-vectors together to get a doc-vector is one useful technique, that might be good as a simple baseline for many purposes. But it's not the same as what Doc2Vec.infer_vector() does - which involves iteratively adjusting a candidate vector to be better and better at predicting the text's words, just like Doc2Vec training. For your doc-vector to be comparable to other doc-vectors created during model training, you should use infer_vector().
The model.docvecs object holds all the doc-vectors that were learned during model training, for lookup (by the tags given as their names during training) or other operations, like finding the most_similar() N doc-vectors to a target tag/vector amongst those learned during training.
I need to cluster different descriptions of parts from catalog data from different vendors. I am trying to find an "approach" that can detect clusters of similar descriptions for purpose of grouping them together.
This is a sample dataset for one part number i.e.
A100: ["COCPIT VOICE RECORDER", "RECORDER", "VOICE RECORDER","BELT", "REGULARTOR BELT", "OXIGEN REGULATOR", "BULB", "OXIGEN REG"]
Expected result of clustering will be i.e. :
Cluster 1: ["COCPIT VOICE RECORDER", "RECORDER", "VOICE RECORDER"],
Cluster 2 : ["BELT"],
Cluster 3: ["OXIGEN REG", "OXIGEN REGULATOR"],
Cluster 4: ["BULB"]
or variations of it.
I never had experience with this but my basic research on ML shows that first thing you need to do is to extract features from data so I tried coming up with some features...
My feature extraction approach was to compare each and every one of these parts with each other using similarity function (i.e. edit distance or Levenstain distance) or Jaro Winkler distance.
Then my idea was to use KMeans algorithm to find clusters?
Any ideas if this feature selection is good?
Any other idea about feature extraction or an approach to this problem?
Thanks !
I have done something similar where my feature vector is how many times each product description contains each dictionary word (so for each entry you get a long vector which is mostly 0's with a few 1's or 2's). You can then feed this into a clustering alg of your choice (I used kmeans also).
In python the general idea is:
# loop over all descriptions to get word list
allWords = {}
for productDesc in products :
for word in productDesc.split(" ") :
if(not word in words) :
words[word] = 0
# build a vector for each description
matrix = []
for productDesc in products :
vec = words.copy()
for word in productDesc.split(" ") :
vec['word'] = vec['word'] + 1
matrix.append(vec)
Once you have a feature matrix like this you can use your favourite clustering algorithm, for this I would use either kmeans directly, or compute a similarity matrix (for each pair of rows in the matrix compute the number of words in common) and then use spectral clustering.