ML approach for clustering product descriptions - algorithm

I need to cluster different descriptions of parts from catalog data from different vendors. I am trying to find an "approach" that can detect clusters of similar descriptions for purpose of grouping them together.
This is a sample dataset for one part number i.e.
Expected result of clustering will be i.e. :
Cluster 2 : ["BELT"],
Cluster 4: ["BULB"]
or variations of it.
I never had experience with this but my basic research on ML shows that first thing you need to do is to extract features from data so I tried coming up with some features...
My feature extraction approach was to compare each and every one of these parts with each other using similarity function (i.e. edit distance or Levenstain distance) or Jaro Winkler distance.
Then my idea was to use KMeans algorithm to find clusters?
Any ideas if this feature selection is good?
Any other idea about feature extraction or an approach to this problem?
Thanks !

I have done something similar where my feature vector is how many times each product description contains each dictionary word (so for each entry you get a long vector which is mostly 0's with a few 1's or 2's). You can then feed this into a clustering alg of your choice (I used kmeans also).
In python the general idea is:
# loop over all descriptions to get word list
allWords = {}
for productDesc in products :
for word in productDesc.split(" ") :
if(not word in words) :
words[word] = 0
# build a vector for each description
matrix = []
for productDesc in products :
vec = words.copy()
for word in productDesc.split(" ") :
vec['word'] = vec['word'] + 1
Once you have a feature matrix like this you can use your favourite clustering algorithm, for this I would use either kmeans directly, or compute a similarity matrix (for each pair of rows in the matrix compute the number of words in common) and then use spectral clustering.


Use Spacy to find most similar sentences in doc

I'm looking for a solution to use something like most_similar() from Gensim but using Spacy.
I want to find the most similar sentence in a list of sentences using NLP.
I tried to use similarity() from Spacy (e.g. one by one in loop, but it takes a very long time.
To go deeper :
I would like to put all these sentences in a graph (like this) to find sentence clusters.
Any idea ?
This is a simple, built-in solution you could use:
import spacy
nlp = spacy.load("en_core_web_lg")
text = (
"Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity."
" These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature."
" The term semantic similarity is often confused with semantic relatedness."
" Semantic relatedness includes any relation between two terms, while semantic similarity only includes 'is a' relations."
" My favorite fruit is apples."
doc = nlp(text)
max_similarity = 0.0
most_similar = None, None
for i, sent in enumerate(doc.sents):
for j, other in enumerate(doc.sents):
if j <= i:
similarity = sent.similarity(other)
if similarity > max_similarity:
max_similarity = similarity
most_similar = sent, other
print("Most similar sentences are:")
print(f"-> '{most_similar[0]}'")
print(f"-> '{most_similar[1]}'")
print(f"with a similarity of {max_similarity}")
(text from wikipedia)
It will yield the following output:
Most similar sentences are:
-> 'Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity.'
-> 'These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.'
with a similarity of 0.9583859443664551
Note the following information from
To make them compact and fast, spaCy’s small pipeline packages (all packages that end in sm) don’t ship with word vectors, and only include context-sensitive tensors. This means you can still use the similarity() methods to compare documents, spans and tokens – but the result won’t be as good, and individual tokens won’t have any vectors assigned. So in order to use real word vectors, you need to download a larger pipeline package:
- python -m spacy download en_core_web_sm
+ python -m spacy download en_core_web_lg
Also see Document similarity in Spacy vs Word2Vec for advice on how to improve the similarity scores.

Doc2vec - About getting document vector

I'm a very new student of doc2vec and have some questions about document vector.
What I'm trying to get is a vector of phrase like 'cat-like mammal'.
So, what I've tried so far is by using doc2vec pre-trained model, I tried the code below
import gensim.models as g
model = "path/pre-trained doc2vec model.bin"
m = g. Doc2vec.load(model)
oneword = 'cat'
phrase = 'cat like mammal'
oneword_vec = m[oneword]
phrase_vec = m[phrase_vec]
When I tried this code, I could get a vector for one word 'cat', but not 'cat-like mammal'.
Because word2vec only provide the vector for one word like 'cat' right? (If I'm wrong, plz correct me)
So I've searched and found infer_vector() and tried the code below
phrase = phrase.lower().split(' ')
phrase_vec = m.infer_vector(phrase)
When I tried this code, I could get a vector, but every time I get different value when I tried
phrase_vec = m.infer_vector(phrase)
Because infer_vector has 'steps'.
When I set steps=0, I get always the same vector.
phrase_vec = m.infer_vector(phrase, steps=0)
However, I also found that document vector is obtained from averaging words in document.
like if the document is composed of three words, 'cat-like mammal', add three vectors of 'cat', 'like', 'mammal', and then average it, that would be the document vector. (If I'm wrong, plz correct me)
So here are some questions.
Is it the right way to use infer_vector() with 0 steps to getting a vector of phrase?
If it is the right averaging vector of words to get document vector, is there no need to use infer_vector()?
What is a model.docvecs for?
Using 0 steps means no inference at all happens: the vector stays at its randomly-initialized position. So you definitely don't want that. That the vectors for the same text vary a little each time you run infer_vector() is normal: the algorithm is using randomness. The important thing is that they're similar-to-each-other, within a small tolerance. You are more likely to make them more similar (but still not identical) with a larger steps value.
You can see also an entry about this non-determinism in Doc2Vec training or inference in the gensim FAQ.
Averaging word-vectors together to get a doc-vector is one useful technique, that might be good as a simple baseline for many purposes. But it's not the same as what Doc2Vec.infer_vector() does - which involves iteratively adjusting a candidate vector to be better and better at predicting the text's words, just like Doc2Vec training. For your doc-vector to be comparable to other doc-vectors created during model training, you should use infer_vector().
The model.docvecs object holds all the doc-vectors that were learned during model training, for lookup (by the tags given as their names during training) or other operations, like finding the most_similar() N doc-vectors to a target tag/vector amongst those learned during training.

Tag based clustering algorithm

I am looking to cluster many feeds based on their tags.
A typical example would be twitter feeds. Each feed will have user defined tags associated with it. By analyzing the tags , is it possible to cluster the feeds into different groups and tell so much feeds are based on so much tags.
An example would be -
Feed1 - Earthquake in indonasia #earthquake #asia #bad
Feed2 - There is a large earthquake in my area #earthquake #bad
Feed3 - My parents went to singapore #asia #tour
Feed4 - XYZ company is laying off many people #XYZ #layoff #bear
Feed5 - XYZ is getting bad is planning to layoff #XYZ #layoff #bad
Feed6 - XYZ is in a layoff spree #layoff #XYZ #worst
After clustering
#asia , # earthquake - Feed1 , Feed2
#XYZ , # layoff - Feed4 , Feed 5 , Feed6
Here clustering is found purely on basis of tags.
Is there any good algorithm to achieve this
If I understand your question correctly, you would like to cluster the tags together and then put the feeds into these clusters based on the tags in the feed.
For this, you could create a similarity measure between the tags based on the number of feeds that the tags appear in together. For your example, this would be something like this
#earthquake | #asia | #bad | ...
#earthquake 1 | 1/2 | 2/2
#asia 1/2 | 1 | 1/2
#bad 2/3 | 1/3 | 1
Here, value at (i,j) equals frequency of (i,j)/frequency of (i).
Now you have a similarity matrix between the tags and you could virtually any clustering algorithm that suits your needs. Since, the number of tags can be very large and estimating the number of clusters is difficult before running the algorithm, I would suggest using some heirarchical clustering algorithm like Fast Modularity clustering which is also very fast (See some details here). However, if you have some estimate of the number of clusters that you would like to break this into, then Spectral clustering might be useful too (See some details here).
After you cluster the tags together, you could use a simple approach to assign each feed to a cluster. This can be very simple, for example, counting the number of tags from each cluster in a feed and assigning a cluster with the maximum number of matching tags.
If you are flexible on your clustering strategy, then you could also try clustering the feeds together in a similar way by creating a similarity between the feeds based on the number of common tags between the feeds and then applying a clustering algorithm on the similarity matrix.
Interesting question. I'm making things up here, but I think this would work.
For each feed, come up with a complete list of tag combinations (of length >= 2), probably sorted for consistency. For example:
Feed1: (asia-bad), (asia-earthquake), (bad-earthquake), (asia-bad-earthquake)
Feed2: (bad-earthquake)
Feed3: (asia-tour)
Feed4: (bear-layoff), (bear-XYZ), (layoff-XYZ), (bear-layoff-XYZ)
Feed5: (bad-layoff), (bad-XYZ), (layoff-XYZ), (bad-layoff-XYZ)
Feed6: (layoff-worst), (layoff-XYZ), (worst-XYZ), (layoff-worst-XYZ)
Then reverse the mapping:
(asia-bad): Feed1
(asia-earthquake): Feed1
(bad-earthquake): Feed1, Feed2
(asia-bad-earthquake): Feed1
(asia-tour): Feed3
(bear-layoff): Feed4
(layoff-XYZ): Feed4, Feed5, Feed6
You can then cull all the entries with a frequency higher than some threshold. In this case, if we take a frequency threshold of 2, then you'd get (bad-earthquake) with Feed1 and Feed2, and (layoff-XYZ) with Feed4, Feed5 and Feed6.
Performance Concerns
A naive implementation of this would have extremely poor performance -- exponential in the number of tags per feed (not to mention space requirements). However, there are various ways to apply heuristics to improve this. For example:
Determine the most popular X tags by scanning all feeds (or a random selection of X feeds) -- this is linear in the number of tags per feed. Then only consider the Y most popular tags for each feed.
Determine the frequency of all (or most) tags. Then, for each post, only consider the X most popular tags in that post. This prevents situations where you have, say, fifteen tags for some post, resulting in a huge list of combinations, most of which would never occur.
For each post, only consider combinations of length <= X. For example, if a feed had fifteen tags, you could end up with a huge number of combinations, but most of them would have very few occurrences, especially the long ones. So only consider combinations of two or three tags.
Only scan a random selection of X feeds.
Hope this helps!

Effective clustering of a similarity matrix

my topic is similarity and clustering of (a bunch of) text(s). In a nutshell: I want to cluster collected texts together and they should appear in meaningful clusters at the end. To do this, my approach up to now is as follows, my problem is in the clustering. The current software is written in php.
1) Similarity:
I treat every document as a "bag-of-words" and convert words into vectors. I use
filtering (only "real" words)
tokenization (split sentences into words)
stemming (reduce words to their base form; Porter's stemmer)
pruning (cut of words with too high & low frequency)
as methods for dimensionality reduction. After that, I'm using cosine similarity (as suggested / described on various sites on the web and here.
The result then is a similarity matrix like this:
A 0 30 51 75 80
B X 0 21 55 70
C X X 0 25 10
D X X X 0 15
E X X X X 0
A…E are my texts and the number is the similarity in percent; the higher, the more similar the texts are. Because sim(A,B) == sim(B,A) only half of the matrix is filled in. So the similarity of Text A to Text D is 71%.
I want to generate a a priori unknown(!) number of clusters out of this matrix now. The clusters should represent the similar items (up to a certain stopp criterion) together.
I tried a basic implementation myself, which was basically like this (60% as a fixed similarity threshold)
foreach article
get similar entries where sim > 60
foreach similar entry
check if one of the entries already has a cluster number
if no: assign new cluster number to all similar entries
if yes: use that number
It worked (somehow), but wasn't good at all and the results were often monster-clusters.
So, I want to redo this and already had a look into all kinds of clustering algorithms, but I'm still not sure which one will work best. I think it should be an agglomerative algoritm, because every pair of texts can be seen as a cluster in the beginning. But still the questions are what the stopp criterion is and if the algorithm should divide and / or merge existing clusters together.
Sorry if some of the stuff seems basic, but I am relatively new in this field. Thanks for the help.
Since you're both new to the field, have an unknown number of clusters and are already using cosine distance I would recommend the FLAME clustering algorithm.
It's intuitive, easy to implement, and has implementations in a large number of languages (not PHP though, largely because very few people use PHP for data science).
Not to mention, it's actually good enough to be used in research by a large number of people. If nothing else you can get an idea of what exactly the shortcomings are in this clustering algorithm that you want to address in moving onto another one.
Just try some. There are so many clustering algorithms out there, nobody will know all of them. Plus, it also depends a lot on your data set and the clustering structure that is there.
In the end, there also may be just this one monster cluster with respect to cosine distance and BofW features.
Maybe you can transform your similarity matrix to a dissimilarity matrix such as transforming x to 1/x, then your problem is to cluster a dissimilarity matrix. I think the hierarchical cluster may work. These may help you:hierarchical clustering and Clustering a dissimilarity matrix

Document clasification, using genetic algorithms

I have a bit of a problem with my project for the university.
I have to implement document classification using genetic algorithm.
I've had a look at this example and (lets say) understood the principles of the genetic algorithms but I'm not sure how they can be implemented in document classification. Can't figure out the fitness function.
Here is what I've managed to think of so far (Its probably totally wrong...)
Accept that I have the categories and each category is described by some keywords.
Split the file to words.
Create first population from arrays (100 arrays for example but it will depends on the size of the file) filled with random words from the file.
Choose the best category for each child in the population (by counting the keywords in it).
Crossover each 2 children in the population (new array containing half of each children) - "crossover"
Fill the rest of the children left from the crossover with random not used words from the file - "evolution??"
Replace random words in random child from the new population with random word from the file (used or not) - "mutation"
Copy the best results to the new population.
Go to 1 until some population limit is reached or some category is found enough times
I'm not sure if this is correct and will be happy to have some advices, guys.
Much appreciate it!
Ivane, in order to properly apply GA's to document classification:
You have to reduce the problem to a system of components that can be evolved.
You can't do GA training for document classification on a single document.
So the steps that you've described are on the right track, but I'll give you some improvements:
Have a sufficient amount of training data: you need a set of documents which are already classified and are diverse enough to cover the range of documents which you're likely to encounter.
Train your GA to correctly classify a subset of those documents, aka the Training Data Set.
At each generation, test your best specimen against a Validation Data Set and stop training if the validation accuracy starts to decrease.
So what you want to do is:
prevValidationFitness = default;
currentValidationFitness = default;
bestGA = default;
while(currentValidationFitness.IsBetterThan( prevValidationFitness ) )
prevValidationFitness = currentValidationFitness;
// Randomly generate a population of GAs
population[] = randomlyGenerateGAs();
// Train your population on the training data set
bestGA = Train(population);
// Get the validation fitness fitness of the best GA
currentValidationFitness = Validate(bestGA);
// Make your selection (i.e. half of the population, roulette wheel selection, or random selection)
selection[] = makeSelection(population);
// Mate the specimens in the selection (each mating involves a crossover and possibly a mutation)
population = mate(selection);
Whenever you get get a new document (one which has not been classified before), you can now classify it with your best GA:
category = bestGA.Classify(document);
So this is not the end-all-be-all solution, but it should give you a decent start.
You might find Learning Classifier Systems useful/interesting. An LCS is a type of evolutionary algorithm intended for classification problems. There is a chapter about them in Eiben & Smith's Introduction to Evolutionary Computing.
