What is the similarity score in the gensim similar_by_word function? - gensim

What is the similarity score in the genism similar_by_word function?
I was reading here about the genism similar_by_word function:
https://radimrehurek.com/gensim/models/keyedvectors.html
The similar_by_word function returns a sequence of (word, similarity). What is the definition by similarity here and how is it calculated?

The similarity measure used here is the cosine similarity, which takes values between -1 and 1. The cosine similarity measures the (cosine of) the angle between two vectors. If the angle is very small the vectors are considered similar since they are pointing in the same direction. This way of measuring similarity is common when working with high dimensional vector spaces such as word embeddings.
The formula for the cosine similarity of two vectors A and B is as follows:

Related

Cosine similarity concept for matrices

Can you explain to me the concept of cosine similarity for matrices? Cosine similarity for two vectors has a cos(a) concept that a is the angel of between two vectors. but for two matrices, what is the meaning of it?
Thank you so much.

Algorithm: How to smoothly interpolate/reconstruct sparse samples with noise?

This question is not directly related to a particular programming language but is an algorithmic question.
What I have is a lot of samples of a 2D function. The samples are at random locations, they are not uniformly distributed over the domain, the sample values contain noise and each sample has a confidence-weight assigned to it.
What I'm looking for is an algorithm to reconstruct the original 2D function based on the samples, so a function y' = G(x0, x1) that approximates the original well and interpolates areas where samples are sparse smoothly.
It goes into the direction of what scipy.interpolate.griddata is doing, but with the added difficulty that:
the sample values contain noise - meaning that samples should not just be interpolated, but nearby samples also averaged in some way to average out the sampling noise.
the samples are weighted, so, samples with higher weight should contrbute more strongly to the reconstruction that those with lower weight.
scipy.interpolate.griddata seems to do a Delaunay triangulation and then use the barycentric cordinates of the triangles to interpolate values. This doesn't seem to be compatible with my requirement of weighting samples and averaging noise though.
Can someone point me in the right direction on how to solve this?
Based on the comments, the function is defined on a sphere. That simplifies life because your region is both well-studied and nicely bounded!
First, decide how many Spherical Harmonic functions you will use in your approximation. The fewer you use, the more you smooth out noise. The more you use, the more accurate it will be. But if you use any of a particular degree, you should use all of them.
And now you just impose the condition that the sum of the squares of the weighted errors should be minimized. That will lead to a system of linear equations, which you then solve to get the coefficients of each harmonic function.

Can the hamming distance be used with non-binary strcuture

It is known that the hamming distance is applied to calculate the difference between two binary strings. Is it possible to apply it to calculate the difference between non-binary structures?
The Hamming distance of two strings of the same length is the sum of the distance between each pair of corresponding bits (i.e., L1), where the latter distance is 0 for identical bits and 1 for nonidentical bits, (i.e., the discrete metric). If you want to apply Hamming distance to alphabets that are not binary, you can replace the discrete metric with another metric of your choice, e.g., Lee distance is the distance between two numbers on a circle. If the strings have different lengths, then you have to change to something like Levenshtein distance, but even there, you can choose whatever deletion/insertion/substitution costs you want.

Similarity for arrays of parts of speech

K-nearest neighbor and natural language processing: How do you test the distance between arrays of parts of speech? eg
('verb','adverb','noun') and ('adjective','adverb','pronoun')?
A better phrased question would be how do you tell the similarity between the two in the context that they are parts of speech and not just strings?
As a general approach, you can use the cosine between POS vectors as a measure of their similarity. Alternative approach would be using the hamming distance between the two vectors.
There are plenty of other distance functions between vectors. But it really depends on what you want to do and what does your data look like. You should answer questions like does the position matter? How much similarity would you give to these vectors? ('noun', 'verb') and ('verb', 'noun')? Is the distance between ('adverb') and ('adjective') less than distance between ('adverb') and ('noun')? and so on.

Measuring distance between vectors

I have a set of 300.000 or so vectors which I would like to compare in some way, and given one vector I want to be able to find the closest vector I have thought of three methods.
Simple Euclidian distance
Cosine similarity
Use a kernel (for instance Gaussian) to calculate the Gram matrix.
Treat the vector as a discrete probability distribution (which makes
sense to do) and calculate some divergence measure.
I do not really understand when it is useful to do one rather than the other. My data has a lot of zero-elements. With that in mind, is there some general rule of thumbs as to which of the three methods is the best?
Sorry for the weak question, but I had to start somewhere...
Thank you!
Your question is not quite clear, are you looking for a distance metric between vectors, or an algorithm to efficiently find the nearest neighbour?
If your vectors just contain a numeric type such as doubles or integers, you can find a nearest neighbour efficiently using a structure such as the kd-tree. (since you are just looking at points in d-dimensional space). See http://en.wikipedia.org/wiki/Nearest_neighbor_search, for other methods.
Otherwise, choosing a distance metric and algorithm is very much dependent on the content of the vectors.
If your vectors are very sparse in nature and if they are binary, you can use Hamming or Hellinger distance. When your vector dimensions are large, avoid using Euclidean (refer http://en.wikipedia.org/wiki/Curse_of_dimensionality)
Please refer to http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.154.8446 for a survey of distance/similarity measures, although the paper limits it to pair of probability distributions.

Resources