Alternatives to Jaccard Distance - set

To calculate the distance between two sets of words I am using the jaccard distance:
JaccardDistance(A, B) = 1 - JaccardIndex(A, B) = 1 - (|A ∩ B| / |A ∪ B|)
Now I wonder, are there other similar distance metrics that return values ​​between [0, 1]?
Where 0 it means that the two sets contain exactly the same elements while 1 completely different elements. The two sets may have different sizes and the order of words is not important

The most common distance measure between two sets (more generally, multi-sets) is the cosine distance (which is the angle) between the vector representations of the multi-sets.
Now let's see how you can represent a multi-set as a vector.
The first step is to representing each set as a bag if its members, e.g.
X = {a, b, a, c} ==> (a:2) (b:1) (c:1)
Y = {d, b, a, d} ==> (a:1) (b:1) (d:2)
Each set is thus represented as a sparse vector of membership weights of the union set of all the members. For instance, the universal set of members in the above example is {a, b, c, d}, and the implicit weights of d and c in X and Y are 0.
With this sparse representation, which is convenient to store as a hashmap, you could then compute the cosine distance which is the arccos (inverse cosine) of the cosine similarity of the two vectors.
For two vectors x, y, the cosine similarity is computed as \sum_i x_i.y_i/(|x||y|), i.e. inner product of x and y divided by the product of the lengths of x and y.
In our example, the numerator is computed as 2x1 (product of the weight of member a in X and Y) + 1x1 + 1x0 + 2x0 = 3.
The length of x is sqrt(2x2+ 1x1 + 1x1) = sqrt(6), and it is easy to see that the length of y is also sqrt(6).
Hence cosine-distance = 3/(sqrt(6)*sqrt(6)) = 1/2, or in other words the angle between the vectors is 60 degrees.
Note: It is more common to omit the arccos operation and directly use the cosine similarity as a similarity (inverse distance) measure between multi-sets (represented as vectors).

Related

How to calculate the square of Frobenius norm of a matrix faster?

Let's say A is a column vecter with shape (m,1), B is a row vector with shape (1,p), C is the matrix product of A and B, i.e. C=AB, so C's shape is (m,p).
Now I want to compute the square of Frobenius norm of C, i.e. sum_i sum_j c_{ij}^2 (sum of all the squares of C's elements)
Note that c_{ij}=a_i*b_j, a_i and b_j are the elements in A and B. So I can rewrite the formula above.
sum_i sum_j c_{ij}^2=sum_i sum_j (a_ib_j)^2=sum_i a_i^2 *sum_j b_j^2
The complexity of formula sum_i sum_j c_{ij}^2 is O(mp).
And the complexity of formula sum_i a_i^2*sum_j b_j^2 is O(m+p), since A and B has lower dimention than C.
However, this trick is off when A and B are both matricies.
Consider this, A is a matrix with shape (m,n), B is a matrix with shape (n,p), C is the matrix product of A and B, i.e. C=AB, so C's shape is also (m,p).
I still want to compute this sum_i sum_j c_{ij}^2
Note that c_{ij}=sum_k a_{ik}*b_{kj}, so
sum_i sum_j c_{ij}^2=sum_i sum_j(sum_k a_{ik}*b_{kj})^2
Therefore, at this time, there is no trick I can use like before (the vector version)
So my question is, I need matrix C and I also need the square of Frobenius norm of C, would it be faster using C directly than using A and B?
square of Frobenius norm of C = trace of CC^T
You want the square of the Frobenius norm of AB. The Frobenius norm is
preserved by rotations, which leads to the following O((m+n+p)
n2)-time algorithm.
Assuming that m ≥ p and n ≥ p (this shouldn’t be critical if we define
the degenerate cases the right way), let A = QR be the reduced
QR-decomposition of A and tB = Q′R′ be the reduced
QR-decomposition of tB. We want the Frobenius norm of AB =
QRt(Q′R′) = QRtR′tQ′. Since Q and Q′
are semi-orthogonal, it suffices to find the Frobenius norm of
RtR′. Since R and R′ are n × n matrices, we can use the
straightforward O(n3)-time algorithm.

Algorithm to transform one set of numbers to another with optimization

We need to transform a set of integers A to another set of integers B such that the sum of squares of the elements of B is equal to a certain given value M.
Since there can be multiple such transformations, we need to find the one in which the sum of the square of the difference between the corresponding elements of A and B is minimum.
Input:
A set of non-negative integers A = {a1, a2, a3 ... an}
A non-negative integer M
Output:
A set of numbers B = {b1, b2, b3 ... bn}, such that:
sumi=1->n[ bi ^ 2 ] = M
sumi=1->n[ (ai-bi) ^ 2 ] = S is minimized.
The minimum sum S.
A bit of math.
Sum (ai - bi)2 = Sum (ai2 - 2 aibi + bi2) = Sum ai2 - 2 Sum aibi + Sum bi2
The first term is constant; the last one is M (also constant), so you are seeking to maximize
Sum aibi
with the restriction Sum bi2 = M.
In other words, you need a hyperplane normal to a vector A = { ai }, tangent to a hypersphere with a radius sqrt(M). Such hyperplane passes through a point where the normal line intersects with the sphere. This point is fA with |fA| = sqrt(M):
f = sqrt(M)/sqrt(Sum ai2)
The solution to your problem is
bi = ai * sqrt(M)/sqrt(Sum ai2)
EDIT: The answers so far, including the one below, map A to a set of real numbers instead of integers. As far as I can tell there is no general fix for this because there are many values of M for which there is no integer vector satisfying the constraint. Ex: M = 2. There is no vector of integers the sum of whose squares is 2. Even if M is a sum of squares, it is a sum of a certain number of squares, so even M = 4 has no solution if A has 3 or more non-zero components. As such, there is no general mapping that satisfies the problem as stated.
Here is the version that allows B to be a vector of reals:
The answer by #user58697 is quite elegant. Here is a restatement that is, perhaps, more intuitive for those of us less used to thinking with hyper geometry:
Treat A and B as vectors. Then start the same way: sum(ai - bi)2 = sum(ai2) - 2sum(aibi) + sum(bi2)
The first term is the magnitude of the vector A squared just as the last term is the magnitude of vector B squared. Both are constant so only the middle term can change. That means we want to maximize sum(aibi) which is exactly the dot product of A and B (https://en.wikipedia.org/wiki/Dot_product). The dot product of two vectors is maximized when the angle between them is 0, which is to say when they are co-directional (that is they point in the same direction).
This means that the unit vector forms of A and B must be the same. That is:
ai/|A| = bi/|B|. Solve this for bi: bi = ai * |B| / |A|
But |B| is just sqrt(M) and A is sqrt(sum(ai2)). So, just like in user58697's version:
bi = ai * sqrt(M) / sqrt(sum(ai2))

Algorithm to create a vector based puzzle

I am working on a little puzzle-game-project. The basic idea is built around projecting multi-dimensonal data down to 2D. My only problem is how to generate the randomized scenario data. Here is the problem:
I got muliple randomized vectors v_i and a target vector t, all 2D. Now I want to randomize scalar values c_i that:
t = sum c_i v_i
Because there are more than two v_i this is a overdetermined system. I also took care that the linear combination of v_i is actual able to reach t.
How can I create (randomized) values for my c_i?
Edit: After finding this Question I can additionally state, that it is possible for me also (slightly) change the v_i.
All values are based on double
Let's say your v_i form a matrix V with 2 rows and n columns, each vector is a column. The coefficients c_i form a column vector c. Then the equation can be written in matrix form as
V×c = t
Now apply a Singular Value Decomposition to matrix V:
V = A×D×B
with A being an orthogonal 2×2 matrix, D is a 2×n matrix and B an orthogonal n×n matrix. The original equation now becomes
A×D×B×c = t
multiply this equation with the inverse of A, the inverse is the same as the transposed matrix AT:
D×B×c = AT×t
Let's introduce new symbols c'=B×c and t'=AT×t:
D×c' = t'
The solution of this equation is simple, because Matrix D looks like this:
u 0 0 0 ... // n columns
0 v 0 0 ...
The solution is
c1' = t1' / u
c2' = t2' / v
And because all the other columns of D are zero, the remaining components c3'...cn' can be chosen freely. This is the place where you can create random numbers for c3'...cn. Having vector c' you can calculate c as
c = BT×c'
with BT being the inverse/transposed of B.
Since the v_i are linearly dependent there are non trivial solutions to 0 = sum l_i v_i.
If you have n vectors you can find n-2 independent such solutions.
If you have now one solution to t = sum c_i v_i you can add any multiple of l_i to c_i and you will still have a solution: c_i' = p l_i + c_i.
For each independent solution of the homogenous problem determine a random p_j and calculate
c_i'' = c_i + sum p_j l_i_j.

Matching points in 2 D space

I have 2 matrices A and B both of size Rows X 2 where Rows = m , n for A and B respectively. These m and n denote the points in the euclidean space.
The task I wish to perform is to match the maximum number of points from A and B ( assuming A has less number of points than B ) given the condition that the distance is less than a threshold d and each pair is unique.
I have seen this nearest point pairs but this won't work on my problem because for every point in A it select the minimum left in B. However it may happen that the first pair I picked from A and B was wrong leading to less number of matching pairs.
I am looking for a fast solution since both A and B consists of about 1000 points each. Again, some points will be left and I am aware that this would somehow lead to an exhaustive search.
I am looking for a solution where there is some sort of inbuilt functions in matlab or using data structures that can help whose matlab code is available such as kd-trees. As mentioned I have to find unique nearest matching points from B to A.
You can use pdist2 to compute a pairwise distance between two pairs of observations (of different sizes). The final distance matrix will be an N x M matrix which you can probe for all values above the desired threshold.
A = randn(1000, 2);
B = randn(500, 2);
D = pdist2(A, B, 'euclidean'); % euclidean distance
d = 0.5; % threshold
indexD = D > d;
pointsA = any(indexD, 2);
pointsB = any(indexD, 1);
The two vectors provide logical indexes to the points in A and B that have at least one match, defined by the minimum distance d, on the other. The resulting sets will be composed of the entire set of elements from matrix A (or B) with distance above d from any element of the other matrix B (or A).
You can also generalize to more than 2 dimensions or different distance metrics.

how to fast compute distance between high dimension vectors

assume there are three group of high dimension vectors:
{a_1, a_2, ..., a_N},
{b_1, b_2, ... , b_N},
{c_1, c_2, ..., c_N}.
each of my vector can be represented as: x = a_i + b_j + c_k, where 1 <=i, j, k <= N. then the vector is encoded as (i, j, k) wich is then can be decoded as x = a_i + b_j + c_k.
my question is, if there are two vector: x = (i_1, j_1, k_1), y = (i_2, j_2, k_2), is there a method to compute the euclidian distance of these two vector without decode x and y.
Square root of the sum of squares of the differences between components. There's no other way to do it.
You should scale the values to guard against overflow/underflow issues. Search for the max difference and divide all the components by it before squaring, summing, and taking the square root.
Let's assume you have only two groups. You are trying to compute the scalar product
(a_i1 + b_j1, a_i2 + b_j2)
= (a_i1,a_i2) + (b_j1,b_j2) + (a_i1,b_j2) + (a_i2,b_j1) # <- elementary scalar products
So if you know the necessary elementary scalar products between the elements of your vectors a_i, b_j, c_k, then, you do not need to "decode" x and y and can compute the scalar product directly.
Note that this is exactly what happens when you compute an ordinary euclidian distance on a non orthogonal basis.
If you are happy with an approximate result, you could project your high dimension basis vectors using a random projection into a small dimensional space. Johnson-Lindenstrauss lemma says that you can reduce your dimension to O(log N), so that distances remain approximately the same with high probability.

Resources