Why don't we include 0 matches while calculating jaccard distance between binary numbers? - algorithm

I am working on a program based on Jaccard Distance, and I need to calculate the Jaccard Distance between two binary bit vectors. I came across the following on the net:
If p1 = 10111 and p2 = 10011,
The total number of each combination attributes for p1 and p2:
M11 = total number of attributes where p1 & p2 have a value 1,
M01 = total number of attributes where p1 has a value 0 & p2 has a value 1,
M10 = total number of attributes where p1 has a value 1 & p2 has a value 0,
M00 = total number of attributes where p1 & p2 have a value 0.
Jaccard similarity coefficient = J =
intersection/union = M11/(M01 + M10 + M11)
= 3 / (0 + 1 + 3) = 3/4,
Jaccard distance = J' = 1 - J = 1 - 3/4 = 1/4,
Or J' = 1 - (M11/(M01 + M10 + M11)) = (M01 + M10)/(M01 + M10 + M11)
= (0 + 1)/(0 + 1 + 3) = 1/4
Now, while calculating the coefficient, why was "M00" not included in the denominator? Can anyone please explain?

Jaccard coefficient is a measure of asymmetric binary attributes,f.e., a scenario where the presence of an item is more important than its absence.
Since M00 deals only with absence, we do not consider it while calculating Jaccard coeffecient.
For example, while checking for the presence/absence of a disease, the presence of the disease is the more significant outcome.
Hope it helps!

The Jacquard index of A and B is |A∩B|/|A∪B| = |A∩B|/(|A| + |B| - |A∩B|).
We have: |A∩B| = M11, |A| = M11 + M10, |B| = M11 + M01.
So |A∩B|/(|A| + |B| - |A∩B|) = M11 / (M11 + M10 + M11 + M01 - M11) = M11 / (M10 + M01 + M11).
This Venn diagram may help:

Related

Set values of a matrix on positions inside a triangle

I have a N x N matrix will all values equal to zero, then I need to get the coordinates of a triangle and set the values inside this triangle to one (1).
How can I determine the position of each element in the matrix that forms the triangle faces?
Like this 10x10 matrix, I have a triangle set at (9,1),(5,5) and (9,5):
0000000000
0000000000
0000000000
0000000000
0000000000
0000010000
0000110000
0001010000
0010010000
0111110000
I don't need the code made for me, I want to check if there is a proper way (maybe using math) to get the "coordinates".
When you have two points x1,y1 and x2,y2, you can use these to create a formula for the line using "point-slope form"
Calculate slope with m = (y1 - y2) / (x1 - x2)
Then you have a formula of y - y1 = m(x - x1)
This further goes to y = m(x - x1) + y1
So in your example of (9,1),(5,5) you calculate the m = (1 - 5) / (9 - 5) = (-4) / (4) = -1
Then your formula becomes, for that line, y = (-1)(x - 9) + 1
Then iterate between 5 and 9.
f(5) = -(5-9) + 1 = -(-4) + 1 = 4 + 1 = 5
f(6) = -(6-9) + 1 = -(-3) + 1 = 3 + 1 = 4
f(7) = -(7-9) + 1 = -(-2) + 1 = 2 + 1 = 3
f(8) = -(8-9) + 1 = -(-1) + 1 = 1 + 1 = 2
f(9) = -(9-9) + 1 = -(0)) + 1 = 0 + 1 = 1
Triangles have nice properties allowing a very simple algorithm to suffice.
Find Ymax, the topmost Y coordinate set in the triangle. Then for Ymax, find Xmin and Xmax, of the left and rightmost pixels set in that row. Now there are 2 cases. If Xmin == Xmax, then one vertex is (Xmin,Ymax), otherwise two of the coordinates are (Xmin, Ymax) and (Xmax, Ymax).
With this you've found the topmost coordinate or coordinates.
It's pretty simple to continue this reasoning to find the other ones. I'll let you puzzle it out for the fun...
You can combine the min and max-finding in the algorithm above with the algorithm that does the filling as required in the second part of the problem.

Number of ways of distributing n identical balls into groups such that each group has atleast k balls?

I am trying to do this using recursion with memoization ,I have identified the following base cases .
I) when n==k there is only one group with all the balls.
II) when k>n then no groups can have atleast k balls,hence zero.
I am unable to move forward from here.How can this be done?
As an illustration when n=6 ,k=2
(2,2,2)
(4,2)
(3,3)
(6)
That is 4 different groupings can be formed.
This can be represented by the two dimensional recursive formula described below:
T(0, k) = 1
T(n, k) = 0 n < k, n != 0
T(n, k) = T(n-k, k) + T(n, k + 1)
^ ^
There is a box with k balls, No box with k balls, advance to next k
put them
In the above, T(n,k) is the number of distributions of n balls such that each box gets at least k.
And the trick is to think of k as the lowest possible number of balls, and seperate the problem to two scenarios: Is there a box with exactly k balls (if so, place them and recurse with n-k balls), or not (and then, recurse with minimal value of k+1, and same number of balls).
Example, to calculate your example: T(6,2) (6 balls, minimum 2 per box):
T(6,2) = T(4,2) + T(6,3)
T(4,2) = T(2,2) + T(4,3) = T(0,2) + T(2,3) + T(1,3) + T(4,4) =
= T(0,2) + T(2,3) + T(1,3) + T(0,4) + T(4,5) =
= 1 + 0 + 0 + 1 + 0
= 2
T(6,3) = T(3,3) + T(6,4) = T(0,3) + T(3,4) + T(2,4) + T(6,5)
= T(0,3) + T(3,4) + T(2,4) + T(1,5) + T(6,6) =
= T(0,3) + T(3,4) + T(2,4) + T(1,5) + T(0,6) + T(6,7) =
= 1 + 0 + 0 + 0 + 1 + 0
= 2
T(6,2) = T(4,2) + T(6,3) = 2 + 2 = 4
Using Dynamic Programming, it can be calculated in O(n^2) time.
This case can be solved pretty simple:
Number of buckets
The maximum-number of buckets b can be determined as follows:
b = roundDown(n / k)
Each valid distribution can use at most b buckets.
Number of distributions with x buckets
For a given number of buckets the number of distribution can be found pretty simple:
Distribute k balls to each bucket. Find the number of ways to distribute the remaining balls (r = n - k * x) to x buckets:
total_distributions(x) = bincoefficient(x , n - k * x)
EDIT: this will onyl work, if order matters. Since it doesn't for the question, we can use a few tricks here:
Each distribution can be mapped to a sequence of numbers. E.g.: d = {d1 , d2 , ... , dx}. We can easily generate all of these sequences starting with the "first" sequence {r , 0 , ... , 0} and subsequently moving 1s from the left to the right. So the next sequence would look like this: {r - 1 , 1 , ... , 0}. If only sequences matching d1 >= d2 >= ... >= dx are generated, no duplicates will be generated. This constraint can easily be used to optimize this search a bit: We can only move a 1 from da to db (with a = b - 1), if da - 1 >= db + 1 is given, since otherwise the constraint that the array is sorted is violated. The 1s to move are always the rightmost that can be moved. Another way to think of this would be to view r as a unary number and simply split that string into groups such that each group is atleast as long as it's successor.
countSequences(x)
sequence[]
sequence[0] = r
sequenceCount = 1
while true
int i = findRightmostMoveable(sequence)
if i == -1
return sequenceCount
sequence[i] -= 1
sequence[i + 1] -= 1
sequenceCount
findRightmostMoveable(sequence)
for i in [length(sequence) - 1 , 0)
if sequence[i - 1] > sequence[i] + 1
return i - 1
return -1
Actually findRightmostMoveable could be optimized a bit, if we look at the structure-transitions of the sequence (to be more precise the difference between two elements of the sequence). But to be honest I'm by far too lazy to optimize this further.
Putting the pieces together
range(1 , roundDown(n / k)).map(b -> countSequences(b)).sum()

How do you manually compute for silhouette, cohesion and separation of Cluster

Good day!
I have been looking all over the Internet on how to compute for silhouette coefficient, cohesion and separation unfortunately, despite the resources, I just can't understand the formulas posted. I know that there are implementations of it in some tool, but I want to know how to manually compute them especially given a vector space model.
Assuming that I have the following clusters:
Cluster 1 ={{1,0},{1,1}}
Cluster 2 ={{1,2},{2,3},{2,2},{1,2}},
Cluster 3 ={{3,1},{3,3},{2,1}}
The way I understood it according to [1] is that I have to get the average of the points per cluster:
C1 X = 1; Y = .5
C2 X = 1.5; Y = 2.25
C3 X = 2.67; Y = 1.67
Given the mean, I have to compute for my cohesion by Sum of Square Error (SSE):
Cohesion(C1) = (1-1)^2 + (1-1)^2 + (0-.5)^2 + (0-.5)^2 = 0.5
Cohesion(C2) = (1-1.5)^2 + (2-1.5)^2 + (2-1.5)^2 + (1-1.5)^2 + (2-2.5)^2 + (3-2.5)^2 + (2-2.5)^2 +(2-2.5)^2 = 2
Cohesion(C3) = (3-2.67)^2 + (3-2.67)^2 + (2-2.67)^2 + (1-1.67)^2 + (3-1.67)^2 + (1-1.67)^2 = 3.3334
Cluster(C) = 0.5 + 2 + 3.3334 = 5.8334
My questions are:
1. Did I perform cohesion correctly?
2. How do I compute for Separation?
3. How do I compute for Silhouette Coefficient?
Thank you.
References: [1] http://www.cs.kent.edu/~jin/DM08/ClusterValidation.pdf
Cluster 1 ={{1,0},{1,1}}
Cluster 2 ={{1,2},{2,3},{2,2},{1,2}},
Cluster 3 ={{3,1},{3,3},{2,1}}
Take a point {1,0} in cluster 1
Calculate its average distance to all other points in it’s cluster, i.e. cluster 1
So a1 =√( (1-1)^2 + (0-1)^2) =√(0+1)=√1=1
Now for the object {1,0} in cluster 1 calculate its average distance from all the objects in cluster 2 and cluster 3. Of these take the minimum average distance.
So for cluster 2
{1,0} ----> {1,2} = distance = √((1-1)^2 + (0-2)^2) =√(0+4)=√4=2
{1,0} ----> {2,3} = distance = √((1-2)^2 + (0-3)^2) =√(1+9)=√10=3.16
{1,0} ----> {2,2} = distance = √((1-2)^2 + (0-2)^2) =√(1+4)=√5=2.24
{1,0} ----> {1,2} = distance = √((1-1)^2 + (0-2)^2) =√(0+4)=√4=2
Therefore, the average distance of point {1,0} in cluster 1 to all the points in cluster 2 =
(2+3.16+2.24+2)/4 = 2.325
Similarly, for cluster 3
{1,0} ----> {3,1} = distance = √((1-3)^2 + (0-1)^2) =√(4+1)=√5=2.24
{1,0} ----> {3,3} = distance = √((1-3)^2 + (0-3)^2) =√(4+9)=√13=3.61
{1,0} ----> {2,1} = distance = √((1-2)^2 + (0-1)^2) =√(1+1)=√2=2.24
Therefore, the average distance of point {1,0} in cluster 1 to all the points in cluster 3 =
(2.24+3.61+2.24)/3 = 2.7
Now, the minimum average distance of the point {1,0} in cluster 1 to the other clusters 2 and 3 is,
b1 =2.325 (2.325 < 2.7)
So the silhouette coefficient of cluster 1
s1= 1-(a1/b1) = 1- (1/2.325)=1-0.4301=0.5699
In a similar fashion you need to calculate the silhouette coefficient for cluster 2 and cluster 3 separately by taking any single object point in each of the clusters and repeating the steps above. Of these the cluster with the greatest silhouette coefficient is the best as per evaluation.
Note: The distance here is the Euclidean Distance! You can also have a look at this video for further explanation:
https://www.coursera.org/learn/cluster-analysis/lecture/RJJfM/6-2-clustering-evaluation-measuring-clustering-quality
Computation of Silhouette is straightforward, but it does not involve the centroids.
So don't try to compute it from what you did for cohesion; compute it from your original data.
As you have calculated the Cohesion of C1, there is a mistake.
Cohesion(C1) = (1 - 1) ^ 2 + (1 - 1) ^ 2 + (0 - .5) ^ 2 + (1 - .5) ^ 2 = 0.5
This is the Prototype-Based (Centroid in this case) Cohesion calculation.
For calculating Separation: {Between clusters i.e. (C1,C2) , (C1,C3) & (C2,C3)}
Separation(C1,C2) = SSE(Centroid(C1), Centroid(C2))
= (1 - 1.5) ^ 2 + (0.5 - 2.25) ^ 2 = 1 + 3.0625 = 4.0625
Silhouette Coefficient: Combines both the Cohesion and Separation.
Refer https://cs.fit.edu/~pkc/classes/ml-internet/silhouette.pdf
thanks for your answer,
Calculate its average distance to all other points in its cluster, i.e. cluster 1' --> This part has to be corrected.
So
a1 =√( (1-1)^2 + (0-1)^2) =√(0+1)=√1=1
{1,0} ----> {2,1} = distance = √((1-2)^2 + (0-1)^2) =√(1+1)=√2=2.24
this is an error because the root of 2 is approximately 1.41

Segmented Least Squares

Give an algorithm that takes a sequence of points in the plane (x_1, y_1), (x_2, y_2), ...., (x_n, y_n) and an integer k as input and returns the best piecewise linear function f consisting of at most k pieces that minimizes the sum squared error. You may assume that you have access to an algorithm that computes the sum squared error for one segment through a set of n points in Θ(n) time.The solution should use O(n^2k) time and O(nk) space.
Can anyone help me with this problem? Thank you so much!
(This is too late for your homework, but hope it helps anyway.)
First is dynamic programming in python / numpy for k = 4 only,
to help you understand how dynamic programming works;
once you understand that, writing a loop for any k should be easy.
Also, Cost[] is a 2d matrix, space O(n^2);
see the notes at the end for getting down to space O(n k)
#!/usr/bin/env python
""" split4.py: min-cost split into 4 pieces, dynamic programming k=4 """
from __future__ import division
import numpy as np
__version__ = "2014-03-09 mar denis"
#...............................................................................
def split4( Cost, verbose=1 ):
""" split4.py: min-cost split into 4 pieces, dynamic programming k=4
min Cost[0:a] + Cost[a:b] + Cost[b:c] + Cost[c:n]
Cost[a,b] = error in least-squares line fit to xy[a] .. xy[b] *including b*
or error in lsq horizontal lines, sum (y_j - av y) ^2 for each piece --
o--
o-
o---
o----
| | | |
0 2 5 9
(Why 4 ? to walk through step by step, then put in a loop)
"""
# speedup: maxlen 2 n/k or so
Cost = np.asanyarray(Cost)
n = Cost.shape[1]
# C2 C3 ... costs, J2 J3 ... indices of best splits
J2 = - np.ones(n, dtype=int) # -1, NaN mark undefined / bug
C2 = np.ones(n) * np.NaN
J3 = - np.ones(n, dtype=int)
C3 = np.ones(n) * np.NaN
# best 2-splits of the left 2 3 4 ...
for nleft in range( 1, n ):
J2[nleft] = j = np.argmin([ Cost[0,j-1] + Cost[j,nleft] for j in range( 1, nleft+1 )]) + 1
C2[nleft] = Cost[0,j-1] + Cost[j,nleft]
# an idiom for argmin j, min value c together
# best 3-splits of the left 3 4 5 ...
for nleft in range( 2, n ):
J3[nleft] = j = np.argmin([ C2[j-1] + Cost[j,nleft] for j in range( 2, nleft+1 )]) + 2
C3[nleft] = C2[j-1] + Cost[j,nleft]
# best 4-split of all n --
j4 = np.argmin([ C3[j-1] + Cost[j,n-1] for j in range( 3, n )]) + 3
c4 = C3[j4-1] + Cost[j4,n-1]
j3 = J3[j4]
j2 = J2[j3]
jsplit = np.array([ 0, j2, j3, j4, n ])
if verbose:
print "split4: len %s pos %s cost %.3g" % (np.diff(jsplit), jsplit, c4)
print "split4: J2 %s C2 %s" %(J2, C2)
print "split4: J3 %s C3 %s" %(J3, C3)
return jsplit
#...............................................................................
if __name__ == "__main__":
import random
import sys
import spread
n = 10
ncycle = 2
plot = 0
seed = 0
# run this.py a=1 b=None c=[3] 'd = expr' ... in sh or ipython
for arg in sys.argv[1:]:
exec( arg )
np.set_printoptions( 1, threshold=100, edgeitems=10, linewidth=100, suppress=True )
np.random.seed(seed)
random.seed(seed)
print "\n", 80 * "-"
title = "Dynamic programming least-square horizontal lines %s n %d seed %d" % (
__file__, n, seed)
print title
x = np.arange( n + 0. )
y = np.sin( 2*np.pi * x * ncycle / n )
# synthetic time series ?
print "y: %s av %.3g variance %.3g" % (y, y.mean(), np.var(y))
print "Cost[j,k] = sum (y - av y)^2 --" # len * var y[j:k+1]
Cost = spread.spreads_allij( y )
print Cost # .round().astype(int)
jsplit = split4( Cost )
# split4: len [3 2 3 2] pos [ 0 3 5 8 10]
if plot:
import matplotlib.pyplot as pl
title += "\n lengths: %s" % np.diff(jsplit)
pl.title( title )
pl.plot( y )
for js, js1 in zip( jsplit[:-1], jsplit[1:] ):
if js1 <= js: continue
yav = y[js:js1].mean() * np.ones( js1 - js + 1 )
pl.plot( np.arange( js, js1 + 1 ), yav )
# pl.legend()
pl.show()
Then, the following code does Cost[] for horizontal lines only, slope 0;
extending it to line segments of any slope, in time O(n), is left as an exercise.
""" spreads( all y[:j] ) in time O(n)
define spread( y[] ) = sum (y - average y)^2
e.g. spread of 24 hourly temperatures y[0:24] i.e. y[0] .. y[23]
around a horizontal line at the average temperature
(spread = 0 for constant temperature,
24 c^2 for constant + [c -c c -c ...],
24 * variance(y) )
How fast can one compute all 24 spreads
1 hour (midnight to 1 am), 2 hours ... all 24 ?
A simpler problem: compute all 24 averages in time O(n):
N = np.arange( 1, len(y)+1 )
allav = np.cumsum(y) / N
= [ y0, (y0 + y1) / 2, (y0 + y1 + y2) / 3 ...]
An identity:
spread(y) = sum(y^2) - n * (av y)^2
Voila: the code below, all spreads() in time O(n).
Exercise: extend this to spreads around least-squares lines
fit to [ y0, [y0 y1], [y0 y1 y2] ... ], not just horizontal lines.
"""
from __future__ import division
import sys
import numpy as np
#...............................................................................
def spreads( y ):
""" [ spread y[:1], spread y[:2] ... spread y ] in time O(n)
where spread( y[] ) = sum (y - average y )^2
= n * variance(y)
"""
N = np.arange( 1, len(y)+1 )
return np.cumsum( y**2 ) - np.cumsum( y )**2 / N
def spreads_allij( y ):
""" -> A[i,j] = sum (y - av y)^2, spread of y around its average
for all y[i:j+1]
time, space O(n^2)
"""
y = np.asanyarray( y, dtype=float )
n = len(y)
A = np.zeros((n,n))
for i in range(n):
A[i,i:] = spreads( y[i:] )
return A
So far we have an n x n cost matrix, space O(n^2).
To get down to space O( n k ),
look closely at the pattern of Cost[i,j] accesses in the dyn-prog code:
for nleft .. to n:
Cost_nleft = Cost[j,nleft ] -- time nleft or nleft^2
for k in 3 4 5 ...:
min [ C[k-1, j-1] + Cost_nleft[j] for j .. to nleft ]
Here Cost_nleft is one row of the full n x n cost matrix, ~ n segments, generated as needed.
This can be done in time O(n) for line segments.
But if "error for one segment through a set of n points takes O(n) time",
it seems we're up to time O(n^3). Comments anyone ?
If you can do least squares for some segment in n^2, it's easy to do what you want in n^2 k^2 with dynamic programming. You might be able to optimize that to a single k only.

Calculating a weighted similarity

I have 2 data rows and each of them have 4 fields
something like this:
field1 field2 field3 field4
Row 1
Row 2
Now I have to compare these two records and calculate the similarity. I calculate the similarity for each field by deriving the cosine similarity.
So I end up with similarities something like this:
(0 signifying a week similarity and 1 signifying a strong similarity)
field1: 0.12
field2: 0.67
field3: 1.00
field3: 0.93
I can now find the total similarity by averaging the value but the problem is:
I want to add weights to the fields
so if field2 has a higher weight than field1, then the similarity of field2 will have a significant contribution to the average similarity.
Can you suggest a formula or algorithm to satisfy such a requirement?
Simple,
multiply each of the 4 values by their weight
add the results together
divide by the sum of the weights
Examples
In the example each of the fields can be thought to have an equal weight of 1
((0.12 * 1) + (0.67 * 1) + (1.00 * 1) + (0.93 * 1)) / 4 = 0.68
Now if we want to make field2 worth 2x more than the other fields
// Weights are (1 + 2 + 1 + 1) = 5
((0.12 * 1) + (0.67 * 2) + (1.00 * 1) + (0.93 * 1)) / 5 = 0.678
If we want field 3 to have 100 times the weight (field 2 is still 2x)
// Weights are (1 + 2 + 100 + 1) = 104
((0.12 * 1) + (0.67 * 2) + (1.00 * 100) + (0.93 * 1)) / 104 = 0.9845192307692308
Formula
((field1 * field1_weight) + (field2 * field2_weight) + ... + (fieldn * fieldn_weight)) / (field1_weight + field2_weight + ... + fieldn_weight) = weighted_average
Fractional weights
The formula works just the same if you give fractions as weights. For example if you would like the weight of the 4th field to be weighted 150% more then the other fields you can assign it weight 1.5
// Weights are (1 + 1 + 1 + 1.5) = 4.5
((0.12 * 1) + (0.67 * 1) + (1.00 * 1) + (0.93 * 1.5)) / 4.5 = 0.7077777777777778
Weights are relative
You don't need to start with each of the weights set to 1, you can use 100 or 1000 if you like.
For example if the weights for all 4 fields were 100 the final average would be the same if they were all 1.
Further reading
wikipedia: Weighted arithmetic mean
You just want to find the weighted average. Multiply each similarity by the weight, then add the products together, divide at the end by the sum of the weights to get the average:
total, totalw = 0, 0
for w,s in weighted_sims :
total += w*s
totalw += w
result = total / totalw

Resources