Showing two k-means cost functions are equal - algorithm

So I am learning about the k-means algorithm for clustering and have seen a couple of different cost functions that can be used, in particular, $$J_{avg} = \sum_{i=0}^k\sum_{x\in C_i}d(x,m_j)^2$$$$J_{IC}=\sum_{i=0}^k\frac{1}{|C_j|}\sum_{x\in C_i}\sum_{x'\in C_i}d(x,x')^2.$$ Now I am trying to show that if $m_j=\frac{1}{C_j}\sum_{x\in C_j}x$ then $J_{IC}=2J_{avg}.$ This makes intuitive sense to me since it seems to be the difference between average distance to the centriod and average distance between two points (which should be double of that to the center). Would appreciate any help, thanks!

For the cost functions to be equivalent they don't have to be exactly equal, just monotonically related so that optimizing one means optimizing the other.
SUM_ij (Xi - Xj)^2 = SUM_ij (Xi - x + x - Xj)^2 = SUM_ij (Xi - x)^2 + (Xj - x)^2 + 2(Xi - x).(x - Xj)
If x is the mean of the Xi then SUM_j (x - Xj) = 0 so the dot product term goes away and you get the sort of connection between the sum of squared distances from the mean and the sum of the squared distances between any two points that I think you need.

Related

Estimating an unknown basis function from a sum of time-scaled versions

Not sure if this is a 'mathematical' problem or a 'numerical/computational' problem, but I thought would post it here in hope that someone might be able to help point me in the right direction.
Consider a function g(t) in the time-domain that consists of a sum of time-scaled versions of an unknown basis function f(t) ie.
g(t) = f(a_0.t) + f(a_1.t) + ... + f(a_N.t)
where N and the time scaling coefficients (a_0, a_1,..., a_N) are known. We can assume all coefficients are positive and no greater than one. ie. 0 < a_i <= 1.
Observations of g(t) are made at 'M' uniformly sampled discrete points ie. t = [0,1,...,M-1]. It is assumed that g(t) and f(t) are continuous and otherwise "well behaved".
The objective is to obtain an estimate of the function f(t) at the sample points.
One approach I have tried is to express both f(t) and g(t) as polynomials of degree K, equate polynomial coefficients and solve; ie.
g(t) = \sum_{k=0}^{K} g_k.t^k
f(t) = \sum_{k=0}^{K} f_k.t^k
and hence
f_K = g_K / (a_0^K + a_1^K + .... + a_N^K)
f_{K-1} = g_{K-1} / (a_0^(K-1) + a_1^(K-1) + .... + a_N^(K-1))
....etc....
f_0 = g_0 / N
Directly implementing this approach (naively?) suffers from the sensitivity of estimating g(t) with a degree K polynomial. Does anyone know of other approaches to this problem that may be more numerically stable?

Genetic Algorithm : Find curve that fits points

I am working on a genetic algorithm. Here is how it works :
Input : a list of 2D points
Input : the degree of the curve
Output : the equation of the curve that passes through points the best way (try to minimize the sum of vertical distances from point's Ys to the curve)
The algorithm finds good equations for simple straight lines and for 2-degree equations.
But for 4 points and 3 degree equations and more, it gets more complicated. I cannot find the right combination of parameters : sometimes I have to wait 5 minutes and the curve found is still very bad. I tried modifying many parameters, from population size to number of parents selected...
Do famous combinations/theorems in GA programming can help me ?
Thank you ! :)
Based on what is given, you would need a polynomial interpolation in which, the degree of the equation is number of points minus 1.
n = (Number of points) - 1
Now having said that, let's assume you have 5 points that need to be fitted and I am going to define them in a variable:
var points = [[0,0], [2,3], [4,-1], [5,7], [6,9]]
Please be noted the array of the points have been ordered by the x values which you need to do.
Then the equation would be:
f(x) = a1*x^4 + a2*x^3 + a3*x^2 + a4*x + a5
Now based on definition (https://en.wikipedia.org/wiki/Polynomial_interpolation#Constructing_the_interpolation_polynomial), the coefficients are computed like this:
Now you need to used the referenced page to come up with the coefficient.
It is not that complicated, for the polynomial interpolation of degree n you get the following equation:
p(x) = c0 + c1 * x + c2 * x^2 + ... + cn * x^n = y
This means we need n + 1 genes for the coefficients c0 to cn.
The fitness function is the sum of all squared distances from the points to the curve, below is the formula for the squared distance. Like this a smaller value is obviously better, if you don't want that you can take the inverse (1 / sum of squared distances):
d_squared(xi, yi) = (yi - p(xi))^2
I think for faster conversion you could limit the mutation, e.g. when mutating choose a new value with 20% probability between min and max (e.g. -1000 and 1000) and with 80% probabilty a random factor between 0.8 and 1.2 with which you multiply the old value.

optimization of sum of multi variable functions

Imagine that I'm a bakery trying to maximize the number of pies I can produce with my limited quantities of ingredients.
Each of the following pie recipes A, B, C, and D produce exactly 1 pie:
A = i + j + k
B = t + z
C = 2z
D = 2j + 2k
*The recipes always have linear form, like above.
I have the following ingredients:
4 of i
5 of z
4 of j
2 of k
1 of t
I want an algorithm to maximize my pie production given my limited amount of ingredients.
The optimal solution of these example inputs would yield me the following quantities of pies:
2 x A
1 x B
2 x C
0 x D
= a total of 5 pies
I can solve this easily enough by taking the maximal producer of all combinations, but the number
of combos becomes prohibitive as the quantities of ingredients increases. I feel like there must
be generalizations of this type of optimization problem, I just don't know where to start.
While I can only bake whole pies, I would be still be interested in seeing a method which may produce non integer results.
You can define the linear programming problem. I'll show the usage on the example, but it can of course be generalized to any data.
Denote your pies as your variables (x1 = A, x2 = B, ...) and the LP problem will be as follows:
maximize x1 + x2 + x3 + x4
s.t. x1 <= 4 (needed i's)
x1 + 2x4 <= 4 (needed j's)
x1 + 2x4 <= 2 (needed k's)
x2 <= 1 (needed t's)
x2 + 2x3 <= 5 (needed z's)
and x1,x2,x3,x4 >= 0
The fractional solution to this problem is solveable polynomially, but the integer linear programming is NP-Complete.
The problem is indeed NP-Complete, because given an integer linear programming problem, you can reduce the problem to "maximize the number of pies" using the same approach, where each constraint is an ingredient in the pie and the variables are the number of pies.
For the integers problem - there are a lot of approximation techniques in the literature for the problem if you can do with "close up to a certain bound", (for example local ratio technique or primal-dual are often used) or if you need an exact solution - exponential solution is probably your best shot. (Unless of course, P=NP)
Since all your functions are linear, it sounds like you're looking for either linear programming (if continuous values are acceptable) or integer programming (if you require your variables to be integers).
Linear programming is a standard technique, and is efficiently solvable. A traditional algorithm for doing this is the simplex method.
Integer programming is intractable in general, because adding integral constraints allows it to describe intractable combinatorial problems. There seems to be a large number of approximation techniques (for example, you might try just using regular linear programming to see what that gets you), but of course they depend on the specific nature of your problem.

Number of ways to move from Point 1 to Point 2 in a co-ordinate 2D plain

I came across a question where it was asked to find the number of unique ways of reaching from point 1 to point 2 in a 2D co-ordinate plain.
Note: This can be assumed without loss of generality that x1 < x2 and y1 < y2.
Moreover the motions are constrained int he following manner. One can move only right or up. means a valid move is from (xa, ya) to (xb, yb) if xa < xb and ya < yb.
Mathematically, this can be found by ( [(x2-x1)+(y2-y1)]! ) / [(x2-x1)!] * [(y2-y1)!]. I have thought of code too.
I have approaches where I coded with dynamic programming and my approach takes around O([max(x2,y2)]^2) time and Theta( x2 * y2 ) where I can just manage with the upper or lower triangular matrix.
Can you think of some other approaches where running time is less than this? I am thinking of a recursive solution where the minimum running time is O(max(x2,y2)).
A simple efficient solution is the mathematical one.
Let x2-x1 = n and y2-y1 = m.
You need to take exactly n steps to the right, and m steps up, all is left to determine is their order.
This can be modeled as number of binary vectors with n+m elements with exactly n elements set to 1.
Thus, the total number of possibilities is chose(n,n+m) = (n+m)! / (n! * m!), which is exactly what you got.
Given that the mathematical answer is both proven and both faster to calculate - I see no reason for using a different solution with these restrictions.
If you are eager to use recursion here, the recursive formula for binomial coefficient will probably be a good fit here.
EDIT:
You might be looking for the multiplicative formula to calculate it.
To compute the answer, you can use this formula:
(n+m)!/(n!m!)=(n+1)*(n+2)/2*(n+3)/3*…*(n+m)/m
So the pseudo code is:
let foo(n,m) =
ans=1;
for i = 1 to m do
ans = ans*(n+i)/i;
done;
ans
The order of multiplications and divisions is important, if you modify it you can have an overflow even if the result is not so large.
I finally managed to write the article to describe this question in detail and complete the answer as well. Here is the link for the same. http://techieme.in/dynamic-programming-distinct-paths-between-two-points/
try this formula:
ans = (x2-x1) * (y2-y1) + 1;

What is the most efficient algorithm to find a straight line that goes through most points?

The problem:
N points are given on a 2-dimensional plane. What is the maximum number of points on the same straight line?
The problem has O(N2) solution: go through each point and find the number of points which have the same dx / dy with relation to the current point. Store dx / dy relations in a hash map for efficiency.
Is there a better solution to this problem than O(N2)?
There is likely no solution to this problem that is significantly better than O(n^2) in a standard model of computation.
The problem of finding three collinear points reduces to the problem of finding the line that goes through the most points, and finding three collinear points is 3SUM-hard, meaning that solving it in less than O(n^2) time would be a major theoretical result.
See the previous question on finding three collinear points.
For your reference (using the known proof), suppose we want to answer a 3SUM problem such as finding x, y, z in list X such that x + y + z = 0. If we had a fast algorithm for the collinear point problem, we could use that algorithm to solve the 3SUM problem as follows.
For each x in X, create the point (x, x^3) (for now we assume the elements of X are distinct). Next, check whether there exists three collinear points from among the created points.
To see that this works, note that if x + y + z = 0 then the slope of the line from x to y is
(y^3 - x^3) / (y - x) = y^2 + yx + x^2
and the slope of the line from x to z is
(z^3 - x^3) / (z - x) = z^2 + zx + x^2 = (-(x + y))^2 - (x + y)x + x^2
= x^2 + 2xy + y^2 - x^2 - xy + x^2 = y^2 + yx + x^2
Conversely, if the slope from x to y equals the slope from x to z then
y^2 + yx + x^2 = z^2 + zx + x^2,
which implies that
(y - z) (x + y + z) = 0,
so either y = z or z = -x - y as suffices to prove that the reduction is valid.
If there are duplicates in X, you first check whether x + 2y = 0 for any x and duplicate element y (in linear time using hashing or O(n lg n) time using sorting), and then remove the duplicates before reducing to the collinear point-finding problem.
If you limit the problem to lines passing through the origin, you can convert the points to polar coordinates (angle, distance from origin) and sort them by angle. All points with the same angle lie on the same line. O(n logn)
I don't think there is a faster solution in the general case.
The Hough Transform can give you an approximate solution. It is approximate because the binning technique has a limited resolution in parameter space, so the maximum bin will give you some limited range of possible lines.
Again an O(n^2) solution with pseudo code. Idea is create a hash table with line itself as the key. Line is defined by slope between the two points, point where line cuts x-axis and point where line cuts y-axis.
Solution assumes languages like Java, C# where equals method and hashcode methods of the object are used for hashing function.
Create an Object (call SlopeObject) with 3 fields
Slope // Can be Infinity
Point of intercept with x-axis -- poix // Will be (Infinity, some y value) or (x value, 0)
Count
poix will be a point (x, y) pair. If line crosses x-axis the poix will (some number, 0). If line is parallel to x axis then poix = (Infinity, some number) where y value is where line crosses y axis.
Override equals method where 2 objects are equal if Slope and poix are equal.
Hashcode is overridden with a function which provides hashcode based on combination of values of Slope and poix. Some pseudo code below
Hashmap map;
foreach(point in the array a) {
foeach(every other point b) {
slope = calculateSlope(a, b);
poix = calculateXInterception(a, b);
SlopeObject so = new SlopeObject(slope, poix, 1); // Slope, poix and intial count 1.
SlopeObject inMapSlopeObj = map.get(so);
if(inMapSlopeObj == null) {
inMapSlopeObj.put(so);
} else {
inMapSlopeObj.setCount(inMapSlopeObj.getCount() + 1);
}
}
}
SlopeObject maxCounted = getObjectWithMaxCount(map);
print("line is through " + maxCounted.poix + " with slope " + maxCounted.slope);
Move to the dual plane using the point-line duality transform for p=(a,b) p*:y=a*x + b.
Now using a line sweep algorithm find all intersection points in NlogN time.
(If you have points which are one above the other just rotate the points to some small angle).
The intersection points corresponds in the dual plane to lines in the primer plane.
Whoever said that since 3SUM have a reduction to this problem and thus the complexity is O(n^2). Please note that the complexity of 3SUM is less than that.
Please check https://en.wikipedia.org/wiki/3SUM and also read
https://tmc.web.engr.illinois.edu/reduce3sum_sosa.pdf
As already mentioned, there probably isn't a way to solve the general case of this problem better than O(n^2). However, if you assume a large number of points lie on the same line (say the probability that a random point in the set of points lie on the line with the maximum number of points is p) and don't need an exact algorithm, a randomized algorithm is more efficient.
maxPoints = 0
Repeat for k iterations:
1. Pick 2 random, distinct points uniformly at random
2. maxPoints = max(maxPoints, number of points that lies on the
line defined by the 2 points chosen in step 1)
Note that in the first step, if you picked 2 points which lies on the line with the maximum number of points, you'll get the optimal solution. Assuming n is very large (i.e. we can treat the probability of finding 2 desirable points as sampling with replacement), the probability of this happening is p^2. Therefore the probability of finding a suboptimal solution after k iterations is (1 - p^2)^k.
Suppose you can tolerate a false negative rate rate = err. Then this algorithm runs in O(nk) = O(n * log(err) / log(1 - p^2)). If both n and p are large enough, this is significantly more efficient than O(n^2). (i.e. Supposed n = 1,000,000 and you know there are at least 10,000 points that lie on the same line. Then n^2 would required on the magnitude of 10^12 operations, while randomized algorithm would require on the magnitude of 10^9 operations to get a error rate of less than 5*10^-5.)
It is unlikely for a $o(n^2)$ algorithm to exist, since the problem (of even checking if 3 points in R^2 are collinear) is 3Sum-hard (http://en.wikipedia.org/wiki/3SUM)
This is not a solution better than O(n^2), but you can do the following,
For each point convert first convert it as if it where in the (0,0) coordinate, and then do the equivalent translation for all the other points by moving them the same x,y distance you needed to move the original choosen point.
2.Translate this new set of translated points to the angle with respect to the new (0,0).
3.Keep stored the maximum number (MSN) of points that are in each angle.
4.Choose the maximum stored number (MSN), and that will be the solution

Resources