Should we used k-means++ instead of k-means? - algorithm

The k-means++ algorithm helps in two following points of the original k-means algorithm:
The original k-means algorithm has the worst case running time of super-polynomial in input size, while k-means++ has claimed to be O(log k).
The approximation found can yield a not so satisfactory result with respect to objective function compared to the optimal clustering.
But are there any drawbacks of k-means++? Should we always used it instead of k-means from now on?

Nobody claims k-means++ runs in O(lg k) time; it's solution quality is O(lg k)-competitive with the optimal solution. Both k-means++ and the common method, called Lloyd's algorithm, are approximations to an NP-hard optimization problem.
I'm not sure what the worst case running time of k-means++ is; note that in Arthur & Vassilvitskii's original description, steps 2-4 of the algorithm refer to Lloyd's algorithm. They do claim that it works both better and faster in practice because it starts from a better position.
The drawbacks of k-means++ are thus:
It too can find a suboptimal solution (it's still an approximation).
It's not consistently faster than Lloyd's algorithm (see Arthur & Vassilvitskii's tables).
It's more complicated than Lloyd's algo.
It's relatively new, while Lloyd's has proven it's worth for over 50 years.
Better algorithms may exist for specific metric spaces.
That said, if your k-means library supports k-means++, then by all means try it out.

Not your question, but an easy speedup to any kmeans method for large N:
1) first do k-means on a random sample of say sqrt(N) of the points
2) then run full k-means from those centres.
I've found this 5-10 times faster than kmeans++ for N 10000, k 20, with similar results.
How well it works for you will depend on how well a sqrt(N) sample
approximates the whole, as well as on N, dim, k, ninit, delta ...
What are your N (number of data points), dim (number of features), and k ?
The huge range in users' N, dim, k, data noise, metrics ...
not to mention the lack of public benchmarks, make it tough to compare methods.
Added: Python code for kmeans() and kmeanssample() is
here on SO; comments are welcome.

Related

Meaning of O(logk) competitive complexity

I am working on an existing algorithm to improve its complexity. The existing algorithm uses K-means to perform clustering, whereas I chose to use K-means++ to do the same.
K-means++ was chosen because it mostly has faster and more accurate clustering results compared to K-mean.
Now, towards the end, where I have to compare the complexity of the new and existing algorithms, I find that I can't make sense of the fact that K-means++ has a complexity of O(logk) competitive.
I have tried looking everywhere on the web for an explanation, including stack overflow.
The only thing I have understood is that competitive has something to do with "on-line" and "off-line" algorithms. Could anyone please explain how it applies here?
The full sentence that you are reading says something like "The k-means++ clustering is O(log k)-competitive to the optimal k-means solution".
This is not a statement about its algorithmic complexity. It's a statement about its effectiveness. You can use O-notation for other things.
K-means attempts to minimize a "potential" that is calculated as the sum of the squared distances of points from their cluster centers.
For any specific clustering problem, the expected potential of a K-means++ solution is at most 8(ln k + 2) times the potential of the best possible solution. That 8(ln k + 2) is shortened to O(log k) for brevity.
The precise meaning of the statement that the k-means++ solution is O(log k)-competitive is that there is some constant C such that the expected ratio between the k-means++ potential and the best possible potential is less than C*(log k) for all sufficiently large k.
the smallest such constant is about 8

Algorithm Analysis: In practice, do coefficients of higher order terms matter?

Consider an^2 + bn + c. I understand that for large n, bn and c become insignificant.
I also understand that for large n, the differences between 2n^2 and n^2 are pretty insignificant compared to the differences between, say n^2 and n*log(n).
However, there is still an order of 2 difference between 2n^2 and n^2. Does this matter in practice? Or do people just think about algorithms without coefficients? Why?
The actual coefficients matter if you're interested in timing. But big-O isn't actually about timing, it's about scalability. When you see an algorithm described as O(n^2), you don't really know how long it will take to solve a problem of size n on a particular computer in a particular language with a particular compiler, but you know that a problem of size 2n should take about 4 times as long.
The reason you can ignore the coefficients is that if you consider the ratio of different size problems, the lower order terms' coefficients are asymptotically dominated, and the highest order term's coefficients cancel in the ratio.
We use time complexity analysis to help us estimate the time cost and understand how far we can go. For example, the lower bound time complexity for sorting algorithm is O(nlgn), it is proved in theory, and we should never try to design a algorithm better than this.
For the coefficient, in many case it's not easy to find a accurate number in theory, since it could be effect by the input data. But it doesn't mean it's not important. Quicksort is the most widely used sorting algorithm, since the coefficient of time complexity is really small, which is only 1.39NlgN for average case.
And another interesting fact about quicksort is that we all know that the worst case for quicksort will cost O(N^2). We can use Median of Medians algorithm to reduce the worst case time complexity of quicksort to O(NlgN), but we seldom use this version in practice. It's because that the coefficient of Median of Medians version is too big, which make it unusable.

What is the purpose of Big-O notation in computer science if it doesn't give all the information needed?

What is the use of Big-O notation in computer science if it doesn't give all the information needed?
For example, if one algorithm runs at 1000n and one at n, it is true that they are both O(n). But I still may make a foolish choice based on this information, since one algorithm takes 1000 times as long as the other for any given input.
I still need to know all the parts of the equation, including the constant, to make an informed choice, so what is the importance of this "intermediate" comparison? I end up loosing important information when it gets reduced to this form, and what do I gain?
What does that constant factor represent? You can't say with certainty, for example, that an algorithm that is O(1000n) will be slower than an algorithm that's O(5n). It might be that the 1000n algorithm loads all data into memory and makes 1,000 passes over that data, and the 5n algorithm makes five passes over a file that's stored on a slow I/O device. The 1000n algorithm will run faster even though its "constant" is much larger.
In addition, some computers perform some operations more quickly than other computers do. It's quite common, given two O(n) algorithms (call them A and B), for A to execute faster on one computer and B to execute faster on the other computer. Or two different implementations of the same algorithm can have widely varying runtimes on the same computer.
Asymptotic analysis, as others have said, gives you an indication of how an algorithm's running time varies with the size of the input. It's useful for giving you a good starting place in algorithm selection. Quick reference will tell you that a particular algorithm is O(n) or O(n log n) or whatever, but it's very easy to find more detailed information on most common algorithms. Still, that more detailed analysis will only give you a constant number without saying how that number relates to real running time.
In the end, the only way you can determine which algorithm is right for you is to study it yourself and then test it against your expected data.
In short, I think you're expecting too much from asymptotic analysis. It's a useful "first line" filter. But when you get beyond that you have to look for more information.
As you correctly noted, it does not give you information on the exact running time of an algorithm. It is mainly used to indicate the complexity of an algorithm, to indicate if it is linear in the input size, quadratic, exponential, etc. This is important when choosing between algorithms if you know that your input size is large, since even a 1000n algorithm well beat a 1.23 exp(n) algorithm for large enough n.
In real world algorithms, the hidden 'scaling factor' is of course important. It is therefore not uncommon to use an algorithm with a 'worse' complexity if it has a lower scaling factor. Many practical implementations of sorting algorithms are for example 'hybrid' and will resort to some 'bad' algorithm like insertion sort (which is O(n^2) but very simple to implement) for n < 10, while changing to quicksort (which is O(n log(n)) but more complex) for n >= 10.
Big-O tells you how the runtime or memory consumption of a process changes as the size of its input changes. O(n) and O(1000n) are both still O(n) -- if you double the size of the input, then for all practical purposes the runtime doubles too.
Now, we can have an O(n) algorithm and an O(n2) algorithm where the coefficient of n is 1000000 and the coefficient of n2 is 1, in which case the O(n2) algorithm would outperform the O(n) for smaller n values. This doesn't change the fact, however, that the second algorithm's runtime grows more rapidly than the first's, and this is the information that big-O tells us. There will be some input size at which the O(n) algorithm begins to outperform the O(n2) algorithm.
In addition to the hidden impact of the constant term, complexity notation also only considers the worst case instance of a problem.
Case in point, the simplex method (linear programming) has exponential complexity for all known implementations. However, the simplex method works much faster in practice than the provably polynomial-time interior point methods.
Complexity notation has much value for theoretical problem classification. If you want some more information on practical consequences check out "Smoothed Analysis" by Spielman: http://www.cs.yale.edu/homes/spielman
This is what you are looking for.
It's main purpose is for rough comparisons of logic. The difference of O(n) and O(1000n) is large for n ~ 1000 (n roughly equal to 1000) and n < 1000, but when you compare it to values where n >> 1000 (n much larger than 1000) the difference is miniscule.
You are right in saying they both scale linearly and knowing the coefficient helps in a detailed analysis but generally in computing the difference between linear (O(cn)) and exponential (O(cn^x)) performance is more important to note than the difference between two linear times. There is a larger value in the comparisons of runtime of higher orders such as and Where the performance difference scales exponentially.
The overall purpose of Big O notation is to give a sense of relative performance time in order to compare and further optimize algorithms.
You're right that it doesn't give you all information, but there's no single metric in any field that does that.
Big-O notation tells you how quickly the performance gets worse, as your dataset gets larger. In other words, it describes the type of performance curve, but not the absolute performance.
Generally, Big-O notation is useful to express an algorithm's scaling performance as it falls into one of three basic categories:
Linear
Logarithmic (or "linearithmic")
Exponential
It is possible to do deep analysis of an algorithm for very accurate performance measurements, but it is time consuming and not really necessary to get a broad indication of performance.

What is the Best Complexity of a Greedy Algorithm?

It seems like the best complexity would be linear O(n).
Doesn't matter the case really, I'm speaking of greedy algorithms in general.
Sometimes it pays off to be greedy?
In the specific case that I am interested would be computing change.
Say you need to give 35 cents in change. You have coins of 1, 5, 10, 25. The greedy algorithm, coded simply, would solve this problem quickly and easily. First grabbing 25 cents the highest value going in 35 and then next 10 cents to complete the total. This would be best case. Of course there are bad cases and cases where this greedy algorithm would have issues. I'm talking best case complexity for determining this type of problem.
Any algorithm that has an output of n items that must be taken individually has at best O(n) time complexity; greedy algorithms are no exception. A more natural greedy version of e.g. a knapsack problem converts something that is NP-complete into something that is O(n^2)--you try all items, pick the one that leaves the least free space remaining; then try all the remaining ones, pick the best again; and so on. Each step is O(n). But the complexity can be anything--it depends on how hard it is to be greedy. (For example, a greedy clustering algorithm like hierarchical agglomerative clustering has individual steps that are O(n^2) to evaluate (at least naively) and requires O(n) of these steps.)
When you're talking about greedy algorithms, typically you're talking about the correctness of the algorithm rather than the time complexity, especially for problems such as change making.
Greedy heuristics are used because they're simple. This means easy implementations for easy problems, and reasonable approximations for hard problems. In the latter case you'll find time complexities that are better than guaranteed correct algorithms. In the former case, you can't hope for better than optimal time complexity.
GREEDY APPROACH
knapsack problem...sort the given element using merge sort ..(nlogn)
find max deadline that will take O(n)
using linear search select one by one element....O(n²)
nlogn + n + n² = n² in worst case....
now can we apply binary search instead of linear search.....?
Greedy or not has essentially nothing to do with computational complexity, other than the fact that greedy algorithms tend to be simpler than other algorithms to solve the same problem, and hence they tend to have lower complexity.

How to test an algorithm for perfect optimization?

Is there any way to test an algorithm for perfect optimization?
There is no easy way to prove that any given algorithm is asymptotically optimal.
Proving optimality (if ever) sometimes follows years and/or decades after the algorithm has been written. A classic example is the Union-Find/disjoint-set data structure.
Disjoint-set forests are a data structure where each set is represented by a tree data structure, in which each node holds a reference to its parent node. They were first described by Bernard A. Galler and Michael J. Fischer in 1964, although their precise analysis took years.
[...] These two techniques complement each other; applied together, the amortized time per operation is only O(α(n)), where α(n) is the inverse of the function f(n) = A(n,n), and A is the extremely quickly-growing Ackermann function.
[...] In fact, this is asymptotically optimal: Fredman and Saks showed in 1989 that Ω(α(n)) words must be accessed by any disjoint-set data structure per operation on average.
For some algorithms optimality can be proven after very careful analysis, but generally speaking, there's no easy way to tell if an algorithm is optimal once it's written. In fact, it's not always easy to prove if the algorithm is even correct.
See also
Wikipedia/Matrix multiplication
The naive algorithm is O(N3), Strassen's is roughly O(N2.807), Coppersmith-Winograd is O(N2.376), and we still don't know what is optimal.
Wikipedia/Asymptotically optimal
it is an open problem whether many of the most well-known algorithms today are asymptotically optimal or not. For example, there is an O(nα(n)) algorithm for finding minimum spanning trees. Whether this algorithm is asymptotically optimal is unknown, and would be likely to be hailed as a significant result if it were resolved either way.
Practical considerations
Note that sometimes asymptotically "worse" algorithms are better in practice due to many factors (e.g. ease of implementation, actually better performance for the given input parameter range, etc).
A typical example is quicksort with a simple pivot selection that may exhibit quadratic worst-case performance, but is still favored in many scenarios over a more complicated variant and/or other asymptotically optimal sorting algorithms.
For those among us mortals that merely want to know if an algorithm:
reasonably works as expected;
is faster than others;
there is an easy step called 'benchmark'.
Pick up the best contenders in the area and compare them with your algorithm.
If your algorithm wins then it better matches your needs (the ones defined by
your benchmarks).

Resources