Meaning of O(logk) competitive complexity - algorithm

I am working on an existing algorithm to improve its complexity. The existing algorithm uses K-means to perform clustering, whereas I chose to use K-means++ to do the same.
K-means++ was chosen because it mostly has faster and more accurate clustering results compared to K-mean.
Now, towards the end, where I have to compare the complexity of the new and existing algorithms, I find that I can't make sense of the fact that K-means++ has a complexity of O(logk) competitive.
I have tried looking everywhere on the web for an explanation, including stack overflow.
The only thing I have understood is that competitive has something to do with "on-line" and "off-line" algorithms. Could anyone please explain how it applies here?

The full sentence that you are reading says something like "The k-means++ clustering is O(log k)-competitive to the optimal k-means solution".
This is not a statement about its algorithmic complexity. It's a statement about its effectiveness. You can use O-notation for other things.
K-means attempts to minimize a "potential" that is calculated as the sum of the squared distances of points from their cluster centers.
For any specific clustering problem, the expected potential of a K-means++ solution is at most 8(ln k + 2) times the potential of the best possible solution. That 8(ln k + 2) is shortened to O(log k) for brevity.
The precise meaning of the statement that the k-means++ solution is O(log k)-competitive is that there is some constant C such that the expected ratio between the k-means++ potential and the best possible potential is less than C*(log k) for all sufficiently large k.
the smallest such constant is about 8

Related

worst case is equal to best case algorithms

I'm trying to answer this question about algorithms and I don't understand what possibly this could be. I don't have any example to provide to you and I'm sharing it same as it was shared with me:
"If the complexity of the X algorithm for the worst case is equal to the complexity of the Y algorithm for the best case, which of these two algorithms is faster? Explain why!"
They're not looking for any specific answer. They're looking for how you reason about the question. For example, you can reason as follows:
Obviously, one would prefer an algorithm whose worst case is as good as another algorithm's best case. Because in the worst case, they're equal and in the best case, it's better. But complexity isn't the only criteria by which algorithms should be judged, ...
This is one of those "see how you reason about things" questions and not a "get the right answer" question.
Let me try to explain it in steps:
Understand that an algorithm can have different best/average/worst time complexity depending on factors such as input size, etc.
If algorithm X's worst performance is equal to the complexity of the algorithm Y's best performance then you can reason that, overall, the algorithm X is faster than algorithm Y but this is considering only the asymptotic complexity. See 3.
Of course there are many other factors you have to consider. Consider the scenario when Algorithm X performs better than Y for very specific input but on average and in worst case both, X and Y perform the same, then it is worth understanding the trade offs between these two algorithms such as space complexity and amortized complexity.

Algorithm Analysis: In practice, do coefficients of higher order terms matter?

Consider an^2 + bn + c. I understand that for large n, bn and c become insignificant.
I also understand that for large n, the differences between 2n^2 and n^2 are pretty insignificant compared to the differences between, say n^2 and n*log(n).
However, there is still an order of 2 difference between 2n^2 and n^2. Does this matter in practice? Or do people just think about algorithms without coefficients? Why?
The actual coefficients matter if you're interested in timing. But big-O isn't actually about timing, it's about scalability. When you see an algorithm described as O(n^2), you don't really know how long it will take to solve a problem of size n on a particular computer in a particular language with a particular compiler, but you know that a problem of size 2n should take about 4 times as long.
The reason you can ignore the coefficients is that if you consider the ratio of different size problems, the lower order terms' coefficients are asymptotically dominated, and the highest order term's coefficients cancel in the ratio.
We use time complexity analysis to help us estimate the time cost and understand how far we can go. For example, the lower bound time complexity for sorting algorithm is O(nlgn), it is proved in theory, and we should never try to design a algorithm better than this.
For the coefficient, in many case it's not easy to find a accurate number in theory, since it could be effect by the input data. But it doesn't mean it's not important. Quicksort is the most widely used sorting algorithm, since the coefficient of time complexity is really small, which is only 1.39NlgN for average case.
And another interesting fact about quicksort is that we all know that the worst case for quicksort will cost O(N^2). We can use Median of Medians algorithm to reduce the worst case time complexity of quicksort to O(NlgN), but we seldom use this version in practice. It's because that the coefficient of Median of Medians version is too big, which make it unusable.

How do you get various algorithm analysis factors in your code?

I am attempting to prepare a presentation to explain the basics of algorithm analysis to my co-workers - some of them have never had a lecture on the subject before, but everyone has at least a few years programming behind them and good math backgrounds, so I think I can teach this. I can explain the concepts fine, but I need concrete examples of some code structures or patterns that result in factors so I can demonstrate them.
Geometric factors (n, n^2, n^3, etc) are easy, embedded loops using the same sentinel, but I am getting lost on how to describe and show off some of the less common ones.
I would like to incorporate exponential (2^n or c^n), logarithmic (n log(n) or just log(n)) and factoral (n!) factors in the presentation. What are some short, teachable ways to get these in code?
A divide-and-conquer algorithm that does a constant amount of work for each time it divides the problem in half is O(log n). For example a binary search.
A divide-and-conquer algorithm that does a linear amount of work for each time it divides the problem in half is O(n * log n). For example a merge sort.
Exponential and factorial are probably best illustrated by iterating respectively over all subsets of a set, or all permutations of a set.
Exponential: naive Fibonacci implementation.
n log(n) or just log(n): Sorting and Binary seach
Factorial: Naive traveling salesman solutions. Many naive solutions to NP-complete problems.
n! problems are pretty simple. There are many NP-complete n! time problems such as the travelling salesman problem
In doubt pick one of the Sort algorithms - everyone knows what they're supposed to do and therefore they're easy to explain in relation to the complexity stuff: Wikipedia has a quite good overview

What is the Best Complexity of a Greedy Algorithm?

It seems like the best complexity would be linear O(n).
Doesn't matter the case really, I'm speaking of greedy algorithms in general.
Sometimes it pays off to be greedy?
In the specific case that I am interested would be computing change.
Say you need to give 35 cents in change. You have coins of 1, 5, 10, 25. The greedy algorithm, coded simply, would solve this problem quickly and easily. First grabbing 25 cents the highest value going in 35 and then next 10 cents to complete the total. This would be best case. Of course there are bad cases and cases where this greedy algorithm would have issues. I'm talking best case complexity for determining this type of problem.
Any algorithm that has an output of n items that must be taken individually has at best O(n) time complexity; greedy algorithms are no exception. A more natural greedy version of e.g. a knapsack problem converts something that is NP-complete into something that is O(n^2)--you try all items, pick the one that leaves the least free space remaining; then try all the remaining ones, pick the best again; and so on. Each step is O(n). But the complexity can be anything--it depends on how hard it is to be greedy. (For example, a greedy clustering algorithm like hierarchical agglomerative clustering has individual steps that are O(n^2) to evaluate (at least naively) and requires O(n) of these steps.)
When you're talking about greedy algorithms, typically you're talking about the correctness of the algorithm rather than the time complexity, especially for problems such as change making.
Greedy heuristics are used because they're simple. This means easy implementations for easy problems, and reasonable approximations for hard problems. In the latter case you'll find time complexities that are better than guaranteed correct algorithms. In the former case, you can't hope for better than optimal time complexity.
GREEDY APPROACH
knapsack problem...sort the given element using merge sort ..(nlogn)
find max deadline that will take O(n)
using linear search select one by one element....O(n²)
nlogn + n + n² = n² in worst case....
now can we apply binary search instead of linear search.....?
Greedy or not has essentially nothing to do with computational complexity, other than the fact that greedy algorithms tend to be simpler than other algorithms to solve the same problem, and hence they tend to have lower complexity.

Should we used k-means++ instead of k-means?

The k-means++ algorithm helps in two following points of the original k-means algorithm:
The original k-means algorithm has the worst case running time of super-polynomial in input size, while k-means++ has claimed to be O(log k).
The approximation found can yield a not so satisfactory result with respect to objective function compared to the optimal clustering.
But are there any drawbacks of k-means++? Should we always used it instead of k-means from now on?
Nobody claims k-means++ runs in O(lg k) time; it's solution quality is O(lg k)-competitive with the optimal solution. Both k-means++ and the common method, called Lloyd's algorithm, are approximations to an NP-hard optimization problem.
I'm not sure what the worst case running time of k-means++ is; note that in Arthur & Vassilvitskii's original description, steps 2-4 of the algorithm refer to Lloyd's algorithm. They do claim that it works both better and faster in practice because it starts from a better position.
The drawbacks of k-means++ are thus:
It too can find a suboptimal solution (it's still an approximation).
It's not consistently faster than Lloyd's algorithm (see Arthur & Vassilvitskii's tables).
It's more complicated than Lloyd's algo.
It's relatively new, while Lloyd's has proven it's worth for over 50 years.
Better algorithms may exist for specific metric spaces.
That said, if your k-means library supports k-means++, then by all means try it out.
Not your question, but an easy speedup to any kmeans method for large N:
1) first do k-means on a random sample of say sqrt(N) of the points
2) then run full k-means from those centres.
I've found this 5-10 times faster than kmeans++ for N 10000, k 20, with similar results.
How well it works for you will depend on how well a sqrt(N) sample
approximates the whole, as well as on N, dim, k, ninit, delta ...
What are your N (number of data points), dim (number of features), and k ?
The huge range in users' N, dim, k, data noise, metrics ...
not to mention the lack of public benchmarks, make it tough to compare methods.
Added: Python code for kmeans() and kmeanssample() is
here on SO; comments are welcome.

Resources