Quickhull - all points on convex hull - bad performance - performance

How to avoid bad performance in quickhull algorithm, when all points in input lie on convex hull?

QuickHull's performance comes mainly from being able to throw away a portion of the input with each recursive call (or iteration). This does not happen when all the points lie on a circle, unfortunately. Even in this case, it's still possible to obtain an O(nlog n) worst-case performance if the split step gives a fairly balanced partition at every recursive call. The ultimate worst-case which results in quadratic runtime is when we have grossly imbalanced splits (say one split always ends up empty) in each call. Because this pretty much depends on the dataset, there's not much one can do about it.
You might want to try other algorithms instead, such as Andrew's variant of Graham's scan, or MergeHull. Both have guaranteed O(nlog n) worst-case time-complexity.

For some algorithm implementation comparison, I suggest to look at my article: Fast and improved 2D Convex Hull algorithm and its implementation in O(n log h) which compare many 2D algorithm performance like:
Monotone chain
Graham scan
Delaunay/Voronoi
Chan
Liu and Chen
Ouellet (mine)

Related

Meaning of O(logk) competitive complexity

I am working on an existing algorithm to improve its complexity. The existing algorithm uses K-means to perform clustering, whereas I chose to use K-means++ to do the same.
K-means++ was chosen because it mostly has faster and more accurate clustering results compared to K-mean.
Now, towards the end, where I have to compare the complexity of the new and existing algorithms, I find that I can't make sense of the fact that K-means++ has a complexity of O(logk) competitive.
I have tried looking everywhere on the web for an explanation, including stack overflow.
The only thing I have understood is that competitive has something to do with "on-line" and "off-line" algorithms. Could anyone please explain how it applies here?
The full sentence that you are reading says something like "The k-means++ clustering is O(log k)-competitive to the optimal k-means solution".
This is not a statement about its algorithmic complexity. It's a statement about its effectiveness. You can use O-notation for other things.
K-means attempts to minimize a "potential" that is calculated as the sum of the squared distances of points from their cluster centers.
For any specific clustering problem, the expected potential of a K-means++ solution is at most 8(ln k + 2) times the potential of the best possible solution. That 8(ln k + 2) is shortened to O(log k) for brevity.
The precise meaning of the statement that the k-means++ solution is O(log k)-competitive is that there is some constant C such that the expected ratio between the k-means++ potential and the best possible potential is less than C*(log k) for all sufficiently large k.
the smallest such constant is about 8

Dynamic convex hull trick

I was reading about interesting algorithms in my free time and I just discovered the convex hull trick algorithm, with which we can compute the maxima of several lines in the plane on a given x coordinate. I found this article:
http://wcipeg.com/wiki/Convex_hull_trick
Here the author says, that the dynamic version of this algorithm runs in logarithmic time, but there is no proof. When we insert a line, we test some of his neighbors, but I dont understand how can it be O(log N) when we can test all the N lines by such an insert. Is it correct or I missed something?
UPDATE: this question is already answered, the interesting ones are the rest below
How can we delete?
I mean... if we delete a line, we will probably need the previous ones to reset the whole hull, but that algoritm is deleting all the not necessary lines, when inserting a new one.
Is it an another approach, to solve problems like above(or similar ones, for example managing queries like insert, delete, find maximum on an x point or on a given range etc.)
Thank you in advance!
To answer your first question: "How can insertion be O(logn)?", you can indeed end up checking O(n) neighbors, but note that you only need to check an extra neighbor when you discover that you need to do a delete operation.
The point is that if you are going to insert n new lines, then you can at most do the delete operation n times. Therefore the total amount of extra work is at most O(n) in addition to the O(logn) work per line that you need to find its position in the sorted data structure.
So the total effort for inserting all n lines is O(n) + O(nlogn) = O(nlogn), or in other words, amortized O(logn) per line.
The article claims amortized(not the worst case) O(log N) time per insertion. The amortize bound is easy to prove(each line is removed at most once and each check is either the last one or leads to a deletion of one line).
The article does not say that this data structure supports deletions at all. I'm not sure if it is possible to handle them efficiently. There is an obstacle: the time complexity analysis is based on fact that if we remove a line, we will never need it in the future, which is not the case when deletions are allowed.
Insertion can be quicker than O(log n), it can be achieve in O(Log h) where h is the set of already calculated Hull points. Insertion in batch or one by one can be done in O(log h) per point. You can read my articles about that:
A Convex Hull Algorithm and its implementation in O(n log h)
Fast and improved 2D Convex Hull algorithm and its implementation in O(n log h)
First and Extremely fast Online 2D Convex Hull Algorithm in O(Log h) per point
About delete: I'm pretty sure, but it has to be proven, that it can be achieve in O(log n + log h) = O(log n) per point. A text about it is available at the end of the my third article.

Why do divide and conquer algorithms often run faster than brute force?

Why do divide and conquer algorithms often run faster than brute force? For example, to find closest pair of points. I know you can show me the mathematical proof. But intuitively, why does this happen? Magic?
Theoretically, is it true that "divide and conquer is always better than brute force"? If it is not, is there any counterexample?
For your first question, the intuition behind divide-and-conquer is that in many problems the amount of work that has to be done is based on some combinatorial property of the input that scales more than linearly.
For example, in the closest pair of points problem, the runtime of the brute-force answer is determined by the fact that you have to look at all O(n2) possible pairs of points.
If you take something that grows quadratically and cut it into two pieces, each of which is half the size as before, it takes one quarter of the initial time to solve the problem in each half, so solving the problem in both halves takes time roughly one half the time required for the brute force solution. Cutting it into four pieces would take one fourth of the time, cutting it into eight pieces one eighth the time, etc.
The recursive version ends up being faster in this case because at each step, we avoid doing a lot of work from dealing with pairs of elements by ensuring that there aren't too many pairs that we actually need to check. Most algorithms that have a divide and conquer solution end up being faster for a similar reason.
For your second question, no, divide and conquer algorithms are not necessarily faster than brute-force algorithms. Consider the problem of finding the maximum value in an array. The brute-force algorithm takes O(n) time and uses O(1) space as it does a linear scan over the data. The divide-and-conquer algorithm is given here:
If the array has just one element, that's the maximum.
Otherwise:
Cut the array in half.
Find the maximum in each half.
Compute the maximum of these two values.
This takes time O(n) as well, but uses O(log n) memory for the stack space. It's actually worse than the simple linear algorithm.
As another example, the maximum single-sell profit problem has a divide-and-conquer solution, but the optimized dynamic programming solution is faster in both time and memory.
Hope this helps!
I recommend you read through the chapter 5 of Algorithm Design, it explains divide-and-conquer very well.
Intuitively, for a problem, if you can divide it into two sub-problems with the same pattern as the origin one, and the time complexity to merge the results of the two sub-problems into the final result is somehow small, then it's faster than solve the orignal complete problem by brute-force.
As said in Algorithm Design, you actually cannot gain too much from divide-and-conquer in terms of time, general you can only reduce time complexity from higher polynomial to lower polynomial(e.g. from O(n^3) to O(n^2)), but hardly from exponential to polynomial(e.g. from O(2^n) to O(n^3)).
I think the most you can gain from divide-and-conquer is the mindset for problem solving. It's always a good attempt to break the original big problem down to smaller and easier sub-problems. Even if you don't get a better running time, it still helps you think through the problem.

Should we used k-means++ instead of k-means?

The k-means++ algorithm helps in two following points of the original k-means algorithm:
The original k-means algorithm has the worst case running time of super-polynomial in input size, while k-means++ has claimed to be O(log k).
The approximation found can yield a not so satisfactory result with respect to objective function compared to the optimal clustering.
But are there any drawbacks of k-means++? Should we always used it instead of k-means from now on?
Nobody claims k-means++ runs in O(lg k) time; it's solution quality is O(lg k)-competitive with the optimal solution. Both k-means++ and the common method, called Lloyd's algorithm, are approximations to an NP-hard optimization problem.
I'm not sure what the worst case running time of k-means++ is; note that in Arthur & Vassilvitskii's original description, steps 2-4 of the algorithm refer to Lloyd's algorithm. They do claim that it works both better and faster in practice because it starts from a better position.
The drawbacks of k-means++ are thus:
It too can find a suboptimal solution (it's still an approximation).
It's not consistently faster than Lloyd's algorithm (see Arthur & Vassilvitskii's tables).
It's more complicated than Lloyd's algo.
It's relatively new, while Lloyd's has proven it's worth for over 50 years.
Better algorithms may exist for specific metric spaces.
That said, if your k-means library supports k-means++, then by all means try it out.
Not your question, but an easy speedup to any kmeans method for large N:
1) first do k-means on a random sample of say sqrt(N) of the points
2) then run full k-means from those centres.
I've found this 5-10 times faster than kmeans++ for N 10000, k 20, with similar results.
How well it works for you will depend on how well a sqrt(N) sample
approximates the whole, as well as on N, dim, k, ninit, delta ...
What are your N (number of data points), dim (number of features), and k ?
The huge range in users' N, dim, k, data noise, metrics ...
not to mention the lack of public benchmarks, make it tough to compare methods.
Added: Python code for kmeans() and kmeanssample() is
here on SO; comments are welcome.

Algorithms for Big O Analysis

What all algorithms do you people find having amazing (tough, strange) complexity analysis in terms of both - Resulting O notation and uniqueness in way they are analyzed?
I have (quite) a few examples:
The union-find data structure, which supports operations in (amortized) inverse Ackermann time. It's particularly nice because the data structure is incredibly easy to code.
Splay trees, which are self-balancing binary trees (that is, no extra information is stored other than the BST -- no red/black information. Amortized analysis was essentially invented to prove bounds for splay trees; splay trees run in amortized logarithmic time, but worst-case linear time. The proofs are cool.
Fibonacci heaps, which perform most of the priority queue operations in amortized constant time, thus improving the runtime of Dijkstra's algorithm and other problems. As with splay trees, there are slick "potential function" proofs.
Bernard Chazelle's algorithm for computing minimum spanning trees in linear times inverse Ackermann time. The algorithm uses soft heaps, a variant of the traditional priority queue, except that some "corruption" might occur and queries might not be answered correctly.
While on the topic of MSTs: an optimal algorithm has been given by Pettie and Ramachandran, but we don't know the running time!
Lots of randomized algorithms have interested analyses. I'll only mention one example: Delaunay triangulation can be computed in expected O(n log n) time by incrementally adding points; the analysis is apparently intricate, though I haven't seen it.
Algorithms that use "bit tricks" can be neat, e.g. sorting in O(n log log n) time (and linear space) -- that's right, it breaks the O(n log n) barrier by using more than just comparisons.
Cache-oblivious algorithms often have interesting analyses. For example, cache-oblivious priority queues (see page 3) use log log n levels of sizes n, n2/3, n4/9, and so on.
(Static) range-minimum queries on arrays are neat. The standard proof tests your limits with respect to reduction: range-minimum queries is reduced to least common ancestor in trees, which is in turn reduced to a range-minimum queries in a specific kind of arrays. The final step uses a cute trick, too.
Ackermann's function.
This one is kinda simple but Comb Sort blows my mind a little.
http://en.wikipedia.org/wiki/Comb_sort
It is such a simple algorithm for the most part it reads like an overly complicated bubble sort, but it is O(n*Log[n]). I find that mildly impressive.
The plethora of Algorithms for Fast Fourier Transforms are impressive too, the math that proves their validity is trippy and it was fun to try to prove a few on my own.
http://en.wikipedia.org/wiki/Fast_Fourier_transform
I can fairly easily understand the prime radix, multiple prime radix, and mixed radix algorithms but one that works on sets whose size are prime is quite cool.
2D ordered search analysis is quite interesting. You've got a 2-dimensional numeric array of numbers NxN where each row is sorted left-right and each column is sorted top-down. The task is to find a particular number in the array.
The recursive algorithm: pick the element in the middle, compare with the target number, discard a quarter of the array (depending on the result of the comparison), apply recursively to the remainig 3 quarters is quite interesting to analyze.
Non-deterministically polynomial complexity gets my vote, especially with the (admittedly considered unlikely) possibility that it may turn out to be the same as polynomial. In the same vein, anything that can theoretically benefit from quantum computing (N.B. this set is by no means all algorithms).
The other that would get my vote would be common mathematical operations on arbitrary-precision numbers -- this is where you have to consider things like multiplying big numbers is more expensive than multiplying small ones. There is quite a lot of analysis of this in Knuth (which shouldn't be news to anyone). Karatsuba's method is pretty neat: cut the two factors in half by digit (A1;A2)(B1;B2) and multiply A1 B1, A1 B2, A2 B1, A2 B2 separately, and then combine the results. Recurse if desired...
Shell sort. There are tons of variants with various increments, most of which have no benefits except to make the complexity analysis simpler.

Resources