most efficient way to select boundary points on a cluster - curve-fitting

I'm playing with an algorithm which generates a sequence of integers as shown in the diagram. What I want to calculate is the polynomial (or other) curve fit to the upper boundary.
I know I can use a convex hull algo to select the boundary points, then reject all points along the bottom and right-hand margin, do some smoothing of the remainder, etc.
Alternatively, I could sample the dataset in order and extract only the values which set a new maximum as I progress. I'm pretty sure there are no "outlier" values that would grossly shift the curve with this approach.
So, my question really boils down to: is there any real advantage to going with the more time-consuming convex-hull approach?


Use of Hilvert Curve to query a rectangular area and see if it overlaps other rectangles

I am looking for a method that can help me in project I am working on. The idea is that there are some rectangles in 2d space, I want to query a rectangular area and see if it overlaps any rectangle in that area. If it does, it fails. If it doesn't, meaning the space is empty, it succeeds.
I was linked to z-order curves to help turn 2d coordinates into 1d. While I was reading about it, I encountered the Hilbert curve. I read that the Hilbert curve is preferred over a z-order curve because it maintains better proximity of points. I also read that the Hilbert curve is used to make more efficient quadtrees and octrees.
I was reading this article for a possible solution but I don't know if this applies to my case.
I also saw this comment which mentioned multiple index entries for non point objects.
Is there an elegant method where I can use the Hilbert curve to achieve this? Is it possible with just an array of rectangles?
I am pretty sure it is possible and can be very efficient. But there is a lot of complexity involved.
I have implemented this for a z-order curve and called it PH-Tree, you find implementations in Java and C++, as well as theoretical background in the link. The PH-Tree is similar to a z-ordered quadtree but with several modifications that allow taking full advantage of the space filling curve.
There are a few things to unpack here.
Hilbert curves vs z-order curves: Yes, Hilbert curves have a slightly better proximity than z-order curves. Better proximity does two things: 1) reduce the number of elements (nodes, data) to look at and 2) improve linear access patterns (less hopping around). Both is important if the spatial index is stored on disk and I/O is expensive (especially old disk drives).
If I remember correctly, Hilbert curve and z-order are similar enough that the always access the same amount of data. The only thing that Hilbert curves are better at is linear access.
However, Hilbert curves are much more complicated to calculate, in the tests I made with an in-memory index (not very thorough testing, I admit) I found that z-order curves are considerably more efficient because the reduced CPU time outweighs the cost of accessing data slightly out of order.
Space filling curves (at least Hilbert and z-curve) allow for some neat bit-level hacks that can speed up index operations. However, even for z-ordering I found that getting these right required a lot of thinking (I wrote a whole paper about it). I believe these operations can be adapted for Hilbert curves but I may take some time and, as I wrote above, it may not improve performance at all.
Storing rectangles in a spatial curve index:
There are different approaches to encode rectangles in a spatial curve. All approaches that I am aware of encode the rectangle in a multi-dimensional point. I found the following encoding to work best (assuming axis aligned rectangles). We defined the rectangle by the lower left minimum-corner and the upper right maximum corner, e.g. min={min0, min1}={2,3}/max={max0, max1}={8,4}. We transform this (by interleaving the min/max values) into a 4-dimensional point {2,8,3,4} and store this point in the index. Other approaches use a different ordering (2,3,8,4) or, instead of two corners, store the center point and the lengths of the edges.
If you want to find any rectangles that overlap/intersect with a given region (we call that a window query) we need to create a 4-dimensional query box, i.e. an axis aligned box that is defined by a 4D min-point and 4D max-point (copied from here):
min = {−∞, min_0, −∞, min_1} = {−∞, 2, −∞, 3}
max = {max_0, +∞, max_1, +∞} = {8, +∞, 4, +∞}
We can process the dimensions one by one. The first min/max pair is {−∞, 8}, that will match any 2D rectangle whose minimum x-coordinate is is 8 or lower. All coordinates:
d=0: min/max pair is {−∞, 8}: matches any 2D min-x <= 8
d=1: min/max pair is {2, +∞}: matches any 2D max-x >= 2
d=2: min/max pair is {−∞, 4}: matches any 2D min-y <= 4
d=3: min/max pair is {3, +∞}: matches any 2D max-y <= 3
If all these conditions hold true, then the stored rectangle overlaps with the query window.
Final words:
This sounds complicated but can be implemented very efficiently (also lends itself to vectorization). I found that is on par with other indexes (quadtree, R-Star-Tree), see some benchmarks I made here.
Specifically, I found that the z-ordered indexes have very good insertion/update/removal times (I don't know whether that matters for you) and is very good for small query result sizes (it sounds like you often expect it be zero, i.e. no overlapping rectangle found). It generally works well with large datasets (1000s or millions of entries) and datasets that have strong clusters.
For smaller datasets of if you expect to typically find many result (you can of course abort a query early once you find the first match!) other index types may be better.
On a last note, I found the dataset characteristics to have considerable influence on which index worked best. Also, implementation appears to be at least as important as the underlying algorithms. Especially for quadtrees I found huge variations in performance from different implementations.

Efficient sorting of integer vertices of a convex polygon

I am given as input n pairs of integers which describe points in the 2D plane that are known ahead of time to be vertices of some convex polygon.
I'd like to efficiently sort these points in a clockwise or counter-clockwise fashion.
At first I thought of doing something like the initial step of Graham Scan, but I can't see a simple way to break ties for vertices that make the same angle with the anchor point.
Notice that, as you walk along the sides of the polygon, sometimes these vertices may be getting closer to the anchor point, and sometimes they may be getting farther.
Something that does seem to work is producing a point in the interior of the polygon (for instance, the average of the n points) and using it is an anchor point for radial sorting of the input.
Indeed, because the anchor point lies in the interior, any ray emanating from it contains at most one input point, so there will be no ties.
The overall complexity is not affected: computing the midpoint is an O(n) task, and the bottleneck is the actual sorting.
This does involve a few more operations than the hopeful version of Graham Scan (where we assume there are no ties to be broken), but the bigger loss is leaving integer arithmethic behind by introducing division into the mix.
This in turn can be remedied with scaling everything by a factor of n, but at this point it seems like grasping at straws.
Am I missing something?
Is there a simpler, efficient way to solve this sorting problem, preferrably one that can avoid floating point calculations?

Algorithm for multiple polyline and polygon decimation

We have some polylines (list of points, has start and end point, not cyclic) and polygons (list of points, cyclic, no such thing as endpoints).
We want to map each polyline to a new polyline and each polygon to a new polygon so the total number of edges is small enough.
Let's say the number of edges originally is N, and we want our result to have M edges. N is much larger than M.
Polylines need to keep their start and end points, so they contribute at least 1 edge, one less than their vertex count. Polygons need to still be polygons, so they contribute at least 3 edges, equal to their vertex count. M will be at least large enough for this requirement.
The outputs should be as close as possible to the inputs. This would end up being an optimization problem of minimizing some metric to within some small tolerance of the true optimal solution. Originally I'd have used the area of the symmetric difference of the original and result (area between), but if another metric makes this easier to do I'll gladly take that instead.
It's okay if the results only include vertices in the original, the fit will be a little worse but it might be necessary to keep the time complexity down.
Since I'm asking for an algorithm, it'd be nice to also see an implementation. I'll likely have to re-implement it for where I'll be using it anyway, so details like what language or what data structures won't matter too much.
As for how good the approximation needs to be, about what you'd expect from getting a vector image from a bitmap image. The actual use here is for a tool for a game though, there's some strange details for the specific game, that's why the output edge count is fixed rather than the tolerance.
It's pretty hard to find any information on this kind of thing, so without even providing a full workable algorithm, just some pointers would be very much appreciated.
Ramer–Douglas–Peucker algorithm (mentioned in the comment) is definitely good, but it has some disadvantages:
It requires open polyline on input, for closed polygon one has to fix an arbitrary point, which either decreases final quality or forces to test many points and decreases the performance.
The vertices of simplified polyline are a subset of original polyline vertices, other options are not considered. This permits very fast implementations, but again decreases the precision of simplified polyline.
Another alternative is to take well known algorithm for simplification of triangular meshes Surface Simplification Using Quadric Error Metrics and adapt it for polylines:
distances to planes containing triangles are replaced with distances to lines containing polyline segments,
quadratic forms lose one dimension if the polyline is two dimensional.
But the majority of the algorithm is kept including the queue of edge contraction (minimal heap) according to the estimated distortion such contraction produces in the polyline.
Here is an example of this algorithm application:
Red - original polyline, blue - simplified polyline, and one can see that all its vertices do not lie on the original polyline, while general shape is preserved as much as possible with so few line segments.
it'd be nice to also see an implementation.
One can find an implementation in MeshLib, see MRPolylineDecimate.h/.cpp

Algorithm to Produce an Evenly Spaced Grid

I'm looking for a general algorithm for creating an evenly spaced grid, and I've been surprised how difficult it is to find!
Is this a well solved problem whose name I don't know?
Or is this an unsolved problem that is best done by self organising map?
More specifically, I'm attempting to make a grid on a 2D Cartesian plane in which the Euclidean distance between each point and 4 bounding lines (or "walls" to make a bounding box) are equal or nearly equal.
For a square number, this is as simple as making a grid with sqrt(n) rows and sqrt(n) columns with equal spacing positioned in the center of the bounding box. For 5 points, the pattern would presumably either be circular or 4 points with a point in the middle.
I didn't find a very good solution, so I've sadly left the problem alone and settled with a quick function that produces the following grid:
There is no simple general solution to this problem. A self-organizing map is probably one of the best choices.
Another way to approach this problem is to imagine the points as particles that repel each others and that are also repelled by the walls. As an initial arrangement, you could already evenly distribute the points up to the next smaller square number - for this you already have a solution. Then randomly add the remaining points.
Iteratively modify the locations to minimize the energy function based on the total force between the particles and walls. The result will of course depend on the force law, i.e. how the force depends on the distance.
To solve this, you can use numerical methods like FEM.
A simplified and less efficient method that is based on the same principle is to first set up an estimated minimal distance, based on the square number case which you can calculate. Then iterate through all points a number of times and for each one calculate the distance to its closest neighbor. If this is smaller than the estimated distance, move your point into the opposite direction by a certain fraction of the difference.
This method will generally not lead to a stable minimum but should find an acceptable solution after a number ot iterations. You will have to experiment with the stepsize and the number of iterations.
To summarize, you have three options:
FEM method: Efficient but difficult to implement
Self organizing map: Slightly less efficient, medium complexity of implementation.
Iteration described in last section: Less efficient but easy to implement.
Unfortunately your problem is still not very clearly specified. You say you want the points to be "equidistant" yet in your example, some pairs of points are far apart (eg top left and bottom right) and the points are all different distances from the walls.
Perhaps you want the points to have equal minimum distance? In which case a simple solution is to draw a cross shape, with one point in the centre and the remainder forming a vertical and horizontal crossed line. The gap between the walls and the points, and the points in the lines can all be equal and this can work with any number of points.

extract points which satisfy certain conditions

I have an array of points in one plane. They form some shape. I need to extract points from this array which only form straight lines of this shape.
At this moment I have an algorithm but it does not work very good. I take first two points, make a straight line and then check if the following points lie on it with some tolerance. But there is a problem: the points which form straight line are not really on the straight but have some deviation. This deviation is quite large. If in my algorithm I make deviation large enough to get points from the straight part, then other points which are on the slightly bent part but have deviation less then specified also extracted.
I am looking for some idea on how to perform such task.
Here is the picture:
In circles are the parts which I want to extract. Red points are the parts which I could extract with my approach. If I increase the tolerance then I miss the straight pieces too.
First, if you already have some candidate subset of points and want to check whether they lie on a straight line. Use a form of linear regression to identify the best-fitting line, then check how well it fits and accept or reject the hypothesis that this particular segment is linear based on that.
One of the most standard ways of doing that is using Least Squares method.
Identifying the subset is a different problem, the best solution to which will depend strongly on the kind of data you have and the objective. I suggest that enumerating all the segments is a good starting point, if the amount of data is not extremely large, -- that should be doable in no more than cubic time, I gather.
There are certainly some approximations one can apply, e.g. choosing a point in the sequence and building a subset by iteratively adding points on either side as long as the segment remains linear within the tolerance threshold, than accepting or rejecting it if the segment is long enough.
I assume here that the curve is parameterizable by one of the coordinates. If this is not the case, e.g. if the curve is closed, additional steps may be required to separate the curve into parameterizable segments.
EDIT: how to check a segment is straight
There's a number of options.
First, I would expect that for a straight line the average deviation would stay roughly the same as you add the new points, then you can simply find a reasonable threshold on that given the data.
Second option is to further split the subset into a fixed number of parts (e.g. 2), find the best fitting line for each one and then compare these. In case of a straight line, roughly the same line should be predicted, but for a curve it would be different.
Third option is to perform nonlinear curve fitting, e.g. fit a quadratic curve and check the coefficient for the quadratic term -- if the line is straight, it should be close to zero.
In each case, of course, there is a tradeoff between the segment size and the deviation of the points from that segment. In the extreme case, there would either be one huge linear segment with huge deviation or a whole buch of 2-point segments with 0 deviation. The actual threshold on the deviation, the difference between the tangent curves, or the magnitude of the quadratic term (depending on the option you prefer) has to be selected for the given dataset to suit your needs. Looking at the plot, I would say that the threshold should be picked so as to allow for segments of length 10 or so.
