What's the best method to compare original trajectory with two compressed trajectory - algorithm

Suppose to have a GPS trajectory - i.e.: a series of spatio-temporal coords, every coord is a (x,y,t) information, where x is longitude, y is latitude and t is the time stamp.
Suppose each trajectory identified by 1000 (x,y) points, a compressed trajectory is trajectory with fewer points than the original, for instance 300 points. A compression algorithm (Douglas-Peucker, Bellman, etc) decide what points will be in compressed trajectory and what point will be discarded.
Each algorithm make his own choice. Better algorithms choice the points not only by spatial characteristics (x, y) but using spatio-temporal characteristics (x,y,t).
Now I need a way to compare two compressed trajectories against the original to understand what compression algorithm better reduce a spatio-temporal (temporal component is really important) trajectory.
I've thinked of DTW algorithm to check trajectory similarity, but this probably don't care about temporal component. What algorithm can I use to make this control?

What is the best compression algorithm depends to a large extent on what you are trying to achieve with it, and is dependent on other external variables. Typically, were going to identify and remove spikes, and then remove redundant data. For example;
Known minimum and maximum velocity, acceleration, and ability to tuen will let you remove spikes. If we look at the join distance between a pair of points divided by the time where
velocity = sqrt((xb - xa)^2 + (yb - ya))/(tb-ta)
we can eliminate points where the distance couldn't be travelled in the elapsed time given the speed constraint. We can do the same with acceleration constraints, and change in direction constraints for a given velocity. These constraints change whether the GPS receiver is static, hand held, in a car, in an aeroplane etc...
We can remove redundant points using a moving window looking at three points, where if the an interpolated (x,y,t) for middle point can be compared with an observed point, and the observed point removed if it lies within a specified distance + time tolerance of the interpolated point. We can also curve fit the data and consider the distance to the curve rather than using a moving 3 point window.
The compression may also have differing goals based on the constraints given, e.g. to simply reduce the data size by removing redundant observations and spikes, or to smooth the data as well.
For the former, after checking for spikes based on defined constraints, we simply check the 3d distance of each point to the polyline connecting the compressed points. This is achieved by finding the pair of points before and after the point that has been removed, interpolating a position on the line connecting those points based on the observed time, and comparing the interpolated position with the observed position. The amount of points removed will increase as we allow this distance tolerance to increase.
For the latter we also have to consider how well the smoothed result models the data, the weights imposed by the constraints, and the design shape / curve parameters.
Hope this makes some sense.

Maybe you could use mean square distance between trajectories over time.
Probably, simply looking at distance at time 1s,2s,... will be enough, but you can also do it more precise between time stamps integrating, (x1(t)-x2(t))^2 + (y1(t)-y2(t))^2. Note that between 2 time stamps both trajectories will be straight line.

I've found what I need to compute spatio-temporal error.
As written in paper "Compression and Mining of GPS Trace Data:
New Techniques and Applications" by Lawson, Ravi & Hwang:
Synchronized Euclidean distance (sed) measures the distance between
two points at identical time stamps. In Figure 1, five time steps (t1
through t5) are shown. The simplified line (which can be thought of as
the compressed representation of the trace) is comprised of only two
points (P't1 and P't5); thereby, it does not include points P't2, P't3
and P't4. To quantify the error introduced by these missing points,
distance is measured at the identical time steps. Since three points
were removed between P't1 and P't5, the line is divided into four
equal sized line segments using the three points P't2, P't3 and P't4
for the purposes of measuring the error. The total error is measured
as the sum of the distance between all points at the synchronized time
instants, as shown below. (In the following expression, n represents
the total number of points considered.)

Related

Efficiently finding out which 3D lines intersect which 3D points

Given
A set of many points in 3D space
(each represented as 3 floating-point coordinates (x, y, z))
A set of many infinite lines in 3D space
(each represented by an arbitary point on the line, and a 3D direction vector)
Is there a way to find out which of the points lie on which of the lines (with a little tolerance to account for floating-point errors), that is more efficient than the trivial O(n²) approach of testing every point against every line in a nested loop?
I'm thinking along the lines of storing one of the two sets in a special data structure that helps with the intersection tests. But what would such a data structure look like?
(Links to relevant academic literature are also appreciated.)
This is a complement to Pinhead's answer.
Consider the bounding box of the P points, assumed roughly cubic, and subdivide it in C³ cells. If the spreading of the points is uniform, every cell will contain P/C³ points on average.
Now for every line you will find all the cells it intersects by a process similar to drawing a digital line, and you will intersect on average αC cells, where α is a small constant. Hence the total workload for the search is proportional to L P/C³.
Anyway, you have to account for the initialization time of the grid, proportional to C³. Hence the total including initialization is of the form C³+ ß L P/C², and is minimized by C~(L P)^(1/5), giving time and space complexity O((L P)^(3/5)), a significant saving over O(L P).
You can also think of a binary tree that contains a hierarchy of bounding spheres, starting from the tolerance radius around each point. You could create the tree in the same way one creates a kD-tree, by recursive subvisions on some dimension.
It is much harder to make precise estimates of the speedup, but you could expect a time complexity like O((L + P) Log(P)).
Divide 3d space into cubes of side N, so that a cube can contain more than one point. An infinite line will intersect an infinite amount of cubes, but checking if a cube has a point takes O(1). When a line intersects a cube with more than one point, you only check for the points in that cube, which is going to be smaller than the total size of points if 1) they are evenly distributed 2) you are using amortized constants(Constant Amortized Time). From the perspective of average scenario, you would only check all points at once at O(N^2) if all points are packed inside one cube. You can also have cubes inside the cubes, but that is more bookkeeping.
This idea is inspired by a quadtree:
https://en.wikipedia.org/wiki/Quadtree

Data structure for piecewise circular trajectory in plane

I'm trying to design a data-structure to hold/express a piecewise circular trajectory in the Euclidian plane. The trajectory is constrained to be continuous and have finite curvature everywhere, and therefore the circular arcs meet tangentially.
Storing all the circle centers, radii, and touching points would allow for inspecting the geometry anywhere in O(1) but would require explicit enforcement of the continuity and curvature constraints due to data redundancy. In my view, this would make the code messy.
Storing only the circle touching points (which are waypoints along the curve) along with the curve's initial direction would be sufficient in principle, and avoid data redundancy, but then it would be necessary to do an O(n) calculation to inspect the geometry of arc n, since that arc depends on all the arcs preceding it in the trajectory.
I would like to avoid data redundancy, but I also don't want to make the cost of geometric inspection prohibitive.
Does anyone have any high-level idea/advice to share?
For the most efficient traversal of the trajectory, if I am right you need
the ending curvilinear abscissas of every arc (cumulative),
the radii,
the starting angles,
the coordinates of the centers,
so that for a given s you find the index of the arc, then the azimuth and the coordinates of the point. (Either incrementally for a sequence of points, or by dichotomy for a single point.) That takes five parameters per arc.
Only the cumulative abscissas are global, but you can't do without them for single-point accesses. You can drop the radii and starting angles and retrieve them for any arc from the difference of curvilinear abscissas and the limit angles (see below). This reduces to three parameters.
On the other hand, knowing just the coordinates of the centers and those of the starting and ending points is enough to recover the whole geometry, and this takes two parameters per arc.
The meeting point of two arcs is found on the line through the centers, and if you know one radius, the other follows. And the limit angle is given by the direction of the line. So for an incremental traversal, this non-redundant description can do.
For convenient computation, knowing s and the arc index, consider the vectors from the center to the centers of the adjoining arcs. Rotate them so that the first becomes horizontal. The components of the other will give you the amplitude angle. The fraction (s - Si-1) / (Si - Si-1) of the amplitude gives you the azimuth of the point, to which you apply the counter-rotation.
I'd store items with the data required to get info for any point of that element. For example, an arc needs x, y, initial direction, radius, lenght (or end point, or angle difference or whatever you find easiest).
Because you need continuity (same x,y, same bearing, perhaps same curvature) between two ending points then a node with this properties is needed. Notice these properties are common to arcs and straights (a special arc identified by radius = 0). So you can treat a node the same as an item.
The trajectory should be calculated before any request. So you have all items-data in advance.
The container depends on how you request info.
If the trajectory can be somehow represented in a grid, then you better use a quad-tree.
I guess you must find the item from a x,y or accumulated length input. You will have to iterate through the container to find the element closest to the input data. Sorted data may help.
My choice is a simple vector with the consecutive elements, which happens to be sorted on accumulated trajectory length.
Finding by x,y on a x-sorted container (or a tree) is not so simple, due to some x,y may have perpendiculars to several items, consecutive or not, near or not, and you need to select the nearest one.

seeking approximate algorithm to find largest clear circle in an area

Related: Is there a simple algorithm for calculating the maximum inscribed circle into a convex polygon?
I'm writing a graphics program whose goals are artistic rather than mathematical. It composes a picture step by step, using geometric primitives such as line segments or arcs of small angle. As it goes, it looks for open areas to fill in with more detail; as the available open areas get smaller, the detail gets finer, so it's loosely fractal.
At a given step, in order to decide what to do next, we want to find out: where is the largest circular area that's still free of existing geometric primitives?
Some constraints of the problem
It does not need to be exact. A close-enough answer is fine.
Imprecision should err on the conservative side: an almost-maximal circle is acceptable, but a circle that's not quite empty isn't acceptable.
CPU efficiency is a priority, because it will be called often.
The program will run in a browser, so memory efficiency is a priority too.
I'll have to set a limit on level of detail, constrained presumably by memory space.
We can keep track of the primitives already drawn in any way desired, e.g. a spatial index. Exactness of these is not required; e.g. storing bounding boxes instead of arcs would be OK. However the more precision we have, the better, because it will allow the program to draw to a higher level of detail. But, given that the number of primitives can increase exponentially with the level of detail, we'd like storage of past detail not to increase linearly with the number of primitives.
To summarize the order of priorities
Memory efficiency
CPU efficiency
Precision
P.S.
I framed this question in terms of circles, but if it's easier to find the largest clear golden rectangle (or golden ellipse), that would work too.
P.P.S.
This image gives some idea of what I'm trying to achieve. Here is the start of a tendril-drawing program, in which decisions about where to sprout a tendril, and how big, are made without regard to remaining open space. But now we want to know, where is there room to draw a tendril next, and how big? And where after that?
One very efficient way would be to recursively divide your area into rectangular sub-areas, splitting them when necessary to divide occupied areas from unoccupied areas. Then you would simply need to keep track of the largest unoccupied area at each time. See https://en.wikipedia.org/wiki/Quadtree - but you needn't split into squares.
Given any rectangle, you can draw a line inside it, so that at least one of the rectangles to either side of the line is a golden rectangle. Therefore you can recursively erect partitions within a rectangle so that all but one of the rectangles formed by the partitions are golden rectangles, and the add rectangle left over is vanishingly small. You could do this to create a quadtree-like structure, where almost all of the rectangles left over were golden rectangles.
This seems like the kind of situation where a randomized algorithm might be helpful. Choose points at random, reject and choose more if they're inappropriate for some reason, then find the min distance from your choices to each of the figures already included. The random point with the max of the mins would be your choice.
The number of sample points might have to increase as the complexity of the figure increases.
The random algorithm could be improved by checking points nearby good choices. Keep checking neighbors until no more improvement is possible.
Here's a simple way that uses a fixed amount of memory and time per update, regardless of how many drawing primitives you use. How much memory (and time per update) is needed can be controlled according to how high a "resolution" you need:
Divide the space up into a grid of points. We will maintain a 2D array, d[], which records the minimum distance from the grid point (x, y) to any already-drawn primitive in the entry d[x, y]. Initially, set every element in this array to infinity (or some huge number).
Whenever you draw some primitive, iterate over all grid points (x, y) calculating the minimum distance (or some conservative approximation to it) from (x, y) to the just-drawn primitive. E.g., if the primitive just drawn was a circle of radius r centered at (p, q), then this distance would be sqrt((x-p)^2 + (y-q)^2) - r. Then update d[x, y] with this new distance value if it is smaller than its current value.
The grid point at which the largest circle can be drawn without touching any already-drawn primitive is the grid point that is the farthest away from any primitive drawn so far. To find it, simply scan through d[] to find its maximum value, and note the corresponding indices (x, y). d[x, y] will be the maximum radius you could safely use for this circle.
Repeat steps 2 and 3 as necessary.
A couple of points:
For primitives that have area, you can assign 0 or a negative value to all d[x, y] corresponding to grid points inside the primitive.
For any convex primitive, you can often avoid updating most of the d[] array by scanning rows (or columns) "outward" from the just-drawn primitive's border: the distance from the current grid point to the primitive will never decrease, so if this distance becomes larger than the previous maximum value in d[] then we know that we can stop scanning this row, because no further distance value that we would compute on it could possibly be less than an existing distance on it.

Efficiently calculating a segmented regression on a large dataset

I currently have a large data set, for which I need to calculate a segmented regression (or fit a piecewise linear function in some similar way). However, I have both a large data set, as well as a very large number of pieces.
Currently I have the following approach:
Let si be the end of segment i
Let (xi,yi) denote the i-th data point
Assume the data point xk lies within segment j, then I can create a vector from xk as
(s1,s2-s1,s3-s2,...,xk-sj-1,0,0,...)
To do a segmented regression on the data point, I can do a normal linear regression on each of these vectors.
However, my current estimates show, that if I define the problem that way, I will get about 600.000 vectors with about 2.000 components each. I haven't benchmarked yet, but I don't think my computer will be able to calculate such a large regression problem in any acceptable time.
Is there a better way to calculate this kind of regression problem? One idea was to maybe use some kind of hierarchical approach, i.e. calculate one regression problem by combining multiple segments, so that I can determine start and endpoints for this set. Then calculate an individual segmented regression for this set of segments. However, I cannot figure out how to calculate the regression for this set of segments, so that the endpoints match (I can only match start or endpoint by fixing the intercept but not both).
Another idea I had was to calculate an individual regression for each of the segments and then only use the slope for that segment. However with that approach, errors might start to accumulate and I have no way to control for this kind of error accumulation.
Yet another ideas is that I could do individual regression for each segment, but fix the intercept to the endpoint of the previous segment. However, I still am not sure, if I may get some kind of error accumulation this way.
Clarification
Not sure if this was clear from the rest of the question. I know where the segments start and end. The most important part is, that I have to get each line segment to intersect at the segment boundary with the next segment.
EDIT
Maybe another fact that could help. All points have different x values.
I would group points to rectangular grid areas
based on their position. So you process this task on more smaller datasets and then merge the results together when all done.
I would process each group like this:
compute histogram of angles
take only the most occurring angles
their count determine the number of line segments present in group
do the regression/line fit for these angles
See this Answer it does something very similar (just single line)
compute the intersection points
between line segments to get the endpoints of your piecewise polyline and also connectivity info (join the closest endpoints)
[edit1] after OP edit
You know the edge x coordinates of all segments (x0,x1,...) so just compute average y coordinates of points near segment edge (gray area, green points) and You got the segment line endpoints (blue points). Of coarse this is no fit or regression because of discard all the other points so it leads to bigger errors (unless the segment x coordinated corresponds to regressed lines ...) but there is no way around it with the constrains of solution you have (at least I do not see any).
Because if you use regression on segment data then you can not connect it to other segments and if you try to merge them then you got almost the same result as this:
the size of gray area determine the output ... so play with it a bit ...

extract points which satisfy certain conditions

I have an array of points in one plane. They form some shape. I need to extract points from this array which only form straight lines of this shape.
At this moment I have an algorithm but it does not work very good. I take first two points, make a straight line and then check if the following points lie on it with some tolerance. But there is a problem: the points which form straight line are not really on the straight but have some deviation. This deviation is quite large. If in my algorithm I make deviation large enough to get points from the straight part, then other points which are on the slightly bent part but have deviation less then specified also extracted.
I am looking for some idea on how to perform such task.
Here is the picture:
In circles are the parts which I want to extract. Red points are the parts which I could extract with my approach. If I increase the tolerance then I miss the straight pieces too.
First, if you already have some candidate subset of points and want to check whether they lie on a straight line. Use a form of linear regression to identify the best-fitting line, then check how well it fits and accept or reject the hypothesis that this particular segment is linear based on that.
One of the most standard ways of doing that is using Least Squares method.
Identifying the subset is a different problem, the best solution to which will depend strongly on the kind of data you have and the objective. I suggest that enumerating all the segments is a good starting point, if the amount of data is not extremely large, -- that should be doable in no more than cubic time, I gather.
There are certainly some approximations one can apply, e.g. choosing a point in the sequence and building a subset by iteratively adding points on either side as long as the segment remains linear within the tolerance threshold, than accepting or rejecting it if the segment is long enough.
I assume here that the curve is parameterizable by one of the coordinates. If this is not the case, e.g. if the curve is closed, additional steps may be required to separate the curve into parameterizable segments.
EDIT: how to check a segment is straight
There's a number of options.
First, I would expect that for a straight line the average deviation would stay roughly the same as you add the new points, then you can simply find a reasonable threshold on that given the data.
Second option is to further split the subset into a fixed number of parts (e.g. 2), find the best fitting line for each one and then compare these. In case of a straight line, roughly the same line should be predicted, but for a curve it would be different.
Third option is to perform nonlinear curve fitting, e.g. fit a quadratic curve and check the coefficient for the quadratic term -- if the line is straight, it should be close to zero.
In each case, of course, there is a tradeoff between the segment size and the deviation of the points from that segment. In the extreme case, there would either be one huge linear segment with huge deviation or a whole buch of 2-point segments with 0 deviation. The actual threshold on the deviation, the difference between the tangent curves, or the magnitude of the quadratic term (depending on the option you prefer) has to be selected for the given dataset to suit your needs. Looking at the plot, I would say that the threshold should be picked so as to allow for segments of length 10 or so.

Resources