Efficient algorithm to fit a linear line along the upper boundary of data only - algorithm

I'm currently trying to fit a linear line through a spread of scattered data in MATLAB. Now this is easy enough using the polyfit function where I can easily obtain my y= mx + c equation. However, I need to now fit a line along the upper boundary of my data, i.e., the top few data points. I know this description is vague, so lets assume that my scattered data will be in a shape of a cone, with its apex on the y-axis, and it spreads outwards and upwards in the +x and +y direction. I need to fit a best fit line on the 'upper edge of the cone' if you will.
I've developed an algorithm but it's extremely slow. It involves first fitting a line of best fit through ALL data, deleting all data points below this line of best fit, and iterating through until only 5% of the initial data points are left. The final best fit line will then reside close to the top edge of the cone. For 250 data points, this takes about 5s and with me dealing with more than a million data points, this algorithm is simply too inefficient.
I guess my question is: is there an algorithm to more efficiently achieve what I need? Or is there a way to sharpen up my code to eliminate unnecessary complexity?
Here is my code in MATLAB:
(As an example)
a = [4, 5, 1, 8, 1.6, 3, 8, 9.2]; %To be used as x-axis points
b = [45, 53, 12, 76, 25, 67, 75, 98]; %To be used as y-axis points
while prod(size(a)) > (0.05*prod(size(a))) %Iterative line fitting occurs until there are less than 5% of the data points left
lobf = polyfit(a,b,1); %Line of Best Fit for current data points
alen = length(a);
for aindex = alen:-1:1 %For loop to delete all points below line of best fit
ValLoBF = lobf(1)*a(aindex) + lobf(2)
if ValLoBF > b(aindex) %if LoBF is above current point...
a(aindex) = []; %delete x coordinate...
b(aindex) = []; %and delete its corresponding y coordinate
end
end
end

Well first of all your example code seems to be running indefinitely ;)
Some optimizations for your code:
a = [4, 5, 1, 8, 1.6, 3, 8, 9.2]; %To be used as x-axis points
b = [45, 53, 12, 76, 25, 67, 75, 98]; %To be used as y-axis points
n_init_a = length(a);
while length(a) > 0.05*n_init_a %Iterative line fitting occurs until there are less than 5% of the data points left
lobf = polyfit(a,b,1); % Line of Best Fit for current data points
% Delete data points below line using logical indexing
% First create values of the polyfit points using element-wise vector multiplication
temp = lobf(1)*a + lobf(2); % Containing all polyfit values
% Using logical indexing to discard all points below
a(b<temp)=[]; % First destroy a
b(b<temp)=[]; % Then b, very important!
end
Also you should try profiling your code by typing in the command window
profile viewer
and check what takes most time calculating your results. I suspect it is polyfit but that can't be sped up much probably.

What you are looking for is not line fitting. You are trying to find the convex hull of the points.
You should check out the function convhull. Once you find the hull, you can remove all of the points that aren't close to it, and fit each part independently to avoid the fact that the data is noisy.
Alternatively, you could render the points onto some pixel grid, and then do some kind of morphological operation, like imclose, and finish with Hough transform. Check out also this answer.

Related

Cover a polygonal line using the least given rectangles while keeping her continuity

Given a list of points forming a polygonal line, and both height and width of a rectangle, how can I find the number and positions of all rectangles needed to cover all the points?
The rectangles should be rotated and may overlap, but must follow the path of the polyline (A rectangle may contain multiple segments of the line, but each rectangle must contain a segment that is contiguous with the previous one.)
Do the intersections on the smallest side of the rectangle, when it is possible, would be much appreciated.
All the solutions I found so far were not clean, here is the result I get:
You should see that it gives a good render in near-flat cases, but overlaps too much in big curbs. One rectangle could clearly be removed if the previous were offset.
Actually, I put a rectangle centered at width/2 along the line and rotate it using convex hull and modified rotating calipers algorithms, and reiterate starting at the intersection point of the previous rectangle and the line.
You may observe that I took inspiration from the minimum oriented rectangle bounding box algorithm, for the orientation, but it doesn't include the cutting aspect, nor the fixed size.
Thanks for your help!
I modified k-means to solve this. It's not fast, it's not optimal, it's not guaranteed, but (IMHO) it's a good start.
There are two important modifications:
1- The distance measure
I used a Chebyshev-distance-inspired measure to see how far points are from each rectangle. To find distance from points to each rectangle, first I transformed all points to a new coordinate system, shifted to center of rectangle and rotated to its direction:
Then I used these transformed points to calculate distance:
d = max(2*abs(X)/w, 2*abs(Y)/h);
It will give equal values for all points that have same distance from each side of rectangle. The result will be less than 1.0 for points that lie inside rectangle. Now we can classify points to their closest rectangle.
2- Strategy for updating cluster centers
Each cluster center is a combination of C, center of rectangle, and a, its rotation angle. At each iteration, new set of points are assigned to a cluster. Here we have to find C and a so that rectangle covers maximum possible number of points. I don’t now if there is an analytical solution for that, but I used a statistical approach. I updated the C using weighted average of points, and used direction of first principal component of points to update a. I used results of proposed distance, powered by 500, as weight of each point in weighted average. It moves rectangle towards points that are located outside of it.
How to Find K
Initiate it with 1 and increase it till all distances from points to their corresponding rectangles become less than 1.0, meaning all points are inside a rectangle.
The results
Iterations 0, 10, 20, 30, 40, and 50 of updating cluster centers (rectangles):
Solution for test case 1:
Trying Ks: 2, 4, 6, 8, 10, and 12 for complete coverage:
Solution for test case 2:
P.M: I used parts of Chalous Road as data. It was fun downloading it from Google Maps. The I used technique described here to sample a set of equally spaced points.
It’s a little late and you’ve probably figured this out. But, I was free today and worked on the constraint reflected in your last edit (continuity of segments). As I said before in the comments, I suggest using a greedy algorithm. It’s composed of two parts:
A search algorithm that looks for furthermost point from an initial point (I used binary search algorithm), so that all points between them lie inside a rectangle of given w and h.
A repeated loop that finds best rectangle at each step and advances the initial point.
The pseudo code of them are like these respectively:
function getBestMBR( P, iFirst, w, h )
nP = length(P);
iStart = iFirst;
iEnd = nP;
while iStart <= iEnd
m = floor((iStart + iEnd) / 2);
MBR = getMBR(P[iFirst->m]);
if (MBR.w < w) & (MBR.h < h) {*}
iStart = m + 1;
iLast = m;
bestMBR = MBR;
else
iEnd = m - 1;
end
end
return bestMBR, iLast;
end
function getRectList( P, w, h )
nP = length(P);
rects = [];
iFirst = 1;
iLast = iFirst;
while iLast < nP
[bestMBR, iLast] = getBestMBR(P, iFirst, w, h);
rects.add(bestMBR.x, bestMBR.y, bestMBR.a];
iFirst = iLast;
end
return rects;
Solution for test case 1:
Solution for test case 2:
Just keep in mind that it’s not meant to find the optimal solution, but finds a sub-optimal one in a reasonable time. It’s greedy after all.
Another point is that you can improve this a little in order to decrease number of rectangles. As you can see in the line marked with (*), I kept resulting rectangle in direction of MBR (Minimum Bounding Rectangle), even though you can cover larger MBRs with rectangles of same w and h if you rotate the rectangle. (1) (2)

finding thermocline using depth and temperature

temp = {23,23,23,22,20,20,19,12,11,10,10 };
depth= {0,1,2,3, 8, 9, 10, 12, 18, 23, 29 };
I have two arrays as shown i need to find thermocline using the following statement
it is easily see that the slope of the curve (i.e. dT/dh) is a maximum at the (very!) obvious thermocline.
Furthermore, because the curvature of the curve changes at the thermocline, a point of inflection, then, by definition d2T/dx2 = 0.
Another way to look at it is that to maximise the slope, i.e. the 1st derivative,
then the 2nd derivative must be equal to zero.
Please help!

How to compute the variances in Expectation Maximization with n dimensions?

I have been reviewing Expectation Maximization (EM) in research papers such as this one:
http://pdf.aminer.org/000/221/588/fuzzy_k_means_clustering_with_crisp_regions.pdf
I have some doubts that I have not figured it out. For example, what would happen if we have many dimensions for each datapoint?
For example I have the following dataset with 6 datapoints and 4 dimensions:
>D1 D2 D3 D4
5, 19, 72, 5
6, 18, 14, 1
7, 22, 29, 4
3, 22, 51, 1
2, 21, 89, 2
1, 12, 28, 1
It means that for computing the expectation step, do I need to compute 4 standard deviations (one for each dimension)?
Do I also have to compute the variance for each cluster assuming k=3 (Do not know if it is necessary based on the formula from the paper...) or just the variances for each dimensions (4 attributes)?
Usually, you use a Covariance matrix, which also includes variances.
But it really depends on your chosen model. The simplest model does not use variances at all.
A more complex model has a single variance value, the average variance over all dimensions.
Next, you can have a separate variance for each dimension independently; and last but not least a full covariance matrix. That is probably the most flexible GMM in popular use.
Depending on your implementation, there can be many more.
From R's mclust documentation:
univariate mixture
"E" = equal variance (one-dimensional)
"V" = variable variance (one-dimensional)
multivariate mixture
"EII" = spherical, equal volume
"VII" = spherical, unequal volume
"EEI" = diagonal, equal volume and shape
"VEI" = diagonal, varying volume, equal shape
"EVI" = diagonal, equal volume, varying shape
"VVI" = diagonal, varying volume and shape
"EEE" = ellipsoidal, equal volume, shape, and orientation
"EEV" = ellipsoidal, equal volume and equal shape
"VEV" = ellipsoidal, equal shape
"VVV" = ellipsoidal, varying volume, shape, and orientation
single component
"X" = univariate normal
"XII" = spherical multivariate normal
"XXI" = diagonal multivariate normal
"XXX" = elliposidal multivariate normal

How to filter a set of 2D points moving in a certain way

I have a list of points moving in two dimensions (x- and y-axis) represented as rows in an array. I might have N points - i.e., N rows:
1 t1 x1 y1
2 t2 x2 y2
.
.
.
N tN xN yN
where ti, xi, and yi, is the time-index, x-coordinate, and the y-coordinate for point i. The time index-index ti is an integer from 1 to T. The number of points at each such possible time index can vary from 0 to N (still with only N points in total).
My goal is the filter out all the points that do not move in a certain way; or to keep only those that do. A point must move in a parabolic trajectory - with decreasing x- and y-coordinate (i.e., moving to the left and downwards only). Points with other dynamic behaviour must be removed.
Can I use a simple sorting mechanism on this array - and then analyse the order of the time-index? I have also considered the fact each point having the same time-index ti are physically distinct points, and so should be paired up with other points. The complexity of the problem grew - and now I turn to you.
NOTE: You can assume that the points are confined to a sub-region of the (x,y)-plane between two parabolic curves. These curves intersect only at only at one point: A point close to the origin of motion for any point.
More Information:
I have made some datafiles available:
MATLAB datafile (1.17 kB)
same data as CSV with semicolon as column separator (2.77 kB)
Necessary context:
The datafile hold one uint32 array with 176 rows and 5 columns. The columns are:
pixel x-coordinate in 175-by-175 lattice
pixel y-coordinate in 175-by-175 lattice
discrete theta angle-index
time index (from 1 to T = 10)
row index for this original sorting
The points "live" in a 175-by-175 pixel-lattice - and again inside the upper quadrant of a circle with radius 175. The points travel on the circle circumference in a counterclockwise rotation to a certain angle theta with horizontal, where they are thrown off into something close to a parabolic orbit. Column 3 holds a discrete index into a list with indices 1 to 45 from 0 to 90 degress (one index thus spans 2 degrees). The theta-angle was originally deduces solely from the points by setting up the trivial equations of motions and solving for the angle. This gives rise to a quasi-symmetric quartic which can be solved in close-form. The actual metric radius of the circle is 0.2 m and the pixel coordinate were converted from pixel-coordinate to metric using simple linear interpolation (but what we see here are the points in original pixel-space).
My problem is that some points are not behaving properly and since I need to statistics on the theta angle, I need to remove the points that certainly do NOT move in a parabolic trajoctory. These error are expected and fully natural, but still need to be filtered out.
MATLAB plot code:
% load data and setup variables:
load mat_points.mat;
num_r = 175;
num_T = 10;
num_gridN = 20;
% begin plotting:
figure(1000);
clf;
plot( ...
num_r * cos(0:0.1:pi/2), ...
num_r * sin(0:0.1:pi/2), ...
'Color', 'k', ...
'LineWidth', 2 ...
);
axis equal;
xlim([0 num_r]);
ylim([0 num_r]);
hold all;
% setup grid (yea... went crazy with one):
vec_tickValues = linspace(0, num_r, num_gridN);
cell_tickLabels = repmat({''}, size(vec_tickValues));
cell_tickLabels{1} = sprintf('%u', vec_tickValues(1));
cell_tickLabels{end} = sprintf('%u', vec_tickValues(end));
set(gca, 'XTick', vec_tickValues);
set(gca, 'XTickLabel', cell_tickLabels);
set(gca, 'YTick', vec_tickValues);
set(gca, 'YTickLabel', cell_tickLabels);
set(gca, 'GridLineStyle', '-');
grid on;
% plot points per timeindex (with increasing brightness):
vec_grayIndex = linspace(0,0.9,num_T);
for num_kt = 1:num_T
vec_xCoords = mat_points((mat_points(:,4) == num_kt), 1);
vec_yCoords = mat_points((mat_points(:,4) == num_kt), 2);
plot(vec_xCoords, vec_yCoords, 'o', ...
'MarkerEdgeColor', 'k', ...
'MarkerFaceColor', vec_grayIndex(num_kt) * ones(1,3) ...
);
end
Thanks :)
Why, it looks almost as if you're simulating a radar tracking debris from the collision of two missiles...
Anyway, let's coin a new term: object. Objects are moving along parabolae and at certain times they may emit flashes that appear as points. There are also other points which we are trying to filter out.
We will need some more information:
Can we assume that the objects obey the physics of things falling under gravity?
Must every object emit a point at every timestep during its lifetime?
Speaking of lifetime, do all objects begin at the same time? Can some expire before others?
How precise is the data? Is it exact? Is there a measure of error? To put it another way, do we understand how poorly the points from an object might fit a perfect parabola?
Sort the data with (index,time) as keys and for all locations of a point i see if they follow parabolic trajectory?
Which part are you facing problem? Sorting should be very easy. IMHO, it is the second part (testing if a set of points follow parabolic trajectory) that is difficult.

An algorithm to solve a simple(?) array problem

For this problem speed is pretty crucial. I've drawn a nice image to explain the problem better. The algorithm needs to calculate if edges of a rectangle continue within the confines of the canvas, will the edge intersect another rectangle?
We know:
The size of the canvas
The size of each rectangle
The position of each rectangle
The faster the solution is the better! I'm pretty stuck on this one and don't really know where to start.
alt text http://www.freeimagehosting.net/uploads/8a457f2925.gif
Cheers
Just create the set of intervals for each of the X and the Y axis. Then for each new rectangle, see if there are intersecting intervals in the X or the Y axis. See here for one way of implementing the interval sets.
In your first example, the interval set on the horizontal axis would be { [0-8], [0-8], [9-10] }, and on the vertical: { [0-3], [4-6], [0-4] }
This is only a sketch, I abstracted many details here (e.g. usually one would ask an interval set/tree "which intervals overlap this one", instead of "intersect this one", but nothing not doable).
Edit
Please watch this related MIT lecture (it's a bit long, but absolutely worths it).
Even if you find simpler solutions (than implementing an augmented red-black tree), it's good to know the ideas behind these things.
Lines that are not parallel to each other are going to intersect at some point. Calculate the slopes of each line and then determine what lines they won't intersect with.
Start with that, and then let's see how to optimize it. I'm not sure how your data is represented and I can't see your image.
Using slopes is a simple equality check which probably means you can take advantage of sorting the data. In fact, you can probably just create a set of distinct slopes. You'll have to figure out how to represent the data such that the two slopes of the same rectangle are not counted as intersecting.
EDIT: Wait.. how can two rectangles whose edges go to infinity not intersect? Rectangles are basically two lines that are perpendicular to each other. shouldn't that mean it always intersects with another if those lines are extended to infinity?
as long as you didn't mention the language you chose to solve the problem, i will use some kind of pseudo code
the idea is that if everything is ok, then a sorted collection of rectangle edges along one axis should be a sequence of non-overlapping intervals.
number all your rectangles, assigning them individual ids
create an empty binary tree collection (btc). this collection should have a method to insert an integer node with info btc::insert(key, value)
for all rectangles, do:
foreach rect in rects do
btc.insert(rect.top, rect.id)
btc.insert(rect.bottom, rect.id)
now iterate through the btc (this will give you a sorted order)
btc_item = btc.first()
do
id = btc_item.id
btc_item = btc.next()
if(id != btc_item.id)
then report_invalid_placement(id, btc_item.id)
btc_item = btc.next()
while btc_item is valid
5,7,8 - repeat steps 2,3,4 for rect.left and rect.right coordinates
I like this question. Here is my try to get on it:
If possible:
Create a polygon from each rectangle. Treat each edge as an line of maximum length that must be clipped. Use a clipping algorithm to check weather or not a line intersects with another. For example this one: Line Clipping
But keep in mind: If you find an intersection which is at the vertex position, its a valid one.
Here's an idea. Instead of creating each rectangle with (x, y, width, height), instantiate them with (x1, y1, x2, y2), or at least have it interpret these values given the width and height.
That way, you can check which rectangles have a similar x or y value and make sure the corresponding rectangle has the same secondary value.
Example:
The rectangles you have given have the following values:
Square 1: [0, 0, 8, 3]
Square 3: [0, 4, 8, 6]
Square 4: [9, 0, 10, 4]
First, we compare Square 1 to Square 3 (no collision):
Compare the x values
[0, 8] to [0, 8] These are exactly the same, so there's no crossover.
Compare the y values
[0, 4] to [3, 6] None of these numbers are similar, so they're not a factor
Next, we compare Square 3 to Square 4 (collision):
Compare the x values
[0, 8] to [9, 10] None of these numbers are similar, so they're not a factor
Compare the y values
[4, 6] to [0, 4] The rectangles have the number 4 in common, but 0 != 6, therefore, there is a collision
By know we know that a collision will occur, so the method will end, but lets evaluate Square 1 and Square 4 for some extra clarity.
Compare the x values
[0, 8] to [9, 10] None of these numbers are similar, so they're not a factor
Compare the y values
[0, 3] to [0, 4] The rectangles have the number 0 in common, but 3 != 4, therefore, there is a collision
Let me know if you need any extra details :)
Heh, taking the overlapping intervals answer to the extreme, you simply determine all distinct intervals along the x and y axis. For each cutting line, do an upper bound search along the axis it will cut based on the interval's starting value. If you don't find an interval or the interval does not intersect the line, then it's a valid line.
The slightly tricky part is to realize that valid cutting lines will not intersect a rectangle's bounds along an axis, so you can combine overlapping intervals into a single interval. You end up with a simple sorted array (which you fill in O(n) time) and a O(log n) search for each cutting line.

Resources