Grouping large set of similar vectors

Grouping large set of similar vectors - algorithm

I have a 3d mesh of ~200,000 triangles.
To find all the flat (or near enough flat) surfaces on the model I thought I could try and group triangles by their normal vectors (giving me ones which face the same way) and then I can search these smaller sets for ones which are similar in position or connected.
I cannot think of a good way to practically do this while also keeping things relatively speedy. I have come up with solutions which would take n² but none which are elegant and quicker than that.
I have vertex information and triangle information (vertices, centre and normal).
Any suggestions would be appreciated.

It is possible that I have misunderstood the problem so I am stating what I think you need to do : "Given a set of vectors, group parallel vectors together".
You could use a hash-map to solve this problem. I am assuming that you stored the normal vectors in the form:
a + b + c = 0
You just need to write a function that converts a vector to an integer, for example, if I know that 0 <= a, b, c <= 1000, then I can use F(a, b, c) = a + 1000b + 1000000c which guarantees unique integer for every unique vector. After this, its just a matter of creating a hashmap which maps some integer to a list and store all the parallel vectors in the same list.

You want to find connected components on the graph from your triangles. The only thing you need is to store adjacency information in a convenient form.
Create a list of all edges (min, max), if all edges have two triangles adjacent, then there are 300'000 edges. This can be done in linear time:
For every vertex count number of adjacent vertices with greater index, do the partial sum on these numbers.
Allocate and fill an array for edges (second vertex and utility data). Use array from step 1 to access edges adjacent to a vertex. Such an access can be done in the constant time if we know that the number of edges adjacent to a vertex is bounded from above by a constant and the whole step can be done in the linear time.
So, mentioned utility data is the numbers of pair of triangles adjacent to the edge.
Ok, now you have adjacency info. It is time to find connected components. You can use DFS for it. It will work in the linear time because every triangle has three (constant number of) neighbors.
Here you need to allocate 200'000 * sizeof(int) * 4 bytes. And it can be done in the linear time.
You could also want to read about doubly connected edge list, but it is pretty expensive.

Related

neighbor searching algorithm

I have a STL file that contains the coordinates (x,y,z) of 3 points (p0, p1, p2) of a triangle. these triangle represent a 3D surface f(x,y,z). The STL file might have over a 1000 triangles to represent a complex geometry.
for my application, I need to know the neighboring triangles for each triangle entry from the stl file. meaning that for each triangle, i have to pick 3 pairs of points pair1=(p0,p1), pair2=(p0,p2), pair3= (p1,p2) and compare them with pair of points in other triangles in the set
what's the best and most efficient algorithm to achieve this purpose? can i use a hashtree, hashmap?

change the mesh representation to point table and triangle faces table. STL demands that all triangles are joined in their vertexes so no cutting of edges which means neighboring triangle always share one complete edge.
double pnt[points][3];
int tri[triangles][3];
The pnt should be list of all distinct points (index sort it to improve speed for high point count). The tri should contain 3 indexes of points used in triangle. Sort them (asc or desc) to improve match speed.
Now if any triangle tri[i] shares the same edge like tri[j] then those two are neighboring triangles.
if ((tri[i][0]==tri[j][0])&&(tri[i][1]==tri[j][1])
||(tri[i][0]==tri[j][1])&&(tri[i][1]==tri[j][2])) triangles i,j are neighbors
Add all combinations ...
If you need just neighboring points then find all triangles containing that points and all the other points used in those triangles are neighbors
To load STL to such structure do this:
clear pnt[],tri[] lists/tables
process each triangle of STL
for each point of triangle
look if it is in pnt[] if yes use its index for new triangle. if not add new point to pnt and use its index for new triangle. When all 3 points done add new triangle to tri.
Improving pnt[] performance
Add index sort for pnt[] sorted by any coordinate for example x and improve performance of checking if point is already present in pnt.
So while adding (xi,yi,zi) into pnt[] find index of point that have the biggest x which is xi>=pnt[i0][0] via binary search and then scan all points in pnt until x crosses xi so xi<pnt[i1][0] this way you do not need to check all points.
If this is too slow (usually if number of points is bigger then 40000) you can improve performance more by segment index sorting (dividing index sort into segment pages of finite size like 8192 points)
Improving tri[] performance
You can also sort the tri[] by tri[i][0] so you can use binary search similarly to pnt[].

I would suggest going with hashmap where values are sets (based on tree) of references to Tringles, keys are those pairs of Points (lets call these pairs simply Sides) and some hashing function that would take into accout the property that hash of Side (a,b) should be equal to hash of (b,a).
Some kind of algorithm:
Read 3 Points and create from them 3 Sides and Triangle.
Add all that to hashmap: map[side[i]].insert(tringle)
Repeat 1-2 until you read all the data
Now you have a map with filled data. About the complexity of filling: insertion into hashmap are constant-time at average (it also depends on the hash-function) and insertion complexity into a set is logarithmic so the complete complexity of filleng data is O(n*logm) where n is the number of Sides and m is average number of Tringles with the same Side.
Normally each set would contain around 4 Triangles: 1 + 3 side-neighbours, so logm is relatively small (comparing to n) and could be not taken into account (suppose it is a constant). These suggestions lead us to some kind of conclusion: best-case complexity for filling is O(n) (no collisions, no rehashing, etc) and worst is O(n*logn) (worst-case inserting of n Sides by 1 average case in map and by logn case inserting into one set meaning all Tringles share the same Side).
Now to get all side-neighbours for some Triangle:
Get all 3 sets for each Side of that Triangle (e.g. set[i] = map[triangle.sides[i]].
Get intersection of those 3 sets (exclude triangle to get only its side-neighbours).
Done.
About complexity of getting side-neighbours: linearly-depent on the size of the sets and relatively small comparing to 'n' in normal case.
Note: To get not side-neighbours but point-neighbours (assuming neighbours are called any 2 Triangles with common Point not Side) simply fill sets with Points instead of Sides. The above assumptions about time-complexities hold exept for constants.

Two salesmen - one always visits the nearest neighbour, the other the farthest

Consider this question relative to graph theory:
Let G a complete (every vertex is connected to all the other vertices) non-directed graph of size N x N. Two "salesmen" travel this way: the first always visits the nearest non visited vertex, the second the farthest, until they have both visited all the vertices. We must generate a matrix of distances and the starting points for the two salesmen (they can be different) such that:
All the distances are unique Edit: positive integers
The distance from a vertex to itself is always 0.
The difference between the total distance covered by the two salesmen must be a specific number, D.
The distance from A to B is equal to the distance from B to A
What efficient algorithms cn be useful to help me? I can only think of backtracking, but I don't see any way to reduce the work to be done by the program.

Geometry is helpful.
Using the distances of points on a circle seems like it would work. Seems like you could determine adjust D by making the circle radius larger or smaller.
Alternatively really any 2D shape, where the distances are all different could probably used as well. In this case you should scale up or down the shape to obtain the correct D.
Edit: Now that I think about it, the simplest solution may be to simply pick N random 2D points, say 32 bit integer coordinates to lower the chances of any distances being too close to equal. If two distances are too close, just pick a different point for one of them until it's valid.
Ideally, you'd then just need to work out a formula to determine the relationship between D and the scaling factor, which I'm not sure of offhand. If nothing else, you could also just use binary search or interpolation search or something to search for scaling factor to obtain the required D, but that's a slower method.

How to transform many numbers into one? Algorithmic issue

I want to have function that transform input (n-size vector) into one int.
Formally:
F: (x1,x2,...,xn) -> y
This could be something like:
http://en.wikipedia.org/wiki/G%C3%B6del_numbering
but should be:
-unambiguous
-big size of input cant produce huge output number.
My case is to encode graph neighbours. I just want to keep information about edges for specific vertex.
First idea was that assume some vertex has assigned value 41. In binary representation it is 101001. That means this vertex is connected with vertices numer: 1, 4 and 6. Then getting info about neighbors was simply tab[i] & (1 << j), where tab is 1-d array of vertices and for vertex i it stores for example 41. j is a number of checking vertex. But with this solution is on problem: I can only encode information about 32 neighbours (max size of int - 2^32) or am I wrong?
So I want to have simply representation of graph neighbours as a number with fast method of obtaining info about neighborhood of two vertices. Is such a solution exist, or maybe some of u have an idea how to deal with it?
Thanks for any help.

I don't think it is possible.
If you have k potential neighbors,then your representation needs to 2^k possible values so as to map a value to a certain combination of neighbors. So you do need at least k bits to represent k neighbors.
Also, your proposed solution of using a 32 bit int to represent neighbors is not much different from an adjacency matrix. In fact its exactly the same thing, except you are using bits instead of maybe using booleans. You could use an array of bit vectors instead of of using 2-D array of booleans.

How to select points at a regular density

how do I select a subset of points at a regular density? More formally,
Given
a set A of irregularly spaced points,
a metric of distance dist (e.g., Euclidean distance),
and a target density d,
how can I select a smallest subset B that satisfies below?
for every point x in A,
there exists a point y in B
which satisfies dist(x,y) <= d
My current best shot is to
start with A itself
pick out the closest (or just particularly close) couple of points
randomly exclude one of them
repeat as long as the condition holds
and repeat the whole procedure for best luck. But are there better ways?
I'm trying to do this with 280,000 18-D points, but my question is in general strategy. So I also wish to know how to do it with 2-D points. And I don't really need a guarantee of a smallest subset. Any useful method is welcome. Thank you.
bottom-up method
select a random point
select among unselected y for which min(d(x,y) for x in selected) is largest
keep going!
I'll call it bottom-up and the one I originally posted top-down. This is much faster in the beginning, so for sparse sampling this should be better?
performance measure
If guarantee of optimality is not required, I think these two indicators could be useful:
radius of coverage: max {y in unselected} min(d(x,y) for x in selected)
radius of economy: min {y in selected != x} min(d(x,y) for x in selected)
RC is minimum allowed d, and there is no absolute inequality between these two. But RC <= RE is more desirable.
my little methods
For a little demonstration of that "performance measure," I generated 256 2-D points distributed uniformly or by standard normal distribution. Then I tried my top-down and bottom-up methods with them. And this is what I got:
RC is red, RE is blue. X axis is number of selected points. Did you think bottom-up could be as good? I thought so watching the animation, but it seems top-down is significantly better (look at the sparse region). Nevertheless, not too horrible given that it's much faster.
Here I packed everything.
http://www.filehosting.org/file/details/352267/density_sampling.tar.gz

You can model your problem with graphs, assume points as nodes, and connect two nodes with edge if their distance is smaller than d, Now you should find the minimum number of vertex such that they are with their connected vertices cover all nodes of graph, this is minimum vertex cover problem (which is NP-Hard in general), but you can use fast 2-approximation : repeatedly taking both endpoints of an edge into the vertex cover, then removing them from the graph.
P.S: sure you should select nodes which are fully disconnected from the graph, After removing this nodes (means selecting them), your problem is vertex cover.

A genetic algorithm may probably produce good results here.
update:
I have been playing a little with this problem and these are my findings:
A simple method (call it random-selection) to obtain a set of points fulfilling the stated condition is as follows:
start with B empty
select a random point x from A and place it in B
remove from A every point y such that dist(x, y) < d
while A is not empty go to 2
A kd-tree can be used to perform the look ups in step 3 relatively fast.
The experiments I have run in 2D show that the subsets generated are approximately half the size of the ones generated by your top-down approach.
Then I have used this random-selection algorithm to seed a genetic algorithm that resulted in a further 25% reduction on the size of the subsets.
For mutation, giving a chromosome representing a subset B, I randomly choose an hyperball inside the minimal axis-aligned hyperbox that covers all the points in A. Then, I remove from B all the points that are also in the hyperball and use the random-selection to complete it again.
For crossover I employ a similar approach, using a random hyperball to divide the mother and father chromosomes.
I have implemented everything in Perl using my wrapper for the GAUL library (GAUL can be obtained from here.
The script is here: https://github.com/salva/p5-AI-GAUL/blob/master/examples/point_density.pl
It accepts a list of n-dimensional points from stdin and generates a collection of pictures showing the best solution for every iteration of the genetic algorithm. The companion script https://github.com/salva/p5-AI-GAUL/blob/master/examples/point_gen.pl can be used to generate the random points with a uniform distribution.

Here is a proposal which makes an assumption of Manhattan distance metric:
Divide up the entire space into a grid of granularity d. Formally: partition A so that points (x1,...,xn) and (y1,...,yn) are in the same partition exactly when (floor(x1/d),...,floor(xn/d))=(floor(y1/d),...,floor(yn/d)).
Pick one point (arbitrarily) from each grid space -- that is, choose a representative from each set in the partition created in step 1. Don't worry if some grid spaces are empty! Simply don't choose a representative for this space.
Actually, the implementation won't have to do any real work to do step one, and step two can be done in one pass through the points, using a hash of the partition identifier (the (floor(x1/d),...,floor(xn/d))) to check whether we have already chosen a representative for a particular grid space, so this can be very, very fast.
Some other distance metrics may be able to use an adapted approach. For example, the Euclidean metric could use d/sqrt(n)-size grids. In this case, you might want to add a post-processing step that tries to reduce the cover a bit (since the grids described above are no longer exactly radius-d balls -- the balls overlap neighboring grids a bit), but I'm not sure how that part would look.

To be lazy, this can be casted to a set cover problem, which can be handled by mixed-integer problem solver/optimizers. Here is a GNU MathProg model for the GLPK LP/MIP solver. Here C denotes which point can "satisfy" each point.
param N, integer, > 0;
set C{1..N};
var x{i in 1..N}, binary;
s.t. cover{i in 1..N}: sum{j in C[i]} x[j] >= 1;
minimize goal: sum{i in 1..N} x[i];
With normally distributed 1000 points, it didn't find the optimum subset in 4 minutes, but it said it knew the true minimum and it selected only one more point.

Finding the farthest point in one set from another set

My goal is a more efficient implementation of the algorithm posed in this question.
Consider two sets of points (in N-space. 3-space for the example case of RGB colorspace, while a solution for 1-space 2-space differs only in the distance calculation). How do you find the point in the first set that is the farthest from its nearest neighbor in the second set?
In a 1-space example, given the sets A:{2,4,6,8} and B:{1,3,5}, the answer would be
8, as 8 is 3 units away from 5 (its nearest neighbor in B) while all other members of A are just 1 unit away from their nearest neighbor in B. edit: 1-space is overly simplified, as sorting is related to distance in a way that it is not in higher dimensions.
The solution in the source question involves a brute force comparison of every point in one set (all R,G,B where 512>=R+G+B>=256 and R%4=0 and G%4=0 and B%4=0) to every point in the other set (colorTable). Ignore, for the sake of this question, that the first set is elaborated programmatically instead of iterated over as a stored list like the second set.

First you need to find every element's nearest neighbor in the other set.
To do this efficiently you need a nearest neighbor algorithm. Personally I would implement a kd-tree just because I've done it in the past in my algorithm class and it was fairly straightforward. Another viable alternative is an R-tree.
Do this once for each element in the smallest set. (Add one element from the smallest to larger one and run the algorithm to find its nearest neighbor.)
From this you should be able to get a list of nearest neighbors for each element.
While finding the pairs of nearest neighbors, keep them in a sorted data structure which has a fast addition method and a fast getMax method, such as a heap, sorted by Euclidean distance.
Then, once you're done simply ask the heap for the max.
The run time for this breaks down as follows:
N = size of smaller set
M = size of the larger set
N * O(log M + 1) for all the kd-tree nearest neighbor checks.
N * O(1) for calculating the Euclidean distance before adding it to the heap.
N * O(log N) for adding the pairs into the heap.
O(1) to get the final answer :D
So in the end the whole algorithm is O(N*log M).
If you don't care about the order of each pair you can save a bit of time and space by only keeping the max found so far.
*Disclaimer: This all assumes you won't be using an enormously high number of dimensions and that your elements follow a mostly random distribution.

The most obvious approach seems to me to be to build a tree structure on one set to allow you to search it relatively quickly. A kd-tree or similar would probably be appropriate for that.
Having done that, you walk over all the points in the other set and use the tree to find their nearest neighbour in the first set, keeping track of the maximum as you go.
It's nlog(n) to build the tree, and log(n) for one search so the whole thing should run in nlog(n).

To make things more efficient, consider using a Pigeonhole algorithm - group the points in your reference set (your colorTable) by their location in n-space. This allows you to efficiently find the nearest neighbour without having to iterate all the points.
For example, if you were working in 2-space, divide your plane into a 5 x 5 grid, giving 25 squares, with 25 groups of points.
In 3 space, divide your cube into a 5 x 5 x 5 grid, giving 125 cubes, each with a set of points.
Then, to test point n, find the square/cube/group that contains n and test distance to those points. You only need to test points from neighbouring groups if point n is closer to the edge than to the nearest neighbour in the group.

For each point in set B, find the distance to its nearest neighbor in set A.
To find the distance to each nearest neighbor, you can use a kd-tree as long as the number of dimensions is reasonable, there aren't too many points, and you will be doing many queries - otherwise it will be too expensive to build the tree to be worthwhile.

Maybe I'm misunderstanding the question, but wouldn't it be easiest to just reverse the sign on all the coordinates in one data set (i.e. multiply one set of coordinates by -1), then find the first nearest neighbour (which would be the farthest neighbour)? You can use your favourite knn algorithm with k=1.

EDIT: I meant nlog(n) where n is the sum of the sizes of both sets.
In the 1-Space set I you could do something like this (pseudocode)
Use a structure like this
Struct Item {
int value
int setid
}
(1) Max Distance = 0
(2) Read all the sets into Item structures
(3) Create an Array of pointers to all the Items
(4) Sort the array of pointers by Item->value field of the structure
(5) Walk the array from beginning to end, checking if the Item->setid is different from the previous Item->setid
if (SetIDs are different)
check if this distance is greater than Max Distance if so set MaxDistance to this distance
Return the max distance.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio