Find areas in matrix..? - matrix

lets say I have a very big matrix with 10000x10000 elements all having the value '0'. Lets say there are some big 'nests' of '1's. Those areas might even be connected, but very weekly connected by a 'pipe' of '1's.
I want to get an algorithm that very quickly (and dirty if necessary) finds these 'nests' of '1's. Here it shouldn't 'cut apart' two weekly connected 'nests'.
Any idea how I should do such an algorithm?

Maybe a pathfinding algorithm like A* (or something simpler like a BFS or DFS) may work in this case..
You can:
search starting point for your searches by finding small nests (ignoring pipes).. so at least a 3x3 block of 1's
then you should pathfind from there going through 1's until you end your "connected component" (poetic license) inside the matrix
repeat starting from another small 1's block

I would say it depends on how the data is needed. If, given two points, you need to check if they are in the same block of 1's, I think #Jack's answer is best. This is also true if you have some knowledge of where blocks are initially, as you can use those as starting points for your algorithm.
If you don't have any other information, maybe one of these would be a possibility:
If given a point, you wish to find all elements in the same block, a flood fill would be appropriate. Then you could cache each nest as you find it, and when you get another point first see if it's in a known nest, and if it isn't do a flood fill to find this nest then add it to the cache.
As an implementation detail, as you traverse the matrix each row should have available the set of nests present on the previous row. Then you would only need to check new points against those nests, rather than the complete set, to determine if a new point is in a known set or not.
Be sure that you use a set implementation with a very low lookup cost such as a hashtable or possibly a Bloom filter if you can deal with the probabilistic effects.

Turn the matrix into a black&white bitmap
Scale the matrix so that nests of size N become a single pixel (so if you look for 10x10 nests, scale by a factor of N=10).
Use the remaining pixels of the output to locate the nests. Use the center coordinate (multiplied by the factor above) to locate the same nest in the matrix.
Use a low-pass filter to get rid of all "pipes" that connect the nests.
Find the border of the nest with a contrast filter on the bitmap.
Create a bitmap which doesn't contain the nests (i.e. set all pixels of the nests to 0).
Use a filter that widens single pixels to grow the outline of the nests.
Bitwise AND the output of 7 and 5 to get the connection points of all pipes.
Follow the pipes to see how they connect the nests

Related

how to find the nearest neighbor of a sparse vector

I have about 500 vectors,each vector is a 1500-dimension vector,
and almost every vector is very sparse-- I mean only about 30-70 dimension of the vector is not 0。
Now, the problom is that here is a given vetor,also 1500 dimension,and I need to compare it to the 500 vectors to find which of the 500 is the nearest one.(In euclidean distance).
There is no doubt that brute-force method is a solution , but I need to calculate the distance for 500 times ,which takes a long time.
Yesterday I read an article "Object retrieval with large vocabularies and fast spatial matching", it says using inverted index will help,its says:
but after my test, it made almost no sense, imagine a 1500-vector in which 50 of the dimension are not zero, when it comes to another one, they may always have the same dimension that are not zero. In other words, this algorithm can only rule out a little vectors, I still need to compare with many vectors left.
Thank you for your nice that you have read to here, my question is that:
1.will this algorithm make sense?
2.is there any other way to do what I want to do? such as flann or Kd-TREE?
but I want the exact accurate nearest neighbor, a approxiate one is not enough
This kind of index is called inverted lists, and is commonly used for text.
For example, Apache Lucene uses this kind of indexing for text similarity search.
Essentially, you use a columnar layout, and you only store the non-zero values. For on-disk efficiency, various compression techniques can be employed.
You can then compute many similarities using set operations on these lists.
k-d-trees cannot be used here. They will be extremely inefficient if you have many duplicate (zero) values.
I don't know your context but if you don't care of having a long preprocess step and you have to make this check often and fast, you can build a neighborhood graph and sorting neighbors by distances.
To efficiently build this graph you can perform a taxicab distance or a square distance to sort the points by distances (This will avoid heavy calculations).
Then if you want the nearest neighbor you just have to pick the first neighbor :p.

How to do the final steps of the k-means algorithm

I’m having a problem with my program. I am trying to implement k-means (manually) by clustering a set of RGB values to 3 clusters. I don’t need help with coding just with understanding. So far I have done this:
Created 3 cluster objects, each with a mean and an array to hold that clusters members.
Imported the text file and saved the RGB values in an array.
Looped the array and for each RGB value stored, calculated the mean with the closest distance using Euclidian distance.
Added each RGB value to the array of the cluster with the closest mean.
I have done research and I can’t seem to understand the next step. Research has suggested:
In each cluster, add all of the RGB values together divide them by the number of values in that cluster, then update the mean with that value or;
Find the average distance between all RGB values and the mean, then update the mean with that value or;
Update the mean each time you add a RGB to the cluster…..
I just can’t seem to understand the last steps, thanks.
Once you have a per-cluster array, recompute the "cluster center" as the average of the items in that cluster. Then, reassign every item to the appropriate cluster (the one whose "center" it's closest to after the recomputation). You're done when no item changes cluster (it's theoretically possible to end up in a situation where one item keeps flipping between two clusters, with generalized distance measures -- this can be detected to stop the loop anyway -- but I don't think it can happen with Euclidean distances).
IOW, that would be the first one of your three alternatives. I'm not even sure what you mean by the second alternative; the third one would likely not be stable, and depend on the arbitrary order of items, so I feel strongly against it.

Simple k-nearest-neighbor algorithm for euclidean data with variable density?

An elaboration on this question, but with more constraints.
The idea is the same, to find a simple, fast algorithm for k-nearest-neighbors in 2 euclidean dimensions. The bucketing grid seems to work nicely if you can find a grid size that will suitably partition your data. However, what if the data is not uniformly distributed, but has areas with both very high and very low density (for example, the US population), so that no fixed grid size could guarantee both enough neighbors and efficiency? Can this method still be salvaged?
If not, other suggestions would be helpful, though I hope for answers less complex than moving to kd-trees, etc.
If you don't have too many elements, just compare each with all the others. This can be a lot faster than you'd think; today's machines are fast. Unfortunately, the square factor will catch you sooner or later; I figure a linear search of a million objects won't take tooo long, so you may be okay with up to 1000 elements. Using a grid, or even stripes, might boost that number substantially.
But I think you're stuck with a quadtree (a specific form of k-d tree). Your whole map is one block, which can contain four subblocks (upper left, upper right, lower left, lower right). When a block fills up with more elements than you want to do a linear search on, break it into smaller ones and transfer the elements. (Only leaf nodes have elements.) It's easy to search within a given radius of a given point. Start at the top and if a part of a block is within range of the point, check out it's subblocks the same way if it has them. If it doesn't, check its elements.
(When searching for "closest", take care. The square grid means a nearer object might be in a farther block. You have to get everything within a given radius, then check 'em all. If you want the 10 closest and your radius of 20 only picked up 5, you need to try a larger radius. You may have a rejected item that proved to be 30 away and think you should grab it and a few others to make up your 10. However, there may be a few items at 25 away whose whole blocks were rejected, and you want them instead. There ought to be a better solution for this, but I haven't figured it out yet. I just make a guess at the radius and double it till I get enough.)
Quadtrees are fun. If you can set up your data and then access it, it's easy. The problems come when your mapped elements appear, disappear, and move while you are trying to figure out who's near what.
Have you looked at this?
http://www.cs.sunysb.edu/~algorith/major_section/1.4.shtml
kd-trees are quite simple to implement, there are standard java/c implementations.
Also:
You may want to post your question here:
https://cstheory.stackexchange.com/?as=1

Collision detection complexity <O(n²): Simpler approach than grid, quadtrees, BSP?

I have a large number of object (balls, for a start) that move stepwise in space, one at a time, and shall not overlap. Currently, for every move I check for collision with every other object. Several other questions here deal with this, however, I thought of a seemingly simple solution that does not seem to come up in this context, and I wonder why.
Why not simply keep 2 collections (for 2D, or 3 in three dimensions) of all objects, sorted by the x and y (and z) coordinate, respectively, and at every move look up all other objects within a given distance (ball diameter here) in each dimension and do the actual collision check only on objects in both (or all 3) result sets?
I realize this only works for equally-sized objects, but alternatively one could use twice as many collections, sorted by the (1) highest (2) lowest coordinate of each object for each dimension. Any reason why this would not work, or give significantly less of an improvement compared to going from O(n) "pairwise check" to "grid method" or "quad/octrees"? I see the update of these sorted collections as the costly operation here, but using e.g. a TreeSet (my implementation would be in Java) it should still be significantly less than O(n), right?
The check for which objects are in both result sets involves looking at all objects in two strips of the plane. That is a much larger area, and therefore involves more objects, than the enclosing square that a quadtree lets you immediately narrow down to. More objects means that it is slower.
You want to use a spatial index or space-filling-curve instead of a quadtree. A sfc reduce the 2d complexity to a 1d complexity and is different from a quadtree because it can only store 1 object per x,y pair? Maybe this works for your problem? You want to search for Nick's hilbert curve quadtree spatial index blog.

Averaging a set of points on a Google Map into a smaller set

I'm displaying a small Google map on a web page using the Google Maps Static API.
I have a set of 15 co-ordinates, which I'd like to represent as points on the map.
Due to the map being fairly small (184 x 90 pixels) and the upper limit of 2000 characters on a Google Maps URL, I can't represent every point on the map.
So instead I'd like to generate a small list of co-ordinates that represents an average of the big list.
So instead of having 15 sets, I'd end up with 5 sets, who's positions approximate the positions of the 15. Say there are 3 points that are in closer proximity to each-other than to any other point on the map, those points will be collapsed into 1 point.
So I guess I'm looking for an algorithm that can do this.
Not asking anyone to spell out every step, but perhaps point me in the direction of a mathematical principle or general-purpose function for this kind of thing?
I'm sure a similar function is used in, say, graphics software, when pixellating an image.
(If I solve this I'll be sure to post my results.)
I recommend K-means clustering when you need to cluster N objects into a known number K < N of clusters, which seems to be your case. Note that one cluster may end up with a single outlier point and another with say 5 points very close to each other: that's OK, it will look closer to your original set than if you forced exactly 3 points into every cluster!-)
If you are searching for such functions/classes, have a look at MarkerClusterer and MarkerManager utility classes. MarkerClusterer closely matches the described functionality, as seen in this demo.
In general I think the area you need to search around in is "Vector Quantization". I've got an old book title Vector Quantization and Signal Compression by Allen Gersho and Robert M. Gray which provides a bunch of examples.
From memory, the Lloyd Iteration was a good algorithm for this sort of thing. It can take the input set and reduce it to a fixed sized set of points. Basically, uniformly or randomly distribute your points around the space. Map each of your inputs to the nearest quantized point. Then compute the error (e.g. sum of distances or Root-Mean-Squared). Then, for each output point, set it to the center of the set that maps to it. This will move the point and possibly even change the set that maps to it. Perform this iteratively until no changes are detected from one iteration to the next.
Hope this helps.

Resources