Simple k-nearest-neighbor algorithm for euclidean data with variable density? - performance

An elaboration on this question, but with more constraints.
The idea is the same, to find a simple, fast algorithm for k-nearest-neighbors in 2 euclidean dimensions. The bucketing grid seems to work nicely if you can find a grid size that will suitably partition your data. However, what if the data is not uniformly distributed, but has areas with both very high and very low density (for example, the US population), so that no fixed grid size could guarantee both enough neighbors and efficiency? Can this method still be salvaged?
If not, other suggestions would be helpful, though I hope for answers less complex than moving to kd-trees, etc.

If you don't have too many elements, just compare each with all the others. This can be a lot faster than you'd think; today's machines are fast. Unfortunately, the square factor will catch you sooner or later; I figure a linear search of a million objects won't take tooo long, so you may be okay with up to 1000 elements. Using a grid, or even stripes, might boost that number substantially.
But I think you're stuck with a quadtree (a specific form of k-d tree). Your whole map is one block, which can contain four subblocks (upper left, upper right, lower left, lower right). When a block fills up with more elements than you want to do a linear search on, break it into smaller ones and transfer the elements. (Only leaf nodes have elements.) It's easy to search within a given radius of a given point. Start at the top and if a part of a block is within range of the point, check out it's subblocks the same way if it has them. If it doesn't, check its elements.
(When searching for "closest", take care. The square grid means a nearer object might be in a farther block. You have to get everything within a given radius, then check 'em all. If you want the 10 closest and your radius of 20 only picked up 5, you need to try a larger radius. You may have a rejected item that proved to be 30 away and think you should grab it and a few others to make up your 10. However, there may be a few items at 25 away whose whole blocks were rejected, and you want them instead. There ought to be a better solution for this, but I haven't figured it out yet. I just make a guess at the radius and double it till I get enough.)
Quadtrees are fun. If you can set up your data and then access it, it's easy. The problems come when your mapped elements appear, disappear, and move while you are trying to figure out who's near what.

Have you looked at this?
http://www.cs.sunysb.edu/~algorith/major_section/1.4.shtml
kd-trees are quite simple to implement, there are standard java/c implementations.
Also:
You may want to post your question here:
https://cstheory.stackexchange.com/?as=1

Related

Minimal data structure to prevent 2D-grid traveler from repeat itself

I'm sorry if this is a duplicate of some thread, but I'm really not sure how to describe the question.
I'm wondering what is the minimal data structure to prevent 2D-grid traveler from repeating itself (i.e. travel to some point it already traveled before). The traveler can only move horizontally or vertically 1 step each time. For my special case (below), the 2D-grid is actually a lower-left triagle where one coordinate never exceeds another.
For example, with 1D case, this can be simply done by recording the direction of last travel. If direction changes, it's repeating itself.
For 2D case it becomes complicated. The most trivial way would be creating a list recording the points traveled before, but I'm wondering are there more efficient ways to do that?
I'm implementing a more-or-less "4-finger" algorithm for 4-sum where the 2 fingers in the middle moves in two directions (namely i, j, k, and l):
i=> <=j=> <=k=> <=l
1 2 3 ... 71 72 ... 123 124 ... 201 202 203
The directions fingers travel are decided (or suggested) by some algorithm but might lead to forever-loop. Therefore, I have to force not to take some suggestion if the 2 fingers in the middle starts to repeat history position.
EDIT
Among these days, I found 2 solutions. None of them is ideal solution to this problem, but they're at least somewhat usable:
As #Sorin mentioned below, one solution would be saving a bit array representing state of all cells. For the triangular-grid example here, we can even condense the array to cut memory cost by half (though requiring k^2 time to compute the bit position where k is the degree of freedom i.e. 2 here. A standard array would use only linear time).
Another solution would be directly avoid backward-travelling. Set up the algorithm such that j and k only move in one direction (this is probably greedy).
But still since the 2D-grid traveler have the nice property that it moves along axis 1 step each time, I'm wondering are there more "specialized" representation
for this kind of movement.
Thanks for your help!
If you are looking for optimal lookup complexity, then a hashset is the best thing. You need O(N) memory but all lookups & insertions will be O(1).
If it's often that you visit most of the cells then you can even skip the hash part and store a bit array. That is store one bit for every cell and just check if the corresponding bit is 0 or 1. This is much more compact in memory (at least 32x, one bit vs. one int, but likely more as you also skip storing some pointers internal to the datastructure, 64 bits).
If this still take too much space, you could use a bloom filter (link), but that will give you some false positives (tells you that you've visited a cell, but in fact you didn't). If that's something you can live with the space savings are fairly huge.
Other structures like BSP or Kd-trees could work as well. Once you reach a point where everything is either free or occupied (ignoring the unused cells in the upper triangle) you can store all that information in a single node.
This is hard to recommend because of it's complexity and that it will likely also use O(N) memory in many cases, but with a larger constant. Also all checks will be O(logN).

Retrieve set of rectangles containing a specified point

I can't figure out how to implement this in a performing way, so I decided to ask you guys.
I have a list of rectangles - actually atm only squares, but I might have to migrate to rectangles later, so let's stick to them and keep it a bit more general - in a 2 dimensional space. Each rectangle is specified by two points, rectangles can overlap and I don't care all too much about setup time, because the rectangles are basicly static and there's some room for precalculate any setup stuff (like building trees, sorting, precalculating additional vectors, whatever etc). Oh I am developing in JavaScript if this is of any concern.
To my actual question: given a point, how do I get a set of all rectangles that include that point?
Linear approaches do not perform well enough. So I look for something that performs better than O(n). I read some stuff, like on Bounding Volume Hierarchies and similar things, but whatever I tried the fact that rectangles can overlap (and I actually want to get all of them, if the point lies within multiple rectangles) seems to always get into my way.
Are there any suggestions? Have I missed something obvious? Are BVH even applicable to possibly overlapping bounds? If so, how do I build such a possibly overlapping tree? If not, what else could I use? It is of no concern to me if borders are inside, outside or not determined.
If someone could come up with anything helpfull like a link or a rant on how stupid I am to use BVH and not Some_Super_Cool_Structure_Perfectly_Suited_For_My_Problem I'd really appreciate it!
Edit: Ok, I played around a bit with R-Trees and this is exactly what I was looking for. Infact I am currently using the RTree implementation http://stackulator.com/rtree/ as suggested by endy_c. It performs really well and fullfills my requirements entirely. Thanks alot for your support guys!
You could look at R-Trees
Java code
there's also a wiki, but can only post one link ;-)
You can divide the space into grid, and for each grid cell have a list of rectangles (or rectangle identifiers) that exist at least partially in that grid. Search for rectangles only in corresponding grid's cell. The complexity should be O(sqrt(n)).
Another approach is to maintain four sorted arrays of x1,y1,x2,y2 values, and binary search your point within those 4 arrays. The result of each search is a set of rectangle candidates, and the final result is intersection of those 4 sets. Depending on how set intersection is implemented this should be efficient than O(n).

Writing Simulated Annealing algorithm for 0-1 knapsack in C#

I'm in the process of learning about simulated annealing algorithms and have a few questions on how I would modify an example algorithm to solve a 0-1 knapsack problem.
I found this great code on CP:
http://www.codeproject.com/KB/recipes/simulatedAnnealingTSP.aspx
I'm pretty sure I understand how it all works now (except the whole Bolzman condition, as far as I'm concerned is black magic, though I understand about escaping local optimums and apparently this does exactly that). I'd like to re-design this to solve a 0-1 knapsack-"ish" problem. Basically I'm putting one of 5,000 objects in 10 sacks and need to optimize for the least unused space. The actual "score" I assign to a solution is a bit more complex, but not related to the algorithm.
This seems easy enough. This means the Anneal() function would be basically the same. I'd have to implement the GetNextArrangement() function to fit my needs. In the TSM problem, he just swaps two random nodes along the path (ie, he makes a very small change each iteration).
For my problem, on the first iteration, I'd pick 10 random objects and look at the leftover space. For the next iteration, would I just pick 10 new random objects? Or am I best only swapping out a few of the objects, like half of them or only even one of them? Or maybe the number of objects I swap out should be relative to the temperature? Any of these seem doable to me, I'm just wondering if someone has some advice on the best approach (though I can mess around with improvements once I have the code working).
Thanks!
Mike
With simulated annealing, you want to make neighbour states as close in energy as possible. If the neighbours have significantly greater energy, then it will just never jump to them without a very high temperature -- high enough that it will never make progress. On the other hand, if you can come up with heuristics that exploit lower-energy states, then exploit them.
For the TSP, this means swapping adjacent cities. For your problem, I'd suggest a conditional neighbour selection algorithm as follows:
If there are objects that fit in the empty space, then it always puts the biggest one in.
If no objects fit in the empty space, then pick an object to swap out -- but prefer to swap objects of similar sizes.
That is, objects have a probability inverse to the difference in their sizes. You might want to use something like roulette selection here, with the slice size being something like (1 / (size1 - size2)^2).
Ah, I think I found my answer on Wikipedia.. It suggests moving to a "neighbor" state, which usually implies changing as little as possible (like swapping two cities in a TSM problem)..
From: http://en.wikipedia.org/wiki/Simulated_annealing
"The neighbours of a state are new states of the problem that are produced after altering the given state in some particular way. For example, in the traveling salesman problem, each state is typically defined as a particular permutation of the cities to be visited. The neighbours of some particular permutation are the permutations that are produced for example by interchanging a pair of adjacent cities. The action taken to alter the solution in order to find neighbouring solutions is called "move" and different "moves" give different neighbours. These moves usually result in minimal alterations of the solution, as the previous example depicts, in order to help an algorithm to optimize the solution to the maximum extent and also to retain the already optimum parts of the solution and affect only the suboptimum parts. In the previous example, the parts of the solution are the parts of the tour."
So I believe my GetNextArrangement function would want to swap out a random item with an item unused in the set..

Averaging a set of points on a Google Map into a smaller set

I'm displaying a small Google map on a web page using the Google Maps Static API.
I have a set of 15 co-ordinates, which I'd like to represent as points on the map.
Due to the map being fairly small (184 x 90 pixels) and the upper limit of 2000 characters on a Google Maps URL, I can't represent every point on the map.
So instead I'd like to generate a small list of co-ordinates that represents an average of the big list.
So instead of having 15 sets, I'd end up with 5 sets, who's positions approximate the positions of the 15. Say there are 3 points that are in closer proximity to each-other than to any other point on the map, those points will be collapsed into 1 point.
So I guess I'm looking for an algorithm that can do this.
Not asking anyone to spell out every step, but perhaps point me in the direction of a mathematical principle or general-purpose function for this kind of thing?
I'm sure a similar function is used in, say, graphics software, when pixellating an image.
(If I solve this I'll be sure to post my results.)
I recommend K-means clustering when you need to cluster N objects into a known number K < N of clusters, which seems to be your case. Note that one cluster may end up with a single outlier point and another with say 5 points very close to each other: that's OK, it will look closer to your original set than if you forced exactly 3 points into every cluster!-)
If you are searching for such functions/classes, have a look at MarkerClusterer and MarkerManager utility classes. MarkerClusterer closely matches the described functionality, as seen in this demo.
In general I think the area you need to search around in is "Vector Quantization". I've got an old book title Vector Quantization and Signal Compression by Allen Gersho and Robert M. Gray which provides a bunch of examples.
From memory, the Lloyd Iteration was a good algorithm for this sort of thing. It can take the input set and reduce it to a fixed sized set of points. Basically, uniformly or randomly distribute your points around the space. Map each of your inputs to the nearest quantized point. Then compute the error (e.g. sum of distances or Root-Mean-Squared). Then, for each output point, set it to the center of the set that maps to it. This will move the point and possibly even change the set that maps to it. Perform this iteratively until no changes are detected from one iteration to the next.
Hope this helps.

Possible to calculate closest locations via lat/long in better than O(n) time?

I'm wondering if there is an algorithm for calculating the nearest locations (represented by lat/long) in better than O(n) time.
I know I could use the Haversine formula to get the distance from the reference point to each location and sort ASC, but this is inefficient for large data sets.
How does the MySQL DISTANCE() function perform? I'm guessing O(n)?
If you use a kd-tree to store your points, you can do this in O(log n) time (expected) or O(sqrt(n)) worst case.
You mention MySql, but there are some pretty sophisticated spatial features in SQL Server 2008 including a geography data type. There's some information out there about doing the types of things you are asking about. I don't know spatial well enough to talk about perf. but I doubt there is a bounded time algorithm to do what you're asking, but you might be able to do some fast set operations on locations.
If the data set being searched is static, e.g., the coordinates of all gas stations in the US, then a proper index (BSP) would allow for efficient searching. Postgres has had good support since the mid 90's for 2-dimensional indexed data so you can do just this sort of query.
Better than O(n)? Only if you go the way of radix sort or store the locations with hash keys that represent the general location they are in.
For instance, you could divide the globe with latitude and longitude to the minutes, enumerate the resulting areas, and make the hash for a location it's area. So when the time comes to get the closest location, you only need to check at most 9 hash keys -- you can test beforehand if an adjacent grid can possibly provide a close location than the best found so far, thus decreasing the set of locations to compute the distance to. It's still O(n), but with a much smaller constant factor. Properly implemented you won't even notice it.
Or, if the data is in memory or otherwise randomly accessible, you could store it sorted by both latitude and longitude. You then use binary search to find the closest latitude and longitude in the respective data sets. Next, you keep reading locations with increasing latitude or longitude (ie, the preceding and succeeding locations), until it becomes impossible to find a closer location.
You know you can't find a close location when the latitude of the next location to either side of the latitude-sorted data wouldn't be closer than the best case found so far even if they belonged in the same longitude as the point from which distance is being calculated. A similar test applies for the longitude-sorted data.
This actually gives you better than O(n) -- closer to O(logN), I think, but does require random, instead of sequential, access to data, and duplication of all data (or the keys to the data, at least).
I wrote a article about Finding the nearest Line at DDJ a couple of years ago, using a grid (i call it quadrants). Using it to find the nearest point (instead of lines) would be just a reduction of it.
Using Quadrants reduces the time drastically, although the complexity is not determinable mathematically (all points could theoretically lie in a single quadrant). A precondition of using quadrants/grids is, that you have a maximum distance for the point searched. If you just look for the nearest point, without giving a maximum distance, you cant use quadrants.
In this case, have a look at A Template for the Nearest Neighbor Problem (Larry Andrews at DDJ), having a retrival complexity of O(log n). I did not compare the runtime of both algorithms. Probably, if you have a reasonable maximum width, quadrants are better. The better general purpose algorithm is the one from Larry Andrews.
If you are looking for the (1) closest location, there's no need to sort. Simply iterate through your list, calculating the distance to each point and keeping track of the closest one. By the time you get through the list, you'll have your answer.
Even better would be to introduce the concept of grids. You would assign each point to a grid. Then, for your search, first determine the grid you are in and perform your calculations on the points in the grid. You'll need to be a little careful though. If the test location is close to the boundary of a grid, you'll need to search those grid(s) as well. Still, this is likely to be highly performant.
I haven't looked at it myself, but Postgres does have a module dedicated to the management of GIS data.
In an appliation I worked on in a previous life we took all of the data, computed it's key for a quad-tree (for 2D space) or an oct-tree (for 3D space) and stored that in the database. It was then a simple matter of loading the values from the database (to prevent you having to recompute the quad-tree) and following the standard quad-tree search algorithm.
This does of course mean you will touch all of the data at least once to get it into the data structure. But persisting this data-structure means you can get better lookup speeds from then on. I would imagine you will do a lot of nearest-neighbour checks for each data-set.
(for kd-tree's wikipedia has a good explanation: http://en.wikipedia.org/wiki/Kd-tree)
You need a spatial index. Fortunately, MySQL provides just such an index, in its Spatial Extensions. They use an R-Tree index internally - though it shouldn't really matter what they use. The manual page referenced above has lots of details.
I guess you could do it theoretically if you had a large enough table to do this... secondly, perhaps caching correctly could get you very good average case?
An R-Tree index can be used to speed spatial searches like this. Once created, it allows such searches to be better than O(n).

Resources