How does "Find Nearest Locations" work? - algorithm

Nowadays most of the Restaurants and other businesses have a "Find Locations" functionality on their websites which lists nearest locations for a given address/Zip. How is this implemented? Matching the zipcode against the DB is a simple no-brainer way to do but may not always work, for example there may be a branch closer to the given location but could be in a different zip. One approach that comes to my mind is to convert the given zip-code/address into map co-ordinates and list any branches falling into a pre-defined radius. I welcome your thoughts on how this would've been implemented.If possible provide more detailed implementation details like any web-services used etc.,

A lot of geospatial frameworks will help you out with this. In the geospatial world, a zip code is just a "polygon", which is just an area on a map which defines clear boundaries (not a polygon in the math sense). In SQL 2008 spatial, for example, you can create a new polygon based on your original polygon. So you can dynamically create a polygon that is your zip code extended by a certain distance at every point. It takes the funky shape of the zip code into account. With an address, It’s easy, because you just create a polygon, which is a circle around the one point. You can then do queries give you all points within the new polygon that you created in either method.
A lot of these sites are basically just doing this. They give you all points within a 5 mile extended polygon, and then maybe a 10 mile extended polygon, and so on and so forth. They are not actually calculating distance. Most ma stuff on the web is not sophisticated at all.
You can see some basic examples here to get the general idea of what I'm talking about.

There is a standard zipode/location database available. Here is one version in Access format that includes the lat/long of the zipcode as well as other information. You can then use The PostgreSQL GIS extensions to do searches on the locations for example.
(assuming of course that you extract the access db and insert into a more friendly database like PostgreSQL)

First, you Geocode the address, translating it into (usually) a latitude and longitude. Then, you do a nearest-neighbour query on your database for points of interest.
Most spatial indexes don't directly support nearest-neighbour queries, so the usual approach here is to query on a bounding box of a reasonable size with the geocoded point at the center, then sort the results in memory to pick the closest ones.

Just like you said. Convert an address/ZIP into a 2D world coordinate and compare it to other known locations. Pick the nearest. :) I think some DB's (Oracle, MSSQL 2008) even offer some functions that can help, but I've never used them.

I think it is pretty universal. They take the address or zipcode and turn it in to a "map coordinate" (differs depending on implementation, probably a lat/long) and then using the "map coordinates" of the things in the database it is easy to calculate a distance.
Note that some poor implementations convert the zipcode in to a coordinate representing the center of the zipcode area, which sometimes gives bad results.

Your thoughts on how to do it are how I would probably do it. You can geocode the co-oridinated for the zip and then do calculations based on that. I know SQL Server 2008 has some special new functionality to help doing queries based on these geocoded lon/lat co-ordinates.

There are actual geometric algorithms and/or datastructures that support lower O(...) nearest location queries on point, line and/or region data.
See this book as an example of information on some of them, like: Voronoi diagrams, quadtrees, etc.
However I think the other answers here are right in most cases that you find in software today:
geocode (a single point in) the search area
bounding box query to get an initial ballpark
in memory sorting/selecting

I had table that i would compile a database table every 6months it contained 3 columns, I used it for a few clients in Australia, it contained about 40k of rows, very lightweight to run a query. this is quite quick, if just looking to get something off the ground for a client
Postal Code from
Postal Code To
Distance
SELECT Store_ID, Store_AccountName, Store_PostalCode, Store_Address, Store_Suburb, Store_Phone, Store_State, Code_Distance FROM Store, (SELECT Code_To As Code_To, Code_Distance FROM Code WHERE Code_From = #PostalCode UNION ALL SELECT Code_From As Code_To, Code_Distance FROM Code WHERE Code_To = #PostalCode UNION ALL SELECT #PostalCode As Code_To, 0 As Code_Distance) As Code WHERE Store_PostalCode = Code_To AND Code_Distance <= #Distance ORDER BY Code_Distance
There may be plenty optimization that you could do to speed up this query!.

Related

Dividing the world in a thousand or so locations

Background: I want to create a weather service, and since most available APIs limit the number of daily calls, I want to divide the planet in a thousand or so areas.
Obviously, internet users are not uniformly distributed, so the sampling should be finer around densely populated regions.
How should I go about implementing this?
Where can I find data regarding geographical internet user density?
The algorithm will probably be something similar to k-means. However, implementing it on a sphere with oceans may be a bit tricky. Any insight?
Finally, maybe there is a way I can avoid doing all of this?
Very similar to k-mean is the centroidal Voronoi diagram (it is the continuous version of k-means). However, this would produce a uniform tesselation of your sphere that does not account for user density as you wish.
So a similar solution is the same technique but used with a Power Diagram : a Power Diagram is a Voronoi Diagram that accounts for a density (by assigning a weight to each Voronoi seed). Such diagram can be computed using an embedding in a 3D space (instead of 2D) that consists of the first two (x,y) coordinates plus a third one which is the square root of [any large positive constant minus the weight for the given point].
Using that, you can obtain a tesselation of your domain accounting for a user density.
You don't care about internet user density in general. You care about the density of users using your service - and you don't care where those users are, you care where they ask about. So once your site has been going for more than a day you can use the locations people ask about the previous day to work out what the areas should be for the next day.
Dynamic programming on a tree is easy. What I would do for an algorithm is to build a tree of successively more finely divided cells. More cells mean a smaller error, because people get predictions for points closer to them, and you can work out the error, or at least the relative error between more cells and fewer cells. Starting from the bottom up work out the smallest possible total error contributed by each subtree, allowing it to be divided in up to 1,2,3,..N. ways. You can work out the best possible division and smallest possible error for each k=1..N for a node by looking at the smallest possible error you have already calculated for each of its descendants, and working out how best to share out the available k divisions between them.
I would try to avoid doing this by thinking of a different idea. Depending on the way you look at life, there are at least two disadvantages of this:
1) You don't seem to be adding anything to the party. It looks like you are interposing yourself between organizations that actually make weather forecasts and their clients. Organizations lose direct contact with their clients, which might for instance lose them advertising revenue. Customers get a poorer weather forecast.
2) Most sites have legal terms of service, which must clients can ignore without worrying. My guess is that you would be breaking those terms of service, and if your service gets popular enough to be noticed they will be enforced against you.

Sort POIs by distance from current location

Trover is an awesome app: it shows you a stream of discoveries (POIs) people have uploaded - sorted by the distance from any location you specify (usually your current location). The further you scroll through the feed, the farther away the displayed discoveries are. An indicator tells you quite accurately how far the currently shown discoveries are (see screenshots on Website).
This is different from most other location based apps that deliver their results (POIs) based on fixed regions (e.g. give me all Pizzerias withing a 10km radius) which can be implemented using a single spacial datastructure (or an SQL engine supporting spatial data types). Deliverying the results the way Trover does is considerably harder:
You can query POIs for arbitrary locations. Give Trover a location in the far East of Russia and it will deliver discoveries where the first one is 2000km away and continuously increasing from there.
The result list of POIs is not limited by some spatial range. If you scroll long enough through the feed you will probably see discoveries which are on the other side of the globe.
The above points require a semi-strict ordering of their POIs for any location. The fact that you can scroll down and reload more discoveries implies that they can deliver specific sections of the sorted data (e.g. give me the next 20 discoveries that are at least 100km away from my current location).
It's fast, the fetching and distance indications are instant. The discoveries must be pre-sorted. I don't know how many discoveries they have in their DB but it must be more than what you want to sort ad hoc.
I find these characteristics quite remarkable and wonder how this is implemented. Any suggestions what kind of data-structure, algorithms or caching might be used?
I don't get the question. What do want an answer to?
Edit:
They might use a graph-database where one edge represent the distance between the nodes. That way you can get the distance by the relationships of nearby POIs. You would calculate the distance and create edges to nearby nodes. To get the distance of an arbitrary point you just do a circle-distance calculation, for another node you just add up the edges value as they represent the distance (this is for the case of getting the walking,biking, or car calculation). The adding up might not be the closest way but will give a relative indication which it seems like they use.

Tools/recommendations to create and check if address falls within a specified boundary on a map?

Abstract problem: Define some (non-rectangular, non-circular) topologically closed area on a map. Find out way to query that map such that it returns true if longitude/latitude within the boundary.
Applied problem:
Let's say we're dealing with newspaper boy coverage. A coverage area is defined for each newspaper boy, and I query each house address to find who services what address.
I am looking for suggestions/hints/tips on how best to do this (real world, so helpful APIs and tools would be much appreciated).
So, first defining a boundary, then allowing for an address to query for membership within a specific boundary.
We have a mapping software at work where we implemented this exact problem (obviously in a different domain than paperboy coverage). We could not find an out of the box solution, so we implemented our own.
We solved this problem by defining the geographical areas as set of points (given in latitude and longitude) and used the ray-casting point in polygon method.
http://en.wikipedia.org/wiki/Point_in_polygon
The math is not too complex, but there is a fair bit of setup work involved.
A quick google search brought up this sample code for implementation:
http://www.ecse.rpi.edu/Homepages/wrf/Research/Short_Notes/pnpoly.html#The%20C%20Code
Good luck! I would be happy to clarify anything if needed.

What's a good algorithm for nearest neighbour problem in two dimensions?

I would like to build an app that's going to give you the closest restaurant depending on your location. We'll have a database with all the POI corresponding to the restaurant and we'll get your location with the GPS of your phone...
What algorithm would be appropriate ? Where can I find good doc about it ?
Thanks
Here's an informative presentation: http://dimacs.rutgers.edu/Workshops/MiningTutorial/pindyk-slides.ppt
I would either use a Quadtree or a Kd-tree.
See some benchmarks here: http://www.flegg.net/brett/pubs/spatial/index.html. It really all depends on your data size and range.
The main problem is how do you store and search the data. If you are using a SQL database that doesn't support spatial indexes (let's say SQLite on Android), consider converting the spatial data to a linear Z-order curve. The algorithm is simple, I know about (well, wrote) this implementation.

anything better than bounding boxes?

I have a scenario, where I have x million longitude latitude points.
When a new long/lat point is added I want to know efficiently which other points are within a user configured distance parameter, so I can add them to a list.
got anything better than bounding boxes?
I would love to see algorithms, references and a few implementations ;) thank you kindly!
There are quite a few options that are better, mostly based around space partitioning.
A common, and often very good option (which isn't too tough to implement) is to use a KD-Tree. Quadtrees are easier to implement, but slower for searching. Depending on the distribution of your data, and your requirements, other space partitioning algorithms may perform better, have lower memory requirements, or other issues that are related.
A colleague told me that he had good experience with using Morton-Code as a spatial index on GIS data, maybe that is something worth investigating.
This quick-and-dirty approach may save you some grief: Divide the surface of the earth into 1 degree boxes. You will then have a 180x360 element array and you will only need to search a small number of boxes, including the box containing the new point and all the boxes immediately around it for which one of the corners is within the user-specified distance. You will find that there are some tricks you can use to quickly figure out what boxes to use without considering them all. Just don't forget latitude and longitude wrap-around.
If your "only" have millions of points, and they aren't clustered into hot-spots, that might get you through.
A theoretically superior way: You could map each point into three dimensional space and then store them in an octree, which would let you quickly find nearby points to within an arbitrary distance. Of course, the distance in three-dimensional space will be slightly different than the great-circle distance on the globe, so you will have to calculate a conversion factor. That should be simple, though. You don't mention an implementation language, but there is almost certainly going to be a well-tested octree implementation for any language you are working in. If you don't mind inserting the third-party code, this solution is the way to go.

Resources