Suppose I have some (lng, lat) coordinate. I also have a big list of ranges,
[ { northeast: {lng, lat}, southwest: {lng, lat} } ... ]
How can I most effeciently determine which bucket the (lng, lat) point goes into?
Also, on a design perspective. Would it make more sense for the "list of ranges" to be on some database like mysql, monodb, or on something like memcached, redis?
Thank you.
An SQL database might be a good answer. If you imagine a table like (bucketId, latNe, longNe, latSw, longSw), with indices on all the lat/long columns, then you could very efficiently get an answer by preparing and executing a query like SELECT bucketId FROM bucketTable WHERE latNe > ? AND longNe < ? AND latSe < ? AND longSe > ? using the desired lat/long coordinate.
You need to subdivide the list of ranges. You can look into a quadkey. It's similar to a quadtree. It uses a morton curve. You can very fast compute the quadkey of the range and the points. But you can also try a rectangle tree. You can also use an intervall tree.
The R-Tree is an data structure designed for this kind of thing. Boost contains an implementation of it. As does CGAL. Most modern databases support this kind of thing natively as well.
Related
I already have a distance matrix (1609*1609) and each distance is between 0~1. I want to cluster 1609 items into natural groups by using Twostep cluster in SPSS. I want to use the distance matrix as input for Twostep cluster analysis. How to modify the syntax to do that? Or I cannot do it?
DATASET ACTIVATE dataname1.
TWOSTEP CLUSTER
/CATEGORICAL VARIABLES=ROWTYPE VARNAME
/CONTINUOUS VARIABLES=A1 to A1609 *Ignore the A2 to A1608 here.
/DISTANCE LIKELIHOOD
/NUMCLUSTERS AUTO 15 BIC
/HANDLENOISE 0
/MEMALLOCATE 64
/CRITERIA INITHRESHOLD(0) MXBRANCH(8) MXLEVEL(3)
/VIEWMODEL DISPLAY=YES
/SAVE VARIABLE=TSC_4920.
Thanks in advance.
From my understanding of that vaguely-documented "twostep" clustering, it needs to compute the mean of points.
Then it cannot be used with a distance matrix. Consider using ELKI, sklearn, or R instead. As a plus, they are open source, so you can actually check what they are doing, and customize them if e.g. they would not allow a distance matrix somewhere. That is a very big feature, being open source.
The twostep algorithm, as well as all the others in Statistics, is fully documented in the algorithms manual available from the Help menu.
I have had a look at this post about geohashes. According to the author, the final step in calculating the hash is interleaving the x and y index values. But is this really necessary? Is there a proper reason not to just concatenate these values, as long as the hash table is built according to that altered indexing rule?
From the wiki page
Geohashes offer properties like arbitrary precision and the
possibility of gradually removing characters from the end of the code
to reduce its size (and gradually lose precision).
If you simply concatenated x and y coordinates, then users would have to take a lot more care when trying to reduce precision by being careful to remove exactly the right number of characters from both the x and y coordinate.
There is a related (and more important) reason than arbitrary precision: Geohashes with a common prefix are close to one another. The longer the common prefix, the closer they are.
54.321 -2.345 has geohash gcwm48u6
54.322 -2.346 has geohash gcwm4958
(See http://geohash.org to try this)
This feature enables fast lookup of nearby points (though there are some complications), and only works because we interleave the two dimensions to get a sort of approximate 2D proximity metric.
As the wikipedia entry goes on to explain:
When used in a database, the structure of geohashed data has two
advantages. First, data indexed by geohash will have all points for a
given rectangular area in contiguous slices (the number of slices
depends on the precision required and the presence of geohash "fault
lines"). This is especially useful in database systems where queries
on a single index are much easier or faster than multiple-index
queries. Second, this index structure can be used for a
quick-and-dirty proximity search - the closest points are often among
the closest geohashes.
Note that the converse is not always true - if two points happen to lie on either side of a subdivision (e.g. either side of the equator) then they may be extremely close but have no common prefix. Hence the complications I mentioned earlier.
I have a dataset. Each element of this set consists of numerical and categorical variables. Categorical variables are nominal and ordinal.
There is some natural structure in this dataset. Commonly, experts clusterize datasets such as mine using their 'expert knowledge', but I want to automate this process of clusterization.
Most algorithms for clusterization use distance (Euclidean, Mahalanobdis and so on) between objects to group them in clusters. But it is hard to find some reasonable metrics for mixed data types, i.e. we can't find a distance between 'glass' and 'steel'. So I came to the conclusion that I have to use conditional probabilities P(feature = 'something' | Class) and some utility function that depends on them. It is reasonable for categorical variables, and it works fine with numeric variables assuming they are distributed normally.
So it became clear to me that algorithms like K-means will not produce good results.
At this time I try to work with COBWEB algorithm, that fully matches my ideas of using conditional probabilities. But I faced another obsacles: results of clusterization are really hard to interpret, if not impossible. As a result I wanted to get something like a set of rules that describes each cluster (e.g. if feature1 = 'a' and feature2 in [30, 60], it is cluster1), like descision trees for classification.
So, my question is:
Is there any existing clusterization algorithm that works with mixed data type and produces an understandable (and reasonable for humans) description of clusters.
Additional info:
As I understand my task is in the field of conceptual clustering. I can't define a similarity function as it was suggested (it as an ultimate goal of the whoal project), because of the field of study - it is very complicated and mercyless in terms of formalization. As far as I understand the most reasonable approach is the one used in COBWEB, but I'm not sure how to adapt it, so I can get an undestandable description of clusters.
Decision Tree
As it was suggested, I tried to train a decision tree on the clustering output, thus getting a description of clusters as a set of rules. But unfortunately interpretation of this rules is almost as hard as with the raw clustering output. First of only a few first levels of rules from the root node do make any sense: closer to the leaf - less sense we have. Secondly, these rules doesn't match any expert knowledge.
So, I came to the conclusion that clustering is a black-box, and it worth not trying to interpret its results.
Also
I had an interesting idea to modify a 'decision tree for regression' algorithm in a certain way: istead of calculating an intra-group variance calcualte a category utility function and use it as a split criterion. As a result we should have a decision tree with leafs-clusters and clusters description out of the box. But I haven't tried to do so, and I am not sure about accuracy and everything else.
For most algorithms, you will need to define similarity. It doesn't need to be a proper distance function (e.g. satisfy triangle inequality).
K-means is particularly bad, because it also needs to compute means. So it's better to stay away from it if you cannot compute means, or are using a different distance function than Euclidean.
However, consider defining a distance function that captures your domain knowledge of similarity. It can be composed of other distance functions, say you use the harmonic mean of the Euclidean distance (maybe weighted with some scaling factor) and a categorial similarity function.
Once you have a decent similarity function, a whole bunch of algorithms will become available to you. e.g. DBSCAN (Wikipedia) or OPTICS (Wikipedia). ELKI may be of interest to you, they have a Tutorial on writing custom distance functions.
Interpretation is a separate thing. Unfortunately, few clustering algorithms will give you a human-readable interpretation of what they found. They may give you things such as a representative (e.g. the mean of a cluster in k-means), but little more. But of course you could next train a decision tree on the clustering output and try to interpret the decision tree learned from the clustering. Because the one really nice feature about decision trees, is that they are somewhat human understandable. But just like a Support Vector Machine will not give you an explanation, most (if not all) clustering algorithms will not do that either, sorry, unless you do this kind of post-processing. Plus, it will actually work with any clustering algorithm, which is a nice property if you want to compare multiple algorithms.
There was a related publication last year. It is a bit obscure and experimental (on a workshop at ECML-PKDD), and requires the data set to have a quite extensive ground truth in form of rankings. In the example, they used color similarity rankings and some labels. The key idea is to analyze the cluster and find the best explanation using the given ground truth(s). They were trying to use it to e.g. say "this cluster found is largely based on this particular shade of green, so it is not very interesting, but the other cluster cannot be explained very well, you need to investigate it closer - maybe the algorithm discovered something new here". But it was very experimental (Workshops are for work-in-progress type of research). You might be able to use this, by just using your features as ground truth. It should then detect if a cluster can be easily explained by things such as "attribute5 is approx. 0.4 with low variance". But it will not forcibly create such an explanation!
H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Multiple Clustering Solutions
In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011. http://dme.rwth-aachen.de/en/MultiClust2011
A common approach to solve this type of clustering problem is to define a statistical model that captures relevant characteristics of your data. Cluster assignments can be derived by using a mixture model (as in the Gaussian Mixture Model) then finding the mixture component with the highest probability for a particular data point.
In your case, each example is a vector has both real and categorical components. A simple approach is to model each component of the vector separately.
I generated a small example dataset where each example is a vector of two dimensions. The first dimension is a normally distributed variable and the second is a choice of five categories (see graph):
There are a number of frameworks that are available to run monte carlo inference for statistical models. BUGS is probably the most popular (http://www.mrc-bsu.cam.ac.uk/bugs/). I created this model in Stan (http://mc-stan.org/), which uses a different sampling technique than BUGs and is more efficient for many problems:
data {
int<lower=0> N; //number of data points
int<lower=0> C; //number of categories
real x[N]; // normally distributed component data
int y[N]; // categorical component data
}
parameters {
real<lower=0,upper=1> theta; // mixture probability
real mu[2]; // means for the normal component
simplex[C] phi[2]; // categorical distributions for the categorical component
}
transformed parameters {
real log_theta;
real log_one_minus_theta;
vector[C] log_phi[2];
vector[C] alpha;
log_theta <- log(theta);
log_one_minus_theta <- log(1.0 - theta);
for( c in 1:C)
alpha[c] <- .5;
for( k in 1:2)
for( c in 1:C)
log_phi[k,c] <- log(phi[k,c]);
}
model {
theta ~ uniform(0,1); // equivalently, ~ beta(1,1);
for (k in 1:2){
mu[k] ~ normal(0,10);
phi[k] ~ dirichlet(alpha);
}
for (n in 1:N) {
lp__ <- lp__ + log_sum_exp(log_theta + normal_log(x[n],mu[1],1) + log_phi[1,y[n]],
log_one_minus_theta + normal_log(x[n],mu[2],1) + log_phi[2,y[n]]);
}
}
I compiled and ran the Stan model and used the parameters from the final sample to compute the probability of each datapoint under each mixture component. I then assigned each datapoint to the mixture component (cluster) with higher probability to recover the cluster assignments below:
Basically, the parameters for each mixture component will give you the core characteristics of each cluster if you have created a model appropriate for your dataset.
For heterogenous, non-Euclidean data vectors as you describe, hierarchical clustering algorithms often work best. The conditional probability condition you describe can be incorporated as an ordering of attributes used to perform cluster agglomeration or division. The semantics of the resulting clusters are easy to describe.
I have a database of addresses, all geocoded.
What is the best way to find all addresses in our database within a certain radius of a given lat, lng?
In other words a user enters (lat, lng) of a location and we return all records from our database that are within 10, 20, 50 ... etc. miles of the given location.
It doesn't have to be very precise.
I'm using MySQL DB as the back end.
There are Spatial extensions available for MySQL 5 - an entry page to the documentation is here:
http://dev.mysql.com/doc/refman/5.0/en/spatial-extensions.html
There are lots of details of how to accomplish what you are asking, depending upon how your spatial data is represented in the DB.
Another option is to make a function for calculating the distance using the Haversine formula mentioned already. The math behind it can be found here:
www.movable-type.co.uk/scripts/latlong.html
Hopefully this helps.
You didn't mention your database but in SQL Server 2008 it is as easy as this when you use the geography data types
This will find all zipcodes within 20 miles from zipcode 10028
SELECT h.*
FROM zipcodes g
JOIN zipcodes h ON g.zipcode <> h.zipcode
AND g.zipcode = '10028'
AND h.zipcode <> '10028'
WHERE g.GeogCol1.STDistance(h.GeogCol1)<=(20 * 1609.344)
see also here SQL Server 2008 Proximity Search With The Geography Data Type
The SQL Server 2000 version is here: SQL Server Zipcode Latitude/Longitude proximity distance search
This is a typical spatial search problem.
1> what db are you using, sql2008, oracle, ESRI geodatabase, and postgis are some spatial db engine which has this functionaliyt.
2> Otherwise, you probably look for some spatial Algo library if you want to achieve this. You could code for yourself, but I won't suggest because computation geometry is a complicated issue.
If you're using a database which supports spatial types, you can build the query directly, and the database will handle it. PostgreSQL, Oracle, and the latest MS SQL all support this, as do some others.
If not, and precision isn't an issue, you can do a search in a box instead of by radius, as this will be very fast. Otherwise, things get complicated, as the actual conversion from lat-long -> distances needs to happen in a projected space (since the distances change in different areas of the planet), and life gets quite a bit nastier.
I don't remember the equation off the top of my head, but the Haversine formula is what is used to calculate distances between two points on the Earth. You may Google the equation and see if that gives you any ideas. Sorry, I know this isn't much help, but maybe it will give a place to start.
If it doesn't have to be very accurate, and I assume you have an x and y column in your table, then just select all rows in a big bounding rectangle, and use pythagorus (or Haversine) to trim off the results in the corners.
eg. select * from locations where (x between xpos-10 miles and xpos+10miles) and (y between xpos -10miles and ypos+10miles).
Remember pythagorus is sqrt(x_dist^2 + y_dist^2).
Its quick and simple, easy to understand and doesn't need funny joins.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
How do you calculate the distance between 2 cities?
If you need to take the curvature of the earth into account, the Great-Circle distance is what you're looking for. The Wikipedia article probably does a better job of explaining how the formula works than me, and there's also this aviation formulary page that covers that goes into more detail.
The formulas are only the first part of the puzzle though, if you need to make this work for arbitrary cities, you'll need a location database to get the lat/long from. Luckily you can get this for free from Geonames.org, although there are commercial db's available (ask google). So, in general, look up the two cities you want, get the lat/long co-orinates and plug them into the formula as in the Wikipedia Worked Example.
Other suggestions:
For a full commercial solution,
there's PC Miler which is used
by many trucking companies to
calculate shipping rates.
Make calls to the Google Maps (or other) api. If you need to do many requests per day, consider caching the results on the server.
Also very important is to consider building an equivalence database for cities, suburbs, towns etc. if you think you'll ever need to group your data. This gets really complicated though, and you may not find a one-size-fits-all solution for your problem.
Last but not least, Joel wrote an article about this problem a while back, so here you go: New Feature: Job Search
You use the Haversine formula.
This is very easy to do with geography type in SQL Server 2008.
SELECT geography::Point(lat1, lon1, 4326).STDistance(geography::Point(lat2, lon2, 4326))
-- computes distance in meters using eliptical model, accurate to the mm
4326 is SRID for WGS84 elipsoidal Earth model
You ca use the A* algorithm to find the shortest path between those two cities and this way you'll have the distance.
If you're talking about the shortest distance between two real cities on a real spherical planet, like Earth, you want the great circle distance.
If you are working in the plane and you want the Euclidean distance "as the crow flies":
// Cities are points x0,y0 and x1,y1 in kilometers or miles or Smoots[1]
dx = x1 - x0;
dy = y1 - y0;
dist = sqrt(dx*dx + dy*y);
No trigonometry needed! Just the Pythagorean theorem and the fact that squares are always positive so you don't need dx = abs(x1 - x0), etc. to get a positive number to pass to sqrt().
Note that you could probably do this in one line and a compiler would probably reduce it the equivalent above code:
dist = sqrt((x1-x0)*(x1-x0) + (y1-y0)*(y1-y0));
[1] http://en.wikipedia.org/wiki/Smoot
You can get the distance between two cities from google map api.
Here is an implementation of it in Python
#!/usr/bin/python
import requests
from sys import argv
def get_distance(origin,destination):
gmap='http://maps.googleapis.com/maps/api/distancematrix/json'
payload={"origins":origin,"destinations":destination,"sensor":'false' }
try:
a=requests.get(gmap,params=payload)
data = a.json()
origin = str(data['origin_addresses'][0])
destination= str(data['destination_addresses'][0])
distance = data['rows'][0]['elements'][0]['distance']['text']
return distance,origin,destination
except Exception,e:
print "The %s or %destination does not exists :(" %(origin,destination)
exit()
if __name__=="__main__":
if len(argv)<3:
print "sorry Check the format"
else:
origin=argv[1]
destination=argv[2]
distance,origin,destination=get_distance(origin,destination)
print "%s ---> %s : %s" %(origin,destination,distance)
Example link: https://gist.github.com/sarathsp06/cf063e47bcc515b51c84
You find the Lat/Lon of the city, then use a distance estimation algorithm for Lat/Lon coordinates.
if you need a code example I think I have one I could dig up at home, but like many of the previous answers, you need a long / lat db to do the calculation
It is better to use a look-up table for obtaining the distance between two cities.
This makes sense because
* The Formula to calculate the distance ais quite computationally intensive..
* Distance between cities is unlikely to change.
So unless you needs are very specific (like terrain mapping from a satellite or some or topography algorithm or something else), you should really just save the list of cities and distances between them, into a table and look it up as needed.
I've been doing a lot of work with this recently. I'm finding SQL2008's new features really make this easy. I can find all the points that are withing Xkm of a 100k record table in sub-second time...not too shabby.
The great circle (spherical assumption) method in my testing was about 2.5 miles off when compared to the vincenty formula (elipsoidal assumption, which is what the earth is).
The real trick is getting the lat and long..for that I'm using Google.
#Jared - a minor correction to your code example. The last line of the first code example should read:
dist = sqrt(dx*dx + dy*dy);
I agree that once you have the info, if it's not going to change, store it somehow. #Marko Tinto Thanks for the T-SQL sample. For those who don't have access to SQL Server or prefer another method: If you need high accuracy, check out Wikipedia's entry on the Vincenty algorithm for more info. I believe there is a js implementation, which would (if not already) be easily ported to other languages. Also, at the bottom of that page is a link to geographicLib, which purports to be 1000 time more accurate than the Vincenty algorithm (if you have data that good, it might matter).
Why would you use something like the Vincenty method? Because the earth is not a perfect sphere and methods like that allow for inputting a more accurate major and minor axis for modeling the earth.
i use distancy
so simple and clean