Calculating Distance Between 2 Cities [closed] - algorithm

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
How do you calculate the distance between 2 cities?

If you need to take the curvature of the earth into account, the Great-Circle distance is what you're looking for. The Wikipedia article probably does a better job of explaining how the formula works than me, and there's also this aviation formulary page that covers that goes into more detail.
The formulas are only the first part of the puzzle though, if you need to make this work for arbitrary cities, you'll need a location database to get the lat/long from. Luckily you can get this for free from Geonames.org, although there are commercial db's available (ask google). So, in general, look up the two cities you want, get the lat/long co-orinates and plug them into the formula as in the Wikipedia Worked Example.
Other suggestions:
For a full commercial solution,
there's PC Miler which is used
by many trucking companies to
calculate shipping rates.
Make calls to the Google Maps (or other) api. If you need to do many requests per day, consider caching the results on the server.
Also very important is to consider building an equivalence database for cities, suburbs, towns etc. if you think you'll ever need to group your data. This gets really complicated though, and you may not find a one-size-fits-all solution for your problem.
Last but not least, Joel wrote an article about this problem a while back, so here you go: New Feature: Job Search

You use the Haversine formula.

This is very easy to do with geography type in SQL Server 2008.
SELECT geography::Point(lat1, lon1, 4326).STDistance(geography::Point(lat2, lon2, 4326))
-- computes distance in meters using eliptical model, accurate to the mm
4326 is SRID for WGS84 elipsoidal Earth model

You ca use the A* algorithm to find the shortest path between those two cities and this way you'll have the distance.

If you're talking about the shortest distance between two real cities on a real spherical planet, like Earth, you want the great circle distance.

If you are working in the plane and you want the Euclidean distance "as the crow flies":
// Cities are points x0,y0 and x1,y1 in kilometers or miles or Smoots[1]
dx = x1 - x0;
dy = y1 - y0;
dist = sqrt(dx*dx + dy*y);
No trigonometry needed! Just the Pythagorean theorem and the fact that squares are always positive so you don't need dx = abs(x1 - x0), etc. to get a positive number to pass to sqrt().
Note that you could probably do this in one line and a compiler would probably reduce it the equivalent above code:
dist = sqrt((x1-x0)*(x1-x0) + (y1-y0)*(y1-y0));
[1] http://en.wikipedia.org/wiki/Smoot

You can get the distance between two cities from google map api.
Here is an implementation of it in Python
#!/usr/bin/python
import requests
from sys import argv
def get_distance(origin,destination):
gmap='http://maps.googleapis.com/maps/api/distancematrix/json'
payload={"origins":origin,"destinations":destination,"sensor":'false' }
try:
a=requests.get(gmap,params=payload)
data = a.json()
origin = str(data['origin_addresses'][0])
destination= str(data['destination_addresses'][0])
distance = data['rows'][0]['elements'][0]['distance']['text']
return distance,origin,destination
except Exception,e:
print "The %s or %destination does not exists :(" %(origin,destination)
exit()
if __name__=="__main__":
if len(argv)<3:
print "sorry Check the format"
else:
origin=argv[1]
destination=argv[2]
distance,origin,destination=get_distance(origin,destination)
print "%s ---> %s : %s" %(origin,destination,distance)
Example link: https://gist.github.com/sarathsp06/cf063e47bcc515b51c84

You find the Lat/Lon of the city, then use a distance estimation algorithm for Lat/Lon coordinates.

if you need a code example I think I have one I could dig up at home, but like many of the previous answers, you need a long / lat db to do the calculation

It is better to use a look-up table for obtaining the distance between two cities.
This makes sense because
* The Formula to calculate the distance ais quite computationally intensive..
* Distance between cities is unlikely to change.
So unless you needs are very specific (like terrain mapping from a satellite or some or topography algorithm or something else), you should really just save the list of cities and distances between them, into a table and look it up as needed.

I've been doing a lot of work with this recently. I'm finding SQL2008's new features really make this easy. I can find all the points that are withing Xkm of a 100k record table in sub-second time...not too shabby.
The great circle (spherical assumption) method in my testing was about 2.5 miles off when compared to the vincenty formula (elipsoidal assumption, which is what the earth is).
The real trick is getting the lat and long..for that I'm using Google.

#Jared - a minor correction to your code example. The last line of the first code example should read:
dist = sqrt(dx*dx + dy*dy);

I agree that once you have the info, if it's not going to change, store it somehow. #Marko Tinto Thanks for the T-SQL sample. For those who don't have access to SQL Server or prefer another method: If you need high accuracy, check out Wikipedia's entry on the Vincenty algorithm for more info. I believe there is a js implementation, which would (if not already) be easily ported to other languages. Also, at the bottom of that page is a link to geographicLib, which purports to be 1000 time more accurate than the Vincenty algorithm (if you have data that good, it might matter).
Why would you use something like the Vincenty method? Because the earth is not a perfect sphere and methods like that allow for inputting a more accurate major and minor axis for modeling the earth.

i use distancy
so simple and clean

Related

Statistics/Algorithm: How do I compare a weekly graph with its own history to see when in the past it was almost the same?

I’ve got a statistical/mathematical problem I’m stumped on and I was really hoping to get some help. I’m working on a research where I need to compare a weekly graph with its own history to see when in the past it was almost the same. Think of this as “finding the closest match”. The information is displayed as a line graph, but it’s readily available as raw data:
Date...................Result
08/10/18......52.5
08/07/18......60.2
08/06/18......58.5
08/05/18......55.4
08/04/18......55.2
and so on...
What I really want is the output to be a form of correlation between the current data points with the other set of 5 concurrent data points in history. So, something like:
Date range.....................Correlation
07/10/18-07/15/18....0.98
We’ll be getting a code written in Python for the software to do this automatically (so that as new data is added, it automatically runs and finds the closest set of numbers to match the current one).
Here’s where the difficulty sets in: Since numbers are on a general upward trend over time, we don’t want it to compare the absolute value (since the numbers might never really match). One suggestion has been to compare the delta (rate of change as a percentage over the previous day), or using a log scale.
I’m wondering: how do I go about this? What kind of calculation I can use to get the desired results? I’ve looked at the different kind of correlation equations, but they don’t account for the “shape” of the data, and they generally just average it out. The shape of the line chart is the important thing.
Thanks very much in advance!
I would simply divide the data of each week by their average (i.e., normalize them to an average of 1), then sum the squares of the differences of each day of each pair of weeks. This sum is what you want to minimize.
If you don't care about how much a graph oscillates relative to its mean, you can normalize also the variance. For each week, calculate mean and variance, then subtract the mean and divide by the root of the variance. Each week will have mean 0 and variance 1. Then minimize the sum of squares of differences like before.
If the normalization of data is all you can change in your workflow, just leave out the sum of squares of differences minimization part.

What is Youtube comment system sorting / ranking algorithm? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
Youtube provides two sorting options: Newest first and Top comments. The "Newest first" is pretty simple that we just sort the comments by their post date. But the "Top comments" seems to be a lot more complex than just sorting by "thumb up"s.
After a short research, I found out that the order of comments depends on those things:
Number of "thumb up"s and "thumb down"s
Post date
Number of replies to that comment
But I don't know how Youtube uses this information to decide the order, like what information is more important and what is less important.
Is there any article about this topic that I could refer to?
Thanks!
I have the answer to your question.
After searching the internet for the answer to this, I never found precisely what I was looking for. So, my colleagues and I decided to experiment using the system with the Youtube comments.
First of all, we sorted what we believed to be popular videos into one section, average videos into another, and less popular into the last. There were 200 videos in each section, and after days of examining we started to notice a pattern. We found that you were right about the three things required, but we also dove a little deeper and found an additional variable.
The Youtube comment system depends on four things:
1) Time it was posted,
2) Like/dislike ratio of a comment,
3) Number of replies,
4) And, believe it or not, WHO posted it.
The average like/dislike ratio of every public comment you've ever posted builds into it, as (what we predicted) they believe that those with low like/dislike ratios would post comments that many people do not like or simply disagree with.
There is an algorithm to it, and it is quite simpler than you might think. Basically there are these things that we called "module points," and you get a certain one based on these four factors. First, here's the things you need to know about module point conversion with TWO of the factors:
For the like/dislike ratio on the comment, multiply that number by ten.
For the amount of replies (NOT from the original poster) that the comment has, there are two module points.
These are the two basic factors that tell the amount of module points the comment has.
For example, if a comment had 27 likes and 8 dislikes, then the ratio would be 3.375. Multiplying by 10, you would then have 33.75 module points. Using the next factor, amount of replies, let's say this comment has 4 direct replies to it. Multiplying 2 by 4, we get 8. This is the part where you add 8 onto the accumulative module points, giving you a total of 41.75 module points.
But we're not done here; this is where it gets tricky.
Using the average like/dislike ratio of a person's total comments that they've ever posted publicly, we found that the formula added onto the accumulative module points is this:
C = MP(R/3) + (MP/10)
where C = Comment Position Variable; MP = Module Points; R = Person's total like/dislike ratio
Trust me, we spend DAYS just on this part, which was probably the most frustrating. Even though the 3 and the 10 within this equation seem random and unnecessary, so far all of the comments we tested this equation on passed the test, but did not pass the test when those two variables were removed. After this equation is done, it gives you a number that we named to be the Position Variable.
However, we are not even done yet, we still haven't talked about time.
I was actually quite surprised that this part didn't take as long as I expected, but it sure was a pain doing this equation every single time for every comment we tested. At first, when testing it, we figured that the time was just there to break the barrier if 2 comments had equal Position Variables.
In fact, I almost called it a wrap on the experiment when this happened, but upon further inspection, we found out there was more to do. We found that some of the comments outranked each other that had the same Position Variable, but the timing seemed to be random! After a few days of inspection, here is where the final result comes in:
There is yet ANOTHER equation that we must find before applying the 4th variable. Using another separate equation, here's what our algebraic deductions came down to:
X = 1/3(S/10 + A) x [absolute value of](A - 3S)
where X = Timing Variable; S = How long ago the video was posted in minutes; A = How long ago the comment was posted in minutes
I wish I was making this up, but unfortunately this is how complicated the system is. There are mathematical reasons behind the other variables, but they are far too complex to explain, it will probably take up atleast three paragraphs worth of explaining. We tested this equation on more than 150 comments, all of them checked out to be true.
Once you find X, which is what we called the Timing Variable, all you have to do from here is apply it to this equation:
N = X(C/4 + 1)
where X = Timing Variable; C = Positioning Variable
N is the answer to all your problems.
This is the final equation, the final answer. The simple conclusion: the higher N, the higher up the comment is.
Note: Special thanks to my colleagues: David Mattison, Josh Williams, Diego Mendieta, Steven Orsette, and Kyle Shropshire. I could have never found out this without them and the work they put into this.

Tools to determine exact location when using ibeacons

We are working with a Retail client who would like to know if using multiple iBeacons throughout the store would help track a customer's exact location when they are inside the store (of course when they have the client's App installed).
I would like to know what software tools are already available for this purpose?
What is clear is that at the basic level the location of a device can be determined based on it's relative distance from multiple (at least 2) iBeacons. If so, aren't there tools that help with this?
Thanks
Obviously this is unlikely to work well due to the inconsistency of the RSSI value (bluetooth signal). However, this is the direction you may want to take it (adapted from lots of stackoverflow research):
Filter RSSI
I use a rolling filter with this whenever the beacons are ranged, using a kFilteringFactor of 0.1:
rollingRssi = (beacon.rssi * kFilteringFactor) + (rollingRssi * (1.0 - kFilteringFactor));
And I use this to get a rolling Accuracy value (in meters). (Thanks David!)
- (double)calculateAccuracyWithRSSI:(double)rssi {
//formula adapted from David Young's Radius Networks Android iBeacon Code
if (rssi == 0) {
return -1.0; // if we cannot determine accuracy, return -1.
}
double txPower = -70;
double ratio = rssi*1.0/txPower;
if (ratio < 1.0) {
return pow(ratio,10);
}
else {
double accuracy = (0.89976) * pow(ratio,7.7095) + 0.111;
return accuracy;
}
}
Calculate XY with Trilateration (Beacons 1, 2, and 3 are Beacon subclasses with pre-set X and Y values for location and distance is calculated as above).
float xa = beacon1.locationX;
float ya = beacon1.locationY;
float xb = beacon2.locationX;
float yb = beacon2.locationY;
float xc = beacon3.locationX;
float yc = beacon3.locationY;
float ra = beacon1.filteredDistance;
float rb = beacon2.filteredDistance;
float rc = beacon3.filteredDistance;
float S = (pow(xc, 2.) - pow(xb, 2.) + pow(yc, 2.) - pow(yb, 2.) + pow(rb, 2.) - pow(rc, 2.)) / 2.0;
float T = (pow(xa, 2.) - pow(xb, 2.) + pow(ya, 2.) - pow(yb, 2.) + pow(rb, 2.) - pow(ra, 2.)) / 2.0;
float y = ((T * (xb - xc)) - (S * (xb - xa))) / (((ya - yb) * (xb - xc)) - ((yc - yb) * (xb - xa)));
float x = ((y * (ya - yb)) - T) / (xb - xa);
CGPoint point = CGPointMake(x, y);
return point;
The easiest way to get an exact location is to put one iBeacons at each point you care about, then have an iBeacon-aware app compare the "accuracy" field (which actually gives you a rough distance estimate i meters), and assume the user is at the iBeacon point with the lowest "accuracy" reading. Clearly, this approach will require a large number of iBeacons to give a precise location over a large floorplan.
Lots of folks have proposed triangulation-like strategies for using only a few iBeacons. This is much more complex, and there is no pre-built software to do this. While I have read a lot about people wanting or trying to do this, I have not heard any reports of folks pulling it off yet.
If you want to try this yourself, then you should realize that you are undertaking a bit of a science project, and there may be a great deal of time and effort needed to make it happen with unknown results.
Exact location is something that is unlikely to be achievable, but something within some tolerance values is certainly possible. I've not done extensive testing of this yet, but in a small 3x4m area, with three Beacons, you can get good positioning in ideal situations, the problem is that we don't normally have ideal situations!
The hard part is getting an accurate distance from the receiver to the iBeacon, RSSI (the received signal strength) is the only information we have, to turn this into a distance we use a measurement based on known signal strengths at various distances from the transmitter e.g. Qiu, T, Zhou, Y, Xia, F, Jin, N, & Feng, L 2012. This bit works well (and is already implemented with an average accuracy in the iOS SDK), but unfortunately environmental conditions such as humidity and other objects (such as people) getting between the receiver and transmitter degrade the signal unpredictably.
You can read my initial research presentation on SlideShare, which covers some basic environmental effects and shows the effect on the accuracy of measurement, it also references articles that explain how RSSI is turned into distance, and some approaches to overcome the environmental factors. However in a retail situation the top tip, is to position the iBeacons on the ceiling as this reduces the number of Human obstructions.
Trilateration is basically the same whatever you do, I've been using Gema Megantara's version. To improve the positioning further a technique will be needed that takes environmental conditions into account e.g. Hyo-Sung & Wonpil 2009.
The solution is a technique called trilateration. There is a decent wiki article on it.
If you assume that all the beacons and the receiver are on the same plane you can ignore the Z dimension and it simplifies to circles.
The math is still kind of messy. You'd have to do some matrix math on the positions of the beacons to shift one beacon to the origin and put a second beacon on the x axis, and then apply the inverse of your matrix to the result to convert it back to "real" coordinates.
The big problem is that the "accuracy" (aka distance) value is anything but accurate. Even in a wide open space with no interference, the distance signals vary quite a bit. Add any interference (like from your body holding the phone even) and it gets worse. Add walls, furniture, metal surfaces, other people, etc, and it gets really wonky.
I have it on my list of things to do to write trilateration code, measure out a grid in my yard (when the weather warms up), take a tape measure, and do some testing.
The problem with all of this is that the RSSI signal you get back is extremely volatile. If you simply take the raw RSSI you will get very unreliable answers. You need to somehow process the data you get back before you run it through any triangulations, and that means either 1)averaging, or 2)filtering (or both). If you don't, you may get "IMMEDIATE" proximity response even though you are in fringe areas.
Bluetooth Low Energy (4.0) alone is not a robust indoor location and mapping technology, it will most likely become part of the fabric of indoor location technologies, in much the same way wi-fi signals are used to add fidelity to GPS signals in cities. Currently iBeacon can really only be used for fairly nebulous nodes indoors, like 'the shoe department' (assuming a large store)
I expect via the 'iBeacon' service (or something alternatively-named), Apple are working on high-resolution indoor location for app developers. You need look no further than their purchase of WifiSLAM, in mid 2013, for evidence. As yet, iBeacon and any other solely BLE technology is not going to give you precise indoor location. (Perhaps if you blanket the store with beacons, and combine a probabilistic model with a physical model, you could do it, but not with any practically-implementable beacon strategy.)
Also of note, is the discussion around Nokia's next-gen BLE HAIP (High-Accuracy Indoor Positioning) version http://conversations.nokia.com/2012/08/23/new-alliance-helps-you-find-needle-in-a-haystack/
Basically accurate indoor positioning doesn't exist in the wild yet, but it's about to...
We did use gimbal's ibeacons for indoor localization in a 6mx11m area. The RSSI fluctuation was a major issue that we handled through particle filtering. The code for the project can be found at https://github.com/ipapapa/IoT-MicroLocation/. The github repository contains the iOS application as well as a java based apache tomcat server. While we did get an accuracy as high as 0.97 meters, it is still very challenging to use beacons for micro-location purposes.Check the paper http://people.engr.ncsu.edu/ipapapa/Files/globecom2015.pdf which has been accepted for publication in IEEE Globecom conference.
To determine the user's position based on surrounding iBeacons, which have to stay at fixed positions, it is possible by signal strength triangulation. Take a look at this thesis about Bluetooth Indoor Positioning ;-)
Certainly if you can measure the exact position(lat/lon) of any individual iBeacon - that could be included in the beacons message. But assuming a stationary location - obviously this only need to be done once. Sort of an independent "calibration" exercise - which could itself be automated.
When discussing positioning, you need to first define your needs more concretely.
Computer/GPS geeks will assume you want accuracy down to the millimeter, if not finer, so they will either provide you with more information then you need - or tell you it can't be done[both viable answers].
However, in the REAL WORLD, most people are looking for accuracy of at most 3 feet[or 1 meter] - and most likely are willing to accept accuracy of within 10 feet[ie visual distance].
iOS already provides you with that level of accuracy - their api gives you the distance as "near, medium, far" - so within 10 feet all you need to check is that the distance is "near" or "medium".
If your needs go beyond that, then you can provide the custom functionality quite easily. You have 32 bits of information[major and minor codes] That is more then enough information to store the lattitude and longitude of each ibeacon IN the beacon itself using Morton Coding, http://www.spatial.maine.edu/~mark.a.plummer/Morton-GEOG664-Plummer.pdf
As long as altitude[height] is not a factor and no beacon will be deployed within 1 meter of another beacon - you can encode each lat/long pair into a single 32 bit integer and store it in the major and minor code.
Using just the major code, you can determine the location of the beacon[and hence the phone] to within 100 meters[conservatively]. This means that many beacons within the same 100 meter radius will have the SAME major code.
Using the minor code, you can determine the location of the beacon to within 1 meter, and the location of the phone to within 10 feet.
The code for this is already written and widely available - just look for code that demonstrates how it is "impossible" to do this, ignore what the comments about it not being possible since their focused on precision to a greater degree then you care about.
**Note: as mentioned in later posts, external factors will affect signal strength - but again this is likely not relevant for your needs. There are 3 'distances' provided by the iphone sdk, "close, near, far".
Far is the problematic one. Assume a beacon with a 150 foot range. Check with an iphone to determine what the 2 close distances are ideally... assume within 5 feet is "close" and 15 feet is "near".
If phone A is near to Beacon B[which has a known location] then you know the person is within 15 feet of point X. If there is a lot of interference, they may be 3 feet away, or they may be 15 feet, but in either case it is "within 15 feet". That's all you need.
By the same token, if you need to know if they are within 5 feet, then you use the "close" measurement.
I firmly believe that 80% of all positioning needs is provided by the current scheme - where it is not then you do your initial implementation with the limitation as a proof of concept and then contact one of the many ibeacon experts to provide the last bit of accuracy.
I have not done the extensive research that I believe went into Phil's above master's thesis, but...
There is another team that has claimed to have figured this out using various AI algorithms . See this linkedin post: https://www.linkedin.com/groups/6510475/6510475-5866674142035607552
As someone who develops beacons ( http://www.getgelo.com ) I can share first hand that pretty much any object will drastically change the consistency and accuracy of the RSSI which is going to make computing an exact position impossible. (Phil, I hope you prove me wrong, I haven't read your thesis yet).
If are the only person in a wide open space that has a grid of beacons then you can likely get this to work, but as soon as you add other people, walls, objects, etc, then you're SOL.
You can approximate location which is what iBeacons do and are pretty bad at, but it's directionless.
You could deploy enough beacons so that essentially wherever you are in a retail location you're standing very close to a beacon and you can have high confidence that you're in aisle 5 about 20 ft done (as opposed to being on the other side of aisle, aisle 6, and 20 ft down). Cost may become an issue here.
There are teams that are combining BLE with Wifi and other technologies to create a more accurate indoor positioning solution.
In short, and this will come as an echo of what's already been posted, BLE is not a good technology to be used solely for extremely accurate positioning.
The above problem can be solved using technology that combines Wi-Fi trilateration and a phone's sensor data. We get 1 meter accuracy in spaces that are properly outfitted when companies integrate our SDK with their app. The accuracy of these methods has improved dramatically over the last year.

how to find interesting points in time series

i have an array of date=>values, like this
"2010-10-12 14:58:36" =>13.4
"2010-10-17 14:58:36" =>12
"2010-10-22 14:58:36" =>17.6
"2010-10-27 14:58:36" =>22
"2010-11-01 14:58:36" =>10
[...]
I use this date-value combination to paint an graph in javascript.
Now i like to mark those dates, who are "very special".
My problem (and Question) is, which aspect should consider to find those specific dates?
As an human, i prefer the date "2010-10-17 14:58:36", because "something" should be happens on this date, because the value on the next dates rises for 5.6 points, which is the biggest step up followed by one mor big step up. On the other hand, also the date "2010-10-27 14:58:36" is an "highlight", because this is
the top of all values and
after this date, there comes the biggest step down.
So as an human, i would be choose both dates.
My problem is: how could an algorithm look like?
I tried averages values for n dates before and after the current values, which results in an accumulation of those specifics dates at the beginning and at the end of the graph
So i tried to find the biggest percentage step up (depending on the date before), but I'm not sure, if i really find the specific dates, I'm looking for?!
How would you tackle the problem?
Thank you.
Looks like financial stocking issue :-) You are looking for Time series analysis - this is a statistical issue. I'd recommend to use R programming language to play with it (you can do complex statistical things very fast). There are tens of special packages, for sure financial one's too. Once you know what you want, you may implement the solution in any other language.
Just try to google time series analysis r.
EDIT: note that R is very powerful - I'd bet there is a tool how to use R packages from other languages.
If you have information over a timeline you could use Inerpolation.
A Polynomial interpolation will give you an approximated polynomial that goes through the points.
What's nice about this is you can then use Mathematical analysis which is easy on polynomials to find interesting points (large gradients, min-max points etc...)
Also you get an approximation of how the function behaves, so you could "future" points and see what may happen in the near future.
Of course looking into the future isn't so accurate, but forms of interpolation are used in analytic to see trends and behaviors.
And of course, it's easy to plot a polynomial, which is always nice.
This is really a question of Statistics http://en.wikipedia.org/wiki/Statistics and the context of your data and what you're looking to highlight, for example, the fact that between 12/10 and 17/10 the data moved negative 1.4 units may be more useful in some scenarios than a larger positive step change.
You need sample data, on which build up a function which can calculate an expected value for any given date; for instance averaging the values of the day before, the same week day of the previous week, of the previous month and so on. After that decide a threshold: interesting date are those for which real value is outside expected value +- threshold

Algorithm to score similarness of sets of numbers

What is an algorithm to compare multiple sets of numbers against a target set to determine which ones are the most "similar"?
One use of this algorithm would be to compare today's hourly weather forecast against historical weather recordings to find a day that had similar weather.
The similarity of two sets is a bit subjective, so the algorithm really just needs to diferentiate between good matches and bad matches. We have a lot of historical data, so I would like to try to narrow down the amount of days the users need to look through by automatically throwing out sets that aren't close and trying to put the "best" matches at the top of the list.
Edit:
Ideally the result of the algorithm would be comparable to results using different data sets. For example using the mean square error as suggested by Niles produces pretty good results, but the numbers generated when comparing the temperature can not be compared to numbers generated with other data such as Wind Speed or Precipitation because the scale of the data is different. Some of the non-weather data being is very large, so the mean square error algorithm generates numbers in the hundreds of thousands compared to the tens or hundreds that is generated by using temperature.
I think the mean square error metric might work for applications such as weather compares. It's easy to calculate and gives numbers that do make sense.
Since your want to compare measurements over time you can just leave out missing values from the calculation.
For values that are not time-bound or even unsorted, multi-dimensional scatter data it's a bit more difficult. Choosing a good distance metric becomes part of the art of analysing such data.
Use the pearson correlation coefficient. I figured out how to calculate it in an SQL query which can be found here: http://vanheusden.com/misc/pearson.php
In finance they use Beta to measure the correlation of 2 series of numbers. EG, Beta could answer the question "Over the last year, how much would the price of IBM go up on a day that the price of the S&P 500 index went up 5%?" It deals with the percentage of the move, so the 2 series can have different scales.
In my example, the Beta is Covariance(IBM, S&P 500) / Variance(S&P 500).
Wikipedia has pages explaining Covariance, Variance, and Beta: http://en.wikipedia.org/wiki/Beta_(finance)
Look at statistical sites. I think you are looking for correlation.
As an example, I'll assume you're measuring temp, wind, and precip. We'll call these items "features". So valid values might be:
Temp: -50 to 100F (I'm in Minnesota, USA)
Wind: 0 to 120 Miles/hr (not sure if this is realistic but bear with me)
Precip: 0 to 100
Start by normalizing your data. Temp has a range of 150 units, Wind 120 units, and Precip 100 units. Multiply your wind units by 1.25 and Precip by 1.5 to make them roughly the same "scale" as your temp. You can get fancy here and make rules that weigh one feature as more valuable than others. In this example, wind might have a huge range but usually stays in a smaller range so you want to weigh it less to prevent it from skewing your results.
Now, imagine each measurement as a point in multi-dimensional space. This example measures 3d space (temp, wind, precip). The nice thing is, if we add more features, we simply increase the dimensionality of our space but the math stays the same. Anyway, we want to find the historical points that are closest to our current point. The easiest way to do that is Euclidean distance. So measure the distance from our current point to each historical point and keep the closest matches:
for each historicalpoint
distance = sqrt(
pow(currentpoint.temp - historicalpoint.temp, 2) +
pow(currentpoint.wind - historicalpoint.wind, 2) +
pow(currentpoint.precip - historicalpoint.precip, 2))
if distance is smaller than the largest distance in our match collection
add historicalpoint to our match collection
remove the match with the largest distance from our match collection
next
This is a brute-force approach. If you have the time, you could get a lot fancier. Multi-dimensional data can be represented as trees like kd-trees or r-trees. If you have a lot of data, comparing your current observation with every historical observation would be too slow. Trees speed up your search. You might want to take a look at Data Clustering and Nearest Neighbor Search.
Cheers.
Talk to a statistician.
Seriously.
They do this type of thing for a living.
You write that the "similarity of two sets is a bit subjective", but it's not subjective at all-- it's a matter of determining the appropriate criteria for similarity for your problem domain.
This is one of those situation where you are much better off speaking to a professional than asking a bunch of programmers.
First of all, ask yourself if these are sets, or ordered collections.
I assume that these are ordered collections with duplicates. The most obvious algorithm is to select a tolerance within which numbers are considered the same, and count the number of slots where the numbers are the same under that measure.
I do have a solution implemented for this in my application, but I'm looking to see if there is something that is better or more "correct". For each historical day I do the following:
function calculate_score(historical_set, forecast_set)
{
double c = correlation(historical_set, forecast_set);
double avg_history = average(historical_set);
double avg_forecast = average(forecast_set);
double penalty = abs(avg_history - avg_forecast) / avg_forecast
return c - penalty;
}
I then sort all the results from high to low.
Since the correlation is a value from -1 to 1 that says whether the numbers fall or rise together, I then "penalize" that with the percentage difference the averages of the two sets of numbers.
A couple of times, you've mentioned that you don't know the distribution of the data, which is of course true. I mean, tomorrow there could be a day that is 150 degree F, with 2000km/hr winds, but it seems pretty unlikely.
I would argue that you have a very good idea of the distribution, since you have a long historical record. Given that, you can put everything in terms of quantiles of the historical distribution, and do something with absolute or squared difference of the quantiles on all measures. This is another normalization method, but one that accounts for the non-linearities in the data.
Normalization in any style should make all variables comparable.
As example, let's say that a day it's a windy, hot day: that might have a temp quantile of .75, and a wind quantile of .75. The .76 quantile for heat might be 1 degree away, and the one for wind might be 3kmh away.
This focus on the empirical distribution is easy to understand as well, and could be more robust than normal estimation (like Mean-square-error).
Are the two data sets ordered, or not?
If ordered, are the indices the same? equally spaced?
If the indices are common (temperatures measured on the same days (but different locations), for example, you can regress the first data set against the second,
and then test that the slope is equal to 1, and that the intercept is 0.
http://stattrek.com/AP-Statistics-4/Test-Slope.aspx?Tutorial=AP
Otherwise, you can do two regressions, of the y=values against their indices. http://en.wikipedia.org/wiki/Correlation. You'd still want to compare slopes and intercepts.
====
If unordered, I think you want to look at the cumulative distribution functions
http://en.wikipedia.org/wiki/Cumulative_distribution_function
One relevant test is Kolmogorov-Smirnov:
http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test
You could also look at
Student's t-test,
http://en.wikipedia.org/wiki/Student%27s_t-test
or a Wilcoxon signed-rank test http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
to test equality of means between the two samples.
And you could test for equality of variances with a Levene test http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm
Note: it is possible for dissimilar sets of data to have the same mean and variance -- depending on how rigorous you want to be (and how much data you have), you could consider testing for equality of higher moments, as well.
Maybe you can see your set of numbers as a vector (each number of the set being a componant of the vector).
Then you can simply use dot product to compute the similarity of 2 given vectors (i.e. set of numbers).
You might need to normalize your vectors.
More : Cosine similarity

Resources