Exact Matching with r MatchIt with ratio 1:4

Exact Matching with r MatchIt with ratio 1:4 - methods

I would like to run an exact match by state but at the same time I would like a ratio of 1 to 4 case:control. Will the following code do the trick? I understand method=exact won't work since that method does not support the ratio parameter. If yes, how do I check if matching is exact? If not, how can I fix my code? I have plenty of data (255 cases & 7500 control so exact shouldn't be a problem at all)
Thank you!
exact_match <- matchit(case ~ state , ratio = 4, data = case_crude, exact="statabbr")

I'm not sure why you would want to throw out any units that are more than the 4th exact match. For example, if 5 control units resemble your treated unit, why would you only want to retain 4 of them?
Otherwise, that code should work to do what you want. It performs 4:1 nearest neighbor matching on a propensity score estimated with state as the sole predictor with exact matching on statabbr.
To check if exact matching was successful, check balance using summary(). You should see that the mean differences are 0 and the pair distance are also 0 (i.e., implying that within each pair, the values of statabbr are identical.

Related

No division in numerical differentiaton algorithm

I am trying to understand an algorithm that evaluates the speed of a moving element. The position sensors are sampled with varying but rather big speed (from 16MSPS to 24MSPS) and the speed is calculated as a simple difference between the last two values.
The formula for the speed is then v = f(x_(n+1)) - f(x_n) , and according to all numerical approaches i was expectingv = (f(x+h) - f(x)) / h
I don't really understand why the division is omitted. Under what circumstances can the division be ignored?
This system is implemented on a FPGA.

It can be ignored when h is 1 as divide by 1 is a no-op.

Thanks to many comments I was able to understand the problem:
The unit calculating the speed doesn't need to know the time period. By subtracting the next sampled value from the previous one, it produces output values. These values represent a function, that is is linearly dependent on the speed. One way to understand this is, that this output is kind of 'speed without units'. The output can be than further manipulated (oversampled, undersampled) to achieve desired signal quality.
To able to determine the speed in some exact units (like m/s) at least the sampling frequency has to be given. In case of rotational movement also other constants are needed, such as the radius of the axis where the sensor is mounted, etc. This happens at some later point.

How can I detect these audio abnormalities?

iOS has an issue recording through some USB audio devices. It cannot be reliably reproduced (happens every 1 in ~2000-3000 records in batches and silently disappears), and we currently manually check our audio for any recording issues. It results in small numbers of samples (1-20) being shifted by a small number that sounds like a sort of 'crackle'.
They look like this:
closer:
closer:
another, single sample error elsewhere in the same audio file:
The question is, how can these be algorithmically be detected (assuming direct access to samples) whilst not triggering false positives on high frequency audio with waveforms like this:
Bonus points: after determining as many errors as possible, how can the audio be 'fixed'?
Dirty audio file - pictured
Another dirty audio file
Clean audio with valid high frequency - pictured
More bonus points: what could be causing this issue in the iOS USB audio drivers/hardware (assuming it is there).

I do not think there is an out of the box solution to find the disturbances, but here is one (non standard) way of tackling the problem. Using this, I could find most intervals and I only got a small number of false positives, but the algorithm could certainly use some fine tuning.
My idea is to find the start and end point of the deviating samples. The first step should be to make these points stand out more clearly. This can be done by taking the logarithm of the data and taking the differences between consecutive values.
In MATLAB I load the data (in this example I use dirty-sample-other.wav)
y1 = wavread('dirty-sample-pictured.wav');
y2 = wavread('dirty-sample-other.wav');
y3 = wavread('clean-highfreq.wav');
data = y2;
and use the following code:
logdata = log(1+data);
difflogdata = diff(logdata);
So instead of this plot of the original data:
we get:
where the intervals we are looking for stand out as a positive and negative spike. For example zooming in on the largest positive value in the plot of logarithm differences we get the following two figures. One for the original data:
and one for the difference of logarithms:
This plot could help with finding the areas manually but ideally we want to find them using an algorithm. The way I did this was to take a moving window of size 6, computing the mean value of the window (of all points except the minimum value), and compare this to the maximum value. If the maximum point is the only point that is above the mean value and at least twice as large as the mean it is counted as a positive extreme value.
I then used a threshold of counts, at least half of the windows moving over the value should detect it as an extreme value in order for it to be accepted.
Multiplying all points with (-1) this algorithm is then run again to detect the minimum values.
Marking the positive extremes with "o" and negative extremes with "*" we get the following two plots. One for the differences of logarithms:
and one for the original data:
Zooming in on the left part of the figure showing the logarithmic differences we can see that most extreme values are found:
It seems like most intervals are found and there are only a small number of false positives. For example running the algorithm on 'clean-highfreq.wav' I only find one positive and one negative extreme value.
Single values that are falsely classified as extreme values could perhaps be weeded out by matching start and end-points. And if you want to replace the lost data you could use some kind of interpolation using the surrounding data-points, perhaps even a linear interpolation will be good enough.
Here is the MATLAB-code I used:
function test20()
clc
clear all
y1 = wavread('dirty-sample-pictured.wav');
y2 = wavread('dirty-sample-other.wav');
y3 = wavread('clean-highfreq.wav');
data = y2;
logdata = log(1+data);
difflogdata = diff(logdata);
figure,plot(data),hold on,plot(data,'.')
figure,plot(difflogdata),hold on,plot(difflogdata,'.')
figure,plot(data),hold on,plot(data,'.'),xlim([68000,68200])
figure,plot(difflogdata),hold on,plot(difflogdata,'.'),xlim([68000,68200])
k = 6;
myData = difflogdata;
myPoints = findPoints(myData,k);
myData2 = -difflogdata;
myPoints2 = findPoints(myData2,k);
figure
plotterFunction(difflogdata,myPoints>=k,'or')
hold on
plotterFunction(difflogdata,myPoints2>=k,'*r')
figure
plotterFunction(data,myPoints>=k,'or')
hold on
plotterFunction(data,myPoints2>=k,'*r')
end
function myPoints = findPoints(myData,k)
iterationVector = k+1:length(myData);
myPoints = zeros(size(myData));
for i = iterationVector
subVector = myData(i-k:i);
meanSubVector = mean(subVector(subVector>min(subVector)));
[maxSubVector, maxIndex] = max(subVector);
if (sum(subVector>meanSubVector) == 1 && maxSubVector>2*meanSubVector)
myPoints(i-k-1+maxIndex) = myPoints(i-k-1+maxIndex) +1;
end
end
end
function plotterFunction(allPoints,extremeIndices,markerType)
extremePoints = NaN(size(allPoints));
extremePoints(extremeIndices) = allPoints(extremeIndices);
plot(extremePoints,markerType,'MarkerSize',15),
hold on
plot(allPoints,'.')
plot(allPoints)
end
Edit - comments on recovering the original data
Here is a slightly zoomed out view of figure three above: (the disturbance is between 6.8 and 6.82)
When I examine the values, your theory about the data being mirrored to negative values does not seem to fit the pattern exactly. But in any case, my thought about just removing the differences is certainly not correct. Since the surrounding points do not seem to be altered by the disturbance, I would probably go back to the original idea of not trusting the points within the affected region and instead using some sort of interpolation using the surrounding data. It seems like a simple linear interpolation would be a quite good approximation in most cases.

To answer the question of why it happens -
A USB audio device and host are not clock synchronous - that is to say that the host cannot accurately recover the relationship between the host's local clock and the word-clock of the ADC/DAC on the audio interface. Various techniques do exist for clock-recovery with various degrees of effectiveness. To add to the problem, the bus clock is likely to be unrelated to either of the two audio clocks.
Whilst you might imagine this not to be too much of a concern for audio receive - audio capture callbacks could happen when there is data - audio interfaces are usually bi-directional and the host will be rendering audio at regular interval, which the other end is potentially consuming at a slightly different rate.
In-between are several sets of buffers, which can over- or under-run, which is what looks to be happening here; the interval between it happening certainly seems about right.
You might find that changing USB audio device to one built around a different chip-set (or, simply a different local oscillator) helps.
As an aside both IEEE1394 audio and MPEG transport streams have the same clock recovery requirement. Both of them solve the problem with by embedding a local clock reference packet into the serial bitstream in a very predictable way which allows accurate clock recovery on the other end.

I think the following algorithm can be applied to samples in order to determine a potential false positive:
First, scan for high amount of high frequency, either via FFT'ing the sound block by block (256 values maybe), or by counting the consecutive samples above and below zero. The latter should keep track of maximum consecutive above zero, maximum consecutive below zero, the amount of small transitions around zero and the current volume of the block (0..1 as Audacity displays it). Then, if the maximum consecutive is below 5 (sampling at 44100, and zeroes be consecutive, while outstsanding samples are single, 5 responds to 4410Hz frequency, which is pretty high), or the sum of small transitions' lengths is above a certain value depending on maximum consecutive (I believe the first approximation would be 3*5*block size/distance between two maximums, which roughly equates to period of the loudest FFT frequency. Also it should be measured both above and below threshold, as we can end up with an erroneous peak, which will likely be detected by difference between main tempo measured on below-zero or above-zero maximums, also by std-dev of peaks. If high frequency is dominant, this block is eligible only for zero-value testing, and a special means to repair the data will be needed. If high frequency is significant, that is, there is a dominant low frequency detected, we can search for peaks bigger than 3.0*high frequency volume, as well as abnormal zeroes in this block.
Also, your gaps seem to be either highly extending or plain zero, with high extends to be single errors, and zero errors range from 1-20. So, if there is a zero range with values under 0.02 absolute value, which is directly surrounded by values of 0.15 (a variable to be finetuned) or higher absolute value AND of the same sign, count this point as an error. Single values that stand out can be detected if you calculate 2.0*(current sample)-(previous sample)-(next sample) and if it's above a certain threshold (0.1+high frequency volume, or 3.0*high frequency volume, whichever is bigger), count this as an error and average.
What to do with zero gaps found - we can copy values from 1 period backwards and 1 period forwards (averaging), where "period" is of the most significant frequency of the FFT of the block. If the "period" is smaller than the gap (say we've detected a gap of zeroes in a high-pitched part of the sound), use two or more periods, so the source data will all be valid (in this case, no averaging can be done, as it's possible that the signal 2 periods forward from the gap and 2 periods back will be in counterphase). If there are more than one frequency of about equal amplitude, we can plain sample these with correct phases, cutting the rest of less significant frequencies altogether.
The outstanding sample should IMO just be averaged by 2-4 surrounding samples, as there seems to be only a single sample ever encountered in your sound files.

The discrete wavelet transform (DWT) may be the solution to your problem.
A FFT calculation is not very useful in your case since its an average representation of relative frequency content over the entire duration of the signal, and thus impossible to detect momentary changes. The dicrete short time frequency transform (STFT) tries to tackle this by computing the DFT for short consecutive time-blocks of the signal, the length of which is determine by the length (and shape) of a window, but since the resolution of the DFT is dependent on the data/block-length, there is a trade-off between resolution in freqency OR in time, and finding this magical fixed window-size can be tricky!
What you want is a time-frequency analysis method with good time resolution for high-frequency events, and good frequency resolution for low-frequency events... Enter the discrete wavelet transform!
There are numerous wavelet transforms for different applications and as you might expect, it's computationally heavy. The DWT may not be practical solution to your problem, but it's worth considering. Good luck with your problem. Some friday-evening reading:
http://klapetek.cz/wdwt.html
http://etd.lib.fsu.edu/theses/available/etd-11242003-185039/unrestricted/09_ds_chapter2.pdf
http://en.wikipedia.org/wiki/Wavelet_transform
http://en.wikipedia.org/wiki/Discrete_wavelet_transform

You can try the following super-simple approach (maybe it's enough):
Take each point in your wave-form and subtract its predecessor (look at the changes from one point to the next).
Look at the distribution of these changes and find their standard deviation.
If any given difference is beyond X times this standard deviation (either above or below), flag it as a problem.
Determine the best value for X by playing with it and seeing how well it performs.
Most "problems" should come as a pair of two differences beyond your cutoff, one going up, and one going back down.
To stick with the super-simple approach, you can then fix the data by just interpolating linearly between the last good point before your problem-section and the first good point after. (Make sure you don't just delete the points as this will influence (raise) the pitch of your audio.)

Algorithm to calculate a page importance based on its views / comments

I need an algorithm that allows me to determine an appropriate <priority> field for my website's sitemap based on the page's views and comments count.
For those of you unfamiliar with sitemaps, the priority field is used to signal the importance of a page relative to the others on the same website. It must be a decimal number between 0 and 1.
The algorithm will accept two parameters, viewCount and commentCount, and will return the priority value. For example:
GetPriority(100000, 100000); // Damn, a lot of views/comments! The returned value will be very close to 1, for example 0.995
GetPriority(3, 2); // Ok not many users are interested in this page, so for example it will return 0.082

You mentioned doing this in an SQL query, so I'll give samples in that.
If you have a table/view Pages, something like this
Pages
-----
page_id:int
views:int - indexed
comments:int - indexed
Then you can order them by writing
SELECT * FROM Pages
ORDER BY
(0.3+LOG10(10+views)/LOG10(10+(SELECT MAX(views) FROM Pages))) +
(0.7+LOG10(10+comments)/LOG10(10+(SELECT MAX(comments) FROM Pages)))
I've deliberately chosen unequal weighting between views and comments. A problem that can arise with keeping an equal weighting with views/comments is that the ranking becomes a self-fulfilling prophecy - a page is returned at the top of the list, so it's visited more often, and thus gets more points, so it's shown at the stop of the list, and it's visited more often, and it gets more points.... Putting more weight on on the comments reflects that these take real effort and show real interest.
The above formula will give you ranking based on all-time statistics. So an article that amassed the same number of views/comments in the last week as another article amassed in the last year will be given the same priority. It may make sense to repeat the formula, each time specifying a range of dates, and favoring pages with higher activity, e.g.
0.3*(score for views/comments today) - live data
0.3*(score for views/comments in the last week)
0.25*(score for views/comments in the last month)
0.15*(score for all views/comments, all time)
This will ensure that "hot" pages are given higher priority than similarly scored pages that haven't seen much action lately. All values apart from today's scores can be persisted in tables by scheduled stored procedures so that the database isn't having to aggregate many many comments/view stats. Only today's stats are computed "live". Taking it one step further, the ranking formula itself can be computed and stored for historical data by a stored procedure run daily.
EDIT: To get a strict range from 0.1 to 1.0, you would motify the formula like this. But I stress - this will only add overhead and is unecessary - the absolute values of priority are not important - only their relative values to other urls. The search engine uses these to answer the question, is URL A more important/relevant than URL B? It does this by comparing their priorities - which one is greatest - not their absolute values.
// unnormalized - x is some page id
un(x) = 0.3*log(views(x)+10)/log(10+maxViews()) +
0.7*log(comments(x)+10)/log(10+maxComments())
// the original formula (now in pseudo code)
The maximum will be 1.0, the minimum will start at 1.0 and move downwards as more views/comments are made.
we define un(0) as the minimum value, i.e. (where views(x) and comments(x) are both 0 in the above formula)
To get a normalized formula from 0.1 to 1.0, you then compute n(x), the normalized priority for page x
(1.0-un(x)) * (un(0)-0.1)
n(x) = un(x) - ------------------------- when un(0) != 1.0
1.0-un(0)
= 0.1 otherwise.

Priority = W1 * views / maxViewsOfAllArticles + W2 * comments / maxCommentsOfAllArticles
with W1+W2=1
Although IMHO, just use 0.5*log_10(10+views)/log_10(10+maxViews) + 0.5*log_10(10+comments)/log_10(10+maxComments)

What you're looking for here is not an algorithm, but a formula.
Unfortunately, you haven't really specified the details of what you want, so there's no way we can provide the formula to you.
Instead, let's try to walk through the problem together.
You've got two incoming parameters, the viewCount and the commentCount. You want to return a single number, Priority. So far, so good.
You say that Priority should range between 0 and 1, but this isn't really important. If we were to come up with a formula we liked, but resulted in values between 0 and N, we could just divide the results by N-- so this constraint isn't really relevant.
Now, the first thing we need to decide is the relative weight of Comments vs Views.
If page A has 100 comments and 10 views, and page B has 10 comments and 100 views, which should have a higher priority? Or, should it be the same priority? You need to decide what's right for your definition of Priority.
If you decide, for example, that comments are 5 times more valuable than views, then we can begin with a formula like
Priority = 5 * Comments + Views
Obviously, this can be generalized to
Priority = A * Comments + B * Views
Where A and B are relative weights.
But, sometimes we want our weights to be exponential instead of linear, like
Priority = Comment ^ A + Views ^ B
which will give a very different curve than the earlier formula.
Similarly,
Priority = Comment ^ A * Views ^ B
will give higher value to a page with 20 comments and 20 views than one with 1 comment and 40 views, if the weights are equal.
So, to summarize:
You really ought to make a spreadsheet with sample values for Views and Comments, and then play around with various formulas until you get one that has the distribution that you are hoping for.
We can't do it for you, because we don't know how you want to value things.

I know it has been a while since this was asked, but I encountered a similar problem and had a different solution.
When you want to have a way to rank something, and there are multiple factors that you're using to perform that ranking, you're doing something called multi-criteria decision analysis. (MCDA). See: http://en.wikipedia.org/wiki/Multi-criteria_decision_analysis
There are several ways to handle this. In your case, your criteria have different "units". One is in units of comments, the other is in units of views. Futhermore, you may want to give different weight to these criteria based on whatever business rules you come up with.
In that case, the best solution is something called a weighted product model. See: http://en.wikipedia.org/wiki/Weighted_product_model
The gist is that you take each of your criteria and turn it into a percentage (as was previously suggested), then you take that percentage and raise it to the power of X, where X is a number between 0 and 1. This number represents your weight. Your total weights should add up to one.
Lastly, you multiple each of the results together to come up with a rank. If the rank is greater than 1, than the numerator page has a higher rank than the denominator page.
Each page would be compared against every other page by doing something like:
p1C = page 1 comments
p1V = page 1 view
p2C = page 2 comments
p2V = page 2 views
wC = comment weight
wV = view weight
rank = (p1C/p2C)^(wC) * (p1V/p2V)^(wV)
The end result is a sorted list of pages according to their rank.
I've implemented this in C# by performing a sort on a collection of objects implementing IComparable.

What several posters have essentially advocated without conceptual clarification is that you use linear regression to determine a weighting function of webpage view and comment counts to establish priority.
This technique is pretty easy to implement for your problem, and the basic concept is described well in this Wikipedia article on linear regression models.
A quick summary of how to apply it to your problem is:
Determine the parameters of the line which best fits the view and comment count data for all your site's webpages, i.e., use linear regression.
Use the line parameters to derive your priority function for the view/count parameters.
Code examples for basic linear regression should not be hard to track down if you don't want to implement it from scratch from basic math formulas (use the web, Numerical Recipes, etc.). Also, any general math software package like Matlab, R, etc., comes with linear regression functions.

The most naive approach would be the following:
Let v[i] the views of page i, c[i] the number of comments for page i, then define the relative view weight for page i to be
r_v(i) = v[i]/(sum_j v[j])
where sum_j v[j] is the total of the v[.] over all pages. Similarly define the relative comment weight for page i to be
r_c(i) = c[i]/(sum_j c[j]).
Now you want some constant parameter p: 0 < p < 1 which indicates the importance of views over comments: p = 0 means only comments are significant, p = 1 means only views are significant, and p = 0.5 gives equal weight.
Then set the priority to be
p*r_v(i) + (1-p)*r_c(i)
This might be over-simplistic but its probably the best starting point.

Floating point calculations with latitudes and longitudes of varying precisions

Background: I receive a long and lat as parameters to a web service. They are typically up to 6 decimal places. When a new request is received, I calculate the distance between the last recorded loc and the long/lat in the params of the request. If the distance is greater than a certain threshold of miles apart, I update the current loc.
Problem: I use the geokit gem/plugin to calculate the distance between the locs. Very rarely, a bug shows up (the zero distance bug mentioned on the author's site - I'm using 1.4.1 which claims the bug is fixed, but I still see it occurring shrug) that causes the distance calc to return something wildly inaccurate when calculating the distance between two points that are identical (this occurs if the user is not moving). This is causing updates to the user's current loc that should not be happening. You're probably wondering - well if it's just updating the loc to be exactly the same coordinates, who cares? Well, the answer is that a bunch of other crap occurs when the loc is updated that makes it an actual issue.
Attempted Solution: I tried to add in logic to manually check if the two locs are identical before calculating the distance and then just skipping the calc and not updating if that is the case. The incoming parameters are long/lats with 6 decimal precision; whereas, in my database, I store the values as floats, which appears to only store 4 decimal places. This is causing my float comparison to always fail and the inappropriate loc updates continue to occur.
Phew, ok so the actual question is: How should I perform this comparison? Should I truncate 2 of the decimal places from the incoming lat/longs, round up somehow so the fourth digit is correct and then compare? Or, should I do a "within a certain range" sort of comparison (e.g. reported_loc.long > current_loc.long - .0001 && reported_loc.long < current_loc.long + .0001)? Also any recommendations for existing ruby gem/plugins or built in functions to do this sort of thing would be much appreciated.
Here is sample output from the log:
[update_loc] Last location history record at lat: 41.5024, long: -81.6816
[update_loc] Current loc at lat: 41.502467, long: -81.681623
[update_loc] Distance from current loc and last loc history: 5795.10615113555 miles
[update_loc] Locs not identical and distance greater than threshold, inserting new loc history
[update_loc] Location update complete
Thank you
Tom

The usual way to test if two numbers are close is to use abs, ie,
(reported_loc.long - current_loc.long).abs <= tol
where tol is some pre-specified tolerance, eg, 0.0001.

A GPS receiver can give you a location with a precision of many decimals, but that doesn't mean that the measurement is actually that accurate.
Usually about 95% of the measurements lie within a circle of a couple of meters, which is about the same accuracy that you can store with a 32bits float.
However, you will clearly notice rounding errors when you plot a gps log as floats on a map.
Anyway, to do this comparison, I'd use a range instead of rounding the incoming value and comparing that to the database value. You won't be able to detect the smallest movements anymore, but at least you won't get false positives anymore either.
As you're using floats to store stuff, you clearly don't care about millimeters or centimeters anyway.

Algorithm to score similarness of sets of numbers

What is an algorithm to compare multiple sets of numbers against a target set to determine which ones are the most "similar"?
One use of this algorithm would be to compare today's hourly weather forecast against historical weather recordings to find a day that had similar weather.
The similarity of two sets is a bit subjective, so the algorithm really just needs to diferentiate between good matches and bad matches. We have a lot of historical data, so I would like to try to narrow down the amount of days the users need to look through by automatically throwing out sets that aren't close and trying to put the "best" matches at the top of the list.
Edit:
Ideally the result of the algorithm would be comparable to results using different data sets. For example using the mean square error as suggested by Niles produces pretty good results, but the numbers generated when comparing the temperature can not be compared to numbers generated with other data such as Wind Speed or Precipitation because the scale of the data is different. Some of the non-weather data being is very large, so the mean square error algorithm generates numbers in the hundreds of thousands compared to the tens or hundreds that is generated by using temperature.

I think the mean square error metric might work for applications such as weather compares. It's easy to calculate and gives numbers that do make sense.
Since your want to compare measurements over time you can just leave out missing values from the calculation.
For values that are not time-bound or even unsorted, multi-dimensional scatter data it's a bit more difficult. Choosing a good distance metric becomes part of the art of analysing such data.

Use the pearson correlation coefficient. I figured out how to calculate it in an SQL query which can be found here: http://vanheusden.com/misc/pearson.php

In finance they use Beta to measure the correlation of 2 series of numbers. EG, Beta could answer the question "Over the last year, how much would the price of IBM go up on a day that the price of the S&P 500 index went up 5%?" It deals with the percentage of the move, so the 2 series can have different scales.
In my example, the Beta is Covariance(IBM, S&P 500) / Variance(S&P 500).
Wikipedia has pages explaining Covariance, Variance, and Beta: http://en.wikipedia.org/wiki/Beta_(finance)

Look at statistical sites. I think you are looking for correlation.

As an example, I'll assume you're measuring temp, wind, and precip. We'll call these items "features". So valid values might be:
Temp: -50 to 100F (I'm in Minnesota, USA)
Wind: 0 to 120 Miles/hr (not sure if this is realistic but bear with me)
Precip: 0 to 100
Start by normalizing your data. Temp has a range of 150 units, Wind 120 units, and Precip 100 units. Multiply your wind units by 1.25 and Precip by 1.5 to make them roughly the same "scale" as your temp. You can get fancy here and make rules that weigh one feature as more valuable than others. In this example, wind might have a huge range but usually stays in a smaller range so you want to weigh it less to prevent it from skewing your results.
Now, imagine each measurement as a point in multi-dimensional space. This example measures 3d space (temp, wind, precip). The nice thing is, if we add more features, we simply increase the dimensionality of our space but the math stays the same. Anyway, we want to find the historical points that are closest to our current point. The easiest way to do that is Euclidean distance. So measure the distance from our current point to each historical point and keep the closest matches:
for each historicalpoint
distance = sqrt(
pow(currentpoint.temp - historicalpoint.temp, 2) +
pow(currentpoint.wind - historicalpoint.wind, 2) +
pow(currentpoint.precip - historicalpoint.precip, 2))
if distance is smaller than the largest distance in our match collection
add historicalpoint to our match collection
remove the match with the largest distance from our match collection
next
This is a brute-force approach. If you have the time, you could get a lot fancier. Multi-dimensional data can be represented as trees like kd-trees or r-trees. If you have a lot of data, comparing your current observation with every historical observation would be too slow. Trees speed up your search. You might want to take a look at Data Clustering and Nearest Neighbor Search.
Cheers.

Talk to a statistician.
Seriously.
They do this type of thing for a living.
You write that the "similarity of two sets is a bit subjective", but it's not subjective at all-- it's a matter of determining the appropriate criteria for similarity for your problem domain.
This is one of those situation where you are much better off speaking to a professional than asking a bunch of programmers.

First of all, ask yourself if these are sets, or ordered collections.
I assume that these are ordered collections with duplicates. The most obvious algorithm is to select a tolerance within which numbers are considered the same, and count the number of slots where the numbers are the same under that measure.

I do have a solution implemented for this in my application, but I'm looking to see if there is something that is better or more "correct". For each historical day I do the following:
function calculate_score(historical_set, forecast_set)
{
double c = correlation(historical_set, forecast_set);
double avg_history = average(historical_set);
double avg_forecast = average(forecast_set);
double penalty = abs(avg_history - avg_forecast) / avg_forecast
return c - penalty;
}
I then sort all the results from high to low.
Since the correlation is a value from -1 to 1 that says whether the numbers fall or rise together, I then "penalize" that with the percentage difference the averages of the two sets of numbers.

A couple of times, you've mentioned that you don't know the distribution of the data, which is of course true. I mean, tomorrow there could be a day that is 150 degree F, with 2000km/hr winds, but it seems pretty unlikely.
I would argue that you have a very good idea of the distribution, since you have a long historical record. Given that, you can put everything in terms of quantiles of the historical distribution, and do something with absolute or squared difference of the quantiles on all measures. This is another normalization method, but one that accounts for the non-linearities in the data.
Normalization in any style should make all variables comparable.
As example, let's say that a day it's a windy, hot day: that might have a temp quantile of .75, and a wind quantile of .75. The .76 quantile for heat might be 1 degree away, and the one for wind might be 3kmh away.
This focus on the empirical distribution is easy to understand as well, and could be more robust than normal estimation (like Mean-square-error).

Are the two data sets ordered, or not?
If ordered, are the indices the same? equally spaced?
If the indices are common (temperatures measured on the same days (but different locations), for example, you can regress the first data set against the second,
and then test that the slope is equal to 1, and that the intercept is 0.
http://stattrek.com/AP-Statistics-4/Test-Slope.aspx?Tutorial=AP
Otherwise, you can do two regressions, of the y=values against their indices. http://en.wikipedia.org/wiki/Correlation. You'd still want to compare slopes and intercepts.
====
If unordered, I think you want to look at the cumulative distribution functions
http://en.wikipedia.org/wiki/Cumulative_distribution_function
One relevant test is Kolmogorov-Smirnov:
http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test
You could also look at
Student's t-test,
http://en.wikipedia.org/wiki/Student%27s_t-test
or a Wilcoxon signed-rank test http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
to test equality of means between the two samples.
And you could test for equality of variances with a Levene test http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm
Note: it is possible for dissimilar sets of data to have the same mean and variance -- depending on how rigorous you want to be (and how much data you have), you could consider testing for equality of higher moments, as well.

Maybe you can see your set of numbers as a vector (each number of the set being a componant of the vector).
Then you can simply use dot product to compute the similarity of 2 given vectors (i.e. set of numbers).
You might need to normalize your vectors.
More : Cosine similarity

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio