Test the randomness of a black box that outputs random 64-bit floats - algorithm

I got this interview question and need to write a function for it. I failed.
Because it is a phone interview question, I don't think what I am supposed to code really need to be perfect random tester.
Any ideas?
How to write some code to be a reasonable randomness tester within like 30 minutes during an interview?
The distribution in this question is uniformly distributed

As this is an interview question, I think the interviewers are looking to assess in two ways:
Ability to understand what the requirements of the problem really are.
Ability to think of some code that would address those requirements.
This could be a really good interview question in certain settings, especially if the interviewer were willing to prompt the candidate with questions as and when necessary.
In terms of understanding the requirements of the question, it helps if you know that this is a really difficult problem, witness the Diehard tests mentioned in pjs's answer. Fundamentally I think a candidate would need to demonstrate appreciation of two things:
(a) The overall distribution of the numbers should match the desired distribution (I'm assuming it is uniform in this case, but as #pjs points out in comments this assumption should be made explicit).
(b) Each number drawn should be independent from the previous numbers drawn.
With half an hour to code something up in a phone interview, you can't go very far. If I were answering this question I would try to suggest something like:
(a) To test the distribution, come up with a set of equal-sized bins for the floating point numbers, and count the numbers that fall into each bin. Plot a histogram and eyeball it (plotting the data is always a good idea). To extend this, you could use a chi-squared test, as described in amit's answer.
However, as discussed in the comments, and here
The main problem with chi squared test is the choice of number and size of the intervals. Although rules of thumb can help produce good results, there is no panacea for all kinds of applications.
To this end, the Kolmogorov-Smirnov test can be used. The idea behind this test is that if you a plot of the ordered data should be a good fit against the perfect ordered data (known as the cumulative distribution). For a uniform distribution the perfect ordered data is a straight line: you expect the 10th percentile of the data to be 10% of the way through the range, the 20th to be 20% of the way through the range and so on. So, programmatically, you could sort the data, plot it against the ideal value and you should get a straight line. There is also a formal, quantitative statistical test you can apply, which is based on the differences between the actual and ideal values.
(b) To test independence, there are multiple approaches. Autocorrelation at various time lags is one fairly obvious one: to what extent is the value at time t similar to the value at time t+1, for example. The runs test is another nice one: you convert all the numbers into 1 or 0 depending on whether they fall above or below the median, and then the distribution of the length of runs can be used to construct a statistical test. The runs test can also be used to test for runs in one direction or another, as described here and here (this might be more useful in your case). Both of these have fairly straightforward implementations so long as you have the formulas to hand!
Apart from the diehard tests, other good sources discussing random number generators include here and here.

The way to check if a random number generator (or any other probability for that matter) is matching a desired model (in your case, uniform distribution) - you should use a statistical test, the Pearson's chi squared test.
The test is based on collecting observations, and matching them to the expected probability in according to the theoretic model you are assuming the numbers come from.
At the end, the test gives you the probability that the collected sample indeed came from the given model.
A simple example:
Given a cube, and the draws: [5,3,5,5,1,1] Is the cube balanced? (p=1/6 for each of {1,...,6})
Given the above observations we create the Expected vector: E = [1,1,1,1,1,1] (each entry is N/6 - 6 because this is the number of outcomes and N is the number of draws, 6 in the above example). And the Observed vector: O=[2,0,1,0,3,0]
From this we compute the statistic:
Xi^2 = sum((O_i - E_i)^2 / E_i) = 1/1 + 1/1 + 0/1 + 1/1 + 4/1 + 1/1 = 8
Now, we need to check what is the probability for P(Xi^2>=8), according to the chi^2 distribution (one degree of freedom). This probability is ~0.005 (a bit less..). So we can reject the hypothesis that the sample comes from unbiased cube with pretty high probability.

You're saying that they wanted you to recreate/reinvent the "diehard" battery of tests that it took Marsaglia many years to develop? I'd call them on unreasonable expectations.

Whatever distribution the random floats are suppposed to have, say uniform distribution over the interval [0,1], you can use the Kolmogorov-Smirnov test http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test to test to see if a sample does not follow the desired distribution. This can have advantages over chi-squared test if you have many possible values (because if you have more possible values than samples, then you have to define buckets for the chi-squared test, which makes the test less powerful compared to general distribution checking like Kolmogorov-Smirnov)


Sampling from discrete distribution without replacement where the probabilities change each draw

I have a sequence S = (s1,s2,...sk) with probability weights for each sequence site P = (p1,p2,...pk) where the sum of P = 1 maximum length of S may be around 10^9
By Simulation a site k is picked and modified after each draw , as reason the pk also changes each run through. Expected number of site exchanges is about 50k - 100k per simulation
Question 1: How would you suggest to draw site?
Actually I implemented this logic which seems to be ok itself as going along literature see e.g. here:
counter = 0
random_number = draw_random() #<= float in range 0,1
while P_sum < random_number
P_sum += P[counter]
return counter
By testing the simulation I observed a strong bias which seems to rebuilt random generators distribution ( see_here ) Three different generators generate 3 different results... which is fairly ok but none of them is correct at all states
Walkers and Knuth's methods with lookup table seem to be too time expensive for me as the lookup tables have to be recalculated each time.
Question 2 How can I reduce bias from randomness? Actual built in 3 different generators (only one used per simulation) which are uniform distributed in kindness to their chances. Knowing this is a heavy question when not knowing a line of simulation code
Question 3 Library for the thing ?
As it's not to much code I don't have problem to write on my own, but is there a another library for it which may not BOOST? Asking as this question may be outdated... Not Boost as I don't want to built in a fourth random generator and use that large thing
Question 4 Faster alternative?
I know that this topic was answered may thousands of time before - but none of the answers satisfies me enough nor gave me a wise alternative e.g. here seems to have the same problem but I don't understand which heap is where built and why in addition it seems very complicated for such a "easy" thing
Thank you for your support!

How many simulations need to do?

Hello my problem is more related with the validation of a model. I have done a program in netlogo that i'm gonna use in a report for my thesis but now the question is, how many repetitions (simulations) i need to do for justify my results? I already have read some methods using statistical approach and my colleagues have suggested me some nice mathematical operations, but i also want to know from people who works with computational models what kind of statistical test or mathematical method used to know that.
There are two aspects to this (1) How many parameter combinations (2) How many runs for each parameter combination.
(1) Generally you would do experiments, where you vary some of your input parameter values and see how some model output changes. Take the well known Schelling segregation model as an example, you would vary the tolerance value and see how the segregation index is affected. In this case you might vary the tolerance from 0 to 1 by 0.01 (if you want discrete) or you could just take 100 different random values in the range [0,1]. This is a matter of experimental design and is entirely affected by how fine you wish to examine your parameter space.
(2) For each experimental value, you also need to run multiple simulations so that you can can calculate the average and reduce the impact of randomness in the simulation run. For example, say you ran the model with a value of 3 for your input parameter (whatever it means) and got a result of 125. How do you know whether the 'real' answer is 125 or something else. If you ran it 10 times and got 10 different numbers in the range 124.8 to 125.2 then 125 is not an unreasonable estimate. If you ran it 10 times and got numbers ranging from 50 to 500, then 125 is not a useful result to report.
The number of runs for each experiment set depends on the variability of the output and your tolerance. Even the 124.8 to 125.2 is not useful if you want to be able to estimate to 1 decimal place. Look up 'standard error of the mean' in any statistics text book. Basically, if you do N runs, then a 95% confidence interval for the result is the average of the results for your N runs plus/minus 1.96 x standard deviation of the results / sqrt(N). If you want a narrower confidence interval, you need more runs.
The other thing to consider is that if you are looking for a relationship over the parameter space, then you need fewer runs at each point than if you are trying to do a point estimate of the result.
Not sure exactly what you mean, but maybe you can check the books of Hastie and Tishbiani
specially the sections on resampling methods (Cross-Validation and bootstrap).
They also have a shorter book that covers the possible relevant methods to your case along with the commands in R to run this. However, this book, as a far as a I know, is not free.
Also, could perturb the initial conditions to see you the outcome doesn't change after small perturbations of the initial conditions or parameters. On a larger scale, sometimes you can break down the space of parameters with regard to final state of the system.
1) The number of simulations for each parameter setting can be decided by studying the coefficient of variance Cv = s / u, here s and u are standard deviation and mean of the result respectively. It is explained in detail in this paper Coefficient of variance.
2) The simulations where parameters are changed can be analyzed using several methods illustrated in the paper Testing methods.
These papers provide scrupulous analyzing methods and refer to other papers which may be relevant to your question and your research.

Understanding Perceptrons

I just started a Machine learning class and we went over Perceptrons. For homework we are supposed to:
"Choose appropriate training and test data sets of two dimensions (plane). Use 10 data points for training and 5 for testing. " Then we are supposed to write a program that will use a perceptron algorithm and output:
a comment on whether the training data points are linearly
a comment on whether the test points are linearly separable
your initial choice of the weights and constants
the final solution equation (decision boundary)
the total number of weight updates that your algorithm made
the total number of iterations made over the training set
the final misclassification error, if any, on the training data and
also on the test data
I have read the first chapter of my book several times and I am still having trouble fully understanding perceptrons.
I understand that you change the weights if a point is misclassified until none are misclassified anymore, I guess what I'm having trouble understanding is
What do I use the test data for and how does that relate to the
training data?
How do I know if a point is misclassified?
How do I go about choosing test points, training points, threshold or a bias?
It's really hard for me to know how to make up one of these without my book providing good examples. As you can tell I am pretty lost, any help would be so much appreciated.
What do I use the test data for and how does that relate to the
training data?
Think about a Perceptron as young child. You want to teach a child how to distinguish apples from oranges. You show it 5 different apples (all red/yellow) and 5 oranges (of different shape) while telling it what it sees at every turn ("this is a an apple. this is an orange). Assuming the child has perfect memory, it will learn to understand what makes an apple an apple and an orange an orange if you show him enough examples. He will eventually start to use meta-features (like shapes) without you actually telling him. This is what a Perceptron does. After you showed him all examples, you start at the beginning, this is called a new epoch.
What happens when you want to test the child's knowledge? You show it something new. A green apple (not just yellow/red), a grapefruit, maybe a watermelon. Why not show the child the exact same data as before during training? Because the child has perfect memory, it will only tell you what you told him. You won't see how good it generalizes from known to unseen data unless you have different training data that you never showed him during training. If the child has a horrible performance on the test data but a 100% performance on the training data, you will know that he has learned nothing - it's simply repeating what he has been told during training - you trained him too long, he only memorized your examples without understanding what makes an apple an apple because you gave him too many details - this is called overfitting. To prevent your Perceptron from only (!) recognizing training data you'll have to stop training at a reasonable time and find a good balance between the size of the training and testing set.
How do I know if a point is misclassified?
If it's different from what it should be. Let's say an apple has class 0 and an orange has 1 (here you should start reading into Single/MultiLayer Perceptrons and how Neural Networks of multiple Perceptrons work). The network will take your input. How it's coded is irrelevant for this, let's say input is a string "apple". Your training set then is {(apple1,0), (apple2,0), (apple3,0), (orange1,1), (orange2,1).....}. Since you know the class beforehand, the network will either output 1 or 0 for the input "apple1". If it outputs 1, you perform (targetValue-actualValue) = (1-0) = 1. 1 in this case means that the network gives a wrong output. Compare this to the delta rule and you will understand that this small equation is part of the larger update equation. In case you get a 1 you will perform a weight update. If target and actual value are the same, you will always get a 0 and you know that the network didn't misclassify.
How do I go about choosing test points, training points, threshold or
a bias?
Practically the bias and threshold isn't "chosen" per se. The bias is trained like any other unit using a simple "trick", namely using the bias as an additional input unit with value 1 - this means the actual bias value is encoded in this additional unit's weight and the algorithm we use will make sure it learns the bias for us automatically.
Depending on your activation function, the threshold is predetermined. For a simple perceptron, the classification will occur as follows:
Since we use a binary output (between 0 and 1), it's a good start to put the threshold at 0.5 since that's exactly the middle of the range [0,1].
Now to your last question about choosing training and test points: This is quite difficult, you do that by experience. Where you're at, you start off by implementing simple logical functions like AND, OR, XOR etc. There's it's trivial. You put everything in your training set and test with the same values as your training set (since for x XOR y etc. there are only 4 possible inputs 00, 10, 01, 11). For complex data like images, audio etc. you'll have to try and tweak your data and features until you feel like the network can work with it as good as you want it to.
What do I use the test data for and how does that relate to the training data?
Usually, to asses how well a particular algorithm performs, one first trains it and then uses different data to test how well it does on data it has never seen before.
How do I know if a point is misclassified?
Your training data has labels, which means that for each point in the training set, you know what class it belongs to.
How do I go about choosing test points, training points, threshold or a bias?
For simple problems, you usually take all the training data and split it around 80/20. You train on the 80% and test against the remaining 20%.

Why we multiply 'most likely estimate' by 4 in three point estimation?

I have used three point estimation for one of my project.
Formula is
Three Point Estimate = (O + 4M + L ) / 6
That means,
Best Estimate + 4 x Most Likely Estimate + Worst Case Estimate divided by 6
divided by 6 means, average 6
and there is less chance of the worst case or the best case happening. In good faith, most likely estimate (M), is what it will take to get the job done.
But I don't know why they use 4(M). Why they multiplied by 4 ???. Not use 5,6,7 etc...
why most likely estimate is weighted four times as much as the other two values ?
There is a derivation here:
In case the link goes dead, I'll provide a summary here.
So, taking a step back from the question for a moment, the goal here is to come up with a single mean (average) figure that we can say is the expected figure for any given 3 point estimate. That is to say, If I was to attempt the project X times, and add up all the costs of the project attempts for a total of $Y, then I expect the cost of one attempt to be $Y/X. Note that this number may or may not be the same as the mode (most likely) outcome, depending on the probability distribution.
An expected outcome is useful because we can do things like add up a whole list of expected outcomes to create an expected outcome for the project, even if we calculated each individual expected outcome differently.
A mode on the other hand, is not even necessarily unique per estimate, so that's one reason that it may be less useful than an expected outcome. For example, every number from 1-6 is the "most likely" for a dice roll, but 3.5 is the (only) expected average outcome.
The rationale/research behind a 3 point estimate is that in many (most?) real-world scenarios, these numbers can be more accurately/intuitively estimated by people than a single expected value:
A pessimistic outcome (P)
An optimistic outcome (O)
The most likely outcome (M)
However, to convert these three numbers into an expected value we need a probability distribution that interpolates all the other (potentially infinite) possible outcomes beyond the 3 we produced.
The fact that we're even doing a 3-point estimate presumes that we don't have enough historical data to simply lookup/calculate the expected value for what we're about to do, so we probably don't know what the actual probability distribution for what we're estimating is.
The idea behind the PERT estimates is that if we don't know the actual curve, we can plug some sane defaults into a Beta distribution (which is basically just a curve we can customise into many different shapes) and use those defaults for every problem we might face. Of course, if we know the real distribution, or have reason to believe that default Beta distribution prescribed by PERT is wrong for the problem at hand, we should NOT use the PERT equations for our project.
The Beta distribution has two parameters A and B that set the shape of the left and right hand side of the curve respectively. Conveniently, we can calculate the mode, mean and standard deviation of a Beta distribution simply by knowing the minimum/maximum values of the curve, as well as A and B.
PERT sets A and B to the following for every project/estimate:
If M > (O + P) / 2 then A = 3 + √2 and B = 3 - √2, otherwise the values of A and B are swapped.
Now, it just so happens that if you make that specific assumption about the shape of your Beta distribution, the following formulas are exactly true:
Mean (expected value) = (O + 4M + P) / 6
Standard deviation = (O - P) / 6
So, in summary
The PERT formulas are not based on a normal distribution, they are based on a Beta distribution with a very specific shape
If your project's probability distribution matches the PERT Beta distribution then the PERT formula are exactly correct, they are not approximations
It is pretty unlikely that the specific curve chosen for PERT matches any given arbitrary project, and so the PERT formulas will be an approximation in practise
If you don't know anything about the probability distribution of your estimate, you may as well leverage PERT as it's documented, understood by many people and relatively easy to use
If you know something about the probability distribution of your estimate that suggests something about PERT is inappropriate (like the 4x weighting towards the mode), then don't use it, use whatever you think is appropriate instead
The reason why you multiply by 4 to get the Mean (and not 5, 6, 7, etc.) is because the number 4 is tied to the shape of the underlying probability curve
Of course, PERT could have been based off a Beta distribution that yields 5, 6, 7 or any other number when calculating the Mean, or even a normal distribution, or a uniform distribution, or pretty much any other probability curve, but I'd suggest that the question of why they chose the curve they did is out of scope for this answer and possibly quite open ended/subjective anyway
I dug into this once. I cleverly neglected to write down the trail, so this is from memory.
So far as I can make out, the standards documents got it from the textbooks. The textbooks got it from the original 1950s write up in a statistics journals. The writeup in the journal was based on an internal report done by RAND as part of the overall work done to develop PERT for the Polaris program.
And that's where the trail goes cold. Nobody seems to have a firm idea of why they chose that formula. The best guess seems to be that it's based on a rough approximation of a normal distribution -- strictly, it's a triangular distribution. A lumpy bell curve, basically, that assumes that the "likely case" falls within 1 standard deviation of the true mean estimate.
4/6ths approximates 66.7%, which approximates 68%, which approximates the area under a normal distribution within one standard deviation of the mean.
All that being said, there are two problems:
It's essentially made up. There doesn't seem to be a firm basis for picking it. There's some Operational Research literature arguing for alternative distributions. In what universe are estimates normally distributed around the true outcome? I'd very much like to move there.
The accuracy-improving effect of the 3-point / PERT estimation method might be more about the breaking down of tasks into subtasks than from any particular formula. Psychologists studying what they call "the planning fallacy" have found that breaking down tasks -- "unpacking", in their terminology -- consistently improves estimates by making them higher and thus reducing inaccuracy. So perhaps the magic in PERT/3-point is the unpacking, not the formulae.
Isn't it a well working thumb-number?
The cone of uncertainty uses the factor 4 for the beginning phase of the project.
The book "Software Estimation" by Steve McConnell is based around the "cone of uncertainty" model and gives many "thumb-rules". However every approximated number or a thumb-rule is based on statistics from COCOMO or similar solid researches, models or studies.
Ideally these factors for O, M and L are derived using historical data for other projects in the same company in the same environment. In other words, the company should have 4 projects completed within M estimate, 1 within O and 1 within L. If my company/team had got 1 project completed within original O estimate, 2 projects within M and 2 within L, I would use another formula - (O + 2M + 2L) / 5. Does it make sense?
The cone of uncertainty was referenced above ... it's a well-known foundational element used in agile estimation practices.
What's the problem with it though? Doesn't it look too symmetrical - as if it's not natural, not really based on real data?
If you ever though that then you're right. The cone of uncertainty shown in the picture above is made up based on probabilities ... not actual raw data from real projects (but most of the times it's used as such).
Laurent Bossavit wrote a book and also gave a presentation where he presented his research on how that cone came to be (and other 'facts' we often believe in software engineering):
The Leprechauns of Software Engineering
Is there some real data to support a cone of uncertainty? The closest he was able to find was a cone that can go up to 10x in the positive Y direction (so we can be up to 10 times off on our estimation in terms of the project taking 10 times as long in the end).
Hardly anybody estimates a project that ends up finishing 4 times earlier ... or ... gasp ... 10 times earlier.

Algorithm to score similarness of sets of numbers

What is an algorithm to compare multiple sets of numbers against a target set to determine which ones are the most "similar"?
One use of this algorithm would be to compare today's hourly weather forecast against historical weather recordings to find a day that had similar weather.
The similarity of two sets is a bit subjective, so the algorithm really just needs to diferentiate between good matches and bad matches. We have a lot of historical data, so I would like to try to narrow down the amount of days the users need to look through by automatically throwing out sets that aren't close and trying to put the "best" matches at the top of the list.
Ideally the result of the algorithm would be comparable to results using different data sets. For example using the mean square error as suggested by Niles produces pretty good results, but the numbers generated when comparing the temperature can not be compared to numbers generated with other data such as Wind Speed or Precipitation because the scale of the data is different. Some of the non-weather data being is very large, so the mean square error algorithm generates numbers in the hundreds of thousands compared to the tens or hundreds that is generated by using temperature.
I think the mean square error metric might work for applications such as weather compares. It's easy to calculate and gives numbers that do make sense.
Since your want to compare measurements over time you can just leave out missing values from the calculation.
For values that are not time-bound or even unsorted, multi-dimensional scatter data it's a bit more difficult. Choosing a good distance metric becomes part of the art of analysing such data.
Use the pearson correlation coefficient. I figured out how to calculate it in an SQL query which can be found here: http://vanheusden.com/misc/pearson.php
In finance they use Beta to measure the correlation of 2 series of numbers. EG, Beta could answer the question "Over the last year, how much would the price of IBM go up on a day that the price of the S&P 500 index went up 5%?" It deals with the percentage of the move, so the 2 series can have different scales.
In my example, the Beta is Covariance(IBM, S&P 500) / Variance(S&P 500).
Wikipedia has pages explaining Covariance, Variance, and Beta: http://en.wikipedia.org/wiki/Beta_(finance)
Look at statistical sites. I think you are looking for correlation.
As an example, I'll assume you're measuring temp, wind, and precip. We'll call these items "features". So valid values might be:
Temp: -50 to 100F (I'm in Minnesota, USA)
Wind: 0 to 120 Miles/hr (not sure if this is realistic but bear with me)
Precip: 0 to 100
Start by normalizing your data. Temp has a range of 150 units, Wind 120 units, and Precip 100 units. Multiply your wind units by 1.25 and Precip by 1.5 to make them roughly the same "scale" as your temp. You can get fancy here and make rules that weigh one feature as more valuable than others. In this example, wind might have a huge range but usually stays in a smaller range so you want to weigh it less to prevent it from skewing your results.
Now, imagine each measurement as a point in multi-dimensional space. This example measures 3d space (temp, wind, precip). The nice thing is, if we add more features, we simply increase the dimensionality of our space but the math stays the same. Anyway, we want to find the historical points that are closest to our current point. The easiest way to do that is Euclidean distance. So measure the distance from our current point to each historical point and keep the closest matches:
for each historicalpoint
distance = sqrt(
pow(currentpoint.temp - historicalpoint.temp, 2) +
pow(currentpoint.wind - historicalpoint.wind, 2) +
pow(currentpoint.precip - historicalpoint.precip, 2))
if distance is smaller than the largest distance in our match collection
add historicalpoint to our match collection
remove the match with the largest distance from our match collection
This is a brute-force approach. If you have the time, you could get a lot fancier. Multi-dimensional data can be represented as trees like kd-trees or r-trees. If you have a lot of data, comparing your current observation with every historical observation would be too slow. Trees speed up your search. You might want to take a look at Data Clustering and Nearest Neighbor Search.
Talk to a statistician.
They do this type of thing for a living.
You write that the "similarity of two sets is a bit subjective", but it's not subjective at all-- it's a matter of determining the appropriate criteria for similarity for your problem domain.
This is one of those situation where you are much better off speaking to a professional than asking a bunch of programmers.
First of all, ask yourself if these are sets, or ordered collections.
I assume that these are ordered collections with duplicates. The most obvious algorithm is to select a tolerance within which numbers are considered the same, and count the number of slots where the numbers are the same under that measure.
I do have a solution implemented for this in my application, but I'm looking to see if there is something that is better or more "correct". For each historical day I do the following:
function calculate_score(historical_set, forecast_set)
double c = correlation(historical_set, forecast_set);
double avg_history = average(historical_set);
double avg_forecast = average(forecast_set);
double penalty = abs(avg_history - avg_forecast) / avg_forecast
return c - penalty;
I then sort all the results from high to low.
Since the correlation is a value from -1 to 1 that says whether the numbers fall or rise together, I then "penalize" that with the percentage difference the averages of the two sets of numbers.
A couple of times, you've mentioned that you don't know the distribution of the data, which is of course true. I mean, tomorrow there could be a day that is 150 degree F, with 2000km/hr winds, but it seems pretty unlikely.
I would argue that you have a very good idea of the distribution, since you have a long historical record. Given that, you can put everything in terms of quantiles of the historical distribution, and do something with absolute or squared difference of the quantiles on all measures. This is another normalization method, but one that accounts for the non-linearities in the data.
Normalization in any style should make all variables comparable.
As example, let's say that a day it's a windy, hot day: that might have a temp quantile of .75, and a wind quantile of .75. The .76 quantile for heat might be 1 degree away, and the one for wind might be 3kmh away.
This focus on the empirical distribution is easy to understand as well, and could be more robust than normal estimation (like Mean-square-error).
Are the two data sets ordered, or not?
If ordered, are the indices the same? equally spaced?
If the indices are common (temperatures measured on the same days (but different locations), for example, you can regress the first data set against the second,
and then test that the slope is equal to 1, and that the intercept is 0.
Otherwise, you can do two regressions, of the y=values against their indices. http://en.wikipedia.org/wiki/Correlation. You'd still want to compare slopes and intercepts.
If unordered, I think you want to look at the cumulative distribution functions
One relevant test is Kolmogorov-Smirnov:
You could also look at
Student's t-test,
or a Wilcoxon signed-rank test http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
to test equality of means between the two samples.
And you could test for equality of variances with a Levene test http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm
Note: it is possible for dissimilar sets of data to have the same mean and variance -- depending on how rigorous you want to be (and how much data you have), you could consider testing for equality of higher moments, as well.
Maybe you can see your set of numbers as a vector (each number of the set being a componant of the vector).
Then you can simply use dot product to compute the similarity of 2 given vectors (i.e. set of numbers).
You might need to normalize your vectors.
More : Cosine similarity
