Related
I am calculating many (~ 100 million) floating point values during an operation. I do not want to store them all in the memory but I want to save a rough distribution of the collection.
My idea was to determine the exponents of all values and count them in a histogram. But this, of course, works only if the values have different exponents.
Has anybody an idea how I can do this without knowing how the distribution looks like?
I would suggest randomly saving some, then making a histogram after the fact from that. For example if you randomly save 0.1% of the numbers then you'd only need to save 100,000, from which you can calculate a highly accurate distribution.
You can reduce the number of calls to rand() by calling it every time you save a number to find a random number in the range 1..2000, then wait that many numbers before saving the next.
If you approximately know the min and max values, I'd think a binning strategy would be a good choice. Here is an outline for what I mean:
Figure out how many bins you need
For all my numbers
Find the bin that this number goes in
Increment that bin
Another useful alternative would be to compute on-the-fly moments of the distribution, and then reconstruct PDF from moments
https://en.wikipedia.org/wiki/Method_of_moments_(statistics)
https://www.wias-berlin.de/people/john/ELECTRONIC_PAPERS/JAOT07.CES.pdf
According to a teacher of mine in-order to do this you make two arrays with numbers, possessing several decimals. One positive array and one negative.
Array 1 [0] = e.g 1.5739
Array 2 [0] = e.g -5.31729
Then you find current time
201305220957 or May 22,2013 at 9:57 AM
And use this equation:
(201305211647*1.5739)--5.31729
-Then you use absolute value and round to 1.0 decimal place and you have your number
Is it true that in most generators the value is dependent on time?
Bottom line up front - Generating random numbers is really hard to do right, and has badly burned some seriously smart people (John Von Neumann for one). Normal people shouldn't try to create their own RNG algorithms. It requires expertise in number theory, probability & statistics, and numerical computation. Unless you qualify in all three fields, you're much better off using algorithms developed by people who are. If you want to know how to do it right you can find lots of good info at http://en.wikipedia.org/wiki/Random_number_generation and http://en.wikipedia.org/wiki/Pseudorandom_number_generator.
Speaking bluntly, your teacher is totally clueless about this topic.
If you need to generate random numbers for encryption, or statistical purposes - you need to get a well studied generator. One I like is the Mersenne Twister, that has very nice statistical properties, is fast to run and easy to code.
If you just need a reasonably random generator - for example to make things appear random in a game - you can use the classic "Linear Congruent Generator" which is trivial to write and produces pretty random looking output. ( Not safe for heavy duty computation ).
The LCG generator:
int seed = 0x333; // chose any number.
int random() { seed = ( seed * 69069 ) + 1; return seed; }
There are several numbers you can use instead of 69069. But don't pick your own. Chose one from here if you don't like 69069.
http://en.wikipedia.org/wiki/Linear_congruential_generator
The wikipedia has a lot of great pages devoted to random number generation (RNG). One page is devoted to just listing the various types of random number generators used throughout history. One of the earliest and weakest is known as the Middle Square method - easy to implement in programming and suitable for many low level tasks. Some computers have a linear feedback shift register (LFSR) built into the circuitry for random number generation, but its not very advanced. One of the more modern generators (not the most modern), and which is considered cryptographically secure (in the sense of how unpredictable it is), is known as the Mersenne Twister.
Properly, these are pseudorandom number generators (PRNG), because they arent truly random. They arent truly random because computers are deterministic machines (state machines); no predetermined algorithm can be programmed to generate truly random numbers from a known prior state.
That said, the invention of true random number generator (TRNG) hardware circuitry (typically analog) does exist, and are approached in different ways. From checking ambient conditions such as temperature and pressure, to phenomenon a more nuanced and bit more subject to conditions atomic/quantum, such as what state a multistable circuit with feedback settles in. Most modern personal computers do not use this and, if they do, probably only use it to find the seed value for PRNGs. Then, you also have RNG programs that check through an internet connection, to find random number servers online, most of which use TRNGs. You can even rely on lookup tables from real world phenomenon that have been documented; lookup tables is very old school.
Pseudorandom number generators only have the appearance of randomness, namely, they follow a particular distribution and the ability to predict future values from prior ones is not easy. There are a set of diehard tests which have the sole purpose of testing the quality of a random number generator. In general, though, all a PRNG is, is an algorithm that produces a sequence of integers. Ideally, we are looking for an algorithm that produces a sequence of numbers which are suitably unpredictable (this is the hardest part) from prior terms in the sequence, while also following a particular distribution (uniform, typically), meaning that every value in a range is produced in equal proportion.
In general, its a trivial task to convert one distribution into another, regardless if its TRNG or PRNG. Uniform discrete integer distributions (which is what PRNGs generate) can easily be extended or compressed to span any arbitrary interval of integers using a variety of scaling techniques that preserve uniformity, or converted into uniform floating points distributions by randomly choosing a large integer and scaling it down into a float, etc.
Uniform floating point numbers can easily be converted to any other non-uniform distribution such as the Normal, Chi-Square, exponential, etc., using a variety of methods, such as Inverse Transform sampling, Rejection sampling, or simple algebraic relationships on distributions (e.g. the Chi-Square is simply the sum of the squares of independent normal distributions). Additionally some distributions can be suitably approximated using simple mathematical functions applied to a uniform float.
Ultimately the hardest part and the heart of most studies in the subject is in the generation of those uniformly distributed integers, that form the basis of all other distributions.
There is nothing fundamentally wrong with using clock time to get an initial seed value, for example. Id favor using the lower three or four significant digits of the time expressed in microseconds, however, for something suitably unpredictable. Or using the LFSR in a computer for the same purpose, of getting a more advanced algorithm jump started.
I am generating about 100 million random numbers to pick from 300 things. I need to set it up so that I have 10 million independent instances (different seed) that picks 10 times each. The goal is for the aggregate results to have very low discrepancy, as in, each item gets picked about the same number of times.
The problem is with a regular prng, some numbers get chosen more than others. (tried lcg and mersenne twister) The difference between the most picked and least picked can be several thousand, to ten thousands) With linear congruity generators and mersenne twister, I also tried picking 100 million times with 1 instance and that also didn't yield uniform results. I'm guessing this is because the period is very long, and perhaps 100 million isn't big enough. Theoretically, if I pick enough numbers, the results should reach uniformity. (should settle at the expected value)
I switched to Sobol, a quasirandom generator and got much better results with the 100 million from 1 instance test. (difference between most picked and least picked is about 5) But splitting them up to 10 million instances at 10 times each, the uniformity was lost and I got similar results as with the prng. Sobol seem very sensitive to sequence - skipping ahead randomly diminishes uniformity.
Is there a class of random generators that can maintain quasirandom-like low discrepancy even when combining 10 million independent instances? Or is that theoretically impossible? One solution I can think of now is to use 1 Sobol generator that is shared across 10 million instances, so effectively it is the same as the 100 million from 1 instance test.
Both the shuffling and proper use of Sobol should give you uniformity as desired. Shuffling needs to be done at the aggregate level (start with a global 100M sample having the desired aggregate frequencies, then shuffle it to introduce randomness, and finally split into the 10 values instances; shuffling within each instance wouldnt help globally, as you noted).
But that's an additional level of uniformity, you might not really need that: randomness might be enough.
First of all I would check the check itself, because it sounds strange that with enough samples you're really getting significant deviations (check "chi square test" to qualify such significance, or equivalently how many are "enough" samples). So for a first safety check: if you're picking independent values, then simplify differently to 10M instances picking 10 out 2 categories: do you get approximately a binomial distribution? For exclusive picking it's a different distribution (hypergeometric iirc, but need to check). Then generalize to more categories (multinomial distribution) and only later it's safe to proceed with your problem.
(Not strictly programming, but a question that programmers need answered.)
I have a benchmark, X, which is made up of a lot of sub-benchmarks x1..xn. Its quite a noisy test, with the results being quite variable. To accurately benchmark, I must reduce that "variability", which requires that I first measure the variability.
I can easily calculate the variability of each sub-benchmark, using perhaps standard deviation or variance. However, I'd like to get a single number which represents the overall variability as a single number.
My own attempt at the problem is:
sum = 0
foreach i in 1..n
calculate mean across the 60 runs of x_i
foreach j in 1..60
sum += abs(mean[i] - x_i[j])
variability = sum / 60
Best idea: ask at the statistics Stack Exchange once it hits public beta (in a week).
In the meantime: you might actually be more interested in the extremes of variability, rather than the central tendency (mean, etc.). For many applications, I imagine that there's relatively little to be gained by incrementing the typical user experience, but much to be gained by improving the worst user experiences. Try the 95th percentile of the standard deviations and work on reducing that. Alternatively, if the typical variability is what you want to reduce, plot the standard deviations all together. If they're approximately normally distributed, I don't know of any reason why you couldn't just take the mean.
I think you're misunderstanding the standard deviation -- if you run your test 50 times and have 50 different runtimes the standard deviation will be a single number that describes how tight or loose those 50 numbers are distributed around your average. In conjunction with your average run time, the standard deviation will help you see how much spread there is in your results.
Consider the following run times:
12 15 16 18 19 21 12 14
The mean of these run times is 15.875. The sample standard deviation of this set is 3.27. There's a good explanation of what 3.27 actually means (in a normally distributed population, roughly 68% of the samples will fall within one standard deviation of the mean: e.g., between 15.875-3.27 and 15.875+3.27) but I think you're just looking for a way to quantify how 'tight' or 'spread out' the results are around your mean.
Now consider a different set of run times (say, after you compiled all your tests with -O2):
14 16 14 17 19 21 12 14
The mean of these run times is also 15.875. The sample standard deviation of this set is 3.0. (So, roughly 68% of the samples will fall within 15.875-3.0 and 15.875+3.0.) This set is more closely grouped than the first set.
And you have a single number that summarizes how compact or loose a group of numbers is around the mean.
Caveats
Standard deviation is built on the assumption of a normal distribution -- but your application may not be normally distributed, so please be aware that standard deviation may be a rough guideline at best. Plot your run-times in a histogram to see if your data looks roughly normal or uniform or multimodal or...
Also, I'm using the sample standard deviation because these are only a sample out of the population space of benchmark runs. I'm not a professional statistician, so even this basic assumption may be wrong. Either population standard deviation or sample standard deviation will give you good enough results in your application IFF you stick to either sample or population. Don't mix the two.
I mentioned that the standard deviation in conjunction with the mean will help you understand your data: if the standard deviation is almost as large as your mean, or worse, larger, then your data is very dispersed, and perhaps your process is not very repeatable. Interpreting a 3% speedup in the face of a large standard deviation is nearly useless, as you've recognized. And the best judge (in my experience) of the magnitude of the standard deviation is the magnitude of the average.
Last note: yes, you can calculate standard deviation by hand, but it is tedious after the first ten or so. Best to use a spreadsheet or wolfram alpha or your handy high-school calculator.
From Variance:
"the variance of the total group is equal to the mean of the variances of the subgroups, plus the variance of the means of the subgroups."
I had to read that several times, then run it: 464 from this formula == 464, the standard deviation of all the data -- the single number you want.
#!/usr/bin/env python
import sys
import numpy as np
N = 10
exec "\n".join( sys.argv[1:] ) # this.py N= ...
np.set_printoptions( 1, threshold=100, suppress=True ) # .1f
np.random.seed(1)
data = np.random.exponential( size=( N, 60 )) ** 5 # N rows, 60 cols
row_avs = np.mean( data, axis=-1 ) # av of each row
row_devs = np.std( data, axis=-1 ) # spread, stddev, of each row about its av
print "row averages:", row_avs
print "row spreads:", row_devs
print "average row spread: %.3g" % np.mean( row_devs )
# http://en.wikipedia.org/wiki/Variance:
# variance of the total group
# = mean of the variances of the subgroups + variance of the means of the subgroups
avvar = np.mean( row_devs ** 2 )
varavs = np.var( row_avs )
print "sqrt total variance: %.3g = sqrt( av var %.3g + var avs %.3g )" % (
np.sqrt( avvar + varavs ), avvar, varavs)
var_all = np.var( data ) # std^2 all N x 60 about the av of the lot
print "sqrt variance all: %.3g" % np.sqrt( var_all )
row averages: [ 49.6 151.4 58.1 35.7 59.7 48. 115.6 69.4 148.1 25. ]
row devs: [ 244.7 932.1 251.5 76.9 201.1 280. 513.7 295.9 798.9 159.3]
average row dev: 375
sqrt total variance: 464 = sqrt( av var 2.13e+05 + var avs 1.88e+03 )
sqrt variance all: 464
To see how group variance increases, run the example in Wikipedia Variance.
Say we have
60 men of heights 180 +- 10, exactly 30: 170 and 30: 190
60 women of heights 160 +- 7, 30: 153 and 30: 167.
The average standard dev is (10 + 7) / 2 = 8.5 .
Together though, the heights
-------|||----------|||-|||-----------------|||---
153 167 170 190
spread like 170 +- 13.2, much greater than 170 +- 8.5.
Why ? Because we have not only the spreads men +- 10 and women +- 7,
but also the spreads from 160 / 180 about the common mean 170.
Exercise: compute the spread 13.2 in two ways,
from the formula above, and directly.
This is a tricky problem because benchmarks can be of different natural lengths anyway. So, the first thing you need to do is to convert each of the individual sub-benchmark figures into scale-invariant values (e.g., “speed up factor” relative to some believed-good baseline) so that you at least have a chance to compare different benchmarks.
Then you need to pick a way to combine the figures. Some sort of average. There are, however, many types of average. We can reject the use of the mode and the median here; they throw away too much relevant information. But the different kinds of mean are useful because of the different ways they give weight to outliers. I used to know (but have forgotten) whether it was the geometric mean or the harmonic mean that was most useful in practice (the arithmetic mean is less good here). The geometric mean is basically an arithmetic mean in the log-domain, and a harmonic mean is similarly an arithmetic mean in the reciprocal-domain. (Spreadsheets make this trivial.)
Now that you have a means to combine the values for a run of the benchmark suite into something suitably informative, you've then got to do lots of runs. You might want to have the computer do that while you get on with some other task. :-) Then try combining the values in various ways. In particular, look at the variance of the individual sub-benchmarks and the variance of the combined benchmark number. Also consider doing some of the analyses in the log and reciprocal domains.
Be aware that this is a slow business that is difficult to get right and it's usually uninformative to boot. A benchmark only does performance testing of exactly what's in the benchmark, and that's mostly not how people use the code. It's probably best to consider strictly time-boxing your benchmarking work and instead focus on whether users think the software is perceived as fast enough or whether required transaction rates are actually attained in deployment (there are many non-programming ways to screw things up).
Good luck!
You are trying to solve the wrong problem. Better try to minimize it. The differences can be because of caching.
Try running the code on a single (same) core with SetThreadAffinityMask() function on Windows.
Drop the first measurement.
Increase the thead priority.
Stop hyperthreading.
If you have many conditional jumps it can introduce visible differences between calls with different input. (this could be solved by giving exactly the same input for i-th iteration, and then comparing the measured times between these iterations).
You can find here some useful hints: http://www.agner.org/optimize/optimizing_cpp.pdf
What is an algorithm to compare multiple sets of numbers against a target set to determine which ones are the most "similar"?
One use of this algorithm would be to compare today's hourly weather forecast against historical weather recordings to find a day that had similar weather.
The similarity of two sets is a bit subjective, so the algorithm really just needs to diferentiate between good matches and bad matches. We have a lot of historical data, so I would like to try to narrow down the amount of days the users need to look through by automatically throwing out sets that aren't close and trying to put the "best" matches at the top of the list.
Edit:
Ideally the result of the algorithm would be comparable to results using different data sets. For example using the mean square error as suggested by Niles produces pretty good results, but the numbers generated when comparing the temperature can not be compared to numbers generated with other data such as Wind Speed or Precipitation because the scale of the data is different. Some of the non-weather data being is very large, so the mean square error algorithm generates numbers in the hundreds of thousands compared to the tens or hundreds that is generated by using temperature.
I think the mean square error metric might work for applications such as weather compares. It's easy to calculate and gives numbers that do make sense.
Since your want to compare measurements over time you can just leave out missing values from the calculation.
For values that are not time-bound or even unsorted, multi-dimensional scatter data it's a bit more difficult. Choosing a good distance metric becomes part of the art of analysing such data.
Use the pearson correlation coefficient. I figured out how to calculate it in an SQL query which can be found here: http://vanheusden.com/misc/pearson.php
In finance they use Beta to measure the correlation of 2 series of numbers. EG, Beta could answer the question "Over the last year, how much would the price of IBM go up on a day that the price of the S&P 500 index went up 5%?" It deals with the percentage of the move, so the 2 series can have different scales.
In my example, the Beta is Covariance(IBM, S&P 500) / Variance(S&P 500).
Wikipedia has pages explaining Covariance, Variance, and Beta: http://en.wikipedia.org/wiki/Beta_(finance)
Look at statistical sites. I think you are looking for correlation.
As an example, I'll assume you're measuring temp, wind, and precip. We'll call these items "features". So valid values might be:
Temp: -50 to 100F (I'm in Minnesota, USA)
Wind: 0 to 120 Miles/hr (not sure if this is realistic but bear with me)
Precip: 0 to 100
Start by normalizing your data. Temp has a range of 150 units, Wind 120 units, and Precip 100 units. Multiply your wind units by 1.25 and Precip by 1.5 to make them roughly the same "scale" as your temp. You can get fancy here and make rules that weigh one feature as more valuable than others. In this example, wind might have a huge range but usually stays in a smaller range so you want to weigh it less to prevent it from skewing your results.
Now, imagine each measurement as a point in multi-dimensional space. This example measures 3d space (temp, wind, precip). The nice thing is, if we add more features, we simply increase the dimensionality of our space but the math stays the same. Anyway, we want to find the historical points that are closest to our current point. The easiest way to do that is Euclidean distance. So measure the distance from our current point to each historical point and keep the closest matches:
for each historicalpoint
distance = sqrt(
pow(currentpoint.temp - historicalpoint.temp, 2) +
pow(currentpoint.wind - historicalpoint.wind, 2) +
pow(currentpoint.precip - historicalpoint.precip, 2))
if distance is smaller than the largest distance in our match collection
add historicalpoint to our match collection
remove the match with the largest distance from our match collection
next
This is a brute-force approach. If you have the time, you could get a lot fancier. Multi-dimensional data can be represented as trees like kd-trees or r-trees. If you have a lot of data, comparing your current observation with every historical observation would be too slow. Trees speed up your search. You might want to take a look at Data Clustering and Nearest Neighbor Search.
Cheers.
Talk to a statistician.
Seriously.
They do this type of thing for a living.
You write that the "similarity of two sets is a bit subjective", but it's not subjective at all-- it's a matter of determining the appropriate criteria for similarity for your problem domain.
This is one of those situation where you are much better off speaking to a professional than asking a bunch of programmers.
First of all, ask yourself if these are sets, or ordered collections.
I assume that these are ordered collections with duplicates. The most obvious algorithm is to select a tolerance within which numbers are considered the same, and count the number of slots where the numbers are the same under that measure.
I do have a solution implemented for this in my application, but I'm looking to see if there is something that is better or more "correct". For each historical day I do the following:
function calculate_score(historical_set, forecast_set)
{
double c = correlation(historical_set, forecast_set);
double avg_history = average(historical_set);
double avg_forecast = average(forecast_set);
double penalty = abs(avg_history - avg_forecast) / avg_forecast
return c - penalty;
}
I then sort all the results from high to low.
Since the correlation is a value from -1 to 1 that says whether the numbers fall or rise together, I then "penalize" that with the percentage difference the averages of the two sets of numbers.
A couple of times, you've mentioned that you don't know the distribution of the data, which is of course true. I mean, tomorrow there could be a day that is 150 degree F, with 2000km/hr winds, but it seems pretty unlikely.
I would argue that you have a very good idea of the distribution, since you have a long historical record. Given that, you can put everything in terms of quantiles of the historical distribution, and do something with absolute or squared difference of the quantiles on all measures. This is another normalization method, but one that accounts for the non-linearities in the data.
Normalization in any style should make all variables comparable.
As example, let's say that a day it's a windy, hot day: that might have a temp quantile of .75, and a wind quantile of .75. The .76 quantile for heat might be 1 degree away, and the one for wind might be 3kmh away.
This focus on the empirical distribution is easy to understand as well, and could be more robust than normal estimation (like Mean-square-error).
Are the two data sets ordered, or not?
If ordered, are the indices the same? equally spaced?
If the indices are common (temperatures measured on the same days (but different locations), for example, you can regress the first data set against the second,
and then test that the slope is equal to 1, and that the intercept is 0.
http://stattrek.com/AP-Statistics-4/Test-Slope.aspx?Tutorial=AP
Otherwise, you can do two regressions, of the y=values against their indices. http://en.wikipedia.org/wiki/Correlation. You'd still want to compare slopes and intercepts.
====
If unordered, I think you want to look at the cumulative distribution functions
http://en.wikipedia.org/wiki/Cumulative_distribution_function
One relevant test is Kolmogorov-Smirnov:
http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test
You could also look at
Student's t-test,
http://en.wikipedia.org/wiki/Student%27s_t-test
or a Wilcoxon signed-rank test http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
to test equality of means between the two samples.
And you could test for equality of variances with a Levene test http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm
Note: it is possible for dissimilar sets of data to have the same mean and variance -- depending on how rigorous you want to be (and how much data you have), you could consider testing for equality of higher moments, as well.
Maybe you can see your set of numbers as a vector (each number of the set being a componant of the vector).
Then you can simply use dot product to compute the similarity of 2 given vectors (i.e. set of numbers).
You might need to normalize your vectors.
More : Cosine similarity