How would you mathematically model the distribution of repeated real life performance measurements - "Real life" meaning you are not just looping over the code in question, but it is just a short snippet within a large application running in a typical user scenario?
My experience shows that you usually have a peak around the average execution time that can be modeled adequately with a Gaussian distribution. In addition, there's a "long tail" containing outliers - often with a multiple of the average time. (The behavior is understandable considering the factors contributing to first execution penalty).
My goal is to model aggregate values that reasonably reflect this, and can be calculated from aggregate values (like for the Gaussian, calculate mu and sigma from N, sum of values and sum of squares). In other terms, number of repetitions is unlimited, but memory and calculation requirements should be minimized.
A normal Gaussian distribution can't model the long tail appropriately and will have the average biased strongly even by a very small percentage of outliers.
I am looking for ideas, especially if this has been attempted/analysed before. I've checked various distributions models, and I think I could work out something, but my statistics is rusty and I might end up with an overblown solution. Oh, a complete shrink-wrapped solution would be fine, too ;)
Other aspects / ideas: Sometimes you get "two humps" distributions, which would be acceptable in my scenario with a single mu/sigma covering both, but ideally would be identified separately.
Extrapolating this, another approach would be a "floating probability density calculation" that uses only a limited buffer and adjusts automatically to the range (due to the long tail, bins may not be spaced evenly) - haven't found anything, but with some assumptions about the distribution it should be possible in principle.
Why (since it was asked) -
For a complex process we need to make guarantees such as "only 0.1% of runs exceed a limit of 3 seconds, and the average processing time is 2.8 seconds". The performance of an isolated piece of code can be very different from a normal run-time environment involving varying levels of disk and network access, background services, scheduled events that occur within a day, etc.
This can be solved trivially by accumulating all data. However, to accumulate this data in production, the data produced needs to be limited. For analysis of isolated pieces of code, a gaussian deviation plus first run penalty is ok. That doesn't work anymore for the distributions found above.
[edit] I've already got very good answers (and finally - maybe - some time to work on this). I'm starting a bounty to look for more input / ideas.
Often when you have a random value that can only be positive, a log-normal distribution is a good way to model it. That is, you take the log of each measurement, and assume that is normally distributed.
If you want, you can consider that to have multiple humps, i.e. to be the sum of two normals having different mean. Those are a bit tricky to estimate the parameters of, because you may have to estimate, for each measurement, its probability of belonging to each hump. That may be more than you want to bother with.
Log-normal distributions are very convenient and well-behaved. For example, you don't deal with its average, you deal with it's geometric mean, which is the same as its median.
BTW, in pharmacometric modeling, log-normal distributions are ubiquitous, modeling such things as blood volume, absorption and elimination rates, body mass, etc.
ADDED: If you want what you call a floating distribution, that's called an empirical or non-parametric distribution. To model that, typically you save the measurements in a sorted array. Then it's easy to pick off the percentiles. For example the median is the "middle number". If you have too many measurements to save, you can go to some kind of binning after you have enough measurements to get the general shape.
ADDED: There's an easy way to tell if a distribution is normal (or log-normal). Take the logs of the measurements and put them in a sorted array. Then generate a QQ plot (quantile-quantile). To do that, generate as many normal random numbers as you have samples, and sort them. Then just plot the points, where X is the normal distribution point, and Y is the log-sample point. The results should be a straight line. (A really simple way to generate a normal random number is to just add together 12 uniform random numbers in the range +/- 0.5.)
The problem you describe is called "Distribution Fitting" and has nothing to do with performance measurements, i.e. this is generic problem of fitting suitable distribution to any gathered/measured data sample.
The standard process is something like that:
Guess the best distribution.
Run hypothesis tests to check how well it describes gathered data.
Repeat 1-3 if not well enough.
You can find interesting article describing how this can be done with open-source R software system here. I think especially useful to you may be function fitdistr.
In addition to already given answers consider Empirical Distributions. I have successful experience in using empirical distributions for performance analysis of several distributed systems. The idea is very straightforward. You need to build histogram of performance measurements. Measurements should be discretized with given accuracy. When you have histogram you could do several useful things:
calculate the probability of any given value (you are bound by accuracy only);
build PDF and CDF functions for the performance measurements;
generate sequence of response times according to a distribution. This one is very useful for performance modeling.
Try whit gamma distribution http://en.wikipedia.org/wiki/Gamma_distribution
From wikipedia
The gamma distribution is frequently a probability model for waiting times; for instance, in life testing, the waiting time until death is a random variable that is frequently modeled with a gamma distribution.
The standard for randomized Arrival times for performance modelling is either Exponential distribution or Poisson distribution (which is just the distribution of multiple Exponential distributions added together).
Not exactly answering your question, but relevant still: Mor Harchol-Balter did a very nice analysis of the size of jobs submitted to a scheduler, The effect of heavy-tailed job size distributions on computer systems design (1999). She found that the size of jobs submitted to her distributed task assignment system took a power-law distribution, which meant that certain pieces of conventional wisdom she had assumed in the construction of her task assignment system, most importantly that the jobs should be well load balanced, had awful consequences for submitters of jobs. She's done good follor-up work on this issue.
The broader point is, you need to ask such questions as:
What happens if reasonable-seeming assumptions about the distribution of performance, such as that they take a normal distribution, break down?
Are the data sets I'm looking at really representative of the problem I'm trying to solve?
Related
This is a semi-broad question, but it's one that I feel on some level is answerable or at least approachable.
I've spent the last month or so making a fairly extensive simulation. In order to protect the interests of my employer, I won't state specifically what it does... but an analogy of what it does may be explained by... a high school dance.
A girl or boy enters the dance floor, and based on the selection of free dance partners, an optimal choice is made. After a period of time, two dancers finish dancing and are now free for a new partnership.
I've been making partner selection algorithms designed to maximize average match outcome while not sacrificing wait time for a partner too much.
I want a way to gauge / compare versions of my algorithms in order to make a selection of the optimal algorithm for any situation. This is difficult however since the inputs of my simulation are extremely large matrices of input parameters (2-5 per dancer), and the simulation takes several minutes to run (a fact that makes it difficult to test a large number of simulation inputs). I have a few output metrics, but linking them to the large number of inputs is extremely hard. I'm also interested in finding which algorithms completely fail under certain input conditions...
Any pro tips / online resources which might help me in defining input constraints / output variables which might give clarity on an optimal algorithm?
I might not understand what you exactly want. But here is my suggestion. Let me know if my solution is inaccurate/irrelevant and I will edit/delete accordingly.
Assume you have a certain metric (say compatibility of the pairs or waiting time). If you just have the average or total number for this metric over all the users, it is kind of useless. Instead you might want to find the distribution of of this metric over all users. If nothing, you should always keep track of the variance. Once you have the distribution, you can calculate a probability that particular algorithm A is better than B for a certain metric.
If you do not have the distribution of the metric within an experiment, you can always run multiple experiments, and the number of experiments you need to run depends on the variance of the metric and difference between two algorithms.
I am looking for a method to find the best parameters for a simulation. It's about break-shots in billiards / pool. A shot is defined by 7 parameters, I can simulate the shot and then rate the outcome and I would like to compute the best parameters.
I have found the following link here:
Multiple parameter optimization with lots of local minima
suggesting 4 kinds of algorithms. In the pool simulator I am using, the shots are altered by a little random value each time it is simulated. If I simulate the same shot twice, the outcome will be different. So I am looking for an algorithm like the ones in the link above, only with the addition of a stochastical element, optimizing for the 7 parameters that will on average yield the best parameters, i.e. a break shot that most likely will be a success. My initial idea was simulating the shot 100 or 1000 times and just take the average as rating for the algorithms above, but I still feel like there is a better way. Does anyone have an idea?
The 7 parameters are continuous but within different ranges (one from 0 to 10, another from 0.0 to 0.028575 and so on).
Thank you
At least for some of the algorithms, simulating the same shot repeatedly might not be neccessary. As long as your alternatives have some form of momentum, like in the swarm simulation approach, you can let that be affected by the outcome of each individual simulation. In that case, a single unlucky simulation would slow the movement in parameter space only slightly, whereas a serious loss of quality should be enough to stop and reverse the movement. Thos algorithms which don't use momentum might be tweaked to have momentum. If not, then repeated simulation seems the best approach. Unless you can get your hands on the internals of the simulator, and rate the shot as a whole without having to simulate it over and over again.
You can use the algorithms you mentioned in your non-deterministic scenario with independent stochastic runs. Your idea with repeated simulations is good, you can read more about how many repeats you might have to consider for your simulations (unfortunately, there is no trivial answer). If you are not so much into maths, and the runs go fast, do 1.000 repeats, then 10.000 repeats, and see if the results differ largely. If yes, you have to collect more samples, if not, you are probably on the safe side (the central limit theorem states that the results converge).
Further, do not just consider the average! Make sure to look into the standard deviation for each algorithm's results; you might want to use box plots to compare their quartiles. If you rely on the average only, you could pick an algorithm that produces very varying results, sometimes excellent, sometimes terrible in performance.
I don't know what language you are using, but if you use Java, I am maintaining a tool that could simplify your "monte carlo" style experiments.
When measuring application performance (response time for example) it's so easy to come across averages (mean). ab, httpref and bunch of other utilities are reporting mean and standard deviation. But from theoretical point of view it doesn't make a lot of sense to me. And there is why.
Mean value is good at describing symmetrical distributed population, because in case of symmetrical distribution mean is equal to population mode and expected value. But response times are not distributed symmetrical. They are more like exponential. In this case average tells us nothing.
It's more convenient to work with percentile values, which tells us what response time we could afford in what percentage of responses.
Am I missing something or mean is popular just because it's very simple to calculate?
All kinds of tools get their features not necessarily from what makes sense, but from users' expectations.
You're absolutely right that the distributions are non-negative and heavily skewed, and that percentiles would be more informative.
Alternatively, a distribution more like lognormal or chi-square would be a little better.
Yes, you are missing something.
The whole point of descriptive statistics is to present a few numbers to describe (or represent or model or ...) a large number of numbers. They aid the comprehension of large datasets, the extraction of information from data, the approximate comparison of datasets whose exact comparison is large and bewildering to the limitations of the human mind.
But no single descriptive statistic is always fit for all purposes, and no one is dictating to you that you must or should or ought to use the mean. If it doesn't suit your purposes, use something else.
As it happens you are quite wrong to write They are more like exponential. In this case average tells us nothing. For an exponential distribution with rate parameter lambda the mean is simply 1/lambda so the mean tells you everything about an exponential distribution.
I'm not an expert in statistics but i believe the average values are used so much because those are the values that help to measure the scalability of a system.
You need to consider first your average values to know how your system needs to bahevae under certains workloads and those needs to be predictable, you usually are not very interested in outliers at least not at first.
Of course you need to look into your min values and the peak values to know the moment your system its going to have a bottleneck but the average values show you as i said a correct and predictable behavior.
What algorithms can be used to characterize an expected clearly bimodal distribution, say a mixture of 2 normal distributions with well separated peaks, in an array of samples? Something that spits out 2 means, 2 standard deviations, and some sort of robustness estimate, would be the desired result.
I am interested in an algorithm that can be implemented in any programming language (for an embedded controller), not an existing C or Python library or stat package.
Would it be easier if I knew that the two modal means differ by a ratio of approximately 3:1 +- 50%, the standard deviations are "small" relative to the peak separation, but the pair of peaks could be anywhere in a 100:1 range?
There are two separate possibilities here. One is that you have a single distribution that is bimodal. The other is that you are observing data from two different distributions. The usual way to estimate the later is in something called, unsurprisingly, a mixture model.
Your approaches for estimating are to use a maximum likelihood approach or use Markov chain Monte Carlo methods if you want to take a Bayesian view of the problem. If you state your assumptions in a bit more detail I'd be willing to help try and figure out what objective function you'd want to try and maximize.
These type of models can be computationally intensive, so I am not sure you'd want to try and do the whole statistical approach in an embedded controller. A hack might be a better fit. If the peaks are in fact well separated, I think it would be easier to try and identify the two peaks and split your data between them and do the estimation of the mean and standard deviation for each distribution independently.
Background
Here is the problem:
A black box outputs a new number each day.
Those numbers have been recorded for a period of time.
Detect when a new number from the black box falls outside the pattern of numbers established over the time period.
The numbers are integers, and the time period is a year.
Question
What algorithm will identify a pattern in the numbers?
The pattern might be simple, like always ascending or always descending, or the numbers might fall within a narrow range, and so forth.
Ideas
I have some ideas, but am uncertain as to the best approach, or what solutions already exist:
Machine learning algorithms?
Neural network?
Classify normal and abnormal numbers?
Statistical analysis?
Cluster your data.
If you don't know how many modes your data will have, use something like a Gaussian Mixture Model (GMM) along with a scoring function (e.g., Bayesian Information Criterion (BIC)) so you can automatically detect the likely number of clusters in your data. I recommend this instead of k-means if you have no idea what value k is likely to be. Once you've constructed a GMM for you data for the past year, given a new datapoint x, you can calculate the probability that it was generated by any one of the clusters (modeled by a Gaussian in the GMM). If your new data point has low probability of being generated by any one of your clusters, it is very likely a true outlier.
If this sounds a little too involved, you will be happy to know that the entire GMM + BIC procedure for automatic cluster identification has been implemented for you in the excellent MCLUST package for R. I have used it several times to great success for such problems.
Not only will it allow you to identify outliers, you will have the ability to put a p-value on a point being an outlier if you need this capability (or want it) at some point.
You could try line fitting prediction using linear regression and see how it goes, it would be fairly easy to implement in your language of choice.
After you fitted a line to your data, you could calculate the mean standard deviation along the line.
If the novel point is on the trend line +- the standard deviation, it should not be regarded as an abnormality.
PCA is an other technique that comes to mind, when dealing with this type of data.
You could also look in to unsuperviced learning. This is a machine learning technique that can be used to detect differences in larger data sets.
Sounds like a fun problem! Good luck
There is little magic in all the techniques you mention. I believe you should first try to narrow the typical abnormalities you may encounter, it helps keeping things simple.
Then, you may want to compute derived quantities relevant to those features. For instance: "I want to detect numbers changing abruptly direction" => compute u_{n+1} - u_n, and expect it to have constant sign, or fall in some range. You may want to keep this flexible, and allow your code design to be extensible (Strategy pattern may be worth looking at if you do OOP)
Then, when you have some derived quantities of interest, you do statistical analysis on them. For instance, for a derived quantity A, you assume it should have some distribution P(a, b) (uniform([a, b]), or Beta(a, b), possibly more complex), you put a priori laws on a, b and you ajust them based on successive information. Then, the posterior likelihood of the info provided by the last point added should give you some insight about it being normal or not. Relative entropy between posterior and prior law at each step is a good thing to monitor too. Consult a book on Bayesian methods for more info.
I see little point in complex traditional machine learning stuff (perceptron layers or SVM to cite only them) if you want to detect outliers. These methods work great when classifying data which is known to be reasonably clean.