Is the average showing in milliseconds? How the error % is calculating, on the what basis total value of average and error % calculating?
As per the The Load Reports guide:
#Samples is the number of samples with the same label.
Average is the average time of a set of results.
Median is a number which divides the samples into two equal halves. Half of the samples are smaller than the median, and half are larger. [Some samples may equal the median.] This is a standard statistical measure. The Median is the same as the 50th Percentile.
90% Line (90th Percentile) meaning 90% of the samples took no more than this time.
Median is the time in the middle of a set of results. 50% of the samples took no more than this time; the remainder took at least as long.
Min is the shortest time for the samples with the same label
Max is the longest time for the samples with the same label
Error % is the percent of requests with errors
Throughput is measured in requests per second/minute/hour. The time unit is chosen so that the displayed rate is at least 1.0. When the throughput is saved to a CSV file, it is expressed in requests/second, i.e. 30.0 requests/minute is saved as 0.5.
Kb/sec - throughput measured in Kilobytes per second. Time is in milliseconds.
Related
There are multiple methods for find the average of a set of numbers.
First, the sum / count quotient. Add all values and divide them by the number of values.
Second, the moving average. The function I found in another Stack answer is:
New average = old average * (n-1)/n + new value /n
This works as long as each value is added to the average one value at a time.
My concern is that the second method is more calculatively complex for my processor to execute, but I also fear that the first method will result in a loss of resolution for data sets that result in large sums. In a 32 bit system, for example, the resolution of a float value stored is reduced automatically as the magnitude of the number grows.
Does a moving average preserve resolution?
"moving average" does not calculate average over large interval.
It smooth data in such way that newer measurements give larger impact, and older measurements weight becomes smaller and smaller.
If you bother about large sums and want to preserve all possible data "bits", consider special methods like Kahan summation algorithm
If I am given the total number of occurrences of an event over the last hour, and I can get this data at arbitrary times ( but at least once an hour ), how can I work out the total number of occurrences over a 24 hour period?
Obviously, you can't. For example -- if the first two observations overlap then it would be impossible to determine the number of kills during the overlap. If there is a time gap between the first two observations then there is no way to determine what happened during the gap. You could try to set up a system of equations -- but the resulting system will be underdetermined (but it could give you both a min and a max, which might be relevant).
Why not adopt a statistical approach? Let X = kills over a 1 hour period. This is a random variable. Estimate its expected value by sampling it at randomly chosen times and multiply your estimate by 24.
As part of the performance tuning and load tests we usually do, i am forced to believe that we need to look at 90th percentiles. As per my understanding 90 times out of hundred people got a respone which is equal to or better than the 90th percentile number. However my current clients always look at average number.What is the impact of only looking at average? Most of the times, I see that between two tests, if average is lower in test A , then 90th percentile is also less in test A.
So should we match the SLA on average or on 90th percentile?
I agree that this is not a pure programming question. But in my humble opinion, program performance and statistics are closely related anyway. That's why I think this question deserves an answer.
The two are different in nature. We then have the average - the sum of all observations divided by the number of observations. We also have the median or 50 percentile - half of the observations are above, half are below.
There's a very visible difference in the two if your observations do not match the bell curve: e.g. if you have positive outliers but no negative outliers.
Let's do a few number examples:
observations 2 4 6 8 - average and median are both 5
observations 1 1 10 - average is 4, median is 1.
Your 50 percentile could be argued on any number between 1 and 10 here: two observations are below, one is above for any of these numbers.
observations 1 4 1000 - average is 335, median is 4, 50 percentile is also 4.
As you can see, the distribution of the numbers matters a lot.
Only if you have a symmetrical distribution (like a Gaussian bell curve), the average equals the 50 percentile.
But you asked for the 90 percentile.
Essentially, nothing changes - the distribution, the number of outliers and the most often observed values affect your percentile.
I suggest to pick up a good book on statistics if you need to know more.
I am working on an algorithm to compute multiple mean max values of an array. The array contains time/value pairs such as HR data recorded on a Garmin device over a 5 hour run. the data is approx once a second for an unknown period, but has no guaranteed frequency. An example would be a 10 minute mean maximum, which is the maximum average 10 minute duration value. Assume "mean" is just average value for this discussion. The desired mean maximal value's duration is arbitrary, 1 min, 5 min, 60 min. And, I'm likely going to need many of them-- at least 30 but ideally any on demand if it wasn't a lengthy request.
Right now I have a straight forward algorithm to compute on value:
1) Start at beginning of array and "walk" forward until the subset is equal to or 1 element past the desired duration. Stop if end of array is reached.
2) Find the average of those subset values. Store as max avg if larger than current max.
3) Shift a single value off the left side of array.
4) Repeat from 1 until end of array met.
It basically computes every possible consecutive average and returns the max.
It does this for each duration. And it computes a real avg computation continuously instead of sliding it somehow by removing the left point and adding the right, like one could do for a Simple-moving-average series. It takes about 3-10 secs per mean max value depending on the total array size.
I'm wondering how to optimize this. For instance, the series of all mean max values will be an exponential curve with the 1s value highest, and lowering until the entire average is met. Can this curve, and all values, be interpolated from a certain number of points? Or some other optimization to the above heavy computation but still maintain accuracy?
"And it computes a real avg computation continuously instead of sliding it somehow by removing the left point and adding the right, like one could do for a Simple-moving-average series."
Why don't you just slide it (i.e. keep a running sum and divide by the number of elements in that sum)?
I'd like to sum up moving averages for a number of different categories when storing log records. Imagine a service that saves web server logs one entry at a time. Let's further imagine, we don't have access to the logged records. So we see them once but don't have access to them later on.
For different pages, I'd like to know
the total number of hits (easy)
a "recent" average (like one month or so)
a "long term" average (over a year)
Is there any clever algorithm/data model that allows to save such moving averages without having to recalculate them by summing up huge quantities of data?
I don't need an exact average (exactly 30 days or so) but just trend indicators. So some fuzziness is not a problem at all. It should just make sure that newer entries are weighted higher than older ones.
One solution probably would be to auto-create statistics records for each month. However, I don't even need past month statistics, so this seems like overkill. And it wouldn't give me a moving average but rather swap to new values from month to month.
An easy solution would be to keep an exponentially decaying total.
It can be calculated using the following formula:
newX = oldX * (p ^ (newT - oldT)) + delta
where oldX is the old value of your total (at time oldT), newX is the new value of your total (at time newT); delta is the contribution of new events to the total (for example the number of hits today); p is less or equal to 1 and is the decay factor. If we take p = 1, then we have the total number of hits. By decreasing p, we effectively decrease the interval our total describes.
If all you really want is a smoothed value with a given time constant then the easiest thing is to use a single pole recursive IIR filter (aka AR or auto-regressive filter in time series analysis). This takes the form:
Xnew = k * X_old + (1 - k) * x
where X_old is the previous smoothed value, X_new is the new smoothed value, x is the current data point and k is a factor which determines the time constant (usually a small value, < 0.1). You may need to determine the two k values (one value for "recent" and a smaller value for "long term") empirically, based on your sample rate, which ideally should be reasonably constant, e.g. one update per day.
It may be solution for you.
You can aggregate data to intermediate storage grouped by hour or day. Than grouping function will work very fast, because you will need to group small amount of records and inserts will be fast as well. Precision decisions up to you.
It can be better than auto-correlated exponential algorithms because you can understand what you calculate easier and it doesn't require math each step.
For last term data you can use capped collections with limited amount of records. They supported natively by some DBs for example MongoDB.