Better measure than "average" to highlight outlying values?

Better measure than "average" to highlight outlying values? - performance

For web app debugging purposes, I am currently showing a single number that represents the average JSON request time.
We do a large number of JSON requests, and most of them succeed quickly. The ones that occasionally take a few seconds to complete are the real problem. Showing the average request time is nearly useless, as the average is almost always excellent.
What kind of calculation can I do to generate a quality-of-service type of number that will go up when long request times start to become more prevalent?
Something like "max request time time in the last 5 minutes" is one way, but that doesn't give any information about how many times it happened or really how bad the problem is.
Thank you.

Related

Complex Primefaces datatable slow loading on client side

I have a lazy datatable with 40 complex editabe columns (most of them have autocomplete, calendar and selectOneMenu components) plus sorting and filtering... and it's taking too much time to load.
I'm using pagination and I noticed that when I select 10 rows by default, the time to load on the browser seems pretty decent. But if I choose 50 rows, changing from one page to the other takes around 13 seconds in the best cases (yes I know, it's 40 complex columns and 50 rows... ).
At first I thought the time complexity was on the server side, because of the complex query it executes each time you apply a filter or change page, but I measured it and the time it takes to query the DB is basically irrelevant compared with the amount of seconds it takes to load.Even more, I checked the perfomance on Chrome and almost 60% of the time is dedicated to 'Scripting' according to the performance tester (I don't really understand all the process of loading the page onto the browser, but I'll trust them about that).
So, my question is...
Am I missing something crucial about performance here? Is there something I can still do from the code in order to improve the loading time? My client's requirements led me into this solution and I cannot sacrifice ant functionalities... So the components in every column, the number of columns and the number of rows per page are non-negotiable. I hope you can give me some hint about it since I'm having serious claims about the response times and don't really know what to do.
Thanks in advance.

Approach to handle data that keeps changing

I am working on a website, and need to present something like 'x% of users who viewed this page bought this product.'
Despite the discussion of the business value, I want to know what would be a acceptable approach to get the data of x%.
I currently have two approaches. Either requires saving the number of users viewed the page and number of users who bought this product.
One approach is to calculate this data on the fly. The pros of this is that it presents accurate data, while the cons is that the wait time for user increases due to the calculation.
The other approach is that for every users viewed or bought this product, calculate the x% amount and persists the data to database. The pros of this is that it allows the users to quickly get the info, while the cons will be a lot of extra calculations, and the data may not be as accurate.
Assuming we expect hundreds of page views per hour, I wonder which is better approach? Or maybe a third approach will work better?
Thanks!

I think your best bet would be to find a balance between calculating the exact value on every user visit, and having the most accurate display.
You could log every user visit and every purchase in a database, then on every 100th visit or so, perform the calculation. Also log that calculation in your database and have your site pull the information from there, rather then calculate it on every visit.
And depending on how accurate you need to be, and how performance heavy the operation is, you can adjust your interval for calculating the value.
So in all we have that each user's visit increments a value in a database. On the back-end, that value is checked to see if it has went over another interval (so if your interval is every 100, the value is checked to see if its value % 100 == 0). And then you have an operation that only takes place 1/100th of the time a user visits the site, and is still accurate to within the hour (according to your calculation of having hundreds of views per hour).
Having said this, I agree with Jim Garrison's comment about premature optimization. I don't think the operation will have a noticable impact on your site's performance, and if you wanted to be as accurate as possible, you can run the calculation every time a user visits the site or purchases an item.

When timing how long a quick process runs, how many runs should be used?

Lets say I am going to run process X and see how long it takes.
I am going to save into a database a date I ran this process, and the time it took. I want to know what to put into the DB.
Process X almost always runs under 1500ms, so this is a short process. It usually runs between 500 and 1500ms, quite a range (3x difference).
My question is, how many "runs" should be saved into the DB as a single run?
Every run saved into the DB as its
own row?
5 Runs, averaged, then save that
time?
10 Runs averaged?
20 Runs, remove anything more than 2
std deviations away, and save
everything inside that range?
Does anyone have any good info backing them up on this?

Save the data for every run into its own row. Then later you can use and analyze the data however you like... ie, all you the other options you listed can be performed after the fact. It's not really possible for someone else to draw meaningful conclusions about how to average/analyze the data without knowing more about what's going on.

The fastest run is the one that most accurately times only your code.
All slower runs are slower because of noise introduced by the operating system scheduler.
The variance you experience is going to differ from machine to machine, and even on identical machines, the set of runnable processes will introduce noise.

None of the above. Bran is close though. You should save every measurment. But don't average them. The average (arithmetic mean) can be very misleading in this type of analysis. The reason is that some of your measurments will be much longer than the others. This will happen becuse things can interfere with your process - even on 'clean' test systems. It can also happen becuse your process may not be as deterministic as you might thing.
Some people think that simply taking more samples (running more iterations) and averaging the measurmetns will give them better data. It doesn't. The more you run, the more likelty it is that you will encounter a perturbing event, thus making the average overly high.
A better way to do this is to run as many measurments as you can (time permitting). 100 is not a bad number, but 30-ish can be enough.
Then, sort these by magnitude and graph them. Note that this is not a standard distribution. Compute compute some simple statistics: mean, median, min, max, lower quaertile, upper quartile.
Contrary to some guidance, do not 'throw away' outside vaulues or 'outliers'. These are often the most intersting measurments. For example, you may establish a nice baseline, then look for departures. Understanding these departures will help you fully understand how your process works, how the sytsem affecdts your process, and what can interfere with your process. It will often readily expose bugs.

Depends what kind of data you want. I'd say one line per run initially, then analyze the data, go from there. Maybe store a min/max/average of X runs if you want to consolidate it.

http://en.wikipedia.org/wiki/Sample_size
Bryan is right - you need to investigate more. if your code has that much variance even "most" of the time then you might have a lot of fluctuation in your test environment because of other processes, os paging or other factors. If not it seems that you have code paths doing wildly varying amount of work and coming up with a single number/run data to describe the performance of such a multi-modal system is not going to tell you much. So i'd say isolate your setup as much as possible, run at least 30 trials and get a feel for what your performance curve looks like. Once you have that, you can use that wikipedia page to come up with a number that will tell you how many trials you need to run per code-change to see if the performance has increased/decreased with some level of statistical significance.

While saying, "Save every run," is nice, it might not be practical in your case. However, I do think that storing only the average eliminates too much data. I like storing the average of ten runs, but instead of storing just the average, I'd also store the max and min values, so that I can get a feel for the spread of the data in addition to its center.
The max and min information in particular will tell you how often corner cases arise. Is the 1500ms case a one-in-1000 outlier? Or is it something that recurs on a regular basis?

algorithm to calculate ETA of a file downloading session [duplicate]

We've all poked fun at the 'X minutes remaining' dialog which seems to be too simplistic, but how can we improve it?
Effectively, the input is the set of download speeds up to the current time, and we need to use this to estimate the completion time, perhaps with an indication of certainty, like '20-25 mins remaining' using some Y% confidence interval.
Code that did this could be put in a little library and used in projects all over, so is it really that difficult? How would you do it? What weighting would you give to previous download speeds?
Or is there some open source code already out there?
Edit: Summarising:
Improve estimated completion time via better algo/filter etc.
Provide interval instead of single time ('1h45-2h30 mins'), or just limit the precision ('about 2 hours').
Indicate when progress has stalled - although if progress consistently stalls and then continues, we should be able to deal with that. Perhaps 'about 2 hours, currently stalled'

More generally, I think you are looking for a way to give an instant mesure of the transfer speed, which is generally obtained by an average over a small period.
The problem is generally that in order to be reactive, the period is usually extremely small, which leads to the yoyo effect.
I would propose a very simple scheme, let's model it.
Think of a curve speed (y) over time (x).
the Instant Speed, is no more than reading y for the current x (x0).
the Average Speed, is no more than Integral(f(x), x in [x0-T,x0]) / T
the scheme I propose is to apply a filter, to give more weight to the last moments, while still taking into account the past moments.
It can be easily implement as g(x,x0,T) = 2 * (x - x0) + 2T which is a simple triangle of surface T.
And now you can compute Integral(f(x)*g(x,x0,T), x in [x0-T,x0]) / T, which should work because both functions are always positive.
Of course you could have a different g as long as it's always positive in the given interval and that its integral on the interval is T (so that its own average is exactly 1).
The advantage of this method is that because you give more weight to immediate events, you can remain pretty reactive even if you consider larger time intervals (so that the average is more precise, and less susceptible to hiccups).
Also, what I have rarely seen but think would provide more precise estimates would be to correlate the time used for computing the average to the estimated remaining time:
if I download a 5ko file, it's going to be loaded in an instant, no need to estimate
if I download a 15 Mo file, it's going to take between 2 minutes roughly, so I would like estimates say... every 5 seconds ?
if I download a 1.5 Go file, it's going to take... well around 200 minutes (with the same speed)... which is to say 3h20m... perhaps that an estimates every minute would be sufficient ?
So, the longer the download is going to take, the less reactive I need to be, and the more I can average out. In general, I would say that a window could cover 2% of the total time (perhaps except for the few first estimates, because people appreciate immediate feedback). Also, indicating progress by whole % at a time is sufficient. If the task is long, I was prepared to wait anyway.

I wonder, would a state estimation technique produce good results here? Something like a Kalman Filter?
Basically you predict the future by looking at your current model, and change the model at each time step to reflect the changes to the real world. I think this kind of technique is used for estimating the time left on your laptop battery, which can also vary according to use, age of battery, etc'.
see http://en.wikipedia.org/wiki/Kalman_filter for a more in depth description of the algorithm.
The filter also gives a variance measure, which could be used to indicate your confidence of the estimate (allthough, as was mentioned by other answers, it might not be the best idea to show this to the end user)
Does anyone know if this is actually used somewhere for download (or file copy) estimation?

Don't confuse your users by providing more information than they need. I'm thinking of the confidence interval. Skip it.
Internet download times are highly variable. The microwave interferes with WiFi. Usage varies by time of day, day of week, holidays, and releases of new exciting games. The server may be heavily loaded right now. If you carry your laptop to cafe, the results will be different than at home. So, you probably can't rely on historical data to predict the future of download speeds.
If you can't accurately estimate the time remaining, then don't lie to your user by offering such an estimate.
If you know how much data must be downloaded, you can provide % completed progress.
If you don't know at all, provide a "heartbeat" - a piece of moving UI that shows the user that things are working, even through you don't know how long remains.

Improving the estimated time itself: Intuitively, I would guess that the speed of the net connection is a series of random values around some temporary mean speed - things tick along at one speed, then suddenly slow or speed up.
One option, then, could be to weight the previous set of speeds by some exponential, so that the most recent values get the strongest weighting. That way, as the previous mean speed moves further into the past, its effect on the current mean reduces.
However, if the speed randomly fluctuates, it might be worth flattening the top of the exponential (e.g. by using a Gaussian filter), to avoid too much fluctuation.
So in sum, I'm thinking of measuring the standard deviation (perhaps limited to the last N minutes) and using that to generate a Gaussian filter which is applied to the inputs, and then limiting the quoted precision using the standard deviation.
How, though, would you limit the standard deviation calculation to the last N minutes? How do you know how long to use?
Alternatively, there are pattern recognition possibilities to detect if we've hit a stable speed.

I've considered this off and on, myself. I the answer starts with being conservative when computing the current (and thus, future) transfer rate, and includes averaging over longer periods, to get more stable estimates. Perhaps low-pass filtering the time that is displayed, so that one doesn't get jumps between 2 minutes and 2 days.
I don't think a confidence interval is going to be helpful. Most people wouldn't be able to interpret it, and it would just be displaying more stuff that is a guess.

Estimating/forecasting download completion time

We've all poked fun at the 'X minutes remaining' dialog which seems to be too simplistic, but how can we improve it?
Effectively, the input is the set of download speeds up to the current time, and we need to use this to estimate the completion time, perhaps with an indication of certainty, like '20-25 mins remaining' using some Y% confidence interval.
Code that did this could be put in a little library and used in projects all over, so is it really that difficult? How would you do it? What weighting would you give to previous download speeds?
Or is there some open source code already out there?
Edit: Summarising:
Improve estimated completion time via better algo/filter etc.
Provide interval instead of single time ('1h45-2h30 mins'), or just limit the precision ('about 2 hours').
Indicate when progress has stalled - although if progress consistently stalls and then continues, we should be able to deal with that. Perhaps 'about 2 hours, currently stalled'

I wonder, would a state estimation technique produce good results here? Something like a Kalman Filter?
Basically you predict the future by looking at your current model, and change the model at each time step to reflect the changes to the real world. I think this kind of technique is used for estimating the time left on your laptop battery, which can also vary according to use, age of battery, etc'.
see http://en.wikipedia.org/wiki/Kalman_filter for a more in depth description of the algorithm.
The filter also gives a variance measure, which could be used to indicate your confidence of the estimate (allthough, as was mentioned by other answers, it might not be the best idea to show this to the end user)
Does anyone know if this is actually used somewhere for download (or file copy) estimation?

Don't confuse your users by providing more information than they need. I'm thinking of the confidence interval. Skip it.
Internet download times are highly variable. The microwave interferes with WiFi. Usage varies by time of day, day of week, holidays, and releases of new exciting games. The server may be heavily loaded right now. If you carry your laptop to cafe, the results will be different than at home. So, you probably can't rely on historical data to predict the future of download speeds.
If you can't accurately estimate the time remaining, then don't lie to your user by offering such an estimate.
If you know how much data must be downloaded, you can provide % completed progress.
If you don't know at all, provide a "heartbeat" - a piece of moving UI that shows the user that things are working, even through you don't know how long remains.

Improving the estimated time itself: Intuitively, I would guess that the speed of the net connection is a series of random values around some temporary mean speed - things tick along at one speed, then suddenly slow or speed up.
One option, then, could be to weight the previous set of speeds by some exponential, so that the most recent values get the strongest weighting. That way, as the previous mean speed moves further into the past, its effect on the current mean reduces.
However, if the speed randomly fluctuates, it might be worth flattening the top of the exponential (e.g. by using a Gaussian filter), to avoid too much fluctuation.
So in sum, I'm thinking of measuring the standard deviation (perhaps limited to the last N minutes) and using that to generate a Gaussian filter which is applied to the inputs, and then limiting the quoted precision using the standard deviation.
How, though, would you limit the standard deviation calculation to the last N minutes? How do you know how long to use?
Alternatively, there are pattern recognition possibilities to detect if we've hit a stable speed.

I've considered this off and on, myself. I the answer starts with being conservative when computing the current (and thus, future) transfer rate, and includes averaging over longer periods, to get more stable estimates. Perhaps low-pass filtering the time that is displayed, so that one doesn't get jumps between 2 minutes and 2 days.
I don't think a confidence interval is going to be helpful. Most people wouldn't be able to interpret it, and it would just be displaying more stuff that is a guess.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio