Comparison of Sorting Algorithms using running time in terms of seconds - sorting

I have devised a test in order to compare the different running times of my sorting algorithm with Insertion sort, bubble sort, quick sort, selection sort, and shell sort. I have based my test using the test done in this website http://warp.povusers.org/SortComparison/index.html, but I modified my test a bit.
I set up a test manager program server which generates the data, and the test manager sends it to the clients that run the different algorithms, therefore they are sorting the same data to have no bias.
I noticed that the insertion sort, bubble sort, and selection sort algorithms really did run for a very long time (some more than 15 minutes) just to sort one given data for sizes of 100,000 and 1,000,000.
So I changed the number of runs per test case for those two data sizes. My original runs for the 100,000 was 500 but I reduced it to 15, and for 1,000,000 was 100 and I reduced it to 3.
Now my professor doubts the credibility as to why I've reduced it that much, but as I've observed the running time for sorting a specific data distribution varied only by a small percentage, which is why I still find it that even though I've reduced it to that much I'd still be able to approximate the average runtime for that specific test case of that algorithm.
My question now is, is my assumption wrong? Does the machine at times make significant running time changes (>50% changes), like say for example sorting the same data over and over if a first run would give it 0.3 milliseconds will the second run give as much difference as making it run for 1.5 seconds? Because from my observation, the running times don't vary largely given the same type of test distribution (e.g. completely random, completely sorted, completely reversed).

What you are looking for is a way to measure error in your experiments. My favorite book on subject is Error Analysis by Taylor and Chapter 4 has what you need which I'll summarize here.
You need to calculate Standard error of the mean or SDOM. First calculate mean and standard deviation (formulas are on Wikipedia and quite simple). Your SDOM is standard deviation divided by square root of number of measurements. Assuming your timings have Normal distribution (which it should), the twice the value of SDOM is a very common way to specify +/- error.
For example, let's say you run sorting algorithm 5 times and get following numbers: 5, 6, 7, 4, 5. Then mean is 5.4 and standard deviation is 1.1. Therefor SDOM is 1.1/sqrt(5) = 0.5. So 2*SDOM = 1. Now you can say that algorithm rum time was 5.4 ± 1. You professor can determine if this is acceptable error in measurement. Notice that as you take more readings, your SDOM, i.e. plus or minus error, goes down inversely proportional to square root of N. Twice of SDOM interval has 95% probability or confidence that the true value lies within the interval which is accepted standard.
Also you most likely want to measure performance by measuring CPU time instead of simple timer. Modern CPUs are too complex with various cache level and pipeline optimizations and you might end up getting less accurate measurement if you are using timer. More about CPU time is in this answer: How can I measure CPU time and wall clock time on both Linux/Windows?

It absolutely does. You need a variety of "random" samples in order to be able to draw proper conclusions about the population.
Look at it this way. It takes a long time to poll 100,000 people in the U.S. about their political stance. If we reduce the sample size to 100 people in order to complete it faster, we not only reduce the precision of our final result (2 decimal places rather than 5), we also introduce a larger chance that the members of the sample have a specific bias (there is a greater chance that 100 people out of 3xx,000,000 think the same way than 100,000 out of those same 3xx,000,000).

Your professor is right, however he's not provided the details that I mention some of them here :
Sampling issue: It's right that you generate some random numbers and feed them to your sorting methods, but with a few test cases indeed you're biased cause almost all of the random functions are biased to some extent (specially to the state of machine or time at the moment), so you should use more and more test cases to be more confident about the randomness.
Machine state: Suppose you've provide perfect data (fully representative of a uniform distribution), the performance of the electro-mechanical devises like computers may vary in different situations, so you should try for considerable times to smooth the effects of these phenomena.
Note : In advanced technical reports, you should provide a confidence coefficient for the answers you provide derived from statistical analysis, and proven step by step, but if you don't need to be that much exact, simply increase these :
The size of the data
The number of tests

Related

Statistical Analysis of Runtime Measurements of a Parallel Algorithm

Problem Introduction
Assume we have a parallel algorithm f(<params>) running on P cores whereas
<params>: Parameters for algorithm
P: Number of cores it runs on (i.e. threads, cores, processors)
We further assume that out implementation actually consists of three parts:
A - Distribution: We distribute the input to all processors
B - Run the algorithm: We run f(<params>) ("on each processor")
C - Collection: We collect the computed data from all processors
After fixing <params> and P like input size, number of processors etc. the algorithm itself is deterministic i.e. we can write down an exact cDAG for it.
I'm now trying to answer the question: "For a given set of parameters, what is the execution time for a given system?"
With "given system" I mean e.g. "my computer" or "the university super computer" because obviously, the runtime does depend on the system it runs on and obviously the system itself does introduce non-determinism because you never really know the state of the system.
So in short: While the algorithm might be deterministic, runtime measurements aren't. (but e.g. communication measurements would be deterministic.) So we need to do a proper statistical analysis. And this is where I'm unsure.
Measuring Runtime: Basic idea
We are interested in how long part "B - Run the Algorithm" takes. Since the algorithm actually runs on P cores we'll make a measurement on each core and so get P values, let call those P values P_measurements. Some cores might finish before others, so which value does represent the runtime of the whole algorithm? I think a good choice is to simply take value of the core that took the longest i.e. max(P_measurements).
Now there are two things that need consideration here:
We have to repeat the measurement n times since it's a non-deterministic value
Once we have those n*P values, we need to know how to properly summarize them.
(And additional concern would be how to communicate those results in the end, but that's not part of this question.)
Measuring Runtime: Statistical Analysis
So here's what I'd do and this is also the part where I'm very unsure.
We measure the runtime of f(<params>) on each of the P cores. We get P_measurements
We take max(P_measurements)
We repeat 1. & 2. n times and we end up with maxes. Whereas maxes is a list of the n values max(P_measurements)
We check if maxesis normally distributed using a Q-Q-Plot. If not, we normalize. We do expect it to be right-skewed.
Now we take the median of maxes. (If we normalized, we use the normalized values)
We compute the standard deviation, the population mean and the 95% confidence interval.
We might want to say that all values are of an error of e.g. 5% so we check if all the values lie between +-5% of the population mean i.e. the confidence interval should be rather "thin".
We got ourselves some nice runtime measurement.
Clarifications:
Step 4. was necessary because computing the CI in step 6 uses the t-distribution and because later on I want to measure a different implementation of the same algorithm. So I'll have to compare two values and for that I need to do e.g. a t-test. So I need to make sure, the prerequisites for the t-test are met, which are: iid & normally distributed. Iid is assumed.
Question
I am very unsure what I did is statistically sound. Especially step 1-3. I'm not sure if I can do that kind of summarization (just take the max) here. I know that we might have an outsider value that's "especially" high but since we only measure on super computers we can assume the noise to be low and since we take the median in the end any outliners shouldn't have a big impact.
I hope for good input since it's a rather complex topic and I'm very interested in doing it right. I mostly followed the following paper, which I can recommend: http://spcl.inf.ethz.ch/Teaching/2020-dphpc/hoefler-scientific-benchmarking.pdf
But even with the paper, I'm not used to use statistical analysis and thus would just like to get some input from people who actually know this stuff. :)

Measure parallel speedup in randomized-algorithms

I have a randomized program with sequential and parallel variants. The nature of that program is that its run-time varies drastically depending on its "luck". It regularly takes values between 1sec and 2 min in a seemingly geomentric-distribution-ish pattern. Parallel variants show a similar behaviour with different numbers.
What is a "good" way measure parallel speedup in this case?
I have the possibility to just use the mean/median of measured values as a representative for "the run-time"
How would I explain such an approach and is there a (statistically/mathematically) better way to calculate the speedup?
EDIT: Thanks to user3666197, which noted some very important technical details necessary to get good data.
I have done that homework and want to clarify my question.
I made my benchmark process as reliable as I could get it:
the benchmarks are run with seeds, with which the results are reproducible.
every configuration is repeated multiple times (~400 times) with different seeds inside a script
My question remains: How to approach the calculation of the speedup for this program.
What I have done:
Mean sequential run-time is about 8.38, median is 4.8, which is a big difference. For 2 threads, mean run-time is 4.36 while median run-time is 2.42.
If I divide sequential by parallel I get speedups of 1.92 (means) and 1.992 (medians).
For 4 threads similar: means: 2.25 run-time and 3.72 speedups, medians: 1.12 median and 4.3 speedup (superlinear).
Similar numbers exist for 8 threads.
I try to visualize the data in different ways. Plots
The histogram shows the distribtion of run-times using various threads, as does the boxplot to its right. It is visible that some speedup is visible.
If I pair the measurements based on seeds, I get pairs of times: sequential and parallel times.
One of my first ideas was to calculate the speedup by calculating the slope of the regression-line, however, it seems that the regression-line does not properly "summarize" the data and has limited value. In the bottom-right plot, only the points for 4 threads are shown.
I would recommend you compute the speedup based on the arithmetic mean of runtimes of a sufficiently large set of measurements. Make sure to properly communicate what the numbers represent. It can be difficult to ensure that you have a large enough set up measurements to compute a proper mean with a certain confidence, especially since your samples are not normally distributed. Include your findings about distribution and confindence. Make sure to summarize the runtimes first, before computing the speedup.
There is an excellent paper by Torsten Hoefler and Roberto Belli which covers your issues in detail. Particularly Sections 2.1.1 and 3.
How to measure a parallel-speedup versus a pure-[SERIAL] code?
Always be both quantitative and systematic.
That means at least:
1) use all systematic-steps for controlled test-repeatability
2) compare apples to apples, incl. the controlled seed-setup for randomizers
3) best, generate all test-batteries as scripted, auto-repeatable experiments
4) record the performance ( overall and local-sections timing ) in tests' UUID#-distinguishable logs
5) collect rather 1E+3 ~ 1E+4 sized populations of test-runs, not just a few units of individual trials
Given your solution is already implemented in both a pure-[SERIAL] code-execution fashion and in some other [CONCURRENT] or even [PARALLEL], the most exact step is to compare the end-to-end test durations.
It is quite common to use monotonic-clocks, having better than ~ [us] resolution in [TIME]-domain.
For more details about internalities, best view the re-formulated Amdahl's Law and the criticism of of the parallel-speedup initial, unconstrained-resources use formulation.

Rapid change detection algorithm

I'm logging temperature values in a room, saving them to the database. I'd like to be alerted when temperature rises suddenly. I can't set fixed values, because 18°C is acceptable in winter and 25°C is acceptable in summer. But if it jumps from 20°C to 25°C during, let's say, 30 minutes and stays like this for 5 minutes (to eliminate false readouts), I'd like to be informed.
My current idea is to take readouts from last 30 minutes (A) and readouts from last 5 minutes (B), calculate median of A and B and check if difference between them is less then my desired threshold.
Is this correct way to solve this or is there a better algorithm? I searched for a specific one but most of them seem overcomplicated.
Thanks!
Detecting changes in a time-series is a well-researched subject, and hundreds if not thousands of papers have been written on this subject. As you've seen many methods are quite advanced, but proved to be quite useful for many use cases. Whatever method you choose, you should evaluate it against real of simulated data, and optimize its parameters for your use case.
As you require, let me suggest a very simple method that in many cases prove to be good enough, and is quite similar to that you considered.
Basically, you have two concerns:
Detecting a monotonous change in a sampled noisy signal
Ignoring false readouts
First, note that medians are not commonly used for detecting trends. For the series (1,2,3,30,35,3,2,1) the medians of 5 consecutive terms is be (3, 3, 3, 3). It is much more common to use averages.
One common trick is to throw the extreme values before averaging (e.g. for each 7 values average only the middle 5). If many false readouts are expected - try to take measurements at a faster rate, and throw more extreme values (e.g. for each 13 values average the middle 9).
Also, you should throw away unfeasible values and replace them with the last measured value (unfeasible means out of range, or non-physical change rate).
Your idea of comparing a short-period measure with a long-period measure is a good idea, and indeed it is commonly used (e.g. in econometrics).
Quoting from "Financial Econometric Models - Some Contributions to the Field [Nicolau, 2007]:
Buy and sell signals are generated by two moving averages of the price
level: a long-period average and a short-period average. A typical
moving average trading rule prescribes a buy (sell) when the
short-period moving average crosses the long-period moving average
from below (above) (i.e. when the original time series is rising
(falling) relatively fast).
When you say "rises suddenly," mathematically you are talking about the magnitude of the derivative of the temperature signal.
There is a nice algorithm to simultaneously smooth a signal and calculate its derivative called the Savitzky–Golay filter. It's explained with examples on Wikipedia, or you can use Matlab to help you generate the convolution coefficients required. Once you have the coefficients the calculation is very simple.

How many simulations need to do?

Hello my problem is more related with the validation of a model. I have done a program in netlogo that i'm gonna use in a report for my thesis but now the question is, how many repetitions (simulations) i need to do for justify my results? I already have read some methods using statistical approach and my colleagues have suggested me some nice mathematical operations, but i also want to know from people who works with computational models what kind of statistical test or mathematical method used to know that.
There are two aspects to this (1) How many parameter combinations (2) How many runs for each parameter combination.
(1) Generally you would do experiments, where you vary some of your input parameter values and see how some model output changes. Take the well known Schelling segregation model as an example, you would vary the tolerance value and see how the segregation index is affected. In this case you might vary the tolerance from 0 to 1 by 0.01 (if you want discrete) or you could just take 100 different random values in the range [0,1]. This is a matter of experimental design and is entirely affected by how fine you wish to examine your parameter space.
(2) For each experimental value, you also need to run multiple simulations so that you can can calculate the average and reduce the impact of randomness in the simulation run. For example, say you ran the model with a value of 3 for your input parameter (whatever it means) and got a result of 125. How do you know whether the 'real' answer is 125 or something else. If you ran it 10 times and got 10 different numbers in the range 124.8 to 125.2 then 125 is not an unreasonable estimate. If you ran it 10 times and got numbers ranging from 50 to 500, then 125 is not a useful result to report.
The number of runs for each experiment set depends on the variability of the output and your tolerance. Even the 124.8 to 125.2 is not useful if you want to be able to estimate to 1 decimal place. Look up 'standard error of the mean' in any statistics text book. Basically, if you do N runs, then a 95% confidence interval for the result is the average of the results for your N runs plus/minus 1.96 x standard deviation of the results / sqrt(N). If you want a narrower confidence interval, you need more runs.
The other thing to consider is that if you are looking for a relationship over the parameter space, then you need fewer runs at each point than if you are trying to do a point estimate of the result.
Not sure exactly what you mean, but maybe you can check the books of Hastie and Tishbiani
http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
specially the sections on resampling methods (Cross-Validation and bootstrap).
They also have a shorter book that covers the possible relevant methods to your case along with the commands in R to run this. However, this book, as a far as a I know, is not free.
http://www.springer.com/statistics/statistical+theory+and+methods/book/978-1-4614-7137-0
Also, could perturb the initial conditions to see you the outcome doesn't change after small perturbations of the initial conditions or parameters. On a larger scale, sometimes you can break down the space of parameters with regard to final state of the system.
1) The number of simulations for each parameter setting can be decided by studying the coefficient of variance Cv = s / u, here s and u are standard deviation and mean of the result respectively. It is explained in detail in this paper Coefficient of variance.
2) The simulations where parameters are changed can be analyzed using several methods illustrated in the paper Testing methods.
These papers provide scrupulous analyzing methods and refer to other papers which may be relevant to your question and your research.

What are some good approaches to predicting the completion time of a long process?

tl;dr: I want to predict file copy completion. What are good methods given the start time and the current progress?
Firstly, I am aware that this is not at all a simple problem, and that predicting the future is difficult to do well. For context, I'm trying to predict the completion of a long file copy.
Current Approach:
At the moment, I'm using a fairly naive formula that I came up with myself: (ETC stands for Estimated Time of Completion)
ETC = currTime + elapsedTime * (totalSize - sizeDone) / sizeDone
This works on the assumption that the remaining files to be copied will do so at the average copy speed thus far, which may or may not be a realistic assumption (dealing with tape archives here).
PRO: The ETC will change gradually, and becomes more and more accurate as the process nears completion.
CON: It doesn't react well to unexpected events, like the file copy becoming stuck or speeding up quickly.
Another idea:
The next idea I had was to keep a record of the progress for the last n seconds (or minutes, given that these archives are supposed to take hours), and just do something like:
ETC = currTime + currAvg * (totalSize - sizeDone)
This is kind of the opposite of the first method in that:
PRO: If the speed changes quickly, the ETC will update quickly to reflect the current state of affairs.
CON: The ETC may jump around a lot if the speed is inconsistent.
Finally
I'm reminded of the control engineering subjects I did at uni, where the objective is essentially to try to get a system that reacts quickly to sudden changes, but isn't unstable and crazy.
With that said, the other option I could think of would be to calculate the average of both of the above, perhaps with some kind of weighting:
Weight the first method more if the copy has a fairly consistent long-term average speed, even if it jumps around a bit locally.
Weight the second method more if the copy speed is unpredictable, and is likely to do things like speed up/slow down for long periods, or stop altogether for long periods.
What I am really asking for is:
Any alternative approaches to the two I have given.
If and how you would combine several different methods to get a final prediction.
If you feel that the accuracy of prediction is important, the way to go about about building a predictive model is as follows:
collect some real-world measurements;
split them into three disjoint sets: training, validation and test;
come up with some predictive models (you already have two plus a mix) and fit them using the training set;
check predictive performance of the models on the validation set and pick the one that performs best;
use the test set to assess the out-of-sample prediction error of the chosen model.
I'd hazard a guess that a linear combination of your current model and the "average over the last n seconds" would perform pretty well for the problem at hand. The optimal weights for the linear combination can be fitted using linear regression (a one-liner in R).
An excellent resource for studying statistical learning methods is The Elements of
Statistical Learning by Hastie, Tibshirani and Friedman. I can't recommend that book highly enough.
Lastly, your second idea (average over the last n seconds) attempts to measure the instantaneous speed. A more robust technique for this might be to use the Kalman filter, whose purpose is exactly this:
Its purpose is to use measurements observed over time, containing
noise (random variations) and other inaccuracies, and produce values
that tend to be closer to the true values of the measurements and
their associated calculated values.
The principal advantage of using the Kalman filter rather than a fixed n-second sliding window is that it's adaptive: it will automatically use a longer averaging window when measurements jump around a lot than when they're stable.
Imho, bad implementations of ETC are wildly overused, which allows us to have a good laugh. Sometimes, it might be better to display facts instead of estimations, like:
5 of 10 files have been copied
10 of 200 MB have been copied
Or display facts and an estimation, and make clear that it is only an estimation. But I would not display only an estimation.
Every user knows that ETCs are often completely meaningless, and then it is hard to distinguish between meaningful ETCs and meaningless ETCs, especially for inexperienced users.
I have implemented two different solutions to address this problem:
The ETC for the current transfer at start time is based on a historic speed value. This value is refined after each transfer. During the transfer I compute a weighted average between the historic data and data from the current transfer, so that the closer to the end you are the more weight is given to actual data from the transfer.
Instead of showing a single ETC, show a range of time. The idea is to compute the ETC from the last 'n' seconds or minutes (like your second idea). I keep track of the best and worst case averages and compute a range of possible ETCs. This is kind of confusing to show in a GUI, but okay to show in a command line app.
There are two things to consider here:
the exact estimation
how to present it to the user
1. On estimation
Other than statistics approach, one simple way to have a good estimation of the current speed while erasing some noise or spikes is to take a weighted approach.
You already experimented with the sliding window, the idea here is to take a fairly large sliding window, but instead of a plain average, giving more weight to more recent measures, since they are more indicative of the evolution (a bit like a derivative).
Example: Suppose you have 10 previous windows (most recent x0, least recent x9), then you could compute the speed:
Speed = (10 * x0 + 9 * x1 + 8 * x2 + ... + x9) / (10 * window-time) / 55
When you have a good assessment of the likely speed, then you are close to get a good estimated time.
2. On presentation
The main thing to remember here is that you want a nice user experience, and not a scientific front.
Studies have demonstrated that users reacted very badly to slow-down and very positively to speed-up. Therefore, a good progress bar / estimated time should be conservative in the estimates presented (reserving time for a potential slow-down) at first.
A simple way to get that is to have a factor that is a percentage of the completion, that you use to tweak the estimated remaining time. For example:
real-completion = 0.4
presented-completion = real-completion * factor(real-completion)
Where factor is such that factor([0..1]) = [0..1], factor(x) <= x and factor(1) = 1. For example, the cubic function produces the nice speed-up toward the completion time. Other functions could use an exponential form 1 - e^x, etc...

Resources