Statistical Analysis of Runtime Measurements of a Parallel Algorithm - performance

Problem Introduction
Assume we have a parallel algorithm f(<params>) running on P cores whereas
<params>: Parameters for algorithm
P: Number of cores it runs on (i.e. threads, cores, processors)
We further assume that out implementation actually consists of three parts:
A - Distribution: We distribute the input to all processors
B - Run the algorithm: We run f(<params>) ("on each processor")
C - Collection: We collect the computed data from all processors
After fixing <params> and P like input size, number of processors etc. the algorithm itself is deterministic i.e. we can write down an exact cDAG for it.
I'm now trying to answer the question: "For a given set of parameters, what is the execution time for a given system?"
With "given system" I mean e.g. "my computer" or "the university super computer" because obviously, the runtime does depend on the system it runs on and obviously the system itself does introduce non-determinism because you never really know the state of the system.
So in short: While the algorithm might be deterministic, runtime measurements aren't. (but e.g. communication measurements would be deterministic.) So we need to do a proper statistical analysis. And this is where I'm unsure.
Measuring Runtime: Basic idea
We are interested in how long part "B - Run the Algorithm" takes. Since the algorithm actually runs on P cores we'll make a measurement on each core and so get P values, let call those P values P_measurements. Some cores might finish before others, so which value does represent the runtime of the whole algorithm? I think a good choice is to simply take value of the core that took the longest i.e. max(P_measurements).
Now there are two things that need consideration here:
We have to repeat the measurement n times since it's a non-deterministic value
Once we have those n*P values, we need to know how to properly summarize them.
(And additional concern would be how to communicate those results in the end, but that's not part of this question.)
Measuring Runtime: Statistical Analysis
So here's what I'd do and this is also the part where I'm very unsure.
We measure the runtime of f(<params>) on each of the P cores. We get P_measurements
We take max(P_measurements)
We repeat 1. & 2. n times and we end up with maxes. Whereas maxes is a list of the n values max(P_measurements)
We check if maxesis normally distributed using a Q-Q-Plot. If not, we normalize. We do expect it to be right-skewed.
Now we take the median of maxes. (If we normalized, we use the normalized values)
We compute the standard deviation, the population mean and the 95% confidence interval.
We might want to say that all values are of an error of e.g. 5% so we check if all the values lie between +-5% of the population mean i.e. the confidence interval should be rather "thin".
We got ourselves some nice runtime measurement.
Clarifications:
Step 4. was necessary because computing the CI in step 6 uses the t-distribution and because later on I want to measure a different implementation of the same algorithm. So I'll have to compare two values and for that I need to do e.g. a t-test. So I need to make sure, the prerequisites for the t-test are met, which are: iid & normally distributed. Iid is assumed.
Question
I am very unsure what I did is statistically sound. Especially step 1-3. I'm not sure if I can do that kind of summarization (just take the max) here. I know that we might have an outsider value that's "especially" high but since we only measure on super computers we can assume the noise to be low and since we take the median in the end any outliners shouldn't have a big impact.
I hope for good input since it's a rather complex topic and I'm very interested in doing it right. I mostly followed the following paper, which I can recommend: http://spcl.inf.ethz.ch/Teaching/2020-dphpc/hoefler-scientific-benchmarking.pdf
But even with the paper, I'm not used to use statistical analysis and thus would just like to get some input from people who actually know this stuff. :)

Related

Measure parallel speedup in randomized-algorithms

I have a randomized program with sequential and parallel variants. The nature of that program is that its run-time varies drastically depending on its "luck". It regularly takes values between 1sec and 2 min in a seemingly geomentric-distribution-ish pattern. Parallel variants show a similar behaviour with different numbers.
What is a "good" way measure parallel speedup in this case?
I have the possibility to just use the mean/median of measured values as a representative for "the run-time"
How would I explain such an approach and is there a (statistically/mathematically) better way to calculate the speedup?
EDIT: Thanks to user3666197, which noted some very important technical details necessary to get good data.
I have done that homework and want to clarify my question.
I made my benchmark process as reliable as I could get it:
the benchmarks are run with seeds, with which the results are reproducible.
every configuration is repeated multiple times (~400 times) with different seeds inside a script
My question remains: How to approach the calculation of the speedup for this program.
What I have done:
Mean sequential run-time is about 8.38, median is 4.8, which is a big difference. For 2 threads, mean run-time is 4.36 while median run-time is 2.42.
If I divide sequential by parallel I get speedups of 1.92 (means) and 1.992 (medians).
For 4 threads similar: means: 2.25 run-time and 3.72 speedups, medians: 1.12 median and 4.3 speedup (superlinear).
Similar numbers exist for 8 threads.
I try to visualize the data in different ways. Plots
The histogram shows the distribtion of run-times using various threads, as does the boxplot to its right. It is visible that some speedup is visible.
If I pair the measurements based on seeds, I get pairs of times: sequential and parallel times.
One of my first ideas was to calculate the speedup by calculating the slope of the regression-line, however, it seems that the regression-line does not properly "summarize" the data and has limited value. In the bottom-right plot, only the points for 4 threads are shown.
I would recommend you compute the speedup based on the arithmetic mean of runtimes of a sufficiently large set of measurements. Make sure to properly communicate what the numbers represent. It can be difficult to ensure that you have a large enough set up measurements to compute a proper mean with a certain confidence, especially since your samples are not normally distributed. Include your findings about distribution and confindence. Make sure to summarize the runtimes first, before computing the speedup.
There is an excellent paper by Torsten Hoefler and Roberto Belli which covers your issues in detail. Particularly Sections 2.1.1 and 3.
How to measure a parallel-speedup versus a pure-[SERIAL] code?
Always be both quantitative and systematic.
That means at least:
1) use all systematic-steps for controlled test-repeatability
2) compare apples to apples, incl. the controlled seed-setup for randomizers
3) best, generate all test-batteries as scripted, auto-repeatable experiments
4) record the performance ( overall and local-sections timing ) in tests' UUID#-distinguishable logs
5) collect rather 1E+3 ~ 1E+4 sized populations of test-runs, not just a few units of individual trials
Given your solution is already implemented in both a pure-[SERIAL] code-execution fashion and in some other [CONCURRENT] or even [PARALLEL], the most exact step is to compare the end-to-end test durations.
It is quite common to use monotonic-clocks, having better than ~ [us] resolution in [TIME]-domain.
For more details about internalities, best view the re-formulated Amdahl's Law and the criticism of of the parallel-speedup initial, unconstrained-resources use formulation.

How to run MCTS on a highly non-deterministic system?

I'm trying to implement a MCTS algorithm for the AI of a small game. The game is a rpg-simulation. The AI should decides what moves to play in battle. It's a turn base battle (FF6-7 style). There is no movement involved.
I won't go into details but we can safely assume that we know with certainty what move will chose the player in any given situation when it is its turn to play.
Games end-up when one party has no unit alive (4v4). It can take any number of turn (may also never end). There is a lot of RNG element in the damage computation & skill processing (attacks can hit/miss, crit or not, there is a lots of procs going on that can "proc" or not, buffs can have % value to happens ect...).
Units have around 6 skills each to give an idea of the branching factor.
I've build-up a preliminary version of the MCTS that gives poor results for now. I'm having trouble with a few things :
One of my main issue is how to handle the non-deterministic states of my moves. I've read a few papers about this but I'm still in the dark.
Some suggest determinizing the game information and run a MCTS tree on that, repeat the process N times to cover a broad range of possible game states and use that information to take your final decision. In the end, it does multiply by a huge factor our computing time since we have to compute N times a MCTS tree instead of one. I cannot rely on that since over the course of a fight I've got thousands of RNG element : 2^1000 MCTS tree to compute where i already struggle with one is not an option :)
I had the idea of adding X children for the same move but it does not seems to be leading to a good answer either. It smooth the RNG curve a bit but can shift it in the opposite direction if the value of X is too big/small compared to the percentage of a particular RNG. And since I got multiple RNG par move (hit change, crit chance, percentage to proc something etc...) I cannot find a decent value of X that satisfies every cases. More of a badband-aid than anythign else.
Likewise adding 1 node per RNG tuple {hit or miss ,crit or not,proc1 or not,proc2 or not,etc...} for each move should cover every possible situations but has some heavy drawbacks : with 5 RNG mecanisms only that means 2^5 node to consider for each move, it is way too much to compute. If we manage to create them all, we could assign them a probability ( linked to the probability of each RNG element in the node's tuple) and use that probability during our selection phase. This should work overall but be really hard on the cpu :/
I also cannot "merge" them in one single node since I've got no way of averaging the player/monsters stat's value accuractely based on two different game state and averaging the move's result during the move processing itself is doable but requieres a lot of simplifcation that are a pain to code and will hurt our accuracy really fast anyway.
Do you have any ideas how to approach this problem ?
Some other aspects of the algorithm are eluding me:
I cannot do a full playout untill a end state because A) It would take a lot of my computing time and B) Some battle may never ends (by design). I've got 2 solutions (that i can mix)
- Do a random playout for X turns
- Use an evaluation function to try and score the situation.
Even if I consider only health point to evaluate I'm failing to find a good evaluation function to return a reliable value for a given situation (between 1-4 units for the player and the same for the monsters ; I know their hp current/max value). What bothers me is that the fights can vary greatly in length / disparity of powers. That means that sometimes a 0.01% change in Hp matters (for a long game vs a boss for example) and sometimes it is just insignificant (when the player farm a low lvl zone compared to him).
The disparity of power and Hp variance between fights means that my Biais parameter in the UCB selection process is hard to fix. i'm currently using something very low, like 0.03. Anything > 0.1 and the exploration factor is so high that my tree is constructed depth by depth :/
For now I'm also using a biaised way to choose move during my simulation phase : it select the move that the player would choose in the situation and random ones for the AI, leading to a simulation biaised in favor of the player. I've tried using a pure random one for both, but it seems to give worse results. Do you think having a biaised simulation phase works against the purpose of the alogorithm? I'm inclined to think it would just give a pessimistic view to the AI and would not impact the end result too much. Maybe I'm wrong thought.
Any help is welcome :)
I think this question is way too broad for StackOverflow, but I'll give you some thoughts:
Using stochastic or probability in tree searches is usually called expectimax searches. You can find a good summary and pseudo-code for Expectimax Approximation with Monte-Carlo Tree Search in chapter 4, but I would recommend using a normal minimax tree search with the expectimax extension. There are a few modifications like Star1, Star2 and Star2.5 for a better runtime (similiar to alpha-beta pruning).
It boils down to not only having decision nodes, but also chance nodes. The probability of each possible outcome should be known and the expected value of each node is multiplied with its probability to know its real expected value.
2^5 nodes per move is high, but not impossibly high, especially for low number of moves and a shallow search. Even a 1-3 depth search shoulld give you some results. In my tetris AI, there are ~30 different possible moves to consider and I calculate the result of three following pieces (for each possible) to select my move. This is done in 2 seconds. I'm sure you have much more time for calculation since you're waiting for user input.
If you know what move the player is obvious, shouldn't it also obvious for your AI?
You don't need to consider a single value (hp), you can have several factors that are weighted different to calculate the expected value. If I come back to my tetris AI, there are 7 factors (bumpiness, highest piece, number of holes, ...) that are calculated, weighted and added together. To get the weights, you could use different methods, I used a genetic algorithm to find the combination of weights that resulted in most lines cleared.

How many simulations need to do?

Hello my problem is more related with the validation of a model. I have done a program in netlogo that i'm gonna use in a report for my thesis but now the question is, how many repetitions (simulations) i need to do for justify my results? I already have read some methods using statistical approach and my colleagues have suggested me some nice mathematical operations, but i also want to know from people who works with computational models what kind of statistical test or mathematical method used to know that.
There are two aspects to this (1) How many parameter combinations (2) How many runs for each parameter combination.
(1) Generally you would do experiments, where you vary some of your input parameter values and see how some model output changes. Take the well known Schelling segregation model as an example, you would vary the tolerance value and see how the segregation index is affected. In this case you might vary the tolerance from 0 to 1 by 0.01 (if you want discrete) or you could just take 100 different random values in the range [0,1]. This is a matter of experimental design and is entirely affected by how fine you wish to examine your parameter space.
(2) For each experimental value, you also need to run multiple simulations so that you can can calculate the average and reduce the impact of randomness in the simulation run. For example, say you ran the model with a value of 3 for your input parameter (whatever it means) and got a result of 125. How do you know whether the 'real' answer is 125 or something else. If you ran it 10 times and got 10 different numbers in the range 124.8 to 125.2 then 125 is not an unreasonable estimate. If you ran it 10 times and got numbers ranging from 50 to 500, then 125 is not a useful result to report.
The number of runs for each experiment set depends on the variability of the output and your tolerance. Even the 124.8 to 125.2 is not useful if you want to be able to estimate to 1 decimal place. Look up 'standard error of the mean' in any statistics text book. Basically, if you do N runs, then a 95% confidence interval for the result is the average of the results for your N runs plus/minus 1.96 x standard deviation of the results / sqrt(N). If you want a narrower confidence interval, you need more runs.
The other thing to consider is that if you are looking for a relationship over the parameter space, then you need fewer runs at each point than if you are trying to do a point estimate of the result.
Not sure exactly what you mean, but maybe you can check the books of Hastie and Tishbiani
http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
specially the sections on resampling methods (Cross-Validation and bootstrap).
They also have a shorter book that covers the possible relevant methods to your case along with the commands in R to run this. However, this book, as a far as a I know, is not free.
http://www.springer.com/statistics/statistical+theory+and+methods/book/978-1-4614-7137-0
Also, could perturb the initial conditions to see you the outcome doesn't change after small perturbations of the initial conditions or parameters. On a larger scale, sometimes you can break down the space of parameters with regard to final state of the system.
1) The number of simulations for each parameter setting can be decided by studying the coefficient of variance Cv = s / u, here s and u are standard deviation and mean of the result respectively. It is explained in detail in this paper Coefficient of variance.
2) The simulations where parameters are changed can be analyzed using several methods illustrated in the paper Testing methods.
These papers provide scrupulous analyzing methods and refer to other papers which may be relevant to your question and your research.

Test the randomness of a black box that outputs random 64-bit floats

I got this interview question and need to write a function for it. I failed.
Because it is a phone interview question, I don't think what I am supposed to code really need to be perfect random tester.
Any ideas?
How to write some code to be a reasonable randomness tester within like 30 minutes during an interview?
edit
The distribution in this question is uniformly distributed
As this is an interview question, I think the interviewers are looking to assess in two ways:
Ability to understand what the requirements of the problem really are.
Ability to think of some code that would address those requirements.
This could be a really good interview question in certain settings, especially if the interviewer were willing to prompt the candidate with questions as and when necessary.
In terms of understanding the requirements of the question, it helps if you know that this is a really difficult problem, witness the Diehard tests mentioned in pjs's answer. Fundamentally I think a candidate would need to demonstrate appreciation of two things:
(a) The overall distribution of the numbers should match the desired distribution (I'm assuming it is uniform in this case, but as #pjs points out in comments this assumption should be made explicit).
(b) Each number drawn should be independent from the previous numbers drawn.
With half an hour to code something up in a phone interview, you can't go very far. If I were answering this question I would try to suggest something like:
(a) To test the distribution, come up with a set of equal-sized bins for the floating point numbers, and count the numbers that fall into each bin. Plot a histogram and eyeball it (plotting the data is always a good idea). To extend this, you could use a chi-squared test, as described in amit's answer.
However, as discussed in the comments, and here
The main problem with chi squared test is the choice of number and size of the intervals. Although rules of thumb can help produce good results, there is no panacea for all kinds of applications.
To this end, the Kolmogorov-Smirnov test can be used. The idea behind this test is that if you a plot of the ordered data should be a good fit against the perfect ordered data (known as the cumulative distribution). For a uniform distribution the perfect ordered data is a straight line: you expect the 10th percentile of the data to be 10% of the way through the range, the 20th to be 20% of the way through the range and so on. So, programmatically, you could sort the data, plot it against the ideal value and you should get a straight line. There is also a formal, quantitative statistical test you can apply, which is based on the differences between the actual and ideal values.
(b) To test independence, there are multiple approaches. Autocorrelation at various time lags is one fairly obvious one: to what extent is the value at time t similar to the value at time t+1, for example. The runs test is another nice one: you convert all the numbers into 1 or 0 depending on whether they fall above or below the median, and then the distribution of the length of runs can be used to construct a statistical test. The runs test can also be used to test for runs in one direction or another, as described here and here (this might be more useful in your case). Both of these have fairly straightforward implementations so long as you have the formulas to hand!
Apart from the diehard tests, other good sources discussing random number generators include here and here.
The way to check if a random number generator (or any other probability for that matter) is matching a desired model (in your case, uniform distribution) - you should use a statistical test, the Pearson's chi squared test.
The test is based on collecting observations, and matching them to the expected probability in according to the theoretic model you are assuming the numbers come from.
At the end, the test gives you the probability that the collected sample indeed came from the given model.
A simple example:
Given a cube, and the draws: [5,3,5,5,1,1] Is the cube balanced? (p=1/6 for each of {1,...,6})
Given the above observations we create the Expected vector: E = [1,1,1,1,1,1] (each entry is N/6 - 6 because this is the number of outcomes and N is the number of draws, 6 in the above example). And the Observed vector: O=[2,0,1,0,3,0]
From this we compute the statistic:
Xi^2 = sum((O_i - E_i)^2 / E_i) = 1/1 + 1/1 + 0/1 + 1/1 + 4/1 + 1/1 = 8
Now, we need to check what is the probability for P(Xi^2>=8), according to the chi^2 distribution (one degree of freedom). This probability is ~0.005 (a bit less..). So we can reject the hypothesis that the sample comes from unbiased cube with pretty high probability.
You're saying that they wanted you to recreate/reinvent the "diehard" battery of tests that it took Marsaglia many years to develop? I'd call them on unreasonable expectations.
Whatever distribution the random floats are suppposed to have, say uniform distribution over the interval [0,1], you can use the Kolmogorov-Smirnov test http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test to test to see if a sample does not follow the desired distribution. This can have advantages over chi-squared test if you have many possible values (because if you have more possible values than samples, then you have to define buckets for the chi-squared test, which makes the test less powerful compared to general distribution checking like Kolmogorov-Smirnov)

Comparison of Sorting Algorithms using running time in terms of seconds

I have devised a test in order to compare the different running times of my sorting algorithm with Insertion sort, bubble sort, quick sort, selection sort, and shell sort. I have based my test using the test done in this website http://warp.povusers.org/SortComparison/index.html, but I modified my test a bit.
I set up a test manager program server which generates the data, and the test manager sends it to the clients that run the different algorithms, therefore they are sorting the same data to have no bias.
I noticed that the insertion sort, bubble sort, and selection sort algorithms really did run for a very long time (some more than 15 minutes) just to sort one given data for sizes of 100,000 and 1,000,000.
So I changed the number of runs per test case for those two data sizes. My original runs for the 100,000 was 500 but I reduced it to 15, and for 1,000,000 was 100 and I reduced it to 3.
Now my professor doubts the credibility as to why I've reduced it that much, but as I've observed the running time for sorting a specific data distribution varied only by a small percentage, which is why I still find it that even though I've reduced it to that much I'd still be able to approximate the average runtime for that specific test case of that algorithm.
My question now is, is my assumption wrong? Does the machine at times make significant running time changes (>50% changes), like say for example sorting the same data over and over if a first run would give it 0.3 milliseconds will the second run give as much difference as making it run for 1.5 seconds? Because from my observation, the running times don't vary largely given the same type of test distribution (e.g. completely random, completely sorted, completely reversed).
What you are looking for is a way to measure error in your experiments. My favorite book on subject is Error Analysis by Taylor and Chapter 4 has what you need which I'll summarize here.
You need to calculate Standard error of the mean or SDOM. First calculate mean and standard deviation (formulas are on Wikipedia and quite simple). Your SDOM is standard deviation divided by square root of number of measurements. Assuming your timings have Normal distribution (which it should), the twice the value of SDOM is a very common way to specify +/- error.
For example, let's say you run sorting algorithm 5 times and get following numbers: 5, 6, 7, 4, 5. Then mean is 5.4 and standard deviation is 1.1. Therefor SDOM is 1.1/sqrt(5) = 0.5. So 2*SDOM = 1. Now you can say that algorithm rum time was 5.4 ± 1. You professor can determine if this is acceptable error in measurement. Notice that as you take more readings, your SDOM, i.e. plus or minus error, goes down inversely proportional to square root of N. Twice of SDOM interval has 95% probability or confidence that the true value lies within the interval which is accepted standard.
Also you most likely want to measure performance by measuring CPU time instead of simple timer. Modern CPUs are too complex with various cache level and pipeline optimizations and you might end up getting less accurate measurement if you are using timer. More about CPU time is in this answer: How can I measure CPU time and wall clock time on both Linux/Windows?
It absolutely does. You need a variety of "random" samples in order to be able to draw proper conclusions about the population.
Look at it this way. It takes a long time to poll 100,000 people in the U.S. about their political stance. If we reduce the sample size to 100 people in order to complete it faster, we not only reduce the precision of our final result (2 decimal places rather than 5), we also introduce a larger chance that the members of the sample have a specific bias (there is a greater chance that 100 people out of 3xx,000,000 think the same way than 100,000 out of those same 3xx,000,000).
Your professor is right, however he's not provided the details that I mention some of them here :
Sampling issue: It's right that you generate some random numbers and feed them to your sorting methods, but with a few test cases indeed you're biased cause almost all of the random functions are biased to some extent (specially to the state of machine or time at the moment), so you should use more and more test cases to be more confident about the randomness.
Machine state: Suppose you've provide perfect data (fully representative of a uniform distribution), the performance of the electro-mechanical devises like computers may vary in different situations, so you should try for considerable times to smooth the effects of these phenomena.
Note : In advanced technical reports, you should provide a confidence coefficient for the answers you provide derived from statistical analysis, and proven step by step, but if you don't need to be that much exact, simply increase these :
The size of the data
The number of tests

Resources