RNG seed selection highly affects simulation otucomes - random

I'm developing for the first time a simulation in Omnet++ 5.4 that makes use of the queueinglib library available. In particular, I've built a simple model that comprehends a server and some passive queues.
I've repeated the simulation different times, setting different seeds and parameters, as written in the Omnet Simulation Manual, here is my omnetpp.ini:
# specify how many times a run needs to be repeated
repeat = 100
# default: different seed for each run
# seed-set = ${runnumber}
seed-set = ${repetition} # I tried both lines
OppNet.source.interArrivalTime = exponential(${10,5,2.5}s)
This produces 300 runs, 100 repetitions for each value of the interArrivalTime distribution parameter.
However, I'm observing some "strange" behaviour, namely, the resulting statistics are highly variable according to the RNG seed.
For example, considering the lengths of the queues in my model, I have in the majority of the runs values smaller than 10, while in a few others a mean value that differs in orders of magnitude (85000, 45000?).
Does this mean that my implementation is wrong? Or is it possible that the random seed selection could influence the simulation outcomes so heavily?
Any help or hint is appreciated, thank you.

I cannot rule out that your implementation is wrong without seeing it, but it is entirely possible that you just happened to configure a chaotic scenario.
And in that case, yes, any minor change in the inputs (in this case the PRNG seed) might cause a major divergence in the results.
EDIT:
Especially considering a given (non-trivial) queuing network, if you are varying the "load" (number/rate of incoming jobs/messages, with some random distribution), you might observe different "regimes" in the results:
with very light load, there is barely any queuing
with a really heavy load (close, or at max capacity), most queues are almost always loaded, or even growing constantly without limit
and somewhere in between the two, you might get this erratic behavior, where sometimes a couple of queues suddenly grow really deep, then are emptied, then different queues are loaded up, and so on; as a function of either time, or the PRNG seed, or both - this is what I mean by chaotic
But this is just speculation...

Nobody can say whether your implementation is right or wrong without seeing it. However, there are some general rules that apply to queues which you should be aware of. You say that you're varying the interArrivalTime distribution parameter. A very important concept in queueing is the traffic intensity, which is the ratio of the interarrival rate to the service rate. If that ratio is less than one, the line length can vary by a great deal but in the long run there will be time intervals when the queue empties out because on average the server can handle more customers than arrive. This is a stable queue. However, if that ratio is greater than one, the queue will grow unboundedly. The longer you run the system, the longer the line will get. The thing that surprises many people is that the line will also go to infinity asymptotically when traffic intensity is equal to one.
The other thing to know is that the closer the traffic intensity gets to one for a stable queue, the greater the variability is. That's because the average increases, but there will always be periods of line length zero as described above. In order to have zeros always present but have the average increasing, there must be times where the queue length gets above the average, which implies the variability must be increasing. Changing the random number seed gives you some visibility on the magnitude of the variance that's possible at any slice of time.
Bottom line here is that you may just be seeing evidence that queues are weirder and more variable than you thought.

Related

How does the function add_disk_randomness work?

I am trying to understand the add_disk_randomness function from the linux kernel. I read a few papers, but they don't describe it very well. This code is from /drivers/char/random.c:
add_timer_randomness(disk->random, 0x100 + disk_devt(disk));
disk_devt(disk) holds the variable disk->devt. I understand it as a unique block device number. But what are the major and minor device number?
Then the block device number and the hex 0x100 are added together.
Then we collect also the time value disk->random. Is this the seek time for each block?
These two values will be passed to the function add_timer_randomness. It would be nice to get an example with values.
The first parameter of add_timer_randomness is a pointer to struct timer_rand_state. You can confirm this by checking struct gendisk.
timer_rand_state from random.c is reproduced below
/* There is one of these per entropy source */
struct timer_rand_state {
cycles_t last_time;
long last_delta, last_delta2;
};
This struct stores the timestamp of the last input event as well as previous "deltas". add_timer_randomness first gets the current time (measured in jiffies), then reads last_time (also in jiffies), then overwrites last_time with the first value.
The first, second, and third order "deltas" are tracked as a means of estimating entropy. The main source of entropy from hard disk events is the timing of those events. More data is hashed into the entropy pool, but they don't contribute to entropy estimates. (It is important not to over estimate how unpredictable the data you hash in is. Otherwise your entropy pool, and therefore your RNG output too, may be predictable. Underestimating entropy on the other hand cannot make RNG output more predictable. It is always better to use a pessimistic estimator in this respect. That is why data that doesn't contribute to entropy estimates are still hashed into the entropy pool.)
Delta is the time between two events. (The difference between timestamps.) The second order delta is the difference between the times between two events. (Difference between deltas.) Third order deltas is differences between second order deltas. The timer_rand_state pointer is the memory location that tracks the previous timestamp and deltas. delta3 does not need to be stored.
The entropy estimate from this timing data is based on the logarithm of the largest absolute value of deltas one, two, and three. (Not exactly the logarithm. It's always an integer, for example. It's always rounded down by one bit. And if the value you're taking the almost-logarithm of is zero the result is also zero.)
Say you have a device used as an entropy source that generates a new events every 50 milliseconds. The delta will always be 50ms. The second order delta is always zero. Since one of the three deltas is zero this prevents this device's timings from being relied on as a significant entropy source. The entropy estimator successfully fails to overestimate input entropy, so even if this device is used as an entropy source it won't "poison" the entropy pool with predictability.
The entropy estimate isn't based on any formal mathematics. We can't construct an accurate model of the entropy source because we don't know what it is. We don't know what the hardware on a user's computer will be exactly or exactly how it will behave in an unknown environment. We just want to know that if we add one to the (estimated) entropy counter then we've hashed at least one bit of entropy worth of unpredictable data into the entropy pool. Extra data besides just the timings is hashed into the pool without increasing the entropy counter, so we hope that if the timer-based entropy estimator some time over estimates then maybe there is some unpredictability in the non-timer-based source we didn't account for. (And if that's the case your RNG is still safe.)
I'm sure that sounds unconvincing, but I don't know how to help that. I tried my best to explain the relevant parts of the random.c code. Even if I could mind meld and provide some intuition for how the process works it probably would still be unsatisfying.

How many simulations need to do?

Hello my problem is more related with the validation of a model. I have done a program in netlogo that i'm gonna use in a report for my thesis but now the question is, how many repetitions (simulations) i need to do for justify my results? I already have read some methods using statistical approach and my colleagues have suggested me some nice mathematical operations, but i also want to know from people who works with computational models what kind of statistical test or mathematical method used to know that.
There are two aspects to this (1) How many parameter combinations (2) How many runs for each parameter combination.
(1) Generally you would do experiments, where you vary some of your input parameter values and see how some model output changes. Take the well known Schelling segregation model as an example, you would vary the tolerance value and see how the segregation index is affected. In this case you might vary the tolerance from 0 to 1 by 0.01 (if you want discrete) or you could just take 100 different random values in the range [0,1]. This is a matter of experimental design and is entirely affected by how fine you wish to examine your parameter space.
(2) For each experimental value, you also need to run multiple simulations so that you can can calculate the average and reduce the impact of randomness in the simulation run. For example, say you ran the model with a value of 3 for your input parameter (whatever it means) and got a result of 125. How do you know whether the 'real' answer is 125 or something else. If you ran it 10 times and got 10 different numbers in the range 124.8 to 125.2 then 125 is not an unreasonable estimate. If you ran it 10 times and got numbers ranging from 50 to 500, then 125 is not a useful result to report.
The number of runs for each experiment set depends on the variability of the output and your tolerance. Even the 124.8 to 125.2 is not useful if you want to be able to estimate to 1 decimal place. Look up 'standard error of the mean' in any statistics text book. Basically, if you do N runs, then a 95% confidence interval for the result is the average of the results for your N runs plus/minus 1.96 x standard deviation of the results / sqrt(N). If you want a narrower confidence interval, you need more runs.
The other thing to consider is that if you are looking for a relationship over the parameter space, then you need fewer runs at each point than if you are trying to do a point estimate of the result.
Not sure exactly what you mean, but maybe you can check the books of Hastie and Tishbiani
http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
specially the sections on resampling methods (Cross-Validation and bootstrap).
They also have a shorter book that covers the possible relevant methods to your case along with the commands in R to run this. However, this book, as a far as a I know, is not free.
http://www.springer.com/statistics/statistical+theory+and+methods/book/978-1-4614-7137-0
Also, could perturb the initial conditions to see you the outcome doesn't change after small perturbations of the initial conditions or parameters. On a larger scale, sometimes you can break down the space of parameters with regard to final state of the system.
1) The number of simulations for each parameter setting can be decided by studying the coefficient of variance Cv = s / u, here s and u are standard deviation and mean of the result respectively. It is explained in detail in this paper Coefficient of variance.
2) The simulations where parameters are changed can be analyzed using several methods illustrated in the paper Testing methods.
These papers provide scrupulous analyzing methods and refer to other papers which may be relevant to your question and your research.

Comparison of Sorting Algorithms using running time in terms of seconds

I have devised a test in order to compare the different running times of my sorting algorithm with Insertion sort, bubble sort, quick sort, selection sort, and shell sort. I have based my test using the test done in this website http://warp.povusers.org/SortComparison/index.html, but I modified my test a bit.
I set up a test manager program server which generates the data, and the test manager sends it to the clients that run the different algorithms, therefore they are sorting the same data to have no bias.
I noticed that the insertion sort, bubble sort, and selection sort algorithms really did run for a very long time (some more than 15 minutes) just to sort one given data for sizes of 100,000 and 1,000,000.
So I changed the number of runs per test case for those two data sizes. My original runs for the 100,000 was 500 but I reduced it to 15, and for 1,000,000 was 100 and I reduced it to 3.
Now my professor doubts the credibility as to why I've reduced it that much, but as I've observed the running time for sorting a specific data distribution varied only by a small percentage, which is why I still find it that even though I've reduced it to that much I'd still be able to approximate the average runtime for that specific test case of that algorithm.
My question now is, is my assumption wrong? Does the machine at times make significant running time changes (>50% changes), like say for example sorting the same data over and over if a first run would give it 0.3 milliseconds will the second run give as much difference as making it run for 1.5 seconds? Because from my observation, the running times don't vary largely given the same type of test distribution (e.g. completely random, completely sorted, completely reversed).
What you are looking for is a way to measure error in your experiments. My favorite book on subject is Error Analysis by Taylor and Chapter 4 has what you need which I'll summarize here.
You need to calculate Standard error of the mean or SDOM. First calculate mean and standard deviation (formulas are on Wikipedia and quite simple). Your SDOM is standard deviation divided by square root of number of measurements. Assuming your timings have Normal distribution (which it should), the twice the value of SDOM is a very common way to specify +/- error.
For example, let's say you run sorting algorithm 5 times and get following numbers: 5, 6, 7, 4, 5. Then mean is 5.4 and standard deviation is 1.1. Therefor SDOM is 1.1/sqrt(5) = 0.5. So 2*SDOM = 1. Now you can say that algorithm rum time was 5.4 ± 1. You professor can determine if this is acceptable error in measurement. Notice that as you take more readings, your SDOM, i.e. plus or minus error, goes down inversely proportional to square root of N. Twice of SDOM interval has 95% probability or confidence that the true value lies within the interval which is accepted standard.
Also you most likely want to measure performance by measuring CPU time instead of simple timer. Modern CPUs are too complex with various cache level and pipeline optimizations and you might end up getting less accurate measurement if you are using timer. More about CPU time is in this answer: How can I measure CPU time and wall clock time on both Linux/Windows?
It absolutely does. You need a variety of "random" samples in order to be able to draw proper conclusions about the population.
Look at it this way. It takes a long time to poll 100,000 people in the U.S. about their political stance. If we reduce the sample size to 100 people in order to complete it faster, we not only reduce the precision of our final result (2 decimal places rather than 5), we also introduce a larger chance that the members of the sample have a specific bias (there is a greater chance that 100 people out of 3xx,000,000 think the same way than 100,000 out of those same 3xx,000,000).
Your professor is right, however he's not provided the details that I mention some of them here :
Sampling issue: It's right that you generate some random numbers and feed them to your sorting methods, but with a few test cases indeed you're biased cause almost all of the random functions are biased to some extent (specially to the state of machine or time at the moment), so you should use more and more test cases to be more confident about the randomness.
Machine state: Suppose you've provide perfect data (fully representative of a uniform distribution), the performance of the electro-mechanical devises like computers may vary in different situations, so you should try for considerable times to smooth the effects of these phenomena.
Note : In advanced technical reports, you should provide a confidence coefficient for the answers you provide derived from statistical analysis, and proven step by step, but if you don't need to be that much exact, simply increase these :
The size of the data
The number of tests

About random number sequence generation

I am new to randomized algorithms, and learning it myself by reading books. I am reading a book Data structures and Algorithm Analysis by Mark Allen Wessis
.
Suppose we only need to flip a coin; thus, we must generate a 0 or 1
randomly. One way to do this is to examine the system clock. The clock
might record time as an integer that counts the number of seconds
since January 1, 1970 (atleast on Unix System). We could then use the
lowest bit. The problem is that this does not work well if a sequence
of random numbers is needed. One second is a long time, and the clock
might not change at all while the program is running. Even if the time
were recorded in units of microseconds, if the program were running by
itself the sequence of numbers that would be generated would be far
from random, since the time between calls to the generator would be
essentially identical on every program invocation. We see, then, that
what is really needed is a sequence of random numbers. These numbers
should appear independent. If a coin is flipped and heads appears,
the next coin flip should still be equally likely to come up heads or
tails.
Following are question on above text snippet.
In above text snippet " for count number of seconds we could use lowest bit", author is mentioning that this does not work as one second is a long time,
and clock might not change at all", my question is that why one second is long time and clock will change every second, and in what context author is mentioning
that clock does not change? Request to help to understand with simple example.
How author is mentioning that even for microseconds we don't get sequence of random numbers?
Thanks!
Programs using random (or in this case pseudo-random) numbers usually need plenty of them in a short time. That's one reason why simply using the clock doesn't really work, because The system clock doesn't update as fast as your code is requesting new numbers, therefore qui're quite likely to get the same results over and over again until the clock changes. It's probably more noticeable on Unix systems where the usual method of getting the time only gives you second accuracy. And not even microseconds really help as computers are way faster than that by now.
The second problem you want to avoid is linear dependency of pseudo-random values. Imagine you want to place a number of dots in a square, randomly. You'll pick an x and a y coordinate. If your pseudo-random values are a simple linear sequence (like what you'd obtain naïvely from a clock) you'd get a diagonal line with many points clumped together in the same place. That doesn't really work.
One of the simplest types of pseudo-random number generators, the Linear Congruental Generator has a similar problem, even though it's not so readily apparent at first sight. Due to the very simple formula
you'll still get quite predictable results, albeit only if you pick points in 3D space, as all numbers lies on a number of distinct planes (a problem all pseudo-random generators exhibit at a certain dimension):
Computers are fast. I'm over simplifying, but if your clock speed is measured in GHz, it can do billions of operations in 1 second. Relatively speaking, 1 second is an eternity, so it is possible it does not change.
If your program is doing regular operation, it is not guaranteed to sample the clock at a random time. Therefore, you don't get a random number.
Don't forget that for a computer, a single second can be 'an eternity'. Programs / algorithms are often executed in a matter of milliseconds. (1000ths of a second. )
The following pseudocode:
for(int i = 0; i < 1000; i++)
n = rand(0, 1000)
fills n a thousand times with a random number between 0 and 1000. On a typical machine, this script executes almost immediatly.
While you typically only initialize the seed at the beginning:
The following pseudocode:
srand(time());
for(int i = 0; i < 1000; i++)
n = rand(0, 1000)
initializes the seed once and then executes the code, generating a seemingly random set of numbers. The problem arises then, when you execute the code multiple times. Lets say the code executes in 3 milliseconds. Then the code executes again in 3 millisecnds, but both in the same second. The result is then a same set of numbers.
For the second point: The author probabaly assumes a FAST computer. THe above problem still holds...
He means by that is you are not able to control how fast your computer or any other computer runs your code. So if you suggest 1 second for execution thats far from anything. If you try to run code by yourself you will see that this is executed in milliseconds so even that is not enough to ensure you got random numbers !

Algorithm(s) for spotting anomalies ("spikes") in traffic data

I find myself needing to process network traffic captured with tcpdump. Reading the traffic is not hard, but what gets a bit tricky is spotting where there are "spikes" in the traffic. I'm mostly concerned with TCP SYN packets and what I want to do is find days where there's a sudden rise in the traffic for a given destination port. There's quite a bit of data to process (roughly one year).
What I've tried so far is to use an exponential moving average, this was good enough to let me get some interesting measures out, but comparing what I've seen with external data sources seems to be a bit too aggressive in flagging things as abnormal.
I've considered using a combination of the exponential moving average plus historical data (possibly from 7 days in the past, thinking that there ought to be a weekly cycle to what I am seeing), as some papers I've read seem to have managed to model resource usage that way with good success.
So, does anyone knows of a good method or somewhere to go and read up on this sort of thing.
The moving average I've been using looks roughly like:
avg = avg+0.96*(new-avg)
With avg being the EMA and new being the new measure. I have been experimenting with what thresholds to use, but found that a combination of "must be a given factor higher than the average prior to weighing the new value in" and "must be at least 3 higher" to give the least bad result.
This is widely studied in intrusion detection literature. This is a seminal paper on the issue which shows, among other things, how to analyze tcpdump data to gain relevant insights.
This is the paper: http://www.usenix.org/publications/library/proceedings/sec98/full_papers/full_papers/lee/lee_html/lee.html here they use the RIPPER rule induction system, I guess you could replace that old one for something newer such as http://www.newty.de/pnc2/ or http://www.data-miner.com/rik.html
I would apply two low-pass filters to the data, one with a long time constant, T1, and one with a short time constant, T2. You would then look at the magnitude difference in output from these two filters and when it exceeds a certain threshold, K, then that would be a spike. The hardest part is tuning T1, T2 and K so that you don't get too many false positives and you don't miss any small spikes.
The following is a single pole IIR low-pass filter:
new = k * old + (1 - k) * new
The value of k determines the time constant and is usually close to 1.0 (but < 1.0 of course).
I am suggesting that you apply two such filters in parallel, with different time constants, e.g. start with say k = 0.9 for one (short time constant) and k = 0.99 for the other (long time constant) and then look at the magnitude difference in their outputs. The magnitude difference will be small most of the time, but will become large when there is a spike.

Resources