About random number sequence generation - algorithm

I am new to randomized algorithms, and learning it myself by reading books. I am reading a book Data structures and Algorithm Analysis by Mark Allen Wessis
.
Suppose we only need to flip a coin; thus, we must generate a 0 or 1
randomly. One way to do this is to examine the system clock. The clock
might record time as an integer that counts the number of seconds
since January 1, 1970 (atleast on Unix System). We could then use the
lowest bit. The problem is that this does not work well if a sequence
of random numbers is needed. One second is a long time, and the clock
might not change at all while the program is running. Even if the time
were recorded in units of microseconds, if the program were running by
itself the sequence of numbers that would be generated would be far
from random, since the time between calls to the generator would be
essentially identical on every program invocation. We see, then, that
what is really needed is a sequence of random numbers. These numbers
should appear independent. If a coin is flipped and heads appears,
the next coin flip should still be equally likely to come up heads or
tails.
Following are question on above text snippet.
In above text snippet " for count number of seconds we could use lowest bit", author is mentioning that this does not work as one second is a long time,
and clock might not change at all", my question is that why one second is long time and clock will change every second, and in what context author is mentioning
that clock does not change? Request to help to understand with simple example.
How author is mentioning that even for microseconds we don't get sequence of random numbers?
Thanks!

Programs using random (or in this case pseudo-random) numbers usually need plenty of them in a short time. That's one reason why simply using the clock doesn't really work, because The system clock doesn't update as fast as your code is requesting new numbers, therefore qui're quite likely to get the same results over and over again until the clock changes. It's probably more noticeable on Unix systems where the usual method of getting the time only gives you second accuracy. And not even microseconds really help as computers are way faster than that by now.
The second problem you want to avoid is linear dependency of pseudo-random values. Imagine you want to place a number of dots in a square, randomly. You'll pick an x and a y coordinate. If your pseudo-random values are a simple linear sequence (like what you'd obtain naïvely from a clock) you'd get a diagonal line with many points clumped together in the same place. That doesn't really work.
One of the simplest types of pseudo-random number generators, the Linear Congruental Generator has a similar problem, even though it's not so readily apparent at first sight. Due to the very simple formula
you'll still get quite predictable results, albeit only if you pick points in 3D space, as all numbers lies on a number of distinct planes (a problem all pseudo-random generators exhibit at a certain dimension):

Computers are fast. I'm over simplifying, but if your clock speed is measured in GHz, it can do billions of operations in 1 second. Relatively speaking, 1 second is an eternity, so it is possible it does not change.
If your program is doing regular operation, it is not guaranteed to sample the clock at a random time. Therefore, you don't get a random number.

Don't forget that for a computer, a single second can be 'an eternity'. Programs / algorithms are often executed in a matter of milliseconds. (1000ths of a second. )
The following pseudocode:
for(int i = 0; i < 1000; i++)
n = rand(0, 1000)
fills n a thousand times with a random number between 0 and 1000. On a typical machine, this script executes almost immediatly.
While you typically only initialize the seed at the beginning:
The following pseudocode:
srand(time());
for(int i = 0; i < 1000; i++)
n = rand(0, 1000)
initializes the seed once and then executes the code, generating a seemingly random set of numbers. The problem arises then, when you execute the code multiple times. Lets say the code executes in 3 milliseconds. Then the code executes again in 3 millisecnds, but both in the same second. The result is then a same set of numbers.
For the second point: The author probabaly assumes a FAST computer. THe above problem still holds...

He means by that is you are not able to control how fast your computer or any other computer runs your code. So if you suggest 1 second for execution thats far from anything. If you try to run code by yourself you will see that this is executed in milliseconds so even that is not enough to ensure you got random numbers !

Related

Getting the maximum data an algorithm can process in a certain time span

Well meanwhile it's the second time I got an exercise where I have to determine (in this case it's about sorting algorithms) how many numbers I can sort with a certain algorithm (on my own computer) so that the algorithm would run exactly one minute.
This is a practical exercise, means I must generate enough numbers so it would run that long. Now I ask myself, since I haven't had this problem in all ten years of programming: How can I possibly do this? My first attempt was a bit brute-forcy which resulted in an instant StackOverflow.
I could make an array (or multiple) and fill them up with random numbers, but to determine how many would end up in one minute runtime would be a terrible long task since you would always need to wait.
What can I do to efficiently find out about this? Measuring the difference between let's say 10 and 20 numbers and calculate how much it would take to fill a minute? Sounds easy, but algorithms (especially sorting algorithms) are rarely linear.
You know time complexity for each algorithm in question. For example, bubble sort takes O(n*n) time. Make relatively small sample run - D=1000 records, measure the time it takes (T milliseconds). For example, it takes 15 seconds = 15000 milliseconds.
Now with more or less accuracy you can expect that D*2 records will be processed 4 times slower. And vice versa - you need about D* sqrt(60000/T) records to process them in 1 minute. For example, you need D* sqrt(60000/15000)=D* sqrt(4)=D*2=2000 records.
This method is not accurate enough to get exact number, and in most cases exact number of records is not set, it fluctuates from run to run. Also for many algorithms time it takes depends on values in your record set. For example, worst case for quicksort is O(nn), while normal case is O(nlog(n))
You could use something like this:
long startTime = System.getCurrentTimeMillis();
int times = 0;
boolean done = false;
while(!done){
//run algorithm
times++;
if(System.getCurrentTimeMillis()-startTime >= 60000)
done = true;
}
Or if you don't want to wait that long you can can replace the 60000 by 1000 and then multiply the times by 60, it won't be very accurate though.
It would be time consuming to generate a new number every time, so you can use an array that you populate beforehand and then access with the times variable, or you can always use the same variable, which you know would be most time consuming to process so that you get the minimum amount of times that it would run in a minute.

Comparison of Sorting Algorithms using running time in terms of seconds

I have devised a test in order to compare the different running times of my sorting algorithm with Insertion sort, bubble sort, quick sort, selection sort, and shell sort. I have based my test using the test done in this website http://warp.povusers.org/SortComparison/index.html, but I modified my test a bit.
I set up a test manager program server which generates the data, and the test manager sends it to the clients that run the different algorithms, therefore they are sorting the same data to have no bias.
I noticed that the insertion sort, bubble sort, and selection sort algorithms really did run for a very long time (some more than 15 minutes) just to sort one given data for sizes of 100,000 and 1,000,000.
So I changed the number of runs per test case for those two data sizes. My original runs for the 100,000 was 500 but I reduced it to 15, and for 1,000,000 was 100 and I reduced it to 3.
Now my professor doubts the credibility as to why I've reduced it that much, but as I've observed the running time for sorting a specific data distribution varied only by a small percentage, which is why I still find it that even though I've reduced it to that much I'd still be able to approximate the average runtime for that specific test case of that algorithm.
My question now is, is my assumption wrong? Does the machine at times make significant running time changes (>50% changes), like say for example sorting the same data over and over if a first run would give it 0.3 milliseconds will the second run give as much difference as making it run for 1.5 seconds? Because from my observation, the running times don't vary largely given the same type of test distribution (e.g. completely random, completely sorted, completely reversed).
What you are looking for is a way to measure error in your experiments. My favorite book on subject is Error Analysis by Taylor and Chapter 4 has what you need which I'll summarize here.
You need to calculate Standard error of the mean or SDOM. First calculate mean and standard deviation (formulas are on Wikipedia and quite simple). Your SDOM is standard deviation divided by square root of number of measurements. Assuming your timings have Normal distribution (which it should), the twice the value of SDOM is a very common way to specify +/- error.
For example, let's say you run sorting algorithm 5 times and get following numbers: 5, 6, 7, 4, 5. Then mean is 5.4 and standard deviation is 1.1. Therefor SDOM is 1.1/sqrt(5) = 0.5. So 2*SDOM = 1. Now you can say that algorithm rum time was 5.4 ± 1. You professor can determine if this is acceptable error in measurement. Notice that as you take more readings, your SDOM, i.e. plus or minus error, goes down inversely proportional to square root of N. Twice of SDOM interval has 95% probability or confidence that the true value lies within the interval which is accepted standard.
Also you most likely want to measure performance by measuring CPU time instead of simple timer. Modern CPUs are too complex with various cache level and pipeline optimizations and you might end up getting less accurate measurement if you are using timer. More about CPU time is in this answer: How can I measure CPU time and wall clock time on both Linux/Windows?
It absolutely does. You need a variety of "random" samples in order to be able to draw proper conclusions about the population.
Look at it this way. It takes a long time to poll 100,000 people in the U.S. about their political stance. If we reduce the sample size to 100 people in order to complete it faster, we not only reduce the precision of our final result (2 decimal places rather than 5), we also introduce a larger chance that the members of the sample have a specific bias (there is a greater chance that 100 people out of 3xx,000,000 think the same way than 100,000 out of those same 3xx,000,000).
Your professor is right, however he's not provided the details that I mention some of them here :
Sampling issue: It's right that you generate some random numbers and feed them to your sorting methods, but with a few test cases indeed you're biased cause almost all of the random functions are biased to some extent (specially to the state of machine or time at the moment), so you should use more and more test cases to be more confident about the randomness.
Machine state: Suppose you've provide perfect data (fully representative of a uniform distribution), the performance of the electro-mechanical devises like computers may vary in different situations, so you should try for considerable times to smooth the effects of these phenomena.
Note : In advanced technical reports, you should provide a confidence coefficient for the answers you provide derived from statistical analysis, and proven step by step, but if you don't need to be that much exact, simply increase these :
The size of the data
The number of tests

efficient way to detect periodicity of network flows

I have lots of netflow data (i.e src_ip, dest_ip, beg_time, end_time, data_size, etc) and some of them are happening periodically that I want to find out.
Consider I have n netflow(maybe around 10^6) and m of them are periodic. How could I find which ones are periodic?
I can write a code but it will be at least O(n^3 logn), which will take forever for after 10^4 number of netflow.
I have searched about it but couldn't find anything.
Note: You can consider data is sorted according to start time and start time is 32 bit unsigned int(uint32 in c++)
Correction: src_ip is unique and dest_ip is not unique, time for periodicity is unknown. It may be 5 min or it may be 5 days. You can forget about src_ip, dest_ip, end_time, data_size and other attributes of flow. I'm only looking for events whose beginning times are periodic and you can consider, I have eleminated events which are unrelated like different src_ip's, and so on...
Any help will be appreciated,
Thanks
I'd try computing FFT on signals corresponding to your data.
For example, I'd transform the chunk beg_time=1, end_time=5, data_size=100 into a square pulse from 1 to 5 units of time with the amplitude 100.
If you want analyze everything together, you superimpose all the pulses you've got.
If it doesn't make sense to put everything together, superimpose only the pulses from the same src_ip or from the same pair of src_ip and dst_ip.
And then run the FFT on those signals obtained through superposition and see if there any noticeable peaks in the frequency domain, or it all looks randomish, no outstanding peaks.
FFT runs in O(n*log(n)) time, where n is the number of signal samples.
I'm sure there must be better ways to do it, but it may be worth a try.

Valid random generation algorithm or not?

long timeValue = timeInMillis();
int rand = timeValue%100 + 1;
If we execute the above code N times in a loop, it will generate N random numbers between 1 to 100. I know generation of random nos is a tough problem. Just wanted to know is this a good random number generation algorithm? Or is it pseudo random number generator?
Why I think this will produce good estimate of random behavior?
1) all no from 1 to 100 will be uniformly distributed. There is no bias.
2) timeInMillis will show somewhat random behavior because we can never really guess at what time CPU will execute this function. There are so many different tasks running in CPU. So the exact time of execution of timeInMillis() instruction is not predictable in next iteration of loop.
No. For a start, on most processors, this will loop many times (probably the full 100) within 1 millisecond, which will result in 100 identical numbers.
Even seeding a random number generator with a timer tick can be dangerous - the timer tick is rarely as "random" as you might expect.
here is my suggestion to generate random numbers:
1- choose a punch of websites that are as far away from your location as possible. e.g. if you are in US try some websites that have their server IPs in malasia , china , russia , India ..etc . servers with high traffic are better.
2- during times of high internet traffic in your country (in my country it is like 7 to 11 pm) ping those websites many many many times ,take each ping result (use only the integer value) and calculate modulus 2 of it ( i.e from each ping operation you get one bit : either 0 or 1).
3- repeat the process for several days ,recording the results.
4- collect all the bits you got from all your pings (probably you will get hundreds of thousands of bits ) and choose from them your bits . (maybe you wanna choose your bits by using some data from the same method mentioned above :) )
BE CAREFUL : in your code you should check for timeout ..etc

How do I find the running time given algorithm speed and computer speed?

I'm currently working through an assignment that deals with Big-O and running times. I have this one question presented to me that seems to be very easy, but I'm not sure if I'm doing it correctly. The rest of the problems have been quite difficult, and I feel like I'm overlooking something here.
First, you have these things:
Algorithm A, which has a running time of 50n^3.
Computer A, which has a speed of 1 millisecond per operation.
Computer B, which has a speed of 2 milliseconds per operation.
An instance of size 300.
I want to find how long it takes algorithm A to solve this instance on computer A, and how long it takes it on computer B.
What I want to do is sub 300 in for n, so you have 50*(300^2) = 4500000.
Then, multiply that by 1 for the first computer, and by 2 for the second computer.
This feels odd to me, though, because it says the "running time" is 50n^3, not, "the number of operations is 50n^3", so I get the feeling that I'm multiplying time by time, and would end up with units of milliseconds squared, which doesn't seem right at all.
I would like to know if I'm right, and if not, what the question actually means.
It wouldn't make sense if you had O(n^3) but you are not using big O notation in your example. I.e. if you had O(n^3) you would know the complexity of your algorithm but you would not know the exact number of operations as you said.
Instead it looks as though you are given the exact number of operations that are taken. (Even know it is not explicitly stated). So substituting for n would make sense.
Big O notation describes how the size of the input would effect your running time or memory usage. But with Big O you could not deduce an exact running time even given the speed of each operation.
Putting an explanation of why your answer looks so simple (as I described above) would also be a safe way. But I'm sure even without it you'll get the marks for the question.
Well, aside from the pointlessness of figuring out how long something will take this way on most modern computers, though it might make have some meaning in an embedded system, it does look right to me the way you did it.
If the algorithm needs 50n^3 operations to complete something, where n is the number of elements to process, then substituting 300 for n will give you the number of operations to perform, not a time-unit.
So multiply with time per operation and you would get the time needed.
Looks right to me.
Your 50*n^3 data is called "running time", but that's because the model used for speed evaluations assumes a machine with several basic operations, where each of these takes 1 time unit.
In you case, running the algorithm takes 50*500^3 time units. On computer A each time unit is 1ms, and on computer B 2ms.
Hope this puts some sense into the units,
Asaf.

Resources