Python Probability Aligorithm - probability

I am looking for help with a Python algorithm that will take a percent or fraction (such as 45% or 4500/10000) and testing it multiple times, and seeing how many times it comes out true, and how many times it comes out false.
Basically, I am looking for an algorithm that will take a probability, test it multiple times, and give us results on how many times you, say, survived, or died.
Is this possible, and can anyone help me?

Loop over the following for the number of trials you want:
Generate a random integer between 0 and the denominator (if it's a fraction) or real number between 0 and 1 (if it's a percent)
If the value is less than the numerator/percent, record a failure, otherwise record a success
You can find information on generating random values in the python documentation, and how you determine whether you're working with a percent or a fraction will depend on how you accept and parse user input.

Related

How many simulations need to do?

Hello my problem is more related with the validation of a model. I have done a program in netlogo that i'm gonna use in a report for my thesis but now the question is, how many repetitions (simulations) i need to do for justify my results? I already have read some methods using statistical approach and my colleagues have suggested me some nice mathematical operations, but i also want to know from people who works with computational models what kind of statistical test or mathematical method used to know that.
There are two aspects to this (1) How many parameter combinations (2) How many runs for each parameter combination.
(1) Generally you would do experiments, where you vary some of your input parameter values and see how some model output changes. Take the well known Schelling segregation model as an example, you would vary the tolerance value and see how the segregation index is affected. In this case you might vary the tolerance from 0 to 1 by 0.01 (if you want discrete) or you could just take 100 different random values in the range [0,1]. This is a matter of experimental design and is entirely affected by how fine you wish to examine your parameter space.
(2) For each experimental value, you also need to run multiple simulations so that you can can calculate the average and reduce the impact of randomness in the simulation run. For example, say you ran the model with a value of 3 for your input parameter (whatever it means) and got a result of 125. How do you know whether the 'real' answer is 125 or something else. If you ran it 10 times and got 10 different numbers in the range 124.8 to 125.2 then 125 is not an unreasonable estimate. If you ran it 10 times and got numbers ranging from 50 to 500, then 125 is not a useful result to report.
The number of runs for each experiment set depends on the variability of the output and your tolerance. Even the 124.8 to 125.2 is not useful if you want to be able to estimate to 1 decimal place. Look up 'standard error of the mean' in any statistics text book. Basically, if you do N runs, then a 95% confidence interval for the result is the average of the results for your N runs plus/minus 1.96 x standard deviation of the results / sqrt(N). If you want a narrower confidence interval, you need more runs.
The other thing to consider is that if you are looking for a relationship over the parameter space, then you need fewer runs at each point than if you are trying to do a point estimate of the result.
Not sure exactly what you mean, but maybe you can check the books of Hastie and Tishbiani
http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
specially the sections on resampling methods (Cross-Validation and bootstrap).
They also have a shorter book that covers the possible relevant methods to your case along with the commands in R to run this. However, this book, as a far as a I know, is not free.
http://www.springer.com/statistics/statistical+theory+and+methods/book/978-1-4614-7137-0
Also, could perturb the initial conditions to see you the outcome doesn't change after small perturbations of the initial conditions or parameters. On a larger scale, sometimes you can break down the space of parameters with regard to final state of the system.
1) The number of simulations for each parameter setting can be decided by studying the coefficient of variance Cv = s / u, here s and u are standard deviation and mean of the result respectively. It is explained in detail in this paper Coefficient of variance.
2) The simulations where parameters are changed can be analyzed using several methods illustrated in the paper Testing methods.
These papers provide scrupulous analyzing methods and refer to other papers which may be relevant to your question and your research.

combining different instances of a random generator but still maintaining low discrepancy

I am generating about 100 million random numbers to pick from 300 things. I need to set it up so that I have 10 million independent instances (different seed) that picks 10 times each. The goal is for the aggregate results to have very low discrepancy, as in, each item gets picked about the same number of times.
The problem is with a regular prng, some numbers get chosen more than others. (tried lcg and mersenne twister) The difference between the most picked and least picked can be several thousand, to ten thousands) With linear congruity generators and mersenne twister, I also tried picking 100 million times with 1 instance and that also didn't yield uniform results. I'm guessing this is because the period is very long, and perhaps 100 million isn't big enough. Theoretically, if I pick enough numbers, the results should reach uniformity. (should settle at the expected value)
I switched to Sobol, a quasirandom generator and got much better results with the 100 million from 1 instance test. (difference between most picked and least picked is about 5) But splitting them up to 10 million instances at 10 times each, the uniformity was lost and I got similar results as with the prng. Sobol seem very sensitive to sequence - skipping ahead randomly diminishes uniformity.
Is there a class of random generators that can maintain quasirandom-like low discrepancy even when combining 10 million independent instances? Or is that theoretically impossible? One solution I can think of now is to use 1 Sobol generator that is shared across 10 million instances, so effectively it is the same as the 100 million from 1 instance test.
Both the shuffling and proper use of Sobol should give you uniformity as desired. Shuffling needs to be done at the aggregate level (start with a global 100M sample having the desired aggregate frequencies, then shuffle it to introduce randomness, and finally split into the 10 values instances; shuffling within each instance wouldnt help globally, as you noted).
But that's an additional level of uniformity, you might not really need that: randomness might be enough.
First of all I would check the check itself, because it sounds strange that with enough samples you're really getting significant deviations (check "chi square test" to qualify such significance, or equivalently how many are "enough" samples). So for a first safety check: if you're picking independent values, then simplify differently to 10M instances picking 10 out 2 categories: do you get approximately a binomial distribution? For exclusive picking it's a different distribution (hypergeometric iirc, but need to check). Then generalize to more categories (multinomial distribution) and only later it's safe to proceed with your problem.

About random number sequence generation

I am new to randomized algorithms, and learning it myself by reading books. I am reading a book Data structures and Algorithm Analysis by Mark Allen Wessis
.
Suppose we only need to flip a coin; thus, we must generate a 0 or 1
randomly. One way to do this is to examine the system clock. The clock
might record time as an integer that counts the number of seconds
since January 1, 1970 (atleast on Unix System). We could then use the
lowest bit. The problem is that this does not work well if a sequence
of random numbers is needed. One second is a long time, and the clock
might not change at all while the program is running. Even if the time
were recorded in units of microseconds, if the program were running by
itself the sequence of numbers that would be generated would be far
from random, since the time between calls to the generator would be
essentially identical on every program invocation. We see, then, that
what is really needed is a sequence of random numbers. These numbers
should appear independent. If a coin is flipped and heads appears,
the next coin flip should still be equally likely to come up heads or
tails.
Following are question on above text snippet.
In above text snippet " for count number of seconds we could use lowest bit", author is mentioning that this does not work as one second is a long time,
and clock might not change at all", my question is that why one second is long time and clock will change every second, and in what context author is mentioning
that clock does not change? Request to help to understand with simple example.
How author is mentioning that even for microseconds we don't get sequence of random numbers?
Thanks!
Programs using random (or in this case pseudo-random) numbers usually need plenty of them in a short time. That's one reason why simply using the clock doesn't really work, because The system clock doesn't update as fast as your code is requesting new numbers, therefore qui're quite likely to get the same results over and over again until the clock changes. It's probably more noticeable on Unix systems where the usual method of getting the time only gives you second accuracy. And not even microseconds really help as computers are way faster than that by now.
The second problem you want to avoid is linear dependency of pseudo-random values. Imagine you want to place a number of dots in a square, randomly. You'll pick an x and a y coordinate. If your pseudo-random values are a simple linear sequence (like what you'd obtain naïvely from a clock) you'd get a diagonal line with many points clumped together in the same place. That doesn't really work.
One of the simplest types of pseudo-random number generators, the Linear Congruental Generator has a similar problem, even though it's not so readily apparent at first sight. Due to the very simple formula
you'll still get quite predictable results, albeit only if you pick points in 3D space, as all numbers lies on a number of distinct planes (a problem all pseudo-random generators exhibit at a certain dimension):
Computers are fast. I'm over simplifying, but if your clock speed is measured in GHz, it can do billions of operations in 1 second. Relatively speaking, 1 second is an eternity, so it is possible it does not change.
If your program is doing regular operation, it is not guaranteed to sample the clock at a random time. Therefore, you don't get a random number.
Don't forget that for a computer, a single second can be 'an eternity'. Programs / algorithms are often executed in a matter of milliseconds. (1000ths of a second. )
The following pseudocode:
for(int i = 0; i < 1000; i++)
n = rand(0, 1000)
fills n a thousand times with a random number between 0 and 1000. On a typical machine, this script executes almost immediatly.
While you typically only initialize the seed at the beginning:
The following pseudocode:
srand(time());
for(int i = 0; i < 1000; i++)
n = rand(0, 1000)
initializes the seed once and then executes the code, generating a seemingly random set of numbers. The problem arises then, when you execute the code multiple times. Lets say the code executes in 3 milliseconds. Then the code executes again in 3 millisecnds, but both in the same second. The result is then a same set of numbers.
For the second point: The author probabaly assumes a FAST computer. THe above problem still holds...
He means by that is you are not able to control how fast your computer or any other computer runs your code. So if you suggest 1 second for execution thats far from anything. If you try to run code by yourself you will see that this is executed in milliseconds so even that is not enough to ensure you got random numbers !

how to generate longer random number from a short random number?

I have a short random number input, let's say int 0-999.
I don't know the distribution of the input. Now I want to generate a random number in range 0-99999 based on the input without changing the distribution shape.
I know there is a way to make the input to [0,1] by dividing it by 999 and then multiple 99999 to get the result. However, this method doesn't cover all the possible values, like 99999 will never get hit.
Assuming your input is some kind of source of randomness...
You can take two consecutive inputs and combine them:
input() + 1000*(input()%100)
Be careful though. This relies on the source having plenty of entropy, so that a given input number isn't always followed by the same subsequent input number. If your source is a PRNG designed to cycle between the numbers 0–999 in some fashion, this technique won't work.
With most production entropy sources (e.g., /dev/urandom), this should work fine. OTOH, with a production entropy source, you could fetch a random number between 0–99999 fairly directly.
You can try something like the following:
(input * 100) + random
where random is a random number between 0 and 99.
The problem is that input only specifies which 100 range to use. For instance 50 just says you will have a number between 5000 and 5100 (to keep a similar shape distribution). Which number between 5000 and 5100 to pick is up to you.

Valid random generation algorithm or not?

long timeValue = timeInMillis();
int rand = timeValue%100 + 1;
If we execute the above code N times in a loop, it will generate N random numbers between 1 to 100. I know generation of random nos is a tough problem. Just wanted to know is this a good random number generation algorithm? Or is it pseudo random number generator?
Why I think this will produce good estimate of random behavior?
1) all no from 1 to 100 will be uniformly distributed. There is no bias.
2) timeInMillis will show somewhat random behavior because we can never really guess at what time CPU will execute this function. There are so many different tasks running in CPU. So the exact time of execution of timeInMillis() instruction is not predictable in next iteration of loop.
No. For a start, on most processors, this will loop many times (probably the full 100) within 1 millisecond, which will result in 100 identical numbers.
Even seeding a random number generator with a timer tick can be dangerous - the timer tick is rarely as "random" as you might expect.
here is my suggestion to generate random numbers:
1- choose a punch of websites that are as far away from your location as possible. e.g. if you are in US try some websites that have their server IPs in malasia , china , russia , India ..etc . servers with high traffic are better.
2- during times of high internet traffic in your country (in my country it is like 7 to 11 pm) ping those websites many many many times ,take each ping result (use only the integer value) and calculate modulus 2 of it ( i.e from each ping operation you get one bit : either 0 or 1).
3- repeat the process for several days ,recording the results.
4- collect all the bits you got from all your pings (probably you will get hundreds of thousands of bits ) and choose from them your bits . (maybe you wanna choose your bits by using some data from the same method mentioned above :) )
BE CAREFUL : in your code you should check for timeout ..etc

Resources