How to distinguish between biased and random distributions - random

Actually, I am working on cryptography and I have to distinguish a number generated from a true random number generator and a number from a biased number generator.
Is there any theorem to follow and know how many samples of data do we need to distinguish and how much the bias should be?
Thank you in advance.

Related

Is it possible to predict what number will computer choose randomly?

I'm trying to know what number the computer will randomly choose. Will I need a specific algorithm? Or do I need artificial intelligence?
You can't predict a true random number, by definition, as it is random. Pseudorandom numbers can be predicted if you know both the algorithm being used and the seed number being provided.
https://www.howtogeek.com/183051/htg-explains-how-computers-generate-random-numbers/
Here's a good article on the different methods of generating random numbers.

Random number from many other random numbers, is it more random?

We want to generate a uniform random number from the interval [0, 1].
Let's first generate k random booleans (for example by rand()<0.5) and decide according to these on what subinterval [m*2^{-k}, (m+1)*2^{-k}] the number will fall. Then we use one rand() to get the final output as m*2^{-k} + rand()*2^{-k}.
Let's assume we have arbitrary precision.
Will a random number generated this way be 'more random' than the usual rand()?
PS. I guess the subinterval picking amounts to just choosing the binary representation of the output 0. b_1 b_2 b_3... one digit b_i at a time and the final step is adding the representation of rand() to the end of the output.
It depends on the definition of "more random". If you use more random generators, it means more random state, and it means that cycle length will be greater. But cycle length is just one property of random generators. Cycle length of 2^64 usually OK for almost any purpose (the only exception I know is that if you need a lot of different, long sequences, like for some kind of simulation).
However, if you combine two bad random generators, they don't necessarily become better, you have to analyze it. But there are generators, which do work this way. For example, KISS is an example for this: it combines 3, not-too-good generators, and the result is a good generator.
For card shuffling, you'll need a cryptographic RNG. Even a very good, but not cryptographic RNG is inadequate for this purpose. For example, Mersenne Twister, which is a good RNG, is not suitable for secure card shuffling! It is because observing output numbers, it is possible to figure out its internal state, so shuffle result can be predicted.
This can help, but only if you use a different pseudorandom generator for the first and last bits. (It doesn't have to be a different pseudorandom algorithm, just a different seed.)
If you use the same generator, then you will still only be able to construct 2^n different shuffles, where n is the number of bits in the random generator's state.
If you have two generators, each with n bits of state, then you can produce up to a total of 2^(2n) different shuffles.
Tinkering with a random number generator, as you are doing by using only one bit of random space and then calling iteratively, usually weakens its random properties. All RNGs fail some statistical tests for randomness, but you are more likely to get find that a noticeable cycle crops up if you start making many calls and combining them.

Pseudorandom permutations vs random shuffle

I would like to apply a permutation test to a sequence with 4,000,000 elements. To my knowledge, it is infeasible due to a number of possible permutations being ridiculously large (no RNG will generate uniformly distributed values in range {1 ... 4000000!}). I've heard of pseudorandom permutations though, and it sounds like something I need, but I can't comprehend if it's actually a proper replacement for random shuffle in my case.
If you are running a permutation test I presume that you want to generate a random sample from the set of all possible permutations, so that you can test some statistic calculated on the real data against the distribution of statistics calculated on the permuted data.
Algorithms for generating random permutations, such as those described at http://en.wikipedia.org/wiki/Random_permutation, typically use many random numbers, so there is no requirement for any single step of the generation process to need numbers as large as 4000000!. The only worry would be that, since the seed used to generate the random numbers is typically much smaller than 4000000!, not all permutations are possible.
There are other statistical tests which consume very large quantities of pseudo-random numbers (e.g. MCMC), so I wouldn't worry about this if you are using a random number generator which is commonly used for statistical tests. If you are worried about this, you could repeat the test with a cryptographically secure random number generator, such as http://docs.oracle.com/javase/6/docs/api/java/security/SecureRandom.html. This will be slower, so you might need to reduce the number of permutations tested, but it is very unlikely that it has any characteristic which would stand out far enough to affect your test results, because any such characteristic would be a security weakness - it would mean that, given a large quantity of random numbers already generated, you would have a slightly better than random chance of guessing the next number correctly.

predicting non-random number from a series of random number

I got the following interesting task:
Given a list of 1 million numbers with 16 digits (say, credit card numbers), which includes 990,000 purely random numbers generated by a computer system, and 10,000 created manually by fraudsters. These numbers are labeled as genuine or fraud. Build an algorithm to predict non-random numbers.
My approach so far is a bit of a brute-force: looking at non-random numbers to find patterns (such as repeated numbers: 22222, or 01234).
I wonder if there's a ready-made algorithm or tool for this kind of task. I imagine this task should be quite common among fraud analytic community.
Thanks.
First off, if you know they're credit card numbers, use Luhn's algorithm, which is a quick checksum algorithm for valid credit card numbers.
However, if they are simply 16 digit integers, there are a couple of approaches that you can use. It is hard to tell if an individual number came from a random source(as the number 1111111111111111 is just as likely as any other number out of a random number generator). As for your repeated numbers and patterns, that is very reminiscent of the concept of Kolmogorov complexity(see links below). You could try looking for patterns in this brute force method, but I feel like it would be quite inaccurate, as humans might actually tend to avoid putting digits and sequences in these numbers!
Instead, I suggest focusing on the way people generate numbers. You can treat human input like a very poor random number generator. So I recommend just making a list yourself of random human entered numbers, if you don't have another dataset. Then, you can use machine learning to generate a classifier algorithm to distinguish between purely random numbers(those without 'human-like' attributes that your machine learning algorithm has recognized). In terms of the metrics for the statistical classifier, Kolmogorov complexity could be one, perhaps frequency of digits for another metric(see Benford's law on Wikipedia), and number of repeating digits for another(humans might try to avoid repeating digits to look non-random, so let your classifier do the work!)
From my personal experience, tough problems like this are a textbook case for machine learning algorithms and statistical classifiers.
Hope this helps!
Links:
Kolmogorov Complexity
Complexity calculator

Relationship between random numbers

If i have a set of randomly generated numbers (integers), how do I find
a relationship between them so as to express them as a finite sequence and develop an
algorithm that can generate any nth term of the sequence given some seed data.
Is there any existing algorithm or framework or library that does such and if there isnt,
any suggestions on how to proceed.
Thanks.
It depends on the algorithm used to generate the (pseudo-random) numbers. If you want to predict future terms, you need some number of past terms and the algorithm used. If the algorithm is cryptographically secure, then you are out of luck. If it isn't, then you have a good chance of working out future terms.
I asked this question regarding linear congruence generators (commonly used for simple applications) a while ago. It gives a pretty good discussion of how to predict terms for that class of generator.

Resources