Generate n random integers between 1-n in small amount of space? - random

More specifically, is there an algorithm that can generate, deterministically, provided a seed, n integers from 0-(n-1), with no duplicates or missing numbers, in linear or sub-linear time and constant space?
All the answers i've found or seen online require linear space, as they need to store information about every digit in the sequence before they can give the first number at all. This becomes unreasonable memory usage in the millions/trillions of possible numbers, which is useful for random id generation. Is there an algorithm, say an iterative formula, which nicely spits out one number after another, without having to know any information about all the numbers before it or after it? Or am I living in a pipe dream right now?

Related

What is the most efficient algorithm to give out prime numbers, up to very high values (all a 32bit machine can handle)

My program is supposed to loop forever and give out via print every prime number it comes along. Doing this in x86-NASM btw.
My first attempt divided it by EVERY previous number until either the Carry is 0 (not a prime) or the result is 1.
MY second attempt improved this by only testing every second, so only odd numbers.
The third thing I am currently implementing is trying to not divide by EVERY previous number but rather all of the previous divided by 2, since you can't get an even number by dividing a number by something bigger than its half
Another thing that might help is to test it with only odd numbers, like the sieve of eratosthenes, but only excluding even numbers.
Anyway, if there is another thing I can do, all help welcome.
edit:
If you need to test an handful, possibly only one, of primes, the AKS primality test is polynomial in the length of n.
If you want to find a very big prime, of cryptographic size, then select a random range of odd numbers and sieve out all the numbers whose factors are small primes (e.g. less equal than 64K-240K) then test the remaining numbers for primality.
If you want to find the primes in a range then use a sieve, the sieve of Erathostenes is very easy to implement but run slower and require more memory.
The sieve of Atkin is faster, the wheels sieve requires far less memory.
The size of the problem is exponential if approached naively so before micro-optimising is mandatory to first macro-optimise.
More or less all prime numbers algorithms require confidence with Number theory, so take particular attention to the group/ring/field the algorithm is working on because mathematicians write operations like the inverse or the multiplication with the same symbol for all the algebraic structures.
Once you have a fast algorithm, you can start micro-optimising.
At this level it's really impossible to answer how to proceed with such optimisations.

Space Complexity and Modifying the Data Set

What is the space complexity of the following algorithm?
Given an array of n 32-bit signed integers, where each value is positive and less than two to the power of 30, negate all values in the array, then negate all values in the array a second time.
The question arose for me out of a discussion in the comment section here: Rearrange an array so that arr[i] becomes arr[arr[i]] with O(1) extra space
I am specifically interested in different opinions and definitions. I think subtle distinctions and definitions may be missing sometimes in some stackoverflow discussions on this subject.
Space complexity usually refers to the added space requirements for an algorithm, over and above the original data set itself. In this case, the original data set is the n 32-bit signed integers, so you're only concerned with extra storage above that.
In that case, that extra storage is is basically nothing, which translates to constant O(1) space complexity.
If you were required to create a separate array (negated, then negated again), it would be O(n) since the space required is in proportion to the original data set.
But, since you're doing the negations in-place, that's not the case.
You are confusing two different, though related, things: computer-theoretic space complexity of an algorithm, and practical memory requirements of a program.
Algorithms in computer science are normally not formulated in terms of integers of certain predefined size which is imposed by currently predominant computer architectures. If anything, they are parameterized by integer size. So "given an array of n 32-bit signed integers" should be replaced with "given an array of n k-bit signed integers".
Now if all integers of the input array are actually m<k bit wide, and all integers of the output array are also known to be m<k bit wide, and nothing else outside your algorithm imposes k bit wide integers, then sneaking k in the problem description is just cheating in order to make your complexity look better then it actually is. Likewise, saying "signed" if both input and output data is supposed to be positive is cheating.
Real-life programs don't have complexity, they have memory requirements. It is perfectly fine to say that your program does not use any extra memory if it only temporarily uses otherwise unused sign bits of your array elements. Just don't act surprised when one fine day you discover you have too large an array and you must pack it, so that it no longer has any unused bits. That is, you are reusing your algorithm in a different program with a different data representation, one that does not have any spare bits. Then you are forced to recall that the added space complexity of your algorithm is actually O(n).
Since you're interested in space complexity, the only relevant part of the question is:
"an array of n 32-bit signed integers"
From the above, the answer is pretty straightforward - O(n)
This whole blurb:
negate all values in the array, then negate all values in the array a
second time
only affects the time complexity, which seems like a poorly crafted distraction in a homework assignment.

why is integer factorization a non-polynomial time?

I am just a beginner of computer science. I learned something about running time but I can't be sure what I understood is right. So please help me.
So integer factorization is currently not a polynomial time problem but primality test is. Assume the number to be checked is n. If we run a program just to decide whether every number from 1 to sqrt(n) can divide n, and if the answer is yes, then store the number. I think this program is polynomial time, isn't it?
One possible way that I am wrong would be a factorization program should find all primes, instead of the first prime discovered. So maybe this is the reason why.
However, in public key cryptography, finding a prime factor of a large number is essential to attack the cryptography. Since usually a large number (public key) is only the product of two primes, finding one prime means finding the other. This should be polynomial time. So why is it difficult or impossible to attack?
Casual descriptions of complexity like "polynomial factoring algorithm" generally refer to the complexity with respect to the size of the input, not the interpretation of the input. So when people say "no known polynomial factoring algorithm", they mean there is no known algorithm for factoring N-bit natural numbers that runs in time polynomial with respect to N. Not polynomial with respect to the number itself, which can be up to 2^N.
The difficulty of factorization is one of those beautiful mathematical problems that's simple to understand and takes you immediately to the edge of human knowledge. To summarize (today's) knowledge on the subject: we don't know why it's hard, not with any degree of proof, and the best methods we have run in more than polynomial time (but also significantly less that exponential time). The result that primality testing is even in P is pretty recent; see the linked Wikipedia page.
The best heuristic explanation I know for the difficulty is that primes are randomly distributed. One of the easier-to-understand results is Dirichlet's theorem. This theorem say that every arithmetic progression contains infinitely many primes, in other words, you can think of primes as being dense with respect to progressions, meaning you can't avoid running into them. This is the simplest of a rather large collection of such results; in all of them, primes appear in ways very much analogous to random numbers.
The difficult of factoring is thus analogous to the impossibility of reversing a one-time pad. In a one-time pad, there's a bit we don't know XOR with another one we don't. We get zero information about an individual bit knowing the result of the XOR. Replace "bit" with "prime" and multiplication with XOR, and you have the factoring problem. It's as if you've multiplied two random numbers together, and you get very little information from product (instead of zero information).
If we run a program just to decide whether every number from 1 to sqrt(n) can divide n, and if the answer is yes, then store the number.
Even ignoring that the divisibility test will take longer for bigger numbers, this approach takes almost twice as long if you just add a single (binary) digit to n. (Actually it will take twice as long if you add two digits)
I think that is the definition of exponential runtime: Make n one bit longer, the algorithm takes twice as long.
But note that this observation applies only to the algorithm you proposed. It is still unknown if integer factorization is polynomial or not. The cryptographers sure hope that it is not, but there are also alternative algorithms that do not depend on prime factorization being hard (such as elliptic curve cryptography), just in case...

Generate N quasi random numbers in less than O(N)

This was inspired by a question at a job interview: how do you efficiently generate N unique random numbers? Their security and distribution/bias don't matter.
I proposed a naive way of calling rand() N times and eliminating dupes by trial and error, thus getting inefficient and flawed solution. Then I've read this SO question, these algorithms are great for getting quality unique numbers and they are O(N).
But I suspect there are ways to get low-quality unique random numbers for dummy tasks in less than O(N) time complexity. I got some possible ideas:
Store many precomputed lists each containing N numbers and retrieve one list randomly. Complexity is O(1) for fixed N. Storage space used is O(NR) where R is number of lists.
Generate N/2 unique random numbers and then divide them by 2 inequal parts (floor/ceil for odd numbers, n+1/n-1 for even). I know this is flawed (duplicates can pop up) and O(N/2) is still O(N). This is more of a food for thought.
Generate one big random number and then squeeze more variants from it by some fixed manipulations like bitwise operations, factorization, recursion, MapReduce or something else.
Use a quasi-random sequence somehow (not a math guy, just googled this term).
Your ideas?
Presumably this routine has some kind of output (i.e. the results are written to an array of some kind). Populating an array (or some other data-structure) of size N is at least an O(N) operation, so you can't do better than O(N).
You can consequently generate a random number, and if the result array contains it, just add to it the maximum number of already generated numbers.
Detecting if a number already generated is O(1) (using a hash set). So it's O(n) and with only N random() calls.
Of course, this is an assumption that we do not overflow the upper limit (i.e. BigInteger).

Does Repeating a Biased Random Shuffle Reduce the Bias?

I'd like to produce fast random shuffles repeatedly with minimal bias.
It's known that the Fisher-Yates shuffle is unbiased as long as the underlying random number generator (RNG) is unbiased.
To shuffle an array a of n elements:
for i from n − 1 downto 1 do
j ← random integer with 0 ≤ j ≤ i
exchange a[j] and a[i]
But what if the RNG is biased (but fast)?
Suppose I want to produce many random permutations of an array of 25 elements. If I use the Fisher-Yates algorithm with a biased RNG, then my permutation will be biased, but I believe this assumes that the 25-element array starts from the same state before each application of the shuffle algorithm. One problem, for example, is if the RNG only has a period of 2^32 ~ 10^9 we can not produce every possible permutation of the 25 elements because this is 25! ~ 10^25 permutations.
My general question is, if I leave the shuffled elements shuffled before starting each new application of the Fisher-Yates shuffle, would this reduce the bias and/or allow the algorithm to produce every permutation?
My guess is it would generally produce better results, but it seems like if the array being repeatedly shuffled had a number of elements that was related to the underlying RNG that the permutations could actually repeat more often than expected.
Does anyone know of any research that addresses this?
As a sub-question, what if I only want repeated permutations of 5 of the 25 elements in the array, so I use the Fisher-Yates algorithm to select 5 elements and stop before doing a full shuffle? (I use the 5 elements on the end of the array that got swapped.) Then I start over using the previous partially shuffled 25-element array to select another permutation of 5. Again, it seems like this would be better than starting from the original 25-element array if the underlying RNG had a bias. Any thoughts on this?
I think it would be easier to test the partial shuffle case since there are only 6,375,600 possible permutations of 5 out of 25 elements, so are there any simple tests to use to check for biases?
if the RNG only has a period of 2^32 ~
10^9 we can not produce every possible
permutation of the 25 elements because
this is 25! ~ 10^25 permutations
This is only true as long as the seed determines every successive selection. As long as your RNG can be expected to deliver a precisely even distribution over the range specified for each next selection, then it can produce every permutation. If your RNG cannot do that, having a larger seed base will not help.
As for your side question, you might as well reseed for every draw. However, reseeding the generator is only useful if reseeding it contains enough entropy. Time stamps don't contain much entropy, neither do algorithmic calculations.
I'm not sure what this solution is part of because you have not listed it, but if you are trying to calculate something from a larger domain using random input, there are probably better methods.
A couple of points:
1) Anyone using the Fisher Yates shuffle should read this and make doubly sure their implementation is correct.
2) Doesn't repeating the shuffle defeat the purpose of using a faster random number generator? Surely if you're going to have to repeat every shuffle 5 times to get the desired entropy you're better using a low bias generator.
3) Do you have a set up where you can test this? If so start trying things - Jeffs graphs make it clear that you can easily detect quite a lot of errors by using small decks and visually portraying the results.
My feeling is that with a biased RNG repeated runs of the Knuth shuffle would produce all the permutations, but I'm not able to prove it (it depends on the period of the RNG and how much biased it is).
So let's reverse the question: given an algorithm that requires a random input and a biased RNG, is it easier to de-skew the algorithm's output or to de-skew the RNG's output?
Unsurprisingly, the latter is much easier to do (and is of broader interest): there are several standard techniques to do it. A simple technique, due to Von Neumann, is: given a bitstream from a biased RNG, take bits in pairs, throw away every (0,0) and (1,1) pair, return a 1 for every (1,0) pair and a 0 for every (0,1) pair. This technique assumes that the bits are from a stream where each bit has the same probability of being a 0 or 1 as any other bit in the stream and that bits are not correlated. Elias generalized von Neumann's technique to a more efficient scheme (one where fewer bits are discarded).
But even strongly biased or correlated bits, may contain useful amounts of randomness, for example using a technique based on Fast Fourier Transform.
Another option is to feed the biased RNG output to a cryptographically strong function, for example a message digest algorithm, and use its output.
For further references on how to de-skew random number generators, I suggest you to read the Randomness Recommendations for Security RFC.
My point is that the quality if the output of a random-based algorithm is upper bounded by the entropy provided by the RNG: if it is extremely biased the output will be extremely biased, no matter what you do. The algorithm can't squeeze more entropy than the one contained in the biased random bitstream. Worse: it will probably lose some random bits. Even assuming that the algorithm works with a biased RNG, to obtain good result you'll have to put a computational effort at least as great as the effort that it would take to de-skew the RNG (but it probably will require more effort, since you'll have to both run the algorithm and "defeat" the biasing at the same time).
If your question is just theoretical, then please disregard this answer. If it is practical then please seriously think about de-skewing your RNG instead of making assumption about the output of the algorithm.
I can't completely answer your question, but this observation seemed too long for a comment.
What happens if you ensure that the number of random numbers pulled from your RNG for each iteration of Fisher-Yates has a high least common multiple with the RNG period? That may mean that you "waste" a random integer at the end of the algorithm. When shuffling 25 elements, you need 24 random numbers. If you pull one more random number at the end, making 25 random numbers, you're not guaranteed to have a repetition for much longer than the RNG period. Now, randomly, you could have the same 25 numbers occur in succession before reaching the period, of course. But, as 25 has no common factors other than 1 with 2^32, you wouldn't hit a guaranteed repetition until 25*(2^32). Now, that isn't a huge improvement, but you said this RNG is fast. What if the "waste" value was much larger? It may still not be practical to get every permutation, but you could at least increase the number you can reach.
It depends entirely on the bias. In general I would say "don't count on it".
Biased algorithm that converges to non-biased:
Do nothing half of the time, and a correct shuffle the other half. Converges towards non-biased exponentially. After n shuffles there is a 1-1/2^n chance the shuffle is non-biased and a 1/2^n chance the input sequence was selected.
Biased algorithm that stays biased:
Shuffle all elements except the last one. Permanently biased towards not moving the last element.
More General Example:
Think of a shuffle algorithm as a weighted directed graph of permutations, where the weights out of a node correspond to the probability of transitioning from one permutation to another when shuffled. A biased shuffle algorithm will have non-uniform weights.
Now suppose you filled one node in that graph with water, and water flowed from one node to the next based on the weights. The algorithm will converge to non-biased if the distribution of water converges to uniform no matter the starting node.
So in what cases will the water not spread out uniformly? Well, if you have a cycle of above-average weights, nodes in the cycle will tend to feed each other and stay above the average amount of water. They won't take all of it, since as they get more water the amount coming in decreases and the amount going out increases, but it will be above average.

Resources