Generate sequence of integers in random order without constructing the whole list upfront [duplicate] - algorithm

This question already has answers here:
Closed 14 years ago.
How can I generate the list of integers from 1 to N but in a random order, without ever constructing the whole list in memory?
(To be clear: Each number in the generated list must only appear once, so it must be the equivalent to creating the whole list in memory first, then shuffling.)
This has been determined to be a duplicate of this question.

very simple random is 1+((power(r,x)-1) mod p) will be from 1 to p for values of x from 1 to p and will be random where r and p are prime numbers and r <> p.

Not the whole list technically, but you could use a bit mask to decide if a number has already been selected. This has a lot less storage than the number list itself.
Set all N bits to 0, then for each desired number:
use one of the normal linear congruent methods to select a number from 1 to N.
if that number has already been used, find the next highest unused (0 bit), with wrap.
set that numbers bit to 1 and return it.
That way you're guaranteed only one use per number and relatively random results.

It might help to specify a language you are searching a solution for.
You could use a dynamic list where you store your generated numbers, since you will need a reference which numbers you already created. Every time you create a new number you could check if the number is contained in the list and throw it away if it is contained and try again.
The only possible way without such a list would be to use a number size where it is unlikely to generate a duplicate like a UUID if the algorithm is working correctly - but this doesn't guarantee that no duplicate is generated - it is just highly unlikely.

You will need at least half of the total list's memory, just to remember what you did already.
If you are in tough memory conditions, you may try so:
Keep the results generated so far in a tree, randomize the data, and insert it into the tree. If you cannot insert then generate another number and try again, etc, until the tree fills halfway.
When the tree fills halfway, you inverse it: you construct a tree holding numbers that you haven't used already, then pick them in random order.
It has some overhead for keeping the tree structure, but it may help when your pointers are considerably smaller in size than your data is.

Related

How to quickly search for a specified element in an ordered array consisting of only two types of elements?

The array mentioned in the question are as follows:
[1,1,...,1,1,-1,-1,...,-1,-1]
How to quickly find the index of the 1 closest to -1?
Note: Both 1 and -1 will exist at the same time, and the number of 1 and -1 is large.
For example, for an array like this:
[1,1,1,1,1,-1,-1,-1]
the result should be 4.
The fastest way I can think of is binary search, is there a faster way?
With the current representation of the data, binary search is the fastest way I can thing of. Of course, you can cache and reuse the results in constant time since the answer is always the same.
On the other hand if you change the representation of the array to some simple numbers you can find the next element in constant time. Since the data can always be mapped to a binary value, you can reduce the whole array to 2 numbers. The length of the first partition and the length of the second partition. Or the length of the whole array and the partitioning point. This way you can easily change the length of both partitions in constant time and have access to the next element of the second partition in constant time.
Of course, changing the representation of the array itself is a logarithmic process since you need to find the partitioning point.
By a simple information theoretic argument, you can't be faster than log(n) using only comparisons. Because there are n possible outcomes, and you need to collect at least log(n) bits of information to number them.
If you have extra information about the statistical distribution of the values, then maybe you can exploit it. But this is to be discussed on a case-by-case basis.

Amount of arrays with unique numbers

I have been wondering if there is any better solution of this problem:
Let's assume that there are n containers (they might not have the same length). In each of them we have some numbers. What is the amount of n-length arrays that are created by taking one element from every container? Those numbers in the newly formed arrays must be unique (e.g. (2,3,3) can not be created but (2,4,3) can).
Here is an exaple:
n=3
c1=(1,6,7)
c2=(1,6,7)
c3=(6,7)
The correct answer is 4, because we can create those four arrays: (1,6,7), (1,7,6), (6,1,7), (6,7,1).
Edit: None of the n containers contain duplicates and all the elements in the new arrays must have the same order as the order of the containers they belong to.
So my question is: Is there any better way to calculate the number of those arrays than just by generating every single possibility and checking if it has no repetitions?
You do not need to generate each possibility and then check whether or not it has repetitions - you can do that before adding the would-be duplicate element, saving a lot of wasted work further down the line. But yes, given the requirement that
all the elements in the new arrays must have the same order as the
order of the containers they belong to
you cannot simply count permutations, or combinations of m-over-n, which would have been much quicker (as there is a closed formula for those).
Therefore, the optimal algorithm is probably to use a backtracking approach with a set to avoid duplicates while building partial answers, and count the number of valid answers found.
The problem looks somewhat like counting possible answers to a 1-dimensional sudoku: choose one element each from each region, ensuring no duplicates. For many cases, there may be 0 answers - imagine n=4, c=[[1,2],[2,3],[3,1],[2,3]]. For example, if are less than k unique elements for a subset of k containers, no answer is possible.

Find random numbers in a given range with certain possible numbers excluded

Suppose you are given a range and a few numbers in the range (exceptions). Now you need to generate a random number in the range except the given exceptions.
For example, if range = [1..5] and exceptions = {1, 3, 5} you should generate either 2 or 4 with equal probability.
What logic should I use to solve this problem?
If you have no constraints at all, i guess this is the easiest way: create an array containing the valid values, a[0]...a[m] . Return a[rand(0,...,m)].
If you don't want to create an auxiliary array, but you can count the number of exceptions e and of elements n in the original range, you can simply generate a random number r=rand(0 ... n-e), and then find the valid element with a counter that doesn't tick on exceptions, and stops when it's equal to r.
Depends on the specifics of the case. For your specific example, I'd return a 2 if a Uniform(0,1) was below 1/2, 4 otherwise. Similarly, if I saw a pattern such as "the exceptions are odd numbers", I'd generate values for half the range and double. In general, though, I'd generate numbers in the range, check if they're in the exception set, and reject and re-try if they were - a technique known as acceptance/rejection for obvious reasons. There are a variety of techniques to make the exception-list check efficient, depending on how big it is and what patterns it may have.
Let's assume, to keep things simple, that arrays are indexed starting at 1, and your range runs from 1 to k. Of course, you can always shift the result by a constant if this is not the case. We'll call the array of exceptions ex_array, and let's say we have c exceptions. These need to be sorted, which shall turn out to be pretty important in a while.
Now, you only have k-e useful numbers to work with, so it'll be meaningful to find a random number in the range 1 to k-e. Say we end up with the number r. Now, we just need to find the r-th valid number in your array. Simple? Not so much. Remember, you can never simply walk over any of your arrays in a linear fashion, because that can really slow down your implementation when you have a lot of numbers. You have do some sort of binary search, say, to come up with a fast enough algorithm.
So let's try something better. The r-th number would nominally have lied at index r in your original array had you had no exceptions. The number at index r is r, of course, since your range and your array indices start from 1. But, you have a bunch of invalid numbers between 1 and r, and you want to somehow get to the r-th valid number. So, lets do a binary search on the array of exceptions, ex_array, to find how many invalid numbers are equal to or less than r, because we have these many invalid numbers lying between 1 and r. If this number is 0, we're all done, but if it isn't, we have a bit more work to do.
Assume you found there were n invalid numbers between 1 and r after the binary search. Let's advance n indices in your array to the index r+n, and find the number of invalid numbers lying between 1 and r+n, using a binary search to find how many elements in ex_array are less than or equal to r+n. If this number is exactly n, no more invalid numbers were encountered, and you've hit upon your r-th valid number. Otherwise, repeat again, this time for the index r+n', where n' is the number of random numbers that lay between 1 and r+n.
Repeat till you get to a stage where no excess exceptions are found. The important thing here is that you never once have to walk over any of the arrays in a linear fashion. You should optimize the binary searches so they don't always start at index 0. Say if you know there are n random numbers between 1 and r. Instead of starting your next binary search from 1, you could start it from one index after the index corresponding to n in ex_array.
In the worst case, you'll be doing binary searches for each element in ex_array, which means you'll do c binary searches, the first starting from index 1, the next from index 2, and so on, which gives you a time complexity of O(log(n!)). Now, Stirling's approximation tells us that O(ln(x!)) = O(xln(x)), so using the algorithm above only makes sense if c is small enough that O(cln(c)) < O(k), since you can achieve O(k) complexity using the trivial method of extracting valid elements from your array first.
In Python the solution is very simple (given your example):
import random
rng = set(range(1, 6))
ex = {1, 3, 5}
random.choice(list(rng-ex))
To optimize the solution, one needs to know how long is the range and how many exceptions there are. If the number of exceptions is very low, it's possible to generate a number from the range and just check if it's not an exception. If the number of exceptions is dominant, it probably makes sense to gather the remaining numbers into an array and generate random index for fetching non-exception.
In this answer I assume that it is known how to get an integer random number from a range.
Here's another approach...just keep on generating random numbers until you get one that isn't excluded.
Suppose your desired range was [0,100) excluding 25,50, and 75.
Put the excluded values in a hashtable or bitarray for fast lookup.
int randNum = rand(0,100);
while( excludedValues.contains(randNum) )
{
randNum = rand(0,100);
}
The complexity analysis is more difficult, since potentially rand(0,100) could return 25, 50, or 75 every time. However that is quite unlikely (assuming a random number generator), even if half of the range is excluded.
In the above case, we re-generate a random value for only 3/100 of the original values.
So 3% of the time you regenerate once. Of those 3%, only 3% will need to be regenerated, etc.
Suppose the initial range is [1,n] and and exclusion set's size is x. First generate a map from [1, n-x] to the numbers [1,n] excluding the numbers in the exclusion set. This mapping with 1-1 since there are equal numbers on both sides. In the example given in the question the mapping with be as follows - {1->2,2->4}.
Another example suppose the list is [1,10] and the exclusion list is [2,5,8,9] then the mapping is {1->1, 2->3, 3->4, 4->6, 5->7, 6->10}. This map can be created in a worst case time complexity of O(nlogn).
Now generate a random number between [1, n-x] and map it to the corresponding number using the mapping. Map looks can be done in O(logn).
You can do it in a versatile way if you have enumerators or set operations. For example using Linq:
void Main()
{
var exceptions = new[] { 1,3,5 };
RandomSequence(1,5).Where(n=>!exceptions.Contains(n))
.Take(10)
.Select(Console.WriteLine);
}
static Random r = new Random();
IEnumerable<int> RandomSequence(int min, int max)
{
yield return r.Next(min, max+1);
}
I would like to acknowledge some comments that are now deleted:
It's possible that this program never ends (only theoretically) because there could be a sequence that never contains valid values. Fair point. I think this is something that could be explained to the interviewer, however I believe my example is good enough for the context.
The distribution is fair because each of the elements has the same chance of coming up.
The advantage of answering this way is that you show understanding of modern "functional-style" programming, which may be interesting to the interviewer.
The other answers are also correct. This is a different take on the problem.

Optimal way to find number of unique numbers in an array

What is the optimal way to find number of unique numbers in an array. One way is to add them to HashSet and then find the size of hashset. Is there any other way better than this.
I just need the number of unique numbers. Their frequency is not required.
Any help is appreciated.
Thanks,
Harish
What's the tradeoff in memory for fewer cpu cycles you're willing to accept? Which is more important for your optimal solution?
A variant of counting sort is very inefficient in space, but extremely fast.
For larger datasets you'll be wanting to use hashing, which is what hashset already does. Assuming you're willing to take the overhead of it actually storing the data, just go with your idea. It has the added advantage of being simpler to implement in any language with a decent standard library.
You don't say what is known about the numbers, but if 1) they are integers and 2) you know the range (max and min) and 3) the range isn't too large, then you can allocate an array of ints equal in length to ceiling(range / 32) (assuming 32-bit integers) all initialized to zero. Then go through the data set and set the bit corresponding to each number to 1. At the end, just count the number of 1 bits.
One simple algorithm is to loop through the list adding numbers to a hash set as you said, but each time check if it is already in the set, and if not add 1 to a running count. Then when you finish looping through the list you will have the number of distinct elements in the final value of the running count. Here is a python example:
count=0
s=set()
for i in list:
if i not in s:
s.add(i)
count+=1
Edit: I use a running count instead of checking the length of a set because in the background the set may be implemented as a sparse array and an extra loop over that array may be needed to check if each hash has a corresponding value. The running count avoids that potential additional overhead.
I would suggest to sort the array first and look for unique elements after that.

Generating random fixed length permutations of a string

Lets say my alphabet contains X letters and my language supports only Y letter words (Y < X ofcourse). I need to generate all the words possible in random order.
E.g.
Alphabet=a,b,c,d,e,f,g
Y=3
So the words would be:
aaa
aab
aac
aba
..
bbb
ccc
..
(the above should be generated in random order)
The trivial way to do it would be to generate the words and then randomize the list. I DONT want to do that. I want to generate the words in random order.
rondom(n)=letter[x].random(n-1) will not work because then you'll have a list of words starting with letter[x].. which will make the list not so random.
Any code/pseudocode appreciated.
As other answers have implied, there's two main approaches: 1) track what you've already generated (the proposed solutions in this category suffer from possibly never terminating), or 2) track what permutations have yet to be produced (which implies that the permutations must be pre-generated which was specifically disallowed in the requirements). Here's another solution that is guaranteed to terminate and does not require pre-generation, but may not meet your randomization requirements (which are vague at this point).
General overview: generate a tree to track what's been generated or what's remaining. "select" new permutations by traversing random links in the tree, pruning the tree at the leafs after generation of that permutation to prevent it from being generated again.
Without a whiteboard to diagram this, I hope this description is good enough to describe what I mean: Create a "node" that has links to other nodes for every letter in the alphabet. This could be implemented using a generic map of alphabet letters to nodes or if your alphabet is fixed, you could create specific references. The node represents the available letters in the alphabet that can be "produced" next for generating a permutation. Start generating permutations by visiting the root node, selecting a random letter from the available letters in that node, then traversing that reference to the next node. With each traversal, a letter is produced for the permutation. When a leaf is reached (i.e. a permutation is fully constructed), you'd backtrack up the tree to see if the parent nodes have any available permutations remaining; if not, the parent node can be pruned.
As an implementation detail, the node could store the set of letters that are not available to be produced at that point or the set of letters that are still available to be produced at that point. In order to possibly reduce storage requirements, you could also allow the node to store either with a flag indicating which it's doing so that when the node allows more than half the alphabet it stores the letters produced so far and switch to using the letters remaining when there's less than half the alphabet available.
Using such a tree structure limits what can be produced without having to pre-generate all combinations since you don't need to pre-construct the entire tree (it can be constructed as the permutations are generated) and you're guaranteed to complete because of the purging of the nodes (i.e. you're only traversing links to nodes when that's an allowed combination for an unproduced permutation).
I believe the randomization of the technique is a little odd, however, and I don't think each combination is equally likely to be generated at any given time, though I haven't really thought through this. It's also probably worth noting that even though the full tree isn't necessarily generated up front, the overhead involved will likely be enough such that you may be better off pre-generating all permutations.
I think you can do something pretty straightforward by generating a random array of characters based on the alphabet you have (in c#):
char[] alphabet = {'a', 'b', 'c', 'd'};
int wordLength = 3;
Random rand = new Random();
for (int i = 0; i < 5; i++)
{
char[] word = new char[wordLength];
for (int j = 0; j < wordLength; j++)
{
word[j] = alphabet[rand.Next(alphabet.Length)];
}
Console.WriteLine(new string(word));
}
Obviously this might generate duplicates but you could maybe store results in a hashmap or something to check for duplicates if you need to.
So I take it what you want is to produce a permutation of the set using as little memory as possible.
First off, it can't be done using no memory. For your first string, you want a function that could produce any of the strings with equal likelihood. Say that function is called nextString(). If you call nextString() again without changing anything in the state, of course it will once again be able to produce any of the strings.
So you need to store something. The question is, what do you need to store, and how much space will it take?
The strings can be seen as numbers 0 - X^Y. (aaa=0, aab=1,aac=2...aba=X...) So to store a single string as efficiently as possible, you'd need lg(X^Y) bits. Let's say X = 16 and Y=2. Then you'd need 1 byte of storage to uniquely specify a string.
Of course the most naive algorithm is to mark each string as it is produced, which takes X^Y bits, which in my example is 256 bits (32 bytes). This is what you've said you don't want to do. You can use a shuffle algorithm as discussed in this question: Creating a random ordered list from an ordered list (you won't need to store the strings as you produce them through the shuffle algorithm, but you still need to mark them).
Ok, now the question is, can we do better than that? How much do we need to store, total?
Well, on the first call, we don't need any storage. On the second call, we need to know which one was produced before. On the last call, we only need to know which one is the last one left. So the worst case is when we're halfway through. When we're halfway through, there have been 128 strings produced, and there are 128 to go. We need to know which are left to produce. Assuming the process is truly random, any split is possible. There are (256 choose 128) possibilities. In order to potentially be able to store any of these, we need lg(256 choose 128) bits, which according to google calculator is 251.67. So if you were really clever you could squeeze the information into 4 fewer bits than the naive algorithm. Probably not worth it.
If you just want it to look randomish with very little storage, see this question: Looking for an algorithm to spit out a sequence of numbers in a (pseudo) random order

Resources