Threshold to stop generating random unique things - random

Given a population size P, I must generate P random, but unique objects. An object is an unordered list of X unique unordered pairs.
I am currently just using a while loop with T attempts at generating a random ordering before giving up. Currently T = some constant.
So my question is at what point should I stop attempting to generate more unique objects i.e. the reasonable value of T.
For example:
1) If I have 3 unique objects and I need just one more, I can attempt up to e.g. 4 times
2) But if I have 999 unique objects and I need just one more, I do not want to make e.g. 1000 attempts
The problem I'm dealing with doesn't absolutely require every unique ordering. The user specifies the number actually, so I want to determine at what point to say that it is not reasonable to generate any more.
I hope that makes sense
If not, a more general case:
Choosing N numbers, at what value of T does it start to get very difficult to start generating more unique random numbers from the possible N.
I'm not sure if T would be the same in both cases but maybe this second case would be sufficient for my needs. I need a relatively large threshold for small values of N and a relatively small threshold for large values of N.
Not that it matters, but this is for a basic genetic algorithm.

Are you asking for something like lottery tickets/balls selection? For that there is a well-known shuffle algorithm - Fisher–Yates-Knuth shuffle.

Related

Distribute a quantity randomly

I'm starting a project where I'm simulating an explosion of an object. I want to randomly distribute the total mass of the object that explodes into the fragments. For example, if the object has a mass of 3 kg and breaks into 3 fragments their masses could be 1, 0.5, 1.5 respectively. I want to do the same thing with energy and other things. Also, I would like to have control over the random distribution used.
I think I could do this simply by generating a random number, somehow relate it to the quantity I want to distribute and keep doing that while subtracting to the total pool. The problem with this approach is that on first sight it doesn't seem very efficient, and it may give problems for a fixed number of fragments.
So the question is, is there an algorithm or an efficient way to do this?
An example will be thoroughly appreciated.
For this problem, the first thing I would try is this:
Generate N-1 random numbers between 0 and 1
Sort them
Raise them to the xth power
Multiply the N differences between 0, successive numbers, and 1, by the quantity you want to distribute. Of course all these differences add up to 1, so you'll end up distributing exactly the target quantity.
A nice advantage of this method is that you can adjust the parameter x to get an aesthetically pleasing distribution of chunks. Natural explosions won't produce a uniform distribution of chunk sizes, so you'll want to play with this.
So here's a generic algorithm that might work for you:
Generate N random numbers using a distribution of your choosing
Find the sum of all the numbers
Divide each number by its sum
Multiply by the fixed total mass of your object
This will only take O(N) time, and will allow you to control the distribution and number of chunks.

Is a zinterstore going to be faster/slower when one of the two input sets is a normal set?

I know I can do a zinterstore with a normal set as an argument (Redis: How to intersect a "normal" set with a sorted set?). Is that going to affect performance? Is it going to be faster/slower than working only with zsets?
According to the sorted-set source code, ZINTERSTORE will treat a set like a sorted-set with score 1, the function name is zunionInterGenericCommand.
Intersecting sets will take more or less time depending on the sorting algorithm used in this step, for example:
/* sort sets from the smallest to largest, this will improve our
* algorithm's performance */
qsort(src,setnum,sizeof(zsetopsrc),zuiCompareByCardinality);
There are also differences in how Sets and Zsets are stored, which will affect how they are read. Redis will decide how to encode a (Sorted) Set depending on how many elements they contain. Therefore iterating through them requires different work.
However for any practical purposes, I'd say that your best bet is to use ZINTERSTORE, and I'll explain why: I hardly see how anything you might write in your source code will beat Redis performance when doing the intersection you want to do.
If your concern is performance, you're getting too much in the details. Your focus should be in the big-O of the operation instead, shown in the command documentation:
Time complexity: O(NK)+O(Mlog(M)) worst case with N being the
smallest input sorted set, K being the number of input sorted sets and
M being the number of elements in the resulting sorted set.
What this tells you is:
1-The size of the smaller set and the amount of sets you plan to intersect determine the first part. Therefore if you know that you'll always intersect 2 sets, one being small and the other one being huge; then you can say that the first part is constant. A good example of this would be intersect a set of all available products in a store (where the score is how many in stock), and a sorted set of products in a user's cart.
In this case you'll have only 2 sets, and you'll know one of them will be very small.
2-The size of the resulting sorted set M can cause a big performance issue. But there's a trick here: big sorted sets are encoded as a skip list when they are too big. A small sorted set will be stored as a zip list, which can cause an important hit in big sorted sets.
However, for the case of intersection, you know that the resulting set can not be bigger than the smaller set you provide. For a union, the resulting set will contain all elements in all sets; so attention needs to be on the size of the bigger sets more than on the smallest.
In summary, the answer to the question of performance with (sorted) sets is: it depends on the sizes of the sets much more than in the actual datatype. Take into consideration that the resulting data structure will be a sorted set regardless of all the inputs being sets. Therefore a big sorted set will be stored (less efficiently) as a skip list.
Knowing beforehand how many sets you plan to intersect (2, 3, depending on user input?) and the size of the smaller set (10? hundreds? thousands?) will give you a much better idea than the internal datatypes. The algorithm for intersecting is the same for both types.
Redis by default assumes the normal set to have some default score for each element, therefore it treats the normal set to be like a sorted set with all elements having an equal default score. I believe performance should be the same as intersecting 2 sorted sets.

read file only once for Stratified sampling

If do not know the distribution (or size/probability) of each subpopulation (stratum), and also not know the total population size, is it possible to do Stratified sampling by reading file only once? Thanks.
https://en.wikipedia.org/wiki/Stratified_sampling
regards,
Lin
Assuming that each record in the file can be identified as being in a particular sub-population, and that you know ahead of time what size of random sample you want from that sub-population you could hold, for each sub-population, a datastructure allowing you to do Reservoir Sampling, for that sub-population (https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R).
So repeatedly:
Read a record
Find out which sub-population it is in and get the datastructure representing the reservoir sampling for that sub-population, creating it if necessary.
Use that data-structure and the record read to do reservoir sampling for that sub-population.
At the end you will have, for each sub-population seen, a reservoir sampling data-structure containing a random sample from that population.
For the case when you wish to end up with k of N samples forming a stratified sample over the different classes of records, I don't think you can do much better than keeping k of each class and then downsampling from this. Suppose you can and I give you a initial block of records organised so that the stratified sample will have less than k/2 of some class kept. Now I follow that block with a huge number of records, all of this class, which is now clearly underrepresented. In this case, the final random sample should have much more than k/2 from this class, and (if it is really random) there should be a very small but non-zero probability that more than k/2 of those randomly chosen records came from the first block. But the fact that we never keep more than k/2 of these records from the first block means that the probability with this sampling scheme is exactly zero, so keeping less than k of each class won't work in the worst case.
Here is a cheat method. Suppose that instead of reading the records sequentially we can read the records in any order we chose. If you look through stackoverflow you will see (rather contrived) methods based on cryptography for generating a random permutation of N items without holding N items in memory at any one time, so you could do this. Now keep a pool of k records so that at any time the proportions of the items in the pool are a stratified sample, only adding or removing items from the pool when you are forced to do this to keep the proportions correct. I think you can do this because you need to add an item of class X to keep the proportions correct exactly when you have just observed another item of class X. Because you went through the records in a random order I claim that you have a random stratified sample. Clearly you have a stratified sample, so the only departure from randomness can be in the items selected for a particular class. But consider the permutations which select items not of that class in the same order as the permutation actually chosen, but which select items of that class in different orders. If there is bias in the way that items of that class are selected (as there probably is) because the bias will affect different items of that class in different ways depending on what permutation is selected the result of the random choice between all of these different permutations is that the total effect is unbiassed.
To do sampling in a single pass is simple, if you are able to keep the results in memory. It consists of two parts:
Calculate the odds of the new item being part of the result set, and use a random number to determine if the item should be part of the result or not.
If the item is to be kept, determine whether it should be added to the set or replace an existing member. If it should replace an existing member, use a random number to determine which existing member it should replace. Depending on how you calculate your random numbers, this can be the same one as the previous step or it can be a new one.
For stratified sampling, the only modification required for this algorithm is to determine which strata the item belongs to. The result lists for each strata should be kept separate.

Shuffling a huge range of numbers using minimal storage

I've got a very large range/set of numbers, (1..1236401668096), that I would basically like to 'shuffle', i.e. randomly traverse without revisiting the same number. I will be running a Web service, and each time a request comes in it will increment a counter and pull the next 'shuffled' number from the range. The algorithm will have to accommodate for the server going offline, being able to restart traversal using the persisted value of the counter (something like how you can seed a pseudo-random number generator, and get the same pseudo-random number given the seed and which iteration you are on).
I'm wondering if such an algorithm exists or is feasible. I've seen the Fisher-Yates Shuffle, but the 1st step is to "Write down the numbers from 1 to N", which would take terabytes of storage for my entire range. Generating a pseudo-random number for each request might work for awhile, but as the database/tree gets full, collisions will become more common and could degrade performance (already a 0.08% chance of collision after 1 billion hits according to my calculation). Is there a more ideal solution for my scenario, or is this just a pipe dream?
The reason for the shuffling is that being able to correctly guess the next number in the sequence could lead to a minor DOS vulnerability in my app, but also because the presentation layer will look much nicer with a wider number distribution (I'd rather not go into details about exactly what the app does). At this point I'm considering just using a PRNG and dealing with collisions or shuffling range slices (starting with (1..10000000).to_a.shuffle, then, (10000001, 20000000).to_a.shuffle, etc. as each range's numbers start to run out).
Any mathemagicians out there have any better ideas/suggestions?
Concatenate a PRNG or LFSR sequence with /dev/random bits
There are several algorithms that can generate pseudo-random numbers with arbitrarily large and known periods. The two obvious candidates are the LCPRNG (LCG) and the LFSR, but there are more algorithms such as the Mersenne Twister.
The period of these generators can be easily constructed to fit your requirements and then you simply won't have collisions.
You could deal with the predictable behavior of PRNG's and LFSR's by adding 10, 20, or 30 bits of cryptographically hashed entropy from an interface like /dev/random. Because the deterministic part of your number is known to be unique it makes no difference if you ever repeat the actually random part of it.
Divide and conquer? Break down into manageable chunks and shuffle them. You could divide the number range e.g. by their value modulo n. The list is constructive and quite small depending on n. Once a group is exhausted, you can use the next one.
For example if you choose an n of 1000, you create 1000 different groups. Pick a random number between 1 and 1000 (let's call this x) and shuffle the numbers whose value modulo 1000 equals x. Once you have exhausted that range, you can choose a new random number between 1 and 1000 (without x obviously) to get the next subset to shuffle. It shouldn't exactly be challenging to keep track of which numbers of the 1..1000 range have already been used, so you'd just need a repeatable shuffle algorithm for the numbers in the subset (e.g. Fisher-Yates on their "indices").
I guess the best option is to use a GUID/UUID. They are made for this type of thing, and it shouldn't be hard to find an existing implementation to suit your needs.
While collisions are theoretically possible, they are extremely unlikely. To quote Wikipedia:
The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs

Efficiently estimating the number of unique elements in a large list

This problem is a little similar to that solved by reservoir sampling, but not the same. I think its also a rather interesting problem.
I have a large dataset (typically hundreds of millions of elements), and I want to estimate the number of unique elements in this dataset. There may be anywhere from a few, to millions of unique elements in a typical dataset.
Of course the obvious solution is to maintain a running hashset of the elements you encounter, and count them at the end, this would yield an exact result, but would require me to carry a potentially large amount of state with me as I scan through the dataset (ie. all unique elements encountered so far).
Unfortunately in my situation this would require more RAM than is available to me (nothing that the dataset may be far larger than available RAM).
I'm wondering if there would be a statistical approach to this that would allow me to do a single pass through the dataset and come up with an estimated unique element count at the end, while maintaining a relatively small amount of state while I scan the dataset.
The input to the algorithm would be the dataset (an Iterator in Java parlance), and it would return an estimated unique object count (probably a floating point number). It is assumed that these objects can be hashed (ie. you can put them in a HashSet if you want to). Typically they will be strings, or numbers.
You could use a Bloom Filter for a reasonable lower bound. You just do a pass over the data, counting and inserting items which were definitely not already in the set.
This problem is well-addressed in the literature; a good review of various approaches is http://www.edbt.org/Proceedings/2008-Nantes/papers/p618-Metwally.pdf. The simplest approach (and most compact for very high accuracy requirements) is called Linear Counting. You hash elements to positions in a bitvector just like you would a Bloom filter (except only one hash function is required), but at the end you estimate the number of distinct elements by the formula D = -total_bits * ln(unset_bits/total_bits). Details are in the paper.
If you have a hash function that you trust, then you could maintain a hashset just like you would for the exact solution, but throw out any item whose hash value is outside of some small range. E.g., use a 32-bit hash, but only keep items where the first two bits of the hash are 0. Then multiply by the appropriate factor at the end to approximate the total number of unique elements.
Nobody has mentioned approximate algorithm designed specifically for this problem, Hyperloglog.

Resources