read file only once for Stratified sampling - algorithm

If do not know the distribution (or size/probability) of each subpopulation (stratum), and also not know the total population size, is it possible to do Stratified sampling by reading file only once? Thanks.
https://en.wikipedia.org/wiki/Stratified_sampling
regards,
Lin

Assuming that each record in the file can be identified as being in a particular sub-population, and that you know ahead of time what size of random sample you want from that sub-population you could hold, for each sub-population, a datastructure allowing you to do Reservoir Sampling, for that sub-population (https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R).
So repeatedly:
Read a record
Find out which sub-population it is in and get the datastructure representing the reservoir sampling for that sub-population, creating it if necessary.
Use that data-structure and the record read to do reservoir sampling for that sub-population.
At the end you will have, for each sub-population seen, a reservoir sampling data-structure containing a random sample from that population.
For the case when you wish to end up with k of N samples forming a stratified sample over the different classes of records, I don't think you can do much better than keeping k of each class and then downsampling from this. Suppose you can and I give you a initial block of records organised so that the stratified sample will have less than k/2 of some class kept. Now I follow that block with a huge number of records, all of this class, which is now clearly underrepresented. In this case, the final random sample should have much more than k/2 from this class, and (if it is really random) there should be a very small but non-zero probability that more than k/2 of those randomly chosen records came from the first block. But the fact that we never keep more than k/2 of these records from the first block means that the probability with this sampling scheme is exactly zero, so keeping less than k of each class won't work in the worst case.
Here is a cheat method. Suppose that instead of reading the records sequentially we can read the records in any order we chose. If you look through stackoverflow you will see (rather contrived) methods based on cryptography for generating a random permutation of N items without holding N items in memory at any one time, so you could do this. Now keep a pool of k records so that at any time the proportions of the items in the pool are a stratified sample, only adding or removing items from the pool when you are forced to do this to keep the proportions correct. I think you can do this because you need to add an item of class X to keep the proportions correct exactly when you have just observed another item of class X. Because you went through the records in a random order I claim that you have a random stratified sample. Clearly you have a stratified sample, so the only departure from randomness can be in the items selected for a particular class. But consider the permutations which select items not of that class in the same order as the permutation actually chosen, but which select items of that class in different orders. If there is bias in the way that items of that class are selected (as there probably is) because the bias will affect different items of that class in different ways depending on what permutation is selected the result of the random choice between all of these different permutations is that the total effect is unbiassed.

To do sampling in a single pass is simple, if you are able to keep the results in memory. It consists of two parts:
Calculate the odds of the new item being part of the result set, and use a random number to determine if the item should be part of the result or not.
If the item is to be kept, determine whether it should be added to the set or replace an existing member. If it should replace an existing member, use a random number to determine which existing member it should replace. Depending on how you calculate your random numbers, this can be the same one as the previous step or it can be a new one.
For stratified sampling, the only modification required for this algorithm is to determine which strata the item belongs to. The result lists for each strata should be kept separate.

Related

Randomly Consuming a Set of Elements Indexed From 1 to N

In this problem, I have a set of elements that are indexed from 1 to n. Each element actually corresponds to a graph node and I am trying to calculate random one-to-one matchings between the nodes. For the sake of simplicity, I neglect further details of the actual problem. I need to write a fast algorithm to randomly consume these elements (nodes) and do this operation multiple times in order to calculate different matchings. The purpose here is to create randomized inputs to another algorithm and each calculated matching at the end of this will be another input to that algorithm.
The most basic algorithm I can think of is to create copies of the elements in the form of an array, generate random integers, and use them as array indices to apply swap operations. This way each random copy can be created in O(n) but in practice, it uses a lot of copy and swap operations. Performance is very important and I am looking for faster ways (algorithms and data structures) of achieving this goal. It just needs to satisfy the two conditions:
It shall be able to consume a random element.
It shall be able to consume an element on the given index.
I tried to write as clear as possible. If you have any questions, feel free to ask and I am happy to clarify. Thanks in advance.
Note: Matching is an operation where you pair the vertices on a graph if there exists an edge between them.
Shuffle index array (for example, with Fisher-Yates shuffling)
ia = [3,1,4,2]
Walk through index array and "consume" set element with current index
for x in ia:
consume(Set[indexed by x])
So for this example you will get order Set[3], Set[1], Set[4], Set[2]
No element swaps, only array of integers is changed

Algorithmic help needed (N bags and items distributed randomly)

I have encountered an algorithmic problem but am not able to figure out anything better than brute force or reduce it to a better know problem. Any hints?
There are N bags of variable sizes and N types of items. Each type of items belongs to one bag. There are lots of items of each type and each item may be of a different size. Initially, these items are distributed across all the bags randomly. We have to place the items in their respective bags. However, we can only operate with a pair of bags at one time by exchanging items (as much as possible) and proceeding to the next pair. The aim is to reduce the total number of pairs. Edit: The aim is to find a sequence of transfers that minimizes the total number of bag pairs involved
Clarification:
The bags are not arbitrarily large (You can assume the bag and item sizes to be integers between 0 to 1000 if it helps). You'll frequently encounter scenarios where the all the items between 2 bags cannot be swapped due to the limited capacity of one of the bags. This is where the algorithm needs to make an optimisation. Perhaps, if another pair of bags were swapped first, the current swap can be done in one go. To illustrate this, let's consider Bags A, B and C and their items 1, 2, 3 respectively. The number in the brackets is the size.
A(10) : 3(8)
B(10): 1(2), 1(3)
C(10): 1(4)
The swap orders can be AB, AC, AB or AC, AB. The latter is optimal as the number of swaps is lesser.
Since I cannot come to an idea for an algorithm that will always find an optimal answer, and approximation of the fitness of the solution (amount of swaps) is also fine, I suggest a stochastic local search algorithm with pruning.
Given a random starting configuration, this algorithm considers all possible swaps, and makes a weighed decision based on chance: the better a swap is, the more likely it is chosen.
The value of a swap would be the sum of the value of the transaction of an item, which is zero if the item does not end up in it's belonging bag, and is positive if it does end up there. The value increases as the item's size increases (the idea behind this is that a larger block is hard to move many times in comparison to smaller blocks). This fitness function can be replaced by any other fitness function, it's efficiency is unknown until empirically shown.
Since any configuration can be the consequence of many preceding swaps, we keep track of which configurations we have seen before, along with a fitness (based on how many items are in their correct bag - this fitness is not related to the value of a swap) and the list of preceded swaps. If the fitness function for a configuration is the sum of the items that are in their correct bags, then the amount of items in the problem is the highest fitness (and therefor marks a configuration to be a solution).
A swap is not possible if:
Either of the affected bags is holding more than it's capacity after the potential swap.
The new swap brings you back to the last configuration you were in before the last swap you did (i.e. reversed swap).
When we identify potential swaps, we look into our list of previously seen configurations (use a hash function for O(1) lookup). Then we either set its preceded swaps to our preceded swaps (if our list is shorter than it's), or we set our preceded swaps to its list (if it's list is shorter than ours). We can do this because it does not matter which swaps we did, as long as the amount of swaps is as small as possible.
If there are no more possible swaps left in a configuration, it means you're stuck. Local search tells you 'reset' which you can do in may ways, for instance:
Reset to a previously seen state (maybe the best one you've seen so far?)
Reset to a new valid random solution
Note
Since the algorithm only allows you to do valid swaps, all constraints will be met for each configuration.
The algorithm does not guarantee to 'stop' out of the box, you can implement a maximum number of iterations (swaps)
The algorithm does not guarantee to find a correct solution, as it does it's best to find a better configuration each iteration. However, since a perfect solution (set of swaps) should look closely to an almost perfect solution, a human might be able to finish what the local search algorithm was not after it results in a invalid configuration (where not every item is in its correct bag).
The used fitness functions and strategies are very likely not the most efficient out there. You could look around to find better ones. A more efficient fitness function / strategy should result in a good solution faster (less iterations).

Threshold to stop generating random unique things

Given a population size P, I must generate P random, but unique objects. An object is an unordered list of X unique unordered pairs.
I am currently just using a while loop with T attempts at generating a random ordering before giving up. Currently T = some constant.
So my question is at what point should I stop attempting to generate more unique objects i.e. the reasonable value of T.
For example:
1) If I have 3 unique objects and I need just one more, I can attempt up to e.g. 4 times
2) But if I have 999 unique objects and I need just one more, I do not want to make e.g. 1000 attempts
The problem I'm dealing with doesn't absolutely require every unique ordering. The user specifies the number actually, so I want to determine at what point to say that it is not reasonable to generate any more.
I hope that makes sense
If not, a more general case:
Choosing N numbers, at what value of T does it start to get very difficult to start generating more unique random numbers from the possible N.
I'm not sure if T would be the same in both cases but maybe this second case would be sufficient for my needs. I need a relatively large threshold for small values of N and a relatively small threshold for large values of N.
Not that it matters, but this is for a basic genetic algorithm.
Are you asking for something like lottery tickets/balls selection? For that there is a well-known shuffle algorithm - Fisher–Yates-Knuth shuffle.

Subset generation by rules

Let's say that we have a 5000 users in database. User row has sex column, place where he/she was born column and status (married or not married) column.
How to generate a random subset (let's say 100 users) that would satisfy these conditions:
40% should be males and 60% - females
50% should be born in USA, 20% born in UK, 20% born in Canada, 10% in Australia
70% should be married and 30% not.
These conditions are independent, that is we cannot do like this:
(0.4 * 0.5 * 0.7) * 100 = 14 users that are males, born in USA and married
(0.4 * 0.5 * 0.3) * 100 = 6 users that are males, born in USA and not married.
Is there an algorithm to this generation?
Does the breakdown need to be exact, or approximate? Typically if you are generating a sample like this then you are doing some statistical study, so it is sufficient to generate an approximate sample.
Here's how to do this:
Have a function genRandomIndividual().
Each time you generate an individual, use the random function to choose the sex - male with probability 40%
Choose birth location using random function again (just generate a real in the interval 0-1, and if it falls 0-.5, choose USA, if .5-.7, then &K, if .7-.9 then Canada, otherwise Australia).
Choose married status using random function (again generate in 0-1, if 0-.7 then married, otherwise not).
Once you have a set of characteristics, search in the database for the first individual who satisfies these characteristics, add them to your sample, and tag it as already added in the database. Keep doing this unti you have fulfilled your sample size.
There may be no individaul that satisfies the characteristics. Then, just generate a new random individual instead. Since the generations are independent and generate the characteristics according to the required probabilities, in the end you will have a sample size of the correct size with the individuals generated randomly according to the probabilities specified.
You could try something like this:
Pick a random initial set of 100
Until you have the right distribution (or give up):
Pick a random record not in the set, and a random one that is
If swapping in the other record gets you closer to the set you want, exchange them. Otherwise, don't.
I'd probaby use the sum of squares of distance to the desired distribution as the metric for deciding whether to swap.
That's what comes to mind that keeps the set random. Keep in mind that there may be no subset which matches the distribution you're after.
It is important to note that you may not be able to find a subset that satisfies these conditions. To take an example, suppose your database contained only American males, and only Australian females. Clearly you could not generate any subset that satisfies your distribution constraints.
(Rewrote my post completely (actually, wrote a new one and deleted the old) because I thought of a much simpler and more efficient way to do the same thing.)
I'm assuming you actually want the exact proportions and not just to satisfy them on average. This is a pretty simple way to accomplish that, but depending on your data it might take a while to run.
First, arrange your original data so that you can access each combination of types easily, that is, group married US men in one pile, unmarried US men in another, and so on. Then, assuming that you have p conditions and you want to select k elements, make p arrays of size k each; one array will represent one condition. Make the elements of each array be the types of that condition, in the proportions that you require. So, in your example, the gender array would have 40 males and 60 females.
Now, shuffle each of the p arrays independently (actually, you can leave one array unshuffled if you like). Then, for each index i, take the type of the picked element to be the combination from the shuffled p arrays at index i, and pick one such type at random from the remaining ones in your original group, removing the picked element. If there are no elements of that type left, the algorithm has failed, so reshuffle the arrays and start again to pick elements.
To use this, you need to first make sure that the conditions are satisfiable at all because otherwise it will just loop infinitely. To be honest, I don't see a simple way to verify that the conditions are satisfiable, but if the number of elements in your original data is large compared to k and their distribution isn't too skewed, there should be solutions. Also, if there are only a few ways in which the conditions can be satisfied, it might take a long time to find one; though the method will terminate with probability 1, there is no upper bound that you can place on the running time.
Algorithm may be too strong a word, since to me that implies formalism and publication, but there is a method to select subsets with exact proportions (assuming your percentages yield whole numbers of subjects from the sample universe), and it's much simpler than the other proposed solutions. I've built one and tested it.
Incidentally, I'm sorry to be a slow responder here, but my time is constrained these days. I wrote a hard-coded solution fairly quickly, and since then I've been refactoring it into a decent general-purpose implementation. Because I've been busy, that's still not complete yet, but I didn't want to delay answering any longer.
The method:
Basically, you're going to consider each row separately, and decide whether it's selectable based on whether your criteria give you room to select each of its column values.
In order to do that, you'll consider each of your column rules (e.g., 40% males, 60% females) as an individual target (e.g., given a desired subset size of 100, you're looking for 40 males, 60 females). Make a counter for each.
Then you loop, until you've either created your subset, or you've examined all the rows in the sample universe without finding a match (see below for what happens then). This is the loop in pseudocode:
- Randomly select a row.
- Mark the row examined.
- For each column constraint:
* Get the value for the relevant column from the row
* Test for selectability:
If there's a value target for the value,
and if we haven't already selected our target number of incidences of this value,
then the row is selectable with respect to this column
* Else: the row fails.
- If the row didn't fail, select it: add it to the subset
That's the core of it. It will provide a subset which matches your rules, or it will fail to do so... which brings me to what happens when we can't find a
match.
Unsatisfiability:
As others have pointed out, it's not always possible to satisfy any arbitrary set of rules for any arbitrary sample universe. Even assuming that the rules are valid (percentages for each value sum to 100), the subset size is less than the universe size, and the universe does contain enough individuals with each selected value to hit the targets, it's still possible to fail if the values are actually not distributed independently.
Consider the case where all the males in the sample universe are Australian: in this case, you can only select as many males as you can select Australians, and vice-versa. So a set of constraints (subset size: 100; male: 40%; Australian 10%) cannot be satisfied at all from such a universe, even if all the Australians we select are male.
If we change the constraints (subset size: 100; male: 40%; Australian 40%), now we can possibly make a matching subset, but all of the Australians we select must be male. And if we change the constraints again (subset size: 100; male: 20%; Australian 40%), now we can possibly make a matching subset, but only if we don't pick too many Australian women (no more than half in this case).
In this latter case, selection order is going to matter. Depending on our random seed, sometimes we might succeed, and sometimes we might fail.
For this reason, the algorithm must (and my implementation does) be prepared to retry. I think of this as a patience test: the question is how many times are we willing to let it fail before we decide that the constraints are not compatible with the sample population.
Suitability
This method is well suited to the OP's task as described: selecting a random subset which matches given criteria. It is not suitable to answering a slightly different question: "is it possible to form a subset with the given criteria".
My reasoning for this is simple: the situations in which the algorithm fails to find a subset are those in which the data contains unknown linkages, or where the criteria allow a very limited number of subsets from the sample universe. In these cases, the use of any subset would be questionable for statistical analysis, at least not without further thought.
But for the purpose of answering the question of whether it's possible to form a subset, this method is non-deterministic and inefficient. It would be better to use one of the more complex shuffle-and-sort algorithms proposed by others.
Pre-Validation:
The immediate thought upon discovering that not all subsets can be satisfied is to perform some initial validation, and perhaps to analyze the data to see whether it's answerable or only conditionally answerable.
My position is that other than initially validating that each of the column rules is valid (i.e., the column percentages sum to 100, or near enough) and that the subset size is less than the universe size, there's no other prior validation which is worth doing. An argument can be made that you might want to check that the universe contains enough individuals with each selected value (e.g., that there actually are 40 males and 60 females in the universe), but I haven't implemented that.
Other than those, any analysis to identify linkages in the population is itself time-consuming that you might be better served just running the thing with more retries. Maybe that's just my lack of statistics background talking.
Not quite the subset sum problem
It has been suggested that this problem is like the subset sum problem. I contend that this is subtly and yet significantly different. My reasoning is as follows: for the subset sum problem, you must form and test a subset in order to answer the question of whether it meets the rules: it is not possible (except in certain edge conditions) to test an individual element before adding it to the subset.
For the OP's question, though, it is possible. As I'll explain, we can randomly select rows and test them individually, because each has a weight of one.

Efficiently estimating the number of unique elements in a large list

This problem is a little similar to that solved by reservoir sampling, but not the same. I think its also a rather interesting problem.
I have a large dataset (typically hundreds of millions of elements), and I want to estimate the number of unique elements in this dataset. There may be anywhere from a few, to millions of unique elements in a typical dataset.
Of course the obvious solution is to maintain a running hashset of the elements you encounter, and count them at the end, this would yield an exact result, but would require me to carry a potentially large amount of state with me as I scan through the dataset (ie. all unique elements encountered so far).
Unfortunately in my situation this would require more RAM than is available to me (nothing that the dataset may be far larger than available RAM).
I'm wondering if there would be a statistical approach to this that would allow me to do a single pass through the dataset and come up with an estimated unique element count at the end, while maintaining a relatively small amount of state while I scan the dataset.
The input to the algorithm would be the dataset (an Iterator in Java parlance), and it would return an estimated unique object count (probably a floating point number). It is assumed that these objects can be hashed (ie. you can put them in a HashSet if you want to). Typically they will be strings, or numbers.
You could use a Bloom Filter for a reasonable lower bound. You just do a pass over the data, counting and inserting items which were definitely not already in the set.
This problem is well-addressed in the literature; a good review of various approaches is http://www.edbt.org/Proceedings/2008-Nantes/papers/p618-Metwally.pdf. The simplest approach (and most compact for very high accuracy requirements) is called Linear Counting. You hash elements to positions in a bitvector just like you would a Bloom filter (except only one hash function is required), but at the end you estimate the number of distinct elements by the formula D = -total_bits * ln(unset_bits/total_bits). Details are in the paper.
If you have a hash function that you trust, then you could maintain a hashset just like you would for the exact solution, but throw out any item whose hash value is outside of some small range. E.g., use a 32-bit hash, but only keep items where the first two bits of the hash are 0. Then multiply by the appropriate factor at the end to approximate the total number of unique elements.
Nobody has mentioned approximate algorithm designed specifically for this problem, Hyperloglog.

Resources