Subset generation by rules - algorithm

Let's say that we have a 5000 users in database. User row has sex column, place where he/she was born column and status (married or not married) column.
How to generate a random subset (let's say 100 users) that would satisfy these conditions:
40% should be males and 60% - females
50% should be born in USA, 20% born in UK, 20% born in Canada, 10% in Australia
70% should be married and 30% not.
These conditions are independent, that is we cannot do like this:
(0.4 * 0.5 * 0.7) * 100 = 14 users that are males, born in USA and married
(0.4 * 0.5 * 0.3) * 100 = 6 users that are males, born in USA and not married.
Is there an algorithm to this generation?

Does the breakdown need to be exact, or approximate? Typically if you are generating a sample like this then you are doing some statistical study, so it is sufficient to generate an approximate sample.
Here's how to do this:
Have a function genRandomIndividual().
Each time you generate an individual, use the random function to choose the sex - male with probability 40%
Choose birth location using random function again (just generate a real in the interval 0-1, and if it falls 0-.5, choose USA, if .5-.7, then &K, if .7-.9 then Canada, otherwise Australia).
Choose married status using random function (again generate in 0-1, if 0-.7 then married, otherwise not).
Once you have a set of characteristics, search in the database for the first individual who satisfies these characteristics, add them to your sample, and tag it as already added in the database. Keep doing this unti you have fulfilled your sample size.
There may be no individaul that satisfies the characteristics. Then, just generate a new random individual instead. Since the generations are independent and generate the characteristics according to the required probabilities, in the end you will have a sample size of the correct size with the individuals generated randomly according to the probabilities specified.

You could try something like this:
Pick a random initial set of 100
Until you have the right distribution (or give up):
Pick a random record not in the set, and a random one that is
If swapping in the other record gets you closer to the set you want, exchange them. Otherwise, don't.
I'd probaby use the sum of squares of distance to the desired distribution as the metric for deciding whether to swap.
That's what comes to mind that keeps the set random. Keep in mind that there may be no subset which matches the distribution you're after.

It is important to note that you may not be able to find a subset that satisfies these conditions. To take an example, suppose your database contained only American males, and only Australian females. Clearly you could not generate any subset that satisfies your distribution constraints.

(Rewrote my post completely (actually, wrote a new one and deleted the old) because I thought of a much simpler and more efficient way to do the same thing.)
I'm assuming you actually want the exact proportions and not just to satisfy them on average. This is a pretty simple way to accomplish that, but depending on your data it might take a while to run.
First, arrange your original data so that you can access each combination of types easily, that is, group married US men in one pile, unmarried US men in another, and so on. Then, assuming that you have p conditions and you want to select k elements, make p arrays of size k each; one array will represent one condition. Make the elements of each array be the types of that condition, in the proportions that you require. So, in your example, the gender array would have 40 males and 60 females.
Now, shuffle each of the p arrays independently (actually, you can leave one array unshuffled if you like). Then, for each index i, take the type of the picked element to be the combination from the shuffled p arrays at index i, and pick one such type at random from the remaining ones in your original group, removing the picked element. If there are no elements of that type left, the algorithm has failed, so reshuffle the arrays and start again to pick elements.
To use this, you need to first make sure that the conditions are satisfiable at all because otherwise it will just loop infinitely. To be honest, I don't see a simple way to verify that the conditions are satisfiable, but if the number of elements in your original data is large compared to k and their distribution isn't too skewed, there should be solutions. Also, if there are only a few ways in which the conditions can be satisfied, it might take a long time to find one; though the method will terminate with probability 1, there is no upper bound that you can place on the running time.

Algorithm may be too strong a word, since to me that implies formalism and publication, but there is a method to select subsets with exact proportions (assuming your percentages yield whole numbers of subjects from the sample universe), and it's much simpler than the other proposed solutions. I've built one and tested it.
Incidentally, I'm sorry to be a slow responder here, but my time is constrained these days. I wrote a hard-coded solution fairly quickly, and since then I've been refactoring it into a decent general-purpose implementation. Because I've been busy, that's still not complete yet, but I didn't want to delay answering any longer.
The method:
Basically, you're going to consider each row separately, and decide whether it's selectable based on whether your criteria give you room to select each of its column values.
In order to do that, you'll consider each of your column rules (e.g., 40% males, 60% females) as an individual target (e.g., given a desired subset size of 100, you're looking for 40 males, 60 females). Make a counter for each.
Then you loop, until you've either created your subset, or you've examined all the rows in the sample universe without finding a match (see below for what happens then). This is the loop in pseudocode:
- Randomly select a row.
- Mark the row examined.
- For each column constraint:
* Get the value for the relevant column from the row
* Test for selectability:
If there's a value target for the value,
and if we haven't already selected our target number of incidences of this value,
then the row is selectable with respect to this column
* Else: the row fails.
- If the row didn't fail, select it: add it to the subset
That's the core of it. It will provide a subset which matches your rules, or it will fail to do so... which brings me to what happens when we can't find a
match.
Unsatisfiability:
As others have pointed out, it's not always possible to satisfy any arbitrary set of rules for any arbitrary sample universe. Even assuming that the rules are valid (percentages for each value sum to 100), the subset size is less than the universe size, and the universe does contain enough individuals with each selected value to hit the targets, it's still possible to fail if the values are actually not distributed independently.
Consider the case where all the males in the sample universe are Australian: in this case, you can only select as many males as you can select Australians, and vice-versa. So a set of constraints (subset size: 100; male: 40%; Australian 10%) cannot be satisfied at all from such a universe, even if all the Australians we select are male.
If we change the constraints (subset size: 100; male: 40%; Australian 40%), now we can possibly make a matching subset, but all of the Australians we select must be male. And if we change the constraints again (subset size: 100; male: 20%; Australian 40%), now we can possibly make a matching subset, but only if we don't pick too many Australian women (no more than half in this case).
In this latter case, selection order is going to matter. Depending on our random seed, sometimes we might succeed, and sometimes we might fail.
For this reason, the algorithm must (and my implementation does) be prepared to retry. I think of this as a patience test: the question is how many times are we willing to let it fail before we decide that the constraints are not compatible with the sample population.
Suitability
This method is well suited to the OP's task as described: selecting a random subset which matches given criteria. It is not suitable to answering a slightly different question: "is it possible to form a subset with the given criteria".
My reasoning for this is simple: the situations in which the algorithm fails to find a subset are those in which the data contains unknown linkages, or where the criteria allow a very limited number of subsets from the sample universe. In these cases, the use of any subset would be questionable for statistical analysis, at least not without further thought.
But for the purpose of answering the question of whether it's possible to form a subset, this method is non-deterministic and inefficient. It would be better to use one of the more complex shuffle-and-sort algorithms proposed by others.
Pre-Validation:
The immediate thought upon discovering that not all subsets can be satisfied is to perform some initial validation, and perhaps to analyze the data to see whether it's answerable or only conditionally answerable.
My position is that other than initially validating that each of the column rules is valid (i.e., the column percentages sum to 100, or near enough) and that the subset size is less than the universe size, there's no other prior validation which is worth doing. An argument can be made that you might want to check that the universe contains enough individuals with each selected value (e.g., that there actually are 40 males and 60 females in the universe), but I haven't implemented that.
Other than those, any analysis to identify linkages in the population is itself time-consuming that you might be better served just running the thing with more retries. Maybe that's just my lack of statistics background talking.
Not quite the subset sum problem
It has been suggested that this problem is like the subset sum problem. I contend that this is subtly and yet significantly different. My reasoning is as follows: for the subset sum problem, you must form and test a subset in order to answer the question of whether it meets the rules: it is not possible (except in certain edge conditions) to test an individual element before adding it to the subset.
For the OP's question, though, it is possible. As I'll explain, we can randomly select rows and test them individually, because each has a weight of one.

Related

Algorithm to find outliers of DateTimes outside of a 2 day window

I have a set of arbitrary DateTime values that are input by users. The requirement is that the values are to be within a certain window, e.g. no more than 2 days apart from each other. There is no reference value to work from.
An unknown, but small percentage (say < 5%) of them will be outside the 2 day window because of user error. At some point, the values are aggregated and processed, at which point the requirement is checked. Validation at input time is not practical. How do I determine the largest set of values that fulfill the requirement, so that I can report back the other, incorrect values that don't fulfill the requirement?
I know about determining Interquartile Range. Can I somehow modify that algorithm to include the boundary condition? Or do I need a different algorithm?
One good "quick strike" solution from Machine Learning is the support vector machine (SVM). A 1-class method will be relatively fast, and will identify clustered values vs outliers with very good accuracy for this application. Otherwise ...
You do not want the mean of the dates: a single error could skew the mean right out of the rest of the distribution, such as today's date being 20 Aug 2109.
The median is likely a good starting guess for this distribution. Sort the values, grab the median, and then examine the distribution on either side. At roughly 24 hours each way, there should be a sudden difference in values. Those difference points will absolutely identify your proper boundaries.
In most data sets, you'll be able to find that point easily: look at the differences between adjacent values in your sorted list of dates.
Very simply:
Sort the list of dates
Make a new list, shifting one element left (i.e. delete the first element)
Subtract the two lists.
Move through the difference list; there will be a large cluster of small values in the middle, bounded by a pair of large jumps. The pair of large jumps will be 48 hours apart. Those points are you boundaries.

Statistical Spell Checking: General Approach and Avoid Feedback Loops

A database I am building has a large number of names, that are often repeated. However, many names have misspellings, and I want to attempt to automatically correct the misspellings. I do not know the correct list of names before-hand.
My current approach has been to keep a list of the top N names along with their frequencies in the corpus. Then, when a new name is input into the database, I find the name which maximizes term_freq(name) / edit_distance(new_name, name). That is, I find an existing name in the top N names that with the highest frequency divided by the edit distance between the candidate name and the candidate.
Is this a sound approach to checking for names? Am I going about this incorrectly?
I am concerned that if the system detects a group of documents that misspell a name, they can make it into the top N names, and then cause all other names to change as well.
So, first of all... I'm obligated to say that I suspect you'll create more trouble with this than you save, and that you might be happier just requiring everyone to spell things correctly (and accept that things will break in a nice, predictable way if they don't). That being said, if you're going to do things this way, I would look to Baye's law:
P(H|E) = P(H) * P(E|H) / P(E)
(That reads: "The probability of a hypothesis given certain evidence is equal to the probability of the hypothesis multiplied by the probability of the evidence given the hypothesis, divided by the probability of the evidence.")
In this case, you would be considering the probability that a name is a mis-spelled version of another name, given the frequencies and some pre-conceived expectations about the likelyhood of typos:
P(H) would be the probability of (some number of specific typos | length of name)
P(E) would be the base probability of seeing this name (for this, I would keep the un-modified list of all entered words)
P(E|H) would be the sum of probabilities of all names that are (some number of typos) away from this name.
I would then compare the results of that to the probability that they were right in the first place, i.e., P(Zero Typos| Length of Name) * P(This Spelling).

read file only once for Stratified sampling

If do not know the distribution (or size/probability) of each subpopulation (stratum), and also not know the total population size, is it possible to do Stratified sampling by reading file only once? Thanks.
https://en.wikipedia.org/wiki/Stratified_sampling
regards,
Lin
Assuming that each record in the file can be identified as being in a particular sub-population, and that you know ahead of time what size of random sample you want from that sub-population you could hold, for each sub-population, a datastructure allowing you to do Reservoir Sampling, for that sub-population (https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R).
So repeatedly:
Read a record
Find out which sub-population it is in and get the datastructure representing the reservoir sampling for that sub-population, creating it if necessary.
Use that data-structure and the record read to do reservoir sampling for that sub-population.
At the end you will have, for each sub-population seen, a reservoir sampling data-structure containing a random sample from that population.
For the case when you wish to end up with k of N samples forming a stratified sample over the different classes of records, I don't think you can do much better than keeping k of each class and then downsampling from this. Suppose you can and I give you a initial block of records organised so that the stratified sample will have less than k/2 of some class kept. Now I follow that block with a huge number of records, all of this class, which is now clearly underrepresented. In this case, the final random sample should have much more than k/2 from this class, and (if it is really random) there should be a very small but non-zero probability that more than k/2 of those randomly chosen records came from the first block. But the fact that we never keep more than k/2 of these records from the first block means that the probability with this sampling scheme is exactly zero, so keeping less than k of each class won't work in the worst case.
Here is a cheat method. Suppose that instead of reading the records sequentially we can read the records in any order we chose. If you look through stackoverflow you will see (rather contrived) methods based on cryptography for generating a random permutation of N items without holding N items in memory at any one time, so you could do this. Now keep a pool of k records so that at any time the proportions of the items in the pool are a stratified sample, only adding or removing items from the pool when you are forced to do this to keep the proportions correct. I think you can do this because you need to add an item of class X to keep the proportions correct exactly when you have just observed another item of class X. Because you went through the records in a random order I claim that you have a random stratified sample. Clearly you have a stratified sample, so the only departure from randomness can be in the items selected for a particular class. But consider the permutations which select items not of that class in the same order as the permutation actually chosen, but which select items of that class in different orders. If there is bias in the way that items of that class are selected (as there probably is) because the bias will affect different items of that class in different ways depending on what permutation is selected the result of the random choice between all of these different permutations is that the total effect is unbiassed.
To do sampling in a single pass is simple, if you are able to keep the results in memory. It consists of two parts:
Calculate the odds of the new item being part of the result set, and use a random number to determine if the item should be part of the result or not.
If the item is to be kept, determine whether it should be added to the set or replace an existing member. If it should replace an existing member, use a random number to determine which existing member it should replace. Depending on how you calculate your random numbers, this can be the same one as the previous step or it can be a new one.
For stratified sampling, the only modification required for this algorithm is to determine which strata the item belongs to. The result lists for each strata should be kept separate.

What data structure/algorithm to use to compute similarity between input sequence and a database of stored sequences?

By this question, I mean if I have an input sequence abchytreq and a database / data structure containing jbohytbbq, I would compare the two elements pairwise to get a match of 5/9, or 55%, because of the pairs (b-b, hyt-hyt, q-q). Each sequence additionally needs to be linked to another object (but I don't think this will be hard to do). The sequence does not necessarily need to be a string.
The maximum number of elements in the sequence is about 100. This is easy to do when the database/datastructure has only one or a few sequences to compare to, but I need to compare the input sequence to over 100000 (mostly) unique sequences, and then return a certain number of the most similar previously stored data matches. Additionally, each element of the sequence could have a different weighting. Back to the first example: if the first input element was weighted double, abchytreq would only be a 50% match to jbohytbbq.
I was thinking of using BLAST and creating a little hack as needed to account for any weighting, but I figured that might be a little bit overkill. What do you think?
One more thing. Like I said, comparison needs to be pairwise, e.g. abcdefg would be a zero percent match to bcdefgh.
A modified Edit Distance algorithm with weightings for character positions could help.
https://www.biostars.org/p/11863/
Multiply the resulting distance matrix with a matrix of weights for character positions/
I'm not entirely clear on the question; for instance, would you return all matches of 90% or better, regardless of how many or few there are, or would you return the best 10% of the input, even if some of them match only 50%? Here are a couple of suggestions:
First: Do you know the story of the wise bachelor? The foolish bachelor makes a list of requirements for his mate --- slender, not blonde (Mom was blonde, and he hates her), high IQ, rich, good cook, loves horses, etc --- then spends his life considering one mate after another, rejecting each for failing one of his requirements, and dies unfulfilled. The wise bachelor considers that he will meet 100 marriageable women in his life, examines the first sqrt(100) = 10 of them, then marries the next mate with a better score than the best of the first ten; she might not be perfect, but she's good enough. There's some theorem of statistics that says the square root of the population size is the right cutoff, but I don't know what it's called.
Second: I suppose that you have a scoring function that tells you exactly which of two dictionary words is the better match to the target, but is expensive to compute. Perhaps you can find a partial scoring function that is easy to compute and would allow you to quickly scan the dictionary, discarding those inputs that are unlikely to be winners, and then apply your total scoring function only to that subset of the dictionary that passed the partial scoring function. You'll have to define the partial scoring function based on your needs. For instance, you might want to apply your total scoring function to only the first five characters of the target and the dictionary word; if that doesn't eliminate enough of the dictionary, increase to ten characters on each side.

Grouping individuals into families

We have a simulation program where we take a very large population of individual people and group them into families. Each family is then run through the simulation.
I am in charge of grouping the individuals into families, and I think it is a really cool problem.
Right now, my technique is pretty naive/simple. Each individual record has some characteristics, including married/single, age, gender, and income level. For married people I select an individual and loop through the population and look for a match based on a match function. For people/couples with children I essentially do the same thing, looking for a random number of children (selected according to an empirical distribution) and then loop through all of the children and pick them out and add them to the family based on a match function. After this, not everybody is matched, so I relax the restrictions in my match function and loop through again. I keep doing this, but I stop before my match function gets too ridiculous (marries 85-year-olds to 20-year-olds for example). Anyone who is leftover is written out as a single person.
This works well enough for our current purposes, and I'll probably never get time or permission to rework it, but I at least want to plan for the occasion or learn some cool stuff - even if I never use it. Also, I'm afraid the algorithm will not work very well for smaller sample sizes. Does anybody know what type of algorithms I can study that might relate to this problem or how I might go about formalizing it?
For reference, I'm comfortable with chapters 1-26 of CLRS, but I haven't really touched NP-Completeness or Approximation Algorithms. Not that you shouldn't bring up those topics, but if you do, maybe go easy on me because I probably won't understand everything you are talking about right away. :) I also don't really know anything about evolutionary algorithms.
Edit: I am specifically looking to improve the following:
Less ridiculous marriages.
Less single people at the end.
Perhaps what you are looking for is cluster analysis?
Lets try to think of your problem like this (starting by solving the spouses matching):
If you were to have a matrix where each row is a male and each column is a female, and every cell in that matrix is the match function's returned value, what you are now looking for is selecting cells so that there won't be a row or a column in which more than one cell is selected, and the total sum of all selected cells should be maximal. This is very similar to the N Queens Problem, with the modification that each allocation of a "queen" has a reward (which we should maximize).
You could solve this problem by using a graph where:
You have a root,
each of the first raw's cells' values is an edge's weight leading to first depth vertices
each of the second raw's cells' values is an edge's weight leading to second depth vertices..
Etc.
(Notice that when you find a match to the first female, you shouldn't consider her anymore, and so for every other female you find a match to)
Then finding the maximum allocation can be done by BFS, or better still by A* (notice A* typically looks for minimum cost, so you'll have to modify it a bit).
For matching between couples (or singles, more on that later..) and children, I think KNN with some modifications is your best bet, but you'll need to optimize it to your needs. But now I have to relate to your edit..
How do you measure your algorithm's efficiency?
You need a function that receives the expected distribution of all states (single, married with one children, single with two children, etc.), and the distribution of all states in your solution, and grades the solution accordingly. How do you calculate the expected distribution? That's quite a bit of statistics work..
First you need to know the distribution of all states (single, married.. as mentioned above) in the population,
then you need to know the distribution of ages and genders in the population,
and last thing you need to know - the distribution of ages and genders in your population.
Only then, according to those three, can you calculate how many people you expect to be in each state.. And then you can measure the distance between what you expected and what you got... That is a lot of typing.. Sorry for the general parts...

Resources