Weighting Different Outcomes when Pseudorandomly Choosing from an Arbitrarily Large Sample - random

So, I was sitting in my backyard thinking about Pokemon, as we're all wont to do, and it got me thinking: When you encounter a 'random' Pokemon, some specimen appear a lot more often than others, which means that they're weighted differently than the ones that appear less.
Now, were I to approach the problem of getting the different Pokemon to appear with a certain probability, I would most likely do so by simply increasing the number of entries that certain Pokemon have in the pool of choices (like so),
Pool:
C1 C1 C1 C1
C2 C2
C3 C3 C3 C3 C3
C4
so C1 has a 1/3 chance of being pulled, C2 has a 1/6th chance, etc, but I understand that this may be a very simple and naive approach, and is unlikely to scale well with a large number of choices.
So, my question is this, S/O: Given an arbitrarily large sample size, how would you go about weighting the chance of one outcome as greater than another? And, as a follow up question, assume that you want the probability of certain options to occur in a ratio with floating-point precision as opposed to whole number ratios?

If you know the probability of each event happening you need to map these probabilities to the range 0-100 (or 0 to 1 if you want to use real numbers and probabilities.)
So in the example above there are 12 Cs. C1 is 4/12 or ~33%,
C2 is 2/12 of ~17%, C3 is 5/12 or ~42%, and C4 is 1/12 or ~8%.
Notice that these all add up to 100%. So if we choose a random number between 0 and 100 we can map C1 to 0-33, C2 to 33-50 (17 more than C1's value) , C3 to 50-92, and C4 to 92-100.
An if statement could make the choice:
r = rand() # between 0-100
if (r <33)
return "C1"
elsif (r < 50)
return "C2"
elsif (r < 92)
return "C3"
elsif (r < 100)
return "C4"
If you wanted more accuracy than 1 in 100 just go from 1-1000 or whatever range you want. It's probably better form to use integers and scale them rather than use floating point numbers as floating point can have odd behavior if the spread between values gets large.
If you wanted to go the binning route like you show above you could try something like so (in ruby though the idea is more general):
a = ["C1"]*4 + ["C2"]*2 + ["C3"]*5 + ["C4"]
# ["C1", "C1", "C1", "C1", "C2", "C2",
# "C3", "C3", "C3", "C3", "C3", "C4"]
a[rand(a.length)] # => "C1' w/ probability 4/12
Binning would be slower as you need to create the array, but easier to add alternatives as you wouldn't need to recalculate the probabilities each time.
You could also generate the above if code from the array representation so you'd just take the pre-processing hit once when the code was generated and then get a fast answer from the created code.

Related

Best way to generate U(1,5) from U(1,3)?

I am given a uniform integer random number generator ~ U3(1,3) (inclusive). I would like to generate integers ~ U5(1,5) (inclusive) using U3. What is the best way to do this?
This simplest approach I can think of is to sample twice from U3 and then use rejection sampling. I.e., sampling twice from U3 gives us 9 possible combinations. We can assign the first 5 combinations to 1,2,3,4,5, and reject the last 4 combinations.
This approach expects to sample from U3 9/5 * 2 = 18/5 = 3.6 times.
Another approach could be to sample three times from U3. This gives us a sample space of 27 possible combinations. We can make use of 25 of these combinations and reject the last 2. This approach expects to use U3 27/25 * 3.24 times. But this approach would be a little more tedious to write out since we have a lot more combinations than the first, but the expected number of sampling from U3 is better than the first.
Are there other, perhaps better, approaches to doing this?
I have this marked as language agnostic, but I'm primarily looking into doing this in either Python or C++.
You do not need combinations. A slight tweak using base 3 arithmetic removes the need for a table. Rather than using the 1..3 result directly, subtract 1 to get it into the range 0..2 and treat it as a base 3 digit. For three samples you could do something like:
function sample3()
result <- 0
result <- result + 9 * (randU3() - 1) // High digit: 9
result <- result + 3 * (randU3() - 1) // Middle digit: 3
result <- result + 1 * (randU3() - 1) // Units digit: 1
return result
end function
That will give you a number in the range 0..26, or 1..27 if you add one. You can use that number directly in the rest of your program.
For the range [1, 3] to [1, 5], this is equivalent to rolling a 5-sided die with a 3-sided one.
However, this can't be done without "wasting" randomness (or running forever in the worst case), since all the prime factors of 5 (namely 5) don't divide 3. Thus, the best that can be done is to use rejection sampling to get arbitrarily close to no "waste" of randomness (such as by batching multiple rolls of the 3-sided die until 3^n is "close enough" to a power of 5). In other words, the approaches you give in your question are as good as they can get.
More generally, an algorithm to roll a k-sided die with a p-sided die will inevitably "waste" randomness (and run forever in the worst case) unless "every prime number dividing k also divides p", according to Lemma 3 in "Simulating a dice with a dice" by B. Kloeckner. For example:
Take the much more practical case that p is a power of 2 (and any block of random bits is the same as rolling a die with a power of 2 number of faces) and k is arbitrary. In this case, this "waste" and indefinite running time are inevitable unless k is also a power of 2.
This result applies to any case of rolling a n-sided die with a m-sided die, where n and m are prime numbers. For example, look at the answers to a question for the case n = 7 and m = 5.
See also this question: Frugal conversion of uniformly distributed random numbers from one range to another.
Peter O. is right, you cannot escape to loose some randomness. So the only choice is between how expensive calls to U(1,3) are, code clarity, simplicity etc.
Here is my variant, making bits from U(1,3) and combining them together with rejection
C/C++ (untested!)
int U13(); // your U(1,3)
int getBit() { // single random bit
return (U13()-1)&1;
}
int U15() {
int r;
for(;;) {
int q = getBit() + 2*getBit() + 4*getBit(); // uniform in [0...8)
if (q < 5) { // need range [0...5)
r = q + 1; // q accepted, make it in [1...5]
break;
}
}
return r;
}

Generate a number within a range and considering a mean val

I want to generate a random number within a range while considering a mean value.
I have a solution for generating the range:
turtles-own [age]
to setup
crt 2 [
get-age
]
end
to get-age
let min-age 65
let max-age 105
set age ( min-age + random ( max-age - min-age ) )
end
However, if I use this approach every number can be created with the same probability, which doesn't make much sense in this case as way more people are 65 than 105 years old.
Therefore, I want to include a mean value. I found random-normal but as I don't have a standard deviation and my values are not normally distributed, I can't use this approach.
Edit:
An example: I have two agent typologies. Agent typology 1 has the mean age 79 and the age range 67-90. Agent typology 2 has the mean age 77 and the age range 67-92.
If I implement the agent typologies in NetLogo as described above, I get for agent typlogy 1 the mean age 78 and for agent typology 2 the mean age 79. The reason for that is that for every age the exact same number of agents is generated. This gives me in the end the wrong result for my artificial population.
[Editor's note: Comment from asker added here.]
I want a distribution of values with most values for the min value and fewest values for the max value. However, the curve of the distribution is not necessarily negative linear. Therefore, I need the mean value. I need this approach because there is the possibility that one agent typology has the range for age 65 - 90 and the mean age 70 and another agent typology has the same age range but the mean age is 75. So the real age distribution for the agents would look different.
This is a maths problem rather than a NetLogo problem. You haven't worked out what you want your distribution to look like (lots of different curves can have the same min, max and mean). If you don't know what your curve looks like, it's pretty hard to code it in NetLogo.
However, let's take the simplest curve. This is two uniform distributions, one from the min to the mean and the other from the mean to the max. While it's not decreasing along the length, it will give you the min, max and mean that you want and the higher values will have lower probability as long as the mean is less than the midway point from min to max (as it is if your target is decreasing). The only question is what is the probability to select from each of the two uniform distributions.
If L is your min (low value), H is your max (high value) and M for mean, then you need to find the probability P to select from the lower range, with (1-P) for the upper range. But you know that the total probability of the lower range must equal the total probability of the upper range must equal 0.5 because you want to switch ranges at the mean and the mean must also be the mean of the combined distribution. Therefore, each rectangle is the same size. That is P(M-L) = (1-P)(H-M). Solving for P gets you:
P = (H-M) / (H - L)
Put it into a function:
to-report random-tworange [#min #max #mean]
let prob (#max - #mean) / (#max - #min)
ifelse random-float 1 < prob
[ report #min + random-float (#mean - #min) ]
[ report #mean + random-float (#max - #mean) ]
end
To test this, try different values in the following code:
to testme
let testvals []
let low 77
let high 85
let target 80
repeat 10000 [set testvals lput (random-tworange low high target) testvals]
print mean testvals
end
One other thing you should think about - how much does age matter? This is a design question. You only need to include things that change an agent's behaviour. If agents with age 70 make the same decisions as those with age 80, then all you really need is that the age is in this range and not the specific value.

Design L1 and L2 distance functions to assess the similarity of bank customers. Each customer is characterized by the following attribute

I am having a hard time with the question below. I am not sure if I got it correct, but either way, I need some help futher understanding it if anyone has time to explain, please do.
Design L1 and L2 distance functions to assess the similarity of bank customers. Each customer is characterized by the following attributes:
− Age (customer’s age, which is a real number with the maximum age is 90 years and minimum age 15 years)
− Cr (“credit rating”) which is ordinal attribute with values ‘very good’, ‘good, ‘medium’, ‘poor’, and ‘very poor’.
− Av_bal (avg account balance, which is a real number with mean 7000, standard deviation is 4000)
Using the L1 distance function computes the distance between the following 2 customers: c1 = (55, good, 7000) and c2 = (25, poor, 1000). [15 points]
Using the L2 distance function computes the distance between the above mentioned 2 customers
Using the L2 distance function computes the distance between the above mentioned 2 customers.
Answer with L1
d(c1,c2) = (c1.cr-c2.cr)/4 +(c1.avg.bal –c2.avg.bal/4000)* (c1.age-mean.age/std.age)-( c2.age-mean.age/std.age)
The question as is, leaves some room for interpretation. Mainly because similarity is not specified exactly. I will try to explain what the standard approach would be.
Usually, before you start, you want to normalize values such that they are rougly in the same range. Otherwise, your similarity will be dominated by the feature with the largest variance.
If you have no information about the distribution but just the range of the values you want to try to nomalize them to [0,1]. For your example this means
norm_age = (age-15)/(90-15)
For nominal values you want to find a mapping to ordinal values if you want to use Lp-Norms. Note: this is not always possible (e.g., colors cannot intuitively be mapped to ordinal values). In you case you can transform the credit rating like this
cr = {0 if ‘very good’, 1 if ‘good, 2 if ‘medium’, 3 if ‘poor’, 4 if ‘very poor’}
afterwards you can do the same normalization as for age
norm_cr = cr/4
Lastly, for normally distributed values you usually perform standardization by subtracting the mean and dividing by the standard deviation.
norm_av_bal = (av_bal-7000)/4000
Now that you have normalized your values, you can go ahead and define the distance functions:
L1(c1, c2) = |c1.norm_age - c2.norm_age| + |c1.norm_cr - c2.norm_cr |
+ |c1.norm_av_bal - c2.norm_av_bal|
and
L2(c1, c2) = sqrt((c1.norm_age - c2.norm_age)2 + (c1.norm_cr -
c2.norm_cr)2 + (c1.norm_av_bal -
c2.norm_av_bal)2)

Storing a number bigger than the integer limit in vhdl

Let me explain my problem with an example.
I have two variables
a=74686 and b=20930625.
I want to store
c= (a x 2^16) + b.
This exceeds the integer limit(32bits) in vhdl.
It is okay for me to store c in two separate registers say c1 and c2, and tell users to concatenate bits of c1 and c2 to get the actual result. i.e I want to store lower 32 bits of c in c1 and remaining bits in c2.
How can I do this?

generate clustered spatstat marks?

I was wondering if anyone knows how to assign marks in spatstat so that they tend to cluster spatially? I have a set of lat long coordinates that I want to categorize into 4 groups. I have figured out how to randomly assign marks/groups to these points using the following code:
as.ppp(data, window ,marks=factor(sample(1:4,replace=TRUE)))
But I can't figure out how to assign the marks so that groups tend to occupy points closer to one another. As a further complication, I would also like the number of points within each group to be the same, specified number each time. Does anyone have any leads? Thanks in advance!
Typically in spatstat we define models which describe/generate points at random locations and possibly with random marks. If I understand you correctly you have a fixed set of locations and you simply want to assign random marks. How many points do you have? If you don't have too many points a simple suggestion could be to generate a multivariate normally distributed variable and then take the n_1 lowest values for the first mark, the n_2 next values for the second mark, and so on. A simple example with 4 equal sized groups of points:
library(spatstat)
library(mvtnorm)
set.seed(42) # Make reproducible
X <- redwood # Example data
n <- npoints(redwood)
Xdist <- pairdist(X) # n x n matrix of distances in X
decay_rate <- 1 # Parameter for covariance sturcture
sigma <- exp(-decay_rate * Xdist)
m <- rmvnorm(1, rep(0, n), sigma)
breaks <- quantile(m, probs = c(0, .25, .5, .75, 1)) # breaks to cut marks in four equal sized groups
marks(X) <- cut(m, breaks = breaks, include.lowest=TRUE, labels = 1:4)
plot(X)

Resources