Algorithm that increases randomness? - random

Suppose i provide you with random seeds between 0 and 1 but after some observations you find out that my seeds are not distributed properly and most of them are less than 0.5, would you still be able to use this source by using an algorithm that makes the seeds more distributed?
If yes, please provide me with necessary sources.

It really depends on how numbers are distributed in interval [0...1]. In general, you need CDF (cumulative distribution function) to map some arbitrary [0...1] domain distribution into uniform [0...1]. But for some particular cases you could do some simple transformation. Code below (in Python) first construct simple unfair RNG which generates 60% of numbers below 0.5 and 40% above.
import random
def unfairRng():
q = random.random()
if q < 0.6: # result is skewed toward [0...0.5] interval
return 0.5*random.random()
return 0.5 + 0.5*random.random()
random.seed(312345)
nof_trials = 100000
h = [0, 0]
for k in range(0, nof_trials):
q = unfairRng()
h[0 if q < 0.5 else 1] += 1
print(h)
I count then numbers above and below 0.5, and output on my machine is
[60086, 39914]
which is quite close to 60/40 split I described.
Ok, let's "fix" RNG by taking numbers from unfairRNG and alternating just returning value and next time returning 1-value. Again, Python code
def fairRng():
if (fairRng.even == 0):
fairRng.even = 1
return unfairRng()
else:
fairRng.even = 0
return 1.0 - unfairRng()
fairRng.even = 0
h = [0, 0]
for k in range(0, nof_trials):
q = fairRng()
h[0 if q < 0.5 else 1] += 1
print(h)
Again, counting histogram and result is
[49917, 50083]
which "fix" unfair RNG and make it fair.

Flipping a coin out of an unfair coin is done by flipping twice and, if the results are different, using the first; otherwise, discard the result.
This results in a coin with exactly 50/50 chance, but it's not guaranteed to run in finite time.

Random number sequences generated by any algorithm will have no more entropy ("randomness") than the seeds themselves. For instance, if each seed has an entropy of only 1 bit for every 64 bits, they can each be transformed, at least in theory, to a 1 bit random number with full entropy. However, measuring the entropy of those seeds is nontrivial (entropy estimation). Moreover, not every algorithm is suitable in all cases for extracting the entropy of random seeds (entropy extraction, randomness extraction).

Related

What is the most rng-efficient uniform random integer algorithm? [duplicate]

This question already has answers here:
How to generate a random integer in the range [0,n] from a stream of random bits without wasting bits?
(5 answers)
Closed last year.
This is not a duplicate of #11766794 “What is the optimal algorithm for generating an unbiased random integer within a range?”. Both its top-upvoted answer and its Accepted answer have to do with floating-point extrapolation, which will not even yield perfectly uniform integers. That question is asking after ways to quickly get good approximations of a uniform random integer given a serviceably uniform floating-point rand() value; I am asking this question in the context of a perfectly uniform random integer algorithm given a true random bit generator (deterministic or otherwise; the question applies equally either way).
I am asking, specifically, about theoretical optimality in terms only of efficiency w/r/t random bits used: given a random bit stream, what is the algorithm which consumes the fewest number of bits from it in the process of generating a perfectly uniform random integer within a given range?
For instance, CPython 3.9.0's random.randbelow has at least one trivial inefficiency—it wastes a random bit when called over any power of 2 (including the trivial range):
def randbelow(n):
"Return a random int in the range [0,n). Returns 0 if n==0."
if not n:
return 0
k = n.bit_length() # don't use (n-1) here because n can be 1
r = getrandbits(k) # 0 <= r < 2**k
while r >= n:
r = getrandbits(k)
return r
While this is easily enough patched by replacing "not n" with "n <= 1" and "n.bit_length()" with "(n-1).bit_length()", a little analysis shows it leaves even more to be desired:
Say one is generating an integer over the range [0, 4097): half of all the calls to getrandbits(13) will overshoot the value: if the first bit and, say, the second bit are high, it will consume 11 more bits anyway, and discard them when it seemingly didn't need to. So it would seem that this algorithm is obviously nonoptimal.
The best I could come with in an hour this evening was the following algorithm:
def randbelow(n):
if n <= 1:
return 0
k = (n - 1).bit_length() # this is POPCNT for bignums
while True:
r = 0
for i in reversed(range(k)):
r |= getrandbits(1) << i
if r >= n:
break
else:
return r
However, I am no mathematician, and just because I fixed the inefficiencies that I could immediately see doesn't give me any confidence that I have just instantly stumbled in an afternoon on the most efficient possible uniform integer selection algorithm.
Say, for instance, the bits are being purchased from a quantum or atmospheric RNG service; or as part of a multi-party protocol in which every individual bit generation takes several round-trips; or on an embedded device without any hardware RNG support… whatever the case may be, I'm only asking the direct question: what algorithm for generating (perfectly) uniformly random integers from a true random bit stream is the most efficient with respect to random bits consumed? (Or, if not known with certainty, what is the best current candidate?)
(I have used Python in these examples because it's what I am working in primarily this season, but the question is by no means specific to any language, except inasfaras the algorithm itself must generalize to numbers above 264.)
The Python below implements arithmetic coding in exact arithmetic. This becomes very expensive computationally but achieves entropy + O(1) bits in expectation, which is basically optimal.
from fractions import Fraction
from math import floor, log2
import random
meter = 0
def metered_random_bits():
global meter
while True:
yield bool(random.randrange(2))
meter += 1
class ArithmeticDecoder:
def __init__(self, bits):
self._low = Fraction(0)
self._width = Fraction(1)
self._bits = bits
def randrange(self, n):
self._low *= n
self._width *= n
while True:
f = floor(self._low)
if self._low + self._width <= f + 1:
self._low -= f
return f
self._width /= 2
if next(self._bits):
self._low += self._width
import collections
if __name__ == "__main__":
k = 3000
n = 7
decoder = ArithmeticDecoder(metered_random_bits())
print(collections.Counter(decoder.randrange(n) for i in range(k)))
print("used", meter, "bits")
print("entropy", k * log2(n), "bits")

How to get the number e (2.718) using a random number sensor?

Is it possible to calculate the number e (2.718) using random numbers?
I'm assuming that when you say "using random numbers" you mean "using some sort of random sampling scheme." If you want the exact answer to an infinite number of decimals, the answer is "no, not unless you have an infinite amount of time." However, we can generate random sequences whose expected value is e, and we can assess the sampling error using basic statistics. By increasing the sample size, we can decrease the sampling error to any precision you want as long as you specify your desired confidence level.
It turns out that if you sum a bunch of random uniform(0,1)'s until the sum exceeds 1, the quantity of uniforms required has an expected value of e. We can turn that into a sampling problem by writing a method/function to return the count, and taking the average of the values obtained by calling that method multiple times.
You didn't specify any particular language, so here it is in Ruby (which is practically like pseudocode):
require 'quickstats' # install from rubygems w/ 'gem install quickstats'
def trial # generate results of one trial
count = 0
sum = 0.0
while sum < 1.0
count += 1
sum += rand # Ruby's rand produces U(0,1) values by default
end
return count # added "return" keyword for non-rubyists' readability
end
stats = QuickStats.new
10_000_000.times { stats.new_obs trial } # more precision? bump up sample size
puts "Average = #{stats.avg}"
half_width = 1.96 * stats.std_err
puts "CI half-width = #{half_width}"
deviation = (stats.avg - Math::E).abs
puts " |E - avg| = #{deviation} (should be ≤ half-width 95% of the time)"
This runs in under 4 seconds on my laptop and produces outputs such as:
Average = 2.7179918000002234
CI half-width = 0.0005421324752620413
|E - avg| = 0.0002900284588216451 (should be ≤ half-width 95% of the time)
Here’s another option. Consider the following probability question: you have a biased coin that comes up heads with probability 1/n. You then flip the coin n times. What is the probability that you never flip heads? Well, that’s the probability that you flip tails n times, which is (1 - 1/n)n, which as n tends towards infinity starts to rapidly approach 1/e. You could therefore estimate e by picking some modest value of n, simulating n tosses of a coin that comes up heads with probability 1/n, and seeing whether you never flip heads. The proportion of trials that don’t yield heads will approach 1/e, and from there you can estimate e.
For example, here's Python code to flip a coin with heads probability 1/n a total of n times (done by sampling a uniformly random number between 0 and 1) and see if all of them are tails:
from random import random
def one_trial(n):
for i in range(n):
if random() < 1 / n:
return False
return True
We can then run a large number of trials and see which fraction of them are all tails. That fraction will be approximately 1/e, so we just take the reciprocal:
def estimate_e(n, num_trials):
successes = 0
for i in range(num_trials):
if one_trial(n):
successes += 1
return num_trials / successes
Doing this with n = 210 and num_trials = 220 gave me the estimate
e ≈ 2.7198016257969466,
which isn't too bad.

Conditional sampling of binary vectors (?)

I'm trying to find a name for my problem, so I don't have to re-invent wheel when coding an algorithm which solves it...
I have say 2,000 binary (row) vectors and I need to pick 500 from them. In the picked sample I do column sums and I want my sample to be as close as possible to a pre-defined distribution of the column sums. I'll be working with 20 to 60 columns.
A tiny example:
Out of the vectors:
110
010
011
110
100
I need to pick 2 to get column sums 2, 1, 0. The solution (exact in this case) would be
110
100
My ideas so far
one could maybe call this a binary multidimensional knapsack, but I did not find any algos for that
Linear Programming could help, but I'd need some step by step explanation as I got no experience with it
as exact solution is not always feasible, something like simulated annealing brute force could work well
a hacky way using constraint solvers comes to mind - first set the constraints tight and gradually loosen them until some solution is found - given that CSP should be much faster than ILP...?
My concrete, practical (if the approximation guarantee works out for you) suggestion would be to apply the maximum entropy method (in Chapter 7 of Boyd and Vandenberghe's book Convex Optimization; you can probably find several implementations with your favorite search engine) to find the maximum entropy probability distribution on row indexes such that (1) no row index is more likely than 1/500 (2) the expected value of the row vector chosen is 1/500th of the predefined distribution. Given this distribution, choose each row independently with probability 500 times its distribution likelihood, which will give you 500 rows on average. If you need exactly 500, repeat until you get exactly 500 (shouldn't take too many tries due to concentration bounds).
Firstly I will make some assumptions regarding this problem:
Regardless whether the column sum of the selected solution is over or under the target, it weighs the same.
The sum of the first, second, and third column are equally weighted in the solution (i.e. If there's a solution whereas the first column sum is off by 1, and another where the third column sum is off by 1, the solution are equally good).
The closest problem I can think of this problem is the Subset sum problem, which itself can be thought of a special case of Knapsack problem.
However both of these problem are NP-Complete. This means there are no polynomial time algorithm that can solve them, even though it is easy to verify the solution.
If I were you the two most arguably efficient solution of this problem are linear programming and machine learning.
Depending on how many columns you are optimising in this problem, with linear programming you can control how much finely tuned you want the solution, in exchange of time. You should read up on this, because this is fairly simple and efficient.
With Machine learning, you need a lot of data sets (the set of vectors and the set of solutions). You don't even need to specify what you want, a lot of machine learning algorithms can generally deduce what you want them to optimise based on your data set.
Both solution has pros and cons, you should decide which one to use yourself based on the circumstances and problem set.
This definitely can be modeled as (integer!) linear program (many problems can). Once you have it, you can use a program such as lpsolve to solve it.
We model vector i is selected as x_i which can be 0 or 1.
Then for each column c, we have a constraint:
sum of all (x_i * value of i in column c) = target for column c
Taking your example, in lp_solve this could look like:
min: ;
+x1 +x4 +x5 >= 2;
+x1 +x4 +x5 <= 2;
+x1 +x2 +x3 +x4 <= 1;
+x1 +x2 +x3 +x4 >= 1;
+x3 <= 0;
+x3 >= 0;
bin x1, x2, x3, x4, x5;
If you are fine with a heuristic based search approach, here is one.
Go over the list and find the minimum squared sum of the digit wise difference between each bit string and the goal. For example, if we are looking for 2, 1, 0, and we are scoring 0, 1, 0, we would do it in the following way:
Take the digit wise difference:
2, 0, 1
Square the digit wise difference:
4, 0, 1
Sum:
5
As a side note, squaring the difference when scoring is a common method when doing heuristic search. In your case, it makes sense because bit strings that have a 1 in as the first digit are a lot more interesting to us. In your case this simple algorithm would pick first 110, then 100, which would is the best solution.
In any case, there are some optimizations that could be made to this, I will post them here if this kind of approach is what you are looking for, but this is the core of the algorithm.
You have a given target binary vector. You want to select M vectors out of N that have the closest sum to the target. Let's say you use the eucilidean distance to measure if a selection is better than another.
If you want an exact sum, have a look at the k-sum problem which is a generalization of the 3SUM problem. The problem is harder than the subset sum problem, because you want an exact number of elements to add to a target value. There is a solution in O(N^(M/2)). lg N), but that means more than 2000^250 * 7.6 > 10^826 operations in your case (in the favorable case where vectors operations have a cost of 1).
First conclusion: do not try to get an exact result unless your vectors have some characteristics that may reduce the complexity.
Here's a hill climbing approach:
sort the vectors by number of 1's: 111... first, 000... last;
use the polynomial time approximate algorithm for the subset sum;
you have an approximate solution with K elements. Because of the order of elements (the big ones come first), K should be a little as possible:
if K >= M, you take the M first vectors of the solution and that's probably near the best you can do.
if K < M, you can remove the first vector and try to replace it with 2 or more vectors from the rest of the N vectors, using the same technique, until you have M vectors. To sumarize: split the big vectors into smaller ones until you reach the correct number of vectors.
Here's a proof of concept with numbers, in Python:
import random
def distance(x, y):
return abs(x-y)
def show(ls):
if len(ls) < 10:
return str(ls)
else:
return ", ".join(map(str, ls[:5]+("...",)+ls[-5:]))
def find(is_xs, target):
# see https://en.wikipedia.org/wiki/Subset_sum_problem#Pseudo-polynomial_time_dynamic_programming_solution
S = [(0, ())] # we store indices along with values to get the path
for i, x in is_xs:
T = [(x + t, js + (i,)) for t, js in S]
U = sorted(S + T)
y, ks = U[0]
S = [(y, ks)]
for z, ls in U:
if z == target: # use the euclidean distance here if you want an approximation
return ls
if z != y and z < target:
y, ks = z, ls
S.append((z, ls))
ls = S[-1][1] # take the closest element to target
return ls
N = 2000
M = 500
target = 1000
xs = [random.randint(0, 10) for _ in range(N)]
print ("Take {} numbers out of {} to make a sum of {}", M, xs, target)
xs = sorted(xs, reverse = True)
is_xs = list(enumerate(xs))
print ("Sorted numbers: {}".format(show(tuple(is_xs))))
ls = find(is_xs, target)
print("FIRST TRY: {} elements ({}) -> {}".format(len(ls), show(ls), sum(x for i, x in is_xs if i in ls)))
splits = 0
while len(ls) < M:
first_x = xs[ls[0]]
js_ys = [(i, x) for i, x in is_xs if i not in ls and x != first_x]
replace = find(js_ys, first_x)
splits += 1
if len(replace) < 2 or len(replace) + len(ls) - 1 > M or sum(xs[i] for i in replace) != first_x:
print("Give up: can't replace {}.\nAdd the lowest elements.")
ls += tuple([i for i, x in is_xs if i not in ls][len(ls)-M:])
break
print ("Replace {} (={}) by {} (={})".format(ls[:1], first_x, replace, sum(xs[i] for i in replace)))
ls = tuple(sorted(ls[1:] + replace)) # use a heap?
print("{} elements ({}) -> {}".format(len(ls), show(ls), sum(x for i, x in is_xs if i in ls)))
print("AFTER {} splits, {} -> {}".format(splits, ls, sum(x for i, x in is_xs if i in ls)))
The result is obviously not guaranteed to be optimal.
Remarks:
Complexity: find has a polynomial time complexity (see the Wikipedia page) and is called at most M^2 times, hence the complexity remains polynomial. In practice, the process is reasonably fast (split calls have a small target).
Vectors: to ensure that you reach the target with the minimum of elements, you can improve the order of element. Your target is (t_1, ..., t_c): if you sort the t_js from max to min, you get the more importants columns first. You can sort the vectors: by number of 1s and then by the presence of a 1 in the most important columns. E.g. target = 4 8 6 => 1 1 1 > 0 1 1 > 1 1 0 > 1 0 1 > 0 1 0 > 0 0 1 > 1 0 0 > 0 0 0.
find (Vectors) if the current sum exceed the target in all the columns, then you're not connecting to the target (any vector you add to the current sum will bring you farther from the target): don't add the sum to S (z >= target case for numbers).
I propose a simple ad hoc algorithm, which, broadly speaking, is a kind of gradient descent algorithm. It seems to work relatively well for input vectors which have a distribution of 1s “similar” to the target sum vector, and probably also for all “nice” input vectors, as defined in a comment of yours. The solution is not exact, but the approximation seems good.
The distance between the sum vector of the output vectors and the target vector is taken to be Euclidean. To minimize it means minimizing the sum of the square differences off sum vector and target vector (the square root is not needed because it is monotonic). The algorithm does not guarantee to yield the sample that minimizes the distance from the target, but anyway makes a serious attempt at doing so, by always moving in some locally optimal direction.
The algorithm can be split into 3 parts.
First of all the first M candidate output vectors out of the N input vectors (e.g., N=2000, M=500) are put in a list, and the remaining vectors are put in another.
Then "approximately optimal" swaps between vectors in the two lists are done, until either the distance would not decrease any more, or a predefined maximum number of iterations is reached. An approximately optimal swap is one where removing the first vector from the list of output vectors causes a maximal decrease or minimal increase of the distance, and then, after the removal of the first vector, adding the second vector to the same list causes a maximal decrease of the distance. The whole swap is avoided if the net result is not a decrease of the distance.
Then, as a last phase, "optimal" swaps are done, again stopping on no decrease in distance or maximum number of iterations reached. Optimal swaps cause a maximal decrease of the distance, without requiring the removal of the first vector to be optimal in itself. To find an optimal swap all vector pairs have to be checked. This phase is much more expensive, being O(M(N-M)), while the previous "approximate" phase is O(M+(N-M))=O(N). Luckily, when entering this phase, most of the work has already been done by the previous phase.
from typing import List, Tuple
def get_sample(vects: List[Tuple[int]], target: Tuple[int], n_out: int,
max_approx_swaps: int = None, max_optimal_swaps: int = None,
verbose: bool = False) -> List[Tuple[int]]:
"""
Get a sample of the input vectors having a sum close to the target vector.
Closeness is measured in Euclidean metrics. The output is not guaranteed to be
optimal (minimum square distance from target), but a serious attempt is made.
The max_* parameters can be used to avoid too long execution times,
tune them to your needs by setting verbose to True, or leave them None (∞).
:param vects: the list of vectors (tuples) with the same number of "columns"
:param target: the target vector, with the same number of "columns"
:param n_out: the requested sample size
:param max_approx_swaps: the max number of approximately optimal vector swaps,
None means unlimited (default: None)
:param max_optimal_swaps: the max number of optimal vector swaps,
None means unlimited (default: None)
:param verbose: print some info if True (default: False)
:return: the sample of n_out vectors having a sum close to the target vector
"""
def square_distance(v1, v2):
return sum((e1 - e2) ** 2 for e1, e2 in zip(v1, v2))
n_vec = len(vects)
assert n_vec > 0
assert n_out > 0
n_rem = n_vec - n_out
assert n_rem > 0
output = vects[:n_out]
remain = vects[n_out:]
n_col = len(vects[0])
assert n_col == len(target) > 0
sumvect = (0,) * n_col
for outvect in output:
sumvect = tuple(map(int.__add__, sumvect, outvect))
sqdist = square_distance(sumvect, target)
if verbose:
print(f"sqdist = {sqdist:4} after"
f" picking the first {n_out} vectors out of {n_vec}")
if max_approx_swaps is None:
max_approx_swaps = sqdist
n_approx_swaps = 0
while sqdist and n_approx_swaps < max_approx_swaps:
# find the best vect to subtract (the square distance MAY increase)
sqdist_0 = None
index_0 = None
sumvect_0 = None
for index in range(n_out):
tmp_sumvect = tuple(map(int.__sub__, sumvect, output[index]))
tmp_sqdist = square_distance(tmp_sumvect, target)
if sqdist_0 is None or sqdist_0 > tmp_sqdist:
sqdist_0 = tmp_sqdist
index_0 = index
sumvect_0 = tmp_sumvect
# find the best vect to add,
# but only if there is a net decrease of the square distance
sqdist_1 = sqdist
index_1 = None
sumvect_1 = None
for index in range(n_rem):
tmp_sumvect = tuple(map(int.__add__, sumvect_0, remain[index]))
tmp_sqdist = square_distance(tmp_sumvect, target)
if sqdist_1 > tmp_sqdist:
sqdist_1 = tmp_sqdist
index_1 = index
sumvect_1 = tmp_sumvect
if sumvect_1:
tmp = output[index_0]
output[index_0] = remain[index_1]
remain[index_1] = tmp
sqdist = sqdist_1
sumvect = sumvect_1
n_approx_swaps += 1
else:
break
if verbose:
print(f"sqdist = {sqdist:4} after {n_approx_swaps}"
f" approximately optimal swap{'s'[n_approx_swaps == 1:]}")
diffvect = tuple(map(int.__sub__, sumvect, target))
if max_optimal_swaps is None:
max_optimal_swaps = sqdist
n_optimal_swaps = 0
while sqdist and n_optimal_swaps < max_optimal_swaps:
# find the best pair to swap,
# but only if the square distance decreases
best_sqdist = sqdist
best_diffvect = diffvect
best_pair = None
for i0 in range(M):
tmp_diffvect = tuple(map(int.__sub__, diffvect, output[i0]))
for i1 in range(n_rem):
new_diffvect = tuple(map(int.__add__, tmp_diffvect, remain[i1]))
new_sqdist = sum(d * d for d in new_diffvect)
if best_sqdist > new_sqdist:
best_sqdist = new_sqdist
best_diffvect = new_diffvect
best_pair = (i0, i1)
if best_pair:
tmp = output[best_pair[0]]
output[best_pair[0]] = remain[best_pair[1]]
remain[best_pair[1]] = tmp
sqdist = best_sqdist
diffvect = best_diffvect
n_optimal_swaps += 1
else:
break
if verbose:
print(f"sqdist = {sqdist:4} after {n_optimal_swaps}"
f" optimal swap{'s'[n_optimal_swaps == 1:]}")
return output
from random import randrange
C = 30 # number of columns
N = 2000 # total number of vectors
M = 500 # number of output vectors
F = 0.9 # fill factor of the target sum vector
T = int(M * F) # maximum value + 1 that can be appear in the target sum vector
A = 10000 # maximum number of approximately optimal swaps, may be None (∞)
B = 10 # maximum number of optimal swaps, may be None (unlimited)
target = tuple(randrange(T) for _ in range(C))
vects = [tuple(int(randrange(M) < t) for t in target) for _ in range(N)]
sample = get_sample(vects, target, M, A, B, True)
Typical output:
sqdist = 2639 after picking the first 500 vectors out of 2000
sqdist = 9 after 27 approximately optimal swaps
sqdist = 1 after 4 optimal swaps
P.S.: As it stands, this algorithm is not limited to binary input vectors, integer vectors would work too. Intuitively I suspect that the quality of the optimization could suffer, though. I suspect that this algorithm is more appropriate for binary vectors.
P.P.S.: Execution times with your kind of data are probably acceptable with standard CPython, but get better (like a couple of seconds, almost a factor of 10) with PyPy. To handle bigger sets of data, the algorithm would have to be translated to C or some other language, which should not be difficult at all.

Random function returning number from interval

How would you implement a function that is returning a random number from interval 1..1000
in the case there is a number N determining the chance of reaching higher numbers or lower numbers?
It should behave as follows:
e.g.
if N = 0 and we will generate many times the random number we will get a certain equilibrium (every number from interval 1..1000 has equal chance).
if N = 2321 (I call it positive factor) it will be very hard to achieve small number (often will be generated numbers > 900, sometimes numbers near 500 and rarely numbers < 100). The highest the positive factor the highest probability for high numbers
if N = -2321 (negative factor) this will be the opposite of positive factor
It's clear that the generated numbers will create for given N certain characteristic curve. Could you advise me how to achieve this goal and what curves can I create? What possibilities do I have here? How would you limit positive and negative factors etc.
thank you for help
If you generate a uniform random number, and then raise it to a power > 1, it will get smaller, but stay in the range [0, 1]. If you raise it to a power greater than 0 but less than 1, it will get larger, but stay in the range [0, 1].
So you can use the exponent to pick a power when generating your random numbers.
def biased_random(scale, bias):
return random.random() ** bias * scale
sum(biased_random(1000, 2.5) for x in range(100)) / 100
291.59652962214676 # average less than 500
max(biased_random(1000, 2.5) for x in range(100))
963.81166161355998 # but still occasionally generates large numbers
sum(biased_random(1000, .3) for x in range(100)) / 100
813.90199860117821 # average > 500
min(biased_random(1000, .3) for x in range(100))
265.25040459294883 # but still occasionally generates small numbers
This problem is severely underspecified. There are a million ways to solve it as it is mentioned.
Instead of arbitrary positive and negative values, try to think what is the meaning behind them. IMHO, beta distribution is the one you should consider. By selecting the parameters \alpha and \beta you should be appropriately modulate the behavior of your distribution.
See what shapes you can get with certain \alpha and \beta http://en.wikipedia.org/wiki/Beta_distribution#Shapes
http://en.wikipedia.org/wiki/File:Beta_distribution_pdf.svg
Lets for beginning decide that we will pick numbers from [0,1] because it makes stuff simpler.
n is number that represents distribution (0,2321 or -2321) as in example
We need solution only for n > 0, because if n < 0. You can take positive version of n and subtract from 1.
One simple idea for PDF in interval [0,1] is x^n. (or at least this kind of shape)
Calculating CDF is then integrating x^n and is x^(n+1)/(n+1)
Because CDF must be 1 at the end (in our case at 1) our final CDF is than x^(n+1) and is properly weighted
In order to generate this kind distribution from this, we must calaculate quantile function
Quantile function is just inverse of CDF and is in our case. x^(1/(n+1))
And that is it. Your QF is x^(1/(n+1))
To generate numbers from [0,1] you have to pick uniformly distributetd random from [0,1] (most common random function in programming languages)
and than power this whit (1/(n+1))
Only problem I see is that it can be problem to calculate 1-x^(1/(-n+1)) correctly, where n < 0 but i think that you can use log1p,
so it becomes exp(log1p(-x^(1/(-n+1))) if n<0
conclusion whit normalizations
if n>=0: (x^(1/(n/1000+1)))*1000
if n<0: exp(log1p(-(x^(1/(-(n/1000)+1)))))*1000
where x is uniformly distributed random value in interval [0,1]

How to compute the "15% of the time" randomness?

I'm looking for a decent, elegant method of calculating this simple logic.
Right now I can't think of one, it's spinning my head.
I am required to do some action only 15% of the time.
I'm used to "50% of the time" where I just mod the milliseconds of the current time and see if it's odd or even, but I don't think that's elegant.
How would I elegantly calculate "15% of the time"? Random number generator maybe?
Pseudo-code or any language are welcome.
Hope this is not subjective, since I'm looking for the "smartest" short-hand method of doing that.
Thanks.
Solution 1 (double)
get a random double between 0 and 1 (whatever language you use, there must be such a function)
do the action only if it is smaller than 0.15
Solution 2 (int)
You can also achieve this by creating a random int and see if it is dividable to 6 or 7. UPDATE --> This is not optimal.
You can produce a random number between 0 and 99, and check if it's less than 15:
if (rnd.Next(100) < 15) ...
You can also reduce the numbers, as 15/100 is the same as 3/20:
if (rnd.Next(20) < 3) ...
Random number generator would give you the best randomness. Generate a random between 0 and 1, test for < 0.15.
Using the time like that isn't true random, as it's influenced by processing time. If a task takes less than 1 millisecond to run, then the next random choice will be the same one.
That said, if you do want to use the millisecond-based method, do milliseconds % 20 < 3.
Just use a PRNG. Like always, it's a performance v. accuracy trade-off. I think making your own doing directly off the time is a waste of time (pun intended). You'll probably get biasing effects even worse than a run of the mill linear congruential generator.
In Java, I would use nextInt:
myRNG.nextInt(100) < 15
Or (mostly) equivalently:
myRNG.nextInt(20) < 3
There are way to get a random integer in other languages (multiple ways actually, depending how accurate it has to be).
Using modulo arithmetic you can easily do something every Xth run like so
(6 will give you ruthly 15%
if( microtime() % 6 === ) do it
other thing:
if(rand(0,1) >= 0.15) do it
boolean array[100] = {true:first 15, false:rest};
shuffle(array);
while(array.size > 0)
{
// pop first element of the array.
if(element == true)
do_action();
else
do_something_else();
}
// redo the whole thing again when no elements are left.
Here's one approach that combines randomness and a guarantee that eventually you get a positive outcome in a predictable range:
Have a target (15 in your case), a counter (initialized to 0), and a flag (initialized to false).
Accept a request.
If the counter is 15, reset the counter and the flag.
If the flag is true, return negative outcome.
Get a random true or false based on one of the methods described in other answers, but use a probability of 1/(15-counter).
Increment counter
If result is true, set flag to true and return a positive outcome. Else return a negative outcome.
Accept next request
This means that the first request has probability of 1/15 of return positive, but by the 15th request, if no positive result has been returned, there's a probability of 1/1 of a positive result.
This quote is from a great article about how to use a random number generator:
Note: Do NOT use
y = rand() % M;
as this focuses on the lower bits of
rand(). For linear congruential random
number generators, which rand() often
is, the lower bytes are much less
random than the higher bytes. In fact
the lowest bit cycles between 0 and 1.
Thus rand() may cycle between even and
odd (try it out). Note rand() does not
have to be a linear congruential
random number generator. It's
perfectly permissible for it to be
something better which does not have
this problem.
and it contains formulas and pseudo-code for
r = [0,1) = {r: 0 <= r < 1} real
x = [0,M) = {x: 0 <= x < M} real
y = [0,M) = {y: 0 <= y < M} integer
z = [1,M] = {z: 1 <= z <= M} integer

Resources