Memory efficient version for the construction disjoint random lists - python-2.x

We have to positive integers b,n with b < n/2. We want to generate two random disjoint lists I1, I2 both with b elements from {0,1,...,n}. A simple way to do this is the following.
def disjoint_sets(bound,n):
import random
I1=[];I2=[];
L = random.sample(range(0,n+1), n+1)
I1 = L[0:bound]
I2 = L[bound:2*bound]
return I1,I2
For large b,n (say b=100, n>1e7) the previous is not memory efficient. Since L is large. I am wondering if there is a method to get I1,I2 without using range(0,n+1)?

Here is a hit-and-miss approach which works well for numbers in the range that you mentioned:
import random
def rand_sample(k,n):
#picks k distinct random integers from range(n)
#assumes that k is much smaller than n
choices = set()
sample = []
for i in range(k): #xrange(k) in Python 2
choice = random.randint(0,n-1)
while choice in choices:
choice = random.randint(0,n-1)
choices.add(choice)
sample.append(choice)
return sample
For your problem, you could do something like:
def rand_pair(b,n):
sample = rand_sample(2*b,n)
return sample[:b],sample[b:]

Related

Algorithm for an arbitrary number of nested for loops to calculate a discrete probability distribution

This question is not specific to any programming language
Let's say I have a List<Hashmap<int,float>>, where each Hashmap<int,float> represents the discrete probability distribution of a random variable.
For example the dsitribution of a fair coin could be represented by the Hashmap {0:0.5, 1:0.5} (head=1,tail=0).
If we have n discrete random variables, we could store their distributions as a List of n Hashmaps.
Question: How could we now iterate over this List to obtain the distribution of the sum of the random variables?
More Information:
For e.g. three random variables X,Y,Z, where we want the distribution of W=X+Y+Z we could do something like this:
hashmap_w = {}
for (kx,vx) in hashmap_x:
for (ky,vy) in hashmap_y:
for (kz,vz) in hashmap_z:
k = kx+ky+kz
v = vx*vy*vz
if(hashmap_w.contains_key(k)):
hashmap_w[k]+=v
else:
hashmap_w[k]=v
How could we generalize this code to not only work for 3 random variables but for an arbitrary number?
Use dynamic programming.
sum_prob = {0: 1}
for hashmap in hashmaps:
next_sum_prob = {}
for kh, vh in hashmap.items():
for ks, vs in sum_prob.items():
k = kh + ks
v = vh * vs
if k in next_sum_prob:
next_sum_prob[k] += v
else:
next_sum_prob[k] = v
sum_prob = next_sum_prob

Best iterative way to calculate the fundamental matrix of an absorbing Markov Chain?

I have a very large absorbing Markov chain. I want to obtain the fundamental matrix of this chain to calculate the expected number of steps before absortion. From this question I know that this can be calculated by the equation
(I - Q)t=1
which can be obtained by using the following python code:
def expected_steps_fast(Q):
I = numpy.identity(Q.shape[0])
o = numpy.ones(Q.shape[0])
numpy.linalg.solve(I-Q, o)
However, I would like to calculate it using some kind of iterative method similar to the power iteration method used for calculate the PageRank. This method would allow me to calculate an approximation to the expected number of steps before absortion in a mapreduce-like system.
¿Does something similar exist?
If you have a sparse matrix, check if scipy.spare.linalg.spsolve works. No guarantees about numerical robustness, but at least for trivial examples it's significantly faster than solving with dense matrices.
import networkx as nx
import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
def example(n):
"""Generate a very simple transition matrix from a directed graph
"""
g = nx.DiGraph()
for i in xrange(n-1):
g.add_edge(i+1, i)
g.add_edge(i, i+1)
g.add_edge(n-1, n)
g.add_edge(n, n)
m = nx.to_numpy_matrix(g)
# normalize rows to ensure m is a valid right stochastic matrix
m = m / np.sum(m, axis=1)
return m
A = sp.csr_matrix(example(2000)[:-1,:-1])
Ad = np.array(A.todense())
def sp_solve(Q):
I = sp.identity(Q.shape[0], format='csr')
o = np.ones(Q.shape[0])
return spla.spsolve(I-Q, o)
def dense_solve(Q):
I = numpy.identity(Q.shape[0])
o = numpy.ones(Q.shape[0])
return numpy.linalg.solve(I-Q, o)
Timings for sparse solution:
%timeit sparse_solve(A)
1000 loops, best of 3: 1.08 ms per loop
Timings for dense solution:
%timeit dense_solve(Ad)
1 loops, best of 3: 216 ms per loop
Like Tobias mentions in the comments, I would have expected other solvers to outperform the generic one, and they may for very large systems. For this toy example, the generic solve seems to work well enough.
I arraived to this answer thanks to #tobias-ribizel's suggestion of using the Neumann series. If we part from the following equation:
Using the Neumann series:
If we multiply each term of the series by the vector 1 we could operate separately over each row of the matrix Q and approximate successively with:
This is the python code I use to calculate this:
def expected_steps_iterative(Q, n=10):
N = Q.shape[0]
acc = np.ones(N)
r_k_1 = np.ones(N)
for k in range(1, n):
r_k = np.zeros(N)
for i in range(N):
for j in range(N):
r_k[i] += r_k_1[j] * Q[i, j]
if np.allclose(acc, acc+r_k, rtol=1e-8):
acc += r_k
break
acc += r_k
r_k_1 = r_k
return acc
And this is the code using Spark. This code expects that Q is a RDD where each row is a tuple (row_id, dict of weights for that row of the matrix).
def expected_steps_spark(sc, Q, n=10):
def dict2np(d, sz):
vec = np.zeros(sz)
for k, v in d.iteritems():
vec[k] = v
return vec
sz = Q.count()
acc = np.ones(sz)
x = {i:1.0 for i in range(sz)}
for k in range(1, n):
bc_x = sc.broadcast(x)
x_old = x
x = Q.map(lambda (u, ol): (u, reduce(lambda s, j: s + bc_x.value[j]*ol[j], ol, 0.0)))
x = x.collectAsMap()
v_old = dict2np(x_old, sz)
v = dict2np(x, sz)
acc += v
if np.allclose(v, v_old, rtol=1e-8):
break
return acc

numpy: evaluating function in matrix, using previous array as argument in calculating the next

I have an m x n array: a, where the integers m > 1E6, and n <= 5.
I have functions F and G, which are composed like this: F( u, G ( u, t)). u is a 1 x n array, t is a scalar, and F and G returns 1 x n arrays.
I need to evaluate each row of a in F, and use previously evaluated row as the u-array for the next evaluation. I need to make m such evaluations.
This has to be really fast. I was previously impressed by scitools.std StringFunction evaluaion for a whole array, but this problem requires using the previously calculated array as an argument in calculating the next. I don't know if StringFunction can do this.
For example:
a = zeros((1000000, 4))
a[0] = asarray([1.,69.,3.,4.1])
# A is a float defined elsewhere, h is a function which accepts a float as its argument and returns an arbitrary float. h is defined elsewhere.
def G(u, t):
return asarray([u[0], u[1]*A, cos(u[2]), t*h(u[3])])
def F(u, t):
return u + G(u, t)
dt = 1E-6
for i in range(1, 1000000):
a[i] = F(a[i-1], i*dt)
i += 1
The problem with the above code is that it is slow as hell. I need to get these calculations done by numpy milliseconds.
How can I do what I want?
Thank you for our time.
Kind regards,
Marius
This sort of thing is very difficult to do in numpy. If we look at this by column we see a few simpler solutions.
a[:,0] is very easy:
col0 = np.ones((1000))*2
col0[0] = 1 #Or whatever start value.
np.cumprod(col0, out=col0)
np.allclose(col0, a[:1000,0])
True
As mentioned earlier this will overflow very quickly. a[:,1] can be done much along the same lines.
I do not believe there is a way to do the next two columns inside numpy alone quickly. We can turn to numba for this:
from numba import auotojit
def python_loop(start, count):
out = np.zeros((count), dtype=np.double)
out[0] = start
for x in xrange(count-1):
out[x+1] = out[x] + np.cos(out[x+1])
return out
numba_loop = autojit(python_loop)
np.allclose(numba_loop(3,1000),a[:1000,2])
True
%timeit python_loop(3,1000000)
1 loops, best of 3: 4.14 s per loop
%timeit numba_loop(3,1000000)
1 loops, best of 3: 42.5 ms per loop
Although its worth pointing out that this converges to pi/2 very very quickly and there is little point in calculating this recursion past ~20 values for any start value. This returns the exact same answer to double point precision- I didn't bother finding the cutoff, but it is much less then 50:
%timeit tmp = np.empty((1000000));
tmp[:50] = numba_loop(3,50);
tmp[50:] = np.pi/2
100 loops, best of 3: 2.25 ms per loop
You can do something similar with the fourth column. Of course you can autojit all of the functions, but this gives you several different options to try out depending on numba usage:
Use cumprod for the first two columns
Use an approximation for column 3 (and possible 4) where only the first few iterations are calculated
Implement columns 3 and 4 in numba using autojit
Wrap everything inside of an autojit loop (the best option)
The way you have presented this all rows past ~200 will either be np.inf or np.pi/2. Exploit this.
Slightly faster. Your first column is basicly 2^n. Calculating 2^n for n up to 1000000 is gonna overflow.. second column is even worse.
def calc(arr, t0=1E-6):
u = arr[0]
dt = 1E-6
h = lambda x: np.random.random(1)*50.0
def firstColGen(uStart):
u = uStart
while True:
u += u
yield u
def secondColGen(uStart, A):
u = uStart
while True:
u += u*A
yield u
def thirdColGen(uStart):
u = uStart
while True:
u += np.cos(u)
yield u
def fourthColGen(uStart, h, t0, dt):
u = uStart
t = t0
while True:
u += h(u) * dt
t += dt
yield u
first = firstColGen(u[0])
second = secondColGen(u[1], A)
third = thirdColGen(u[2])
fourth = fourthColGen(u[3], h, t0, dt)
for i in xrange(1, len(arr)):
arr[i] = [first.next(), second.next(), third.next(), fourth.next()]

Generate Random(a, b) making calls to Random(0, 1)

There is known Random(0,1) function, it is a uniformed random function, which means, it will give 0 or 1, with probability 50%. Implement Random(a, b) that only makes calls to Random(0,1)
What I though so far is, put the range a-b in a 0 based array, then I have index 0, 1, 2...b-a.
then call the RANDOM(0,1) b-a times, sum the results as generated idx. and return the element.
However since there is no answer in the book, I don't know if this way is correct or the best. How to prove that the probability of returning each element is exactly same and is 1/(b-a+1) ?
And what is the right/better way to do this?
If your RANDOM(0, 1) returns either 0 or 1, each with probability 0.5 then you can generate bits until you have enough to represent the number (b-a+1) in binary. This gives you a random number in a slightly too large range: you can test and repeat if it fails. Something like this (in Python).
def rand_pow2(bit_count):
"""Return a random number with the given number of bits."""
result = 0
for i in xrange(bit_count):
result = 2 * result + RANDOM(0, 1)
return result
def random_range(a, b):
"""Return a random integer in the closed interval [a, b]."""
bit_count = math.ceil(math.log2(b - a + 1))
while True:
r = rand_pow2(bit_count)
if a + r <= b:
return a + r
When you sum random numbers, the result is not longer evenly distributed - it looks like a Gaussian function. Look up "law of large numbers" or read any probability book / article. Just like flipping coins 100 times is highly highly unlikely to give 100 heads. It's likely to give close to 50 heads and 50 tails.
Your inclination to put the range from 0 to a-b first is correct. However, you cannot do it as you stated. This question asks exactly how to do that, and the answer utilizes unique factorization. Write m=a-b in base 2, keeping track of the largest needed exponent, say e. Then, find the biggest multiple of m that is smaller than 2^e, call it k. Finally, generate e numbers with RANDOM(0,1), take them as the base 2 expansion of some number x, if x < k*m, return x, otherwise try again. The program looks something like this (simple case when m<2^2):
int RANDOM(0,m) {
// find largest power of n needed to write m in base 2
int e=0;
while (m > 2^e) {
++e;
}
// find largest multiple of m less than 2^e
int k=1;
while (k*m < 2^2) {
++k
}
--k; // we went one too far
while (1) {
// generate a random number in base 2
int x = 0;
for (int i=0; i<e; ++i) {
x = x*2 + RANDOM(0,1);
}
// if x isn't too large, return it x modulo m
if (x < m*k)
return (x % m);
}
}
Now you can simply add a to the result to get uniformly distributed numbers between a and b.
Divide and conquer could help us in generating a random number in range [a,b] using random(0,1). The idea is
if a is equal to b, then random number is a
Find mid of the range [a,b]
Generate random(0,1)
If above is 0, return a random number in range [a,mid] using recursion
else return a random number in range [mid+1, b] using recursion
The working 'C' code is as follows.
int random(int a, int b)
{
if(a == b)
return a;
int c = RANDOM(0,1); // Returns 0 or 1 with probability 0.5
int mid = a + (b-a)/2;
if(c == 0)
return random(a, mid);
else
return random(mid + 1, b);
}
If you have a RNG that returns {0, 1} with equal probability, you can easily create a RNG that returns numbers {0, 2^n} with equal probability.
To do this you just use your original RNG n times and get a binary number like 0010110111. Each of the numbers are (from 0 to 2^n) are equally likely.
Now it is easy to get a RNG from a to b, where b - a = 2^n. You just create a previous RNG and add a to it.
Now the last question is what should you do if b-a is not 2^n?
Good thing that you have to do almost nothing. Relying on rejection sampling technique. It tells you that if you have a big set and have a RNG over that set and need to select an element from a subset of this set, you can just keep selecting an element from a bigger set and discarding them till they exist in your subset.
So all you do, is find b-a and find the first n such that b-a <= 2^n. Then using rejection sampling till you picked an element smaller b-a. Than you just add a.

Randomly Generate a set of numbers of n length totaling x

I'm working on a project for fun and I need an algorithm to do as follows:
Generate a list of numbers of Length n which add up to x
I would settle for list of integers, but ideally, I would like to be left with a set of floating point numbers.
I would be very surprised if this problem wasn't heavily studied, but I'm not sure what to look for.
I've tackled similar problems in the past, but this one is decidedly different in nature. Before I've generated different combinations of a list of numbers that will add up to x. I'm sure that I could simply bruteforce this problem but that hardly seems like the ideal solution.
Anyone have any idea what this may be called, or how to approach it? Thanks all!
Edit: To clarify, I mean that the list should be length N while the numbers themselves can be of any size.
edit2: Sorry for my improper use of 'set', I was using it as a catch all term for a list or an array. I understand that it was causing confusion, my apologies.
This is how to do it in Python
import random
def random_values_with_prescribed_sum(n, total):
x = [random.random() for i in range(n)]
k = total / sum(x)
return [v * k for v in x]
Basically you pick n random numbers, compute their sum and compute a scale factor so that the sum will be what you want it to be.
Note that this approach will not produce "uniform" slices, i.e. the distribution you will get will tend to be more "egalitarian" than it should be if it was picked at random among all distribution with the given sum.
To see the reason you can just picture what the algorithm does in the case of two numbers with a prescribed sum (e.g. 1):
The point P is a generic point obtained by picking two random numbers and it will be uniform inside the square [0,1]x[0,1]. The point Q is the point obtained by scaling P so that the sum is required to be 1. As it's clear from the picture the points close to the center of the have an higher probability; for example the exact center of the squares will be found by projecting any point on the diagonal (0,0)-(1,1), while the point (0, 1) will be found projecting only points from (0,0)-(0,1)... the diagonal length is sqrt(2)=1.4142... while the square side is only 1.0.
Actually, you need to generate a partition of x into n parts. This is usually done the in following way: The partition of x into n non-negative parts can be represented in the following way: reserve n + x free places, put n borders to some arbitrary places, and stones to the rest. The stone groups add up to x, thus the number of possible partitions is the binomial coefficient (n + x \atop n).
So your algorithm could be as follows: choose an arbitrary n-subset of (n + x)-set, it determines uniquely a partition of x into n parts.
In Knuth's TAOCP the chapter 3.4.2 discusses random sampling. See Algortihm S there.
Algorithm S: (choose n arbitrary records from total of N)
t = 0, m = 0;
u = random, uniformly distributed on (0, 1)
if (N - t)*u >= n - m, skip t-th record and increase t by 1; otherwise include t-th record in the sample, increase m and t by 1
if M < n, return to 2, otherwise, algorithm finished
The solution for non-integers is algorithmically trivial: you just select arbitrary n numbers that don't sum up to 0, and norm them by their sum.
If you want to sample uniformly in the region of N-1-dimensional space defined by x1 + x2 + ... + xN = x, then you're looking at a special case of sampling from a Dirichlet distribution. The sampling procedure is a little more involved than generating uniform deviates for the xi. Here's one way to do it, in Python:
xs = [random.gammavariate(1,1) for a in range(N)]
xs = [x*v/sum(xs) for v in xs]
If you don't care too much about the sampling properties of your results, you can just generate uniform deviates and correct their sum afterwards.
Here is a version of the above algorithm in Javascript
function getRandomArbitrary(min, max) {
return Math.random() * (max - min) + min;
};
function getRandomArray(min, max, n) {
var arr = [];
for (var i = 0, l = n; i < l; i++) {
arr.push(getRandomArbitrary(min, max))
};
return arr;
};
function randomValuesPrescribedSum(min, max, n, total) {
var arr = getRandomArray(min, max, n);
var sum = arr.reduce(function(pv, cv) { return pv + cv; }, 0);
var k = total/sum;
var delays = arr.map(function(x) { return k*x; })
return delays;
};
You can call it with
var myarray = randomValuesPrescribedSum(0,1,3,3);
And then check it with
var sum = myarray.reduce(function(pv, cv) { return pv + cv;},0);
This code does a reasonable job. I think it produces a different distribution than 6502's answer, but I am not sure which is better or more natural. Certainly his code is clearer/nicer.
import random
def parts(total_sum, num_parts):
points = [random.random() for i in range(num_parts-1)]
points.append(0)
points.append(1)
points.sort()
ret = []
for i in range(1, len(points)):
ret.append((points[i] - points[i-1]) * total_sum)
return ret
def test(total_sum, num_parts):
ans = parts(total_sum, num_parts)
assert abs(sum(ans) - total_sum) < 1e-7
print ans
test(5.5, 3)
test(10, 1)
test(10, 5)
In python:
a: create a list of (random #'s 0 to 1) times total; append 0 and total to the list
b: sort the list, measure the distance between each element
c: round the list elements
import random
import time
TOTAL = 15
PARTS = 4
PLACES = 3
def random_sum_split(parts, total, places):
a = [0, total] + [random.random()*total for i in range(parts-1)]
a.sort()
b = [(a[i] - a[i-1]) for i in range(1, (parts+1))]
if places == None:
return b
else:
b.pop()
c = [round(x, places) for x in b]
c.append(round(total-sum(c), places))
return c
def tick():
if info.tick == 1:
start = time.time()
alpha = random_sum_split(PARTS, TOTAL, PLACES)
end = time.time()
log('alpha: %s' % alpha)
log('total: %.7f' % sum(alpha))
log('parts: %s' % PARTS)
log('places: %s' % PLACES)
log('elapsed: %.7f' % (end-start))
yields:
[2014-06-13 01:00:00] alpha: [0.154, 3.617, 6.075, 5.154]
[2014-06-13 01:00:00] total: 15.0000000
[2014-06-13 01:00:00] parts: 4
[2014-06-13 01:00:00] places: 3
[2014-06-13 01:00:00] elapsed: 0.0005839
to the best of my knowledge this distribution is uniform

Resources