Understanding Spark correlation algorithm - algorithm

I was reading Spark correlation algorithm source code and while going through the code, I coulddn't understand this particular peace of code.
This is from the file : org/apache/spark/mllib/linalg/BLAS.scala
def spr(alpha: Double, v: Vector, U: Array[Double]): Unit = {
val n = v.size
v match {
case DenseVector(values) =>
NativeBLAS.dspr("U", n, alpha, values, 1, U)
case SparseVector(size, indices, values) =>
val nnz = indices.length
var colStartIdx = 0
var prevCol = 0
var col = 0
var j = 0
var i = 0
var av = 0.0
while (j < nnz) {
col = indices(j)
// Skip empty columns.
colStartIdx += (col - prevCol) * (col + prevCol + 1) / 2
av = alpha * values(j)
i = 0
while (i <= j) {
U(colStartIdx + indices(i)) += av * values(i)
i += 1
}
j += 1
prevCol = col
}
}
}
I do not know Scala and that could be the reason I could not understand it. Can someone explain what is happening here.
It is being called from Rowmatrix.scala
def computeGramianMatrix(): Matrix = {
val n = numCols().toInt
checkNumColumns(n)
// Computes n*(n+1)/2, avoiding overflow in the multiplication.
// This succeeds when n <= 65535, which is checked above
val nt = if (n % 2 == 0) ((n / 2) * (n + 1)) else (n * ((n + 1) / 2))
// Compute the upper triangular part of the gram matrix.
val GU = rows.treeAggregate(new BDV[Double](nt))(
seqOp = (U, v) => {
BLAS.spr(1.0, v, U.data)
U
}, combOp = (U1, U2) => U1 += U2)
RowMatrix.triuToFull(n, GU.data)
}
The correlation is defined here:
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
The final goal is to understand the Spark correlation algorithm.
Update 1: Relevent paper https://stanford.edu/~rezab/papers/linalg.pdf

Related

Reverse the isometric projection algorithm

I've got this code:
const a = 2; // always > 0 and known in advance
const b = 3; // always > 0 and known in advance
const c = 4; // always > 0 and known in advance
for (let x = 0; x <= a; x++) {
for (let y = 0; y <= b; y++) {
for (let z = 0; z <= c; z++) {
for (let p = 0; p <= 1; p++) {
for (let q = 0; q <= 2; q++) {
let u = b + x - y + p;
let v = a + b + 2 * c - x - y - 2 * z + q;
let w = c + x + y - z;
}
}
}
}
}
The code generates (a+1)*(b+1)*(c+1)*2*3 triplets of (u,v,w), each of them is unique. And because of that fact, I think it should be possible to write reversed version of this algorithm that will calculate x,y,z,p,q based on u,v,w. I understand that there are only 3 equations and 5 variables to get, but known boundaries for x,y,z,p,q and the fact that all variables are integers should probably help.
for (let u = ?; u <= ?; u++) {
for (let v = ?; v <= ?; v++) {
for (let w = ?; w <= ?; w++) {
x = ?;
y = ?;
z = ?;
p = ?;
q = ?;
}
}
}
I even managed to produce the first line: for (let u = 0; u <= a + b + 1; u++) by taking the equation for u and finding min and max but I'm struggling with moving forward. I understand that min and max values for v are depending on u, but can't figure out the formulas.
Examples are in JS, but I will be thankful for any help in any programming language or even plain math formulas.
If anyone is interested in what this code is actually about - it projects voxel 3d model to triangles on a plain. u,v are resulting 2d coordinates and w is distance from the camera. Reversed algorithm will be actually a kind of raytracing.
UPDATE: Using line equations from 2 points I managed to create minmax conditions for v and code now looks like this:
for (let u = 0; u <= a + b + 1; u++) {
let minv = u <= a ? a - u : -a + u - 1;
let maxv = u <= b ? a + 2 * c + u + 2 : a + 2 * b + 2 * c - u + 3;
for (let v = minv; v <= maxv; v++) {
...
}
}
I think I know what to do with x, y, z, p, q on the last step so the problem left is minw and maxw. As far as I understand those values should depend both on u and v and I must use plane equations?
If the triplets are really unique (didn't check that) and if p and q always go up to 1 and 2 (respectively), then you can "group" triplets together and go up the loop chain.
We'll first find the 3 triplets that where made in the same "q loop" : the triplets make with the same x,y,z,p. As only q change, the only difference will be v, and it will be 3 consecutive numbers.
For that, let's group triplets such that, in a group, all triplets have the same u and same w. Then we sort triplets in groups by their v parameters, and we group them 3 by 3. Inside each group it's easy to assign the correct q variable to each triplet.
Then reduce the groups of 3 into the first triplet (the one with q == 0). We start over to assign the p variable : Group all triplets such that they have same v and w inside a group. Then sort them by the u value, and group them 2 by 2. This let's us find their p value. Remember that each triplet in the group of 3 (before reducing) has that same p value.
Then, for each triplet, we have found p and q. We solve the 3 equation for x,y,z :
z = -1 * ((v + w) - a - b - 3c -q)/3
y = (w - u + z + b - c - p)/2
x = u + y - b - p
After spending some time with articles on geometry and with the huge help from Wolfram Alpha, I managed to write needed equations myself. And yes, I had to use plane equations.
const a = 2; // always > 0 and known in advance
const b = 3; // always > 0 and known in advance
const c = 4; // always > 0 and known in advance
const minu = 0;
const maxu = a + b + 1;
let minv, maxv, minw, maxw;
let x, y, z, p, q;
for (let u = minu; u <= maxu; u++) {
if (u <= a) {
minv = a - u;
} else {
minv = -a + u - 1;
}
if (u <= b) {
maxv = a + 2 * c + u + 2;
} else {
maxv = a + 2 * b + 2 * c - u + 3;
}
for (let v = minv; v <= maxv; v++) {
if (u <= b && v >= a + u + 1) {
minw = (-a + 2 * b - 3 * u + v - 2) / 2;
} else if (u > b && v >= a + 2 * b - u + 2) {
minw = (-a - 4 * b + 3 * u + v - 5) / 2;
} else {
minw = a + b - v;
}
if (u <= a && v <= a + 2 * c - u + 1) {
maxw = (-a + 2 * b + 3 * u + v - 1) / 2;
} else if (u > a && v <= -a + 2 * c + u) {
maxw = (5 * a + 2 * b - 3 * u + v + 2) / 2;
} else {
maxw = a + b + 3 * c - v + 2;
}
minw = Math.round(minw);
maxw = Math.round(maxw);
for (let w = minw; w <= maxw; w++) {
z = (a + b + 3 * c - v - w + 2) / 3;
q = Math.round(2 - (z % 1) * 3);
z = Math.floor(z);
y = (a + 4 * b + q - 3 * u - v + 2 * w + 3) / 6;
p = 1 - (y % 1) * 2;
y = Math.floor(y);
x = (a - 2 * b - 3 * p + q + 3 * u - v + 2 * w) / 6;
x = Math.round(x);
}
}
}
This code passes my tests, but if someone can create better solution, I would be very interested.

How to build a list of vectors taken from an other list that satisfy a constraint

It's my first try with z3.
I want to find which vectors taken in a list I have to sum to get a given result.
I've try this but that don't compile because R isn't an indice.
Tr_tuple = ((-1,1,0,1,0,0,0,-1),
(1,-1,1,0,0,0,-1,0),
(0,-1,-1,1,0,1,0,0),
(-1,0,1,-1,0,0,0,0),
(0,0,0,-1,-1,1,0,1),
(0,0,-1,0,1,-1,1,0),
(0,1,0,0,0,-1,-1,1),
(1,0,0,0,-1,0,1,-1),
(1,1,-1,-1,1,1,-1,-1),
(-1,-1,1,1,-1,-1,1,1))
Start_tuple = (1,-1,0,-1,0,0,0,1)
depth = 2
G = [Int('g_%s' % i) for i in range(8)]
R = [Int('r_%s' % i) for i in range(depth)]
R_c = [ And (R[i] >= 0, R[i] < 10) for i in range(depth) ]
G_c = [G[i] == Start_tuple[i] + sum([ Tr_tuple[j][i] for j in R]) for i in range(8)]
G_g = [G[i] == 0 for i in range(8)]
I found something but it's like brute force :
M = [[Int('m_%s_%s' % (j,i)) for i in range(8)] for j in range(depth)]
T = [Int('t_%s' % i) for i in range(8)]
M_c = And([
Or([
And([M[j][i] == Tr_tuple[k][i] for i in range(8)])
for k in range(10)])
for j in range(depth)])
G_c = (And([Start_tuple[i]+sum([M[j][i] for j in range(depth)]) == 0 for i in range(8)]))
s = Solver()
s.add(M_c)
s.add(G_c)
if s.check() == sat :
pp(s.model())
else:
print('No solution found')

logical matrix how to find efficiently row/column with true value

I'm trying to find a efficient solution for the next riddle:
i have a logical matrix at (n * n) size filled in false values
i need to create a function that will get zero or one as argument it will shift all
the values in the matrix one step to the left (meaning the first
element on the first row is deleted and the last element in the last
row is our new bit) and return true if there is a row/column in our
matrix contains only one's values.
No limitation on the data structure.
My naive solution in javascript:
const next = (bit, matrix) => {
matrix.shift()
matrix.push(bit);
const matrix_size = Math.sqrt(matrix.length);
let col_sum = 0;
let row_sum = 0;
for (let i = 0; i < matrix.length; ++i) {
col_sum = matrix[i];
row_sum += matrix[i];
if ((i + 1) % matrix_size === 0) {
if (row_sum === matrix_size) return true;
row_sum = 0;
}
for (let j = i + matrix_size;j < (i + ((matrix_size * matrix_size) - 1)); j += matrix_size) {
col_sum += matrix[j];
}
if (col_sum === matrix_size) return true;
}
return false;
}
i used 1d array as data structure but it doesn't really help my to reduce time complexity.
Love to hear some ideas :)
Let’s think about following example matrix:
[0, 0, 0, 0,
0, 0, 0, 0,
0, 0, 1, 1,
1, 1, 1, 1]
and push zero 16 times.
Then, False, True, True, True, False, True, True, True, False, True, True, True, False, False False and False will be obtained.
There is cyclic behavior (False, True, True, True).
If the length of continued ones was fixed, it isn’t necessary to recalculate every time in update.
Updated the matrix, the length of continued ones at top-left and bottom-right can be change, and it can be needed to update the cyclic memory.
Maintaining continued ones sequences, maintaining total count of cyclic behavior affected by the sequences, the complexity for the rows will be in O(1).
In case of column, instead of shifting and pushing, let matrix[cur]=bit and cur = (cur+1)%(matrix_size*matrix_size) to represent cur as the actual upper-left of the matrix.
Maintaining col_sum of each column, maintaining total count satisfying the all-ones-condition, the complexity will be O(1).
class Matrix:
def __init__(self, n):
self.mat = [0] * (n*n)
self.seq_len = [0] * (n*n)
self.col_total = [0] * n
self.col_archive = 0
self.row_cycle_cnt = [0] * n
self.cur = 0
self.continued_one = 0
self.n = n
def update(self, bit):
prev_bit = self.mat[self.cur]
self.mat[self.cur] = bit
# update col total
col = self.cur % self.n
if self.col_total[col] == self.n:
self.col_archive -= 1
self.col_total[col] += bit - prev_bit
if self.col_total[col] == self.n:
self.col_archive += 1
# update row index
# process shift out
if prev_bit == 1:
prev_len = self.seq_len[self.cur]
if prev_len > 1:
self.seq_len[(self.cur + 1) % (self.n * self.n)] = prev_len-1
if self.n <= prev_len and prev_len < self.n*2:
self.row_cycle_cnt[self.cur % self.n] -= 1
# process new bit
if bit == 0:
self.continued_one = 0
else:
self.continued_one = min(self.continued_one + 1, self.n*self.n)
# write the length of continued_one at the head of sequence
self.seq_len[self.cur+1 - self.continued_one] = self.continued_one
if self.n <= self.continued_one and self.continued_one < self.n*2:
self.row_cycle_cnt[(self.cur+1) % self.n] += 1
# update cursor
self.cur = (self.cur + 1) % (self.n * self.n)
return (self.col_archive > 0) or (self.row_cycle_cnt[self.cur % self.n] > 0)
def check2(self):
for y in range(self.n):
cnt = 0
for x in range(self.n):
cnt += self.mat[(self.cur + y*self.n + x) % (self.n*self.n)]
if cnt == self.n:
return True
for x in range(self.n):
cnt = 0
for y in range(self.n):
cnt += self.mat[(self.cur + y*self.n + x) % (self.n*self.n)]
if cnt == self.n:
return True
return False
if __name__ == "__main__":
import random
random.seed(123)
m = Matrix(4)
for i in range(100000):
ans1 = m.update(random.randint(0, 1))
ans2 = m.check2()
assert(ans1 == ans2)
print("epoch:{} mat={} ans={}".format(i, m.mat[m.cur:] + m.mat[:m.cur], ans1))

How to find ith item in zigzag ordering?

A question last week defined the zig zag ordering on an n by m matrix and asked how to list the elements in that order.
My question is how to quickly find the ith item in the zigzag ordering? That is, without traversing the matrix (for large n and m that's much too slow).
For example with n=m=8 as in the picture and (x, y) describing (row, column)
f(0) = (0, 0)
f(1) = (0, 1)
f(2) = (1, 0)
f(3) = (2, 0)
f(4) = (1, 1)
...
f(63) = (7, 7)
Specific question: what is the ten billionth (1e10) item in the zigzag ordering of a million by million matrix?
Let's assume that the desired element is located in the upper half of the matrix. The length of the diagonals are 1, 2, 3 ..., n.
Let's find the desired diagonal. It satisfies the following property:
sum(1, 2 ..., k) >= pos but sum(1, 2, ..., k - 1) < pos. The sum of 1, 2, ..., k is k * (k + 1) / 2. So we just need to find the smallest integer k such that k * (k + 1) / 2 >= pos. We can either use a binary search or solve this quadratic inequality explicitly.
When we know the k, we just need to find the pos - (k - 1) * k / 2 element of this diagonal. We know where it starts and where we should move(up or down, depending on the parity of k), so we can find the desired cell using a simple formula.
This solution has an O(1) or an O(log n) time complexity(it depends on whether we use a binary search or solve the inequation explicitly in step 2).
If the desired element is located in the lower half of the matrix, we can solve this problem for a pos' = n * n - pos + 1 and then use symmetry to get the solution to the original problem.
I used 1-based indexing in this solution, using 0-based indexing might require adding +1 or -1 somewhere, but the idea of the solution is the same.
If the matrix is rectangular, not square, we need to consider the fact the length of diagonals look this way: 1, 2, 3, ..., m, m, m, .., m, m - 1, ..., 1(if m <= n) when we search for the k, so the sum becomes something like k * (k + 1) / 2 if k <= m and k * (k + 1) / 2 + m * (k - m) otherwise.
import math, random
def naive(n, m, ord, swap = False):
dx = 1
dy = -1
if swap:
dx, dy = dy, dx
cur = [0, 0]
for i in range(ord):
cur[0] += dy
cur[1] += dx
if cur[0] < 0 or cur[1] < 0 or cur[0] >= n or cur[1] >= m:
dx, dy = dy, dx
if cur[0] >= n:
cur[0] = n - 1
cur[1] += 2
if cur[1] >= m:
cur[1] = m - 1
cur[0] += 2
if cur[0] < 0: cur[0] = 0
if cur[1] < 0: cur[1] = 0
return cur
def fast(n, m, ord, swap = False):
if n < m:
x, y = fast(m, n, ord, not swap)
return [y, x]
alt = n * m - ord - 1
if alt < ord:
x, y = fast(n, m, alt, swap if (n + m) % 2 == 0 else not swap)
return [n - x - 1, m - y - 1]
if ord < (m * (m + 1) / 2):
diag = int((-1 + math.sqrt(1 + 8 * ord)) / 2)
parity = (diag + (0 if swap else 1)) % 2
within = ord - (diag * (diag + 1) / 2)
if parity: return [diag - within, within]
else: return [within, diag - within]
else:
ord -= (m * (m + 1) / 2)
diag = int(ord / m)
within = ord - diag * m
diag += m
parity = (diag + (0 if swap else 1)) % 2
if not parity:
within = m - within - 1
return [diag - within, within]
if __name__ == "__main__":
for i in range(1000):
n = random.randint(3, 100)
m = random.randint(3, 100)
ord = random.randint(0, n * m - 1)
swap = random.randint(0, 99) < 50
na = naive(n, m, ord, swap)
fa = fast(n, m, ord, swap)
assert na == fa, "(%d, %d, %d, %s) ==> (%s), (%s)" % (n, m, ord, swap, na, fa)
print fast(1000000, 1000000, 9999999999, False)
print fast(1000000, 1000000, 10000000000, False)
So the 10-billionth element (the one with ordinal 9999999999), and the 10-billion-first element (the one with ordinal 10^10) are:
[20331, 121089]
[20330, 121090]
An analytical solution
In the general case, your matrix will be divided in 3 areas:
an initial triangle t1
a skewed part mid where diagonals have a constant length
a final triangle t2
Let's call p the index of your diagonal run.
We want to define two functions x(p) and y(p) that give you the column and row of the pth cell.
Initial triangle
Let's look at the initial triangular part t1, where each new diagonal is one unit longer than the preceding.
Now let's call d the index of the diagonal that holds the cell, and
Sp = sum(di) for i in [0..p-1]
We have p = Sp + k, with 0 <=k <= d and
Sp = d(d+1)/2
if we solve for d, it brings
d²+d-2p = 0, a quadratic equation where we retain only the positive root:
d = (-1+sqrt(1+8*p))/2
Now we want the highest integer value closest to d, which is floor(d).
In the end, we have
p = d + k with d = floor((-1+sqrt(1+8*p))/2) and k = p - d(d+1)/2
Let's call
o(d) the function that equals 1 if d is odd and 0 otherwise, and
e(d) the function that equals 1 if d is even and 0 otherwise.
We can compute x(p) and y(p) like so:
d = floor((-1+sqrt(1+8*p))/2)
k = p - d(d+1)/2
o = d % 2
e = 1 - o
x = e*d + (o-e)*k
y = o*d + (e-o)*k
even and odd functions are used to try to salvage some clarity, but you can replace
e(p) with 1 - o(p) and have slightly more efficient but less symetric formulaes for x and y.
Middle part
let's consider the smallest matrix dimension s, i.e. s = min (m,n).
The previous formulaes hold until x or y (whichever comes first) reaches the value s.
The upper bound of p such as x(i) <= s and y(i) <= s for all i in [0..p]
(i.e. the cell indexed by p is inside the initial triangle t1) is given by
pt1 = s(s+1)/2.
For p >= pt1, diagonal length remains equal to s until we reach the second triangle t2.
when inside mid, we have:
p = s(s+1)/2 + ds + k with k in [0..s[.
which yields:
d = floor ((p - s(s+1)/2)/s)
k = p - ds
We can then use the same even/odd trick to compute x(p) and y(p):
p -= s(s+1)/2
d = floor (p / s)
k = p - d*s
o = (d+s) % 2
e = 1 - o
x = o*s + (e-o)*k
y = e*s + (o-e)*k
if (n > m)
x += d+e
y -= e
else
y += d+o
x -= o
Final triangle
Using symetry, we can calculate pt2 = m*n - s(s+1)/2
We now face nearly the same problem as for t1, except that the diagonal may run in the same direction as for t1 or in the reverse direction (if n+m is odd).
Using symetry tricks, we can compute x(p) and y(p) like so:
p = n*m -1 - p
d = floor((-1+sqrt(1+8*p))/2)
k = p - d*(d+1)/2
o = (d+m+n) % 2
e = 1 - $o;
x = n-1 - (o*d + (e-o)*k)
y = m-1 - (e*d + (o-e)*k)
Putting all together
Here is a sample c++ implementation.
I used 64 bits integers out of sheer lazyness. Most could be replaced by 32 bits values.
The computations could be made more effective by precomputing a few more coefficients.
A good part of the code could be factorized, but I doubt it is worth the effort.
Since this is just a quick and dirty proof of concept, I did not optimize it.
#include <cstdio> // printf
#include <algorithm> // min
using namespace std;
typedef long long tCoord;
void panic(const char * msg)
{
printf("PANIC: %s\n", msg);
exit(-1);
}
struct tPoint {
tCoord x, y;
tPoint(tCoord x = 0, tCoord y = 0) : x(x), y(y) {}
tPoint operator+(const tPoint & p) const { return{ x + p.x, y + p.y }; }
bool operator!=(const tPoint & p) const { return x != p.x || y != p.y; }
};
class tMatrix {
tCoord n, m; // dimensions
tCoord s; // smallest dimension
tCoord pt1, pt2; // t1 / mid / t2 limits for p
public:
tMatrix(tCoord n, tCoord m) : n(n), m(m)
{
s = min(n, m);
pt1 = (s*(s + 1)) / 2;
pt2 = n*m - pt1;
}
tPoint diagonal_cell(tCoord p)
{
tCoord x, y;
if (p < pt1) // inside t1
{
tCoord d = (tCoord)floor((-1 + sqrt(1 + 8 * p)) / 2);
tCoord k = p - (d*(d + 1)) / 2;
tCoord o = d % 2;
tCoord e = 1 - o;
x = o*d + (e - o)*k;
y = e*d + (o - e)*k;
}
else if (p < pt2) // inside mid
{
p -= pt1;
tCoord d = (tCoord)floor(p / s);
tCoord k = p - d*s;
tCoord o = (d + s) % 2;
tCoord e = 1 - o;
x = o*s + (e - o)*k;
y = e*s + (o - e)*k;
if (m > n) // vertical matrix
{
x -= o;
y += d + o;
}
else // horizontal matrix
{
x += d + e;
y -= e;
}
}
else // inside t2
{
p = n * m - 1 - p;
tCoord d = (tCoord)floor((-1 + sqrt(1 + 8 * p)) / 2);
tCoord k = p - (d*(d + 1)) / 2;
tCoord o = (d + m + n) % 2;
tCoord e = 1 - o;
x = n - 1 - (o*d + (e - o)*k);
y = m - 1 - (e*d + (o - e)*k);
}
return{ x, y };
}
void check(void)
{
tPoint move[4] = { { 1, 0 }, { -1, 1 }, { 1, -1 }, { 0, 1 } };
tPoint pos;
tCoord dir = 0;
for (tCoord p = 0; p != n * m ; p++)
{
tPoint dc = diagonal_cell(p);
if (pos != dc) panic("zot!");
pos = pos + move[dir];
if (dir == 0)
{
if (pos.y == m - 1) dir = 2;
else dir = 1;
}
else if (dir == 3)
{
if (pos.x == n - 1) dir = 1;
else dir = 2;
}
else if (dir == 1)
{
if (pos.y == m - 1) dir = 0;
else if (pos.x == 0) dir = 3;
}
else
{
if (pos.x == n - 1) dir = 3;
else if (pos.y == 0) dir = 0;
}
}
}
};
void main(void)
{
const tPoint dim[] = { { 10, 10 }, { 11, 11 }, { 10, 30 }, { 30, 10 }, { 10, 31 }, { 31, 10 }, { 11, 31 }, { 31, 11 } };
for (tPoint d : dim)
{
printf("Checking a %lldx%lld matrix...", d.x, d.y);
tMatrix(d.x, d.y).check();
printf("done\n");
}
tCoord p = 10000000000;
tMatrix matrix(1000000, 1000000);
tPoint cell = matrix.diagonal_cell(p);
printf("Coordinates of %lldth cell: (%lld,%lld)\n", p, cell.x, cell.y);
}
Results are checked against "manual" sweep of the matrix.
This "manual" sweep is a ugly hack that won't work for a one-row or one-column matrix, though diagonal_cell() does work on any matrix (the "diagonal" sweep becomes linear in that case).
The coordinates found for the 10.000.000.000th cell of a 1.000.000x1.000.000 matrix seem consistent, since the diagonal d on which the cell stands is about sqrt(2*1e10), approx. 141421, and the sum of cell coordinates is about equal to d (121090+20330 = 141420). Besides, it is also what the two other posters report.
I would say there is a good chance this lump of obfuscated code actually produces an O(1) solution to your problem.

Algorithm to generate a sequence proportional to specified percentage

Given a Map of objects and designated proportions (let's say they add up to 100 to make it easy):
val ss : Map[String,Double] = Map("A"->42, "B"->32, "C"->26)
How can I generate a sequence such that for a subset of size n there are ~42% "A"s, ~32% "B"s and ~26% "C"s? (Obviously, small n will have larger errors).
(Work language is Scala, but I'm just asking for the algorithm.)
UPDATE: I resisted a random approach since, for instance, there's ~16% chance that the sequence would start with AA and ~11% chance it would start with BB and there would be very low odds that for n precisely == (sum of proportions) the distribution would be perfect. So, following #MvG's answer, I implemented as follows:
/**
Returns the key whose achieved proportions are most below desired proportions
*/
def next[T](proportions : Map[T, Double], achievedToDate : Map[T,Double]) : T = {
val proportionsSum = proportions.values.sum
val desiredPercentages = proportions.mapValues(v => v / proportionsSum)
//Initially no achieved percentages, so avoid / 0
val toDateTotal = if(achievedToDate.values.sum == 0.0){
1
}else{
achievedToDate.values.sum
}
val achievedPercentages = achievedToDate.mapValues(v => v / toDateTotal)
val gaps = achievedPercentages.map{ case (k, v) =>
val gap = desiredPercentages(k) - v
(k -> gap)
}
val maxUnder = gaps.values.toList.sortWith(_ > _).head
//println("Max gap is " + maxUnder)
val gapsForMaxUnder = gaps.mapValues{v => Math.abs(v - maxUnder) < Double.Epsilon }
val keysByHasMaxUnder = gapsForMaxUnder.map(_.swap)
keysByHasMaxUnder(true)
}
/**
Stream of most-fair next element
*/
def proportionalStream[T](proportions : Map[T, Double], toDate : Map[T, Double]) : Stream[T] = {
val nextS = next(proportions, toDate)
val tailToDate = toDate + (nextS -> (toDate(nextS) + 1.0))
Stream.cons(
nextS,
proportionalStream(proportions, tailToDate)
)
}
That when used, e.g., :
val ss : Map[String,Double] = Map("A"->42, "B"->32, "C"->26)
val none : Map[String,Double] = ss.mapValues(_ => 0.0)
val mySequence = (proportionalStream(ss, none) take 100).toList
println("Desired : " + ss)
println("Achieved : " + mySequence.groupBy(identity).mapValues(_.size))
mySequence.map(s => print(s))
println
produces :
Desired : Map(A -> 42.0, B -> 32.0, C -> 26.0)
Achieved : Map(C -> 26, A -> 42, B -> 32)
ABCABCABACBACABACBABACABCABACBACABABCABACABCABACBA
CABABCABACBACABACBABACABCABACBACABABCABACABCABACBA
For a deterministic approach, the most obvious solution would probably be this:
Keep track of the number of occurrences of each item in the sequence so far.
For the next item, choose that item for which the difference between intended and actual count (or proportion, if you prefer that) is maximal, but only if the intended count (resp. proportion) is greater than the actual one.
If there is a tie, break it in an arbitrary but deterministic way, e.g. choosing the alphabetically lowest item.
This approach would ensure an optimal adherence to the prescribed ratio for every prefix of the infinite sequence generated in this way.
Quick & dirty python proof of concept (don't expect any of the variable “names” to make any sense):
import sys
p = [0.42, 0.32, 0.26]
c = [0, 0, 0]
a = ['A', 'B', 'C']
n = 0
while n < 70*5:
n += 1
x = 0
s = n*p[0] - c[0]
for i in [1, 2]:
si = n*p[i] - c[i]
if si > s:
x = i
s = si
sys.stdout.write(a[x])
if n % 70 == 0:
sys.stdout.write('\n')
c[x] += 1
Generates
ABCABCABACABACBABCAABCABACBACABACBABCABACABACBACBAABCABCABACABACBABCAB
ACABACBACABACBABCABACABACBACBAABCABCABACABACBABCAABCABACBACABACBABCABA
CABACBACBAABCABCABACABACBABCABACABACBACBAACBABCABACABACBACBAABCABCABAC
ABACBABCABACABACBACBAACBABCABACABACBACBAABCABCABACABACBABCABACABACBACB
AACBABCABACABACBACBAABCABCABACABACBABCAABCABACBACBAACBABCABACABACBACBA
For every item of the sequence, compute a (pseudo-)random number r equidistributed between 0 (inclusive) and 100 (exclusive).
If 0 ≤ r < 42, take A
If 42 ≤ r < (42+32), take B
If (42+32) ≤ r < (42+32+26)=100, take C
The number of each entry in your subset is going to be the same as in your map, but with a scaling factor applied.
The scaling factor is n/100.
So if n was 50, you would have { Ax21, Bx16, Cx13 }.
Randomize the order to your liking.
The simplest "deterministic" [in terms of #elements of each category] solution [IMO] will be: add elements in predefined order, and then shuffle the resulting list.
First, add map(x)/100 * n elements from each element x chose how you handle integer arithmetics to avoid off by one element], and then shuffle the resulting list.
Shuffling a list is simple with fisher-yates shuffle, which is implemented in most languages: for example java has Collections.shuffle(), and C++ has random_shuffle()
In java, it will be as simple as:
int N = 107;
List<String> res = new ArrayList<String>();
for (Entry<String,Integer> e : map.entrySet()) { //map is predefined Map<String,Integer> for frequencies
for (int i = 0; i < Math.round(e.getValue()/100.0 * N); i++) {
res.add(e.getKey());
}
}
Collections.shuffle(res);
This is nondeterministic, but gives a distribution of values close to MvG's. It suffers from the problem that it could give AAA right at the start. I post it here for completeness' sake given how it proves my dissent with MvG was misplaced (and I don't expect any upvotes).
Now, if someone has an idea for an expand function that is deterministic and won't just duplicate MvG's method (rendering the calc function useless), I'm all ears!
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>ErikE's answer</title>
</head>
<body>
<div id="output"></div>
<script type="text/javascript">
if (!Array.each) {
Array.prototype.each = function(callback) {
var i, l = this.length;
for (i = 0; i < l; i += 1) {
callback(i, this[i]);
}
};
}
if (!Array.prototype.sum) {
Array.prototype.sum = function() {
var sum = 0;
this.each(function(i, val) {
sum += val;
});
return sum;
};
}
function expand(counts) {
var
result = "",
charlist = [],
l,
index;
counts.each(function(i, val) {
char = String.fromCharCode(i + 65);
for ( ; val > 0; val -= 1) {
charlist.push(char);
}
});
l = charlist.length;
for ( ; l > 0; l -= 1) {
index = Math.floor(Math.random() * l);
result += charlist[index];
charlist.splice(index, 1);
}
return result;
}
function calc(n, proportions) {
var percents = [],
counts = [],
errors = [],
fnmap = [],
errorSum,
worstIndex;
fnmap[1] = "min";
fnmap[-1] = "max";
proportions.each(function(i, val) {
percents[i] = val / proportions.sum() * n;
counts[i] = Math.round(percents[i]);
errors[i] = counts[i] - percents[i];
});
errorSum = counts.sum() - n;
while (errorSum != 0) {
adjust = errorSum < 0 ? 1 : -1;
worstIndex = errors.indexOf(Math[fnmap[adjust]].apply(0, errors));
counts[worstIndex] += adjust;
errors[worstIndex] = counts[worstIndex] - percents[worstIndex];
errorSum += adjust;
}
return expand(counts);
}
document.body.onload = function() {
document.getElementById('output').innerHTML = calc(99, [25.1, 24.9, 25.9, 24.1]);
};
</script>
</body>
</html>

Resources