Efficiently calculate histogram of a 3D numpy array along an axis with different bin edges

Efficiently calculate histogram of a 3D numpy array along an axis with different bin edges - performance

Problem description
I have a 3D numpy array, denoted as data, of shape N x R x C, i.e. N samples, R rows and C columns. I would like to obtain histograms along column for each combination of sample and row. However bin edges (see argument bins in numpy.histogram), of fixed length S, will be different at different rows but are shared across samples. Consider this example for illustration, for the 1st sample (data[0]), bin edge sequence for its 1st row is different from that for its 2nd row, but is the same as that for the 1st row from the 2nd sample (data[1]). Thus all the bin edge sequences are stored in a 2D numpy array of shape R x S, denoted as bin_edges.
My question is how to efficiently calculate the histograms?
A working but slow solution
Using numpy.histogram, I was able to come up with a working but fairly slow solution as shown in the below code snippet
```
Get dummy data
N: number of samples
R: number of rows (or kernels)
C: number of columns (or pixels)
S: number of bins
```
import numpy as np
N, R, C, S = 100, 50, 1000, 10
data = np.random.randn(N, R, C)
# for each row/kernel, pool pixels of all samples
poolsamples = np.swapaxes(data, 0, 1).reshape(R, -1)
# use quantiles as bin edges
percentiles = np.linspace(0, 100, num=(S + 1))
bin_edges = np.transpose(np.percentile(poolsamples, percentiles, axis=1))
```
A working but slow solution of getting histograms along column
```
hist = np.empty((N, R, S))
for idx in np.arange(R):
bin_edges_i = bin_edges[idx, :]
counts = np.apply_along_axis(
lambda a: np.histogram(a, bins=bin_edges_i)[0],
1, data[:, idx, :])
hist[:, idx, :] = counts
Possible directions
Fancy numpy reshape to avoid using for loop at all
This problem arises from extracting low-end characteristics for each image forwarded through a trained neural network. Therefore, if the histogram extraction can be embedded in TensorFlow graph and ultimately be carried out on GPU, that would be ideal!
I noticed a python package fast-histogram which claims to be 7-15x faster than numpy.histogram. However 1d histogram function can only takes number of bins instead of actual bin positions
numexpr?
I would love to hear any inputs! Thanks in advance!

Making use of 2D version of np.searchsorted : searchsorted2d -
def vectorized_app(data, bin_edges):
N, R, C = data.shape
a = np.sort(data.reshape(-1,C),1)
b = np.repeat(bin_edges[None],N,axis=0).reshape(-1,bin_edges.shape[-1])
idx = searchsorted2d(a,b)
idx[:,0] = 0
idx[:,-1] = a.shape[1]
out = (idx[:,1:] - idx[:,:-1]).reshape(N,R,-1)
return out
Runtime test -
In [591]: N, R, C, S = 100, 50, 1000, 10
...: data = np.random.randn(N, R, C)
...:
...: # for each row/kernel, pool pixels of all samples
...: poolsamples = np.swapaxes(data, 0, 1).reshape(R, -1)
...: # use quantiles as bin edges
...: percentiles = np.linspace(0, 100, num=(S + 1))
...: bin_edges = np.transpose(np.percentile(poolsamples, percentiles, axis=1))
...:
In [592]: %timeit org_app(data, bin_edges)
1 loop, best of 3: 481 ms per loop
In [593]: %timeit vectorized_app(data, bin_edges)
1 loop, best of 3: 224 ms per loop
In [595]: np.allclose(org_app(data, bin_edges), vectorized_app(data, bin_edges))
Out[595]: True
More than 2x speedup there.
Closer look reveals that the bottleneck with the proposed vectorized method is the sorting itself -
In [594]: %timeit np.sort(data.reshape(-1,C),1)
1 loop, best of 3: 194 ms per loop
We need this sorting to use searchsorted.

Related

Faster way to find the size of the intersection of any two corresponding multisets from two 3D arrays of multisets

I have two uint16 3D (GPU) arrays A and B in MATLAB, which have the same 2nd and 3rd dimension. For instance, size(A,1) = 300 000, size(B,1) = 2000, size(A,2) = size(B,2) = 20, and size(A,3) = size(B,3) = 100, to give an idea about the orders of magnitude. Actually, size(A,3) = size(B,3) is very big, say ~ 1 000 000, but the arrays are stored externally in small pieces cut along the 3rd dimension. The point is that there is a very long loop along the 3rd dimension (cfg. MWE below), so the code inside of it needs to be optimized further (if possible). Furthermore, the values of A and B can be assumed to be bounded way below 65535, but there are still hundreds of different values.
For each i,j, and d, the rows A(i,:,d) and B(j,:,d) represent multisets of the same size, and I need to find the size of the largest common submultiset (multisubset?) of the two, i.e. the size of their intersection as multisets. Moreover, the rows of B can be assumed sorted.
For example, if [2 3 2 1 4 5 5 5 6 7] and [1 2 2 3 5 5 7 8 9 11] are two such multisets, respectively, then their multiset intersection is [1 2 2 3 5 5 7], which has the size 7 (7 elements as a multiset).
I am currently using the following routine to do this:
s = 300000; % 1st dim. of A
n = 2000; % 1st dim. of B
c = 10; % 2nd dim. of A and B
depth = 10; % 3rd dim. of A and B (corresponds to a batch of size 10 of A and B along the 3rd dim.)
N = 100; % upper bound on the possible values of A and B
A = randi(N,s,c,depth,'uint16','gpuArray');
B = randi(N,n,c,depth,'uint16','gpuArray');
Sizes_of_multiset_intersections = zeros(s,n,depth,'uint8'); % too big to fit in GPU memory together with A and B
for d=1:depth
A_slice = A(:,:,d);
B_slice = B(:,:,d);
unique_B_values = permute(unique(B_slice),[3 2 1]); % B is smaller than A
% compute counts of the unique B-values for each multiset:
A_values_counts = permute(sum(uint8(A_slice==unique_B_values),2,'native'),[1 3 2]);
B_values_counts = permute(sum(uint8(B_slice==unique_B_values),2,'native'),[1 3 2]);
% compute the count of each unique B-value in the intersection:
Sizes_of_multiset_intersections_tmp = gpuArray.zeros(s,n,'uint8');
for i=1:n
Sizes_of_multiset_intersections_tmp(:,i) = sum(min(A_values_counts,B_values_counts(i,:)),2,'native');
end
Sizes_of_multiset_intersections(:,:,d) = gather(Sizes_of_multiset_intersections_tmp);
end
One can also easily adapt above code to compute the result in batches along dimension 3 rather than d=1:depth (=batch of size 1), though at the expense of even bigger unique_B_values vector.
Since the depth dimension is large (even when working in batches along it), I am interested in faster alternatives to the code inside the outer loop. So my question is this: is there a faster (e.g. better vectorized) way to compute sizes of intersections of multisets of equal size?

Disclaimer : This is not a GPU based solution (Don't have a good GPU). I find the results interesting and want to share, but I can delete this answer if you think it should be.
Below is a vectorized version of your code, that makes it possible to get rid of the inner loop, at the cost of having to deal with a bigger array, that might be too big to fit in the memory.
The idea is to have the matrices A_values_counts and B_values_counts be 3D matrices shaped in such a way that calling min(A_values_counts,B_values_counts) will calculate everything in one go due to implicit expansion. In the background it will create a big array of size s x n x length(unique_B_values) (Probably most of the time too big)
In order to go around the constraint on the size, the results are calculated in batches along the n dimension, i.e. the first dimension of B:
tic
nBatches_B = 2000;
sBatches_B = n/nBatches_B;
Sizes_of_multiset_intersections_new = zeros(s,n,depth,'uint8');
for d=1:depth
A_slice = A(:,:,d);
B_slice = B(:,:,d);
% compute counts of the unique B-values for each multiset:
unique_B_values = reshape(unique(B_slice),1,1,[]);
A_values_counts = sum(uint8(A_slice==unique_B_values),2,'native'); % s x 1 x length(uniqueB) array
B_values_counts = reshape(sum(uint8(B_slice==unique_B_values),2,'native'),1,n,[]); % 1 x n x length(uniqueB) array
% Not possible to do it all in one go, must split in batches along B
for ii = 1:nBatches_B
Sizes_of_multiset_intersections_new(:,((ii-1)*sBatches_B+1):ii*sBatches_B,d) = sum(min(A_values_counts,B_values_counts(:,((ii-1)*sBatches_B+1):ii*sBatches_B,:)),3,'native'); % Vectorized
end
end
toc
Here is a little benchmark with different values of the number of batches. You can see that a minimum is found around a number of 400 (batch size 50), with a decrease of around 10% in processing time (each point is an average over 3 runs). (EDIT : x axis is amount of batches, not batches size)
I'd be interested in knowing how it behaves for GPU arrays as well!

Perform numpy.sum (or scipy.integrate.simps()) on large splitted array efficiently

Let's consider a very large numpy array a (M, N).
where M can typically be 1 or 100 and N 10-100,000,000
We have the array of indices that can split it into many (K = 1,000,000) along axis=1.
We want to efficiently perform an operation like integration along axis=1 (np.sum to take the simplest form) on each sub-array and return a (M, K) array.
An elegant and efficient solution was proposed by #Divakar in question [41920367]how to split numpy array and perform certain actions on split arrays [Python] but my understanding is that it only applies to cases where all sub-arrays have the same shape, which allows for reshaping.
But in our case the sub-arrays don't have the same shape, which, so far has forced me to loop on the index... please take me out of my misery...
Example
a = np.random.random((10, 100000000))
ind = np.sort(np.random.randint(10, 9000000, 1000000))
The size of the sub-arrays are not homogenous:
sizes = np.diff(ind)
print(sizes.min(), size.max())
2, 8732
So far, the best I found is:
output = np.concatenate([np.sum(vv, axis=1)[:, None] for vv in np.split(a, ind, axis=1)], axis=1)
Possible feature request for numpy and scipy:
If looping is really unavoidable, at least having it done in C inside the numpy and scipy.integrate.simps (or romb) functions would probably speed-up the output.
Something like
output = np.sum(a, axis=1, split_ind=ind)
output = scipy.integrate.simps(a, x=x, axis=1, split_ind=ind)
output = scipy.integrate.romb(a, x=x, axis=1, split_ind=ind)
would be very welcome !
(where x itself could be splitable, or not)
Side note:
While trying this example, I noticed that with these numbers there was almost always an element of sizes equal to 0 (the sizes.min() is almost always zero).
This looks peculiar to me, as we are picking 10,000 integers between 10 and 9,000,000, the odds that the same number comes up twice (such that diff = 0) should be close to 0. It seems to be very close to 1.
Would that be due to the algorithm behind np.random.randint ?

What you want is np.add.reduceat
output = np.add.reduceat(a, ind, axis = 1)
output.shape
Out[]: (10, 1000000)
Universal Functions (ufunc) are a very powerful tool in numpy
As for the repeated indices, that's simply the Birthday Problem cropping up.

Great !
Thanks ! on my VM Cent OS 6.9 I have the following results:
In [71]: a = np.random.random((10, 10000000))
In [72]: ind = np.unique(np.random.randint(10, 9000000, 100000))
In [73]: ind2 = np.append([0], ind)
In [74]: out = np.concatenate([np.sum(vv, axis=1)[:, None] for vv in np.split(a, ind, axis=1)], axis=1)
In [75]: out2 = np.add.reduceat(a, ind2, axis=1)
In [83]: np.allclose(out, out2)
Out[83]: True
In [84]: %timeit out = np.concatenate([np.sum(vv, axis=1)[:, None] for vv in np.split(a, ind, axis=1)], axis=1)
2.7 s ± 40.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [85]: %timeit out2 = np.add.reduceat(a, ind2, axis=1)
179 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That's a good 93 % speed gain (or factor 15 faster) over the list concatenation :-)
Great !

Conditional sampling of binary vectors (?)

I'm trying to find a name for my problem, so I don't have to re-invent wheel when coding an algorithm which solves it...
I have say 2,000 binary (row) vectors and I need to pick 500 from them. In the picked sample I do column sums and I want my sample to be as close as possible to a pre-defined distribution of the column sums. I'll be working with 20 to 60 columns.
A tiny example:
Out of the vectors:
110
010
011
110
100
I need to pick 2 to get column sums 2, 1, 0. The solution (exact in this case) would be
110
100
My ideas so far
one could maybe call this a binary multidimensional knapsack, but I did not find any algos for that
Linear Programming could help, but I'd need some step by step explanation as I got no experience with it
as exact solution is not always feasible, something like simulated annealing brute force could work well
a hacky way using constraint solvers comes to mind - first set the constraints tight and gradually loosen them until some solution is found - given that CSP should be much faster than ILP...?

My concrete, practical (if the approximation guarantee works out for you) suggestion would be to apply the maximum entropy method (in Chapter 7 of Boyd and Vandenberghe's book Convex Optimization; you can probably find several implementations with your favorite search engine) to find the maximum entropy probability distribution on row indexes such that (1) no row index is more likely than 1/500 (2) the expected value of the row vector chosen is 1/500th of the predefined distribution. Given this distribution, choose each row independently with probability 500 times its distribution likelihood, which will give you 500 rows on average. If you need exactly 500, repeat until you get exactly 500 (shouldn't take too many tries due to concentration bounds).

Firstly I will make some assumptions regarding this problem:
Regardless whether the column sum of the selected solution is over or under the target, it weighs the same.
The sum of the first, second, and third column are equally weighted in the solution (i.e. If there's a solution whereas the first column sum is off by 1, and another where the third column sum is off by 1, the solution are equally good).
The closest problem I can think of this problem is the Subset sum problem, which itself can be thought of a special case of Knapsack problem.
However both of these problem are NP-Complete. This means there are no polynomial time algorithm that can solve them, even though it is easy to verify the solution.
If I were you the two most arguably efficient solution of this problem are linear programming and machine learning.
Depending on how many columns you are optimising in this problem, with linear programming you can control how much finely tuned you want the solution, in exchange of time. You should read up on this, because this is fairly simple and efficient.
With Machine learning, you need a lot of data sets (the set of vectors and the set of solutions). You don't even need to specify what you want, a lot of machine learning algorithms can generally deduce what you want them to optimise based on your data set.
Both solution has pros and cons, you should decide which one to use yourself based on the circumstances and problem set.

This definitely can be modeled as (integer!) linear program (many problems can). Once you have it, you can use a program such as lpsolve to solve it.
We model vector i is selected as x_i which can be 0 or 1.
Then for each column c, we have a constraint:
sum of all (x_i * value of i in column c) = target for column c
Taking your example, in lp_solve this could look like:
min: ;
+x1 +x4 +x5 >= 2;
+x1 +x4 +x5 <= 2;
+x1 +x2 +x3 +x4 <= 1;
+x1 +x2 +x3 +x4 >= 1;
+x3 <= 0;
+x3 >= 0;
bin x1, x2, x3, x4, x5;

If you are fine with a heuristic based search approach, here is one.
Go over the list and find the minimum squared sum of the digit wise difference between each bit string and the goal. For example, if we are looking for 2, 1, 0, and we are scoring 0, 1, 0, we would do it in the following way:
Take the digit wise difference:
2, 0, 1
Square the digit wise difference:
4, 0, 1
Sum:
5
As a side note, squaring the difference when scoring is a common method when doing heuristic search. In your case, it makes sense because bit strings that have a 1 in as the first digit are a lot more interesting to us. In your case this simple algorithm would pick first 110, then 100, which would is the best solution.
In any case, there are some optimizations that could be made to this, I will post them here if this kind of approach is what you are looking for, but this is the core of the algorithm.

You have a given target binary vector. You want to select M vectors out of N that have the closest sum to the target. Let's say you use the eucilidean distance to measure if a selection is better than another.
If you want an exact sum, have a look at the k-sum problem which is a generalization of the 3SUM problem. The problem is harder than the subset sum problem, because you want an exact number of elements to add to a target value. There is a solution in O(N^(M/2)). lg N), but that means more than 2000^250 * 7.6 > 10^826 operations in your case (in the favorable case where vectors operations have a cost of 1).
First conclusion: do not try to get an exact result unless your vectors have some characteristics that may reduce the complexity.
Here's a hill climbing approach:
sort the vectors by number of 1's: 111... first, 000... last;
use the polynomial time approximate algorithm for the subset sum;
you have an approximate solution with K elements. Because of the order of elements (the big ones come first), K should be a little as possible:
if K >= M, you take the M first vectors of the solution and that's probably near the best you can do.
if K < M, you can remove the first vector and try to replace it with 2 or more vectors from the rest of the N vectors, using the same technique, until you have M vectors. To sumarize: split the big vectors into smaller ones until you reach the correct number of vectors.
Here's a proof of concept with numbers, in Python:
import random
def distance(x, y):
return abs(x-y)
def show(ls):
if len(ls) < 10:
return str(ls)
else:
return ", ".join(map(str, ls[:5]+("...",)+ls[-5:]))
def find(is_xs, target):
# see https://en.wikipedia.org/wiki/Subset_sum_problem#Pseudo-polynomial_time_dynamic_programming_solution
S = [(0, ())] # we store indices along with values to get the path
for i, x in is_xs:
T = [(x + t, js + (i,)) for t, js in S]
U = sorted(S + T)
y, ks = U[0]
S = [(y, ks)]
for z, ls in U:
if z == target: # use the euclidean distance here if you want an approximation
return ls
if z != y and z < target:
y, ks = z, ls
S.append((z, ls))
ls = S[-1][1] # take the closest element to target
return ls
N = 2000
M = 500
target = 1000
xs = [random.randint(0, 10) for _ in range(N)]
print ("Take {} numbers out of {} to make a sum of {}", M, xs, target)
xs = sorted(xs, reverse = True)
is_xs = list(enumerate(xs))
print ("Sorted numbers: {}".format(show(tuple(is_xs))))
ls = find(is_xs, target)
print("FIRST TRY: {} elements ({}) -> {}".format(len(ls), show(ls), sum(x for i, x in is_xs if i in ls)))
splits = 0
while len(ls) < M:
first_x = xs[ls[0]]
js_ys = [(i, x) for i, x in is_xs if i not in ls and x != first_x]
replace = find(js_ys, first_x)
splits += 1
if len(replace) < 2 or len(replace) + len(ls) - 1 > M or sum(xs[i] for i in replace) != first_x:
print("Give up: can't replace {}.\nAdd the lowest elements.")
ls += tuple([i for i, x in is_xs if i not in ls][len(ls)-M:])
break
print ("Replace {} (={}) by {} (={})".format(ls[:1], first_x, replace, sum(xs[i] for i in replace)))
ls = tuple(sorted(ls[1:] + replace)) # use a heap?
print("{} elements ({}) -> {}".format(len(ls), show(ls), sum(x for i, x in is_xs if i in ls)))
print("AFTER {} splits, {} -> {}".format(splits, ls, sum(x for i, x in is_xs if i in ls)))
The result is obviously not guaranteed to be optimal.
Remarks:
Complexity: find has a polynomial time complexity (see the Wikipedia page) and is called at most M^2 times, hence the complexity remains polynomial. In practice, the process is reasonably fast (split calls have a small target).
Vectors: to ensure that you reach the target with the minimum of elements, you can improve the order of element. Your target is (t_1, ..., t_c): if you sort the t_js from max to min, you get the more importants columns first. You can sort the vectors: by number of 1s and then by the presence of a 1 in the most important columns. E.g. target = 4 8 6 => 1 1 1 > 0 1 1 > 1 1 0 > 1 0 1 > 0 1 0 > 0 0 1 > 1 0 0 > 0 0 0.
find (Vectors) if the current sum exceed the target in all the columns, then you're not connecting to the target (any vector you add to the current sum will bring you farther from the target): don't add the sum to S (z >= target case for numbers).

I propose a simple ad hoc algorithm, which, broadly speaking, is a kind of gradient descent algorithm. It seems to work relatively well for input vectors which have a distribution of 1s “similar” to the target sum vector, and probably also for all “nice” input vectors, as defined in a comment of yours. The solution is not exact, but the approximation seems good.
The distance between the sum vector of the output vectors and the target vector is taken to be Euclidean. To minimize it means minimizing the sum of the square differences off sum vector and target vector (the square root is not needed because it is monotonic). The algorithm does not guarantee to yield the sample that minimizes the distance from the target, but anyway makes a serious attempt at doing so, by always moving in some locally optimal direction.
The algorithm can be split into 3 parts.
First of all the first M candidate output vectors out of the N input vectors (e.g., N=2000, M=500) are put in a list, and the remaining vectors are put in another.
Then "approximately optimal" swaps between vectors in the two lists are done, until either the distance would not decrease any more, or a predefined maximum number of iterations is reached. An approximately optimal swap is one where removing the first vector from the list of output vectors causes a maximal decrease or minimal increase of the distance, and then, after the removal of the first vector, adding the second vector to the same list causes a maximal decrease of the distance. The whole swap is avoided if the net result is not a decrease of the distance.
Then, as a last phase, "optimal" swaps are done, again stopping on no decrease in distance or maximum number of iterations reached. Optimal swaps cause a maximal decrease of the distance, without requiring the removal of the first vector to be optimal in itself. To find an optimal swap all vector pairs have to be checked. This phase is much more expensive, being O(M(N-M)), while the previous "approximate" phase is O(M+(N-M))=O(N). Luckily, when entering this phase, most of the work has already been done by the previous phase.
from typing import List, Tuple
def get_sample(vects: List[Tuple[int]], target: Tuple[int], n_out: int,
max_approx_swaps: int = None, max_optimal_swaps: int = None,
verbose: bool = False) -> List[Tuple[int]]:
"""
Get a sample of the input vectors having a sum close to the target vector.
Closeness is measured in Euclidean metrics. The output is not guaranteed to be
optimal (minimum square distance from target), but a serious attempt is made.
The max_* parameters can be used to avoid too long execution times,
tune them to your needs by setting verbose to True, or leave them None (∞).
:param vects: the list of vectors (tuples) with the same number of "columns"
:param target: the target vector, with the same number of "columns"
:param n_out: the requested sample size
:param max_approx_swaps: the max number of approximately optimal vector swaps,
None means unlimited (default: None)
:param max_optimal_swaps: the max number of optimal vector swaps,
None means unlimited (default: None)
:param verbose: print some info if True (default: False)
:return: the sample of n_out vectors having a sum close to the target vector
"""
def square_distance(v1, v2):
return sum((e1 - e2) ** 2 for e1, e2 in zip(v1, v2))
n_vec = len(vects)
assert n_vec > 0
assert n_out > 0
n_rem = n_vec - n_out
assert n_rem > 0
output = vects[:n_out]
remain = vects[n_out:]
n_col = len(vects[0])
assert n_col == len(target) > 0
sumvect = (0,) * n_col
for outvect in output:
sumvect = tuple(map(int.__add__, sumvect, outvect))
sqdist = square_distance(sumvect, target)
if verbose:
print(f"sqdist = {sqdist:4} after"
f" picking the first {n_out} vectors out of {n_vec}")
if max_approx_swaps is None:
max_approx_swaps = sqdist
n_approx_swaps = 0
while sqdist and n_approx_swaps < max_approx_swaps:
# find the best vect to subtract (the square distance MAY increase)
sqdist_0 = None
index_0 = None
sumvect_0 = None
for index in range(n_out):
tmp_sumvect = tuple(map(int.__sub__, sumvect, output[index]))
tmp_sqdist = square_distance(tmp_sumvect, target)
if sqdist_0 is None or sqdist_0 > tmp_sqdist:
sqdist_0 = tmp_sqdist
index_0 = index
sumvect_0 = tmp_sumvect
# find the best vect to add,
# but only if there is a net decrease of the square distance
sqdist_1 = sqdist
index_1 = None
sumvect_1 = None
for index in range(n_rem):
tmp_sumvect = tuple(map(int.__add__, sumvect_0, remain[index]))
tmp_sqdist = square_distance(tmp_sumvect, target)
if sqdist_1 > tmp_sqdist:
sqdist_1 = tmp_sqdist
index_1 = index
sumvect_1 = tmp_sumvect
if sumvect_1:
tmp = output[index_0]
output[index_0] = remain[index_1]
remain[index_1] = tmp
sqdist = sqdist_1
sumvect = sumvect_1
n_approx_swaps += 1
else:
break
if verbose:
print(f"sqdist = {sqdist:4} after {n_approx_swaps}"
f" approximately optimal swap{'s'[n_approx_swaps == 1:]}")
diffvect = tuple(map(int.__sub__, sumvect, target))
if max_optimal_swaps is None:
max_optimal_swaps = sqdist
n_optimal_swaps = 0
while sqdist and n_optimal_swaps < max_optimal_swaps:
# find the best pair to swap,
# but only if the square distance decreases
best_sqdist = sqdist
best_diffvect = diffvect
best_pair = None
for i0 in range(M):
tmp_diffvect = tuple(map(int.__sub__, diffvect, output[i0]))
for i1 in range(n_rem):
new_diffvect = tuple(map(int.__add__, tmp_diffvect, remain[i1]))
new_sqdist = sum(d * d for d in new_diffvect)
if best_sqdist > new_sqdist:
best_sqdist = new_sqdist
best_diffvect = new_diffvect
best_pair = (i0, i1)
if best_pair:
tmp = output[best_pair[0]]
output[best_pair[0]] = remain[best_pair[1]]
remain[best_pair[1]] = tmp
sqdist = best_sqdist
diffvect = best_diffvect
n_optimal_swaps += 1
else:
break
if verbose:
print(f"sqdist = {sqdist:4} after {n_optimal_swaps}"
f" optimal swap{'s'[n_optimal_swaps == 1:]}")
return output
from random import randrange
C = 30 # number of columns
N = 2000 # total number of vectors
M = 500 # number of output vectors
F = 0.9 # fill factor of the target sum vector
T = int(M * F) # maximum value + 1 that can be appear in the target sum vector
A = 10000 # maximum number of approximately optimal swaps, may be None (∞)
B = 10 # maximum number of optimal swaps, may be None (unlimited)
target = tuple(randrange(T) for _ in range(C))
vects = [tuple(int(randrange(M) < t) for t in target) for _ in range(N)]
sample = get_sample(vects, target, M, A, B, True)
Typical output:
sqdist = 2639 after picking the first 500 vectors out of 2000
sqdist = 9 after 27 approximately optimal swaps
sqdist = 1 after 4 optimal swaps
P.S.: As it stands, this algorithm is not limited to binary input vectors, integer vectors would work too. Intuitively I suspect that the quality of the optimization could suffer, though. I suspect that this algorithm is more appropriate for binary vectors.
P.P.S.: Execution times with your kind of data are probably acceptable with standard CPython, but get better (like a couple of seconds, almost a factor of 10) with PyPy. To handle bigger sets of data, the algorithm would have to be translated to C or some other language, which should not be difficult at all.

Randomly pick elements from a vector of counts

I'm currently trying to optimize some MATLAB/Octave code by means of an algorithmic change, but can't figure out how to deal with some randomness here. Suppose that I have a vector V of integers, with each element representing a count of some things, photons in my case. Now I want to randomly pick some amount of those "things" and create a new vector of the same size, but with the counts adjusted.
Here's how I do this at the moment:
function W = photonfilter(V, eff)
% W = photonfilter(V, eff)
% Randomly takes photons from V according to the given efficiency.
%
% Args:
% V: Input vector containing the number of emitted photons in each
% timeslot (one element is one timeslot). The elements are rounded
% to integers before processing.
% eff: Filter efficiency. On the average, every 1/eff photon will be
% taken. This value must be in the range 0 < eff <= 1.
% W: Output row vector with the same length as V and containing the number
% of received photons in each timeslot.
%
% WARNING: This function operates on a photon-by-photon basis in that it
% constructs a vector with one element per photon. The storage requirements
% therefore directly depend on sum(V), not only on the length of V.
% Round V and make it flat.
Ntot = length(V);
V = round(V);
V = V(:);
% Initialize the photon-based vector, so that each element contains
% the original index of the photon.
idxV = zeros(1, sum(V), 'uint32');
iout = 1;
for i = 1:Ntot
N = V(i);
idxV(iout:iout+N-1) = i;
iout = iout + N;
end;
% Take random photons.
idxV = idxV(randperm(length(idxV)));
idxV = idxV(1:round(length(idxV)*eff));
% Generate the output vector by placing the remaining photons back
% into their timeslots.
[W, trash] = hist(idxV, 1:Ntot);
This is a rather straightforward implementation of the description above. But it has an obvious performance drawback: The function creates a vector (idxV) containing one element per single photon. So if my V has only 1000 elements but an average count of 10000 per element, the internal vector will have 10 million elements making the function slow and heavy.
What I'd like to achieve now is not to directly optimize this code, but to use some other kind of algorithm which immediately calculates the new counts without giving each photon some kind of "identity". This must be possible somehow, but I just can't figure out how to do it.
Requirements:
The output vector W must have the same number of elements as the input vector V.
W(i) must be an integer and bounded by 0 <= W(i) <= V(i).
The expected value of sum(W) must be sum(V)*eff.
The algorithm must somehow implement this "random picking" of photons, i.e. there should not be some deterministic part like "run through V dividing all counts by the stepsize and propagating the remainders", as the whole point of this function is to bring randomness into the system.
An explicit loop over V is allowed if unavoidable, but a vectorized approach is preferable.
Any ideas how to implement something like this? A solution using only a random vector and then some trickery with probabilities and rounding would be ideal, but I haven't had any success with that so far.
Thanks! Best regards, Philipp

The method you employ to compute W is called Monte Carlo method. And indeed there can be some optimizations. Once of such is instead of calculating indices of photons, let's imagine a set of bins. Each bin has some probability and the sum of all bins' probabilities adds up to 1. We divide the segment [0, 1] into parts whose lengths are proportional to the probabilities of the bins. Now for every random number within [0, 1) that we generate we can quickly find the bin that it belongs to. Finally, we count numbers in the bins to obtain the final result. The code below illustrates the idea.
% Population size (number of photons).
N = 1000000;
% Sample size, size of V and W as well.
% For convenience of plotting, V and W are of the same size, but
% the algorithm doesn't enforce this constraint.
M = 10000;
% Number of Monte Carlo iterations, greater numbers give better quality.
K = 100000;
% Generate population of counts, use gaussian distribution to test the method.
% If implemented correctly histograms should have the same shape eventually.
V = hist(randn(1, N), M);
P = cumsum(V / sum(V));
% For every generated random value find its bin and then count the bins.
% Finally we normalize counts by the ration of N / K.
W = hist(lookup(P, rand(1, K)), M) * N / K;
% Compare distribution plots, they should be the same.
hold on;
plot(W, '+r');
plot(V, '*b');
pause

Based on the answer from Alexander Solovets, this is how the code now looks:
function W = photonfilter(V, eff, impl=1)
Ntot = length(V);
V = V(:);
if impl == 0
% Original "straightforward" solution.
V = round(V);
idxV = zeros(1, sum(V), 'uint32');
iout = 1;
for i = 1:Ntot
N = V(i);
idxV(iout:iout+N-1) = i;
iout = iout + N;
end;
idxV = idxV(randperm(length(idxV)));
idxV = idxV(1:round(length(idxV)*eff));
[W, trash] = hist(idxV, 1:Ntot);
else
% Monte Carlo approach.
Nphot = sum(V);
P = cumsum(V / Nphot);
W = hist(lookup(P, rand(1, round(Nphot * eff))), 0:Ntot-1);
end;
The results are quite comparable, as long as eff if not too close to 1 (with eff=1, the original solution yields W=V while the Monte Carlo approach still has some randomness, thereby violating the upper bound constraints).
Test in the interactive Octave shell:
octave:1> T=linspace(0,10*pi,10000);
octave:2> V=100*(1+sin(T));
octave:3> W1=photonfilter(V, 0.1, 0);
octave:4> W2=photonfilter(V, 0.1, 1);
octave:5> plot(T,V,T,W1,T,W2);
octave:6> legend('V','Random picking','Monte Carlo')
octave:7> sum(W1)
ans = 100000
octave:8> sum(W2)
ans = 100000
Plot:

In matlab in a product dense matrix * sparse matrix, how can I only calculate specific entries?

We have a matlab program in which we want to calculate the following expression:
sum( (M*x) .* x)
Here, M is a small dense matrix (say 100 by 100) and x is a sparse fat matrix (say of size 100 by 1 000 000, with 5% non-zero entries). When I run the code, then first M*x is calculated, which is a dense matrix-- however, most of the computation that went into computing that matrix is a complete waste of time, as most of it will be zero-ed out in the point-wise product with x afterwards.
In other words: What I want to do is to only calculate those entries (i,j) of M*x which correspond to (i,j) for which x(i,j) is non-zero. In the end, I will then also only be interested in each column count.
It seems pretty simple to start with but I could not figure out how to tell matlab to do it or how to reshape the calculation so that matlab does it efficiently. I would really like to avoid having to code up a mex-file for this operation, and this operation is eating up most of the computation time.
Here is a code snippet for comparison:
m = 100;
n = 100000;
density = 0.05;
M = randn(m); M = M * M';
x = sprandn(m,n,density);
tic
for i = 1:100
xsi = sum((M * x).*x,1);
end
toc
Elapsed time is 13.570713 seconds.

To compute (M*x) .* x: find which entries of the final result can be nonzero (using find), compute manually only for those (sum(M(...).'.*x(...)) .* nonzeros(x).'), and from that build the final matrix (using sparse):
[ii jj] = find(x);
R = sparse(ii, jj, sum(M(ii,:).'.*x(:,jj)) .* nonzeros(x).');
Of course, to compute sum((M*x) .* x) you then simply use
full(sum(R))

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Efficiently calculate histogram of a 3D numpy array along an axis with different bin edges - performance

Related

Faster way to find the size of the intersection of any two corresponding multisets from two 3D arrays of multisets

Perform numpy.sum (or scipy.integrate.simps()) on large splitted array efficiently

Conditional sampling of binary vectors (?)

Randomly pick elements from a vector of counts

In matlab in a product dense matrix * sparse matrix, how can I only calculate specific entries?

Categories

Resources