My Questions
Is there anyway that I can speed up this calculation?
Is there a better algorithm or implementation that I can be use to calculate the same values?
Describing the algorithm
I have a complex indexing problem that I'm struggling to solve in an efficient way.
The goal is to calculate the matrix w_prime using values a combination of values from the equally sized matrices w, dY, and dX.
The value of w_prime(i,j) is calculated as mean( w( indY & indX ) ), where indY and indX are the indices of dY and dX that are equal to i and j respectively.
Here's a simple implementation in matlab of an algorithm to compute w_prime:
for i = 1:size(w_prime,1)
indY = dY == i;
for j = 1:size(w_prime,2)
indX = dX == j;
w_prime(ind) = mean( w( indY & indX ) );
Performance Problems
This implementation is sufficient in example case below; however, in my actual use case w, dY, dX are ~3000x3000 and w_prime is ~60X900. Meaning that each index calculation is happening on a ~9 million elements. Needless this implementation is too slow to be usable. Additionally I'll need to run this code a few dozen times.
Example Calculation
If I want to compute w(1,1)
Find the indices of dY that equal 1, save as indY
Find the indices of dX that equal 1, save as indX
Find intersection of indY and indX save as ind
Save the mean( w(ind) ) to w_prime(1,1)
General Problem Description
I have a set points defined by two vectors X, and T, both are 1XN where N is ~3000. Additionally the values of X and T are integers bound by the intervals (1 60) and (1 900) respectively.
The matrices dX and dT, are simply distance matrices, meaning that they contain the pairwise distances between the points. Ie dx(i,j) is equal abs( x(i) - x(j) ).
They are calculated using: dx = pdist(x);
The matrix w can be thought of as a weight matrix that describes how much influence one point has on another.
The purpose of calculating w_prime(a,b) is to determine the average weight between the sub-set of points that are separated by a in the X dimension and b in the T dimension.
This can be expressed as follows:

This is straightforward with ACCUMARRAY:
nx = max(dX(:));
ny = max(dY(:));
w_prime = accumarray([dX(:),dY(:)],w(:),[nx,ny],#mean,NaN)
The output will be a nx-by-ny sized array with NaNs wherever there was no corresponding pair of indices. If you're sure that there will be a full complement of indices all the time, you can simplify the above calculation to
w_prime = accumarray([dX(:),dY(:)],w(:),[],#mean)
So, what does accumarray do? It looks at the rows of [dX(:),dY(:)]. Each row gives the (i,j) coordinate pair in w_prime to which the row contributes. For all pairs (1,1), it applies the function (#mean) to the corresponding entries in w(:), and writes the output into w_prime(1,1).


Calculating large exponential shares / probabilities

Let there be an event space ES.
Let there be some sets of objects OS[].
The probabilities of selecting any object are mutually disjoint.
Now, assume that the size of each set is based on a number X[i] assigned to it.
The size of each set rises exponentially with that number.
The base (B) used for exponentiation could be the Euler's number (e), due to its nice properties, but let's assume that, that might not be the case.
Now, we are after calculating the probability of selecting any member of a selected set, at random, while keeping in mind that the arity of each set might be very large.
After the sequence of probabilities is known it's used to compute P[i]*(C).
I wonder if this could be optimized/approximated for very large exponents i.e. computed with low memory consumption i.e. implemented.
Related question I found is here still they seem to tackle only opposite probabilities.
// Numerical example:
// A,C - constants, natural numbers
X[1] = 3432342332;
X[2] = 55438849;
X[3] = 34533;
P1 = A^X[1]/(A^X[1]+A^X[2]+A^X[3]);
P2 = A^X[2]/(A^X[1]+A^X[2]+A^X[3]);
P3 = A^X[3]/(A^X[1]+A^X[2]+A^X[3]);
R1 = P1 *C;
R2 = P2 *C;
R3 = P3 *C;
Excel would fail when exponents are larger than few hundreds.
So you have a number a>1, an integer array B of n elements, and for each i, you are to calculate a^B[i] / (a^B[1] + a^B[2] + ... + a^B[n]) .
Let C[i] = B[i] - max(B[1], ..., B[n]). Then you calculate
a^C[i] / (a^C[1] + a^C[2] + ... + a^C[n]). Since all elements of C are now non-positive, you don't care about overflow.

Generate random non repeating pairs of numbers within 2 ranges

I want to create random pairs of numbers within 2 ranges.
So for example if I want 3 random pairs of numbers where 10 < n1 < 20 and 30 < n2 < 50 then an acceptable output would be this: [[11,35],[15,30],[15,42]] but not [[11,35],[11,35],[12,39]]
I would like an efficient (both computationally and memory wise) algorithm to do this. The language doesn't really matter because I can adapt it later (although Python would be preferred).
So far the best idea I have had is to create a dictionary with all the possible numbers in n1 and as values a list of the numbers which have been used in n2. Then I can just pick a random n1 and find a number which hasn't been used in n1[n2] set.
This isn't very efficient space wise though and I'm hoping for something better. It also seems to be computationally inefficient to find a number not in n1[n2] many times.
I could also do the opposite and have the dictionary populated with all the numbers not used and just pop a random number off the list. But this would use much more space.
Is there any efficient way to do this? Is this a common problem?
Edit: It would be good if this could easily be expanded to more dimensions (so sets of N numbers). But this isn't really needed yet.
An integer pair (x, y) in [min_x, min_x + s) X [min_y, min_y + t) can be mapped to an integer m within the 1D space [min_x * t, (min_x + s) * t) by calculating m = x * t + y - min_y. The inverse mapping from m to (x, y) can be achieved by (m // t, min_y + m % t) in Python.
Therefore the problem is transformed to choosing multiple values from [min_x * t, (min_x + s) * t) without replacement (i.e. no duplicates in the returned sequence). This can be done by simply calling the random.sample function in Python. According to the doc, the underlying implementation is space efficient for sequence inputs. So the entire problem can be done in Python as shown in the following:
from random import sample
# max_x and max_y are exclusive while min_x and min_y are inclusive
t = max_y - min_y
sampled_pairs = [(m//t, min_y + m%t) for m in sample(range(min_x * t, max_x * t), k=3)]

How to select a uniformly distributed subset of a partially dense dataset?

P is an n*d matrix, holding n d-dimensional samples. P in some areas is several times more dense than others. I want to select a subset of P in which distance between any pairs of samples be more than d0, and I need it to be spread all over the area. All samples have same priority and there's no need to optimize anything (e.g. covered area or sum of pairwise distances).
Here is a sample code that does so, but it's really slow. I need a more efficient code since I need to call it several times.
%% generating sample data
n_4 = 1000; n_2 = n_4*2;n = n_4*4;
x1=[ randn(n_4, 1)*10+30; randn(n_4, 1)*3 + 60];
y1=[ randn(n_4, 1)*5 + 35; randn(n_4, 1)*20 + 80 ];
x2 = rand(n_2, 1)*(max(x1)-min(x1)) + min(x1);
y2 = rand(n_2, 1)*(max(y1)-min(y1)) + min(y1);
P = [x1,y1;x2, y2];
%% eliminating close ones
d0 = 1.5;
D = pdist2(P, P);D(1:n+1:end) = inf;
E = zeros(n, 1); % eliminated ones
for i=1:n-1
if ~E(i)
CloseOnes = (D(i,:)<d0) & ((1:n)>i) & (~E');
E(CloseOnes) = 1;
P2 = P(~E, :);
%% plotting samples
subplot(121); scatter(P(:, 1), P(:, 2)); axis equal;
subplot(122); scatter(P2(:, 1), P2(:, 2)); axis equal;
Edit: How big the subset should be?
As j_random_hacker pointed out in comments, one can say that P(1, :) is the fastest answer if we don’t define a constraint on the number of selected samples. It delicately shows incoherence of the title! But I think the current title better describes the purpose. So let’s define a constraint: “Try to select m samples if it’s possible”. Now with the implicit assumption of m=n we can get the biggest possible subset. As I mentioned before a faster method excels the one that finds the optimum answer.
Finding closest points over and over suggests a different data structure that is optimized for spatial searches. I suggest a delaunay triangulation.
The below solution is "approximate" in the sense that it will likely remove more points than strictly necessary. I'm batching all the computations and removing all points in each iteration that contribute to distances that are too long, and in many cases removing one point may remove the edge that appears later in the same iteration. If this matters, the edge list can be further processed to avoid duplicates, or even to find points to remove that will impact the greatest number of distances.
This is fast.
dt = delaunayTriangulation(P(:,1), P(:,2));
d0 = 1.5;
while 1
edge = edges(dt); % vertex ids in pairs
% Lookup the actual locations of each point and reorganize
pwise = reshape(dt.Points(edge.', :), 2, size(edge,1), 2);
% Compute length of each edge
difference = pwise(1,:,:) - pwise(2,:,:);
edge_lengths = sqrt(difference(1,:,1).^2 + difference(1,:,2).^2);
% Find edges less than minimum length
idx = find(edge_lengths < d0);
% pick first vertex of each too-short edge for deletion
% This could be smarter to avoid overdeleting
points_to_delete = unique(edge(idx, 1));
% remove them. triangulation auto-updates
dt.Points(points_to_delete, :) = [];
% repeat until no edge is too short
P2 = dt.Points;
You don't specify how many points you want to select. This is crucial to the problem.
I don't readily see a way to optimise your method.
Assuming that Euclidean distance is acceptable as a distance measure, the following implementation is much faster when selecting only a small number of points, and faster even when trying to the subset with 'all' valid points (note that finding the maximum possible number of points is hard).
subplot(121); scatter(P(:, 1), P(:, 2)); axis equal;
d0 = 1.5;
m_range = linspace(1, 2000, 100);
m_time = NaN(size(m_range));
for m_i = 1:length(m_range);
m = m_range(m_i)
a = tic;
% Test points in random order.
r = randperm(n);
r_i = 1;
S = false(n, 1); % selected ones
for i=1:m
found = false;
while ~found
j = r(r_i);
r_i = r_i + 1;
if r_i > n
% We have tried all points. Nothing else can be valid.
if sum(S) == 0
% This is the first point.
found = true;
% Get the points already selected
P_selected = P(S, :);
% Exclude points >= d0 along either axis - they cannot have
% a Euclidean distance less than d0.
P_valid = (abs(P_selected(:, 1) - P(j, 1)) < d0) & (abs(P_selected(:, 2) - P(j, 2)) < d0);
if sum(P_valid) == 0
% There are no points that can be < d0.
found = true;
% Implement Euclidean distance explicitly rather than
% using pdist - this makes a large difference to
% timing.
found = min(sqrt(sum((P_selected(P_valid, :) - repmat(P(j, :), sum(P_valid), 1)) .^ 2, 2))) >= d0;
if found
% We found a valid point - select it.
S(j) = true;
% Nothing found, so we must have exhausted all points.
P2 = P(S, :);
m_time(m_i) = toc(a);
subplot(122); scatter(P2(:, 1), P2(:, 2)); axis equal;
plot(m_range, m_time);
hold on;
plot(m_range([1 end]), ones(2, 1) * original_time);
hold off;
where original_time is the time taken by your method. This gives the following timings, where the red line is your method, and the blue is mine, with the number of points selected along the x axis. Note that the line flattens when 'all' points meeting the criteria have been selected.
As you say in your comment, performance is highly dependent on the value of d0. In fact, as d0 is reduced, the method above appears to have even greater improvement in performance (this is for d0=0.1):
Note however that this is also dependent on other factors such as the distribution of your data. This method exploits specific properties of your data set, and reduces the number of expensive calculations by filtering out points where calculating the Euclidean distance is pointless. This works particularly well for selecting fewer points, and it is actually faster for smaller d0 because there are fewer points in the data set that match the criteria (so there are fewer computations of the Euclidean distance required). The optimal solution for a problem like this will usually be specific to the exact data set used.
Also note that in my code above, manually calculating the Euclidean distance is much faster then calling pdist. The flexibility and generality of the Matlab built-ins is often detrimental to performance in simple cases.

Randomly pick elements from a vector of counts

I'm currently trying to optimize some MATLAB/Octave code by means of an algorithmic change, but can't figure out how to deal with some randomness here. Suppose that I have a vector V of integers, with each element representing a count of some things, photons in my case. Now I want to randomly pick some amount of those "things" and create a new vector of the same size, but with the counts adjusted.
Here's how I do this at the moment:
function W = photonfilter(V, eff)
% W = photonfilter(V, eff)
% Randomly takes photons from V according to the given efficiency.
% Args:
% V: Input vector containing the number of emitted photons in each
% timeslot (one element is one timeslot). The elements are rounded
% to integers before processing.
% eff: Filter efficiency. On the average, every 1/eff photon will be
% taken. This value must be in the range 0 < eff <= 1.
% W: Output row vector with the same length as V and containing the number
% of received photons in each timeslot.
% WARNING: This function operates on a photon-by-photon basis in that it
% constructs a vector with one element per photon. The storage requirements
% therefore directly depend on sum(V), not only on the length of V.
% Round V and make it flat.
Ntot = length(V);
V = round(V);
V = V(:);
% Initialize the photon-based vector, so that each element contains
% the original index of the photon.
idxV = zeros(1, sum(V), 'uint32');
iout = 1;
for i = 1:Ntot
N = V(i);
idxV(iout:iout+N-1) = i;
iout = iout + N;
% Take random photons.
idxV = idxV(randperm(length(idxV)));
idxV = idxV(1:round(length(idxV)*eff));
% Generate the output vector by placing the remaining photons back
% into their timeslots.
[W, trash] = hist(idxV, 1:Ntot);
This is a rather straightforward implementation of the description above. But it has an obvious performance drawback: The function creates a vector (idxV) containing one element per single photon. So if my V has only 1000 elements but an average count of 10000 per element, the internal vector will have 10 million elements making the function slow and heavy.
What I'd like to achieve now is not to directly optimize this code, but to use some other kind of algorithm which immediately calculates the new counts without giving each photon some kind of "identity". This must be possible somehow, but I just can't figure out how to do it.
The output vector W must have the same number of elements as the input vector V.
W(i) must be an integer and bounded by 0 <= W(i) <= V(i).
The expected value of sum(W) must be sum(V)*eff.
The algorithm must somehow implement this "random picking" of photons, i.e. there should not be some deterministic part like "run through V dividing all counts by the stepsize and propagating the remainders", as the whole point of this function is to bring randomness into the system.
An explicit loop over V is allowed if unavoidable, but a vectorized approach is preferable.
Any ideas how to implement something like this? A solution using only a random vector and then some trickery with probabilities and rounding would be ideal, but I haven't had any success with that so far.
Thanks! Best regards, Philipp
The method you employ to compute W is called Monte Carlo method. And indeed there can be some optimizations. Once of such is instead of calculating indices of photons, let's imagine a set of bins. Each bin has some probability and the sum of all bins' probabilities adds up to 1. We divide the segment [0, 1] into parts whose lengths are proportional to the probabilities of the bins. Now for every random number within [0, 1) that we generate we can quickly find the bin that it belongs to. Finally, we count numbers in the bins to obtain the final result. The code below illustrates the idea.
% Population size (number of photons).
N = 1000000;
% Sample size, size of V and W as well.
% For convenience of plotting, V and W are of the same size, but
% the algorithm doesn't enforce this constraint.
M = 10000;
% Number of Monte Carlo iterations, greater numbers give better quality.
K = 100000;
% Generate population of counts, use gaussian distribution to test the method.
% If implemented correctly histograms should have the same shape eventually.
V = hist(randn(1, N), M);
P = cumsum(V / sum(V));
% For every generated random value find its bin and then count the bins.
% Finally we normalize counts by the ration of N / K.
W = hist(lookup(P, rand(1, K)), M) * N / K;
% Compare distribution plots, they should be the same.
hold on;
plot(W, '+r');
plot(V, '*b');
Based on the answer from Alexander Solovets, this is how the code now looks:
function W = photonfilter(V, eff, impl=1)
Ntot = length(V);
V = V(:);
if impl == 0
% Original "straightforward" solution.
V = round(V);
idxV = zeros(1, sum(V), 'uint32');
iout = 1;
for i = 1:Ntot
N = V(i);
idxV(iout:iout+N-1) = i;
iout = iout + N;
idxV = idxV(randperm(length(idxV)));
idxV = idxV(1:round(length(idxV)*eff));
[W, trash] = hist(idxV, 1:Ntot);
% Monte Carlo approach.
Nphot = sum(V);
P = cumsum(V / Nphot);
W = hist(lookup(P, rand(1, round(Nphot * eff))), 0:Ntot-1);
The results are quite comparable, as long as eff if not too close to 1 (with eff=1, the original solution yields W=V while the Monte Carlo approach still has some randomness, thereby violating the upper bound constraints).
Test in the interactive Octave shell:
octave:1> T=linspace(0,10*pi,10000);
octave:2> V=100*(1+sin(T));
octave:3> W1=photonfilter(V, 0.1, 0);
octave:4> W2=photonfilter(V, 0.1, 1);
octave:5> plot(T,V,T,W1,T,W2);
octave:6> legend('V','Random picking','Monte Carlo')
octave:7> sum(W1)
ans = 100000
octave:8> sum(W2)
ans = 100000

Weights Optimization in matlab

I have to do optimization in supervised learning to get my weights.
I have to learn the values (w1,w2,w3,w4) such that whenever my vector A = [a1 a2 a3 a4] is 1 the sum w1*a1 + w2*a2 + w3*a3 + w4*a4 becomes greater than 0.5 and when its -1 ( labels ) then it becomes less than 0.5.
Can somebody tell me how I can approach this problem in Matlab ? One way that I know is to do it using evolutionary algorithms, taking a random value vector and then changing to pick the best n values.
Is there any other way that this can be approached ?
You can do it using linprog.
Let A be a matrix of size n by 4 consisting of all n training 4-vecotrs you have. You should also have a vector y with n elements (each either plus or minus 1), representing the label of each training 4-vecvtor.
Using A and y we can write a linear program (look at the doc for the names of the parameters I'm using). Now, you do not have an objective function, so you can simply set f to be f = zeros(4,1);.
The only thing you have is an inequality constraint (< a_i , w > - .5) * y_i >= 0 (where <.,.> is a dot-product between 4-vector a_i and weight vector w).
If my calculations are correct, this constraint can be written as
cmat = bsxfun( #times, A, y );
Overall you get
w = linprog( zeros(4,1), -cmat, .5*y );
