optimization of pairwise L2 distance computations - performance

I need help optimizing this loop. matrix_1 is a (nx 2) int matrix and matrix_2 is a (m x 2), m & n very.
index_j = 1;
for index_k = 1:size(Matrix_1,1)
for index_l = 1:size(Matrix_2,1)
M2_Index_Dist(index_j,:) = [index_l, sqrt(bsxfun(#plus,sum(Matrix_1(index_k,:).^2,2),sum(Matrix_2(index_l,:).^2,2)')-2*(Matrix_1(index_k,:)*Matrix_2(index_l,:)'))];
index_j = index_j + 1;
end
end
I need M2_Index_Dist to provide a ((n*m) x 2) matrix with the index of matrix_2 in the first column and the distance in the second column.
Output example:
M2_Index_Dist = [ 1, 5.465
2, 56.52
3, 6.21
1, 35.3
2, 56.52
3, 0
1, 43.5
2, 9.3
3, 236.1
1, 8.2
2, 56.52
3, 5.582]

Here's how to apply bsxfun with your formula (||A-B|| = sqrt(||A||^2 + ||B||^2 - 2*A*B)):
d = real(sqrt(bsxfun(#plus, dot(Matrix_1,Matrix_1,2), ...
bsxfun(#minus, dot(Matrix_2,Matrix_2,2).', 2 * Matrix_1*Matrix_2.')))).';
You can avoid the final transpose if you change your interpretation of the matrix.
Note: There shouldn't be any complex values to handle with real but it's there in case of very small differences that may lead to tiny negative numbers.
Edit: It may be faster without dot:
d = sqrt(bsxfun(#plus, sum(Matrix_1.*Matrix_1,2), ...
bsxfun(#minus, sum(Matrix_2.*Matrix_2,2)', 2 * Matrix_1*Matrix_2.'))).';
Or with just one call to bsxfun:
d = sqrt(bsxfun(#plus, sum(Matrix_1.*Matrix_1,2), sum(Matrix_2.*Matrix_2,2)') ...
- 2 * Matrix_1*Matrix_2.').';
Note: This last order of operations gives identical results to you, rather than with an error ~1e-14.
Edit 2: To replicate M2_Index_Dist:
II = ndgrid(1:size(Matrix_2,1),1:size(Matrix_2,1));
M2_Index_Dist = [II(:) d(:)];

If I understand correctly, this does what you want:
ind = repmat((1:size(Matrix_2,1)).',size(Matrix_1,1),1); %'// first column: index
d = pdist2(Matrix_2,Matrix_1); %// compute distance between each pair of rows
d = d(:); %// second column: distance
result = [ind d]; %// build result from first column and second column
As you see, this code calls pdist2 to compute the distance between every pair of rows of your matrices. By default this function uses Euclidean distance.
If you don't have pdist2 (which is part of the the Statistics Toolbox), you can replace line 2 above with bsxfun:
d = squeeze(sqrt(sum(bsxfun(#minus,Matrix_2,permute(Matrix_1, [3 2 1])).^2,2)));

Related

Find second minimum for each row of a matrix

I have a set i of customers and a set j of facilities. I have two binary variables: y ij which is 1 if client i is served by a primary facility, 0 otherwise; b ij is 1 if client i is served by a backup facility, 0 otherwise.
Given the starting matrix d:
-I must set y[i,j] = 1 based on the minimum distance of each row in the matrix (and this I have done);
I have to fix b[i,j] = 1 according to the second minimum distance of each row in the matrix (I don't know how to do this. I wrote max, but I don't have to do that). I've tried removing the first minimum from each row with the various pop, deleteat, splice, etc, but the solver gives me an error.
using JuMP
using Gurobi
using DelimitedFiles
import Random
import LinearAlgebra
import Plots
n = 3
m = 5
model = Model(Gurobi.Optimizer);
#variable(model, y[1:m,1:n] >= 0, Bin);
#variable(model, b[1:m,1:n] >= 0, Bin);
d = [
[80 20 40]
[71 55 24]
[56 47 81]
[10 20 30]
[31 41 21]
];
#PRIMARY ASSIGNMENTS
# 1) For each customer find the minimum d i-j and its position in matrix and create a vector V composed by all d i-j just founded
V = [];
for i = 1:m;
c = findmin(d[i,j] for j = 1:n);
push!(V,[c[1] ,c[2], i]);
end
println(V)
# 2) Sort vector's evelements from the smallest to the largest
S = sort(V)
println(S)
for i = 1:m
println(S[i][2])
println(S[i][3])
end
# 3) Fix primary assingnments for the first 50% of customers
for i = 1:3
fix(y[S[i][3], S[i][2]], 1.0, force = true);
end
# SECONDARY ASSIGNMENTS
# 1) For each customer find the second minimum d i-j and its position in matrix and create a vector W composed by all d i-j just founded
W = [];
for i = 1:m;
f = findmax(d[i,j] for j = 1:n);
push!(W,[f[1] ,f[2], i]);
end
println(W)
# 2) Sort vector's elements from the smallest to the largest
T = sort(W)
println(T)
for i = 1:3
println(T[i][2])
println(T[i][3])
end
# 3) Fix secondary assingnments for the first 50% of customers
for i = 1:3
fix(b[T[i][3], T[i][2]], 1.0, force = true);
end
optimize!(model)
I tried to find for each line the second minimum, but I could not.

Rearrange list to satisfy a condition

I was asked this during a coding interview but wasn't able to solve this. Any pointers would be very helpful.
I was given an integer list (think of it as a number line) which needs to be rearranged so that the difference between elements is equal to M (an integer which is given). The list needs to be rearranged in such a way that the value of the max absolute difference between the elements' new positions and the original positions needs to be minimized. Eventually, this value multiplied by 2 is returned.
Test cases:
//1.
original_list = [1, 2, 3, 4]
M = 2
rearranged_list = [-0.5, 1.5, 3.5, 5.5]
// difference in values of original and rearranged lists
diff = [1.5, 0.5, 0.5, 1.5]
max_of_diff = 1.5 // list is rearranged in such a way so that this value is minimized
return_val = 1.5 * 2 = 3
//2.
original_list = [1, 2, 4, 3]
M = 2
rearranged_list = [-1, 1, 3, 5]
// difference in values of original and rearranged lists
diff = [2, 1, 1, 2]
max_of_diff = 2 // list is rearranged in such a way so that this value is minimized
return_val = 2 * 2 = 4
Constraints:
1 <= list_length <= 10^5
1 <= M <= 10^4
-10^9 <= list[i] <= 10^9
There's a question on leetcode which is very similar to this: https://leetcode.com/problems/minimize-deviation-in-array/ but there, the operations that are performed on the array are mentioned while that's not been mentioned here. I'm really stumped.
Here is how you can think of it:
The "rearanged" list is like a straight line that has a slope that corresponds to M.
Here is a visualisation for the first example:
The black dots are the input values [1, 2, 3, 4] where the index of the array is the X-coordinate, and the actual value at that index, the Y-coordinate.
The green line is determined by M. Initially this line runs through the origin at (0, 0). The red line segments represent the differences that must be taken into account.
Now the green line has to move vertically to its optimal position. We can see that we only need to look at the difference it makes with the first and with the last point. The other two inputs will never contribute to an extreme. This is generally true: there are only two input elements that need to be taken into account. They are the points that make the greatest (signed -- not absolute) difference and the least difference.
We can see that we need to move the green line in such a way that the signed differences with these two extremes are each others opposite: i.e. their absolute difference becomes the same, but the sign will be opposite.
Twice this absolute difference is what we need to return, and it is actually the difference between the greatest (signed) difference and the least (signed) difference.
So, in conclusion, we must generate the values on the green line, find the least and greatest (signed) difference with the data points (Y-coordinates) and return the difference between those two.
Here is an implementation in JavaScript running the two examples you provided:
function solve(y, slope) {
let low = Infinity;
let high = -Infinity;
for (let x = 0; x < y.length; x++) {
let dy = y[x] - x * slope;
low = Math.min(low, dy);
high = Math.max(high, dy);
}
return high - low;
}
console.log(solve([1, 2, 3, 4], 2)); // 3
console.log(solve([1, 2, 4, 3], 2)); // 4

Compute mean of columns for groups of rows in Octave

I have a matrix, for example:
1 2
3 4
4 5
And I also have a rule of grouping the rows, which is defined as a vector of group IDs like this:
1
2
1
Which means that the first and the third rows belong to the same group (ID 1) and the second row belong to another group (ID 2). So, I would like to compute the mean value for each group. Here is the result for my example:
2.5 3.5
3 4
More formally, there is a matrix A of size (m, n), a number of groups k and a vector v of size (m, 1), values of which are integers in range from 1 to k. The result is a matrix R of size (k, n), where each row with index r corresponds to the mean value of the group r.
Here is my solution (which does what I need) using for-loop in Octave:
R = zeros(k, n);
for r = 1:k
R(r, :) = mean(A((v == r), :), 1);
end
I wonder whether it could be vectorized. So, what I need is to replace the for-loop with a vectorized solution, which is going to be much more efficient than the iterative one.
Here is one of my many attempts (which do not work) to solve the problem in a vectorized way:
R = mean(A((v == 1:k), :);
As long as our data is of floating point, you can just do it manually by doing the sum yourself and then divide, by making use of accumdim. Like so:
octave:1> A = [1 2; 3 4; 4 5];
octave:2> subs = [1; 2; 1];
octave:3> accumdim (subs, A) ./ accumdim (subs, ones (rows (subs), 1))
ans =
2.5000 3.5000
3.0000 4.0000
You can consider it as a matrix multiplication problem. For instance, for your example this corresponds to
A = [1 2; 3 4; 4 5];
B = [0.5,0,0.5;0,1,0];
C = B*A
The main issue, is to construct B from your list of indicies in an efficient manner. My suggestion is to use the implicit expansion of ==.
A = [1 2; 3 4; 4 5]; % Input data
idx = [1;2;1]; % Input Grouping
k = 2; % number of groups, ( = max(idx) )
m = 3; % Number of "observations"
Btmp = (idx == 1:k)'; % Mark locations
B = Btmp ./sum(Btmp,2); % Normalise
C = B*A
C =
2.5000 3.5000
3.0000 4.0000

SPSS: select a subset of columns or rows from a matrix

How can I select a subset of columns or rows from a matrix in SPSS?
Given the following example, I want to compute a matrix X2 containing the first two columns of X.
MATRIX.
COMPUTE
X = {1, 2, 2;
0, -1, 1;
1, 1, -2}.
* Compute new matrix X2 that contains the first two columns of X
MAGIC CODE ;)
END MATRIX.
What is the syntax for matrix subsetting operations in SPSS?
You can subset a matrix, so it would be simply COMPUTE XSub = X(:,1:2). Full example below.
MATRIX.
COMPUTE X = {1, 2, 2;
0, -1, 1;
1, 1, -2}.
COMPUTE XSub = X(:,1:2).
PRINT XSub.
END MATRIX.
To the add-on question in the comments, 1:n basically SPSS understands as a row vector of 1 2 3 .... n. You can create your own vector to subset the matrix though, such as {1,3} or {2,2} or {3,1} or whatever. The last example will return the 3rd column first and the first column second in the subsetted matrix. Example below:
MATRIX.
COMPUTE X = {1, 2, 2;
0, -1, 1;
1, 1, -2}.
COMPUTE XSub = X(:,{3,1}).
PRINT XSub.
END MATRIX.
Which prints out
Run MATRIX procedure:
XSUB
2 1
1 0
-2 1
------ END MATRIX -----
MATRIX.
COMPUTE X = {1, 2, 3; 4, 5, 6; 7, 8, 9}.
COMPUTE Y=MAKE(NROW(X),2,0).
LOOP i=1 to NROW(Y).
LOOP j=1 to NCOL(Y).
COMPUTE Y(i,j)=X(i,j).
END LOOP.
END LOOP.
PRINT X.
PRINT Y.
END MATRIX.

Vectorized search for permutations (with repetitions) that contain given subpermutations (with repetitions)

This question is can be viewed continuation/extension/generalization of a previous question of mine from here.
Some definitions: I have a set of integers S = {1,2,...,s}, say s = 20, and two matrices N and M whose rows are finite sequences of numbers from S (i.e. permutations with possible repetitions), of order n and m respectively, where 1 <= n <= m. Let us think of N as a collection of candidate sub-sequences for the sequences from M.
Example: [2 3 4 3] is a sub-sequence of [1 2 2 3 5 4 1 3] that occurs with multiplicity 2 (=in how many different ways one can find the sub-seq. in the main seq.), whereas [3 2 2 3] is not a sub-sequence of it. In particular, a valid sub-sequence by definition must preserve the order of the indices.
Problem statement:
(P1) For each row of M, obtain the number of sub-sequences of it, with multiplicity and without multiplicity, that occur in N as rows (it can be zero if none are contained in N);
(P2) For each row of N, find out how many times, with multiplicity and without multiplicity, it is contained in M as a sub-sequence (again, this number can be zero);
Example: Let N = [1 2 2; 2 3 4] and M = [1 1 2 2 3; 1 2 2 3 4; 1 2 3 5 6]. Then (P1) returns [2; 3; 0] for 'with multiplicities' and [1; 2; 0] for 'without multiplicities'. (P2) returns [3; 2] for 'with multiplicities' and [2; 1] without multiplicities.
Order of magnitude: M could typically have up to 30-40 columns and a few thousand rows, although I currently have M with only a few hundred rows and ~10 columns. N could be approaching the size of
M or could be also much smaller.
What I have so far: Not much, to be honest. I believe I might be able to slightly modify my not-very-well-vectorized solution from my previous question to tackle permutations with repetitions, but I am still thinking on that and will update as soon as I have something working. But given my (lack of) experience so far, it would be in all likelihood very suboptimal :(
Thanks!
Introduction : Owing to the repetitions in the input data in each row, the combination finding process doesn't have the sort of "uniqueness" among elements which was exploited in your previous problem and hence the loops used here. Also, note that the without multiplicity codes don't use nchoosek and as such, I feel more optimistic about them for performance.
Notations :
p1wim -> P1 with multiplicity
p2wim -> P2 with multiplicity
p1wom -> P1 without multiplicity
p2wom -> P2 without multiplicity
Codes :
I. Code for P1, 2 with multiplicity
permN = permute(N,[3 2 1]);
p1wim(size(M,1),1)=0;
p2wim(size(N,1),1)=0;
for k1 = 1:size(M,1)
d1 = nchoosek(M(k1,:),3);
t1 = all(bsxfun(#eq,d1,permN),2);
p1wim(k1) = sum(t1(:));
p2wim = p2wim + squeeze(sum(t1,1));
end
II. Code for P1, 2 without multiplicity
eqmat = bsxfun(#eq,M,permute(N,[3 4 2 1])); %// equality matrix
[m,n,p,q] = size(eqmat); %// get sizes
inds = zeros(size(M,1),p,q); %// pre-allocate for indices array
vec1 = [1:m]'; %//' setup constants to loop
vec2 = [0:q-1]*m*n*p;
vec3 = permute([0:p-1]*m*n,[1 3 2]);
for iter = 1:p
[~,ind1] = max(eqmat(:,:,iter,:),[],2);
inds(:,iter,:) = reshape(ind1,m,1,q);
ind2 = squeeze(ind1);
ind3 = bsxfun(#plus,vec1,(ind2-1)*m); %//' setup forward moving equalities
ind4 = bsxfun(#plus,ind3,vec2);
ind5 = bsxfun(#plus,ind4,vec3);
eqmat(ind5(:)) = 0;
end
p1wom = sum(all(diff(inds,[],2)>0,2),3);
p2wom = squeeze(sum(all(diff(inds,[],2)>0,2),1));
As usual, I would encourage you to use gpuArrays too with your favorite parfor.
This approach uses only one loop over the rows of M (P1) or N (P2). The code makes use of linear indexing and the very powerful bsxfun function. Note that if the number of columns is large you may experience problems because of nchoosek.
[mr mc] = size(M);
[nr nc] = size(N);
%// P1
combs = nchoosek(1:mc, nc)-1;
P1mu = NaN(mr,1);
P1nm = NaN(mr,1);
for r = 1:mr
aux = M(r+mr*combs);
P1mu(r) = sum(ismember(aux, N, 'rows'));
P1nm(r) = sum(ismember(unique(aux, 'rows'), N, 'rows'));
end
%// P2. Multiplicity defined to span across different rows
rr = reshape(repmat(1:mr, size(combs,1), 1),[],1);
P2mu = NaN(nr,1);
P2nm = NaN(nr,1);
for r = 1:nr
aux = M(bsxfun(#plus, rr, mr*repmat(combs, mr, 1)));
P2mu(r) = sum(all(bsxfun(#eq, N(r,:), aux), 2));
P2nm(r) = sum(all(bsxfun(#eq, N(r,:), unique(aux, 'rows')), 2));
end
%// P2. Multiplicity defined restricted to within one row
rr = reshape(repmat(1:mr, size(combs,1), 1),[],1);
P2mur = NaN(nr,1);
P2nmr = NaN(nr,1);
for r = 1:nr
aux = M(bsxfun(#plus, rr, mr*repmat(combs, mr, 1)));
P2mur(r) = sum(all(bsxfun(#eq, N(r,:), aux), 2));
aux2 = unique([aux rr], 'rows'); %// concat rr to differentiate rows...
aux2 = aux2(:,1:end-1); %// ...and now remove it
P2nmr(r) = sum(all(bsxfun(#eq, N(r,:), aux2), 2));
end
Results for your example data:
P1mu =
2
3
0
P1nm =
1
2
0
P2mu =
3
2
P2nm =
1
1
P2mur =
3
2
P2nmr =
2
1
Some optimizations to the code would be possible. Not sure they are worth the effort:
Replace repmat by another bsxfun (using a 3rd dimension). That may save some memory
Transpose original matrices and work down colunmns, instead of along rows. That may be faster.

Resources