I have a really big (1.5M x 16M) sparse csr scipy matrix A. What i need to compute is the similarity of each pair of rows. I have defined the similarity as this:
Assume a and b are two rows of matrix A
a = (0, 1, 0, 4)
b = (1, 0, 2, 3)
Similarity (a, b) = 0*1 + 1*0 + 0*2 + 4*3 = 12
To compute all pairwise row similarities I use this (or Cosine similarity):
AT = np.transpose(A)
pairs = A.dot(AT)
Now pairs[i, j] is the similarity of row i and row j for all such i and j.
This is quite similar to pairwise Cosine similarity of rows. So If there is an efficient parallel algorithm that computes pairwise Cosine similarity it would work for me as well.
The problem: This dot product is very slow because it uses just one cpu (I have access to 64 of those cpus on my server).
I can also export A and AT to a file and run any other external program that does the multiplication in parallel and get the results back to the Python program.
Is there any more efficient way of doing this dot product? or computing the pairwise similarity in Parallel?
I finally used the 'Cosine' distance metric of scikit-learn and its pairwise_distances functions which support sparse matrices and is highly parallelised.
sklearn.metrics.pairwise.pairwise_distances(X, Y=None, metric='euclidean', n_jobs=1, **kwds)
I could also divide A into n horizontal parts and use the parallel python package to run multiple multiplications and horizontally stack the results later.
I wrote own implementation using sklearn. It is not parallel but it quite fast for large matrices.
from scipy.sparse import spdiags
from sklearn.preprocessing import normalize
def get_similarity_by_x_dot_x_greedy_for_memory(sp_matrix):
sp_matrix = sp_matrix.tocsr()
matrix = sp_matrix.dot(sp_matrix.T)
# zero diagonal
diag = spdiags(-matrix.diagonal(), [0], *matrix.shape, format='csr')
matrix = matrix + diag
return matrix
def get_similarity_by_cosine(sp_matrix):
sp_matrix = normalize(sp_matrix.tocsr())
return get_similarity_by_x_dot_x_greedy_for_memory(sp_matrix)
Related
I have a set of points W={(x1, y1), (x2, y2),..., (xn, yn)} on the 2D plane. Can you find an algorithm that takes these points as the input and returns a point (x, y) on the 2D plane which has the minimum sum of distances from the points in W? In other words, if
di = Euclidean_distance((x, y), (xi, yi))
I want to minimize:
d1 + d2 + ... + dn
The Problem
You're looking for the geometric median.
An Easy Solution
There is no closed-form solution to this problem, so iterative or probabilistic methods are used. The easiest way to find this is probably with Weiszfeld's algorithm:
We can implement this in Python as follows:
import numpy as np
from numpy.linalg import norm as npnorm
c_pt_old = np.random.rand(2)
c_pt_new = np.array([0,0])
while npnorm(c_pt_old-c_pt_new)>1e-6:
num = 0
denom = 0
for i in range(POINT_NUM):
dist = npnorm(c_pt_new-pts[i,:])
num += pts[i,:]/dist
denom += 1/dist
c_pt_old = c_pt_new
c_pt_new = num/denom
print(c_pt_new)
There's a chance that Weiszfeld's algorithm won't converge, so it might be best to run it several times from different starting points.
A General Solution
You can also find this using second-order cone programming (SOCP). In addition to solving your specific problem, this general formulation then allows you to easily add constraints and weightings, such as variable uncertainty in the location of each data point.
To do so, you create a number of indicator variables representing the distance between the proposed center point and the data points.
You then minimize the sum of the indicator variables. The result follows
import cvxpy as cp
import numpy as np
import matplotlib.pyplot as plt
#Generate random test data
POINT_NUM = 100
pts = np.random.rand(POINT_NUM,2)
c_pt = cp.Variable(2) #The center point we wish to locate
distances = cp.Variable(POINT_NUM) #Distance from the center point to each data point
#Generate constraints. These are used to hold distances.
constraints = []
for i in range(POINT_NUM):
constraints.append( cp.norm(c_pt-pts[i,:])<=distances[i] )
objective = cp.Minimize(cp.sum(distances))
problem = cp.Problem(objective,constraints)
optimal_value = problem.solve()
print("Optimal value = {0}".format(optimal_value))
print("Optimal location = {0}".format(c_pt.value))
plt.scatter(x=pts[:,0], y=pts[:,1], s=1)
plt.scatter(c_pt.value[0], c_pt.value[1], s=10)
plt.show()
SOCPs are available in a number of solvers including CPLEX, Elemental, ECOS, ECOS_BB, GUROBI, MOSEK, CVXOPT, and SCS.
I've tested and the two approaches give the same answers to within tolerance.
Weiszfeld, E. (1937). "Sur le point pour lequel la somme des distances de n points donnes est minimum". Tohoku Mathematical Journal. 43: 355–386.
If that point does not need to be from your sample, then the mean minimises the euclidean distance.
A third method would be to use a compact nonlinear programming formulation. An unconstrained NLP model would be:
min sum(i, ||x-p(i)|| )
This has just 2 variables (the coordinates of x).
There is a very good initial point available. Let p(i,c) be the coordinates of the data points. Then the mean is
m(c) = sum(i, p(i,c)) / n
where n is the number of data points. This point is often very close to the optimal value of x. So we can use m as an excellent initial point for x.
Some limited experiments indicate this approach is quite faster than a cone programming formulation for large n.
For details see Yet Another Math Programming Consultant - Finding the Central Point in a Point Cloud blog post.
I found a Python library for Laplacian Score Feature Selection. But the implementation is seemingly different from the research paper.
I implemented the selection method according to the algorithm from the paper (https://papers.nips.cc/paper/2909-laplacian-score-for-feature-selection.pdf), which is as follows:
However, I found a Python library that implements the Laplacian method (https://github.com/jundongl/scikit-feature/blob/master/skfeature/function/similarity_based/lap_score.py).
So to check that my implementation was correct, I ran both versions on my dataset, and got different answers. In the process of debugging, I saw that the library used different formulas when calculating the affinity matrix (The S matrix from the paper).
The paper uses this formula:
, While the library uses
W_ij = exp(-norm(x_i - x_j)/2t^2)
Further investigation revealed that the library calculates the affinity matrix as follows:
t = kwargs['t']
# compute pairwise euclidean distances
D = pairwise_distances(X)
D **= 2
# sort the distance matrix D in ascending order
dump = np.sort(D, axis=1)
idx = np.argsort(D, axis=1)
idx_new = idx[:, 0:k+1]
dump_new = dump[:, 0:k+1]
# compute the pairwise heat kernel distances
dump_heat_kernel = np.exp(-dump_new/(2*t*t))
G = np.zeros((n_samples*(k+1), 3))
G[:, 0] = np.tile(np.arange(n_samples), (k+1, 1)).reshape(-1)
G[:, 1] = np.ravel(idx_new, order='F')
G[:, 2] = np.ravel(dump_heat_kernel, order='F')
# build the sparse affinity matrix W
W = csc_matrix((G[:, 2], (G[:, 0], G[:, 1])), shape=
n_samples,n_samples))
bigger = np.transpose(W) > W
W = W - W.multiply(bigger) + np.transpose(W).multiply(bigger)
return W
I'm not sure why the library squares each value in the distance matrix. I see that they also do some reordering, and they use a different heat kernel formula.
So I'd just like to know if any of the resources (The paper or the library) are wrong, or if they're somehow equivalent, or if anyone knows why they differ.
I have a list of small image blocks. All of them are the same size (eg : 16x16). They are black and white, with pixel values between 0-255. I would like to sort those images to minimize the the difference between them.
What I have in mind is to compute the mean absolute difference (MAD) of pixels between adjacent blocks (eg : block N vs block N+1, block N+1 vs block N+2, ...).
From this I can calculate the sum of those MAD values (eg : sum = mad1 + mad2 + ...). What I would like is to find the ordering that minimize that sum.
Before :
After : (this was done by hand only to provide an example, there is probably a better ordering for those blocks, especially the ones with vertical stripes)
Based on RaymoAisla comment which pointed out this is similar to Travelling salesman problem, I have decided to implement a solution using Nearest neighbour algorithm. While not perfect, it gives great results :
Here is code :
from PIL import Image
import numpy as np
def sortImages(images): #Nearest neighbour algorithm
result = []
tovisit = range(len(images))
best = max(tovisit, key=lambda x: images[x].sum()) #brightest block as first
tovisit.remove(best)
result.append(best)
while tovisit:
best = min(tovisit, key=lambda x: MAD(images[x], images[best])) #nearest neighbour
tovisit.remove(best)
result.append(best)
return result #indices of sorted image blocks
def MAD(a, b): #mean absolute distance
return np.absolute(a - b).sum()
images = ... #should contains b&w images only (eg : from PIL Image)
arrays = [np.array(x).astype(int) for x in images] #convert images to np arrays
result = sortImages(arrays)
I have an image of size as RGB uint8(576,720,3) where I want to classify each pixel to a set of colors. I have transformed using rgb2lab from RGB to LAB space, and then removed the L layer so it is now a double(576,720,2) consisting of AB.
Now, I want to classify this to some colors that I have trained on another image, and calculated their respective AB-representations as:
Cluster 1: -17.7903 -13.1170
Cluster 2: -30.1957 40.3520
Cluster 3: -4.4608 47.2543
Cluster 4: 46.3738 36.5225
Cluster 5: 43.3134 -17.6443
Cluster 6: -0.9003 1.4042
Cluster 7: 7.3884 11.5584
Now, in order to classify/label each pixel to a cluster 1-7, I currently do the following (pseudo-code):
clusters;
for each x
for each y
ab = im(x,y,2:3);
dist = norm(ab - clusters); // norm of dist between ab and each cluster
[~, idx] = min(dist);
end
end
However, this is terribly slow (52 seconds) because of the image resolution and that I manually loop through each x and y.
Are there some built-in functions I can use that performs the same job? There must be.
To summarize: I need a classification method that classifies pixel images to an already defined set of clusters.
Approach #1
For a N x 2 sized points/pixels array, you can avoid permute as suggested in the other solution by Luis, which could slow down things a bit, to have a kind of "permute-unrolled" version of it and also let's bsxfun work towards a 2D array instead of a 3D array, which must be better with performance.
Thus, assuming clusters to be ordered as a N x 2 sized array, you may try this other bsxfun based approach -
%// Get a's and b's
im_a = im(:,:,2);
im_b = im(:,:,3);
%// Get the minimum indices that correspond to the cluster IDs
[~,idx] = min(bsxfun(#minus,im_a(:),clusters(:,1).').^2 + ...
bsxfun(#minus,im_b(:),clusters(:,2).').^2,[],2);
idx = reshape(idx,size(im,1),[]);
Approach #2
You can try out another approach that leverages fast matrix multiplication in MATLAB and is based on this smart solution -
d = 2; %// dimension of the problem size
im23 = reshape(im(:,:,2:3),[],2);
numA = size(im23,1);
numB = size(clusters,1);
A_ext = zeros(numA,3*d);
B_ext = zeros(numB,3*d);
for id = 1:d
A_ext(:,3*id-2:3*id) = [ones(numA,1), -2*im23(:,id), im23(:,id).^2 ];
B_ext(:,3*id-2:3*id) = [clusters(:,id).^2 , clusters(:,id), ones(numB,1)];
end
[~, idx] = min(A_ext * B_ext',[],2); %//'
idx = reshape(idx, size(im,1),[]); %// Desired IDs
What’s going on with the matrix multiplication based distance matrix calculation?
Let us consider two matrices A and B between whom we want to calculate the distance matrix. For the sake of an easier explanation that follows next, let us consider A as 3 x 2 and B as 4 x 2 sized arrays, thus indicating that we are working with X-Y points. If we had A as N x 3 and B as M x 3 sized arrays, then those would be X-Y-Z points.
Now, if we have to manually calculate the first element of the square of distance matrix, it would look like this –
first_element = ( A(1,1) – B(1,1) )^2 + ( A(1,2) – B(1,2) )^2
which would be –
first_element = A(1,1)^2 + B(1,1)^2 -2*A(1,1)* B(1,1) + ...
A(1,2)^2 + B(1,2)^2 -2*A(1,2)* B(1,2) … Equation (1)
Now, according to our proposed matrix multiplication, if you check the output of A_ext and B_ext after the loop in the earlier code ends, they would look like the following –
So, if you perform matrix multiplication between A_ext and transpose of B_ext, the first element of the product would be the sum of elementwise multiplication between the first rows of A_ext and B_ext, i.e. sum of these –
The result would be identical to the result obtained from Equation (1) earlier. This would continue for all the elements of A against all the elements of B that are in the same column as in A. Thus, we would end up with the complete squared distance matrix. That’s all there is!!
Vectorized Variations
Vectorized variations of the matrix multiplication based distance matrix calculations are possible, though there weren't any big performance improvements seen with them. Two such variations are listed next.
Variation #1
[nA,dim] = size(A);
nB = size(B,1);
A_ext = ones(nA,dim*3);
A_ext(:,2:3:end) = -2*A;
A_ext(:,3:3:end) = A.^2;
B_ext = ones(nB,dim*3);
B_ext(:,1:3:end) = B.^2;
B_ext(:,2:3:end) = B;
distmat = A_ext * B_ext.';
Variation #2
[nA,dim] = size(A);
nB = size(B,1);
A_ext = [ones(nA*dim,1) -2*A(:) A(:).^2];
B_ext = [B(:).^2 B(:) ones(nB*dim,1)];
A_ext = reshape(permute(reshape(A_ext,nA,dim,[]),[1 3 2]),nA,[]);
B_ext = reshape(permute(reshape(B_ext,nB,dim,[]),[1 3 2]),nB,[]);
distmat = A_ext * B_ext.';
So, these could be considered as experimental versions too.
Use pdist2 (Statistics Toolbox) to compute the distances in a vectorized manner:
ab = im(:,:,2:3); % // get A, B components
ab = reshape(ab, [size(im,1)*size(im,2) 2]); % // reshape into 2-column
dist = pdist2(clusters, ab); % // compute distances
[~, idx] = min(dist); % // find minimizer for each pixel
idx = reshape(idx, size(im,1), size(im,2)); % // reshape result
If you don't have the Statistics Toolbox, you can replace the third line by
dist = squeeze(sum(bsxfun(#minus, clusters, permute(ab, [3 2 1])).^2, 2));
This gives squared distance instead of distance, but for the purposes of minimizing it doesn't matter.
I'm working on a data mining algorithm where i want to pick a random direction from a particular point in the feature space.
If I pick a random number for each of the n dimensions from [-1,1] and then normalize the vector to a length of 1 will I get an even distribution across all possible directions?
I'm speaking only theoretically here since computer generated random numbers are not actually random.
One simple trick is to select each dimension from a gaussian distribution, then normalize:
from random import gauss
def make_rand_vector(dims):
vec = [gauss(0, 1) for i in range(dims)]
mag = sum(x**2 for x in vec) ** .5
return [x/mag for x in vec]
For example, if you want a 7-dimensional random vector, select 7 random values (from a Gaussian distribution with mean 0 and standard deviation 1). Then, compute the magnitude of the resulting vector using the Pythagorean formula (square each value, add the squares, and take the square root of the result). Finally, divide each value by the magnitude to obtain a normalized random vector.
If your number of dimensions is large then this has the strong benefit of always working immediately, while generating random vectors until you find one which happens to have magnitude less than one will cause your computer to simply hang at more than a dozen dimensions or so, because the probability of any of them qualifying becomes vanishingly small.
You will not get a uniformly distributed ensemble of angles with the algorithm you described. The angles will be biased toward the corners of your n-dimensional hypercube.
This can be fixed by eliminating any points with distance greater than 1 from the origin. Then you're dealing with a spherical rather than a cubical (n-dimensional) volume, and your set of angles should then be uniformly distributed over the sample space.
Pseudocode:
Let n be the number of dimensions, K the desired number of vectors:
vec_count=0
while vec_count < K
generate n uniformly distributed values a[0..n-1] over [-1, 1]
r_squared = sum over i=0,n-1 of a[i]^2
if 0 < r_squared <= 1.0
b[i] = a[i]/sqrt(r_squared) ; normalize to length of 1
add vector b[0..n-1] to output list
vec_count = vec_count + 1
else
reject this sample
end while
There is a boost implementation of the algorithm that samples from normal distributions: random::uniform_on_sphere
I had the exact same question when also developing a ML algorithm.
I got to the same conclusion as Jim Lewis after drawing samples for the 2-d case and plotting the resulting distribution of the angle.
Furthermore, if you try to derive the density distribution for the direction in 2d when you draw at random from [-1,1] for the x- and y-axis ,you will see that:
f_X(x) = 1/(4*cos²(x)) if 0 < x < 45⁰
and
f_X(x) = 1/(4*sin²(x)) if x > 45⁰
where x is the angle, and f_X is the probability density distribution.
I have written about this here:
https://aerodatablog.wordpress.com/2018/01/14/random-hyperplanes/
#define SCL1 (M_SQRT2/2)
#define SCL2 (M_SQRT2*2)
// unitrand in [-1,1].
double u = SCL1 * unitrand();
double v = SCL1 * unitrand();
double w = SCL2 * sqrt(1.0 - u*u - v*v);
double x = w * u;
double y = w * v;
double z = 1.0 - 2.0 * (u*u + v*v);