Faster alternative to compute convolution in the frequency domain in torch

Faster alternative to compute convolution in the frequency domain in torch - performance

I'm implementing a custom convolution in torch, which converts an image to the frequency domain using FFT, computes the product between kernel and image and then computes the inverse FFT. Although it works, I've noticed that the product computation is rather slow. Is there a way to optimize it?
I've added a timer to everything to see how it goes and got the following results (testing on cpu):
squeezes - done in 4.17232513E-05s
real - done in 1.67846680E-04s
im - done in 7.53402710E-05s
stack - done in 8.36849213E-05s
sum - done in 3.96490097E-04s
bias - done in 1.64508820E-05s
Here is my implementation:
Note that I'm not computing the FFT and its inverse in here. The torchimplementation of these operations are rather fast.
def fconv2d(input, kernel, bias=None):
# Computes the convolution in the frequency domain given
# an input of shape (B, Cin, H, W) and kernel of shape (Cout, Cin, H, W).
# Expects input and kernel already in frequency domain!
with timer('squeezes'):
kernel = kernel.unsqueeze(0)
# Expand kernel to (B, Cout, Cin, H, W)
# Expand input to (B, Cout, Cin, H, W)
input = input.unsqueeze(1)
# Compute the multiplication
# (a+bj)*(c+dj) = (ac-bd)+(ad+bc)j
with timer('real'):
real = input[..., 0] * kernel[..., 0] - \
input[..., 1] * kernel[..., 1]
with timer('im'):
im = input[..., 0] * kernel[..., 1] + \
input[..., 1] * kernel[..., 0]
# Stack both channels and sum-reduce the input channels dimension
with timer('stack'):
out = torch.stack([real, im], -1)
with timer('sum'):
out = out.sum(dim=-4)
# Add bias
with timer('bias'):
if bias is not None:
bias = bias.expand(1, 1, 1, bias.shape[0]).permute(0, 3, 1, 2)
out += bias
return out

Related

Why do i have "OutOfMemoryError" in my Kmeans CuPy code?

im really new for gpu coding i found this Kmeans cupy code my propouse is work with a large data base (n,3) for example to realize about the timing difference on gpu and cpu , i wanna have a huge number of clusters but i am getting a memory management error. Can someone give me the route I should take to research and fix it, i already research but i have not a clear start yet.
import contextlib
import time
import cupy
import matplotlib.pyplot as plt
import numpy
#contextlib.contextmanager
def timer(message):
cupy.cuda.Stream.null.synchronize()
start = time.time()
yield
cupy.cuda.Stream.null.synchronize()
end = time.time()
print('%s: %f sec' % (message, end - start))
var_kernel = cupy.ElementwiseKernel(
'T x0, T x1, T c0, T c1', 'T out',
'out = (x0 - c0) * (x0 - c0) + (x1 - c1) * (x1 - c1)',
'var_kernel'
)
sum_kernel = cupy.ReductionKernel(
'T x, S mask', 'T out',
'mask ? x : 0',
'a + b', 'out = a', '0',
'sum_kernel'
)
count_kernel = cupy.ReductionKernel(
'T mask', 'float32 out',
'mask ? 1.0 : 0.0',
'a + b', 'out = a', '0.0',
'count_kernel'
)
def fit_xp(X, n_clusters, max_iter):
assert X.ndim == 2
# Get NumPy or CuPy module from the supplied array.
xp = cupy.get_array_module(X)
n_samples = len(X)
# Make an array to store the labels indicating which cluster each sample is
# contained.
pred = xp.zeros(n_samples)
# Choose the initial centroid for each cluster.
initial_indexes = xp.random.choice(n_samples, n_clusters, replace=False)
centers = X[initial_indexes]
for _ in range(max_iter):
# Compute the new label for each sample.
distances = xp.linalg.norm(X[:, None, :] - centers[None, :, :], axis=2)
new_pred = xp.argmin(distances, axis=1)
# If the label is not changed for each sample, we suppose the
# algorithm has converged and exit from the loop.
if xp.all(new_pred == pred):
break
pred = new_pred
# Compute the new centroid for each cluster.
i = xp.arange(n_clusters)
mask = pred == i[:, None]
sums = xp.where(mask[:, :, None], X, 0).sum(axis=1)
counts = xp.count_nonzero(mask, axis=1).reshape((n_clusters, 1))
centers = sums / counts
return centers, pred
def fit_custom(X, n_clusters, max_iter):
assert X.ndim == 2
n_samples = len(X)
pred = cupy.zeros(n_samples,dtype='float32')
initial_indexes = cupy.random.choice(n_samples, n_clusters, replace=False)
centers = X[initial_indexes]
for _ in range(max_iter):
distances = var_kernel(X[:, None, 0], X[:, None, 1],
centers[None, :, 1], centers[None, :, 0])
new_pred = cupy.argmin(distances, axis=1)
if cupy.all(new_pred == pred):
break
pred = new_pred
i = cupy.arange(n_clusters)
mask = pred == i[:, None]
sums = sum_kernel(X, mask[:, :, None], axis=1)
counts = count_kernel(mask, axis=1).reshape((n_clusters, 1))
centers = sums / counts
return centers, pred
def draw(X, n_clusters, centers, pred, output):
# Plot the samples and centroids of the fitted clusters into an image file.
for i in range(n_clusters):
labels = X[pred == i]
plt.scatter(labels[:, 0], labels[:, 1], c=numpy.random.rand(3))
plt.scatter(
centers[:, 0], centers[:, 1], s=120, marker='s', facecolors='y',
edgecolors='k')
plt.savefig(output)
def run_cpu(gpuid, n_clusters, num, max_iter, use_custom_kernel):##, output
samples = numpy.random.randn(num, 3)
X_train = numpy.r_[samples + 1, samples - 1]
with timer(' CPU '):
centers, pred = fit_xp(X_train, n_clusters, max_iter)
def run_gpu(gpuid, n_clusters, num, max_iter, use_custom_kernel):##, output
samples = numpy.random.randn(num, 3)
X_train = numpy.r_[samples + 1, samples - 1]
with cupy.cuda.Device(gpuid):
X_train = cupy.asarray(X_train)
with timer(' GPU '):
if use_custom_kernel:
centers, pred = fit_custom(X_train, n_clusters, max_iter)
else:
centers, pred = fit_xp(X_train, n_clusters, max_iter)
btw i am working in colab pro 25GB(RAM), the code is working with n_clusters=200 and num= 1000000 but if i use bigger numbers the error appear, i am running the code like this:
run_gpu(0,200,1000000,10,True)
This is the error that i have
Any suggestion will be welcome, thanks for your time.

Assuming that CuPy is clever enough not to create explicit copies of the broadcasted input of var_kernel, the output distances has to have a size of 2 * num * num_clusters which are exactly the 6,400,000,000 Bytes it is trying to allocate. You could have a way smaller memory footprint by never actually writing the distances to memory which means fusing the var_kernel with argmin. See this part of the docs.
If I understand the example there correctly, this should work:
#cupy.fuse(kernel_name='argmin_distance')
def argmin_distance(x1, y1, x2, y2):
return cupy.argmin((x1 - x2) * (x1 - x2) + (y1 - y2) * (y1 - y2), axis = 1)
The next question would be where the other 13.7GB come from. A big part of them might just be the instances of distances from earlier iterations. I'm not a CuPy expert, but at least in Python/Numpy your use of distances inside the loop would not reuse the same memory, but allocate more memory each time you call the var_kernel. The same problem is visible with pred which is allocated before the loop. If CuPy does things the Numpy way, the solution would be to just put [:] in there like
pred[:] = new_pred
or
distances[:,:,:] = var_kernel(X[:, None, 0], X[:, None, 1],
centers[None, :, 1], centers[None, :, 0])
For this to work, you need to allocate distances before the loop as well. Also this isn't needed anymore when using kernel fusion, so just take it as an example. It may be best to allocate everything beforehand and then use this syntax everywhere in the loop.
I don't know enough about CuPy to answer why fit_xp doesn't have the same problem (or does it?). But my guess would be that garbage collection with CuPy objects works differently there. If garbage collection were "quick enough" in fit_custom it should work even without kernel fusion or reusing already allocated arrays.
Other problems or at least oddities with your code:
Why are you comparing the zeroth coordinate of centers with the first coordinate of X? Wouldn't it make more sense to call
distances = var_kernel(X[:, None, 0], X[:, None, 1],
centers[None, :, 0], centers[None, :, 1])
Why are you creating 3D data when only using the projection on the 2D plane? So why not
samples = numpy.random.randn(num, 2)
Why are you using floats for (the initial version of) pred? The argmin should give an integer type result.

How to perform matrix by vector multiplication with sympy?

I have:
a vector of type <class 'sympy.vector.vector.VectorMul'>; and
a matrix of type <class 'sympy.matrices.dense.MutableDenseMatrix'>
I would like to multiply the matrix by the vector in order to produce a vector.
Can I perform this operation conveniently or do I need to do some extra manipulation first?
For reference I am attempting to get the symbolic result of a rotation matrix applied to a vector.
Also below, is some of my code that deals with the above matrix and vector.
from sympy.vector import CoordSys3D
σ, θ, γ, λ, a, b, c = symbols('σ, θ, γ, λ, a, b, c, a_v, b_v, c_v')
σ = sin(θ)
γ = cos(θ)
λ = 1 - γ
N = CoordSys3D('N')
u = a*N.i + b*N.j + c*N.k # Axis of rotation
R = Matrix([
[a*a*λ + γ, a*b*λ-c*σ, a*c*λ+b*σ],
[b*a*λ+c*σ, b*b*λ + γ, b*c*λ-a*σ],
[c*a*λ-b*σ, c*b*λ+a*σ, c*c*λ + γ],
])
# Input vector prior to rotation
v = a_v*N.i + b_v*N.j + c_v*N.k
# How to calculate the post rotation output vector w = Rv?
In summary is there a built-in mechanism in sympy for matrix by vector multiplication?

Although I didn't find a function to do what I wanted, this code achieved the same result. I'm posting it here in case it is useful for others.
w = R * Matrix([v.coeff(N.i), v.coeff(N.j), v.coeff(N.k)])

In the current version of SymPy (1.11), you can calculate the vector matrix product by using the matmul operator (#)
The following code works for me:
v = Matrix([x, y, z])
Kx = Matrix([[1, 0, 0 ],
[0, cos(kx), -sin(kx)],
[0, sin(kx), cos(kx)]])
product = Kx # v
# Don't:
# product = v # Kx

Octave: function doesn't return expected value?

This code is a programming assignment for Andrew Ng's machine learning course.
The function is expecting a row vector [J grad]. The code computes J (albeit wrongly, but that's not the issue here), and I put in a dummy value for grad (because I haven't written the code to compute it yet). When I run the code, it only outputs ans as a scalar with the value of J. Where did grad go?
function [J grad] = nnCostFunction(nn_params, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, ...
X, y, lambda)
%NNCOSTFUNCTION Implements the neural network cost function for a two layer
%neural network which performs classification
% [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ...
% X, y, lambda) computes the cost and gradient of the neural network. The
% parameters for the neural network are "unrolled" into the vector
% nn_params and need to be converted back into the weight matrices.
%
% The returned parameter grad should be a "unrolled" vector of the
% partial derivatives of the neural network.
%
% Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices
% for our 2 layer neural network
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));
% Setup some useful variables
m = size(X, 1);
% You need to return the following variables correctly
J = 0;
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));
% ====================== YOUR CODE HERE ======================
% Instructions: You should complete the code by working through the
% following parts.
%
% Part 1: Feedforward the neural network and return the cost in the
% variable J. After implementing Part 1, you can verify that your
% cost function computation is correct by verifying the cost
% computed in ex4.m
%
% Part 2: Implement the backpropagation algorithm to compute the gradients
% Theta1_grad and Theta2_grad. You should return the partial derivatives of
% the cost function with respect to Theta1 and Theta2 in Theta1_grad and
% Theta2_grad, respectively. After implementing Part 2, you can check
% that your implementation is correct by running checkNNGradients
%
% Note: The vector y passed into the function is a vector of labels
% containing values from 1..K. You need to map this vector into a
% binary vector of 1's and 0's to be used with the neural network
% cost function.
%
% Hint: We recommend implementing backpropagation using a for-loop
% over the training examples if you are implementing it for the
% first time.
%
% Part 3: Implement regularization with the cost function and gradients.
%
% Hint: You can implement this around the code for
% backpropagation. That is, you can compute the gradients for
% the regularization separately and then add them to Theta1_grad
% and Theta2_grad from Part 2.
%
% PART 1
a1 = [ones(m,1) X]; % set a1 to equal X and add column of 1's
z2 = a1 * Theta1'; % matrix times matrix [5000*401 * 401*25 = 5000*25]
a2 = [ones(m,1),sigmoid(z2)]; % sigmoid function on matrix [5000*26]
z3 = a2 * Theta2'; % matrix times matrix [5000*26 * 26*10 = 5000 * 10]
hox = sigmoid(z3); % sigmoid function on matrix [5000*10]
for k = 1:num_labels
yk = y == k; % using the correct column vector y each loop
J = J + sum(-yk.*log(hox(:,k)) - (1-yk).*log(1-hox(:,k)));
end
J = 1/m * J;
% -------------------------------------------------------------
% =========================================================================
% Unroll gradients
% grad = [Theta1_grad(:) ; Theta2_grad(:)];
grad = 6.6735;
end

You have specified in your function declaration that the function can simultaneously return more than one output value:
function [J grad] = nnCostFunction(nn_params, ... % etc
You can capture both outputs if you 'request' them by assigning to a matrix of variables instead of a single variable:
[a, b] = nnCostFunction(input1, input2, etc)
If you don't do this, you're essentially 'requesting' only the first of the returned variables:
a = nnCostFunction(input1, input2, etc) % output 'b' is discarded.
If you don't specify a variable to assign to at all, octave by default assigns to the 'default' variable ans. So it's essentially equivalent to doing
ans = nnCostFunction(input1, input2, etc) % output 'b' is discarded.
See the documentation for the find function (i.e. type help find in your octave terminal) to see an example of such a function.
PS. If you only wanted the second output and did not want to 'waste' a variable name for the first one, you can do this by specifying ~ as the first output, e.g.:
[~, b] = nnCostFunction(input1, input2, etc) % output 'a' is discarded

Can someone explain this piece of code that recognises a digit from the Coursera Machine Learning course

This is a snippet from the predict function of exercise 4 of the Coursera machine learning course. What it does is it stores the predicted digit from a trained neural network in p. Can someone explain how it does this?
function p = predict(Theta1, Theta2, x)
p = 0;
h1 = sigmoid(double([1 x]) * Theta1');
h2 = sigmoid([1 h1] * Theta2');
[dummy, p] = max(h2, [], 2);
end
x = 1x784 matrix of pixel intensity values.
Theta1 = 100x785 matrix.
Theta2 = 10x101 matrix.
I have already trained the network and have gotten the optimum value of Theta1 and Theta2. What I want to know is how that last line of code takes the forward propagated values and stores 1/2/3/4/5/6/7/8/9/10 in p. Whichever digit is stored is the predicted digit.
Sigmoid function:
function g = sigmoid(z)
g = 1 ./ (1 + e.^-z);
end

The last line simply returns index of the neuron with the highest value, in matlab/octave
[M, I] = max(A, [], dim)
stores in I indeces of A which have highest values among dimension dim. In your case, h2 has activations of each output neuron, and from construction of your neural network - classification is simply index of the one with the highest value,
cl(x) = arg max_i f_i(x)

Singular value decomposition of complex 2x2 matrix

I was looking for example code showing how to compute a singular value decomposition of a 2x2 matrix that can contain complex values.
For example, this would be useful for "repairing" user-entered matrices to be unitary. You just take u, s, v = svd(m) then omit the s part from the product: repaired = u * v.

Here's some python code that does the trick. It basically just extracts the complex parts then delegates to the solution from this answer for real 2x2 matrices.
I've written the code in python, using numpy. This is a bit ironic, because if you have numpy you should just use np.linalg.svd. Clearly this is intended as example code suitable for learning or translating into other languages in a pinch.
I'm also not an expert on numerical stability, so... buyer beware.
import numpy as np
import math
# Note: in practice in python just use np.linalg.svd instead
def singular_value_decomposition_complex_2x2(m):
"""
Returns a singular value decomposition of the given 2x2 complex numpy
matrix.
:param m: A 2x2 numpy matrix with complex values.
:returns: A tuple (U, S, V) where U*S*V ~= m, where U and V are complex
2x2 unitary matrices, and where S is a 2x2 diagonal matrix with
non-negative real values.
"""
# Make top row non-imaginary and non-negative by column phasing.
# m2 = m p = | > > |
# | ?+?i ?+?i |
p = phase_cancel_matrix(m[0, 0], m[0, 1])
m2 = m * p
# Cancel top-right value by rotation.
# m3 = m p r = | ?+?i 0 |
# | ?+?i ?+?i |
r = rotation_matrix(math.atan2(m2[0, 1].real, m2[0, 0].real))
m3 = m2 * r
# Make bottom row non-imaginary and non-negative by column phasing.
# m4 = m p r q = | ?+?i 0 |
# | > > |
q = phase_cancel_matrix(m3[1, 0], m3[1, 1])
m4 = m3 * q
# Cancel imaginary part of top left value by row phasing.
# m5 = t m p r q = | > 0 |
# | > > |
t = phase_cancel_matrix(m4[0, 0], 1)
m5 = t * m4
# All values are now real (also the top-right is zero), so delegate to a
# singular value decomposition that works for real matrices.
# t m p r q = u s v
u, s, v = singular_value_decomposition_real_2x2(np.real(m5))
# m = (t* u) s (v q* r* p*)
return adjoint(t) * u, s, v * adjoint(q) * adjoint(r) * adjoint(p)
def singular_value_decomposition_real_2x2(m):
"""
Returns a singular value decomposition of the given 2x2 real numpy matrix.
:param m: A 2x2 numpy matrix with real values.
:returns: A tuple (U, S, V) where U*S*V ~= m, where U and V are 2x2
rotation matrices, and where S is a 2x2 diagonal matrix with
non-negative real values.
"""
a = m[0, 0]
b = m[0, 1]
c = m[1, 0]
d = m[1, 1]
t = a + d
x = b + c
y = b - c
z = a - d
theta_0 = math.atan2(x, t) / 2.0
theta_d = math.atan2(y, z) / 2.0
s_0 = math.sqrt(t**2 + x**2) / 2.0
s_d = math.sqrt(z**2 + y**2) / 2.0
return \
rotation_matrix(theta_0 - theta_d), \
np.mat([[s_0 + s_d, 0], [0, s_0 - s_d]]), \
rotation_matrix(theta_0 + theta_d)
def adjoint(m):
"""
Returns the adjoint, i.e. the conjugate transpose, of the given matrix.
When the matrix is unitary, the adjoint is also its inverse.
:param m: A numpy matrix to transpose and conjugate.
:return: A numpy matrix.
"""
return m.conjugate().transpose()
def rotation_matrix(theta):
"""
Returns a 2x2 unitary matrix corresponding to a 2d rotation by the given angle.
:param theta: The angle, in radians, that the matrix should rotate by.
:return: A 2x2 orthogonal matrix.
"""
c, s = math.cos(theta), math.sin(theta)
return np.mat([[c, -s],
[s, c]])
def phase_cancel_complex(c):
"""
Returns a unit complex number p that cancels the phase of the given complex
number c. That is, c * p will be real and non-negative (approximately).
:param c: A complex number.
:return: A complex number on the complex unit circle.
"""
m = abs(c)
# For small values, where the division is in danger of exploding small
# errors, use trig functions instead.
if m < 0.0001:
theta = math.atan2(c.imag, c.real)
return math.cos(theta) - math.sin(theta) * 1j
return (c / float(m)).conjugate()
def phase_cancel_matrix(p, q):
"""
Returns a 2x2 unitary matrix M such that M cancels out the phases in the
column {{p}, {q}} so that the result of M * {{p}, {q}} should be a vector
with non-negative real values.
:param p: A complex number.
:param q: A complex number.
:return: A 2x2 diagonal unitary matrix.
"""
return np.mat([[phase_cancel_complex(p), 0],
[0, phase_cancel_complex(q)]])
I tested the above code by fuzzing it with matrices filled with random values in [-10, 10] + [-10, 10]i, and checking that the decomposed factors had the right properties (i.e. unitary, diagonal, real, as appropriate) and that their product was (approximately) equal to the input.
But here's a simple smoke test:
m = np.mat([[5, 10], [1j, -1]])
u, s, v = singular_value_decomposition_complex_2x2(m)
np.set_printoptions(precision=5, suppress=True)
print "M:\n", m
print "U*S*V:\n", u*s*v
print "U:\n", u
print "S:\n", s
print "V:\n", v
print "M ~= U*S*V:", np.all(np.abs(m - u*s*v) < 0.1**14)
Which outputs the following. You can confirm that the factored S matches the svd from wolfram alpha, although of course the U and V can be (and are) different.
M:
[[ 5.+0.j 10.+0.j]
[ 0.+1.j -1.+0.j]]
U*S*V:
[[ 5.+0.j 10.+0.j]
[ 0.+1.j -1.-0.j]]
U:
[[-0.89081-0.44541j 0.08031+0.04016j]
[ 0.08979+0.j 0.99596+0.j ]]
S:
[[ 11.22533 0. ]
[ 0. 0.99599]]
V:
[[-0.39679+0.20639j -0.80157+0.39679j]
[ 0.40319+0.79837j -0.19359-0.40319j]]
M ~= U*S*V: True

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Faster alternative to compute convolution in the frequency domain in torch - performance

Related

Why do i have "OutOfMemoryError" in my Kmeans CuPy code?

How to perform matrix by vector multiplication with sympy?

Octave: function doesn't return expected value?

Can someone explain this piece of code that recognises a digit from the Coursera Machine Learning course

Singular value decomposition of complex 2x2 matrix

Categories

Resources