cython running slower than numpy for distance calculation - performance

I'm trying to learn cython; however, I must be doing something wrong. This little piece of test code is running about 50 times slower than my vectorized numpy version of it. Can someone please tell me why my cython is slower than my python? Thanks.
The code calculates the distance between a point in R^3, loc, and and array of points in R^3, points.
import numpy as np
cimport numpy as np
import cython
cimport cython
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
#cython.boundscheck(False) # turn of bounds-checking for entire function
#cython.wraparound(False)
#cython.nonecheck(False)
def distMeasureCython(np.ndarray[DTYPE_t, ndim=2] points, np.ndarray[DTYPE_t, ndim=1] loc):
cdef unsigned int i
cdef unsigned int L = points.shape[0]
cdef np.ndarray[DTYPE_t, ndim=1] d = np.zeros(L)
for i in xrange(0,L):
d[i] = np.sqrt((points[i,0] - loc[0])**2 + (points[i,1] - loc[1])**2 + (points[i,2] - loc[2])**2)
return d
This is the numpy code that it's being compared against.
from numpy import *
N = 1e6
points = random.uniform(0,1,(N,3))
loc = random.uniform(0,1,(3))
def distMeasureNumpy(points,loc):
d = points - loc
d = sqrt(sum(d*d,axis=1))
return d
The numpy/python version takes about 44ms and the cython version takes about 2 seconds. I'm running python 2.7 on a mac osx. I'm using ipython's %timeit command to time the two functions.

The call to np.sqrt, which is a Python function call, is killing your performance You are computing the square root of scalar floating point value, so you should use the sqrt function from the C math library. Here's a modified version of your code:
import numpy as np
cimport numpy as np
import cython
cimport cython
from libc.math cimport sqrt
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
#cython.boundscheck(False) # turn of bounds-checking for entire function
#cython.wraparound(False)
#cython.nonecheck(False)
def distMeasureCython(np.ndarray[DTYPE_t, ndim=2] points,
np.ndarray[DTYPE_t, ndim=1] loc):
cdef unsigned int i
cdef unsigned int L = points.shape[0]
cdef np.ndarray[DTYPE_t, ndim=1] d = np.zeros(L)
for i in xrange(0,L):
d[i] = sqrt((points[i,0] - loc[0])**2 +
(points[i,1] - loc[1])**2 +
(points[i,2] - loc[2])**2)
return d
The following demonstrates the performance improvement. Your original code is in the module check_speed_original, and the modified version is in check_speed:
In [11]: import check_speed_original
In [12]: import check_speed
Set up the test data:
In [13]: N = 10**6
In [14]: points = random.uniform(0,1,(N,3))
In [15]: loc = random.uniform(0,1,(3,))
The original version takes 1.26 seconds on my computer:
In [16]: %timeit check_speed_original.distMeasureCython(points, loc)
1 loops, best of 3: 1.26 s per loop
The modified version takes 4.47 milliseconds:
In [17]: %timeit check_speed.distMeasureCython(points, loc)
100 loops, best of 3: 4.47 ms per loop
In case anyone is worried that the results might be different:
In [18]: d1 = check_speed.distMeasureCython(points, loc)
In [19]: d2 = check_speed_original.distMeasureCython(points, loc)
In [20]: np.all(d1 == d2)
Out[20]: True

As already mentioned, it's the numpy.sqrt call in the code. However, I think one does not need to employ cdef extern, since Cython provides these basic C/C++ libraries already. (see the docs). So you could just cimport it like this:
from libc.math cimport sqrt
Just to get rid of the overhead.

Related

Numba Cuda computation seems to be slower than sequential run. Did I do obvious mistakes?

There are several threads covering similar topics, but unfortunately, these seem to be too complicated for me, so I would like to ask a similar question, hoping that someone will have a look at my code specifically to tell me if I got something wrong.
I am learning numba cuda right now, starting with the simple examples one can find in the net. I started with this tutorial here:
https://github.com/ContinuumIO/gtc2017-numba/blob/master/4%20-%20Writing%20CUDA%20Kernels.ipynb
which shows how to do an addition of arrays in parallel. The system configuration they used to evaluate the times is not given. For the code replication, I use a Geforce GTX 1080 Ti and an Intel Core i7 8700K CPU.
I basically copied the addition script from the tutorial, but added also sequential code for comparison:
from numba import cuda
import numpy as np
import time
import math
#cuda.jit
def addition_kernel(x, y, out):
tx = cuda.threadIdx.x
ty = cuda.blockIdx.x
block_size = cuda.blockDim.x
grid_size = cuda.gridDim.x
start = tx+ ty * block_size
stride = block_size * grid_size
for i in range(start, x.shape[0], stride):
out[i] = y[i] + x[i]
def add(n, x, y):
for i in range(n):
y[i] = y[i] + x[i]
if __name__ =="__main__":
print(cuda.gpus[0])
print("")
n = 100000
x = np.arange(n).astype(np.float32)
y = 2 * x
out = np.empty_like(x)
x_device = cuda.to_device(x)
y_device = cuda.to_device(y)
out_device = cuda.device_array_like(x)
# Set the number of threads in a block
threadsperblock = 128
# Calculate the number of thread blocks in the grid
blockspergrid = 30#math.ceil(n[0] / threadsperblock)
# Now start the kernel
start = time.process_time()
cuda.synchronize()
addition_kernel[blockspergrid, threadsperblock](x_device, y_device, out_device)
cuda.synchronize()
end = time.process_time()
out_global_mem = out_device.copy_to_host()
print("parallel time: ", end - start)
start = time.process_time()
add(n,x,y)
end = time.process_time()
print("sequential time: ", end-start)
The parallel time is on average around 0.14 seconds, while the code without GPU kernel takes only 0.02 seconds.
This seems quite strange to me. Is there anything I did wrong? Or is this problem not a good example for parallelism? (which I do not think as you can run the for loop in parallel)
What is odd is that I do hardly notice a difference if I do not use the to_device() functions. As far as I understood, these should be important, as they avoid the communication between CPU and GPU after each iteration.
addition_kernel is compiled at runtime when it is called the first time, so in the middle of your measured time! The compilation of a kernel is a pretty intensive operation. You can force the compilation to be done eagerly (ie. when the function is defined) by providing the types to Numba.
Note that the arrays are a bit too small so you can see a big improvement on GPUs. Moreover, the comparison with the CPU version is not really fair: you should also use Numba for the CPU implementation or at least Numpy (but not an interpreted pure-CPython loop).
Here is an example:
import numba as nb
#cuda.jit('void(float32[::1], float32[::1], float32[::1])')
def addition_kernel(x, y, out):
tx = cuda.threadIdx.x
ty = cuda.blockIdx.x
block_size = cuda.blockDim.x
grid_size = cuda.gridDim.x
start = tx+ ty * block_size
stride = block_size * grid_size
for i in range(start, x.shape[0], stride):
out[i] = y[i] + x[i]
#nb.njit('void(int64, float32[::1], float32[::1])')
def add(n, x, y):
for i in range(n):
y[i] = y[i] + x[i]

How to speed up this python loop script or parallelize it

I am currently working on a script which takes in data for a correlation matrix and compute a bunch of values. this step right here is very costly, and I would like to find ways to speed it up or parallelize it. I have tried using Parallel (from python's joblib) however because of CPU overhead (at least because of the way I parametrized it) it's significantly slower than a sequential loop.
import time
import numpy as np
import itertools
from sklearn.datasets import make_blobs
N = 5000
data,_ = make_blobs(n_samples=N,n_features=500)
G = np.corrcoef(data)
''' the cluster function '''
def clus_lc(i, j, G, ns=2):
''' c_s'''
cs = 2*(G[i,j]+1)-1e-6
''' A and B'''
if cs<=ns:
return 0
return 0.5*( np.log(ns/cs) + (ns - 1)*np.log( (ns**2 - ns) / ( ns**2 - cs) ) )
''' merge and time '''
indices = list(itertools.combinations(range(N),2))
t0 = time.time()
costs = np.zeros(len(indices))
k=0
for i, j in indices:
costs[k] = clus_lc(i,j,G)
k+=1
t1 = time.time()
toseq = t1-t0
print(toseq)
I think I solved the issue by using numba and adding a decorator #jit. This seems to work fine because all operations in the function are calls to numpy functions. On a dataset with N=5000 it goes from 75 sec to 10 sec. Fantastic improvement. Now whether this can be further improved I am interested in hearing other inputs.

Python: how to write this code to run on GPU?

I have been trying for quite some time to implement my code to run on GPU, however with little success. I would really appreciate someone helping with the implementation.
Let me say a few words about the problem. I have a graph G with N nodes and a distribution mx on each node x. I would like to compute the distance between the distributions for every pair of nodes for all edges. For a given pair, (x,y), I use the code ot.sinkhorn(mx, my, dNxNy) from the python POT package to compute the distance. Again, mx, my are vectors of size Nx and Ny on nodes x and y and dNxNy is a Nx x Ny distance matrix.
Now, I discovered that there is a GPU implementation of this code ot.gpu.sinkhorn(mx, my, dNxNy). However, this is not good enough because I mx, my and dNxNy would need to be uploaded to the GPU at every iteration, which is a massive overhead. So, the idea is to parallelise this for all edges on GPU.
The essence of the code is as follows. mx_all is all the distributions
for i,e in enumerate(G.edges):
W[i] = W_comp(mx_all,dist,e)
def W_comp(mx_all, dist, e):
i = e[0]
j = e[1]
Nx = np.array(mx_all[i][1]).flatten()
Ny = np.array(mx_all[j][1]).flatten()
mx = np.array(mx_all[i][0]).flatten()
my = np.array(mx_all[j][0]).flatten()
dNxNy = dist[Nx,:][:,Ny].copy(order='C')
W = ot.sinkhorn2(mx, my, dNxNy, 1)
Below is a minimal working example. Please ignore everything except the part between dashed === signs.
import ot
import numpy as np
import scipy as sc
def main():
import networkx as nx
#some example graph
G = nx.planted_partition_graph(4, 20, 0.6, 0.3, seed=2)
L = nx.normalized_laplacian_matrix(G)
#this just computes all distributions (IGNORE)
mx_all = []
for i in G.nodes:
mx_all.append(mx_comp(L,1,1,i))
#some random distance matrix (IGNORE)
dist = np.random.randint(5,size=(nx.number_of_nodes(G),nx.number_of_nodes(G)))
# =============================================================================
#this is what needs to be parallelised on GPU
W = np.zeros(nx.Graph.size(G))
for i,e in enumerate(G.edges):
print(i)
W[i] = W_comp(mx_all,dist,e)
return W
def W_comp(mx_all, dist, e):
i = e[0]
j = e[1]
Nx = np.array(mx_all[i][1]).flatten()
Ny = np.array(mx_all[j][1]).flatten()
mx = np.array(mx_all[i][0]).flatten()
my = np.array(mx_all[j][0]).flatten()
dNxNy = dist[Nx,:][:,Ny].copy(order='C')
return ot.sinkhorn2(mx, my, dNxNy,1)
# =============================================================================
#some other functions (IGNORE)
def delta(i, n):
p0 = np.zeros(n)
p0[i] = 1.
return p0
# all neighbourhood densities
def mx_comp(L, t, cutoff, i):
N = np.shape(L)[0]
mx_all = sc.sparse.linalg.expm_multiply(-t*L, delta(i, N))
Nx_all = np.argwhere(mx_all > (1-cutoff)*np.max(mx_all))
return mx_all, Nx_all
if __name__ == "__main__":
main()
Thank you!!
There are some packages, which allow you to run code on your GPU.
You can use one of the following packages:
pyCuda
numba(Pro)
Theano
When you want to use numba, the Python Anaconda distribution is recommended for doing this. Also, Anaconda Accelerate is needed. You can install it using conda install accelerate. In this example, you can see how the usage of the GPU is achieved https://gist.githubusercontent.com/aweeraman/ae6e40f54a924f1f5832081be9521d92/raw/d6775c421aa4fa4c0d582e6c58873499d28b913a/gpu.py .
It's done by adding target='cuda' to the #vectorize decorator. Note the import from numba import vectorize. The vectorize decorator takes the signature of the function that is to be accelerated as input.
Good luck!
Sources:
https://weeraman.com/put-that-gpu-to-good-use-with-python-e5a437168c01
https://www.researchgate.net/post/How_do_I_run_a_python_code_in_the_GPU

Avoid loops in the computation of logistic equation?

I am trying to calculate the nth value of a logistic equation in Python. It is easy to do it with a loop:
import timeit
tic = timeit.default_timer()
x = 0.23
i = 0
n = 1000000000
while (i < n):
x = 4 * x * (1 - x)
i += 1
toc = timeit.default_timer()
toc - tic
However it is also generally time-consuming. Doing it in PyPy greatly improves the performance, as suggested by abarnert in Is MATLAB faster than Python (little simple experiment).
I have also been suggested to avoid Python loops and use NumPy arrays and vector operations instead - actually I do not see how these can help (it seems to me that NumPy operations are similar to Matlab ones, and I am unaware of any way the code above can be vectorized in Matlab either).
Is there a way to optimize the code without loops?
Without loops? Maybe,but this is probably not be the best way to go. It's important to realize that loops are not per-se slow. You try to avoid them in python or matlab in high performance code. If you are writing C code, you don't have to care.
So one idea to optimize here would be to use cython to compile your code to C code:
python version:
def calc_x(x, n):
i = 0
while (i < n):
x = 4 * x * (1 - x)
i += 1
return x
statically typed cython version:
def calc_x_cy(double x, long n):
cdef long i = 0
while (i < n):
x = 4 * x * (1 - x)
i += 1
return x
And all of a sudden, you are almost two orders of magnitude faster:
%timeit calc_x(0.23, n) -> 1 loops, best of 3: 26.9 s per loop
%timeit calc_x_cy(0.23, n) -> 1 loops, best of 3: 370 ms per loop

Speeding up element-wise array multiplication in python

I have been playing around with numba and numexpr trying to speed up a simple element-wise matrix multiplication. I have not been able to get better results, they both are basically (speedwise) equivalent to numpys multiply function. Has anyone had any luck in this area? Am I using numba and numexpr wrong (I'm quite new to this) or is this altogether a bad approach to try and speed this up. Here is a reproducible code, thank you in advanced:
import numpy as np
from numba import autojit
import numexpr as ne
a=np.random.rand(10,5000000)
# numpy
multiplication1 = np.multiply(a,a)
# numba
def multiplix(X,Y):
M = X.shape[0]
N = X.shape[1]
D = np.empty((M, N), dtype=np.float)
for i in range(M):
for j in range(N):
D[i,j] = X[i, j] * Y[i, j]
return D
mul = autojit(multiplix)
multiplication2 = mul(a,a)
# numexpr
def numexprmult(X,Y):
M = X.shape[0]
N = X.shape[1]
return ne.evaluate("X * Y")
multiplication3 = numexprmult(a,a)
What about using fortran and ctypes?
elementwise.F90:
subroutine elementwise( a, b, c, M, N ) bind(c, name='elementwise')
use iso_c_binding, only: c_float, c_int
integer(c_int),intent(in) :: M, N
real(c_float), intent(in) :: a(M, N), b(M, N)
real(c_float), intent(out):: c(M, N)
integer :: i,j
forall (i=1:M,j=1:N)
c(i,j) = a(i,j) * b(i,j)
end forall
end subroutine
elementwise.py:
from ctypes import CDLL, POINTER, c_int, c_float
import numpy as np
import time
fortran = CDLL('./elementwise.so')
fortran.elementwise.argtypes = [ POINTER(c_float),
POINTER(c_float),
POINTER(c_float),
POINTER(c_int),
POINTER(c_int) ]
# Setup
M=10
N=5000000
a = np.empty((M,N), dtype=c_float)
b = np.empty((M,N), dtype=c_float)
c = np.empty((M,N), dtype=c_float)
a[:] = np.random.rand(M,N)
b[:] = np.random.rand(M,N)
# Fortran call
start = time.time()
fortran.elementwise( a.ctypes.data_as(POINTER(c_float)),
b.ctypes.data_as(POINTER(c_float)),
c.ctypes.data_as(POINTER(c_float)),
c_int(M), c_int(N) )
stop = time.time()
print 'Fortran took ',stop - start,'seconds'
# Numpy
start = time.time()
c = np.multiply(a,b)
stop = time.time()
print 'Numpy took ',stop - start,'seconds'
I compiled the Fortran file using
gfortran -O3 -funroll-loops -ffast-math -floop-strip-mine -shared -fPIC \
-o elementwise.so elementwise.F90
The output yields a speed-up of ~10%:
$ python elementwise.py
Fortran took 0.213667869568 seconds
Numpy took 0.230120897293 seconds
$ python elementwise.py
Fortran took 0.209784984589 seconds
Numpy took 0.231616973877 seconds
$ python elementwise.py
Fortran took 0.214708089828 seconds
Numpy took 0.25369310379 seconds
How are you doing your timings ?
The creation of your random array is taking up the overal part of your calculation, and if you include it in your timing you will hardly see any real difference in the results,
however, if you create it up front you can actually compare the methods.
Here are my results, and I'm consistently seeing what you are seeing. numpy and numba give about the same results (with numba being a little bit faster.)
(I don't have numexpr available)
In [1]: import numpy as np
In [2]: from numba import autojit
In [3]: a=np.random.rand(10,5000000)
In [4]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 90 ms per loop
In [5]: # numba
In [6]: def multiplix(X,Y):
...: M = X.shape[0]
...: N = X.shape[1]
...: D = np.empty((M, N), dtype=np.float)
...: for i in range(M):
...: for j in range(N):
...: D[i,j] = X[i, j] * Y[i, j]
...: return D
...:
In [7]: mul = autojit(multiplix)
In [26]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 182 ms per loop
In [27]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 185 ms per loop
In [28]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 181 ms per loop
In [29]: %timeit multiplication2 = mul(a,a)
10 loops, best of 3: 179 ms per loop
In [30]: %timeit multiplication2 = mul(a,a)
10 loops, best of 3: 180 ms per loop
In [31]: %timeit multiplication2 = mul(a,a)
10 loops, best of 3: 178 ms per loop
Update:
I used the latest version of numba, just compiled it from source: '0.11.0-3-gea20d11-dirty'
I tested this with the default numpy in Fedora 19, '1.7.1'
and numpy '1.6.1' compiled from source, linked against:
Update3
My earlier results were of course incorrect, I had return D in the inner loop, so skipping 90% of the calculations.
This provides more evidence for ali_m's assumption that it is really hard to do better than the already very optimized c code.
However, if you are trying to do something more complicated, e.g.,
np.sqrt(((X[:, None, :] - X) ** 2).sum(-1))
I can reproduce the figures Jake Vanderplas get's:
In [14]: %timeit pairwise_numba(X)
10000 loops, best of 3: 92.6 us per loop
In [15]: %timeit pairwise_numpy(X)
1000 loops, best of 3: 662 us per loop
So it seems you are doing something that has been so far optimized by numpy it is hard to do any better.
Edit: nevermind this answer, I'm wrong (see comment below).
I'm afraid it will be very, very hard to have a faster matrix multiplication in python than by using numpy's. NumPy usually uses internal fortran libraries like ATLAS/LAPACK that are very very well optimized.
To check if your version of NumPy was built with LAPACK support: open a terminal, go to your Python install directory and type:
for f in `find lib/python2.7/site-packages/numpy/* -name \*.so`; do echo $f; ldd $f;echo "\n";done | grep lapack
Note that the path can vary depending on your python version.
If you some lines get printed, you surely have LAPACK support... so having faster matrix multiplication on a single core will be very hard to achieve.
Now I don't know about using multiple cores to perform matrix multiplication, so you might want to look into that (see ali_m's comment).
use a GPU. use the following package.
gnumpy
The speed of np.multiply heavily relies on the arrays beeing exactly the same size.
a = np.random.rand(80000,1)
b = np.random.rand(80000,1)
c = np.multiply(a, b)
is fast as hell whereas the following code takes more than a minute and uses up all my 16 GB of ram:
a = np.squeeze(np.random.rand(80000,1))
b = np.random.rand(80000,1)
c = np.multiply(a, b)
So my advice would be to use arrays of exactly the same dimensions. Hope this is useful for someone looking how to speed up element-wise multiplication.

Resources