Speeding up element-wise array multiplication in python - performance

I have been playing around with numba and numexpr trying to speed up a simple element-wise matrix multiplication. I have not been able to get better results, they both are basically (speedwise) equivalent to numpys multiply function. Has anyone had any luck in this area? Am I using numba and numexpr wrong (I'm quite new to this) or is this altogether a bad approach to try and speed this up. Here is a reproducible code, thank you in advanced:
import numpy as np
from numba import autojit
import numexpr as ne
a=np.random.rand(10,5000000)
# numpy
multiplication1 = np.multiply(a,a)
# numba
def multiplix(X,Y):
M = X.shape[0]
N = X.shape[1]
D = np.empty((M, N), dtype=np.float)
for i in range(M):
for j in range(N):
D[i,j] = X[i, j] * Y[i, j]
return D
mul = autojit(multiplix)
multiplication2 = mul(a,a)
# numexpr
def numexprmult(X,Y):
M = X.shape[0]
N = X.shape[1]
return ne.evaluate("X * Y")
multiplication3 = numexprmult(a,a)

What about using fortran and ctypes?
elementwise.F90:
subroutine elementwise( a, b, c, M, N ) bind(c, name='elementwise')
use iso_c_binding, only: c_float, c_int
integer(c_int),intent(in) :: M, N
real(c_float), intent(in) :: a(M, N), b(M, N)
real(c_float), intent(out):: c(M, N)
integer :: i,j
forall (i=1:M,j=1:N)
c(i,j) = a(i,j) * b(i,j)
end forall
end subroutine
elementwise.py:
from ctypes import CDLL, POINTER, c_int, c_float
import numpy as np
import time
fortran = CDLL('./elementwise.so')
fortran.elementwise.argtypes = [ POINTER(c_float),
POINTER(c_float),
POINTER(c_float),
POINTER(c_int),
POINTER(c_int) ]
# Setup
M=10
N=5000000
a = np.empty((M,N), dtype=c_float)
b = np.empty((M,N), dtype=c_float)
c = np.empty((M,N), dtype=c_float)
a[:] = np.random.rand(M,N)
b[:] = np.random.rand(M,N)
# Fortran call
start = time.time()
fortran.elementwise( a.ctypes.data_as(POINTER(c_float)),
b.ctypes.data_as(POINTER(c_float)),
c.ctypes.data_as(POINTER(c_float)),
c_int(M), c_int(N) )
stop = time.time()
print 'Fortran took ',stop - start,'seconds'
# Numpy
start = time.time()
c = np.multiply(a,b)
stop = time.time()
print 'Numpy took ',stop - start,'seconds'
I compiled the Fortran file using
gfortran -O3 -funroll-loops -ffast-math -floop-strip-mine -shared -fPIC \
-o elementwise.so elementwise.F90
The output yields a speed-up of ~10%:
$ python elementwise.py
Fortran took 0.213667869568 seconds
Numpy took 0.230120897293 seconds
$ python elementwise.py
Fortran took 0.209784984589 seconds
Numpy took 0.231616973877 seconds
$ python elementwise.py
Fortran took 0.214708089828 seconds
Numpy took 0.25369310379 seconds

How are you doing your timings ?
The creation of your random array is taking up the overal part of your calculation, and if you include it in your timing you will hardly see any real difference in the results,
however, if you create it up front you can actually compare the methods.
Here are my results, and I'm consistently seeing what you are seeing. numpy and numba give about the same results (with numba being a little bit faster.)
(I don't have numexpr available)
In [1]: import numpy as np
In [2]: from numba import autojit
In [3]: a=np.random.rand(10,5000000)
In [4]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 90 ms per loop
In [5]: # numba
In [6]: def multiplix(X,Y):
...: M = X.shape[0]
...: N = X.shape[1]
...: D = np.empty((M, N), dtype=np.float)
...: for i in range(M):
...: for j in range(N):
...: D[i,j] = X[i, j] * Y[i, j]
...: return D
...:
In [7]: mul = autojit(multiplix)
In [26]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 182 ms per loop
In [27]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 185 ms per loop
In [28]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 181 ms per loop
In [29]: %timeit multiplication2 = mul(a,a)
10 loops, best of 3: 179 ms per loop
In [30]: %timeit multiplication2 = mul(a,a)
10 loops, best of 3: 180 ms per loop
In [31]: %timeit multiplication2 = mul(a,a)
10 loops, best of 3: 178 ms per loop
Update:
I used the latest version of numba, just compiled it from source: '0.11.0-3-gea20d11-dirty'
I tested this with the default numpy in Fedora 19, '1.7.1'
and numpy '1.6.1' compiled from source, linked against:
Update3
My earlier results were of course incorrect, I had return D in the inner loop, so skipping 90% of the calculations.
This provides more evidence for ali_m's assumption that it is really hard to do better than the already very optimized c code.
However, if you are trying to do something more complicated, e.g.,
np.sqrt(((X[:, None, :] - X) ** 2).sum(-1))
I can reproduce the figures Jake Vanderplas get's:
In [14]: %timeit pairwise_numba(X)
10000 loops, best of 3: 92.6 us per loop
In [15]: %timeit pairwise_numpy(X)
1000 loops, best of 3: 662 us per loop
So it seems you are doing something that has been so far optimized by numpy it is hard to do any better.

Edit: nevermind this answer, I'm wrong (see comment below).
I'm afraid it will be very, very hard to have a faster matrix multiplication in python than by using numpy's. NumPy usually uses internal fortran libraries like ATLAS/LAPACK that are very very well optimized.
To check if your version of NumPy was built with LAPACK support: open a terminal, go to your Python install directory and type:
for f in `find lib/python2.7/site-packages/numpy/* -name \*.so`; do echo $f; ldd $f;echo "\n";done | grep lapack
Note that the path can vary depending on your python version.
If you some lines get printed, you surely have LAPACK support... so having faster matrix multiplication on a single core will be very hard to achieve.
Now I don't know about using multiple cores to perform matrix multiplication, so you might want to look into that (see ali_m's comment).

use a GPU. use the following package.
gnumpy

The speed of np.multiply heavily relies on the arrays beeing exactly the same size.
a = np.random.rand(80000,1)
b = np.random.rand(80000,1)
c = np.multiply(a, b)
is fast as hell whereas the following code takes more than a minute and uses up all my 16 GB of ram:
a = np.squeeze(np.random.rand(80000,1))
b = np.random.rand(80000,1)
c = np.multiply(a, b)
So my advice would be to use arrays of exactly the same dimensions. Hope this is useful for someone looking how to speed up element-wise multiplication.

Related

Numba Cuda computation seems to be slower than sequential run. Did I do obvious mistakes?

There are several threads covering similar topics, but unfortunately, these seem to be too complicated for me, so I would like to ask a similar question, hoping that someone will have a look at my code specifically to tell me if I got something wrong.
I am learning numba cuda right now, starting with the simple examples one can find in the net. I started with this tutorial here:
https://github.com/ContinuumIO/gtc2017-numba/blob/master/4%20-%20Writing%20CUDA%20Kernels.ipynb
which shows how to do an addition of arrays in parallel. The system configuration they used to evaluate the times is not given. For the code replication, I use a Geforce GTX 1080 Ti and an Intel Core i7 8700K CPU.
I basically copied the addition script from the tutorial, but added also sequential code for comparison:
from numba import cuda
import numpy as np
import time
import math
#cuda.jit
def addition_kernel(x, y, out):
tx = cuda.threadIdx.x
ty = cuda.blockIdx.x
block_size = cuda.blockDim.x
grid_size = cuda.gridDim.x
start = tx+ ty * block_size
stride = block_size * grid_size
for i in range(start, x.shape[0], stride):
out[i] = y[i] + x[i]
def add(n, x, y):
for i in range(n):
y[i] = y[i] + x[i]
if __name__ =="__main__":
print(cuda.gpus[0])
print("")
n = 100000
x = np.arange(n).astype(np.float32)
y = 2 * x
out = np.empty_like(x)
x_device = cuda.to_device(x)
y_device = cuda.to_device(y)
out_device = cuda.device_array_like(x)
# Set the number of threads in a block
threadsperblock = 128
# Calculate the number of thread blocks in the grid
blockspergrid = 30#math.ceil(n[0] / threadsperblock)
# Now start the kernel
start = time.process_time()
cuda.synchronize()
addition_kernel[blockspergrid, threadsperblock](x_device, y_device, out_device)
cuda.synchronize()
end = time.process_time()
out_global_mem = out_device.copy_to_host()
print("parallel time: ", end - start)
start = time.process_time()
add(n,x,y)
end = time.process_time()
print("sequential time: ", end-start)
The parallel time is on average around 0.14 seconds, while the code without GPU kernel takes only 0.02 seconds.
This seems quite strange to me. Is there anything I did wrong? Or is this problem not a good example for parallelism? (which I do not think as you can run the for loop in parallel)
What is odd is that I do hardly notice a difference if I do not use the to_device() functions. As far as I understood, these should be important, as they avoid the communication between CPU and GPU after each iteration.
addition_kernel is compiled at runtime when it is called the first time, so in the middle of your measured time! The compilation of a kernel is a pretty intensive operation. You can force the compilation to be done eagerly (ie. when the function is defined) by providing the types to Numba.
Note that the arrays are a bit too small so you can see a big improvement on GPUs. Moreover, the comparison with the CPU version is not really fair: you should also use Numba for the CPU implementation or at least Numpy (but not an interpreted pure-CPython loop).
Here is an example:
import numba as nb
#cuda.jit('void(float32[::1], float32[::1], float32[::1])')
def addition_kernel(x, y, out):
tx = cuda.threadIdx.x
ty = cuda.blockIdx.x
block_size = cuda.blockDim.x
grid_size = cuda.gridDim.x
start = tx+ ty * block_size
stride = block_size * grid_size
for i in range(start, x.shape[0], stride):
out[i] = y[i] + x[i]
#nb.njit('void(int64, float32[::1], float32[::1])')
def add(n, x, y):
for i in range(n):
y[i] = y[i] + x[i]

How to speed up this python loop script or parallelize it

I am currently working on a script which takes in data for a correlation matrix and compute a bunch of values. this step right here is very costly, and I would like to find ways to speed it up or parallelize it. I have tried using Parallel (from python's joblib) however because of CPU overhead (at least because of the way I parametrized it) it's significantly slower than a sequential loop.
import time
import numpy as np
import itertools
from sklearn.datasets import make_blobs
N = 5000
data,_ = make_blobs(n_samples=N,n_features=500)
G = np.corrcoef(data)
''' the cluster function '''
def clus_lc(i, j, G, ns=2):
''' c_s'''
cs = 2*(G[i,j]+1)-1e-6
''' A and B'''
if cs<=ns:
return 0
return 0.5*( np.log(ns/cs) + (ns - 1)*np.log( (ns**2 - ns) / ( ns**2 - cs) ) )
''' merge and time '''
indices = list(itertools.combinations(range(N),2))
t0 = time.time()
costs = np.zeros(len(indices))
k=0
for i, j in indices:
costs[k] = clus_lc(i,j,G)
k+=1
t1 = time.time()
toseq = t1-t0
print(toseq)
I think I solved the issue by using numba and adding a decorator #jit. This seems to work fine because all operations in the function are calls to numpy functions. On a dataset with N=5000 it goes from 75 sec to 10 sec. Fantastic improvement. Now whether this can be further improved I am interested in hearing other inputs.

Avoid loops in the computation of logistic equation?

I am trying to calculate the nth value of a logistic equation in Python. It is easy to do it with a loop:
import timeit
tic = timeit.default_timer()
x = 0.23
i = 0
n = 1000000000
while (i < n):
x = 4 * x * (1 - x)
i += 1
toc = timeit.default_timer()
toc - tic
However it is also generally time-consuming. Doing it in PyPy greatly improves the performance, as suggested by abarnert in Is MATLAB faster than Python (little simple experiment).
I have also been suggested to avoid Python loops and use NumPy arrays and vector operations instead - actually I do not see how these can help (it seems to me that NumPy operations are similar to Matlab ones, and I am unaware of any way the code above can be vectorized in Matlab either).
Is there a way to optimize the code without loops?
Without loops? Maybe,but this is probably not be the best way to go. It's important to realize that loops are not per-se slow. You try to avoid them in python or matlab in high performance code. If you are writing C code, you don't have to care.
So one idea to optimize here would be to use cython to compile your code to C code:
python version:
def calc_x(x, n):
i = 0
while (i < n):
x = 4 * x * (1 - x)
i += 1
return x
statically typed cython version:
def calc_x_cy(double x, long n):
cdef long i = 0
while (i < n):
x = 4 * x * (1 - x)
i += 1
return x
And all of a sudden, you are almost two orders of magnitude faster:
%timeit calc_x(0.23, n) -> 1 loops, best of 3: 26.9 s per loop
%timeit calc_x_cy(0.23, n) -> 1 loops, best of 3: 370 ms per loop

parallelizing prime generation in sage

Since ask.sagemath.org is down, I figure I'd ask this here...
I'm trying to parallelize the generation of a bunch of random primes using the random_prime function in sage. Here's some code:
#!/usr/bin/env sage -python
from __future__ import print_function
from sage.all import *
import time
N = 100
B = 200
length = (1 << B) - 1
print('Generating primes (sequentially)')
start = time.time()
ps = []
for _ in xrange(N):
ps.append(random_prime(length))
end = time.time()
print('Took: %f seconds' % (end - start))
print(ps)
#parallel(ncpus=10)
def _random_prime(length):
return random_prime(length)
print('Generating primes (in parallel)')
start = time.time()
ps = [length] * N
ps = list(_random_prime(ps))
end = time.time()
print('Took: %f seconds' % (end - start))
ps = [p for _, p in ps]
print(ps)
The first run through computes the primes sequentially, and it works.
The second run through computes them using sage's #parallel decorator. It "works" in the sense that the computation is parallelized, but all the output primes are the same (i.e., it doesn't generate 100 random primes but rather 100 instances of the same random prime). I'd think this'd be a common usecase of #parallel, however I cannot find any details on what the issue here is. Anyone have any ideas? Thanks.
Just a hint on this - is it possible that you'd need to set a different seed each time? Note that the doc doesn't have a #random in the test, so perhaps the seed for all the parallel instances is the same for some reason (which seems reasonable).
Edit to put Volker's more detailed description of how to do this:
import os
import sage.misc.randstate as randstate
randstate.set_random_seed(os.getpid())

cython running slower than numpy for distance calculation

I'm trying to learn cython; however, I must be doing something wrong. This little piece of test code is running about 50 times slower than my vectorized numpy version of it. Can someone please tell me why my cython is slower than my python? Thanks.
The code calculates the distance between a point in R^3, loc, and and array of points in R^3, points.
import numpy as np
cimport numpy as np
import cython
cimport cython
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
#cython.boundscheck(False) # turn of bounds-checking for entire function
#cython.wraparound(False)
#cython.nonecheck(False)
def distMeasureCython(np.ndarray[DTYPE_t, ndim=2] points, np.ndarray[DTYPE_t, ndim=1] loc):
cdef unsigned int i
cdef unsigned int L = points.shape[0]
cdef np.ndarray[DTYPE_t, ndim=1] d = np.zeros(L)
for i in xrange(0,L):
d[i] = np.sqrt((points[i,0] - loc[0])**2 + (points[i,1] - loc[1])**2 + (points[i,2] - loc[2])**2)
return d
This is the numpy code that it's being compared against.
from numpy import *
N = 1e6
points = random.uniform(0,1,(N,3))
loc = random.uniform(0,1,(3))
def distMeasureNumpy(points,loc):
d = points - loc
d = sqrt(sum(d*d,axis=1))
return d
The numpy/python version takes about 44ms and the cython version takes about 2 seconds. I'm running python 2.7 on a mac osx. I'm using ipython's %timeit command to time the two functions.
The call to np.sqrt, which is a Python function call, is killing your performance You are computing the square root of scalar floating point value, so you should use the sqrt function from the C math library. Here's a modified version of your code:
import numpy as np
cimport numpy as np
import cython
cimport cython
from libc.math cimport sqrt
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
#cython.boundscheck(False) # turn of bounds-checking for entire function
#cython.wraparound(False)
#cython.nonecheck(False)
def distMeasureCython(np.ndarray[DTYPE_t, ndim=2] points,
np.ndarray[DTYPE_t, ndim=1] loc):
cdef unsigned int i
cdef unsigned int L = points.shape[0]
cdef np.ndarray[DTYPE_t, ndim=1] d = np.zeros(L)
for i in xrange(0,L):
d[i] = sqrt((points[i,0] - loc[0])**2 +
(points[i,1] - loc[1])**2 +
(points[i,2] - loc[2])**2)
return d
The following demonstrates the performance improvement. Your original code is in the module check_speed_original, and the modified version is in check_speed:
In [11]: import check_speed_original
In [12]: import check_speed
Set up the test data:
In [13]: N = 10**6
In [14]: points = random.uniform(0,1,(N,3))
In [15]: loc = random.uniform(0,1,(3,))
The original version takes 1.26 seconds on my computer:
In [16]: %timeit check_speed_original.distMeasureCython(points, loc)
1 loops, best of 3: 1.26 s per loop
The modified version takes 4.47 milliseconds:
In [17]: %timeit check_speed.distMeasureCython(points, loc)
100 loops, best of 3: 4.47 ms per loop
In case anyone is worried that the results might be different:
In [18]: d1 = check_speed.distMeasureCython(points, loc)
In [19]: d2 = check_speed_original.distMeasureCython(points, loc)
In [20]: np.all(d1 == d2)
Out[20]: True
As already mentioned, it's the numpy.sqrt call in the code. However, I think one does not need to employ cdef extern, since Cython provides these basic C/C++ libraries already. (see the docs). So you could just cimport it like this:
from libc.math cimport sqrt
Just to get rid of the overhead.

Resources