I am currently working on a script which takes in data for a correlation matrix and compute a bunch of values. this step right here is very costly, and I would like to find ways to speed it up or parallelize it. I have tried using Parallel (from python's joblib) however because of CPU overhead (at least because of the way I parametrized it) it's significantly slower than a sequential loop.
import time
import numpy as np
import itertools
from sklearn.datasets import make_blobs
N = 5000
data,_ = make_blobs(n_samples=N,n_features=500)
G = np.corrcoef(data)
''' the cluster function '''
def clus_lc(i, j, G, ns=2):
''' c_s'''
cs = 2*(G[i,j]+1)-1e-6
''' A and B'''
if cs<=ns:
return 0
return 0.5*( np.log(ns/cs) + (ns - 1)*np.log( (ns**2 - ns) / ( ns**2 - cs) ) )
''' merge and time '''
indices = list(itertools.combinations(range(N),2))
t0 = time.time()
costs = np.zeros(len(indices))
k=0
for i, j in indices:
costs[k] = clus_lc(i,j,G)
k+=1
t1 = time.time()
toseq = t1-t0
print(toseq)
I think I solved the issue by using numba and adding a decorator #jit. This seems to work fine because all operations in the function are calls to numpy functions. On a dataset with N=5000 it goes from 75 sec to 10 sec. Fantastic improvement. Now whether this can be further improved I am interested in hearing other inputs.
Related
There are several threads covering similar topics, but unfortunately, these seem to be too complicated for me, so I would like to ask a similar question, hoping that someone will have a look at my code specifically to tell me if I got something wrong.
I am learning numba cuda right now, starting with the simple examples one can find in the net. I started with this tutorial here:
https://github.com/ContinuumIO/gtc2017-numba/blob/master/4%20-%20Writing%20CUDA%20Kernels.ipynb
which shows how to do an addition of arrays in parallel. The system configuration they used to evaluate the times is not given. For the code replication, I use a Geforce GTX 1080 Ti and an Intel Core i7 8700K CPU.
I basically copied the addition script from the tutorial, but added also sequential code for comparison:
from numba import cuda
import numpy as np
import time
import math
#cuda.jit
def addition_kernel(x, y, out):
tx = cuda.threadIdx.x
ty = cuda.blockIdx.x
block_size = cuda.blockDim.x
grid_size = cuda.gridDim.x
start = tx+ ty * block_size
stride = block_size * grid_size
for i in range(start, x.shape[0], stride):
out[i] = y[i] + x[i]
def add(n, x, y):
for i in range(n):
y[i] = y[i] + x[i]
if __name__ =="__main__":
print(cuda.gpus[0])
print("")
n = 100000
x = np.arange(n).astype(np.float32)
y = 2 * x
out = np.empty_like(x)
x_device = cuda.to_device(x)
y_device = cuda.to_device(y)
out_device = cuda.device_array_like(x)
# Set the number of threads in a block
threadsperblock = 128
# Calculate the number of thread blocks in the grid
blockspergrid = 30#math.ceil(n[0] / threadsperblock)
# Now start the kernel
start = time.process_time()
cuda.synchronize()
addition_kernel[blockspergrid, threadsperblock](x_device, y_device, out_device)
cuda.synchronize()
end = time.process_time()
out_global_mem = out_device.copy_to_host()
print("parallel time: ", end - start)
start = time.process_time()
add(n,x,y)
end = time.process_time()
print("sequential time: ", end-start)
The parallel time is on average around 0.14 seconds, while the code without GPU kernel takes only 0.02 seconds.
This seems quite strange to me. Is there anything I did wrong? Or is this problem not a good example for parallelism? (which I do not think as you can run the for loop in parallel)
What is odd is that I do hardly notice a difference if I do not use the to_device() functions. As far as I understood, these should be important, as they avoid the communication between CPU and GPU after each iteration.
addition_kernel is compiled at runtime when it is called the first time, so in the middle of your measured time! The compilation of a kernel is a pretty intensive operation. You can force the compilation to be done eagerly (ie. when the function is defined) by providing the types to Numba.
Note that the arrays are a bit too small so you can see a big improvement on GPUs. Moreover, the comparison with the CPU version is not really fair: you should also use Numba for the CPU implementation or at least Numpy (but not an interpreted pure-CPython loop).
Here is an example:
import numba as nb
#cuda.jit('void(float32[::1], float32[::1], float32[::1])')
def addition_kernel(x, y, out):
tx = cuda.threadIdx.x
ty = cuda.blockIdx.x
block_size = cuda.blockDim.x
grid_size = cuda.gridDim.x
start = tx+ ty * block_size
stride = block_size * grid_size
for i in range(start, x.shape[0], stride):
out[i] = y[i] + x[i]
#nb.njit('void(int64, float32[::1], float32[::1])')
def add(n, x, y):
for i in range(n):
y[i] = y[i] + x[i]
In the time tests shown below, I found that Skyfield takes several hundred microseconds up to a millisecond to return obj.at(jd).position.km for a single time value in jd, but the incremental cost for longer JulianDate objects (a list of points in time) is only about one microsecond per point. I see similar speeds using Jplephem and with two different ephemerides.
My question here is: if I want to random-access points in time, for example as a slave to an external Runge-Kutta routine which uses its own variable stepsize, is there a way I can do this faster within python (without having to learn to compile code)?
I understand this is not at all the typical way Skyfield is intended to be used. Normally we'd load a JulianDate object with a long list of time points and then calculate them at once, and probably do that a few times, not thousands of times (or more), the way an orbit integrator might do.
Workaround: I can imagine a work-around where I build my own NumPy database by running Skyfield once using a JulianDate object with fine time granularity, then writing my own Runge-Kutta routine which changes step sizes up and down by discrete amounts such that the timesteps always correspond directly to the striding of NumPy array.
Or I could even try re-interpolating. I am not doing highly precise calculations so a simple NumPy or SciPy 2nd order might be fine.
Ultimately I'd like to try integrating the path of objects under the influence of the gravity field of the solar system (e.g. deep-space satellite, comet, asteroid). When looking for an orbit solution one might try millions of starting state vectors in 6D phase space. I know I should be using things like ob.at(jd).observe(large_body).position.km method because gravity travels at the speed of light like everything else. This seems to cost substantial time since (I'm guessing) it's an iterative calculation ("Let's see... where would Jupiter have been such that I feel it's gravity right NOW"). But let's peel the cosmic onion one layer at a time.
Figure 1. Skyfield and JPLephem performance on my laptop for different length JulianDate objects, for de405 and de421. They are all about the same - (very) roughly about a half-millisecond for the first point and a microsecond for each additional point. Also the very first point to be calculated when the script runs (Earth (blue) with len(jd) = 1) has an additional millisecond artifact.
Earth and Moon are slower because it is a two-step calculation internally (the Earth-Moon Barycenter plus the individual orbits about the Barycenter). Mercury may be slower because it moves so fast compared to the ephemeris time steps that it requires more coefficients in the (costly) Chebyshev interpolation?
SCRIPT FOR SKYFIELD DATA the JPLephem script is farther down
import numpy as np
import matplotlib.pyplot as plt
from skyfield.api import load, JulianDate
import time
ephem = 'de421.bsp'
ephem = 'de405.bsp'
de = load(ephem)
earth = de['earth']
moon = de['moon']
earth_barycenter = de['earth barycenter']
mercury = de['mercury']
jupiter = de['jupiter barycenter']
pluto = de['pluto barycenter']
things = [ earth, moon, earth_barycenter, mercury, jupiter, pluto ]
names = ['earth', 'moon', 'earth barycenter', 'mercury', 'jupiter', 'pluto']
ntimes = [i*10**n for n in range(5) for i in [1, 2, 5]]
years = [np.zeros(1)] + [np.linspace(0, 100, n) for n in ntimes[1:]] # 100 years
microsecs = []
for y in years:
jd = JulianDate(utc=(1900 + y, 1, 1))
mics = []
for thing in things:
tstart = time.clock()
answer = thing.at(jd).position.km
mics.append(1E+06 * (time.clock() - tstart))
microsecs.append(mics)
microsecs = np.array(microsecs).T
many = [len(y) for y in years]
fig = plt.figure()
ax = plt.subplot(111, xlabel='length of JD object',
ylabel='microseconds',
title='time for thing.at(jd).position.km with ' + ephem )
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] +
ax.get_xticklabels() + ax.get_yticklabels()):
item.set_fontsize(item.get_fontsize() + 4) # http://stackoverflow.com/a/14971193/3904031
for name, mics in zip(names, microsecs):
ax.plot(many, mics, lw=2, label=name)
plt.legend(loc='upper left', shadow=False, fontsize='x-large')
plt.xscale('log')
plt.yscale('log')
plt.savefig("skyfield speed test " + ephem.split('.')[0])
plt.show()
SCRIPT FOR JPLEPHEM DATA the Skyfield script is above
import numpy as np
import matplotlib.pyplot as plt
from jplephem.spk import SPK
import time
ephem = 'de421.bsp'
ephem = 'de405.bsp'
kernel = SPK.open(ephem)
jd_1900_01_01 = 2415020.5004882407
ntimes = [i*10**n for n in range(5) for i in [1, 2, 5]]
years = [np.zeros(1)] + [np.linspace(0, 100, n) for n in ntimes[1:]] # 100 years
barytup = (0, 3)
earthtup = (3, 399)
# moontup = (3, 301)
microsecs = []
for y in years:
mics = []
#for thing in things:
jd = jd_1900_01_01 + y * 365.25 # roughly, it doesn't matter here
tstart = time.clock()
answer = kernel[earthtup].compute(jd) + kernel[barytup].compute(jd)
mics.append(1E+06 * (time.clock() - tstart))
microsecs.append(mics)
microsecs = np.array(microsecs)
many = [len(y) for y in years]
fig = plt.figure()
ax = plt.subplot(111, xlabel='length of JD object',
ylabel='microseconds',
title='time for jplephem [0,3] and [3,399] with ' + ephem )
# from here: http://stackoverflow.com/a/14971193/3904031
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] +
ax.get_xticklabels() + ax.get_yticklabels()):
item.set_fontsize(item.get_fontsize() + 4)
#for name, mics in zip(names, microsecs):
ax.plot(many, microsecs, lw=2, label='earth')
plt.legend(loc='upper left', shadow=False, fontsize='x-large')
plt.xscale('log')
plt.yscale('log')
plt.ylim(1E+02, 1E+06)
plt.savefig("jplephem speed test " + ephem.split('.')[0])
plt.show()
I am trying to calculate the nth value of a logistic equation in Python. It is easy to do it with a loop:
import timeit
tic = timeit.default_timer()
x = 0.23
i = 0
n = 1000000000
while (i < n):
x = 4 * x * (1 - x)
i += 1
toc = timeit.default_timer()
toc - tic
However it is also generally time-consuming. Doing it in PyPy greatly improves the performance, as suggested by abarnert in Is MATLAB faster than Python (little simple experiment).
I have also been suggested to avoid Python loops and use NumPy arrays and vector operations instead - actually I do not see how these can help (it seems to me that NumPy operations are similar to Matlab ones, and I am unaware of any way the code above can be vectorized in Matlab either).
Is there a way to optimize the code without loops?
Without loops? Maybe,but this is probably not be the best way to go. It's important to realize that loops are not per-se slow. You try to avoid them in python or matlab in high performance code. If you are writing C code, you don't have to care.
So one idea to optimize here would be to use cython to compile your code to C code:
python version:
def calc_x(x, n):
i = 0
while (i < n):
x = 4 * x * (1 - x)
i += 1
return x
statically typed cython version:
def calc_x_cy(double x, long n):
cdef long i = 0
while (i < n):
x = 4 * x * (1 - x)
i += 1
return x
And all of a sudden, you are almost two orders of magnitude faster:
%timeit calc_x(0.23, n) -> 1 loops, best of 3: 26.9 s per loop
%timeit calc_x_cy(0.23, n) -> 1 loops, best of 3: 370 ms per loop
Since ask.sagemath.org is down, I figure I'd ask this here...
I'm trying to parallelize the generation of a bunch of random primes using the random_prime function in sage. Here's some code:
#!/usr/bin/env sage -python
from __future__ import print_function
from sage.all import *
import time
N = 100
B = 200
length = (1 << B) - 1
print('Generating primes (sequentially)')
start = time.time()
ps = []
for _ in xrange(N):
ps.append(random_prime(length))
end = time.time()
print('Took: %f seconds' % (end - start))
print(ps)
#parallel(ncpus=10)
def _random_prime(length):
return random_prime(length)
print('Generating primes (in parallel)')
start = time.time()
ps = [length] * N
ps = list(_random_prime(ps))
end = time.time()
print('Took: %f seconds' % (end - start))
ps = [p for _, p in ps]
print(ps)
The first run through computes the primes sequentially, and it works.
The second run through computes them using sage's #parallel decorator. It "works" in the sense that the computation is parallelized, but all the output primes are the same (i.e., it doesn't generate 100 random primes but rather 100 instances of the same random prime). I'd think this'd be a common usecase of #parallel, however I cannot find any details on what the issue here is. Anyone have any ideas? Thanks.
Just a hint on this - is it possible that you'd need to set a different seed each time? Note that the doc doesn't have a #random in the test, so perhaps the seed for all the parallel instances is the same for some reason (which seems reasonable).
Edit to put Volker's more detailed description of how to do this:
import os
import sage.misc.randstate as randstate
randstate.set_random_seed(os.getpid())
I have been playing around with numba and numexpr trying to speed up a simple element-wise matrix multiplication. I have not been able to get better results, they both are basically (speedwise) equivalent to numpys multiply function. Has anyone had any luck in this area? Am I using numba and numexpr wrong (I'm quite new to this) or is this altogether a bad approach to try and speed this up. Here is a reproducible code, thank you in advanced:
import numpy as np
from numba import autojit
import numexpr as ne
a=np.random.rand(10,5000000)
# numpy
multiplication1 = np.multiply(a,a)
# numba
def multiplix(X,Y):
M = X.shape[0]
N = X.shape[1]
D = np.empty((M, N), dtype=np.float)
for i in range(M):
for j in range(N):
D[i,j] = X[i, j] * Y[i, j]
return D
mul = autojit(multiplix)
multiplication2 = mul(a,a)
# numexpr
def numexprmult(X,Y):
M = X.shape[0]
N = X.shape[1]
return ne.evaluate("X * Y")
multiplication3 = numexprmult(a,a)
What about using fortran and ctypes?
elementwise.F90:
subroutine elementwise( a, b, c, M, N ) bind(c, name='elementwise')
use iso_c_binding, only: c_float, c_int
integer(c_int),intent(in) :: M, N
real(c_float), intent(in) :: a(M, N), b(M, N)
real(c_float), intent(out):: c(M, N)
integer :: i,j
forall (i=1:M,j=1:N)
c(i,j) = a(i,j) * b(i,j)
end forall
end subroutine
elementwise.py:
from ctypes import CDLL, POINTER, c_int, c_float
import numpy as np
import time
fortran = CDLL('./elementwise.so')
fortran.elementwise.argtypes = [ POINTER(c_float),
POINTER(c_float),
POINTER(c_float),
POINTER(c_int),
POINTER(c_int) ]
# Setup
M=10
N=5000000
a = np.empty((M,N), dtype=c_float)
b = np.empty((M,N), dtype=c_float)
c = np.empty((M,N), dtype=c_float)
a[:] = np.random.rand(M,N)
b[:] = np.random.rand(M,N)
# Fortran call
start = time.time()
fortran.elementwise( a.ctypes.data_as(POINTER(c_float)),
b.ctypes.data_as(POINTER(c_float)),
c.ctypes.data_as(POINTER(c_float)),
c_int(M), c_int(N) )
stop = time.time()
print 'Fortran took ',stop - start,'seconds'
# Numpy
start = time.time()
c = np.multiply(a,b)
stop = time.time()
print 'Numpy took ',stop - start,'seconds'
I compiled the Fortran file using
gfortran -O3 -funroll-loops -ffast-math -floop-strip-mine -shared -fPIC \
-o elementwise.so elementwise.F90
The output yields a speed-up of ~10%:
$ python elementwise.py
Fortran took 0.213667869568 seconds
Numpy took 0.230120897293 seconds
$ python elementwise.py
Fortran took 0.209784984589 seconds
Numpy took 0.231616973877 seconds
$ python elementwise.py
Fortran took 0.214708089828 seconds
Numpy took 0.25369310379 seconds
How are you doing your timings ?
The creation of your random array is taking up the overal part of your calculation, and if you include it in your timing you will hardly see any real difference in the results,
however, if you create it up front you can actually compare the methods.
Here are my results, and I'm consistently seeing what you are seeing. numpy and numba give about the same results (with numba being a little bit faster.)
(I don't have numexpr available)
In [1]: import numpy as np
In [2]: from numba import autojit
In [3]: a=np.random.rand(10,5000000)
In [4]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 90 ms per loop
In [5]: # numba
In [6]: def multiplix(X,Y):
...: M = X.shape[0]
...: N = X.shape[1]
...: D = np.empty((M, N), dtype=np.float)
...: for i in range(M):
...: for j in range(N):
...: D[i,j] = X[i, j] * Y[i, j]
...: return D
...:
In [7]: mul = autojit(multiplix)
In [26]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 182 ms per loop
In [27]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 185 ms per loop
In [28]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 181 ms per loop
In [29]: %timeit multiplication2 = mul(a,a)
10 loops, best of 3: 179 ms per loop
In [30]: %timeit multiplication2 = mul(a,a)
10 loops, best of 3: 180 ms per loop
In [31]: %timeit multiplication2 = mul(a,a)
10 loops, best of 3: 178 ms per loop
Update:
I used the latest version of numba, just compiled it from source: '0.11.0-3-gea20d11-dirty'
I tested this with the default numpy in Fedora 19, '1.7.1'
and numpy '1.6.1' compiled from source, linked against:
Update3
My earlier results were of course incorrect, I had return D in the inner loop, so skipping 90% of the calculations.
This provides more evidence for ali_m's assumption that it is really hard to do better than the already very optimized c code.
However, if you are trying to do something more complicated, e.g.,
np.sqrt(((X[:, None, :] - X) ** 2).sum(-1))
I can reproduce the figures Jake Vanderplas get's:
In [14]: %timeit pairwise_numba(X)
10000 loops, best of 3: 92.6 us per loop
In [15]: %timeit pairwise_numpy(X)
1000 loops, best of 3: 662 us per loop
So it seems you are doing something that has been so far optimized by numpy it is hard to do any better.
Edit: nevermind this answer, I'm wrong (see comment below).
I'm afraid it will be very, very hard to have a faster matrix multiplication in python than by using numpy's. NumPy usually uses internal fortran libraries like ATLAS/LAPACK that are very very well optimized.
To check if your version of NumPy was built with LAPACK support: open a terminal, go to your Python install directory and type:
for f in `find lib/python2.7/site-packages/numpy/* -name \*.so`; do echo $f; ldd $f;echo "\n";done | grep lapack
Note that the path can vary depending on your python version.
If you some lines get printed, you surely have LAPACK support... so having faster matrix multiplication on a single core will be very hard to achieve.
Now I don't know about using multiple cores to perform matrix multiplication, so you might want to look into that (see ali_m's comment).
use a GPU. use the following package.
gnumpy
The speed of np.multiply heavily relies on the arrays beeing exactly the same size.
a = np.random.rand(80000,1)
b = np.random.rand(80000,1)
c = np.multiply(a, b)
is fast as hell whereas the following code takes more than a minute and uses up all my 16 GB of ram:
a = np.squeeze(np.random.rand(80000,1))
b = np.random.rand(80000,1)
c = np.multiply(a, b)
So my advice would be to use arrays of exactly the same dimensions. Hope this is useful for someone looking how to speed up element-wise multiplication.