The two ways of computing 'tanh' are shown as follows. Why the computing efficiency of torch.tanh(1) is much higher than the direct expression(2)? I am confused. And where can I find the original code of torch.tanh in pytorch? Dose it written by C/C++?
import torch
import time
def tanh(x):
return (torch.exp(x) - torch.exp(-x)) / (torch.exp(x) + torch.exp(-x))
class Function(torch.nn.Module):
def __init__(self):
super(Function, self).__init__()
self.Linear1 = torch.nn.Linear(3, 50)
self.Linear2 = torch.nn.Linear(50, 50)
self.Linear3 = torch.nn.Linear(50, 50)
self.Linear4 = torch.nn.Linear(50, 1)
def forward(self, x):
# (1) for torch.torch
x = torch.tanh(self.Linear1(x))
x = torch.tanh(self.Linear2(x))
x = torch.tanh(self.Linear3(x))
x = torch.tanh(self.Linear4(x))
# (2) for direct expression
# x = tanh(self.Linear1(x))
# x = tanh(self.Linear2(x))
# x = tanh(self.Linear3(x))
# x = tanh(self.Linear4(x))
return x
func = Function()
x= torch.ones(1000,3)
T1 = time.time()
for i in range(10000):
y = func(x)
T2 = time.time()
print(T2-T1)
The mathematical functions are writen in higly optimized code, they can use advanced CPU features and multiple cores, it can even take advantage of GPUs.
in your tanh function it evaluates the exp function four times, does 2 subtraction and one division, creating temporary tensors require memory allocation that can be slow as well, not to mention the overhead of the python interpreter, being 4 to 10 times slow is reasonable.
There are several threads covering similar topics, but unfortunately, these seem to be too complicated for me, so I would like to ask a similar question, hoping that someone will have a look at my code specifically to tell me if I got something wrong.
I am learning numba cuda right now, starting with the simple examples one can find in the net. I started with this tutorial here:
https://github.com/ContinuumIO/gtc2017-numba/blob/master/4%20-%20Writing%20CUDA%20Kernels.ipynb
which shows how to do an addition of arrays in parallel. The system configuration they used to evaluate the times is not given. For the code replication, I use a Geforce GTX 1080 Ti and an Intel Core i7 8700K CPU.
I basically copied the addition script from the tutorial, but added also sequential code for comparison:
from numba import cuda
import numpy as np
import time
import math
#cuda.jit
def addition_kernel(x, y, out):
tx = cuda.threadIdx.x
ty = cuda.blockIdx.x
block_size = cuda.blockDim.x
grid_size = cuda.gridDim.x
start = tx+ ty * block_size
stride = block_size * grid_size
for i in range(start, x.shape[0], stride):
out[i] = y[i] + x[i]
def add(n, x, y):
for i in range(n):
y[i] = y[i] + x[i]
if __name__ =="__main__":
print(cuda.gpus[0])
print("")
n = 100000
x = np.arange(n).astype(np.float32)
y = 2 * x
out = np.empty_like(x)
x_device = cuda.to_device(x)
y_device = cuda.to_device(y)
out_device = cuda.device_array_like(x)
# Set the number of threads in a block
threadsperblock = 128
# Calculate the number of thread blocks in the grid
blockspergrid = 30#math.ceil(n[0] / threadsperblock)
# Now start the kernel
start = time.process_time()
cuda.synchronize()
addition_kernel[blockspergrid, threadsperblock](x_device, y_device, out_device)
cuda.synchronize()
end = time.process_time()
out_global_mem = out_device.copy_to_host()
print("parallel time: ", end - start)
start = time.process_time()
add(n,x,y)
end = time.process_time()
print("sequential time: ", end-start)
The parallel time is on average around 0.14 seconds, while the code without GPU kernel takes only 0.02 seconds.
This seems quite strange to me. Is there anything I did wrong? Or is this problem not a good example for parallelism? (which I do not think as you can run the for loop in parallel)
What is odd is that I do hardly notice a difference if I do not use the to_device() functions. As far as I understood, these should be important, as they avoid the communication between CPU and GPU after each iteration.
addition_kernel is compiled at runtime when it is called the first time, so in the middle of your measured time! The compilation of a kernel is a pretty intensive operation. You can force the compilation to be done eagerly (ie. when the function is defined) by providing the types to Numba.
Note that the arrays are a bit too small so you can see a big improvement on GPUs. Moreover, the comparison with the CPU version is not really fair: you should also use Numba for the CPU implementation or at least Numpy (but not an interpreted pure-CPython loop).
Here is an example:
import numba as nb
#cuda.jit('void(float32[::1], float32[::1], float32[::1])')
def addition_kernel(x, y, out):
tx = cuda.threadIdx.x
ty = cuda.blockIdx.x
block_size = cuda.blockDim.x
grid_size = cuda.gridDim.x
start = tx+ ty * block_size
stride = block_size * grid_size
for i in range(start, x.shape[0], stride):
out[i] = y[i] + x[i]
#nb.njit('void(int64, float32[::1], float32[::1])')
def add(n, x, y):
for i in range(n):
y[i] = y[i] + x[i]
I am currently working on a script which takes in data for a correlation matrix and compute a bunch of values. this step right here is very costly, and I would like to find ways to speed it up or parallelize it. I have tried using Parallel (from python's joblib) however because of CPU overhead (at least because of the way I parametrized it) it's significantly slower than a sequential loop.
import time
import numpy as np
import itertools
from sklearn.datasets import make_blobs
N = 5000
data,_ = make_blobs(n_samples=N,n_features=500)
G = np.corrcoef(data)
''' the cluster function '''
def clus_lc(i, j, G, ns=2):
''' c_s'''
cs = 2*(G[i,j]+1)-1e-6
''' A and B'''
if cs<=ns:
return 0
return 0.5*( np.log(ns/cs) + (ns - 1)*np.log( (ns**2 - ns) / ( ns**2 - cs) ) )
''' merge and time '''
indices = list(itertools.combinations(range(N),2))
t0 = time.time()
costs = np.zeros(len(indices))
k=0
for i, j in indices:
costs[k] = clus_lc(i,j,G)
k+=1
t1 = time.time()
toseq = t1-t0
print(toseq)
I think I solved the issue by using numba and adding a decorator #jit. This seems to work fine because all operations in the function are calls to numpy functions. On a dataset with N=5000 it goes from 75 sec to 10 sec. Fantastic improvement. Now whether this can be further improved I am interested in hearing other inputs.
I am trying to calculate the nth value of a logistic equation in Python. It is easy to do it with a loop:
import timeit
tic = timeit.default_timer()
x = 0.23
i = 0
n = 1000000000
while (i < n):
x = 4 * x * (1 - x)
i += 1
toc = timeit.default_timer()
toc - tic
However it is also generally time-consuming. Doing it in PyPy greatly improves the performance, as suggested by abarnert in Is MATLAB faster than Python (little simple experiment).
I have also been suggested to avoid Python loops and use NumPy arrays and vector operations instead - actually I do not see how these can help (it seems to me that NumPy operations are similar to Matlab ones, and I am unaware of any way the code above can be vectorized in Matlab either).
Is there a way to optimize the code without loops?
Without loops? Maybe,but this is probably not be the best way to go. It's important to realize that loops are not per-se slow. You try to avoid them in python or matlab in high performance code. If you are writing C code, you don't have to care.
So one idea to optimize here would be to use cython to compile your code to C code:
python version:
def calc_x(x, n):
i = 0
while (i < n):
x = 4 * x * (1 - x)
i += 1
return x
statically typed cython version:
def calc_x_cy(double x, long n):
cdef long i = 0
while (i < n):
x = 4 * x * (1 - x)
i += 1
return x
And all of a sudden, you are almost two orders of magnitude faster:
%timeit calc_x(0.23, n) -> 1 loops, best of 3: 26.9 s per loop
%timeit calc_x_cy(0.23, n) -> 1 loops, best of 3: 370 ms per loop
I have been playing around with numba and numexpr trying to speed up a simple element-wise matrix multiplication. I have not been able to get better results, they both are basically (speedwise) equivalent to numpys multiply function. Has anyone had any luck in this area? Am I using numba and numexpr wrong (I'm quite new to this) or is this altogether a bad approach to try and speed this up. Here is a reproducible code, thank you in advanced:
import numpy as np
from numba import autojit
import numexpr as ne
a=np.random.rand(10,5000000)
# numpy
multiplication1 = np.multiply(a,a)
# numba
def multiplix(X,Y):
M = X.shape[0]
N = X.shape[1]
D = np.empty((M, N), dtype=np.float)
for i in range(M):
for j in range(N):
D[i,j] = X[i, j] * Y[i, j]
return D
mul = autojit(multiplix)
multiplication2 = mul(a,a)
# numexpr
def numexprmult(X,Y):
M = X.shape[0]
N = X.shape[1]
return ne.evaluate("X * Y")
multiplication3 = numexprmult(a,a)
What about using fortran and ctypes?
elementwise.F90:
subroutine elementwise( a, b, c, M, N ) bind(c, name='elementwise')
use iso_c_binding, only: c_float, c_int
integer(c_int),intent(in) :: M, N
real(c_float), intent(in) :: a(M, N), b(M, N)
real(c_float), intent(out):: c(M, N)
integer :: i,j
forall (i=1:M,j=1:N)
c(i,j) = a(i,j) * b(i,j)
end forall
end subroutine
elementwise.py:
from ctypes import CDLL, POINTER, c_int, c_float
import numpy as np
import time
fortran = CDLL('./elementwise.so')
fortran.elementwise.argtypes = [ POINTER(c_float),
POINTER(c_float),
POINTER(c_float),
POINTER(c_int),
POINTER(c_int) ]
# Setup
M=10
N=5000000
a = np.empty((M,N), dtype=c_float)
b = np.empty((M,N), dtype=c_float)
c = np.empty((M,N), dtype=c_float)
a[:] = np.random.rand(M,N)
b[:] = np.random.rand(M,N)
# Fortran call
start = time.time()
fortran.elementwise( a.ctypes.data_as(POINTER(c_float)),
b.ctypes.data_as(POINTER(c_float)),
c.ctypes.data_as(POINTER(c_float)),
c_int(M), c_int(N) )
stop = time.time()
print 'Fortran took ',stop - start,'seconds'
# Numpy
start = time.time()
c = np.multiply(a,b)
stop = time.time()
print 'Numpy took ',stop - start,'seconds'
I compiled the Fortran file using
gfortran -O3 -funroll-loops -ffast-math -floop-strip-mine -shared -fPIC \
-o elementwise.so elementwise.F90
The output yields a speed-up of ~10%:
$ python elementwise.py
Fortran took 0.213667869568 seconds
Numpy took 0.230120897293 seconds
$ python elementwise.py
Fortran took 0.209784984589 seconds
Numpy took 0.231616973877 seconds
$ python elementwise.py
Fortran took 0.214708089828 seconds
Numpy took 0.25369310379 seconds
How are you doing your timings ?
The creation of your random array is taking up the overal part of your calculation, and if you include it in your timing you will hardly see any real difference in the results,
however, if you create it up front you can actually compare the methods.
Here are my results, and I'm consistently seeing what you are seeing. numpy and numba give about the same results (with numba being a little bit faster.)
(I don't have numexpr available)
In [1]: import numpy as np
In [2]: from numba import autojit
In [3]: a=np.random.rand(10,5000000)
In [4]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 90 ms per loop
In [5]: # numba
In [6]: def multiplix(X,Y):
...: M = X.shape[0]
...: N = X.shape[1]
...: D = np.empty((M, N), dtype=np.float)
...: for i in range(M):
...: for j in range(N):
...: D[i,j] = X[i, j] * Y[i, j]
...: return D
...:
In [7]: mul = autojit(multiplix)
In [26]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 182 ms per loop
In [27]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 185 ms per loop
In [28]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 181 ms per loop
In [29]: %timeit multiplication2 = mul(a,a)
10 loops, best of 3: 179 ms per loop
In [30]: %timeit multiplication2 = mul(a,a)
10 loops, best of 3: 180 ms per loop
In [31]: %timeit multiplication2 = mul(a,a)
10 loops, best of 3: 178 ms per loop
Update:
I used the latest version of numba, just compiled it from source: '0.11.0-3-gea20d11-dirty'
I tested this with the default numpy in Fedora 19, '1.7.1'
and numpy '1.6.1' compiled from source, linked against:
Update3
My earlier results were of course incorrect, I had return D in the inner loop, so skipping 90% of the calculations.
This provides more evidence for ali_m's assumption that it is really hard to do better than the already very optimized c code.
However, if you are trying to do something more complicated, e.g.,
np.sqrt(((X[:, None, :] - X) ** 2).sum(-1))
I can reproduce the figures Jake Vanderplas get's:
In [14]: %timeit pairwise_numba(X)
10000 loops, best of 3: 92.6 us per loop
In [15]: %timeit pairwise_numpy(X)
1000 loops, best of 3: 662 us per loop
So it seems you are doing something that has been so far optimized by numpy it is hard to do any better.
Edit: nevermind this answer, I'm wrong (see comment below).
I'm afraid it will be very, very hard to have a faster matrix multiplication in python than by using numpy's. NumPy usually uses internal fortran libraries like ATLAS/LAPACK that are very very well optimized.
To check if your version of NumPy was built with LAPACK support: open a terminal, go to your Python install directory and type:
for f in `find lib/python2.7/site-packages/numpy/* -name \*.so`; do echo $f; ldd $f;echo "\n";done | grep lapack
Note that the path can vary depending on your python version.
If you some lines get printed, you surely have LAPACK support... so having faster matrix multiplication on a single core will be very hard to achieve.
Now I don't know about using multiple cores to perform matrix multiplication, so you might want to look into that (see ali_m's comment).
use a GPU. use the following package.
gnumpy
The speed of np.multiply heavily relies on the arrays beeing exactly the same size.
a = np.random.rand(80000,1)
b = np.random.rand(80000,1)
c = np.multiply(a, b)
is fast as hell whereas the following code takes more than a minute and uses up all my 16 GB of ram:
a = np.squeeze(np.random.rand(80000,1))
b = np.random.rand(80000,1)
c = np.multiply(a, b)
So my advice would be to use arrays of exactly the same dimensions. Hope this is useful for someone looking how to speed up element-wise multiplication.