I've got a look-up problem that boils down to the following situation.
Three columns with positive integers. For some value i, which values in 'column_3' have a value in 'column_1' below i and a value in 'column_2' above i?
import numpy as np
rows = 1e6
i = 5e8
ts = np.zeros((rows,), dtype=[('column_1','int64'),('column_2','int64'),('column_3','int64')])
ts['column_1'] = np.random.randint(low=0,high=1e9,size=rows)
ts['column_2'] = np.random.randint(low=0,high=1e9,size=rows)
ts['column_3'] = np.random.randint(low=0,high=1e9,size=rows)
This is the operation I'd like to optimize:
%%timeit
a = ts[(ts['column_1'] < i)&(ts['column_2'] > i)]['column_3']
Is there anything I'm overlooking that could make this faster?
Would be grateful for any advice!!
Assigning your 3 arrays to A,B,C at creation as well:
In [3]: %%timeit
...: a = ts[(ts['column_1'] < i)&(ts['column_2'] > i)]['column_3']
...:
22.5 ms ± 838 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %%timeit
...: a = C[(A < i)&(B > i)]
...:
...:
9.36 ms ± 15 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using a,b,c = ts['column_1'],ts['column_2'],ts['column_3'] instead falls in between.
Those are variants and timings you can play with. As I can see it just minor differences due to indexing differences. Nothing like an order of magnitude difference.
Related
I have the below dummy dataframe:
import pandas
tab_size = 300000
tab_de_merde = [[i*1, 0, i*3, [i*7%3, i*11%3] ] for i in range(tab_size)]
colnames = ['Id', "Date", "Account","Value"]
indexnames = ['Id']
df = pandas.DataFrame(tab_de_merde, columns = colnames ).set_index(indexnames)
And I want to check if the column "Value" contains a 0.
I've tried 3 different solution and I was wondering if the third one (Python Vectorization) was correctly implemented since it doesn't seem to fasten the whole code.
%timeit df[[(0 in x) for x in df['Value'].values]]
#108 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df[df['Value'].apply(lambda x: 0 in x)]
#86.2 ms ± 649 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def list_contains_element(np_array): return [(0 in x) for x in np_array]
%timeit df[list_contains_element(df['Value'].values)]
#106 ms ± 807 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I would be very glad if someone could help me understand better how to be faster with vector manipulation.
This question already has answers here:
Test if lists share any items in python
(9 answers)
Closed 4 years ago.
Assume I got two pyomo sets A & B, which are containing following elements:
m.A = {1,2,3,4,5}
m.B = {a,b,c,d,5}
I want to check; if A has some elements which are also in B:
EDIT:
Well following does not work:
if m.A & m.B is not None:
raise ValueError
At least for my case when m.A = [None] and m.B = ['some_string'], if-statement is also triggered, but bool(m.A & m.B) is working.
The most compact way you could achieve this is by using the & operator:
a = {1,2,3,4}
b = {4,5,6}
result = bool(a & b)
Speed comparison
Using the & operator:
%timeit bool(a & b)
297 ns ± 3.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Using the intersection method:
%timeit bool(a.intersection(b))
365 ns ± 27.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The two solution are pretty similar, the second one most probably faces overhead from the method call.
You are looking for intersection:
>>> A = {1,2,3,4,5}
>>> B = {'a','b','c','d',5}
>>> A.intersection(B)
set([5])
I am aware that this may seem like a vague question, but I wonder if (e.g. in Python) it is preferable to store multiple values in separate variables, or to store them in logical groups (lists, arrays...).
In my precise case I am supposed to translate a Matlab code into Python 2.7. It is a physically based model that digests 8 input variables and creates two large lists as output. I found that the original model has a huge amount of variables that are calculated on the way (>100). As a rule of thumb: if one calculation is accessed more than once, it is stored in a new variable. Demonstrative example:
x = 3
y = 5
x2 = x**2
z = x2 + exp(y)
zz = (y+x)/x2
x^2 is used two times (for the calculation of z and zz), so it is stored as x2. Is this really faster than letting python calculate x**2 two times? Also, would it be faster if I stored them in lists? Like this:
x = [3, 5]
z = x[0]**2 + exp(x[1])
zz = sum(x)/x[0]**2
The organisation of variables in lists may come at the expense of readibility of the code, but I would gladly take that if it makes my code run faster.
There is no performance advantage I can see to keeping it in a list. On the contrary, putting it in a list makes it slower:
>>> %%timeit
...: x = 3
...: y = 5
...: x2 = x**2
...: z = x2 + exp(y)
...: zz = (y+x)/x2
...:
337 ns ± 1.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> %%timeit
...: x = [3, 5]
...: z = x[0]**2 + exp(x[1])
...: zz = sum(x)/x[0]**2
...:
716 ns ± 4.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Now part of that is because you are calculating x**2 twice in the list condition, but even fixing that issue doesn't make the list version faster:
>>> %%timeit
...: x = [3, 5]
...: x0 = x[0]**2
...: z = x0 + exp(x[1])
...: zz = sum(x)/x0
...:
502 ns ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
If you compare about performance, another big issue is that you are defining ints then converting them to floats. In MATLAB, x = 5 makes a float, while in python it makes an integer. It is much faster to do everything with floats from the beginning, which you can do by just putting a . or .0 at the end of the number:
>>> %%timeit
...: x = 3.0
...: y = 5.0
...: x2 = x**2.0
...: z = x2 + exp(y)
...: zz = (y+x)/x2
...:
166 ns ± 1.12 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
If you were to use numpy arrays rather than lists, it is even worse, because you start with a list of floats then have to do a conversion of both the numbers and the lists, then convert them back, all of which is slow:
>>> %%timeit
...: x = np.array([3., 5.])
...: x0 = x[0]**2.
...: z = x0 + np.exp(x[1])
...: zz = x.sum()/x0
...:
3.22 µs ± 8.96 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As a general rule, avoid doing type conversions wherever possible, and avoid indexing when it doesn't help readability. If you have a bunch of values, then the conversion to numpy is useful. But for just two or three it is going to hurt speed and readability.
If I have a matrix A and I want to get the dot product of A with every row of B.
import numpy as np
a = np.array([[1.0, 2.0],
[3.0, 4.0]])
b = np.array([[1.0, 1.0],
[2.0, 2.0],
[3.0, 3.0]])
If the goal was to do it manually (or in a loop):
c = np.array([np.dot(a, b[0])])
c = np.append(c, [np.dot(a, b[1])], axis=0)
c = np.append(c, [np.dot(a, b[2])], axis=0)
print(c)
c = [[ 3. 7.]
[ 6. 14.]
[ 9. 21.]]
With some transposing and matrix-multiplication using np.dot -
a.dot(b.T).T
b.dot(a.T)
With np.einsum -
np.einsum('ij,kj->ki',a,b)
With np.tensordot -
np.tensordot(b,a,axes=((1,1)))
Runtime test -
In [123]: a = np.random.rand(2000, 2000)
...: b = np.random.rand(3000, 2000)
...:
In [124]: %timeit a.dot(b.T).T
...: %timeit b.dot(a.T)
...: %timeit np.einsum('ij,kj->ki',a,b)
...: %timeit np.tensordot(b,a,axes=((1,1)))
...:
1 loops, best of 3: 234 ms per loop
10 loops, best of 3: 169 ms per loop
1 loops, best of 3: 7.59 s per loop
10 loops, best of 3: 170 ms per loop
I'm trying to learn how to make GPU optimalized OpenCL kernells, I took example of matrix multiplication using square tiles in local memory. However I got at best case just ~10-times speedup ( ~50 Gflops ) in comparison to numpy.dot() ( 5 Gflops , it is using BLAS).
I found studies where they got speedup >200x ( >1000 Gflops ).
ftp://ftp.u-aizu.ac.jp/u-aizu/doc/Tech-Report/2012/2012-002.pdf
I don't know what I'm doing wrong, or if it is just because of my GPU ( nvidia GTX 275 ). Or if it is because of some pyOpenCl overhead. But I meassured also how long does take just to copy result from GPU to RAM and it is just ~10% of the matrix multiplication time.
#define BLOCK_SIZE 22
__kernel void matrixMul(
__global float* Cij,
__global float* Aik,
__global float* Bkj,
__const int ni,
__const int nj,
__const int nk
){
// WARRNING : interchange of i and j dimension lower the performance >2x on my nV GT275 GPU
int gj = get_global_id(0); int gi = get_global_id(1);
int bj = get_group_id(0); int bi = get_group_id(1); // Block index
int tj = get_local_id(0); int ti = get_local_id(1); // Thread index
int oj = bi*BLOCK_SIZE; int oi = bj*BLOCK_SIZE;
float Csub =0;
__local float As [BLOCK_SIZE][BLOCK_SIZE];
__local float Bs [BLOCK_SIZE][BLOCK_SIZE];
for (int ok = 0; ok < nk; ok += BLOCK_SIZE ) {
As[ti][tj] = Aik[ nk*(gi ) + tj + ok ]; // A[i][k]
Bs[ti][tj] = Bkj[ nj*(ti+ok) + gj ]; // B[k][j]
barrier(CLK_LOCAL_MEM_FENCE);
for (int k = 0; k < BLOCK_SIZE; ++k) Csub += As[ti][k] * Bs[k][tj];
barrier(CLK_LOCAL_MEM_FENCE);
}
Cij[ nj * ( gi ) + gj ] = Csub;
}
NOTE - the strange BLOCK_SIZE=22 is the maximum BLOCK_SIZE which does fit to max work_group_size which is 512 on my GPU. In this code must hold condition BLOCK_SIZE^2 < max work_group_size. 22=int(sqrt(512)). I tried also BLOCK_SIZE=16 or 8 but it was slower tan with 22.
I also tried simple matrixMul (without using local memory) but it was even 10-times slower than numpy.dot().
I copied the code here
http://gpgpu-computing4.blogspot.cz/2009/10/matrix-multiplication-3-opencl.html
they say that even the simple version (without local memory) should run 200x faster than CPU? I don't undrestand that.
the dependecne of performance in my case is:
N = 220 numpy 3.680 [Gflops] GPU 16.428 [Gflops] speedUp 4.464
N = 330 numpy 4.752 [Gflops] GPU 29.487 [Gflops] speedUp 6.205
N = 440 numpy 4.914 [Gflops] GPU 37.096 [Gflops] speedUp 7.548
N = 550 numpy 3.849 [Gflops] GPU 47.019 [Gflops] speedUp 12.217
N = 660 numpy 5.251 [Gflops] GPU 49.999 [Gflops] speedUp 9.522
N = 770 numpy 4.565 [Gflops] GPU 48.567 [Gflops] speedUp 10.638
N = 880 numpy 5.452 [Gflops] GPU 44.444 [Gflops] speedUp 8.152
N = 990 numpy 4.976 [Gflops] GPU 42.187 [Gflops] speedUp 8.478
N = 1100 numpy 5.324 [Gflops] GPU 83.187 [Gflops] speedUp 15.625
N = 1210 numpy 5.401 [Gflops] GPU 57.147 [Gflops] speedUp 10.581
N = 1320 numpy 5.450 [Gflops] GPU 48.936 [Gflops] speedUp 8.979
NOTE - the "Gflops" number is obtained as N^3/time and it does include time required to copy results from GPU to main memory, but this time is just few percent of total time especially for N>1000
maybe more pictorial is time in secons:
N = 220 numpy 0.003 [s] GPU 0.001 [s] load 0.001 [s] speedUp 5.000
N = 330 numpy 0.008 [s] GPU 0.001 [s] load 0.001 [s] speedUp 7.683
N = 440 numpy 0.017 [s] GPU 0.002 [s] load 0.001 [s] speedUp 7.565
N = 550 numpy 0.043 [s] GPU 0.004 [s] load 0.001 [s] speedUp 11.957
N = 660 numpy 0.055 [s] GPU 0.006 [s] load 0.002 [s] speedUp 9.298
N = 770 numpy 0.100 [s] GPU 0.009 [s] load 0.003 [s] speedUp 10.638
N = 880 numpy 0.125 [s] GPU 0.010 [s] load 0.000 [s] speedUp 12.097
N = 990 numpy 0.195 [s] GPU 0.015 [s] load 0.000 [s] speedUp 12.581
N = 1100 numpy 0.250 [s] GPU 0.031 [s] load 0.000 [s] speedUp 8.065
N = 1210 numpy 0.328 [s] GPU 0.031 [s] load 0.000 [s] speedUp 10.581
N = 1320 numpy 0.422 [s] GPU 0.047 [s] load 0.000 [s] speedUp 8.979
I was thinking that maybe some speed improvement can be obtained using
async_work_group_copy or even read_imageui to copy blocks to local memory. But I don't understand why I have so big difference when I'm using basically the same code as people who say they have 200x speedup?????
Without even looking at your code let me make some comments about your benchmarks. Let's ignore numpy and compare the maximum SP FLOPs/s and DP FLOPs/s of an Intel CPU versus Nvidia and AMD GPUs.
A Intel 2600K at 4 GHz can do 4 GHz * (8 AVX) * (2 ILP) * ( 4 cores) = 256 SP GFLOPs/s. For DP it's half: 128 DP GFLOPs/s. Haswell which comes out in a few weeks will double both of those. The Intel MKL library gets better than 80% efficiency in GEMM. My own GEMM code gets 70% on my i7-2700 so the 5 GFlops/s you quote with numpy is tiny and not fair to compare with.
I don't know what the GTX 275 is capable of but I would guess it's much more than 50 GFLOPs/s.
The article you reference compares a AMD 7970. They get 848 (90% efficiency) DP GFlops/s and 2646 (70% efficiency) SP GFlops/s. That's closer to 10x the performance of the CPU not 200x!
Edit:
Your calculations of FLOPs is wrong it should be 2.0*n^3. That's still approximate but it's asymptotically true. Let me explain.
Consider a 3D dot product. It's x1*x2+y1*y2+z1*z2. That's 3 multiplications and two additions. So a N-dimensional dot product is n multiplications and (n-1) additions. A matrix product is equivalent to nxn dot products, i.e. n*n*n multiplications and n*n*(n-1) additions. That's approximately 2.0*n^3 FLOPS. So you should double all your Gflops/s numbers.
Edit:
You might want to consider the kernel time. It's been awhile since I used OpenCL but using the C++ bindings I did something like this
queue = cl::CommandQueue(context, devices[device], CL_QUEUE_PROFILING_ENABLE|CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
//other code...run kernel
time_end = clevent.getProfilingInfo<CL_PROFILING_COMMAND_END>();
time_start = clevent.getProfilingInfo<CL_PROFILING_COMMAND_START>();
A good GPU matrix-multiply does not just use local memory, it stores blocks of A, B, and/or C in registers (which results in higher register usage and lower occupancy but is much faster in the end). This is because GPUs have more registers than local memory (128-256KB vs 48KB for NVIDIA), and registers offer as much bandwidth as the ALUs can handle.