OpenCV 3.1 optimization - performance

I'm currently trying to implement an algorithm from a paper with OpenCV 3.1 on python 2.7 but the process is taking way too long.
The section of my code that's giving me trouble looks something like this:
width, height = mr.shape[:2]
Pm = []
for i in d:
M = np.float32([[1,0,-d[i]], [0,1,1]])
mrd = cv2.warpAffine(mr, M, (height,width))
C = cv2.subtract(ml, mrd)
C = cv2.pow(C,2)
C = np.divide(C, sigma_m)
C = p0 + (1-p0)**(-C)
Pm.append(C)
Where ml, mr and mrd are cv2 objects and d, p0 and sigma_m are integers.
The division and final equation in the last 3 lines are the real troublemakers here. Every iteration of this cycle is independent so in theory I could just split the 'for loop' through a few processors, but that seems like a lazy approach where I would just bypass the problem instead of fixing it.
Does anyone know a way to perform those computations faster?

We can leverage numexpr module to efficiently perform all of those latter arithmetic operations as one evaluate expression.
Thus, these steps :
C = cv2.subtract(ml, mrd)
C = cv2.pow(C,2)
C = np.divide(C, sigma_m)
C = p0 + (1-p0)**(-C)
could be replaced by one expression -
import numexpr as ne
C = ne.evaluate('p0 +(1-p0)**(-((ml-mrd)**2)/sigma_m)')
Let's verify things. The original approach as func -
def original_app(ml, mrd, sigma_m, p0):
C = cv2.subtract(ml, mrd)
C = cv2.pow(C,2)
C = np.divide(C, sigma_m)
C = p0 + (1-p0)**(-C)
return C
Verification -
In [28]: # Setup inputs
...: S = 1024 # Size parameter
...: ml = np.random.randint(0,255,(S,S))/255.0
...: mrd = np.random.randint(0,255,(S,S))/255.0
...: sigma_m = 0.45
...: p0 = 0.56
...:
In [29]: out1 = original_app(ml, mrd, sigma_m, p0)
In [30]: out2 = ne.evaluate('p0 +(1-p0)**(-((ml-mrd)**2)/sigma_m)')
In [31]: np.allclose(out1, out2)
Out[31]: True
Timings across various sizes of datasets -
In [19]: # Setup inputs
...: S = 1024 # Size parameter
...: ml = np.random.randint(0,255,(S,S))/255.0
...: mrd = np.random.randint(0,255,(S,S))/255.0
...: sigma_m = 0.45
...: p0 = 0.56
...:
In [20]: %timeit original_app(ml, mrd, sigma_m, p0)
10 loops, best of 3: 67.1 ms per loop
In [21]: %timeit ne.evaluate('p0 +(1-p0)**(-((ml-mrd)**2)/sigma_m)')
100 loops, best of 3: 12.9 ms per loop
In [22]: # Setup inputs
...: S = 512 # Size parameter
In [23]: %timeit original_app(ml, mrd, sigma_m, p0)
100 loops, best of 3: 15.3 ms per loop
In [24]: %timeit ne.evaluate('p0 +(1-p0)**(-((ml-mrd)**2)/sigma_m)')
100 loops, best of 3: 3.39 ms per loop
In [25]: # Setup inputs
...: S = 256 # Size parameter
In [26]: %timeit original_app(ml, mrd, sigma_m, p0)
100 loops, best of 3: 3.65 ms per loop
In [27]: %timeit ne.evaluate('p0 +(1-p0)**(-((ml-mrd)**2)/sigma_m)')
1000 loops, best of 3: 878 µs per loop
Around 5x speedup across various sizes with better speedups for larger arrays!
Also, as a side-note, I would advise using initialized arrays instead of appending as you are doing at the final step. Thus, we could initialize before going into the loop with something like out = np.zeros((len(d), width, height)) / np.empty and at the final step assign into the output array with : out[iteration_ID] = C.

Related

Speed up the mullite operator cross product and dot product in numpy python

I have a loop in one part of my code. I tried to change all vectors into an example to make it simple as see in below sample. I have to try it this loop 230000 inside the other loop. this part take about 26.36.
is there any way to speed up or tune the speed to get optimized speed.
trr=time.time()
for i in range (230000):
print(+1 *0.0001 * 1 * 1000 * (
1 * np.dot(np.subtract([2,1], [4,3]), [1,2]) + 1
* np.dot(
np.cross(np.array([0, 0, 0.5]),
np.array([1,2,3])),
np.array([1,0,0]))
- 1 * np.dot((np.cross(
np.array([0,0,-0.5]),
np.array([2,4,1]))), np.array(
[0,1,0]))) )
print(time.time()-trr)
the code with variable:
For i in range (23000):
.......
.....
else:
delta_fs = +1 * dt * 1 * ks_smoot * A_2d * (
np.dot(np.subtract(grains[p].v, grains[w].v), vti) * sign +
np.dot(np.cross(np.array([0, 0, grains[p].rotational_speed]),
np.array(np.array(xj_c) - np.array(xj_p))),
np.array([vti[0], vti[1], 0])) * sign
- np.dot((np.cross( np.array([0, 0, grains[w].rotational_speed]),
np.array(np.array(xj_c) - np.array(xj_w)))), np.array(
[vti[0], vti[1], 0])) * sign)
It would've been better if you kept your examples in variables, since your code is very difficult to read. Ignoring the fact that the loop in your example just computes the same constant value over and over again, I am working under the assumption that you need to run a specific set of numpy operations many times on various numpy arrays/vectors. You may find it useful to spend some time looking into the documentation for numba. Here's a very basic example:
import numpy as np
import numba as nb
CONST = 1*0.0001*1*1000
a0 = np.array([2.,1.])
a1 = np.array([4.,3.])
a2 = np.array([1.,2.])
b0 = np.array([0., 0., 0.5])
b1 = np.array([1.,2.,3.])
b2 = np.array([1.,0.,0.])
c0 = np.array([0.,0.,-0.5])
c1 = np.array([2.,4.,1.])
c2 = np.array([0.,1.,0.])
#nb.jit()
def op1(iters):
for i in range(iters):
op = CONST * (1 * np.dot(a0-a1,a2)
+ 1 * np.dot(np.cross(b0,b1),b2)
- 1 * np.dot(np.cross(c0,c1),c2))
op1(1) # Initial compilation
def op2(iters):
for i in range(iters):
op = CONST * (1 * np.dot(a0-a1,a2)
+ 1 * np.dot(np.cross(b0,b1),b2)
- 1 * np.dot(np.cross(c0,c1),c2))
%timeit -n 100 op1(100)
# 54 µs ± 2.49 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit -n 100 op2(100)
# 15.5 ms ± 313 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Seems like it'll be multiple order of magnitudes faster, which should easily bring your time down to a fraction of a second.

Julia: why doesn't shared memory multi-threading give me a speedup?

I want to use shared memory multi-threading in Julia. As done by the Threads.#threads macro, I can use ccall(:jl_threading_run ...) to do this. And whilst my code now runs in parallel, I don't get the speedup I expected.
The following code is intended as a minimal example of the approach I'm taking and the performance problem I'm having: [EDIT: See later for even more minimal example]
nthreads = Threads.nthreads()
test_size = 1000000
println("STARTED with ", nthreads, " thread(s) and test size of ", test_size, ".")
# Something to be processed:
objects = rand(test_size)
# Somewhere for our results
results = zeros(nthreads)
counts = zeros(nthreads)
# A function to do some work.
function worker_fn()
work_idx = 1
my_result = results[Threads.threadid()]
while work_idx > 0
my_result += objects[work_idx]
work_idx += nthreads
if work_idx > test_size
break
end
counts[Threads.threadid()] += 1
end
end
# Call our worker function using jl_threading_run
#time ccall(:jl_threading_run, Ref{Cvoid}, (Any,), worker_fn)
# Verify that we made as many calls as we think we did.
println("\nCOUNTS:")
println("\tPer thread:\t", counts)
println("\tSum:\t\t", sum(counts))
On an i7-7700, a typical single threaded result is:
STARTED with 1 thread(s) and test size of 1000000.
0.134606 seconds (5.00 M allocations: 76.563 MiB, 1.79% gc time)
COUNTS:
Per thread: [999999.0]
Sum: 999999.0
And with 4 threads:
STARTED with 4 thread(s) and test size of 1000000.
0.140378 seconds (1.81 M allocations: 25.661 MiB)
COUNTS:
Per thread: [249999.0, 249999.0, 249999.0, 249999.0]
Sum: 999996.0
Multi-threading slows things down! Why?
EDIT: A better minimal example can be created #threads macro itself.
a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
#time Threads.#threads for i = 1 : test_size
a[Threads.threadid()] += b[i]
calls[Threads.threadid()] += 1
end
I falsely assumed that the #threads macro's inclusion in Julia would mean that there was a benefit to be had.
The problem you have is most probably false sharing.
You can solve it by separating the areas you write to far enough like this (here is a "quick and dirty" implementation to show the essence of the change):
julia> function f(spacing)
test_size = 1000000
a = zeros(Threads.nthreads()*spacing)
b = rand(test_size)
calls = zeros(Threads.nthreads()*spacing)
Threads.#threads for i = 1 : test_size
#inbounds begin
a[Threads.threadid()*spacing] += b[i]
calls[Threads.threadid()*spacing] += 1
end
end
a, calls
end
f (generic function with 1 method)
julia> #btime f(1);
41.525 ms (35 allocations: 7.63 MiB)
julia> #btime f(8);
2.189 ms (35 allocations: 7.63 MiB)
or doing per-thread accumulation on a local variable like this (this is a preferred approach as it should be uniformly faster):
function getrange(n)
tid = Threads.threadid()
nt = Threads.nthreads()
d , r = divrem(n, nt)
from = (tid - 1) * d + min(r, tid - 1) + 1
to = from + d - 1 + (tid ≤ r ? 1 : 0)
from:to
end
function f()
test_size = 10^8
a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
Threads.#threads for k = 1 : Threads.nthreads()
local_a = 0.0
local_c = 0.0
for i in getrange(test_size)
for j in 1:10
local_a += b[i]
local_c += 1
end
end
a[Threads.threadid()] = local_a
calls[Threads.threadid()] = local_c
end
a, calls
end
Also note that you are probably using 4 treads on a machine with 2 physical cores (and only 4 virtual cores) so the gains from threading will not be linear.

how to vectorize this function

The following code works, but I would like to create Z by vectorization. How to achieve that?
import numpy as np
from numpy import sqrt
from math import fsum
points = np.array([[0,0],\
[5,-1],\
[4,6],\
[1,3]])
d = lambda x: fsum([sqrt((x[0]-z[0])**2 + (x[1]-z[1])**2) for z in points])
x = np.linspace(min(points[:,0]),max(points[:,0]),100)
y = np.linspace(min(points[:,1]),max(points[:,1]),100)
X, Y = np.meshgrid(x,y)
Z = np.zeros(np.shape(X))
for (i,j),_ in np.ndenumerate(Z):
Z[i,j] = d([X[i,j],Y[i,j]])
#Z=d([X,Y]) #this fails
We can leverage broadcasting to work directly with the 1D versions and thus be more memory efficient and give ourselves a vectorized one-liner, like so -
Z = np.sqrt((x[:,None] - points[:,0])**2 + (y[:,None,None] - points[:,1])**2).sum(2)
Timings on posted sample data -
In [80]: %%timeit
...: X, Y = np.meshgrid(x,y)
...: Z = np.zeros(np.shape(X))
...: for (i,j),_ in np.ndenumerate(Z):
...: Z[i,j] = d([X[i,j],Y[i,j]])
10 loops, best of 3: 101 ms per loop
In [81]: %timeit ((x[:,None] - points[:,0])**2 + (y[:,None,None] - points[:,1])**2).sum(2)
1000 loops, best of 3: 246 µs per loop
400x speedup there!

Numpy version of rolling MAD (mean absolute deviation)

How to make a rolling version of the following MAD function
from numpy import mean, absolute
def mad(data, axis=None):
return mean(absolute(data - mean(data, axis)), axis)
This code is an answer to this question
At the moment i convert numpy to pandas then apply this function, then convert the result back to numpy
pandasDataFrame.rolling(window=90).apply(mad)
but this is inefficient on larger data-frames. How to get a rolling window for the same function in numpy without looping and give the same result?
Here's a vectorized NumPy approach -
# From this post : http://stackoverflow.com/a/40085052/3293881
def strided_app(a, L, S ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
# From this post : http://stackoverflow.com/a/14314054/3293881 by #Jaime
def moving_average(a, n=3) :
ret = np.cumsum(a, dtype=float)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
def mad_numpy(a, W):
a2D = strided_app(a,W,1)
return np.absolute(a2D - moving_average(a,W)[:,None]).mean(1)
Runtime test -
In [617]: data = np.random.randint(0,9,(10000))
...: df = pd.DataFrame(data)
...:
In [618]: pandas_out = pd.rolling_apply(df,90,mad).values.ravel()
In [619]: numpy_out = mad_numpy(data,90)
In [620]: np.allclose(pandas_out[89:], numpy_out) # Nans part clipped
Out[620]: True
In [621]: %timeit pd.rolling_apply(df,90,mad)
10 loops, best of 3: 111 ms per loop
In [622]: %timeit mad_numpy(data,90)
100 loops, best of 3: 3.4 ms per loop
In [623]: 111/3.4
Out[623]: 32.64705882352941
Huge 32x+ speedup there over the loopy pandas solution!

speeding up some for loops in matlab

Basically I am trying to solve a 2nd order differential equation with the forward euler method. I have some for loops inside my code, which take considerable time to solve and I would like to speed things up a bit. Does anyone have any suggestions how could I do this?
And also when looking at the time it takes, I notice that my end at line 14 takes 45 % of my total time. What is end actually doing and why is it taking so much time?
Here is my simplified code:
t = 0:0.01:100;
dt = t(2)-t(1);
B = 3.5 * t;
F0 = 2 * t;
BB=zeros(1,length(t)); % Preallocation
x = 2; % Initial value
u = 0; % Initial value
for ii = 1:length(t)
for kk = 1:ii
BB(ii) = BB(ii) + B(kk) * u(ii-kk+1)*dt; % This line takes the most time
end % This end takes 45% of the other time
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
Running the code it takes me 8.552 sec.
You can remove the inner loop, I think:
for ii = 1:length(t)
for kk = 1:ii
BB(ii) = BB(ii) + B(kk) * u(ii-kk+1)*dt; % This line takes the most time
end % This end takes 45% of the other time
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
So BB(ii) = BB(ii) (zero at initalisation) + sum for 1 to ii of BB(kk)* u(ii-kk+1).dt
but kk = 1:ii, so for a given ii, ii-kk+1 → ii-(1:ii) + 1 → ii:-1:1
So I think this is equivalent to:
for ii = 1:length(t)
BB(ii) = sum(B(1:ii).*u(ii:-1:1)*dt);
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
It doesn't take as long as 8 seconds for me using either method, but the version with only one loop is about 2x as fast (the output of BB appears to be the same).
Is the sum loop of B(kk) * u(ii-kk+1) just conv(B(1:ii),u(1:ii),'same')
The best way to speed up loops in matlab is to try to avoid them. Try if you are able to perform a matrix operation instead of the inner loop. For example try to break the calculation you do there in small parts, then decide, if there are parts you can perform in advance without knowing the results of the next iteration of the loop.
to your secound part of the question, my guess:: The end contains the check if the loop runs for another round and this check by it self is not that long but called 50.015.001 times!

Resources