how to vectorize this function - performance

The following code works, but I would like to create Z by vectorization. How to achieve that?
import numpy as np
from numpy import sqrt
from math import fsum
points = np.array([[0,0],\
[5,-1],\
[4,6],\
[1,3]])
d = lambda x: fsum([sqrt((x[0]-z[0])**2 + (x[1]-z[1])**2) for z in points])
x = np.linspace(min(points[:,0]),max(points[:,0]),100)
y = np.linspace(min(points[:,1]),max(points[:,1]),100)
X, Y = np.meshgrid(x,y)
Z = np.zeros(np.shape(X))
for (i,j),_ in np.ndenumerate(Z):
Z[i,j] = d([X[i,j],Y[i,j]])
#Z=d([X,Y]) #this fails

We can leverage broadcasting to work directly with the 1D versions and thus be more memory efficient and give ourselves a vectorized one-liner, like so -
Z = np.sqrt((x[:,None] - points[:,0])**2 + (y[:,None,None] - points[:,1])**2).sum(2)
Timings on posted sample data -
In [80]: %%timeit
...: X, Y = np.meshgrid(x,y)
...: Z = np.zeros(np.shape(X))
...: for (i,j),_ in np.ndenumerate(Z):
...: Z[i,j] = d([X[i,j],Y[i,j]])
10 loops, best of 3: 101 ms per loop
In [81]: %timeit ((x[:,None] - points[:,0])**2 + (y[:,None,None] - points[:,1])**2).sum(2)
1000 loops, best of 3: 246 µs per loop
400x speedup there!

Related

Runge-Kutta curve fitting extremely slow

I am currently trying to do a regression of a function calculated via a RK4 method performed on a non-linear Volterra integral of the second kind. The problem I found is that the code is extremely slow, for 1 call of the curve_fit function (fitt), it takes about 30-40 minute to generate a data. Overall, there will be a lot of calls to fitt before the parameters are determined, this takes more than 6 hours to run. Is there anyway to optimize this code? Thanks in advance!
from scipy.special import gamma
from ml_internal import LTInversion
from scipy.optimize import curve_fit , fsolve
from scipy.misc import derivative
from sklearn.metrics import r2_score
from math import comb , factorial
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Gets the data
df = pd.read_excel('D:\\CoMat\\Fractional_fit\\optimized\\data_optimized.xlsx')
skipTime = 1
skipIndex = df[df['Time']== skipTime].index.values[0]
xls = pd.read_excel('D:\\CoMat\\Fractional_fit\\optimized\\data_optimized.xlsx',skiprows=np.arange(1,skipIndex+1,1))
timeDF = xls['Time']
tempDF = xls['Temp']
taDF = xls['Ta']
timeDF = timeDF - timeDF[0]
tempDF = tempDF + 273.15
t0 = tempDF[0]
ta = sum(taDF)/len(taDF)
ta = ta + 273.15
###########################################
#Spliting into intervals
h = 0.05
a = 0
b = timeDF[len(timeDF)-1]
N = int(np.round((b-a)/h))
#Each xi
def xidx(index):
return a + h*index
#Function in the image are written here.
def gx(t,lamda,alpha):
return t0 * ml(lamda*(t**alpha),alpha)
gx = np.vectorize(gx)
def kernel(t,s,rad,lamda,alpha,beta):
if t == s:
return 0
return (t-s)**(alpha-1) * ml_(lamda*((t-s)**alpha),alpha,alpha,1) * (beta*(rad**4) - beta*(ta**4) - lamda*ta)
kernel = np.vectorize(kernel)
############################
# The problem is here!!!!!!
def fx(x,n,lamda,alpha,beta):
ans = gx(x,lamda,alpha)
for j in range(n):
ans += (h/6)*(kernel(x,xidx(j),f0[j],lamda,alpha,beta) + 2*kernel(x,xidx(j+1/2),f1[j],lamda,alpha,beta) + 2*kernel(x,xidx(j+1/2),f2[j],lamda,alpha,beta) + kernel(x,xidx(j+1),f3[j],lamda,alpha,beta))
return ans
#########################
f0 = np.zeros(N+1)
f0[0] = t0
f1 = np.zeros(N+1)
f2 = np.zeros(N+1)
f3 = np.zeros(N+1)
F = np.zeros((3,N+1))
def fitt(xvalue,lamda,alpha,beta):
global f0,f1,f2,f3,F
n = int(np.round(xvalue/h))
f1[n] = fx(xidx(n) + 1/2,n,lamda,alpha,beta) + (h/2)*kernel(xidx(n + 1/2),xidx(n),f0[n],lamda,alpha,beta)
f2[n] = fx(xidx(n + 1/2),n,lamda,alpha,beta)
f3[n] = fx(xidx(n+1),n,lamda,alpha,beta) + h*kernel(xidx(n+1),xidx(n+1/2),f2[n],lamda,alpha,beta)
if n+1 <= N:
f0[n+1] = fx(xidx(n+1),n,lamda,alpha,beta) + (h/6)*(kernel(xidx(n+1),xidx(n),f0[n],lamda,alpha,beta) + 2*kernel(xidx(n+1),xidx(n+1/2),f1[n],lamda,alpha,beta) + 2*kernel(xidx(n+1),xidx(n+1/2),f2[n],lamda,alpha,beta) + kernel(xidx(n+1),xidx(n+1),f3[n],lamda,alpha,beta))
if(xvalue == timeDF[len(timeDF) - 1]):
print(f0[n],n)
returnValue = f0[n]
f0 = np.zeros(N+1)
f0[0] = t0
f1 = np.zeros(N+1)
f2 = np.zeros(N+1)
f3 = np.zeros(N+1)
return returnValue
print(f0[n],n)
return f0[n]
fitt = np.vectorize(fitt)
#Fitting, plotting and giving (Adj) R-squared
popt , pcov = curve_fit(fitt,timeDF,tempDF,p0=(-0.1317,0.95,-1e-11),bounds=((-np.inf,0,-np.inf),(0,1,0)))
print(popt)
y_fit = np.array(fitt(timeDF,popt[0],popt[1],popt[2]))
plt.scatter(timeDF,tempDF,color='ORANGE',marker='.',s=0.5)
plt.fill_between(timeDF,tempDF-0.5,tempDF+0.5,color='ORANGE', alpha=0.2)
plt.plot(timeDF,y_fit,color='RED',linewidth=1)
plt.legend(["Experimental data", "Caputo fit"], loc ="upper right")
plt.xlabel("Time (min)")
plt.ylabel("Temperature (Kelvin)")
plt.show()
plt.close()
r2 = r2_score(tempDF,y_fit)
print(r2)
adjr2 = 1 - (1 - r2)*((len(xls)-1)/(len(xls)-3-1))
print(adjr2)
I already tried computing the values f0,f1,f2,f3 all at once, but the thing consuming the most time is Fn(x) which I haven't figured in out how to compute them all at once. If this is possible to compute at once, I think the program will run much faster. PS: ML,ML_ is a function from https://github.com/khinsen/mittag-leffler.
This is the function necesssary. Fn is the only one I haven't figured out yet.
There are two typing errors in the cited image. The combination of x_n and 1/2 is always meant to be the midpoint x_{n+1/2} = x_n + h/2. The second error is a duplication of x_{n+1/2} in the formula for f^{(4)}_n in its third term. The first error is probably producing errors that are large enough to make convergence complicated and any limit wrong for the intended problem.
In the Simpson/RK4 step, the 4 fx computations can be reduced to 2.
The F_n implement the left side of the integral equation
F(x) = g(x) + int(s=0 to x of K(x,s,f(s))
where the integral is approximated with the sample sequences f0,...,f3. Due to the structure of problem and algorithm F_n(x_n)=f^0_n = f^4_{n-1}.
Note that K(x,s,f) should be set to zero for s >= x. In the exact version of the equation these values "above the diagonal" are not used.
If an increase in accuracy is needed, for instance to avoid divergence where there is none in the exact solution, you can decrease the step site by a factor of 10 and then sub-sample the f^0_n sequence to produce the numerical guess for the given data. Other factors than 10 are of course also possible.

Regression with constraints on contribution from variables

I'm trying to develop a regression model with constraints on effect from the independent variables. So my model equation is y = a0 + a1x1 + a2x2 with 200 datapoints. What I want to achieve is sum(a1x1) over 200 datapoints should fall in certain range i.e. lb1<sum(a1x1)<ub1. I am using Gekko for the optimization part and a got stuck while applying this condition.
I am using the following code where ubdict is the dictionary for the boundaries:
m = gk.GEKKO(remote=False)
m.options.IMODE=2 #Regression mode
y = np.array(df['y']) #dependant vars for optimization
x = np.array(df[X]) #array of independent vars for optimization
n = x.shape[1] #number of variables
c = m.Array(m.FV, n+1) #array of parameters and intercept
for ci in c:
ci.STATUS = 1 #calculate fixed parameter
xp = [None]*n
#load data
xd = m.Array(m.Param,n)
yd = m.Param(value=y)
for i in range(n):
xd[i].value = x[:,i]
xp[i] = m.Var()
if ubound_dict[i] >= 0:
xp[i] = m.Var(lb=0, ub=ubdict[i])
elif ubound_dict[i] < 0:
xp[i] = m.Var(lb=ubdict[i], ub=0)
m.Equation(xp[i]==c[i]*xd[i])
yp = m.Var()
m.Equation(yp==m.sum([xp[i] for i in range(n)] + [c[n]]))
#Minimize difference between actual and predicted y
m.Minimize((yd-yp)**2)
#APOPT solver
m.options.SOLVER = 1
#Solve
m.solve(disp=True)
#Retrieve parameter values
a = [i.value[0] for i in c]
print(a)
But this is applying the constraint row-wise. What I want is something like
xp[i] = m.Var(lb=0, ub=ubdict[i])
m.Equation(xp[i]==sum(c[i]*xd[i]) over observations)
Any suggestion would be of great help!
Below is a similar problem with sample data.
Regression Mode with IMODE=2
Use the m.vsum() object in Gekko with IMODE=2. Gekko lets you write the equations once and then applies the data to each equation. This is more efficient for large-scale data sets.
import numpy as np
from gekko import GEKKO
# load data
x1 = np.array([1,2,5,3,2,5,2])
x2 = np.array([5,6,7,2,1,3,2])
ym = np.array([3,2,3,5,6,7,8])
# model
m = GEKKO()
c = m.Array(m.FV,3)
for ci in c:
ci.STATUS=1
x1 = m.Param(value=x1)
x2 = m.Param(value=x2)
ymeas = m.Param(value=ym)
ypred = m.Var()
m.Equation(ypred == c[0] + c[1]*x1 + c[2]*x2)
# add constraint on sum(c[1]*x1) with vsum
v1 = m.Var(); m.Equation(v1==c[1]*x1)
con = m.Var(lb=0,ub=10); m.Equation(con==m.vsum(v1))
m.Minimize((ypred-ymeas)**2)
m.options.IMODE = 2
m.solve()
print('Final SSE Objective: ' + str(m.options.objfcnval))
print('Solution')
for i,ci in enumerate(c):
print(i,ci.value[0])
# plot solution
import matplotlib.pyplot as plt
plt.figure(figsize=(8,4))
plt.plot(ymeas,ypred,'ro')
plt.plot([0,10],[0,10],'k-')
plt.xlabel('Meas')
plt.ylabel('Pred')
plt.savefig('results.png',dpi=300)
plt.show()
Optimization Mode (IMODE=3)
The optimization mode 3 allows you to write each equation and objective term individually. Both give the same solution.
import numpy as np
from gekko import GEKKO
# load data
x1 = np.array([1,2,5,3,2,5,2])
x2 = np.array([5,6,7,2,1,3,2])
ym = np.array([3,2,3,5,6,7,8])
n = len(ym)
# model
m = GEKKO()
c = m.Array(m.FV,3)
for ci in c:
ci.STATUS=1
yp = m.Array(m.Var,n)
for i in range(n):
m.Equation(yp[i]==c[0]+c[1]*x1[i]+c[2]*x2[i])
m.Minimize((yp[i]-ym[i])**2)
# add constraint on sum(c[1]*x1)
s = m.Var(lb=0,ub=10); m.Equation(s==c[1]*sum(x1))
m.options.IMODE = 3
m.solve()
print('Final SSE Objective: ' + str(m.options.objfcnval))
print('Solution')
for i,ci in enumerate(c):
print(i,ci.value[0])
# plot solution
import matplotlib.pyplot as plt
plt.figure(figsize=(8,4))
ypv = [yp[i].value[0] for i in range(n)]
plt.plot(ym,ypv,'ro')
plt.plot([0,10],[0,10],'k-')
plt.xlabel('Meas')
plt.ylabel('Pred')
plt.savefig('results.png',dpi=300)
plt.show()
For future questions, please create a simple and complete example that demonstrates the issue.

OpenCV 3.1 optimization

I'm currently trying to implement an algorithm from a paper with OpenCV 3.1 on python 2.7 but the process is taking way too long.
The section of my code that's giving me trouble looks something like this:
width, height = mr.shape[:2]
Pm = []
for i in d:
M = np.float32([[1,0,-d[i]], [0,1,1]])
mrd = cv2.warpAffine(mr, M, (height,width))
C = cv2.subtract(ml, mrd)
C = cv2.pow(C,2)
C = np.divide(C, sigma_m)
C = p0 + (1-p0)**(-C)
Pm.append(C)
Where ml, mr and mrd are cv2 objects and d, p0 and sigma_m are integers.
The division and final equation in the last 3 lines are the real troublemakers here. Every iteration of this cycle is independent so in theory I could just split the 'for loop' through a few processors, but that seems like a lazy approach where I would just bypass the problem instead of fixing it.
Does anyone know a way to perform those computations faster?
We can leverage numexpr module to efficiently perform all of those latter arithmetic operations as one evaluate expression.
Thus, these steps :
C = cv2.subtract(ml, mrd)
C = cv2.pow(C,2)
C = np.divide(C, sigma_m)
C = p0 + (1-p0)**(-C)
could be replaced by one expression -
import numexpr as ne
C = ne.evaluate('p0 +(1-p0)**(-((ml-mrd)**2)/sigma_m)')
Let's verify things. The original approach as func -
def original_app(ml, mrd, sigma_m, p0):
C = cv2.subtract(ml, mrd)
C = cv2.pow(C,2)
C = np.divide(C, sigma_m)
C = p0 + (1-p0)**(-C)
return C
Verification -
In [28]: # Setup inputs
...: S = 1024 # Size parameter
...: ml = np.random.randint(0,255,(S,S))/255.0
...: mrd = np.random.randint(0,255,(S,S))/255.0
...: sigma_m = 0.45
...: p0 = 0.56
...:
In [29]: out1 = original_app(ml, mrd, sigma_m, p0)
In [30]: out2 = ne.evaluate('p0 +(1-p0)**(-((ml-mrd)**2)/sigma_m)')
In [31]: np.allclose(out1, out2)
Out[31]: True
Timings across various sizes of datasets -
In [19]: # Setup inputs
...: S = 1024 # Size parameter
...: ml = np.random.randint(0,255,(S,S))/255.0
...: mrd = np.random.randint(0,255,(S,S))/255.0
...: sigma_m = 0.45
...: p0 = 0.56
...:
In [20]: %timeit original_app(ml, mrd, sigma_m, p0)
10 loops, best of 3: 67.1 ms per loop
In [21]: %timeit ne.evaluate('p0 +(1-p0)**(-((ml-mrd)**2)/sigma_m)')
100 loops, best of 3: 12.9 ms per loop
In [22]: # Setup inputs
...: S = 512 # Size parameter
In [23]: %timeit original_app(ml, mrd, sigma_m, p0)
100 loops, best of 3: 15.3 ms per loop
In [24]: %timeit ne.evaluate('p0 +(1-p0)**(-((ml-mrd)**2)/sigma_m)')
100 loops, best of 3: 3.39 ms per loop
In [25]: # Setup inputs
...: S = 256 # Size parameter
In [26]: %timeit original_app(ml, mrd, sigma_m, p0)
100 loops, best of 3: 3.65 ms per loop
In [27]: %timeit ne.evaluate('p0 +(1-p0)**(-((ml-mrd)**2)/sigma_m)')
1000 loops, best of 3: 878 µs per loop
Around 5x speedup across various sizes with better speedups for larger arrays!
Also, as a side-note, I would advise using initialized arrays instead of appending as you are doing at the final step. Thus, we could initialize before going into the loop with something like out = np.zeros((len(d), width, height)) / np.empty and at the final step assign into the output array with : out[iteration_ID] = C.

Numpy version of rolling MAD (mean absolute deviation)

How to make a rolling version of the following MAD function
from numpy import mean, absolute
def mad(data, axis=None):
return mean(absolute(data - mean(data, axis)), axis)
This code is an answer to this question
At the moment i convert numpy to pandas then apply this function, then convert the result back to numpy
pandasDataFrame.rolling(window=90).apply(mad)
but this is inefficient on larger data-frames. How to get a rolling window for the same function in numpy without looping and give the same result?
Here's a vectorized NumPy approach -
# From this post : http://stackoverflow.com/a/40085052/3293881
def strided_app(a, L, S ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
# From this post : http://stackoverflow.com/a/14314054/3293881 by #Jaime
def moving_average(a, n=3) :
ret = np.cumsum(a, dtype=float)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
def mad_numpy(a, W):
a2D = strided_app(a,W,1)
return np.absolute(a2D - moving_average(a,W)[:,None]).mean(1)
Runtime test -
In [617]: data = np.random.randint(0,9,(10000))
...: df = pd.DataFrame(data)
...:
In [618]: pandas_out = pd.rolling_apply(df,90,mad).values.ravel()
In [619]: numpy_out = mad_numpy(data,90)
In [620]: np.allclose(pandas_out[89:], numpy_out) # Nans part clipped
Out[620]: True
In [621]: %timeit pd.rolling_apply(df,90,mad)
10 loops, best of 3: 111 ms per loop
In [622]: %timeit mad_numpy(data,90)
100 loops, best of 3: 3.4 ms per loop
In [623]: 111/3.4
Out[623]: 32.64705882352941
Huge 32x+ speedup there over the loopy pandas solution!

Python performance: iteration and operations on nested lists

Problem Hey folks. I'm looking for some advice on python performance. Some background on my problem:
Given:
A (x,y) mesh of nodes each with a value (0...255) starting at 0
A list of N input coordinates each at a specified location within the range (0...x, 0...y)
A value Z that defines the "neighborhood" in count of nodes
Increment the value of the node at the input coordinate and the node's neighbors. Neighbors beyond the mesh edge are ignored. (No wrapping)
BASE CASE: A mesh of size 1024x1024 nodes, with 400 input coordinates and a range Z of 75 nodes.
Processing should be O(x*y*Z*N). I expect x, y and Z to remain roughly around the values in the base case, but the number of input coordinates N could increase up to 100,000. My goal is to minimize processing time.
Current results Between my start and the comments below, we've got several implementations.
Running speed on my 2.26 GHz Intel Core 2 Duo with Python 2.6.1:
f1: 2.819s
f2: 1.567s
f3: 1.593s
f: 1.579s
f3b: 1.526s
f4: 0.978s
f1 is the initial naive implementation: three nested for loops.
f2 is replaces the inner for loop with a list comprehension.
f3 is based on Andrei's suggestion in the comments and replaces the outer for with map()
f is Chris's suggestion in the answers below
f3b is kriss's take on f3
f4 is Alex's contribution.
Code is included below for your perusal.
Question How can I further reduce the processing time? I'd prefer sub-1.0s for the test parameters.
Please, keep the recommendations to native Python. I know I can move to a third-party package such as numpy, but I'm trying to avoid any third party packages. Also, I've generated random input coordinates, and simplified the definition of the node value updates to keep our discussion simple. The specifics have to change slightly and are outside the scope of my question.
thanks much!
**`f1` is the initial naive implementation: three nested `for` loops.**
def f1(x,y,n,z):
rows = [[0]*x for i in xrange(y)]
for i in range(n):
inputX, inputY = (int(x*random.random()), int(y*random.random()))
topleft = (inputX - z, inputY - z)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
for j in xrange(max(0, topleft[1]), min(topleft[1]+(z*2), y)):
if rows[i][j] <= 255: rows[i][j] += 1
f2 is replaces the inner for loop with a list comprehension.
def f2(x,y,n,z):
rows = [[0]*x for i in xrange(y)]
for i in range(n):
inputX, inputY = (int(x*random.random()), int(y*random.random()))
topleft = (inputX - z, inputY - z)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE: f3 is based on Andrei's suggestion in the comments and replaces the outer for with map(). My first hack at this requires several out-of-local-scope lookups, specifically recommended against by Guido: local variable lookups are much faster than global or built-in variable lookups I hardcoded all but the reference to the main data structure itself to minimize that overhead.
rows = [[0]*x for i in xrange(y)]
def f3(x,y,n,z):
inputs = [(int(x*random.random()), int(y*random.random())) for i in range(n)]
rows = map(g, inputs)
def g(input):
inputX, inputY = input
topleft = (inputX - 75, inputY - 75)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(75*2), 1024)):
l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE3: ChristopeD also pointed out a couple improvements.
def f(x,y,n,z):
rows = [[0] * y for i in xrange(x)]
rn = random.random
for i in xrange(n):
topleft = (int(x*rn()) - z, int(y*rn()) - z)
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]
UPDATE4: kriss added a few improvements to f3, replacing min/max with the new ternary operator syntax.
def f3b(x,y,n,z):
rn = random.random
rows = [g1(x, y, z) for x, y in [(int(x*rn()), int(y*rn())) for i in xrange(n)]]
def g1(x, y, z):
l = y - z if y - z > 0 else 0
r = y + z if y + z < 1024 else 1024
for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE5: Alex weighed in with his substantive revision, adding a separate map() operation to cap the values at 255 and removing all non-local-scope lookups. The perf differences are non-trivial.
def f4(x,y,n,z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
inc = (1).__add__
sat = (0xff).__and__
for i in range(n):
inputX, inputY = rr(x), rr(y)
b = max(0, inputX - z)
t = min(inputX + z, x)
l = max(0, inputY - z)
r = min(inputY + z, y)
for i in range(b, t):
rows[i][l:r] = map(inc, rows[i][l:r])
for i in range(x):
rows[i] = map(sat, rows[i])
Also, since we all seem to be hacking around with variations, here's my test harness to compare speeds: (improved by ChristopheD)
def timing(f,x,y,z,n):
fn = "%s(%d,%d,%d,%d)" % (f.__name__, x, y, z, n)
ctx = "from __main__ import %s" % f.__name__
results = timeit.Timer(fn, ctx).timeit(10)
return "%4.4s: %.3f" % (f.__name__, results / 10.0)
if __name__ == "__main__":
print timing(f, 1024, 1024, 400, 75)
#add more here.
On my (slow-ish;-) first-day Macbook Air, 1.6GHz Core 2 Duo, system Python 2.5 on MacOSX 10.5, after saving your code in op.py I see the following timings:
$ python -mtimeit -s'import op' 'op.f1()'
10 loops, best of 3: 5.58 sec per loop
$ python -mtimeit -s'import op' 'op.f2()'
10 loops, best of 3: 3.15 sec per loop
So, my machine is slower than yours by a factor of a bit more than 1.9.
The fastest code I have for this task is:
def f3(x=x,y=y,n=n,z=z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
inc = (1).__add__
sat = (0xff).__and__
for i in range(n):
inputX, inputY = rr(x), rr(y)
b = max(0, inputX - z)
t = min(inputX + z, x)
l = max(0, inputY - z)
r = min(inputY + z, y)
for i in range(b, t):
rows[i][l:r] = map(inc, rows[i][l:r])
for i in range(x):
rows[i] = map(sat, rows[i])
which times as:
$ python -mtimeit -s'import op' 'op.f3()'
10 loops, best of 3: 3 sec per loop
so, a very modest speedup, projecting to more than 1.5 seconds on your machine - well above the 1.0 you're aiming for:-(.
With a simple C-coded extensions, exte.c...:
#include "Python.h"
static PyObject*
dopoint(PyObject* self, PyObject* args)
{
int x, y, z, px, py;
int b, t, l, r;
int i, j;
PyObject* rows;
if(!PyArg_ParseTuple(args, "iiiiiO",
&x, &y, &z, &px, &py, &rows
))
return 0;
b = px - z;
if (b < 0) b = 0;
t = px + z;
if (t > x) t = x;
l = py - z;
if (l < 0) l = 0;
r = py + z;
if (r > y) r = y;
for(i = b; i < t; ++i) {
PyObject* row = PyList_GetItem(rows, i);
for(j = l; j < r; ++j) {
PyObject* pyitem = PyList_GetItem(row, j);
long item = PyInt_AsLong(pyitem);
if (item < 255) {
PyObject* newitem = PyInt_FromLong(item + 1);
PyList_SetItem(row, j, newitem);
}
}
}
Py_RETURN_NONE;
}
static PyMethodDef exteMethods[] = {
{"dopoint", dopoint, METH_VARARGS, "process a point"},
{0}
};
void
initexte()
{
Py_InitModule("exte", exteMethods);
}
(note: I haven't checked it carefully -- I think it doesn't leak memory due to the correct interplay of reference stealing and borrowing, but it should be code inspected very carefully before being put in production;-), we could do
import exte
def f4(x=x,y=y,n=n,z=z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
for i in range(n):
inputX, inputY = rr(x), rr(y)
exte.dopoint(x, y, z, inputX, inputY, rows)
and the timing
$ python -mtimeit -s'import op' 'op.f4()'
10 loops, best of 3: 345 msec per loop
shows an acceleration of 8-9 times, which should put you in the ballpark you desire. I've seen a comment saying you don't want any third-party extension, but, well, this tiny extension you could make entirely your own;-). ((Not sure what licensing conditions apply to code on Stack Overflow, but I'll be glad to re-release this under the Apache 2 license or the like, if you need that;-)).
1. A (smaller) speedup could definitely be the initialization of your rows...
Replace
rows = []
for i in range(x):
rows.append([0 for i in xrange(y)])
with
rows = [[0] * y for i in xrange(x)]
2. You can also avoid some lookups by moving random.random out of the loops (saves a little).
3. EDIT: after corrections -- you could arrive at something like this:
def f(x,y,n,z):
rows = [[0] * y for i in xrange(x)]
rn = random.random
for i in xrange(n):
topleft = (int(x*rn()) - z, int(y*rn()) - z)
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]
EDIT: some new timings with timeit (10 runs) -- seems this provides only minor speedups:
import timeit
print timeit.Timer("f1(1024,1024,400,75)", "from __main__ import f1").timeit(10)
print timeit.Timer("f2(1024,1024,400,75)", "from __main__ import f2").timeit(10)
print timeit.Timer("f(1024,1024,400,75)", "from __main__ import f3").timeit(10)
f1 21.1669280529
f2 12.9376120567
f 11.1249599457
in your f3 rewrite, g can be simplified. (Can also be applied to f4)
You have the following code inside a for loop.
l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)
However, it appears that those values never change inside the for loop. So calculate them once, outside the loop instead.
Based on your f3 version I played with the code. As l and r are constants you can avoid to compute them in g1 loop. Also using new ternary if instead of min and max seems to be consistently faster. Also simplified expression with topleft. On my system it appears to be about 20% faster using with the code below.
def f3b(x,y,n,z):
rows = [g1(x, y, z) for x, y in [(int(x*random.random()), int(y*random.random())) for i in range(n)]]
def g1(x, y, z):
l = y - z if y - z > 0 else 0
r = y + z if y + z < 1024 else 1024
for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
You can create your own Python module in C, and control the performance as you want:
http://docs.python.org/extending/

Resources