Implementing a FIR filter using Vectors - algorithm

I have implemented a FIR filter in Haskell. I don't know that much about FIR filters and my code is heavily based on an existing C# implementation. Therefore, I have a feeling that my implementation is has too much of a C# style and is not really Haskell-like. I would like to know if there is a more idiomatic Haskell way of implementing my code. Ideally, I'm lucky for some combination of higher-order functions (map, filter, fold, etc.) that implement the algorithm.
My Haskell code looks like this:
applyFIR :: Vector Double -> Vector Double -> Vector Double
applyFIR b x = generate (U.length x) help
where
help i = if i >= (U.length b - 1) then loop i (U.length b - 1) else 0
loop yi bi = if bi < 0 then 0 else b !! bi * x !! (yi-bi) + loop yi (bi-1)
vec !! i = unsafeIndex vec i -- Shorthand for unsafeIndex
This code is based on the following C# code:
public float[] RunFilter(double[] x)
{
int M = coeff.Length;
int n = x.Length;
//y[n]=b0x[n]+b1x[n-1]+....bmx[n-M]
var y = new float[n];
for (int yi = 0; yi < n; yi++)
{
double t = 0.0f;
for (int bi = M - 1; bi >= 0; bi--)
{
if (yi - bi < 0) continue;
t += coeff[bi] * x[yi - bi];
}
y[yi] = (float) t;
}
return y;
}
As you can see, it's almost a straight copy. How can I turn my implementation into a more Haskell-like one? Do you have any ideas? The only thing I could come up with was using Vector.generate.
I know that the DSP library has an implementation available. But it uses lists and is way too slow for my use case. This Vector implementation is a lot faster than the one in DSP.
I've also tried implementing the algorithm using Repa. It is faster than the Vector implementation. Here is the result:
applyFIR :: V.Vector Float -> Array U DIM1 Float -> Array D DIM1 Float
applyFIR b x = R.traverse x id (\_ (Z :. i) -> if i >= len then loop i (len - 1) else 0)
where
len = V.length b
loop :: Int -> Int -> Float
loop yi bi = if bi < 0 then 0 else (V.unsafeIndex b bi) * x !! (Z :. (yi-bi)) + loop yi (bi-1)
arr !! i = unsafeIndex arr i

First of all, I don't think that your initial vector code is a faithful translation - that is, I think it disagrees with the C# code. For example, suppose that both "x" and "b" ("b" is coeff in C#) have length 3, and have all values of 1.0. Then for y[0] the C# code would produce x[0] * coeff[0], or 1.0. (it would hit continue for all other values of bi)
With your Haskell code, however, help 0 produces 0. Your Repa version seems to suffer from the same problem.
So let's start with a more faithful translation:
applyFIR :: Vector Double -> Vector Double -> Vector Double
applyFIR b x = generate (U.length x) help
where
help i = loop i (min i $ U.length b - 1)
loop yi bi = if bi < 0 then 0 else b !! bi * x !! (yi-bi) + loop yi (bi-1)
vec !! i = unsafeIndex vec i -- Shorthand for unsafeIndex
Now, you're basically doing a calculation like this for computing, say, y[3]:
... b[3] | b[2] | b[1] | b[0]
x[0] | x[1] | x[2] | x[3] | x[4] | x[5] | ....
multiply
b[3]*x[0]|b[2]*x[1] |b[1]*x[2] |b[0]*x[3]
sum
y[3] = b[3]*x[0] + b[2]*x[1] + b[1]*x[2] + b[0]*x[3]
So one way to think of what you're doing is "take the b vector, reverse it, and to compute spot i of the result, line b[0] up with x[i], multiply all the corresponding x and b entries, and compute the sum".
So let's do that:
applyFIR :: Vector Double -> Vector Double -> Vector Double
applyFIR b x = generate (U.length x) help
where
revB = U.reverse b
bLen = U.length b
help i = let sliceLen = min (i+1) bLen
bSlice = U.slice (bLen - sliceLen) sliceLen revB
xSlice = U.slice (i + 1 - sliceLen) sliceLen x
in U.sum $ U.zipWith (*) bSlice xSlice

Related

Solving linear equations

I have to find out the integral solution of a equation ax+by=c such that x>=0 and y>=0 and value of (x+y) is minimum.
I know if c%gcd(a,b)}==0 then it's always possible. How to find the values of x and y?
My approach
for(i 0 to 2*c):
x=i
y= (c-a*i)/b
if(y is integer)
ans = min(ans,x+y)
Is there any better way to do this ? Having better time complexity.
Using the Extended Euclidean Algorithm and the theory of linear Diophantine equations there is no need to search. Here is a Python 3 implementation:
def egcd(a,b):
s,t = 1,0 #coefficients to express current a in terms of original a,b
x,y = 0,1 #coefficients to express current b in terms of original a,b
q,r = divmod(a,b)
while(r > 0):
a,b = b,r
old_x, old_y = x,y
x,y = s - q*x, t - q*y
s,t = old_x, old_y
q,r = divmod(a,b)
return b, x ,y
def smallestSolution(a,b,c):
d,x,y = egcd(a,b)
if c%d != 0:
return "No integer solutions"
else:
u = a//d #integer division
v = b//d
w = c//d
x = w*x
y = w*y
k1 = -x//v if -x % v == 0 else 1 + -x//v #k1 = ceiling(-x/v)
x1 = x + k1*v # x + k1*v is solution with smallest x >= 0
y1 = y - k1*u
if y1 < 0:
return "No nonnegative integer solutions"
else:
k2 = y//u #floor division
x2 = x + k2*v #y-k2*u is solution with smallest y >= 0
y2 = y - k2*u
if x2 < 0 or x1+y1 < x2+y2:
return (x1,y1)
else:
return (x2,y2)
Typical run:
>>> smallestSolution(1001,2743,160485)
(111, 18)
The way it works: first use the extended Euclidean algorithm to find d = gcd(a,b) and one solution, (x,y). All other solutions are of the form (x+k*v,y-k*u) where u = a/d and v = b/d. Since x+y is linear, it has no critical points, hence is minimized in the first quadrant when either x is as small as possible or y is as small as possible. The k above is an arbitrary integer parameter. By appropriate use of floor and ceiling you can locate the integer points with either x as small as possible or y is as small as possible. Just take the one with the smallest sum.
On Edit: My original code used the Python function math.ceiling applied to -x/v. This is problematic for very large integers. I tweaked it so that the ceiling is computed with just int operations. It can now handle arbitrarily large numbers:
>>> a = 236317407839490590865554550063
>>> b = 127372335361192567404918884983
>>> c = 475864993503739844164597027155993229496457605245403456517677648564321
>>> smallestSolution(a,b,c)
(2013668810262278187384582192404963131387, 120334243940259443613787580180)
>>> x,y = _
>>> a*x+b*y
475864993503739844164597027155993229496457605245403456517677648564321
Most of the computation takes place in the running the extended Euclidean algorithm, which is known to be O(min(a,b)).
First let assume a,b,c>0 so:
a.x+b.y = c
x+y = min(xi+yi)
x,y >= 0
a,b,c > 0
------------------------
x = ( c - b.y )/a
y = ( c - a.x )/b
c - a.x >= 0
c - b.y >= 0
c >= b.y
c >= a.x
x <= c/x
y <= c/b
So naive O(n) solution is in C++ like this:
void compute0(int &x,int &y,int a,int b,int c) // naive
{
int xx,yy;
xx=-1; yy=-1;
for (y=0;;y++)
{
x = c - b*y;
if (x<0) break; // y out of range stop
if (x%a) continue; // non integer solution
x/=a; // remember minimal solution
if ((xx<0)||(x+y<=xx+yy)) { xx=x; yy=y; }
}
x=xx; y=yy;
}
if no solution found it returns -1,-1 If you think about the equation a bit then you should realize that min solution will be when x or y is minimal (which one depends on a<b condition) so adding such heuristics we can increase only the minimal coordinate until first solution found. This will speed up considerably the whole thing:
void compute1(int &x,int &y,int a,int b,int c)
{
if (a<=b){ for (x=0,y=c;y>=0;x++,y-=a) if (y%b==0) { y/=b; return; } }
else { for (y=0,x=c;x>=0;y++,x-=b) if (x%a==0) { x/=a; return; } }
x=-1; y=-1;
}
I measured this on my setup:
x y ax+by x+y a=50 b=105 c=500000000
[ 55.910 ms] 10 4761900 500000000 4761910 naive
[ 0.000 ms] 10 4761900 500000000 4761910 opt
x y ax+by x+y a=105 b=50 c=500000000
[ 99.214 ms] 4761900 10 500000000 4761910 naive
[ 0.000 ms] 4761900 10 500000000 4761910 opt
The ~2.0x difference for naive method times is due to a/b=~2.0and selecting worse coordinate to iterate in the second run.
Now just handle special cases when a,b,c are zero (to avoid division by zero)...

Finding the continued fraction of 2^(1/3) to very high precision

Here I'll use the notation
It is possible to find the continued fraction of a number by computing it then applying the definition, but that requires at least O(n) bits of memory to find a0, a1 ... an, in practice it is a much worse. Using double floating point precision it is only possible to find a0, a1 ... a19.
An alternative is to use the fact that if a,b,c are rational numbers then there exist unique rationals p,q,r such that 1/(a+b*21/3+c*22/3) = x+y*21/3+z*22/3, namely
So if I represent x,y, and z to absolute precision using the boost rational lib I can obtain floor(x + y*21/3+z*22/3) accurately only using double precision for 21/3 and 22/3 because I only need it to be within 1/2 of the true value. Unfortunately the numerators and denominators of x,y, and z grow considerably fast, and if you use regular floats instead the errors pile up quickly.
This way I was able to compute a0, a1 ... a10000 in under an hour, but somehow mathematica can do that in 2 seconds. Here's my code for reference
#include <iostream>
#include <boost/multiprecision/cpp_int.hpp>
namespace mp = boost::multiprecision;
int main()
{
const double t_1 = 1.259921049894873164767210607278228350570251;
const double t_2 = 1.587401051968199474751705639272308260391493;
mp::cpp_rational p = 0;
mp::cpp_rational q = 1;
mp::cpp_rational r = 0;
for(unsigned int i = 1; i != 10001; ++i) {
double p_f = static_cast<double>(p);
double q_f = static_cast<double>(q);
double r_f = static_cast<double>(r);
uint64_t floor = p_f + t_1 * q_f + t_2 * r_f;
std::cout << floor << ", ";
p -= floor;
//std::cout << floor << " " << p << " " << q << " " << r << std::endl;
mp::cpp_rational den = (p * p * p + 2 * q * q * q +
4 * r * r * r - 6 * p * q * r);
mp::cpp_rational a = (p * p - 2 * q * r) / den;
mp::cpp_rational b = (2 * r * r - p * q) / den;
mp::cpp_rational c = (q * q - p * r) / den;
p = a;
q = b;
r = c;
}
return 0;
}
The Lagrange algorithm
The algorithm is described for example in Knuth's book The Art of Computer Programming, vol 2 (Ex 13 in section 4.5.3 Analysis of Euclid's Algorithm, p. 375 in 3rd edition).
Let f be a polynomial of integer coefficients whose only real root is an irrational number x0 > 1. Then the Lagrange algorithm calculates the consecutive quotients of the continued fraction of x0.
I implemented it in python
def cf(a, N=10):
"""
a : list - coefficients of the polynomial,
i.e. f(x) = a[0] + a[1]*x + ... + a[n]*x^n
N : number of quotients to output
"""
# Degree of the polynomial
n = len(a) - 1
# List of consecutive quotients
ans = []
def shift_poly():
"""
Replaces plynomial f(x) with f(x+1) (shifts its graph to the left).
"""
for k in range(n):
for j in range(n - 1, k - 1, -1):
a[j] += a[j+1]
for _ in range(N):
quotient = 1
shift_poly()
# While the root is >1 shift it left
while sum(a) < 0:
quotient += 1
shift_poly()
# Otherwise, we have the next quotient
ans.append(quotient)
# Replace polynomial f(x) with -x^n * f(1/x)
a.reverse()
a = [-x for x in a]
return ans
It takes about 1s on my computer to run cf([-2, 0, 0, 1], 10000). (The coefficients correspond to the polynomial x^3 - 2 whose only real root is 2^(1/3).) The output agrees with the one from Wolfram Alpha.
Caveat
The coefficients of the polynomials evaluated inside the function quickly become quite large integers. So this approach needs some bigint implementation in other languages (Pure python3 deals with it, but for example numpy doesn't.)
You might have more luck computing 2^(1/3) to high accuracy and then trying to derive the continued fraction from that, using interval arithmetic to determine if the accuracy is sufficient.
Here's my stab at this in Python, using Halley iteration to compute 2^(1/3) in fixed point. The dead code is an attempt to compute fixed-point reciprocals more efficiently than Python via Newton iteration -- no dice.
Timing from my machine is about thirty seconds, spent mostly trying to extract the continued fraction from the fixed point representation.
prec = 40000
a = 1 << (3 * prec + 1)
two_a = a << 1
x = 5 << (prec - 2)
while True:
x_cubed = x * x * x
two_x_cubed = x_cubed << 1
x_prime = x * (x_cubed + two_a) // (two_x_cubed + a)
if -1 <= x_prime - x <= 1: break
x = x_prime
cf = []
four_to_the_prec = 1 << (2 * prec)
for i in range(10000):
q = x >> prec
r = x - (q << prec)
cf.append(q)
if True:
x = four_to_the_prec // r
else:
x = 1 << (2 * prec - r.bit_length())
while True:
delta_x = (x * ((four_to_the_prec - r * x) >> prec)) >> prec
if not delta_x: break
x += delta_x
print(cf)

How to wrap last/first element making building interpolation?

I've this code that iterate some samples and build a simple linear interpolation between the points:
foreach sample:
base = floor(index_pointer)
frac = index_pointer - base
out = in[base] * (1 - frac) + in[base + 1] * frac
index_pointer += speed
// restart
if(index_pointer >= sample_length)
{
index_pointer = 0
}
using "speed" equal to 1, the game is done. But if the index_pointer is different than 1 (i.e. got fractional part) I need to wrap last/first element keeping the translation consistent.
How would you do this? Double indexes?
Here's an example of values I have. Let say in array of 4 values: [8, 12, 16, 20].
It will be:
1.0*in[0] + 0.0*in[1]=8
0.28*in[0] + 0.72*in[1]=10.88
0.56*in[1] + 0.44*in[2]=13.76
0.84*in[2] + 0.14*in[3]=16.64
0.12*in[2] + 0.88*in[3]=19.52
0.4*in[3] + 0.6*in[4]=8 // wrong; here I need to wrapper
the last point is wrong. [4] will be 0 because I don't have [4], but the first part need to take care of 0.4 and the weight of first sample (I think?).
Just wrap around the indices:
out = in[base] * (1 - frac) + in[(base + 1) % N] * frac
, where % is the modulo operator and N is the number of input samples.
This procedure generates the following line for your sample data (the dashed lines are the interpolated sample points, the circles are the input values):
I think I understand the problem now (answer only applies if I really did...):
You sample values at a nominal speed sn. But actually your sampler samples at a real speed s, where s != sn. Now, you want to create a function which re-samples the series, sampled at speed s, so it yields a series as if it were sampled with speed sn by means of linear interpolation between 2 adjacent samples. Or, your sampler jitters (has variances in time when it actually samples, which is sn + Noise(sn)).
Here is my approach - a function named "re-sample". It takes the sample data and a list of desired re-sample-points.
For any re-sample point which would index outside the raw data, it returns the respective border value.
let resample (data : float array) times =
let N = Array.length data
let maxIndex = N-1
let weight (t : float) =
t - (floor t)
let interpolate x1 x2 w = x1 * (1.0 - w) + x2 * w
let interp t1 t2 w =
//printfn "t1 = %d t2 = %d w = %f" t1 t2 w
interpolate (data.[t1]) (data.[t2]) w
let inter t =
let t1 = int (floor t)
match t1 with
| x when x >= 0 && x < maxIndex ->
let t2 = t1 + 1
interp t1 t2 (weight t)
| x when x >= maxIndex -> data.[maxIndex]
| _ -> data.[0]
times
|> List.map (fun t -> t, inter t)
|> Array.ofList
let raw_data = [8; 12; 16; 20] |> List.map float |> Array.ofList
let resampled = resample raw_data [0.0..0.2..4.0]
And yields:
val resample : data:float array -> times:float list -> (float * float) []
val raw_data : float [] = [|8.0; 12.0; 16.0; 20.0|]
val resampled : (float * float) [] =
[|(0.0, 8.0); (0.2, 8.8); (0.4, 9.6); (0.6, 10.4); (0.8, 11.2); (1.0, 12.0);
(1.2, 12.8); (1.4, 13.6); (1.6, 14.4); (1.8, 15.2); (2.0, 16.0);
(2.2, 16.8); (2.4, 17.6); (2.6, 18.4); (2.8, 19.2); (3.0, 20.0);
(3.2, 20.0); (3.4, 20.0); (3.6, 20.0); (3.8, 20.0); (4.0, 20.0)|]
Now, I still fail to understand the "wrap around" part of your question. In the end, interpolation - in contrast to extrapolation is only defined for values in [0..N-1]. So it is up to you to decide if the function should produce a run time error or simply use the edge values (or 0) for time values out of bounds of your raw data array.
EDIT
As it turned out, it is about how to use a cyclic (ring) buffer for this as well.
Here, a version of the resample function, using a cyclic buffer. Along with some operations.
update adds a new sample value to the ring buffer
read reads the content a ring buffer element as if it were a normal array, indexed from [0..N-1].
initXXX functions which create the ring buffer in various forms.
length which returns the length or capacity of the ring buffer.
The ring buffer logics is factored into a module to keep it all clean.
module Cyclic =
let wrap n x = x % n // % is modulo operator, just like in C/C++
type Series = { A : float array; WritePosition : int }
let init (n : int) =
{ A = Array.init n (fun i -> 0.);
WritePosition = 0
}
let initFromArray a =
let n = Array.length a
{ A = Array.copy a;
WritePosition = 0
}
let initUseArray a =
let n = Array.length a
{ A = a;
WritePosition = 0
}
let update (sample : float ) (series : Series) =
let wrapper = wrap (Array.length series.A)
series.A.[series.WritePosition] <- sample
{ series with
WritePosition = wrapper (series.WritePosition + 1) }
let read i series =
let n = Array.length series.A
let wrapper = wrap (Array.length series.A)
series.A.[wrapper (series.WritePosition + i)]
let length (series : Series) = Array.length (series.A)
let resampleSeries (data : Cyclic.Series) times =
let N = Cyclic.length data
let maxIndex = N-1
let weight (t : float) =
t - (floor t)
let interpolate x1 x2 w = x1 * (1.0 - w) + x2 * w
let interp t1 t2 w =
interpolate (Cyclic.read t1 data) (Cyclic.read t2 data) w
let inter t =
let t1 = int (floor t)
match t1 with
| x when x >= 0 && x < maxIndex ->
let t2 = t1 + 1
interp t1 t2 (weight t)
| x when x >= maxIndex -> Cyclic.read maxIndex data
| _ -> Cyclic.read 0 data
times
|> List.map (fun t -> t, inter t)
|> Array.ofList
let input = raw_data
let rawSeries0 = Cyclic.initFromArray input
(resampleSeries rawSeries0 [0.0..0.2..4.0]) = resampled

3n+1 implementing with Haskell, compile error

everyone. I'm a newcomer to Haskell and just implemented the '3n + 1' problem with it. I checked a lot but the type error seemed strange, could you please help me find what the problem is?
import qualified Data.Vector as V
import qualified Data.Matrix as M
nMax = 1000000
table = V.fromList $ 0 : 1 : [cycleLength x | x <- [2 .. nMax]] where
cycleLength x = if x' <= nMax then table V.! x' + 1 else cycleLength x' + 1 where
x' = if even x then x `div` 2 else 3 * x + 1
sparseTable = M.fromLists $ [] : [[f i j | j <- [0 .. ceiling $ logBase 2 nMax]] | i <- [1 .. nMax]] where
f i 0 = table V.! i
f i j = maxValue i j
maxValue i j = max $ (leftValue i j) (rightValue i j) where
leftValue i j = sparseTable M.! (i, j - 1)
rightValue i j = sparseTable M.! (i + 2 ^ (j - 1), j - 1)
I used the Vector and Matrix (download with cabal) modules to implement the functions. I think the first function (table) has been proved that no mistakes in it, probably mistakes are in the last two function, which I used to implement the sparse table algorithm.
Since I just signed up and don't have enough reputation now, I just paste the error message here:
[1 of 1] Compiling Main ( 001.hs, interpreted )
001.hs:14:39:
Occurs check: cannot construct the infinite type: s0 ~ s0 -> s0
Relevant bindings include
leftValue :: Int -> Int -> s0 -> s0 (bound at 001.hs:15:9)
rightValue :: Int -> Int -> s0 -> s0 (bound at 001.hs:16:9)
maxValue :: Int -> Int -> s0 -> s0 (bound at 001.hs:14:1)
In the third argument of ‘leftValue’, namely ‘(rightValue i j)’
In the second argument of ‘($)’, namely
‘(leftValue i j) (rightValue i j)’
Failed, modules loaded: none.
The problem is the $ in max $ (leftValue i j) (rightValue i j).
The ($) operator binds less tightly than any other operator, including the 'normal function application you get when you just use a space.
So with the $, it parses as
max ((leftvalue i j) (rightValue i j))
if you remove it that should parse as you intended, which was presumably
max (leftValue i j) (rightValue i j)
You can get a hint of this from the error message, where it talks about the "third argument of leftValue".
There's some more information about ($) in When should I use $ (and can it always be replaced with parentheses)?

Python performance: iteration and operations on nested lists

Problem Hey folks. I'm looking for some advice on python performance. Some background on my problem:
Given:
A (x,y) mesh of nodes each with a value (0...255) starting at 0
A list of N input coordinates each at a specified location within the range (0...x, 0...y)
A value Z that defines the "neighborhood" in count of nodes
Increment the value of the node at the input coordinate and the node's neighbors. Neighbors beyond the mesh edge are ignored. (No wrapping)
BASE CASE: A mesh of size 1024x1024 nodes, with 400 input coordinates and a range Z of 75 nodes.
Processing should be O(x*y*Z*N). I expect x, y and Z to remain roughly around the values in the base case, but the number of input coordinates N could increase up to 100,000. My goal is to minimize processing time.
Current results Between my start and the comments below, we've got several implementations.
Running speed on my 2.26 GHz Intel Core 2 Duo with Python 2.6.1:
f1: 2.819s
f2: 1.567s
f3: 1.593s
f: 1.579s
f3b: 1.526s
f4: 0.978s
f1 is the initial naive implementation: three nested for loops.
f2 is replaces the inner for loop with a list comprehension.
f3 is based on Andrei's suggestion in the comments and replaces the outer for with map()
f is Chris's suggestion in the answers below
f3b is kriss's take on f3
f4 is Alex's contribution.
Code is included below for your perusal.
Question How can I further reduce the processing time? I'd prefer sub-1.0s for the test parameters.
Please, keep the recommendations to native Python. I know I can move to a third-party package such as numpy, but I'm trying to avoid any third party packages. Also, I've generated random input coordinates, and simplified the definition of the node value updates to keep our discussion simple. The specifics have to change slightly and are outside the scope of my question.
thanks much!
**`f1` is the initial naive implementation: three nested `for` loops.**
def f1(x,y,n,z):
rows = [[0]*x for i in xrange(y)]
for i in range(n):
inputX, inputY = (int(x*random.random()), int(y*random.random()))
topleft = (inputX - z, inputY - z)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
for j in xrange(max(0, topleft[1]), min(topleft[1]+(z*2), y)):
if rows[i][j] <= 255: rows[i][j] += 1
f2 is replaces the inner for loop with a list comprehension.
def f2(x,y,n,z):
rows = [[0]*x for i in xrange(y)]
for i in range(n):
inputX, inputY = (int(x*random.random()), int(y*random.random()))
topleft = (inputX - z, inputY - z)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE: f3 is based on Andrei's suggestion in the comments and replaces the outer for with map(). My first hack at this requires several out-of-local-scope lookups, specifically recommended against by Guido: local variable lookups are much faster than global or built-in variable lookups I hardcoded all but the reference to the main data structure itself to minimize that overhead.
rows = [[0]*x for i in xrange(y)]
def f3(x,y,n,z):
inputs = [(int(x*random.random()), int(y*random.random())) for i in range(n)]
rows = map(g, inputs)
def g(input):
inputX, inputY = input
topleft = (inputX - 75, inputY - 75)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(75*2), 1024)):
l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE3: ChristopeD also pointed out a couple improvements.
def f(x,y,n,z):
rows = [[0] * y for i in xrange(x)]
rn = random.random
for i in xrange(n):
topleft = (int(x*rn()) - z, int(y*rn()) - z)
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]
UPDATE4: kriss added a few improvements to f3, replacing min/max with the new ternary operator syntax.
def f3b(x,y,n,z):
rn = random.random
rows = [g1(x, y, z) for x, y in [(int(x*rn()), int(y*rn())) for i in xrange(n)]]
def g1(x, y, z):
l = y - z if y - z > 0 else 0
r = y + z if y + z < 1024 else 1024
for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE5: Alex weighed in with his substantive revision, adding a separate map() operation to cap the values at 255 and removing all non-local-scope lookups. The perf differences are non-trivial.
def f4(x,y,n,z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
inc = (1).__add__
sat = (0xff).__and__
for i in range(n):
inputX, inputY = rr(x), rr(y)
b = max(0, inputX - z)
t = min(inputX + z, x)
l = max(0, inputY - z)
r = min(inputY + z, y)
for i in range(b, t):
rows[i][l:r] = map(inc, rows[i][l:r])
for i in range(x):
rows[i] = map(sat, rows[i])
Also, since we all seem to be hacking around with variations, here's my test harness to compare speeds: (improved by ChristopheD)
def timing(f,x,y,z,n):
fn = "%s(%d,%d,%d,%d)" % (f.__name__, x, y, z, n)
ctx = "from __main__ import %s" % f.__name__
results = timeit.Timer(fn, ctx).timeit(10)
return "%4.4s: %.3f" % (f.__name__, results / 10.0)
if __name__ == "__main__":
print timing(f, 1024, 1024, 400, 75)
#add more here.
On my (slow-ish;-) first-day Macbook Air, 1.6GHz Core 2 Duo, system Python 2.5 on MacOSX 10.5, after saving your code in op.py I see the following timings:
$ python -mtimeit -s'import op' 'op.f1()'
10 loops, best of 3: 5.58 sec per loop
$ python -mtimeit -s'import op' 'op.f2()'
10 loops, best of 3: 3.15 sec per loop
So, my machine is slower than yours by a factor of a bit more than 1.9.
The fastest code I have for this task is:
def f3(x=x,y=y,n=n,z=z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
inc = (1).__add__
sat = (0xff).__and__
for i in range(n):
inputX, inputY = rr(x), rr(y)
b = max(0, inputX - z)
t = min(inputX + z, x)
l = max(0, inputY - z)
r = min(inputY + z, y)
for i in range(b, t):
rows[i][l:r] = map(inc, rows[i][l:r])
for i in range(x):
rows[i] = map(sat, rows[i])
which times as:
$ python -mtimeit -s'import op' 'op.f3()'
10 loops, best of 3: 3 sec per loop
so, a very modest speedup, projecting to more than 1.5 seconds on your machine - well above the 1.0 you're aiming for:-(.
With a simple C-coded extensions, exte.c...:
#include "Python.h"
static PyObject*
dopoint(PyObject* self, PyObject* args)
{
int x, y, z, px, py;
int b, t, l, r;
int i, j;
PyObject* rows;
if(!PyArg_ParseTuple(args, "iiiiiO",
&x, &y, &z, &px, &py, &rows
))
return 0;
b = px - z;
if (b < 0) b = 0;
t = px + z;
if (t > x) t = x;
l = py - z;
if (l < 0) l = 0;
r = py + z;
if (r > y) r = y;
for(i = b; i < t; ++i) {
PyObject* row = PyList_GetItem(rows, i);
for(j = l; j < r; ++j) {
PyObject* pyitem = PyList_GetItem(row, j);
long item = PyInt_AsLong(pyitem);
if (item < 255) {
PyObject* newitem = PyInt_FromLong(item + 1);
PyList_SetItem(row, j, newitem);
}
}
}
Py_RETURN_NONE;
}
static PyMethodDef exteMethods[] = {
{"dopoint", dopoint, METH_VARARGS, "process a point"},
{0}
};
void
initexte()
{
Py_InitModule("exte", exteMethods);
}
(note: I haven't checked it carefully -- I think it doesn't leak memory due to the correct interplay of reference stealing and borrowing, but it should be code inspected very carefully before being put in production;-), we could do
import exte
def f4(x=x,y=y,n=n,z=z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
for i in range(n):
inputX, inputY = rr(x), rr(y)
exte.dopoint(x, y, z, inputX, inputY, rows)
and the timing
$ python -mtimeit -s'import op' 'op.f4()'
10 loops, best of 3: 345 msec per loop
shows an acceleration of 8-9 times, which should put you in the ballpark you desire. I've seen a comment saying you don't want any third-party extension, but, well, this tiny extension you could make entirely your own;-). ((Not sure what licensing conditions apply to code on Stack Overflow, but I'll be glad to re-release this under the Apache 2 license or the like, if you need that;-)).
1. A (smaller) speedup could definitely be the initialization of your rows...
Replace
rows = []
for i in range(x):
rows.append([0 for i in xrange(y)])
with
rows = [[0] * y for i in xrange(x)]
2. You can also avoid some lookups by moving random.random out of the loops (saves a little).
3. EDIT: after corrections -- you could arrive at something like this:
def f(x,y,n,z):
rows = [[0] * y for i in xrange(x)]
rn = random.random
for i in xrange(n):
topleft = (int(x*rn()) - z, int(y*rn()) - z)
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]
EDIT: some new timings with timeit (10 runs) -- seems this provides only minor speedups:
import timeit
print timeit.Timer("f1(1024,1024,400,75)", "from __main__ import f1").timeit(10)
print timeit.Timer("f2(1024,1024,400,75)", "from __main__ import f2").timeit(10)
print timeit.Timer("f(1024,1024,400,75)", "from __main__ import f3").timeit(10)
f1 21.1669280529
f2 12.9376120567
f 11.1249599457
in your f3 rewrite, g can be simplified. (Can also be applied to f4)
You have the following code inside a for loop.
l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)
However, it appears that those values never change inside the for loop. So calculate them once, outside the loop instead.
Based on your f3 version I played with the code. As l and r are constants you can avoid to compute them in g1 loop. Also using new ternary if instead of min and max seems to be consistently faster. Also simplified expression with topleft. On my system it appears to be about 20% faster using with the code below.
def f3b(x,y,n,z):
rows = [g1(x, y, z) for x, y in [(int(x*random.random()), int(y*random.random())) for i in range(n)]]
def g1(x, y, z):
l = y - z if y - z > 0 else 0
r = y + z if y + z < 1024 else 1024
for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
You can create your own Python module in C, and control the performance as you want:
http://docs.python.org/extending/

Resources