Different expressions for different outputs in Halide - halide

I'm new to Halide so also kinda didn't know how to ask the question. Let me explain. Let's assume I have a simple code for Halide's generator like this:
class Blur : public Generator<Blur>{
public:
Input<Buffer<float>> in_func{"in_func", 2};
Output<Buffer<float>> forward{"forward", 2};
Var x, y, n;
void generate(){
Expr m1 = in_func(x+1, y+2)+in_func(x+2, y+1);
Expr m2 = in_func(x+1, y+2)-in_func(x+2, y+1);
Expr m3 = in_func(x+2, y+1)+in_func(x+1, y+1);
Expr m4 = in_func(x+2, y+1)-in_func(x+1, y+1);
Expr w0010_2 = -in_func(x+2, y+2)+in_func(x, y+2);
Expr w0111_2 = -in_func(x+3, y+2)+in_func(x+1, y+2);
forward(0,0) = w0010_2+m4+m3+m2+m1;
forward(1,0) = -w0111_2+m4+m3-m2-m1;
forward(0,1) = w0010_2-m4+m3-m2+m1;
forward(1,1) = w0111_2-m4+m3+m2-m1;
}
};
What I want to achieve is to define that output at index (0,0) should be the result of m1 + m2 but output at index (1,0) should be the result of different expression, for example, m1 - m2. I would be really grateful for help.

What I want to achieve is to define that output at index (0,0) should be the result of m1 + m2 but output at index (1,0) should be the result of different expression, for example, m1 - m2. [...] I want result[0][0] = expression1, result[0][1] = expression2, result[1][0] = expression3 and result[1][1] = expression4. But also result[0][2], result[0][4] and so on = expression1
Compute the values x%2 and y%2 and use their values in a select:
forward(x, y) = select(
x % 2 == 0 && y % 2 == 0, m1 + m2,
x % 2 == 1 && y % 2 == 0, m1 - m2,
x % 2 == 0 && y % 2 == 1, expr3,
/* otherwise, */ expr4
);
Select is a pure if-then-else. It evaluates all of its arguments and then picks the one corresponding to the first true predicate. If the expressions all use nearby points of in_func, this might not be too slow.
If you find that performance suffers, I'd try to create four funcs, one for each of the four expressions, and then select loads from those. If that's still too slow, you might be able to optimize the indexing to not compute any extra points. If you show all four expressions, I might be able to help you do that.

Related

Solving linear equations

I have to find out the integral solution of a equation ax+by=c such that x>=0 and y>=0 and value of (x+y) is minimum.
I know if c%gcd(a,b)}==0 then it's always possible. How to find the values of x and y?
My approach
for(i 0 to 2*c):
x=i
y= (c-a*i)/b
if(y is integer)
ans = min(ans,x+y)
Is there any better way to do this ? Having better time complexity.
Using the Extended Euclidean Algorithm and the theory of linear Diophantine equations there is no need to search. Here is a Python 3 implementation:
def egcd(a,b):
s,t = 1,0 #coefficients to express current a in terms of original a,b
x,y = 0,1 #coefficients to express current b in terms of original a,b
q,r = divmod(a,b)
while(r > 0):
a,b = b,r
old_x, old_y = x,y
x,y = s - q*x, t - q*y
s,t = old_x, old_y
q,r = divmod(a,b)
return b, x ,y
def smallestSolution(a,b,c):
d,x,y = egcd(a,b)
if c%d != 0:
return "No integer solutions"
else:
u = a//d #integer division
v = b//d
w = c//d
x = w*x
y = w*y
k1 = -x//v if -x % v == 0 else 1 + -x//v #k1 = ceiling(-x/v)
x1 = x + k1*v # x + k1*v is solution with smallest x >= 0
y1 = y - k1*u
if y1 < 0:
return "No nonnegative integer solutions"
else:
k2 = y//u #floor division
x2 = x + k2*v #y-k2*u is solution with smallest y >= 0
y2 = y - k2*u
if x2 < 0 or x1+y1 < x2+y2:
return (x1,y1)
else:
return (x2,y2)
Typical run:
>>> smallestSolution(1001,2743,160485)
(111, 18)
The way it works: first use the extended Euclidean algorithm to find d = gcd(a,b) and one solution, (x,y). All other solutions are of the form (x+k*v,y-k*u) where u = a/d and v = b/d. Since x+y is linear, it has no critical points, hence is minimized in the first quadrant when either x is as small as possible or y is as small as possible. The k above is an arbitrary integer parameter. By appropriate use of floor and ceiling you can locate the integer points with either x as small as possible or y is as small as possible. Just take the one with the smallest sum.
On Edit: My original code used the Python function math.ceiling applied to -x/v. This is problematic for very large integers. I tweaked it so that the ceiling is computed with just int operations. It can now handle arbitrarily large numbers:
>>> a = 236317407839490590865554550063
>>> b = 127372335361192567404918884983
>>> c = 475864993503739844164597027155993229496457605245403456517677648564321
>>> smallestSolution(a,b,c)
(2013668810262278187384582192404963131387, 120334243940259443613787580180)
>>> x,y = _
>>> a*x+b*y
475864993503739844164597027155993229496457605245403456517677648564321
Most of the computation takes place in the running the extended Euclidean algorithm, which is known to be O(min(a,b)).
First let assume a,b,c>0 so:
a.x+b.y = c
x+y = min(xi+yi)
x,y >= 0
a,b,c > 0
------------------------
x = ( c - b.y )/a
y = ( c - a.x )/b
c - a.x >= 0
c - b.y >= 0
c >= b.y
c >= a.x
x <= c/x
y <= c/b
So naive O(n) solution is in C++ like this:
void compute0(int &x,int &y,int a,int b,int c) // naive
{
int xx,yy;
xx=-1; yy=-1;
for (y=0;;y++)
{
x = c - b*y;
if (x<0) break; // y out of range stop
if (x%a) continue; // non integer solution
x/=a; // remember minimal solution
if ((xx<0)||(x+y<=xx+yy)) { xx=x; yy=y; }
}
x=xx; y=yy;
}
if no solution found it returns -1,-1 If you think about the equation a bit then you should realize that min solution will be when x or y is minimal (which one depends on a<b condition) so adding such heuristics we can increase only the minimal coordinate until first solution found. This will speed up considerably the whole thing:
void compute1(int &x,int &y,int a,int b,int c)
{
if (a<=b){ for (x=0,y=c;y>=0;x++,y-=a) if (y%b==0) { y/=b; return; } }
else { for (y=0,x=c;x>=0;y++,x-=b) if (x%a==0) { x/=a; return; } }
x=-1; y=-1;
}
I measured this on my setup:
x y ax+by x+y a=50 b=105 c=500000000
[ 55.910 ms] 10 4761900 500000000 4761910 naive
[ 0.000 ms] 10 4761900 500000000 4761910 opt
x y ax+by x+y a=105 b=50 c=500000000
[ 99.214 ms] 4761900 10 500000000 4761910 naive
[ 0.000 ms] 4761900 10 500000000 4761910 opt
The ~2.0x difference for naive method times is due to a/b=~2.0and selecting worse coordinate to iterate in the second run.
Now just handle special cases when a,b,c are zero (to avoid division by zero)...

K-means for color quantization - Code not vectorized

I'm doing this exercise by Andrew NG about using k-means to reduce the number of colors in an image. It worked correctly but I'm afraid it's a little slow because of all the for loops in the code, so I'd like to vectorize them. But there are those loops that I just can't seem to vectorize effectively. Please help me, thank you very much!
Also if possible please give some feedback on my coding style :)
Here is the link of the exercise, and here is the dataset.
The correct result is given in the link of the exercise.
And here is my code:
function [] = KMeans()
Image = double(imread('bird_small.tiff'));
[rows,cols, RGB] = size(Image);
Points = reshape(Image,rows * cols, RGB);
K = 16;
Centroids = zeros(K,RGB);
s = RandStream('mt19937ar','Seed',0);
% Initialization :
% Pick out K random colours and make sure they are all different
% from each other! This prevents the situation where two of the means
% are assigned to the exact same colour, therefore we don't have to
% worry about division by zero in the E-step
% However, if K = 16 for example, and there are only 15 colours in the
% image, then this while loop will never exit!!! This needs to be
% addressed in the future :(
% TODO : Vectorize this part!
done = false;
while done == false
RowIndex = randperm(s,rows);
ColIndex = randperm(s,cols);
RowIndex = RowIndex(1:K);
ColIndex = ColIndex(1:K);
for i = 1 : K
for j = 1 : RGB
Centroids(i,j) = Image(RowIndex(i),ColIndex(i),j);
end
end
Centroids = sort(Centroids,2);
Centroids = unique(Centroids,'rows');
if size(Centroids,1) == K
done = true;
end
end;
% imshow(imread('bird_small.tiff'))
%
% for i = 1 : K
% hold on;
% plot(RowIndex(i),ColIndex(i),'r+','MarkerSize',50)
% end
eps = 0.01; % Epsilon
IterNum = 0;
while 1
% E-step: Estimate membership given parameters
% Membership: The centroid that each colour is assigned to
% Parameters: Location of centroids
Dist = pdist2(Points,Centroids,'euclidean');
[~, WhichCentroid] = min(Dist,[],2);
% M-step: Estimate parameters given membership
% Membership: The centroid that each colour is assigned to
% Parameters: Location of centroids
% TODO: Vectorize this part!
OldCentroids = Centroids;
for i = 1 : K
PointsInCentroid = Points((find(WhichCentroid == i))',:);
NumOfPoints = size(PointsInCentroid,1);
% Note that NumOfPoints is never equal to 0, as a result of
% the initialization. Or .... ???????
if NumOfPoints ~= 0
Centroids(i,:) = sum(PointsInCentroid , 1) / NumOfPoints ;
end
end
% Check for convergence: Here we use the L2 distance
IterNum = IterNum + 1;
Margins = sqrt(sum((Centroids - OldCentroids).^2, 2));
if sum(Margins > eps) == 0
break;
end
end
IterNum;
Centroids ;
% Load the larger image
[LargerImage,ColorMap] = imread('bird_large.tiff');
LargerImage = double(LargerImage);
[largeRows,largeCols,NewRGB] = size(LargerImage); % RGB is always 3
% TODO: Vectorize this part!
largeRows
largeCols
NewRGB
% Replace each of the pixel with the nearest centroid
NewPoints = reshape(LargerImage,largeRows * largeCols, NewRGB);
Dist = pdist2(NewPoints,Centroids,'euclidean');
[~,WhichCentroid] = min(Dist,[],2);
NewPoints = Centroids(WhichCentroid,:);
LargerImage = reshape(NewPoints,largeRows,largeCols,NewRGB);
% for i = 1 : largeRows
% for j = 1 : largeCols
% Dist = pdist2(Centroids,reshape(LargerImage(i,j,:),1,RGB),'euclidean');
% [~,WhichCentroid] = min(Dist);
% LargerImage(i,j,:) = Centroids(WhichCentroid,:);
% end
% end
% Display new image
imshow(uint8(round(LargerImage)),ColorMap)
UPDATE: Replaced
for i = 1 : K
for j = 1 : RGB
Centroids(i,j) = Image(RowIndex(i),ColIndex(i),j);
end
end
with
for i = 1 : K
Centroids(i,:) = Image(RowIndex(i),ColIndex(i),:);
end
I think this may be vectorized further by using linear indexing, but for now I should just focus on the while loop since it takes most of the time.
Also when I tried #Dev-iL's suggestion and replaced
for i = 1 : K
PointsInCentroid = Points((find(WhichCentroid == i))',:);
NumOfPoints = size(PointsInCentroid,1);
% Note that NumOfPoints is never equal to 0, as a result of
% the initialization. Or .... ???????
if NumOfPoints ~= 0
Centroids(i,:) = sum(PointsInCentroid , 1) / NumOfPoints ;
end
end
with
E = sparse(1:size(WhichCentroid), WhichCentroid' , 1, Num, K, Num);
Centroids = (E * spdiags(1./sum(E,1)',0,K,K))' * Points ;
the results were always worse: With K = 16, the first takes 2,414s , the second takes 2,455s ; K = 32, the first takes 4,529s , the second takes 5,022s. Seems like vectorization does not help, but maybe there's something wrong with my code :( .
Replaced
for i = 1 : K
for j = 1 : RGB
Centroids(i,j) = Image(RowIndex(i),ColIndex(i),j);
end
end
with
for i = 1 : K
Centroids(i,:) = Image(RowIndex(i),ColIndex(i),:);
end
I think this may be vectorized further by using linear indexing, but for now I should just focus on the while loop since it takes most of the time.
Also when I tried #Dev-iL's suggestion and replaced
for i = 1 : K
PointsInCentroid = Points((find(WhichCentroid == i))',:);
NumOfPoints = size(PointsInCentroid,1);
% Note that NumOfPoints is never equal to 0, as a result of
% the initialization. Or .... ???????
if NumOfPoints ~= 0
Centroids(i,:) = sum(PointsInCentroid , 1) / NumOfPoints ;
end
end
with
E = sparse(1:size(WhichCentroid), WhichCentroid' , 1, Num, K, Num);
Centroids = (E * spdiags(1./sum(E,1)',0,K,K))' * Points ;
the results were always worse: With K = 16, the first takes 2,414s , the second takes 2,455s ; K = 32, the first took 4,529s , the second took 5,022s. Seems like vectorization did not help in this case.
However, when I replaced
Dist = pdist2(Points,Centroids,'euclidean');
[~, WhichCentroid] = min(Dist,[],2);
(in the while loop) with
Dist = bsxfun(#minus,dot(Centroids',Centroids',1)' / 2 , Centroids * Points' );
[~, WhichCentroid] = min(Dist,[],1);
WhichCentroid = WhichCentroid';
the code ran much faster, especially when K is large (K=32)
Thank you everyone!

Use two random function to get a specific random funciton

There are two random functions f1(),f2().
f1() returns 1 with probability p1, and 0 with probability 1-p1.
f2() returns 1 with probability p2, and 0 with probability 1-p2.
I want to implement a new function f3() which returns 1 with probability p3(a given probability), and returns 0 with probability 1-p3. In the implemetion of function f3(), we can use function f1() and f2(), but you can't use any other random function.
If p3=0.5, an example of implemention:
int f3()
{
do
{
int a = f1();
int b = f1();
if (a==b) continue;
// when reachs here
// a==1 with probability p1(1-p1)
// b==1 with probability (1-p1)p1
if (a==1) return 1;//now returns 1 with probability 0.5
if (b==1) return 0;
}while(1)
}
This implemention of f3() will give a random function returns 1 with probability 0.5, and 0 with probability 0.5. But how to implement the f3() with p3=0.4? I have no idea.
I wonder, is that task possible? And how to implement f3()?
Thanks in advance.
p1 = 0.77 -- arbitrary value between 0 and 1
function f1()
if math.random() < p1 then
return 1
else
return 0
end
end
-- f1() is enough. We don't need f2()
p3 = 0.4 -- arbitrary value between 0 and 1
--------------------------
function f3()
left = 0
rigth = 1
repeat
middle = left + (right - left) * p1
if f1() == 1 then
right = middle
else
left = middle
end
if right < p3 then -- completely below
return 1
elseif left >= p3 then -- completely above
return 0
end
until false -- loop forever
end
This can be solved if p3 is a rational number.
We should use conditional probabilities for this.
For example, if you want to make this for p3=0.4, the method is the following:
Calculate the fractional form of p3. In our case it is p3=0.4=2/5.
Now generate as many random variables from the same distribution (let's say, from f1, we won't use f2 anyway) as the denominator, call them X1, X2, X3, X4, X5.
We should regenerate all these random X variables until their sum equals the numerator in the fractional form of p3.
Once this is achieved then we just return X1 (or any other Xn, where n was chosen independently of the values of the X variables). Since there are 2 1s among the 5 X variables (because their sum equals the numerator), the probability of X1 being 1 is exactly p3.
For irrational p3, the problem cannot be solved by using only f1. I'm not sure now, but I think, it can be solved for p3 of the form p1*q+p2*(1-q), where q is rational with a similar method, generating the appropriate amount of Xs with distribution f1 and Ys with distribution f2, until they have a specific predefined sum, and returning one of them. This still needs to be detailed.
First to say, that's a nice problem to tweak one's brain. I managed to solve the problem for p3 = 0.4, for what you just asked for! And I think, generalisation of such problem, is not so trivial. :D
Here is how, you can solve it for p3 = 0.4:
The intuition comes from your example. If we generate a number from f1() five times in an iteration, (see the code bellow), we can have 32 types of results like bellow:
1: 00000
2: 00001
3: 00010
4: 00011
.....
.....
32: 11111
Among these, there are 10 such results with exactly two 1's in it! After identifying this, the problem becomes simple. Just return 1 for any of the 4 combinations and return 0 for 6 others! (as probability 0.4 means getting 1, 4 times out of 10). You can do that like bellow:
int f3()
{
do{
int a[5];
int numberOfOneInA = 0;
for(int i = 0; i < 5; i++){
a[i] = f1();
if(a[i] == 1){
numberOfOneInA++;
}
}
if (numberOfOneInA != 2) continue;
else return a[0]; //out of 10 times, 4 times a[0] is 1!
}while(1)
}
Waiting to see a generalised solution.
Cheers!
Here is an idea that will work when p3 is of a form a/2^n (a rational number with a denominator that is a power of 2).
Generate n random numbers with probability distribution of 0.5:
x1, x2, ..., xn
Interpret this as a binary number in the range 0...2^n-1; each number in this range has equal probability. If this number is less than a, return 1, else return 0.
Now, since this question is in a context of computer science, it seems reasonable to assume that p3 is in a form of a/2^n (this a common representation of numbers in computers).
I implement the idea of anatolyg and Egor:
inline double random(void)
{
return static_cast<double>(rand()) / static_cast<double>(RAND_MAX);
}
const double p1 = 0.8;
int rand_P1(void)
{
return random() < p1;
}
int rand_P2(void)//return 0 with 0.5
{
int x, y; while (1)
{
mystep++;
x = rand_P1(); y = rand_P1();
if (x ^ y) return x;
}
}
double p3 = random();
int rand_P3(void)//anatolyg's idea
{
double tp = p3; int bit, x;
while (1)
{
if (tp * 2 >= 1) {bit = 1; tp = tp * 2 - 1;}
else {bit = 0; tp = tp * 2;}
x = rand_P2();
if (bit ^ x) return bit;
}
}
int rand2_P3(void)//Egor's idea
{
double left = 0, right = 1, mid;
while (1)
{
dashenstep++;
mid = left + (right - left) * p1;
int x = rand_P1();
if (x) right = mid; else left = mid;
if (right < p3) return 1;
if (left > p3) return 0;
}
}
With massive math computings, I get, assuming P3 is uniformly distributed in [0,1), then the expectation of Egor is (1-p1^2-(1-p1)^2)^(-1). And anatolyg is 2(1-p1^2-(1-p1)^2)^(-1).
Speaking Algorithmically , Yes It is possible to do that task done .
Even Programmatically , It is possible , but a complex problem .
Lets take an example .
Let
F1(1) = .5 which means F1(0) =.5
F2(2) = .8 which means F1(0) =.2
Let Suppose You need a F3, such that F3(1)= .128
Lets try Decomposing it .
.128
= (2^7)*(10^-3) // decompose this into know values
= (8/10)*(8/10)*(2/10)
= F2(1)&F2(1)*(20/100) // as no Fi(1)==2/10
= F2(1)&F2(1)*(5/10)*(4/10)
= F2(1)&F2(1)&F1(1)*(40/100)
= F2(1)&F2(1)&F1(1)*(8/10)*(5/10)
= F2(1)&F2(1)&F1(1)&F2(1)&F1(1)
So F3(1)=.128 if we define F3()=F2()&F2()&F2()&F1()&F1()
Similarly if you want F4(1)=.9 ,
You give it as F4(0)=F1(0) | F2(0) =F1(0)F2(0)=.5.2 =.1 ,which mean F4(1)=1-0.1=0.9
Which means F4 is zero only when both are zero which happens .
So making use this ( & , | and , not(!) , xor(^) if you want ) operations with a combinational use of f1,f2 will surely give you the F3 which is made purely out of f1,f2,
Which may be NP hard problem to find the combination which gives you the exact probability.
So, Finally the answer to your question , whether it is possible or not ? is YES and this is one way of doing it, may be many hacks can be made into it this to optimize this, which gives you any optimal way .

Can You Use Arithmetic Operators to Flip Between 0 and 1

Is there a way without using logic and bitwise operators, just arithmetic operators, to flip between integers with the value 0 and 1?
ie.
variable ?= variable will make the variable 1 if it 0 or 0 if it is 1.
x = 1 - x
Will switch between 0 and 1.
Edit: I misread the question, thought the OP could use any operator
A Few more...(ignore these)
x ^= 1 // bitwise operator
x = !x // logical operator
x = (x <= 0) // kinda the same as x != 1
Without using an operator?
int arr[] = {1,0}
x = arr[x]
Yet another way:
x = (x + 1) % 2
Assuming that it is initialized as a 0 or 1:
x = 1 - x
Comedy variation on st0le's second method
x = "\1"[x]
Another way to flip a bit.
x = ABS(x - 1) // the absolute of (x - 1)
int flip(int i){
return 1 - i;
};
Just for a bit of variety:
x = 1 / (x + 1);
x = (x == 0);
x = (x != 1);
Not sure whether you consider == and != to be arithmetic operators. Probably not, and obviously although they work in C, more strongly typed languages wouldn't convert the result to integer.
you can simply try this
+(!0) // output:1
+(!1) // output:0
You can use simple:
abs(x-1)
or just:
int(not x)

Python performance: iteration and operations on nested lists

Problem Hey folks. I'm looking for some advice on python performance. Some background on my problem:
Given:
A (x,y) mesh of nodes each with a value (0...255) starting at 0
A list of N input coordinates each at a specified location within the range (0...x, 0...y)
A value Z that defines the "neighborhood" in count of nodes
Increment the value of the node at the input coordinate and the node's neighbors. Neighbors beyond the mesh edge are ignored. (No wrapping)
BASE CASE: A mesh of size 1024x1024 nodes, with 400 input coordinates and a range Z of 75 nodes.
Processing should be O(x*y*Z*N). I expect x, y and Z to remain roughly around the values in the base case, but the number of input coordinates N could increase up to 100,000. My goal is to minimize processing time.
Current results Between my start and the comments below, we've got several implementations.
Running speed on my 2.26 GHz Intel Core 2 Duo with Python 2.6.1:
f1: 2.819s
f2: 1.567s
f3: 1.593s
f: 1.579s
f3b: 1.526s
f4: 0.978s
f1 is the initial naive implementation: three nested for loops.
f2 is replaces the inner for loop with a list comprehension.
f3 is based on Andrei's suggestion in the comments and replaces the outer for with map()
f is Chris's suggestion in the answers below
f3b is kriss's take on f3
f4 is Alex's contribution.
Code is included below for your perusal.
Question How can I further reduce the processing time? I'd prefer sub-1.0s for the test parameters.
Please, keep the recommendations to native Python. I know I can move to a third-party package such as numpy, but I'm trying to avoid any third party packages. Also, I've generated random input coordinates, and simplified the definition of the node value updates to keep our discussion simple. The specifics have to change slightly and are outside the scope of my question.
thanks much!
**`f1` is the initial naive implementation: three nested `for` loops.**
def f1(x,y,n,z):
rows = [[0]*x for i in xrange(y)]
for i in range(n):
inputX, inputY = (int(x*random.random()), int(y*random.random()))
topleft = (inputX - z, inputY - z)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
for j in xrange(max(0, topleft[1]), min(topleft[1]+(z*2), y)):
if rows[i][j] <= 255: rows[i][j] += 1
f2 is replaces the inner for loop with a list comprehension.
def f2(x,y,n,z):
rows = [[0]*x for i in xrange(y)]
for i in range(n):
inputX, inputY = (int(x*random.random()), int(y*random.random()))
topleft = (inputX - z, inputY - z)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE: f3 is based on Andrei's suggestion in the comments and replaces the outer for with map(). My first hack at this requires several out-of-local-scope lookups, specifically recommended against by Guido: local variable lookups are much faster than global or built-in variable lookups I hardcoded all but the reference to the main data structure itself to minimize that overhead.
rows = [[0]*x for i in xrange(y)]
def f3(x,y,n,z):
inputs = [(int(x*random.random()), int(y*random.random())) for i in range(n)]
rows = map(g, inputs)
def g(input):
inputX, inputY = input
topleft = (inputX - 75, inputY - 75)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(75*2), 1024)):
l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE3: ChristopeD also pointed out a couple improvements.
def f(x,y,n,z):
rows = [[0] * y for i in xrange(x)]
rn = random.random
for i in xrange(n):
topleft = (int(x*rn()) - z, int(y*rn()) - z)
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]
UPDATE4: kriss added a few improvements to f3, replacing min/max with the new ternary operator syntax.
def f3b(x,y,n,z):
rn = random.random
rows = [g1(x, y, z) for x, y in [(int(x*rn()), int(y*rn())) for i in xrange(n)]]
def g1(x, y, z):
l = y - z if y - z > 0 else 0
r = y + z if y + z < 1024 else 1024
for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE5: Alex weighed in with his substantive revision, adding a separate map() operation to cap the values at 255 and removing all non-local-scope lookups. The perf differences are non-trivial.
def f4(x,y,n,z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
inc = (1).__add__
sat = (0xff).__and__
for i in range(n):
inputX, inputY = rr(x), rr(y)
b = max(0, inputX - z)
t = min(inputX + z, x)
l = max(0, inputY - z)
r = min(inputY + z, y)
for i in range(b, t):
rows[i][l:r] = map(inc, rows[i][l:r])
for i in range(x):
rows[i] = map(sat, rows[i])
Also, since we all seem to be hacking around with variations, here's my test harness to compare speeds: (improved by ChristopheD)
def timing(f,x,y,z,n):
fn = "%s(%d,%d,%d,%d)" % (f.__name__, x, y, z, n)
ctx = "from __main__ import %s" % f.__name__
results = timeit.Timer(fn, ctx).timeit(10)
return "%4.4s: %.3f" % (f.__name__, results / 10.0)
if __name__ == "__main__":
print timing(f, 1024, 1024, 400, 75)
#add more here.
On my (slow-ish;-) first-day Macbook Air, 1.6GHz Core 2 Duo, system Python 2.5 on MacOSX 10.5, after saving your code in op.py I see the following timings:
$ python -mtimeit -s'import op' 'op.f1()'
10 loops, best of 3: 5.58 sec per loop
$ python -mtimeit -s'import op' 'op.f2()'
10 loops, best of 3: 3.15 sec per loop
So, my machine is slower than yours by a factor of a bit more than 1.9.
The fastest code I have for this task is:
def f3(x=x,y=y,n=n,z=z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
inc = (1).__add__
sat = (0xff).__and__
for i in range(n):
inputX, inputY = rr(x), rr(y)
b = max(0, inputX - z)
t = min(inputX + z, x)
l = max(0, inputY - z)
r = min(inputY + z, y)
for i in range(b, t):
rows[i][l:r] = map(inc, rows[i][l:r])
for i in range(x):
rows[i] = map(sat, rows[i])
which times as:
$ python -mtimeit -s'import op' 'op.f3()'
10 loops, best of 3: 3 sec per loop
so, a very modest speedup, projecting to more than 1.5 seconds on your machine - well above the 1.0 you're aiming for:-(.
With a simple C-coded extensions, exte.c...:
#include "Python.h"
static PyObject*
dopoint(PyObject* self, PyObject* args)
{
int x, y, z, px, py;
int b, t, l, r;
int i, j;
PyObject* rows;
if(!PyArg_ParseTuple(args, "iiiiiO",
&x, &y, &z, &px, &py, &rows
))
return 0;
b = px - z;
if (b < 0) b = 0;
t = px + z;
if (t > x) t = x;
l = py - z;
if (l < 0) l = 0;
r = py + z;
if (r > y) r = y;
for(i = b; i < t; ++i) {
PyObject* row = PyList_GetItem(rows, i);
for(j = l; j < r; ++j) {
PyObject* pyitem = PyList_GetItem(row, j);
long item = PyInt_AsLong(pyitem);
if (item < 255) {
PyObject* newitem = PyInt_FromLong(item + 1);
PyList_SetItem(row, j, newitem);
}
}
}
Py_RETURN_NONE;
}
static PyMethodDef exteMethods[] = {
{"dopoint", dopoint, METH_VARARGS, "process a point"},
{0}
};
void
initexte()
{
Py_InitModule("exte", exteMethods);
}
(note: I haven't checked it carefully -- I think it doesn't leak memory due to the correct interplay of reference stealing and borrowing, but it should be code inspected very carefully before being put in production;-), we could do
import exte
def f4(x=x,y=y,n=n,z=z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
for i in range(n):
inputX, inputY = rr(x), rr(y)
exte.dopoint(x, y, z, inputX, inputY, rows)
and the timing
$ python -mtimeit -s'import op' 'op.f4()'
10 loops, best of 3: 345 msec per loop
shows an acceleration of 8-9 times, which should put you in the ballpark you desire. I've seen a comment saying you don't want any third-party extension, but, well, this tiny extension you could make entirely your own;-). ((Not sure what licensing conditions apply to code on Stack Overflow, but I'll be glad to re-release this under the Apache 2 license or the like, if you need that;-)).
1. A (smaller) speedup could definitely be the initialization of your rows...
Replace
rows = []
for i in range(x):
rows.append([0 for i in xrange(y)])
with
rows = [[0] * y for i in xrange(x)]
2. You can also avoid some lookups by moving random.random out of the loops (saves a little).
3. EDIT: after corrections -- you could arrive at something like this:
def f(x,y,n,z):
rows = [[0] * y for i in xrange(x)]
rn = random.random
for i in xrange(n):
topleft = (int(x*rn()) - z, int(y*rn()) - z)
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]
EDIT: some new timings with timeit (10 runs) -- seems this provides only minor speedups:
import timeit
print timeit.Timer("f1(1024,1024,400,75)", "from __main__ import f1").timeit(10)
print timeit.Timer("f2(1024,1024,400,75)", "from __main__ import f2").timeit(10)
print timeit.Timer("f(1024,1024,400,75)", "from __main__ import f3").timeit(10)
f1 21.1669280529
f2 12.9376120567
f 11.1249599457
in your f3 rewrite, g can be simplified. (Can also be applied to f4)
You have the following code inside a for loop.
l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)
However, it appears that those values never change inside the for loop. So calculate them once, outside the loop instead.
Based on your f3 version I played with the code. As l and r are constants you can avoid to compute them in g1 loop. Also using new ternary if instead of min and max seems to be consistently faster. Also simplified expression with topleft. On my system it appears to be about 20% faster using with the code below.
def f3b(x,y,n,z):
rows = [g1(x, y, z) for x, y in [(int(x*random.random()), int(y*random.random())) for i in range(n)]]
def g1(x, y, z):
l = y - z if y - z > 0 else 0
r = y + z if y + z < 1024 else 1024
for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
You can create your own Python module in C, and control the performance as you want:
http://docs.python.org/extending/

Resources