How can I speedup this Julia code? - performance

The code implements an example of a Pollard rho() function for finding a factor of a positive integer, n. I've examined some of the code in the Julia "Primes" package that runs rapidly in an attempt to speedup the pollard_rho() function, all to no avail. The code should execute n = 1524157897241274137 in approximately 100 mSec to 30 Sec (Erlang, Haskell, Mercury, SWI Prolog) but takes about 3 to 4 minutes on JuliaBox, IJulia, and the Julia REPL. How can I make this go fast?
pollard_rho(1524157897241274137) = 1234567891
__precompile__()
module Pollard
export pollard_rho
function pollard_rho{T<:Integer}(n::T)
f(x::T, r::T, n) = rem(((x ^ T(2)) + r), n)
r::T = 7; x::T = 2; y::T = 11; y1::T = 11; z::T = 1
while z == 1
x = f(x, r, n)
y1 = f(y, r, n)
y = f(y1, r, n)
z = gcd(n, abs(x - y))
end
z >= n ? "error" : z
end
end # module

There are quite a few problems with type instability here.
Don't return either the string "error" or a result; instead explicitly call error().
As Chris mentioned, x and r ought to be annotated to be of type T, else they will be unstable.
There also seems to be a potential problem with overflow. A solution is to widen in the squaring step before truncating back to type T.
function pollard_rho{T<:Integer}(n::T)
f(x::T, r::T, n) = rem(Base.widemul(x, x) + r, n) % T
r::T = 7; x::T = 2; y::T = 11; y1::T = 11; z::T = 1
while z == 1
x = f(x, r, n)
y1 = f(y, r, n)
y = f(y1, r, n)
z = gcd(n, abs(x - y))
end
z >= n ? error() : z
end
After making these changes the function will run as fast as you could expect.
julia> #btime pollard_rho(1524157897241274137)
4.128 ms (0 allocations: 0 bytes)
1234567891
To find these problems with type instability, use the #code_warntype macro.

Related

Will this binary search algorithm run into infinite loop?

The specific binary search implementation is shown as below. The question I want to ask is that is it possible for the algorithm to run into infinite loop?
One possible situation I could think of is when l == r == UINT_MAX and the target x is larger than all elements in the array. Is it true that under this situation, the algorithm will stuck in infinite loop?
Are there any other situations of running into infinite loop?
Thanks for your help!!!
// A iterative binary search function. It returns location of x in
// given array arr[l..r] if present, otherwise -1.
int binarySearch(vector<double> arr, double x) {
unsigned int l = 0;
unsigned int r = arr.size() - 1;
while (l <= r) {
int m = l + (r - l) / 2;
if (arr[m] == x)
return m;
if (arr[m] < x)
l = m + 1;
else
r = m - 1;
}
return -1;
}
No it doesn't! An infinite loop only happens here if l and r could possibly stay on the same value forever. For that to happen, one of these things need to happen:
1) new value of l = old value of l:
m + 1 = l + (r - l) / 2 + 1 = l --> (r - l)/2 + 1 = 0 (which never happens since the left side is always positive knowing that r is already bigger equal than l)
2) new value of r = old value of r:
m - 1 = l + (r - l) / 2 - 1 = r --> (r - l)/2 = r - l + 1 (this also never happens because the right side is always strictly bigger)

Solving linear equations

I have to find out the integral solution of a equation ax+by=c such that x>=0 and y>=0 and value of (x+y) is minimum.
I know if c%gcd(a,b)}==0 then it's always possible. How to find the values of x and y?
My approach
for(i 0 to 2*c):
x=i
y= (c-a*i)/b
if(y is integer)
ans = min(ans,x+y)
Is there any better way to do this ? Having better time complexity.
Using the Extended Euclidean Algorithm and the theory of linear Diophantine equations there is no need to search. Here is a Python 3 implementation:
def egcd(a,b):
s,t = 1,0 #coefficients to express current a in terms of original a,b
x,y = 0,1 #coefficients to express current b in terms of original a,b
q,r = divmod(a,b)
while(r > 0):
a,b = b,r
old_x, old_y = x,y
x,y = s - q*x, t - q*y
s,t = old_x, old_y
q,r = divmod(a,b)
return b, x ,y
def smallestSolution(a,b,c):
d,x,y = egcd(a,b)
if c%d != 0:
return "No integer solutions"
else:
u = a//d #integer division
v = b//d
w = c//d
x = w*x
y = w*y
k1 = -x//v if -x % v == 0 else 1 + -x//v #k1 = ceiling(-x/v)
x1 = x + k1*v # x + k1*v is solution with smallest x >= 0
y1 = y - k1*u
if y1 < 0:
return "No nonnegative integer solutions"
else:
k2 = y//u #floor division
x2 = x + k2*v #y-k2*u is solution with smallest y >= 0
y2 = y - k2*u
if x2 < 0 or x1+y1 < x2+y2:
return (x1,y1)
else:
return (x2,y2)
Typical run:
>>> smallestSolution(1001,2743,160485)
(111, 18)
The way it works: first use the extended Euclidean algorithm to find d = gcd(a,b) and one solution, (x,y). All other solutions are of the form (x+k*v,y-k*u) where u = a/d and v = b/d. Since x+y is linear, it has no critical points, hence is minimized in the first quadrant when either x is as small as possible or y is as small as possible. The k above is an arbitrary integer parameter. By appropriate use of floor and ceiling you can locate the integer points with either x as small as possible or y is as small as possible. Just take the one with the smallest sum.
On Edit: My original code used the Python function math.ceiling applied to -x/v. This is problematic for very large integers. I tweaked it so that the ceiling is computed with just int operations. It can now handle arbitrarily large numbers:
>>> a = 236317407839490590865554550063
>>> b = 127372335361192567404918884983
>>> c = 475864993503739844164597027155993229496457605245403456517677648564321
>>> smallestSolution(a,b,c)
(2013668810262278187384582192404963131387, 120334243940259443613787580180)
>>> x,y = _
>>> a*x+b*y
475864993503739844164597027155993229496457605245403456517677648564321
Most of the computation takes place in the running the extended Euclidean algorithm, which is known to be O(min(a,b)).
First let assume a,b,c>0 so:
a.x+b.y = c
x+y = min(xi+yi)
x,y >= 0
a,b,c > 0
------------------------
x = ( c - b.y )/a
y = ( c - a.x )/b
c - a.x >= 0
c - b.y >= 0
c >= b.y
c >= a.x
x <= c/x
y <= c/b
So naive O(n) solution is in C++ like this:
void compute0(int &x,int &y,int a,int b,int c) // naive
{
int xx,yy;
xx=-1; yy=-1;
for (y=0;;y++)
{
x = c - b*y;
if (x<0) break; // y out of range stop
if (x%a) continue; // non integer solution
x/=a; // remember minimal solution
if ((xx<0)||(x+y<=xx+yy)) { xx=x; yy=y; }
}
x=xx; y=yy;
}
if no solution found it returns -1,-1 If you think about the equation a bit then you should realize that min solution will be when x or y is minimal (which one depends on a<b condition) so adding such heuristics we can increase only the minimal coordinate until first solution found. This will speed up considerably the whole thing:
void compute1(int &x,int &y,int a,int b,int c)
{
if (a<=b){ for (x=0,y=c;y>=0;x++,y-=a) if (y%b==0) { y/=b; return; } }
else { for (y=0,x=c;x>=0;y++,x-=b) if (x%a==0) { x/=a; return; } }
x=-1; y=-1;
}
I measured this on my setup:
x y ax+by x+y a=50 b=105 c=500000000
[ 55.910 ms] 10 4761900 500000000 4761910 naive
[ 0.000 ms] 10 4761900 500000000 4761910 opt
x y ax+by x+y a=105 b=50 c=500000000
[ 99.214 ms] 4761900 10 500000000 4761910 naive
[ 0.000 ms] 4761900 10 500000000 4761910 opt
The ~2.0x difference for naive method times is due to a/b=~2.0and selecting worse coordinate to iterate in the second run.
Now just handle special cases when a,b,c are zero (to avoid division by zero)...

Recursive algorithm for power of a power

I need to calculate power of a power. For example: 3^2^n . You can think n as input but this example is not the same thing as 9^n. I write a algorithm using loops but now I need to write recursive one. I couldn't find an efficient way to write it.
I went ahead and implemented this in Ruby, which is pretty darn close to pseudocode and has the added benefit of being testable. Since Ruby also has arbitrary precision integer arithmetic, the following code works with non-trivial arguments.
This implementation is based on the old trick of squaring the base and raising it to half the specified power when the exponent is even, so the recursive stack grows logarithmically rather than linearly in powers. This was inspired by Ilya's answer, but I found that the y > 1 and n > 1 case is not correct, leading me to use the recursive call within a recursive call implemented in the elif n > 1 line below:
def powpow(x, y, n)
if y == 0
return 1
elsif y == 1 || n == 0
return x
elsif n > 1
return powpow(x, powpow(y, n, 1), 1)
elsif y.even?
return powpow(x * x, y / 2, 1)
else
return x * powpow(x * x, y / 2, 1)
end
end
p powpow(3,2,5) # => 1853020188851841
I was able to confirm that result directly:
irb(main):001:0> 2**5
=> 32
irb(main):002:0> 3**32
=> 1853020188851841
Let's say x^(y^n) = powpow(x, y, n) with y and n >= 1
If y > 1 and n > 1, powpow(x, y, n) = powpow(x, y, 1) * powpow(x, y, n-1) (getting closer to the result)
If y > 1 and n = 1, powpow(x, y, 1) = x * powpow(x, y-1, 1) (getting closer)
If y = 1 and n = 1, powpow(x, 1, 1) = x (solved)
That's less efficient than a loop, but it's recursive. Is that what you're aiming for ...?
EDIT as #pjs has pointed out, the first case should be:
powpow(x, y, 1) = powpow(x, powpow(y, n, 1), 1)
public class Power {
int ans = 1;
int z = 1;
int x = 1;
int pow1(int b, int c) {
if (c > 1) {
pow1(b, c - 1);
}
ans = ans * b;
return ans;
}
void pow(int a, int b, int c) {
x = pow1(b, c);
ans = a;
pow1(a, x - 1);
}
public static void main(String[] args) {
Power p = new Power();
p.pow(3, 2, 3);
System.out.println(p.ans);
}
}
Reccursive Approach
We can compute power(x, y) efficiently in complexity O(log y) using the following reccurnce :
power(x, y) :
if y is 0 : return 1
if y is even :
return square( power(x, y / 2))
else :
return square( power(x, (y - 1) / 2 ) * x
Using master theorem we can compute the complexity of above procedure to be O(log y) (Similar case as that of binary search.)
Now, if we use the above procedure to compute 3 ^ (2 ^ n).
We can see that (2 ^ n) will be computed in O(log n) and 3 ^ k. Where k = 2 ^ n, will be computed in O(log k) = O(log (2 ^ n)) = O(n).
So using the binary exponentiation trick sequentially we can solve this using a complexity of O(n).
Iterative approach
Idea : Suppose we have calculated 3 ^ (2 ^ x). Then we can easily calculated 3 ^ (2 ^ (x + 1)) by just squaring 3 ^ (2 ^ x) as :
( 3 ^ (2 ^ x) ) * ( 3 ^ (2 ^ x) ) = 3 ^ ( (2 ^ x) + (2 ^ x) )
= 3 ^ ( 2 * (2 ^ x) )
= 3 ^ ( (2 ^ (x + 1) )
So, if we start with 3 ^ (2 ^ 0), in n steps we can reach on to 3 ^ (2 ^ n) :
def solve(n):
ans = 3 ^ (2 ^ 0) = 3
for i in range(0, n) :
ans = square(ans)
return ans
Clearly the complexity of the above solution is also O(n).

How to find the number of values in a given range divisible by a given value?

I have three numbers x, y , z.
For a range between numbers x and y.
How can i find the total numbers whose % with z is 0 i.e. how many numbers between x and y are divisible by z ?
It can be done in O(1): find the first one, find the last one, find the count of all other.
I'm assuming the range is inclusive. If your ranges are exclusive, adjust the bounds by one:
find the first value after x that is divisible by z. You can discard x:
x_mod = x % z;
if(x_mod != 0)
x += (z - x_mod);
find the last value before y that is divisible by y. You can discard y:
y -= y % z;
find the size of this range:
if(x > y)
return 0;
else
return (y - x) / z + 1;
If mathematical floor and ceil functions are available, the first two parts can be written more readably. Also the last part can be compressed using math functions:
x = ceil (x, z);
y = floor (y, z);
return max((y - x) / z + 1, 0);
if the input is guaranteed to be a valid range (x >= y), the last test or max is unneccessary:
x = ceil (x, z);
y = floor (y, z);
return (y - x) / z + 1;
(2017, answer rewritten thanks to comments)
The number of multiples of z in a number n is simply n / z
/ being the integer division, meaning decimals that could result from the division are simply ignored (for instance 17/5 => 3 and not 3.4).
Now, in a range from x to y, how many multiples of z are there?
Let see how many multiples m we have up to y
0----------------------------------x------------------------y
-m---m---m---m---m---m---m---m---m---m---m---m---m---m---m---
You see where I'm going... to get the number of multiples in the range [ x, y ], get the number of multiples of y then subtract the number of multiples before x, (x-1) / z
Solution: ( y / z ) - (( x - 1 ) / z )
Programmatically, you could make a function numberOfMultiples
function numberOfMultiples(n, z) {
return n / z;
}
to get the number of multiples in a range [x, y]
numberOfMultiples(y) - numberOfMultiples(x-1)
The function is O(1), there is no need of a loop to get the number of multiples.
Examples of results you should find
[30, 90] ÷ 13 => 4
[1, 1000] ÷ 6 => 166
[100, 1000000] ÷ 7 => 142843
[777, 777777777] ÷ 7 => 111111001
For the first example, 90 / 13 = 6, (30-1) / 13 = 2, and 6-2 = 4
---26---39---52---65---78---91--
^ ^
30<---(4 multiples)-->90
I also encountered this on Codility. It took me much longer than I'd like to admit to come up with a good solution, so I figured I would share what I think is an elegant solution!
Straightforward Approach 1/2:
O(N) time solution with a loop and counter, unrealistic when N = 2 billion.
Awesome Approach 3:
We want the number of digits in some range that are divisible by K.
Simple case: assume range [0 .. n*K], N = n*K
N/K represents the number of digits in [0,N) that are divisible by K, given N%K = 0 (aka. N is divisible by K)
ex. N = 9, K = 3, Num digits = |{0 3 6}| = 3 = 9/3
Similarly,
N/K + 1 represents the number of digits in [0,N] divisible by K
ex. N = 9, K = 3, Num digits = |{0 3 6 9}| = 4 = 9/3 + 1
I think really understanding the above fact is the trickiest part of this question, I cannot explain exactly why it works.
The rest boils down to prefix sums and handling special cases.
Now we don't always have a range that begins with 0, and we cannot assume the two bounds will be divisible by K.
But wait! We can fix this by calculating our own nice upper and lower bounds and using some subtraction magic :)
First find the closest upper and lower in the range [A,B] that are divisible by K.
Upper bound (easier): ex. B = 10, K = 3, new_B = 9... the pattern is B - B%K
Lower bound: ex. A = 10, K = 3, new_A = 12... try a few more and you will see the pattern is A - A%K + K
Then calculate the following using the above technique:
Determine the total number of digits X between [0,B] that are divisible by K
Determine the total number of digits Y between [0,A) that are divisible by K
Calculate the number of digits between [A,B] that are divisible by K in constant time by the expression X - Y
Website: https://codility.com/demo/take-sample-test/count_div/
class CountDiv {
public int solution(int A, int B, int K) {
int firstDivisible = A%K == 0 ? A : A + (K - A%K);
int lastDivisible = B%K == 0 ? B : B - B%K; //B/K behaves this way by default.
return (lastDivisible - firstDivisible)/K + 1;
}
}
This is my first time explaining an approach like this. Feedback is very much appreciated :)
This is one of the Codility Lesson 3 questions. For this question, the input is guaranteed to be in a valid range. I answered it using Javascript:
function solution(x, y, z) {
var totalDivisibles = Math.floor(y / z),
excludeDivisibles = Math.floor((x - 1) / z),
divisiblesInArray = totalDivisibles - excludeDivisibles;
return divisiblesInArray;
}
https://codility.com/demo/results/demoQX3MJC-8AP/
(I actually wanted to ask about some of the other comments on this page but I don't have enough rep points yet).
Divide y-x by z, rounding down. Add one if y%z < x%z or if x%z == 0.
No mathematical proof, unless someone cares to provide one, but test cases, in Perl:
#!perl
use strict;
use warnings;
use Test::More;
sub multiples_in_range {
my ($x, $y, $z) = #_;
return 0 if $x > $y;
my $ret = int( ($y - $x) / $z);
$ret++ if $y%$z < $x%$z or $x%$z == 0;
return $ret;
}
for my $z (2 .. 10) {
for my $x (0 .. 2*$z) {
for my $y (0 .. 4*$z) {
is multiples_in_range($x, $y, $z),
scalar(grep { $_ % $z == 0 } $x..$y),
"[$x..$y] mod $z";
}
}
}
done_testing;
Output:
$ prove divrange.pl
divrange.pl .. ok
All tests successful.
Files=1, Tests=3405, 0 wallclock secs ( 0.20 usr 0.02 sys + 0.26 cusr 0.01 csys = 0.49 CPU)
Result: PASS
Let [A;B] be an interval of positive integers including A and B such that 0 <= A <= B, K be the divisor.
It is easy to see that there are N(A) = ⌊A / K⌋ = floor(A / K) factors of K in interval [0;A]:
1K 2K 3K 4K 5K
●········x········x··●·····x········x········x···>
0 A
Similarly, there are N(B) = ⌊B / K⌋ = floor(B / K) factors of K in interval [0;B]:
1K 2K 3K 4K 5K
●········x········x········x········x···●····x···>
0 B
Then N = N(B) - N(A) equals to the number of K's (the number of integers divisible by K) in range (A;B]. The point A is not included, because the subtracted N(A) includes this point. Therefore, the result should be incremented by one, if A mod K is zero:
N := N(B) - N(A)
if (A mod K = 0)
N := N + 1
Implementation in PHP
function solution($A, $B, $K) {
if ($K < 1)
return 0;
$c = floor($B / $K) - floor($A / $K);
if ($A % $K == 0)
$c++;
return (int)$c;
}
In PHP, the effect of the floor function can be achieved by casting to the integer type:
$c = (int)($B / $K) - (int)($A / $K);
which, I think, is faster.
Here is my short and simple solution in C++ which got 100/100 on codility. :)
Runs in O(1) time. I hope its not difficult to understand.
int solution(int A, int B, int K) {
// write your code in C++11
int cnt=0;
if( A%K==0 or B%K==0)
cnt++;
if(A>=K)
cnt+= (B - A)/K;
else
cnt+=B/K;
return cnt;
}
(floor)(high/d) - (floor)(low/d) - (high%d==0)
Explanation:
There are a/d numbers divisible by d from 0.0 to a. (d!=0)
Therefore (floor)(high/d) - (floor)(low/d) will give numbers divisible in the range (low,high] (Note that low is excluded and high is included in this range)
Now to remove high from the range just subtract (high%d==0)
Works for integers, floats or whatever (Use fmodf function for floats)
Won't strive for an o(1) solution, this leave for more clever person:) Just feel this is a perfect usage scenario for function programming. Simple and straightforward.
> x,y,z=1,1000,6
=> [1, 1000, 6]
> (x..y).select {|n| n%z==0}.size
=> 166
EDIT: after reading other's O(1) solution. I feel shamed. Programming made people lazy to think...
Division (a/b=c) by definition - taking a set of size a and forming groups of size b. The number of groups of this size that can be formed, c, is the quotient of a and b. - is nothing more than the number of integers within range/interval ]0..a] (not including zero, but including a) that are divisible by b.
so by definition:
Y/Z - number of integers within ]0..Y] that are divisible by Z
and
X/Z - number of integers within ]0..X] that are divisible by Z
thus:
result = [Y/Z] - [X/Z] + x (where x = 1 if and only if X is divisible by Y otherwise 0 - assuming the given range [X..Y] includes X)
example :
for (6, 12, 2) we have 12/2 - 6/2 + 1 (as 6%2 == 0) = 6 - 3 + 1 = 4 // {6, 8, 10, 12}
for (5, 12, 2) we have 12/2 - 5/2 + 0 (as 5%2 != 0) = 6 - 2 + 0 = 4 // {6, 8, 10, 12}
The time complexity of the solution will be linear.
Code Snippet :
int countDiv(int a, int b, int m)
{
int mod = (min(a, b)%m==0);
int cnt = abs(floor(b/m) - floor(a/m)) + mod;
return cnt;
}
here n will give you count of number and will print sum of all numbers that are divisible by k
int a = sc.nextInt();
int b = sc.nextInt();
int k = sc.nextInt();
int first = 0;
if (a > k) {
first = a + a/k;
} else {
first = k;
}
int last = b - b%k;
if (first > last) {
System.out.println(0);
} else {
int n = (last - first)/k+1;
System.out.println(n * (first + last)/2);
}
Here is the solution to the problem written in Swift Programming Language.
Step 1: Find the first number in the range divisible by z.
Step 2: Find the last number in the range divisible by z.
Step 3: Use a mathematical formula to find the number of divisible numbers by z in the range.
func solution(_ x : Int, _ y : Int, _ z : Int) -> Int {
var numberOfDivisible = 0
var firstNumber: Int
var lastNumber: Int
if y == x {
return x % z == 0 ? 1 : 0
}
//Find first number divisible by z
let moduloX = x % z
if moduloX == 0 {
firstNumber = x
} else {
firstNumber = x + (z - moduloX)
}
//Fist last number divisible by z
let moduloY = y % z
if moduloY == 0 {
lastNumber = y
} else {
lastNumber = y - moduloY
}
//Math formula
numberOfDivisible = Int(floor(Double((lastNumber - firstNumber) / z))) + 1
return numberOfDivisible
}
public static int Solution(int A, int B, int K)
{
int count = 0;
//If A is divisible by K
if(A % K == 0)
{
count = (B / K) - (A / K) + 1;
}
//If A is not divisible by K
else if(A % K != 0)
{
count = (B / K) - (A / K);
}
return count;
}
This can be done in O(1).
Here you are a solution in C++.
auto first{ x % z == 0 ? x : x + z - x % z };
auto last{ y % z == 0 ? y : y - y % z };
auto ans{ (last - first) / z + 1 };
Where first is the first number that ∈ [x; y] and is divisible by z, last is the last number that ∈ [x; y] and is divisible by z and ans is the answer that you are looking for.

Python performance: iteration and operations on nested lists

Problem Hey folks. I'm looking for some advice on python performance. Some background on my problem:
Given:
A (x,y) mesh of nodes each with a value (0...255) starting at 0
A list of N input coordinates each at a specified location within the range (0...x, 0...y)
A value Z that defines the "neighborhood" in count of nodes
Increment the value of the node at the input coordinate and the node's neighbors. Neighbors beyond the mesh edge are ignored. (No wrapping)
BASE CASE: A mesh of size 1024x1024 nodes, with 400 input coordinates and a range Z of 75 nodes.
Processing should be O(x*y*Z*N). I expect x, y and Z to remain roughly around the values in the base case, but the number of input coordinates N could increase up to 100,000. My goal is to minimize processing time.
Current results Between my start and the comments below, we've got several implementations.
Running speed on my 2.26 GHz Intel Core 2 Duo with Python 2.6.1:
f1: 2.819s
f2: 1.567s
f3: 1.593s
f: 1.579s
f3b: 1.526s
f4: 0.978s
f1 is the initial naive implementation: three nested for loops.
f2 is replaces the inner for loop with a list comprehension.
f3 is based on Andrei's suggestion in the comments and replaces the outer for with map()
f is Chris's suggestion in the answers below
f3b is kriss's take on f3
f4 is Alex's contribution.
Code is included below for your perusal.
Question How can I further reduce the processing time? I'd prefer sub-1.0s for the test parameters.
Please, keep the recommendations to native Python. I know I can move to a third-party package such as numpy, but I'm trying to avoid any third party packages. Also, I've generated random input coordinates, and simplified the definition of the node value updates to keep our discussion simple. The specifics have to change slightly and are outside the scope of my question.
thanks much!
**`f1` is the initial naive implementation: three nested `for` loops.**
def f1(x,y,n,z):
rows = [[0]*x for i in xrange(y)]
for i in range(n):
inputX, inputY = (int(x*random.random()), int(y*random.random()))
topleft = (inputX - z, inputY - z)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
for j in xrange(max(0, topleft[1]), min(topleft[1]+(z*2), y)):
if rows[i][j] <= 255: rows[i][j] += 1
f2 is replaces the inner for loop with a list comprehension.
def f2(x,y,n,z):
rows = [[0]*x for i in xrange(y)]
for i in range(n):
inputX, inputY = (int(x*random.random()), int(y*random.random()))
topleft = (inputX - z, inputY - z)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE: f3 is based on Andrei's suggestion in the comments and replaces the outer for with map(). My first hack at this requires several out-of-local-scope lookups, specifically recommended against by Guido: local variable lookups are much faster than global or built-in variable lookups I hardcoded all but the reference to the main data structure itself to minimize that overhead.
rows = [[0]*x for i in xrange(y)]
def f3(x,y,n,z):
inputs = [(int(x*random.random()), int(y*random.random())) for i in range(n)]
rows = map(g, inputs)
def g(input):
inputX, inputY = input
topleft = (inputX - 75, inputY - 75)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(75*2), 1024)):
l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE3: ChristopeD also pointed out a couple improvements.
def f(x,y,n,z):
rows = [[0] * y for i in xrange(x)]
rn = random.random
for i in xrange(n):
topleft = (int(x*rn()) - z, int(y*rn()) - z)
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]
UPDATE4: kriss added a few improvements to f3, replacing min/max with the new ternary operator syntax.
def f3b(x,y,n,z):
rn = random.random
rows = [g1(x, y, z) for x, y in [(int(x*rn()), int(y*rn())) for i in xrange(n)]]
def g1(x, y, z):
l = y - z if y - z > 0 else 0
r = y + z if y + z < 1024 else 1024
for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE5: Alex weighed in with his substantive revision, adding a separate map() operation to cap the values at 255 and removing all non-local-scope lookups. The perf differences are non-trivial.
def f4(x,y,n,z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
inc = (1).__add__
sat = (0xff).__and__
for i in range(n):
inputX, inputY = rr(x), rr(y)
b = max(0, inputX - z)
t = min(inputX + z, x)
l = max(0, inputY - z)
r = min(inputY + z, y)
for i in range(b, t):
rows[i][l:r] = map(inc, rows[i][l:r])
for i in range(x):
rows[i] = map(sat, rows[i])
Also, since we all seem to be hacking around with variations, here's my test harness to compare speeds: (improved by ChristopheD)
def timing(f,x,y,z,n):
fn = "%s(%d,%d,%d,%d)" % (f.__name__, x, y, z, n)
ctx = "from __main__ import %s" % f.__name__
results = timeit.Timer(fn, ctx).timeit(10)
return "%4.4s: %.3f" % (f.__name__, results / 10.0)
if __name__ == "__main__":
print timing(f, 1024, 1024, 400, 75)
#add more here.
On my (slow-ish;-) first-day Macbook Air, 1.6GHz Core 2 Duo, system Python 2.5 on MacOSX 10.5, after saving your code in op.py I see the following timings:
$ python -mtimeit -s'import op' 'op.f1()'
10 loops, best of 3: 5.58 sec per loop
$ python -mtimeit -s'import op' 'op.f2()'
10 loops, best of 3: 3.15 sec per loop
So, my machine is slower than yours by a factor of a bit more than 1.9.
The fastest code I have for this task is:
def f3(x=x,y=y,n=n,z=z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
inc = (1).__add__
sat = (0xff).__and__
for i in range(n):
inputX, inputY = rr(x), rr(y)
b = max(0, inputX - z)
t = min(inputX + z, x)
l = max(0, inputY - z)
r = min(inputY + z, y)
for i in range(b, t):
rows[i][l:r] = map(inc, rows[i][l:r])
for i in range(x):
rows[i] = map(sat, rows[i])
which times as:
$ python -mtimeit -s'import op' 'op.f3()'
10 loops, best of 3: 3 sec per loop
so, a very modest speedup, projecting to more than 1.5 seconds on your machine - well above the 1.0 you're aiming for:-(.
With a simple C-coded extensions, exte.c...:
#include "Python.h"
static PyObject*
dopoint(PyObject* self, PyObject* args)
{
int x, y, z, px, py;
int b, t, l, r;
int i, j;
PyObject* rows;
if(!PyArg_ParseTuple(args, "iiiiiO",
&x, &y, &z, &px, &py, &rows
))
return 0;
b = px - z;
if (b < 0) b = 0;
t = px + z;
if (t > x) t = x;
l = py - z;
if (l < 0) l = 0;
r = py + z;
if (r > y) r = y;
for(i = b; i < t; ++i) {
PyObject* row = PyList_GetItem(rows, i);
for(j = l; j < r; ++j) {
PyObject* pyitem = PyList_GetItem(row, j);
long item = PyInt_AsLong(pyitem);
if (item < 255) {
PyObject* newitem = PyInt_FromLong(item + 1);
PyList_SetItem(row, j, newitem);
}
}
}
Py_RETURN_NONE;
}
static PyMethodDef exteMethods[] = {
{"dopoint", dopoint, METH_VARARGS, "process a point"},
{0}
};
void
initexte()
{
Py_InitModule("exte", exteMethods);
}
(note: I haven't checked it carefully -- I think it doesn't leak memory due to the correct interplay of reference stealing and borrowing, but it should be code inspected very carefully before being put in production;-), we could do
import exte
def f4(x=x,y=y,n=n,z=z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
for i in range(n):
inputX, inputY = rr(x), rr(y)
exte.dopoint(x, y, z, inputX, inputY, rows)
and the timing
$ python -mtimeit -s'import op' 'op.f4()'
10 loops, best of 3: 345 msec per loop
shows an acceleration of 8-9 times, which should put you in the ballpark you desire. I've seen a comment saying you don't want any third-party extension, but, well, this tiny extension you could make entirely your own;-). ((Not sure what licensing conditions apply to code on Stack Overflow, but I'll be glad to re-release this under the Apache 2 license or the like, if you need that;-)).
1. A (smaller) speedup could definitely be the initialization of your rows...
Replace
rows = []
for i in range(x):
rows.append([0 for i in xrange(y)])
with
rows = [[0] * y for i in xrange(x)]
2. You can also avoid some lookups by moving random.random out of the loops (saves a little).
3. EDIT: after corrections -- you could arrive at something like this:
def f(x,y,n,z):
rows = [[0] * y for i in xrange(x)]
rn = random.random
for i in xrange(n):
topleft = (int(x*rn()) - z, int(y*rn()) - z)
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]
EDIT: some new timings with timeit (10 runs) -- seems this provides only minor speedups:
import timeit
print timeit.Timer("f1(1024,1024,400,75)", "from __main__ import f1").timeit(10)
print timeit.Timer("f2(1024,1024,400,75)", "from __main__ import f2").timeit(10)
print timeit.Timer("f(1024,1024,400,75)", "from __main__ import f3").timeit(10)
f1 21.1669280529
f2 12.9376120567
f 11.1249599457
in your f3 rewrite, g can be simplified. (Can also be applied to f4)
You have the following code inside a for loop.
l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)
However, it appears that those values never change inside the for loop. So calculate them once, outside the loop instead.
Based on your f3 version I played with the code. As l and r are constants you can avoid to compute them in g1 loop. Also using new ternary if instead of min and max seems to be consistently faster. Also simplified expression with topleft. On my system it appears to be about 20% faster using with the code below.
def f3b(x,y,n,z):
rows = [g1(x, y, z) for x, y in [(int(x*random.random()), int(y*random.random())) for i in range(n)]]
def g1(x, y, z):
l = y - z if y - z > 0 else 0
r = y + z if y + z < 1024 else 1024
for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
You can create your own Python module in C, and control the performance as you want:
http://docs.python.org/extending/

Resources