Fast bounding of data in R - performance

Suppose I have a vector, vec, which is long (starting at 1E8 entries) and would like to bound it to the range [a,b]. I can certainly code vec[vec < a] = a and vec[vec > b] = b, but this requires two passes over the data and a large RAM allocation for the temporary indicator vector (~800MB, twice). The two passes burn time because we can do better if we copy data from main memory to the local cache just once (calls to main memory are bad, as are cache misses). And who knows how much this could be improved with multiple threads, but let's not get greedy. :)
Is there a nice implementation in base R or some package that I'm overlooking, or is this a job for Rcpp (or my old friend data.table)?

A naive C solution is
library(inline)
fun4 <-
cfunction(c(x="numeric", a="numeric", b="numeric"), body4,
language="C")
body4 <- "
R_len_t len = Rf_length(x);
SEXP result = Rf_allocVector(REALSXP, len);
const double aa = REAL(a)[0], bb = REAL(b)[0], *xp = REAL(x);
double *rp = REAL(result);
for (int i = 0; i < len; ++i)
if (xp[i] < aa)
rp[i] = aa;
else if (xp[i] > bb)
rp[i] = bb;
else
rp[i] = xp[i];
return result;
"
fun4 <-
cfunction(c(x="numeric", a="numeric", b="numeric"), body4,
language="C")
With a simple parallel version (as Dirk points out, this is with CFLAGS = -fopenmp in ~/.R/Makevars, and on a platform / compiler supporting openmp)
body5 <- "
R_len_t len = Rf_length(x);
const double aa = REAL(a)[0], bb = REAL(b)[0], *xp = REAL(x);
SEXP result = Rf_allocVector(REALSXP, len);
double *rp = REAL(result);
#pragma omp parallel for
for (int i = 0; i < len; ++i)
if (xp[i] < aa)
rp[i] = aa;
else if (xp[i] > bb)
rp[i] = bb;
else
rp[i] = xp[i];
return result;
"
fun5 <-
cfunction(c(x="numeric", a="numeric", b="numeric"), body5,
language="C")
And benchmarks
> z <- runif(1e7)
> benchmark(fun1(z,0.25,0.75), fun4(z, .25, .75), fun5(z, .25, .75),
+ replications=10)
test replications elapsed relative user.self sys.self
1 fun1(z, 0.25, 0.75) 10 9.087 14.609325 8.335 0.739
2 fun4(z, 0.25, 0.75) 10 1.505 2.419614 1.305 0.198
3 fun5(z, 0.25, 0.75) 10 0.622 1.000000 2.156 0.320
user.child sys.child
1 0 0
2 0 0
3 0 0
> identical(res1 <- fun1(z,0.25,0.75), fun4(z,0.25,0.75))
[1] TRUE
> identical(res1, fun5(z, 0.25, 0.75))
[1] TRUE
on my quad-core laptop. Assumes numeric input, no error checking, NA handling, etc.

Just to start things off: not much difference between your solution and the pmin/pmax solution (trying things out with n=1e7 rather than n=1e8 because I'm impatient) -- pmin/pmax is actually marginally slower.
fun1 <- function(x,a,b) {x[x<a] <- a; x[x>b] <- b; x}
fun2 <- function(x,a,b) pmin(pmax(x,a),b)
library(rbenchmark)
z <- runif(1e7)
benchmark(fun1(z,0.25,0.75),fun2(z,0.25,0.75),rep=50)
test replications elapsed relative user.self sys.self
1 fun1(z, 0.25, 0.75) 10 21.607 1.00000 6.556 15.001
2 fun2(z, 0.25, 0.75) 10 23.336 1.08002 5.656 17.605

Related

Subset sum with maximum equal sums and without using all elements

You are given a set of integers and your task is the following: split them into 2 subsets with an equal sum in such way that these sums are maximal. You are allowed not to use all given integers, that's fine. If it's just impossible, report error somehow.
My approach is rather straightforward: at each step, we pick a single item, mark it as visited, update current sum and pick another item recursively. Finally, try skipping current element.
It works on simpler test cases, but it fails one:
T = 1
N = 25
Elements: 5 27 24 12 12 2 15 25 32 21 37 29 20 9 24 35 26 8 31 5 25 21 28 3 5
One can run it as follows:
1 25 5 27 24 12 12 2 15 25 32 21 37 29 20 9 24 35 26 8 31 5 25 21 28 3 5
I expect sum to be equal 239, but it the algorithm fails to find such solution.
I've ended up with the following code:
#include <iostream>
#include <unordered_set>
using namespace std;
unordered_set<uint64_t> visited;
const int max_N = 50;
int data[max_N];
int p1[max_N];
int p2[max_N];
int out1[max_N];
int out2[max_N];
int n1 = 0;
int n2 = 0;
int o1 = 0;
int o2 = 0;
int N = 0;
void max_sum(int16_t &sum_out, int16_t sum1 = 0, int16_t sum2 = 0, int idx = 0) {
if (idx < 0 || idx > N) return;
if (sum1 == sum2 && sum1 > sum_out) {
sum_out = sum1;
o1 = n1;
o2 = n2;
for(int i = 0; i < n1; ++i) {
out1[i] = p1[i];
}
for (int i = 0; i < n2; ++i) {
out2[i] = p2[i];
}
}
if (idx == N) return;
uint64_t key = (static_cast<uint64_t>(sum1) << 48) | (static_cast<uint64_t>(sum2) << 32) | idx;
if (visited.find(key) != visited.end()) return;
visited.insert(key);
p1[n1] = data[idx];
++n1;
max_sum(sum_out, sum1 + data[idx], sum2, idx + 1);
--n1;
p2[n2] = data[idx];
++n2;
max_sum(sum_out, sum1, sum2 + data[idx], idx + 1);
--n2;
max_sum(sum_out, sum1, sum2, idx + 1);
}
int main() {
int T = 0;
cin >> T;
for (int t = 1; t <= T; ++t) {
int16_t sum_out;
cin >> N;
for(int i = 0; i < N; ++i) {
cin >> data[i];
}
n1 = 0;
n2 = 0;
o1 = 0;
o2 = 0;
max_sum(sum_out);
int res = 0;
int res2 = 0;
for (int i = 0; i < o1; ++i) res += out1[i];
for (int i = 0; i < o2; ++i) res2 += out2[i];
if (res != res2) cerr << "ERROR: " << "res1 = " << res << "; res2 = " << res2 << '\n';
cout << "#" << t << " " << res << '\n';
visited.clear();
}
}
I have the following questions:
Could someone help me to troubleshoot the failing test? Are there any obvious problems?
How could I get rid of unordered_set for marking already visited sums? I prefer to use plain C.
Is there a better approach? Maybe using dynamic programming?
Another approach is consider all the numbers till [1,(2^N-2)].
Consider the position of each bit to position of each element .Iterate all numbers from [1,(2^N-2)] then check for each number .
If bit is set you can count that number in set1 else you can put that number in set2 , then check if sum of both sets are equals or not . Here you will get all possible sets , if you want just one once you find just break.
1) Could someone help me to troubleshoot the failing test? Are there any obvious problems?
The only issue I could see is that you have not set sum_out to 0.
When I tried running the program it seemed to work correctly for your test case.
2) How could I get rid of unordered_set for marking already visited sums? I prefer to use plain C.
See the answer to question 3
3) Is there a better approach? Maybe using dynamic programming?
You are currently keeping track of whether you have seen each choice of value for first subset, value for second subset, amount through array.
If instead you keep track of the difference between the values then the complexity significantly reduces.
In particular, you can use dynamic programming to store an array A[diff] that for each value of the difference either stores -1 (to indicate that the difference is not reachable), or the greatest value of subset1 when the difference between subset1 and subset2 is exactly equal to diff.
You can then iterate over the entries in the input and update the array based on either assigning each element to subset1/subset2/ or not at all. (Note you need to make a new copy of the array when computing this update.)
In this form there is no use of unordered_set because you can simply use a straight C array. There is also no difference between subset1 and subset2 so you can only keep positive differences.
Example Python Code
from collections import defaultdict
data=map(int,"5 27 24 12 12 2 15 25 32 21 37 29 20 9 24 35 26 8 31 5 25 21 28 3 5".split())
A=defaultdict(int) # Map from difference to best value of subset sum 1
A[0] = 0 # We start with a difference of 0
for a in data:
A2 = defaultdict(int)
def add(s1,s2):
if s1>s2:
s1,s2=s2,s1
d = s2-s1
if d in A2:
A2[d] = max( A2[d], s1 )
else:
A2[d] = s1
for diff,sum1 in A.items():
sum2 = sum1 + diff
add(sum1,sum2)
add(sum1+a,sum2)
add(sum1,sum2+a)
A = A2
print A[0]
This prints 239 as the answer.
For simplicity I haven't bothered with the optimization of using a linear array instead of the dictionary.
A very different approach would be to use a constraint or mixed integer solver. Here is a possible formulation.
Let
x(i,g) = 1 if value v(i) belongs to group g
0 otherwise
The optimization model can look like:
max s
s = sum(i, x(i,g)*v(i)) for all g
sum(g, x(i,g)) <= 1 for all i
For two groups we get:
---- 31 VARIABLE s.L = 239.000
---- 31 VARIABLE x.L
g1 g2
i1 1
i2 1
i3 1
i4 1
i5 1
i6 1
i7 1
i8 1
i9 1
i10 1
i11 1
i12 1
i13 1
i14 1
i15 1
i16 1
i17 1
i18 1
i19 1
i20 1
i21 1
i22 1
i23 1
i25 1
We can easily do more groups. E.g. with 9 groups:
---- 31 VARIABLE s.L = 52.000
---- 31 VARIABLE x.L
g1 g2 g3 g4 g5 g6 g7 g8 g9
i2 1
i3 1
i4 1
i5 1
i6 1
i7 1
i8 1
i9 1
i10 1
i11 1
i12 1
i13 1
i14 1
i15 1
i16 1
i17 1
i19 1
i20 1
i21 1
i22 1
i23 1
i24 1
i25 1
If there is no solution, the solver will select zero elements in each group with a sum s=0.

Using multiplication instead of logical AND

Is there any reason why using multiplication instead of logical AND operator would be preferred or discouraged (using any programming language)? Example below shows that it could make the code simpler, but are there any other advantages (or disadvantages)?
int x = 1;
int y = 0;
int z = 1;
int xyz_mult = x*y*z;
int xyz_and = ((x && y) && z);
Take the simple example in R:
library(rbenchmark)
library(Rcpp)
benchmark(T*F*T, (T&&F)&&T, replications = 1e6)
## test replications elapsed relative user.self sys.self user.child sys.child
## 2 (T && F) && T 1000000 2.974 1.000 2.965 0.004 0 0
## 1 T * F * T 1000000 3.201 1.076 3.187 0.008 0 0
and's are slightly faster. But with using Rcpp it is multiplication that gets faster with int variables while (counter-intuitively) being faster with bool variables:
xyz_and_int <- cppFunction("
int foo() {
int x = 1;
int y = 0;
int z = 1;
return (x && y) && z;
}
")
xyz_mult_int <- cppFunction("
int foo() {
int x = 1;
int y = 0;
int z = 1;
return z*y*z;
}
")
xyz_and_bool <- cppFunction("
int foo() {
bool x = 1;
bool y = 0;
bool z = 1;
return (x && y) && z;
}
")
xyz_mult_bool <- cppFunction("
int foo() {
bool x = 1;
bool y = 0;
bool z = 1;
return z*y*z;
}
")
And here are the simulation results:
benchmark(xyz_and_int(), xyz_mult_int(), replications = 1e6)
test replications elapsed relative user.self sys.self user.child sys.child
## 1 xyz_and_int() 1000000 3.32 1.000 3.33 0 NA NA
## 2 xyz_mult_int() 1000000 3.34 1.006 3.33 0 NA NA
benchmark(xyz_and_bool(), xyz_mult_bool(), replications = 1e6)
test replications elapsed relative user.self sys.self user.child sys.child
## 1 xyz_and_bool() 1000000 3.36 1.015 3.34 0 NA NA
## 2 xyz_mult_bool() 1000000 3.31 1.000 3.31 0 NA NA
If i'm not mistaken, multiply is done either with shift registers or adders of some sort. Their implementation is always more complicated than just an AND gate, therefore they're "less efficient".

An efficient algorithm to calculate the integer square root (isqrt) of arbitrarily large integers

Notice
For a solution in Erlang or C / C++, go to Trial 4 below.
Wikipedia Articles
Integer square root
The definition of "integer square root" could be found here
Methods of computing square roots
An algorithm that does "bit magic" could be found here
[ Trial 1 : Using Library Function ]
Code
isqrt(N) when erlang:is_integer(N), N >= 0 ->
erlang:trunc(math:sqrt(N)).
Problem
This implementation uses the sqrt() function from the C library, so it does not work with arbitrarily large integers (Note that the returned result does not match the input. The correct answer should be 12345678901234567890):
Erlang R16B03 (erts-5.10.4) [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V5.10.4 (abort with ^G)
1> erlang:trunc(math:sqrt(12345678901234567890 * 12345678901234567890)).
12345678901234567168
2>
[ Trial 2 : Using Bigint + Only ]
Code
isqrt2(N) when erlang:is_integer(N), N >= 0 ->
isqrt2(N, 0, 3, 0).
isqrt2(N, I, _, Result) when I >= N ->
Result;
isqrt2(N, I, Times, Result) ->
isqrt2(N, I + Times, Times + 2, Result + 1).
Description
This implementation is based on the following observation:
isqrt(0) = 0 # <--- One 0
isqrt(1) = 1 # <-+
isqrt(2) = 1 # |- Three 1's
isqrt(3) = 1 # <-+
isqrt(4) = 2 # <-+
isqrt(5) = 2 # |
isqrt(6) = 2 # |- Five 2's
isqrt(7) = 2 # |
isqrt(8) = 2 # <-+
isqrt(9) = 3 # <-+
isqrt(10) = 3 # |
isqrt(11) = 3 # |
isqrt(12) = 3 # |- Seven 3's
isqrt(13) = 3 # |
isqrt(14) = 3 # |
isqrt(15) = 3 # <-+
isqrt(16) = 4 # <--- Nine 4's
...
Problem
This implementation involves only bigint additions so I expected it to run fast. However, when I fed it with 1111111111111111111111111111111111111111 * 1111111111111111111111111111111111111111, it seems to run forever on my (very fast) machine.
[ Trial 3 : Using Binary Search with Bigint +1, -1 and div 2 Only ]
Code
Variant 1 (My original implementation)
isqrt3(N) when erlang:is_integer(N), N >= 0 ->
isqrt3(N, 1, N).
isqrt3(_N, Low, High) when High =:= Low + 1 ->
Low;
isqrt3(N, Low, High) ->
Mid = (Low + High) div 2,
MidSqr = Mid * Mid,
if
%% This also catches N = 0 or 1
MidSqr =:= N ->
Mid;
MidSqr < N ->
isqrt3(N, Mid, High);
MidSqr > N ->
isqrt3(N, Low, Mid)
end.
Variant 2 (modified above code so that the boundaries go with Mid+1 or Mid-1 instead, with reference to the answer by Vikram Bhat)
isqrt3a(N) when erlang:is_integer(N), N >= 0 ->
isqrt3a(N, 1, N).
isqrt3a(N, Low, High) when Low >= High ->
HighSqr = High * High,
if
HighSqr > N ->
High - 1;
HighSqr =< N ->
High
end;
isqrt3a(N, Low, High) ->
Mid = (Low + High) div 2,
MidSqr = Mid * Mid,
if
%% This also catches N = 0 or 1
MidSqr =:= N ->
Mid;
MidSqr < N ->
isqrt3a(N, Mid + 1, High);
MidSqr > N ->
isqrt3a(N, Low, Mid - 1)
end.
Problem
Now it solves the 79-digit number (namely 1111111111111111111111111111111111111111 * 1111111111111111111111111111111111111111) in lightening speed, the result is shown immediately. However, it takes 60 seconds (+- 2 seconds) on my machine to solve one million (1,000,000) 61-digit numbers (namely, from 1000000000000000000000000000000000000000000000000000000000000 to 1000000000000000000000000000000000000000000000000000001000000). I would like to do it even faster.
[ Trial 4 : Using Newton's Method with Bigint + and div Only ]
Code
isqrt4(0) -> 0;
isqrt4(N) when erlang:is_integer(N), N >= 0 ->
isqrt4(N, N).
isqrt4(N, Xk) ->
Xk1 = (Xk + N div Xk) div 2,
if
Xk1 >= Xk ->
Xk;
Xk1 < Xk ->
isqrt4(N, Xk1)
end.
Code in C / C++ (for your interest)
Recursive variant
#include <stdint.h>
uint32_t isqrt_impl(
uint64_t const n,
uint64_t const xk)
{
uint64_t const xk1 = (xk + n / xk) / 2;
return (xk1 >= xk) ? xk : isqrt_impl(n, xk1);
}
uint32_t isqrt(uint64_t const n)
{
if (n == 0) return 0;
if (n == 18446744073709551615ULL) return 4294967295U;
return isqrt_impl(n, n);
}
Iterative variant
#include <stdint.h>
uint32_t isqrt_iterative(uint64_t const n)
{
uint64_t xk = n;
if (n == 0) return 0;
if (n == 18446744073709551615ULL) return 4294967295U;
do
{
uint64_t const xk1 = (xk + n / xk) / 2;
if (xk1 >= xk)
{
return xk;
}
else
{
xk = xk1;
}
} while (1);
}
Problem
The Erlang code solves one million (1,000,000) 61-digit numbers in 40 seconds (+- 1 second) on my machine, so this is faster than Trial 3. Can it go even faster?
About My Machine
Processor : 3.4 GHz Intel Core i7
Memory : 32 GB 1600 MHz DDR3
OS : Mac OS X Version 10.9.1
Related Questions
Integer square root in python
The answer by user448810 uses "Newton's Method". I'm not sure whether doing the division using "integer division" is okay or not. I'll try this later as an update. [UPDATE (2015-01-11): It is okay to do so]
The answer by math involves using a 3rd party Python package gmpy, which is not very favourable to me, since I'm primarily interested in solving it in Erlang with only builtin facilities.
The answer by DSM seems interesting. I don't really understand what is going on, but it seems that "bit magic" is involved there, and so it's not quite suitable for me too.
Infinite Recursion in Meta Integer Square Root
This question is for C++, and the algorithm by AraK (the questioner) looks like it's from the same idea as Trial 2 above.
How about binary search like following doesn't need floating divisions only integer multiplications (Slower than newtons method) :-
low = 1;
/* More efficient bound
high = pow(10,log10(target)/2+1);
*/
high = target
while(low<high) {
mid = (low+high)/2;
currsq = mid*mid;
if(currsq==target) {
return(mid);
}
if(currsq<target) {
if((mid+1)*(mid+1)>target) {
return(mid);
}
low = mid+1;
}
else {
high = mid-1;
}
}
This works for O(logN) iterations so should not run forever for even very large numbers
Log10(target) Computation if needed :-
acc = target
log10 = 0;
while(acc>0) {
log10 = log10 + 1;
acc = acc/10;
}
Note : acc/10 is integer division
Edit :-
Efficient bound :- The sqrt(n) has about half the number of digits as n so you can pass high = 10^(log10(N)/2+1) && low = 10^(log10(N)/2-1) to get tighter bound and it should provide 2 times speed up.
Evaluate bound:-
bound = 1;
acc = N;
count = 0;
while(acc>0) {
acc = acc/10;
if(count%2==0) {
bound = bound*10;
}
count++;
}
high = bound*10;
low = bound/10;
isqrt(N,low,high);

Minimum distance between 2 times on clock board

Given 2 times (as int) on clock board, I have to calculate the minimum distance between them.
For example -
d(12,1) = 1 //not 11
d(3,5) = 2
d(10,10) = 0
What is the fastest way for that ?
If a and b are from 1 to 12:
min(abs(a - b), 12 - abs(a - b))
What have you tried?
Dim dif = Math.Abs((t2 + 6) Mod 12 - (t1 + 6) Mod 12)
Pure arithmetic (without any libraries):
int d(int first, int second){
int temp = first - second;
temp < 0? temp *=-1 :temp ;
int distance = temp > 6? 12-temp:temp;
return distance;
}

Python performance: iteration and operations on nested lists

Problem Hey folks. I'm looking for some advice on python performance. Some background on my problem:
Given:
A (x,y) mesh of nodes each with a value (0...255) starting at 0
A list of N input coordinates each at a specified location within the range (0...x, 0...y)
A value Z that defines the "neighborhood" in count of nodes
Increment the value of the node at the input coordinate and the node's neighbors. Neighbors beyond the mesh edge are ignored. (No wrapping)
BASE CASE: A mesh of size 1024x1024 nodes, with 400 input coordinates and a range Z of 75 nodes.
Processing should be O(x*y*Z*N). I expect x, y and Z to remain roughly around the values in the base case, but the number of input coordinates N could increase up to 100,000. My goal is to minimize processing time.
Current results Between my start and the comments below, we've got several implementations.
Running speed on my 2.26 GHz Intel Core 2 Duo with Python 2.6.1:
f1: 2.819s
f2: 1.567s
f3: 1.593s
f: 1.579s
f3b: 1.526s
f4: 0.978s
f1 is the initial naive implementation: three nested for loops.
f2 is replaces the inner for loop with a list comprehension.
f3 is based on Andrei's suggestion in the comments and replaces the outer for with map()
f is Chris's suggestion in the answers below
f3b is kriss's take on f3
f4 is Alex's contribution.
Code is included below for your perusal.
Question How can I further reduce the processing time? I'd prefer sub-1.0s for the test parameters.
Please, keep the recommendations to native Python. I know I can move to a third-party package such as numpy, but I'm trying to avoid any third party packages. Also, I've generated random input coordinates, and simplified the definition of the node value updates to keep our discussion simple. The specifics have to change slightly and are outside the scope of my question.
thanks much!
**`f1` is the initial naive implementation: three nested `for` loops.**
def f1(x,y,n,z):
rows = [[0]*x for i in xrange(y)]
for i in range(n):
inputX, inputY = (int(x*random.random()), int(y*random.random()))
topleft = (inputX - z, inputY - z)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
for j in xrange(max(0, topleft[1]), min(topleft[1]+(z*2), y)):
if rows[i][j] <= 255: rows[i][j] += 1
f2 is replaces the inner for loop with a list comprehension.
def f2(x,y,n,z):
rows = [[0]*x for i in xrange(y)]
for i in range(n):
inputX, inputY = (int(x*random.random()), int(y*random.random()))
topleft = (inputX - z, inputY - z)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE: f3 is based on Andrei's suggestion in the comments and replaces the outer for with map(). My first hack at this requires several out-of-local-scope lookups, specifically recommended against by Guido: local variable lookups are much faster than global or built-in variable lookups I hardcoded all but the reference to the main data structure itself to minimize that overhead.
rows = [[0]*x for i in xrange(y)]
def f3(x,y,n,z):
inputs = [(int(x*random.random()), int(y*random.random())) for i in range(n)]
rows = map(g, inputs)
def g(input):
inputX, inputY = input
topleft = (inputX - 75, inputY - 75)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(75*2), 1024)):
l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE3: ChristopeD also pointed out a couple improvements.
def f(x,y,n,z):
rows = [[0] * y for i in xrange(x)]
rn = random.random
for i in xrange(n):
topleft = (int(x*rn()) - z, int(y*rn()) - z)
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]
UPDATE4: kriss added a few improvements to f3, replacing min/max with the new ternary operator syntax.
def f3b(x,y,n,z):
rn = random.random
rows = [g1(x, y, z) for x, y in [(int(x*rn()), int(y*rn())) for i in xrange(n)]]
def g1(x, y, z):
l = y - z if y - z > 0 else 0
r = y + z if y + z < 1024 else 1024
for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE5: Alex weighed in with his substantive revision, adding a separate map() operation to cap the values at 255 and removing all non-local-scope lookups. The perf differences are non-trivial.
def f4(x,y,n,z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
inc = (1).__add__
sat = (0xff).__and__
for i in range(n):
inputX, inputY = rr(x), rr(y)
b = max(0, inputX - z)
t = min(inputX + z, x)
l = max(0, inputY - z)
r = min(inputY + z, y)
for i in range(b, t):
rows[i][l:r] = map(inc, rows[i][l:r])
for i in range(x):
rows[i] = map(sat, rows[i])
Also, since we all seem to be hacking around with variations, here's my test harness to compare speeds: (improved by ChristopheD)
def timing(f,x,y,z,n):
fn = "%s(%d,%d,%d,%d)" % (f.__name__, x, y, z, n)
ctx = "from __main__ import %s" % f.__name__
results = timeit.Timer(fn, ctx).timeit(10)
return "%4.4s: %.3f" % (f.__name__, results / 10.0)
if __name__ == "__main__":
print timing(f, 1024, 1024, 400, 75)
#add more here.
On my (slow-ish;-) first-day Macbook Air, 1.6GHz Core 2 Duo, system Python 2.5 on MacOSX 10.5, after saving your code in op.py I see the following timings:
$ python -mtimeit -s'import op' 'op.f1()'
10 loops, best of 3: 5.58 sec per loop
$ python -mtimeit -s'import op' 'op.f2()'
10 loops, best of 3: 3.15 sec per loop
So, my machine is slower than yours by a factor of a bit more than 1.9.
The fastest code I have for this task is:
def f3(x=x,y=y,n=n,z=z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
inc = (1).__add__
sat = (0xff).__and__
for i in range(n):
inputX, inputY = rr(x), rr(y)
b = max(0, inputX - z)
t = min(inputX + z, x)
l = max(0, inputY - z)
r = min(inputY + z, y)
for i in range(b, t):
rows[i][l:r] = map(inc, rows[i][l:r])
for i in range(x):
rows[i] = map(sat, rows[i])
which times as:
$ python -mtimeit -s'import op' 'op.f3()'
10 loops, best of 3: 3 sec per loop
so, a very modest speedup, projecting to more than 1.5 seconds on your machine - well above the 1.0 you're aiming for:-(.
With a simple C-coded extensions, exte.c...:
#include "Python.h"
static PyObject*
dopoint(PyObject* self, PyObject* args)
{
int x, y, z, px, py;
int b, t, l, r;
int i, j;
PyObject* rows;
if(!PyArg_ParseTuple(args, "iiiiiO",
&x, &y, &z, &px, &py, &rows
))
return 0;
b = px - z;
if (b < 0) b = 0;
t = px + z;
if (t > x) t = x;
l = py - z;
if (l < 0) l = 0;
r = py + z;
if (r > y) r = y;
for(i = b; i < t; ++i) {
PyObject* row = PyList_GetItem(rows, i);
for(j = l; j < r; ++j) {
PyObject* pyitem = PyList_GetItem(row, j);
long item = PyInt_AsLong(pyitem);
if (item < 255) {
PyObject* newitem = PyInt_FromLong(item + 1);
PyList_SetItem(row, j, newitem);
}
}
}
Py_RETURN_NONE;
}
static PyMethodDef exteMethods[] = {
{"dopoint", dopoint, METH_VARARGS, "process a point"},
{0}
};
void
initexte()
{
Py_InitModule("exte", exteMethods);
}
(note: I haven't checked it carefully -- I think it doesn't leak memory due to the correct interplay of reference stealing and borrowing, but it should be code inspected very carefully before being put in production;-), we could do
import exte
def f4(x=x,y=y,n=n,z=z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
for i in range(n):
inputX, inputY = rr(x), rr(y)
exte.dopoint(x, y, z, inputX, inputY, rows)
and the timing
$ python -mtimeit -s'import op' 'op.f4()'
10 loops, best of 3: 345 msec per loop
shows an acceleration of 8-9 times, which should put you in the ballpark you desire. I've seen a comment saying you don't want any third-party extension, but, well, this tiny extension you could make entirely your own;-). ((Not sure what licensing conditions apply to code on Stack Overflow, but I'll be glad to re-release this under the Apache 2 license or the like, if you need that;-)).
1. A (smaller) speedup could definitely be the initialization of your rows...
Replace
rows = []
for i in range(x):
rows.append([0 for i in xrange(y)])
with
rows = [[0] * y for i in xrange(x)]
2. You can also avoid some lookups by moving random.random out of the loops (saves a little).
3. EDIT: after corrections -- you could arrive at something like this:
def f(x,y,n,z):
rows = [[0] * y for i in xrange(x)]
rn = random.random
for i in xrange(n):
topleft = (int(x*rn()) - z, int(y*rn()) - z)
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]
EDIT: some new timings with timeit (10 runs) -- seems this provides only minor speedups:
import timeit
print timeit.Timer("f1(1024,1024,400,75)", "from __main__ import f1").timeit(10)
print timeit.Timer("f2(1024,1024,400,75)", "from __main__ import f2").timeit(10)
print timeit.Timer("f(1024,1024,400,75)", "from __main__ import f3").timeit(10)
f1 21.1669280529
f2 12.9376120567
f 11.1249599457
in your f3 rewrite, g can be simplified. (Can also be applied to f4)
You have the following code inside a for loop.
l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)
However, it appears that those values never change inside the for loop. So calculate them once, outside the loop instead.
Based on your f3 version I played with the code. As l and r are constants you can avoid to compute them in g1 loop. Also using new ternary if instead of min and max seems to be consistently faster. Also simplified expression with topleft. On my system it appears to be about 20% faster using with the code below.
def f3b(x,y,n,z):
rows = [g1(x, y, z) for x, y in [(int(x*random.random()), int(y*random.random())) for i in range(n)]]
def g1(x, y, z):
l = y - z if y - z > 0 else 0
r = y + z if y + z < 1024 else 1024
for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
You can create your own Python module in C, and control the performance as you want:
http://docs.python.org/extending/

Resources