Python performance: iteration and operations on nested lists

Python performance: iteration and operations on nested lists - performance

Problem Hey folks. I'm looking for some advice on python performance. Some background on my problem:
Given:
A (x,y) mesh of nodes each with a value (0...255) starting at 0
A list of N input coordinates each at a specified location within the range (0...x, 0...y)
A value Z that defines the "neighborhood" in count of nodes
Increment the value of the node at the input coordinate and the node's neighbors. Neighbors beyond the mesh edge are ignored. (No wrapping)
BASE CASE: A mesh of size 1024x1024 nodes, with 400 input coordinates and a range Z of 75 nodes.
Processing should be O(x*y*Z*N). I expect x, y and Z to remain roughly around the values in the base case, but the number of input coordinates N could increase up to 100,000. My goal is to minimize processing time.
Current results Between my start and the comments below, we've got several implementations.
Running speed on my 2.26 GHz Intel Core 2 Duo with Python 2.6.1:
f1: 2.819s
f2: 1.567s
f3: 1.593s
f: 1.579s
f3b: 1.526s
f4: 0.978s
f1 is the initial naive implementation: three nested for loops.
f2 is replaces the inner for loop with a list comprehension.
f3 is based on Andrei's suggestion in the comments and replaces the outer for with map()
f is Chris's suggestion in the answers below
f3b is kriss's take on f3
f4 is Alex's contribution.
Code is included below for your perusal.
Question How can I further reduce the processing time? I'd prefer sub-1.0s for the test parameters.
Please, keep the recommendations to native Python. I know I can move to a third-party package such as numpy, but I'm trying to avoid any third party packages. Also, I've generated random input coordinates, and simplified the definition of the node value updates to keep our discussion simple. The specifics have to change slightly and are outside the scope of my question.
thanks much!
**`f1` is the initial naive implementation: three nested `for` loops.**
def f1(x,y,n,z):
rows = [[0]*x for i in xrange(y)]
for i in range(n):
inputX, inputY = (int(x*random.random()), int(y*random.random()))
topleft = (inputX - z, inputY - z)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
for j in xrange(max(0, topleft[1]), min(topleft[1]+(z*2), y)):
if rows[i][j] <= 255: rows[i][j] += 1
f2 is replaces the inner for loop with a list comprehension.
def f2(x,y,n,z):
rows = [[0]*x for i in xrange(y)]
for i in range(n):
inputX, inputY = (int(x*random.random()), int(y*random.random()))
topleft = (inputX - z, inputY - z)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE: f3 is based on Andrei's suggestion in the comments and replaces the outer for with map(). My first hack at this requires several out-of-local-scope lookups, specifically recommended against by Guido: local variable lookups are much faster than global or built-in variable lookups I hardcoded all but the reference to the main data structure itself to minimize that overhead.
rows = [[0]*x for i in xrange(y)]
def f3(x,y,n,z):
inputs = [(int(x*random.random()), int(y*random.random())) for i in range(n)]
rows = map(g, inputs)
def g(input):
inputX, inputY = input
topleft = (inputX - 75, inputY - 75)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(75*2), 1024)):
l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE3: ChristopeD also pointed out a couple improvements.
def f(x,y,n,z):
rows = [[0] * y for i in xrange(x)]
rn = random.random
for i in xrange(n):
topleft = (int(x*rn()) - z, int(y*rn()) - z)
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]
UPDATE4: kriss added a few improvements to f3, replacing min/max with the new ternary operator syntax.
def f3b(x,y,n,z):
rn = random.random
rows = [g1(x, y, z) for x, y in [(int(x*rn()), int(y*rn())) for i in xrange(n)]]
def g1(x, y, z):
l = y - z if y - z > 0 else 0
r = y + z if y + z < 1024 else 1024
for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE5: Alex weighed in with his substantive revision, adding a separate map() operation to cap the values at 255 and removing all non-local-scope lookups. The perf differences are non-trivial.
def f4(x,y,n,z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
inc = (1).__add__
sat = (0xff).__and__
for i in range(n):
inputX, inputY = rr(x), rr(y)
b = max(0, inputX - z)
t = min(inputX + z, x)
l = max(0, inputY - z)
r = min(inputY + z, y)
for i in range(b, t):
rows[i][l:r] = map(inc, rows[i][l:r])
for i in range(x):
rows[i] = map(sat, rows[i])
Also, since we all seem to be hacking around with variations, here's my test harness to compare speeds: (improved by ChristopheD)
def timing(f,x,y,z,n):
fn = "%s(%d,%d,%d,%d)" % (f.__name__, x, y, z, n)
ctx = "from __main__ import %s" % f.__name__
results = timeit.Timer(fn, ctx).timeit(10)
return "%4.4s: %.3f" % (f.__name__, results / 10.0)
if __name__ == "__main__":
print timing(f, 1024, 1024, 400, 75)
#add more here.

On my (slow-ish;-) first-day Macbook Air, 1.6GHz Core 2 Duo, system Python 2.5 on MacOSX 10.5, after saving your code in op.py I see the following timings:
$ python -mtimeit -s'import op' 'op.f1()'
10 loops, best of 3: 5.58 sec per loop
$ python -mtimeit -s'import op' 'op.f2()'
10 loops, best of 3: 3.15 sec per loop
So, my machine is slower than yours by a factor of a bit more than 1.9.
The fastest code I have for this task is:
def f3(x=x,y=y,n=n,z=z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
inc = (1).__add__
sat = (0xff).__and__
for i in range(n):
inputX, inputY = rr(x), rr(y)
b = max(0, inputX - z)
t = min(inputX + z, x)
l = max(0, inputY - z)
r = min(inputY + z, y)
for i in range(b, t):
rows[i][l:r] = map(inc, rows[i][l:r])
for i in range(x):
rows[i] = map(sat, rows[i])
which times as:
$ python -mtimeit -s'import op' 'op.f3()'
10 loops, best of 3: 3 sec per loop
so, a very modest speedup, projecting to more than 1.5 seconds on your machine - well above the 1.0 you're aiming for:-(.
With a simple C-coded extensions, exte.c...:
#include "Python.h"
static PyObject*
dopoint(PyObject* self, PyObject* args)
{
int x, y, z, px, py;
int b, t, l, r;
int i, j;
PyObject* rows;
if(!PyArg_ParseTuple(args, "iiiiiO",
&x, &y, &z, &px, &py, &rows
))
return 0;
b = px - z;
if (b < 0) b = 0;
t = px + z;
if (t > x) t = x;
l = py - z;
if (l < 0) l = 0;
r = py + z;
if (r > y) r = y;
for(i = b; i < t; ++i) {
PyObject* row = PyList_GetItem(rows, i);
for(j = l; j < r; ++j) {
PyObject* pyitem = PyList_GetItem(row, j);
long item = PyInt_AsLong(pyitem);
if (item < 255) {
PyObject* newitem = PyInt_FromLong(item + 1);
PyList_SetItem(row, j, newitem);
}
}
}
Py_RETURN_NONE;
}
static PyMethodDef exteMethods[] = {
{"dopoint", dopoint, METH_VARARGS, "process a point"},
{0}
};
void
initexte()
{
Py_InitModule("exte", exteMethods);
}
(note: I haven't checked it carefully -- I think it doesn't leak memory due to the correct interplay of reference stealing and borrowing, but it should be code inspected very carefully before being put in production;-), we could do
import exte
def f4(x=x,y=y,n=n,z=z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
for i in range(n):
inputX, inputY = rr(x), rr(y)
exte.dopoint(x, y, z, inputX, inputY, rows)
and the timing
$ python -mtimeit -s'import op' 'op.f4()'
10 loops, best of 3: 345 msec per loop
shows an acceleration of 8-9 times, which should put you in the ballpark you desire. I've seen a comment saying you don't want any third-party extension, but, well, this tiny extension you could make entirely your own;-). ((Not sure what licensing conditions apply to code on Stack Overflow, but I'll be glad to re-release this under the Apache 2 license or the like, if you need that;-)).

1. A (smaller) speedup could definitely be the initialization of your rows...
Replace
rows = []
for i in range(x):
rows.append([0 for i in xrange(y)])
with
rows = [[0] * y for i in xrange(x)]
2. You can also avoid some lookups by moving random.random out of the loops (saves a little).
3. EDIT: after corrections -- you could arrive at something like this:
def f(x,y,n,z):
rows = [[0] * y for i in xrange(x)]
rn = random.random
for i in xrange(n):
topleft = (int(x*rn()) - z, int(y*rn()) - z)
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]
EDIT: some new timings with timeit (10 runs) -- seems this provides only minor speedups:
import timeit
print timeit.Timer("f1(1024,1024,400,75)", "from __main__ import f1").timeit(10)
print timeit.Timer("f2(1024,1024,400,75)", "from __main__ import f2").timeit(10)
print timeit.Timer("f(1024,1024,400,75)", "from __main__ import f3").timeit(10)
f1 21.1669280529
f2 12.9376120567
f 11.1249599457

in your f3 rewrite, g can be simplified. (Can also be applied to f4)
You have the following code inside a for loop.
l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)
However, it appears that those values never change inside the for loop. So calculate them once, outside the loop instead.

Based on your f3 version I played with the code. As l and r are constants you can avoid to compute them in g1 loop. Also using new ternary if instead of min and max seems to be consistently faster. Also simplified expression with topleft. On my system it appears to be about 20% faster using with the code below.
def f3b(x,y,n,z):
rows = [g1(x, y, z) for x, y in [(int(x*random.random()), int(y*random.random())) for i in range(n)]]
def g1(x, y, z):
l = y - z if y - z > 0 else 0
r = y + z if y + z < 1024 else 1024
for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]

You can create your own Python module in C, and control the performance as you want:
http://docs.python.org/extending/

Related

Fast summing of subarrays in Python

I have a data cube a of radius w and for every element of that cube, I would like to add the element and all surrounding values within a cube of radius r, where r < w. The result should be returned in an array of the same shape, b.
As a simple example, suppose:
a = numpy.ones(shape=(2*w,2*w,2*w),dtype='float32')
kernel = numpy.ones(shape=(2*r,2*r,2*r),dtype='float32')
b = convolve(a,kernel,mode='constant',cval=0)
then b would have the value (2r)(2r)(2r) for all the indices not on the edge.
Currently I am using a loop to do this and it is very slow, especially for larger w and r. I tried scipy convolution but got little speedup over the loop. I am now looking at numba's parallel computation feature but cannot figure out how to rewrite the code to work with numba. I have a Nvidia RTX card so CUDA GPU calculations are also possible.
Suggestions are welcome.
Here is my current code:
for x in range(0,w*2):
print(x)
for y in range(0,w*2):
for z in range(0,w*2):
if x >= r:
x1 = x - r
else:
x1 = 0
if x < w*2-r:
x2 = x + r
else:
x2 = w*2 - 1
if y >= r:
y1 = y - r
else:
y1 = 0
if y < w*2-r:
y2 = y + r
else:
y2 = w*2 - 1
if z >= r:
z1 = z - r
else:
z1 = 0
if z < w*2-r:
z2 = z + r
else:
z2 = w*2 - 1
b[x][y][z] = numpy.sum(a[x1:x2,y1:y2,z1:z2])
return b

Here is a very simple version of your code so that it works with numba. I was finding speed-ups of a factor of 10 relative to the pure numpy code. However, you should be able to get even greater speed-ups using a FFT convolution algorithm (e.g. scipy's fftconvolve). Can you share your attempt at getting convolution to work?
from numba import njit
#njit
def sum_cubes(a,b,w,r):
for x in range(0,w*2):
#print(x)
for y in range(0,w*2):
for z in range(0,w*2):
if x >= r:
x1 = x - r
else:
x1 = 0
if x < w*2-r:
x2 = x + r
else:
x2 = w*2 - 1
if y >= r:
y1 = y - r
else:
y1 = 0
if y < w*2-r:
y2 = y + r
else:
y2 = w*2 - 1
if z >= r:
z1 = z - r
else:
z1 = 0
if z < w*2-r:
z2 = z + r
else:
z2 = w*2 - 1
b[x,y,z] = np.sum(a[x1:x2,y1:y2,z1:z2])
return b
EDIT: Your original code has a small bug in it. The way numpy indexing works, the final line should be
b[x,y,z] = np.sum(a[x1:x2+1,y1:y2+1,z1:z2+1])
unless you want the cube to be off-centre.
Assuming you do want the cube to be centred, then a much faster way to do this calculation is using scipy's uniform filter:
from scipy.ndimage import uniform_filter
def sum_cubes_quickly(a,b,w,r):
b = uniform_filter(a,mode='constant',cval=0,size=2*r+1)*(2*r+1)**3
return b
A few quick runtime comparisons for randomly generated data with w = 50, r = 10:
Original raw numpy code - 15.1 sec
Numba'd numpy code - 8.1 sec
uniform_filter - 13.1 ms

For given two integers A and B, find a pair of numbers X and Y such that A = X*Y and B = X xor Y

I'm struggling with this problem I've found in a competitive programming book, but without a solution how to do it. For given two integers A and B (can fit in 64-bit integer type), where A is odd, find a pair of numbers X and Y such that A = X*Y and B = X xor Y.
My approach was to list all divisors of A and try pairing numbers under sqrt(A) with numbers over sqrt(A) that multiply up to A and see if their xor is equal to B. But I don't know if that's efficient enough.
What would be a good solution/algorithm to this problem?

You know that at least one factor is <= sqrt(A). Let's make that one X.
The length of X in bits will be about half the length of A.
The upper bits of X, therefore -- the ones higher in value than sqrt(A) -- are all 0, and the corresponding bits in B must have the same value as the corresponding bits in Y.
Knowing the upper bits of Y gives you a pretty small range for the corresponding factor X = A/Y. Calculate Xmin and Xmax corresponding to the largest and smallest possible values for Y, respectively. Remember that Xmax must also be <= sqrt(A).
Then just try all the possible Xs between Xmin and Xmax. There won't be too many, so it won't take very long.

The other straightforward way to solve this problem relies on the fact that the lower n bits of XY and X xor Y depend only on the lower n bits of X and Y. Therefore, you can use the possible answers for the lower n bits to restrict the possible answers for the lower n+1 bits, until you're done.
I've worked out that, unfortunately, there can be more than one possibility for a single n. I don't know how often there will be a lot of possibilities, but it's probably not too often if at all, so this may be fine in a competitive context. Probabilistically, there will only be a few possibilities, since a solution for n bits will provide either 0 or two solutions for n+1 bits, with equal probability.
It seems to work out pretty well for random input. Here's the code I used to test it:
public static void solve(long A, long B)
{
List<Long> sols = new ArrayList<>();
List<Long> prevSols = new ArrayList<>();
sols.add(0L);
long tests=0;
System.out.print("Solving "+A+","+B+"... ");
for (long bit=1; (A/bit)>=bit; bit<<=1)
{
tests += sols.size();
{
List<Long> t = prevSols;
prevSols = sols;
sols = t;
}
final long mask = bit|(bit-1);
sols.clear();
for (long prevx : prevSols)
{
long prevy = (prevx^B) & mask;
if ((((prevx*prevy)^A)&mask) == 0)
{
sols.add(prevx);
}
long x = prevx | bit;
long y = (x^B)&mask;
if ((((x*y)^A)&mask) == 0)
{
sols.add(x);
}
}
}
tests += sols.size();
{
List<Long> t = prevSols;
prevSols = sols;
sols = t;
}
sols.clear();
for (long testx: prevSols)
{
if (A/testx >= testx)
{
long testy = B^testx;
if (testx * testy == A)
{
sols.add(testx);
}
}
}
System.out.println("" + tests + " checks -> X=" + sols);
}
public static void main(String[] args)
{
Random rand = new Random();
for (int range=Integer.MAX_VALUE; range > 32; range -= (range>>5))
{
long A = rand.nextLong() & Long.MAX_VALUE;
long X = (rand.nextInt(range)) + 2L;
X|=1;
long Y = A/X;
if (Y==0)
{
Y = rand.nextInt(65536);
}
Y|=1;
solve(X*Y, X^Y);
}
}
You can see the results here: https://ideone.com/cEuHkQ
Looks like it usually only takes a couple thousand checks.

Here's a simple recursion that observes the rules we know: (1) the least significant bits of both X and Y are set since only odd multiplicands yield an odd multiple; (2) if we set X to have the highest set bit of B, Y cannot be greater than sqrt(A); and (3) set bits in X or Y according to the current bit in B.
The following Python code resulted in under 300 iterations for all but one of the random pairs I picked from Matt Timmermans' example code. But the first one took 231,199 iterations :)
from math import sqrt
def f(A, B):
i = 64
while not ((1<<i) & B):
i = i - 1
X = 1 | (1 << i)
sqrtA = int(sqrt(A))
j = 64
while not ((1<<j) & sqrtA):
j = j - 1
if (j > i):
i = j + 1
memo = {"it": 0, "stop": False, "solution": []}
def g(b, x, y):
memo["it"] = memo["it"] + 1
if memo["stop"]:
return []
if y > sqrtA or y * x > A:
return []
if b == 0:
if x * y == A:
memo["solution"].append((x, y))
memo["stop"] = True
return [(x, y)]
else:
return []
bit = 1 << b
if B & bit:
return g(b - 1, x, y | bit) + g(b - 1, x | bit, y)
else:
return g(b - 1, x | bit, y | bit) + g(b - 1, x, y)
g(i - 1, X, 1)
return memo
vals = [
(6872997084689100999, 2637233646), # 1048 checks with Matt's code
(3461781732514363153, 262193934464), # 8756 checks with Matt's code
(931590259044275343, 5343859294), # 4628 checks with Matt's code
(2390503072583010999, 22219728382), # 5188 checks with Matt's code
(412975927819062465, 9399702487040), # 8324 checks with Matt's code
(9105477787064988985, 211755297373604352), # 3204 checks with Matt's code
(4978113409908739575,67966612030), # 5232 checks with Matt's code
(6175356111962773143,1264664368613886), # 3756 checks with Matt's code
(648518352783802375, 6) # B smaller than sqrt(A)
]
for A, B in vals:
memo = f(A, B)
[(x, y)] = memo["solution"]
print "x, y: %s, %s" % (x, y)
print "A: %s" % A
print "x*y: %s" % (x * y)
print "B: %s" % B
print "x^y: %s" % (x ^ y)
print "%s iterations" % memo["it"]
print ""
Output:
x, y: 4251585939, 1616572541
A: 6872997084689100999
x*y: 6872997084689100999
B: 2637233646
x^y: 2637233646
231199 iterations
x, y: 262180735447, 13203799
A: 3461781732514363153
x*y: 3461781732514363153
B: 262193934464
x^y: 262193934464
73 iterations
x, y: 5171068311, 180154313
A: 931590259044275343
x*y: 931590259044275343
B: 5343859294
x^y: 5343859294
257 iterations
x, y: 22180179939, 107776541
A: 2390503072583010999
x*y: 2390503072583010999
B: 22219728382
x^y: 22219728382
67 iterations
x, y: 9399702465439, 43935
A: 412975927819062465
x*y: 412975927819062465
B: 9399702487040
x^y: 9399702487040
85 iterations
x, y: 211755297373604395, 43
A: 9105477787064988985
x*y: 9105477787064988985
B: 211755297373604352
x^y: 211755297373604352
113 iterations
x, y: 68039759325, 73164771
A: 4978113409908739575
x*y: 4978113409908739575
B: 67966612030
x^y: 67966612030
69 iterations
x, y: 1264664368618221, 4883
A: 6175356111962773143
x*y: 6175356111962773143
B: 1264664368613886
x^y: 1264664368613886
99 iterations
x, y: 805306375, 805306369
A: 648518352783802375
x*y: 648518352783802375
B: 6
x^y: 6
59 iterations

How can I speedup this Julia code?

The code implements an example of a Pollard rho() function for finding a factor of a positive integer, n. I've examined some of the code in the Julia "Primes" package that runs rapidly in an attempt to speedup the pollard_rho() function, all to no avail. The code should execute n = 1524157897241274137 in approximately 100 mSec to 30 Sec (Erlang, Haskell, Mercury, SWI Prolog) but takes about 3 to 4 minutes on JuliaBox, IJulia, and the Julia REPL. How can I make this go fast?
pollard_rho(1524157897241274137) = 1234567891
__precompile__()
module Pollard
export pollard_rho
function pollard_rho{T<:Integer}(n::T)
f(x::T, r::T, n) = rem(((x ^ T(2)) + r), n)
r::T = 7; x::T = 2; y::T = 11; y1::T = 11; z::T = 1
while z == 1
x = f(x, r, n)
y1 = f(y, r, n)
y = f(y1, r, n)
z = gcd(n, abs(x - y))
end
z >= n ? "error" : z
end
end # module

There are quite a few problems with type instability here.
Don't return either the string "error" or a result; instead explicitly call error().
As Chris mentioned, x and r ought to be annotated to be of type T, else they will be unstable.
There also seems to be a potential problem with overflow. A solution is to widen in the squaring step before truncating back to type T.
function pollard_rho{T<:Integer}(n::T)
f(x::T, r::T, n) = rem(Base.widemul(x, x) + r, n) % T
r::T = 7; x::T = 2; y::T = 11; y1::T = 11; z::T = 1
while z == 1
x = f(x, r, n)
y1 = f(y, r, n)
y = f(y1, r, n)
z = gcd(n, abs(x - y))
end
z >= n ? error() : z
end
After making these changes the function will run as fast as you could expect.
julia> #btime pollard_rho(1524157897241274137)
4.128 ms (0 allocations: 0 bytes)
1234567891
To find these problems with type instability, use the #code_warntype macro.

Solving linear equations

I have to find out the integral solution of a equation ax+by=c such that x>=0 and y>=0 and value of (x+y) is minimum.
I know if c%gcd(a,b)}==0 then it's always possible. How to find the values of x and y?
My approach
for(i 0 to 2*c):
x=i
y= (c-a*i)/b
if(y is integer)
ans = min(ans,x+y)
Is there any better way to do this ? Having better time complexity.

Using the Extended Euclidean Algorithm and the theory of linear Diophantine equations there is no need to search. Here is a Python 3 implementation:
def egcd(a,b):
s,t = 1,0 #coefficients to express current a in terms of original a,b
x,y = 0,1 #coefficients to express current b in terms of original a,b
q,r = divmod(a,b)
while(r > 0):
a,b = b,r
old_x, old_y = x,y
x,y = s - q*x, t - q*y
s,t = old_x, old_y
q,r = divmod(a,b)
return b, x ,y
def smallestSolution(a,b,c):
d,x,y = egcd(a,b)
if c%d != 0:
return "No integer solutions"
else:
u = a//d #integer division
v = b//d
w = c//d
x = w*x
y = w*y
k1 = -x//v if -x % v == 0 else 1 + -x//v #k1 = ceiling(-x/v)
x1 = x + k1*v # x + k1*v is solution with smallest x >= 0
y1 = y - k1*u
if y1 < 0:
return "No nonnegative integer solutions"
else:
k2 = y//u #floor division
x2 = x + k2*v #y-k2*u is solution with smallest y >= 0
y2 = y - k2*u
if x2 < 0 or x1+y1 < x2+y2:
return (x1,y1)
else:
return (x2,y2)
Typical run:
>>> smallestSolution(1001,2743,160485)
(111, 18)
The way it works: first use the extended Euclidean algorithm to find d = gcd(a,b) and one solution, (x,y). All other solutions are of the form (x+k*v,y-k*u) where u = a/d and v = b/d. Since x+y is linear, it has no critical points, hence is minimized in the first quadrant when either x is as small as possible or y is as small as possible. The k above is an arbitrary integer parameter. By appropriate use of floor and ceiling you can locate the integer points with either x as small as possible or y is as small as possible. Just take the one with the smallest sum.
On Edit: My original code used the Python function math.ceiling applied to -x/v. This is problematic for very large integers. I tweaked it so that the ceiling is computed with just int operations. It can now handle arbitrarily large numbers:
>>> a = 236317407839490590865554550063
>>> b = 127372335361192567404918884983
>>> c = 475864993503739844164597027155993229496457605245403456517677648564321
>>> smallestSolution(a,b,c)
(2013668810262278187384582192404963131387, 120334243940259443613787580180)
>>> x,y = _
>>> a*x+b*y
475864993503739844164597027155993229496457605245403456517677648564321
Most of the computation takes place in the running the extended Euclidean algorithm, which is known to be O(min(a,b)).

First let assume a,b,c>0 so:
a.x+b.y = c
x+y = min(xi+yi)
x,y >= 0
a,b,c > 0
------------------------
x = ( c - b.y )/a
y = ( c - a.x )/b
c - a.x >= 0
c - b.y >= 0
c >= b.y
c >= a.x
x <= c/x
y <= c/b
So naive O(n) solution is in C++ like this:
void compute0(int &x,int &y,int a,int b,int c) // naive
{
int xx,yy;
xx=-1; yy=-1;
for (y=0;;y++)
{
x = c - b*y;
if (x<0) break; // y out of range stop
if (x%a) continue; // non integer solution
x/=a; // remember minimal solution
if ((xx<0)||(x+y<=xx+yy)) { xx=x; yy=y; }
}
x=xx; y=yy;
}
if no solution found it returns -1,-1 If you think about the equation a bit then you should realize that min solution will be when x or y is minimal (which one depends on a<b condition) so adding such heuristics we can increase only the minimal coordinate until first solution found. This will speed up considerably the whole thing:
void compute1(int &x,int &y,int a,int b,int c)
{
if (a<=b){ for (x=0,y=c;y>=0;x++,y-=a) if (y%b==0) { y/=b; return; } }
else { for (y=0,x=c;x>=0;y++,x-=b) if (x%a==0) { x/=a; return; } }
x=-1; y=-1;
}
I measured this on my setup:
x y ax+by x+y a=50 b=105 c=500000000
[ 55.910 ms] 10 4761900 500000000 4761910 naive
[ 0.000 ms] 10 4761900 500000000 4761910 opt
x y ax+by x+y a=105 b=50 c=500000000
[ 99.214 ms] 4761900 10 500000000 4761910 naive
[ 0.000 ms] 4761900 10 500000000 4761910 opt
The ~2.0x difference for naive method times is due to a/b=~2.0and selecting worse coordinate to iterate in the second run.
Now just handle special cases when a,b,c are zero (to avoid division by zero)...

what is the algorithm for an RBF kernel matrix in Matlab?

If I am given training data sets and unlabeled datasets, what is the RBF Kernel matrix algorithm for Matlab?

This should be what you are looking for. It is taken from here
% With Fast Computation of the RBF kernel matrix
% To speed up the computation, we exploit a decomposition of the Euclidean distance (norm)
%
% Inputs:
% ker: 'lin','poly','rbf','sam'
% X: data matrix with training samples in rows and features in columns
% X2: data matrix with test samples in rows and features in columns
% sigma: width of the RBF kernel
% b: bias in the linear and polinomial kernel
% d: degree in the polynomial kernel
%
% Output:
% K: kernel matrix
%
% Gustavo Camps-Valls
% 2006(c)
% Jordi (jordi#uv.es), 2007
% 2007-11: if/then -> switch, and fixed RBF kernel
function K = kernelmatrix(ker,X,X2,sigma)
switch ker
case 'lin'
if exist('X2','var')
K = X' * X2;
else
K = X' * X;
end
case 'poly'
if exist('X2','var')
K = (X' * X2 + b).^d;
else
K = (X' * X + b).^d;
end
case 'rbf'
n1sq = sum(X.^2,1);
n1 = size(X,2);
if isempty(X2);
D = (ones(n1,1)*n1sq)' + ones(n1,1)*n1sq -2*X'*X;
else
n2sq = sum(X2.^2,1);
n2 = size(X2,2);
D = (ones(n2,1)*n1sq)' + ones(n1,1)*n2sq -2*X'*X2;
end;
K = exp(-D/(2*sigma^2));
case 'sam'
if exist('X2','var');
D = X'*X2;
else
D = X'*X;
end
K = exp(-acos(D).^2/(2*sigma^2));
otherwise
error(['Unsupported kernel ' ker])
end

To produce the grahm/kernel matrix (matrix of inner products) do:
function [ Kern ] = produce_kernel_matrix( X, t, beta )
%
X = X';
t = t';
X_T_2 = sum(X.^2,2) + sum(t.^2,2).' - (2*X)*t.'; % ||x||^2 + ||t||^2 - 2<x,t>
Kern =exp(-beta*X_T_2); %
end
then to do the interpolation do:
function [ mdl ] = learn_RBF_linear_algebra( X_training_data, Y_training_data, mdl )
%
Kern_matrix = produce_kernel_matrix_bsxfun(X_training_data, mdl.t, mdl.beta); % (N x K)
C = Kern_matrix \ Y_training_data'; % (K x D) = (N x K)' x (N x D)
mdl.c = C; % (K x D)
end
note beta is 1/2sigma

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Python performance: iteration and operations on nested lists - performance

in your f3 rewrite, g can be simplified. (Can also be applied to f4) You have the following code inside a for loop. l = max(0, topleft[1]) r = min(topleft[1]+(75*2), 1024) However, it appears that those values never change inside the for loop. So calculate them once, outside the loop instead.

You can create your own Python module in C, and control the performance as you want: http://docs.python.org/extending/

Related

Fast summing of subarrays in Python

For given two integers A and B, find a pair of numbers X and Y such that A = X*Y and B = X xor Y

How can I speedup this Julia code?

Solving linear equations

what is the algorithm for an RBF kernel matrix in Matlab?

Categories

Resources