Why do Extra Function Calls Speed Up a Program in Python? - performance

If I extract a computation and place it in another function shouldn't
the code be slower? Evidently not. Below, I can't believe fun2 is slower
than fun1, because fun1 clearly does more computation. What is going on?
(Maybe I can have functions call functions call function and REALLY speed
up my code.)
Heading
##
Python code:
MAX = 10000000
def fun1(): # 4.26 seconds.
def multiply (X, Y): # multiply two 2x2 matrices
a, b, c, d = X
e, f, g, h = Y
return a*e+b*g, a*f+b*h, c*e+d*g, c*f+d*h
X = [1,2,3,4]
Y = [5,6,7,8]
for n in range (MAX):
Z = multiply (X, Y) # Make the call
return Z
#-------------------------------------------------
def fun2(): # 6.56 seconds.
X = [1,2,3,4]
Y = [5,6,7,8]
for n in range (MAX):
Z = X[0]*Y[0] + X[1]*Y[2], \
X[0]*Y[1] + X[1]*Y[3], \
X[2]*Y[0] + X[3]*Y[2], \
X[2]*Y[1] + X[3]*Y[3] # Don't make the call.
return Z

I'm not sure, but I think it might be that
a,b,c,d = X
and then referencing a,b,c and d directly is faster that referencing X[0] (and so on).
Every index in list is another lookup, while a,b,c,d=X is only one lookup (I think).

I finally figured out my own question. The function dispensed with square brackets, and that is where the speed increase came from, not from the function call itself. A Python list contains both values and bit-sizes (or addresses, I don't know which). To access x[3], the computer goes to address x reads the address of x[1], moves there, reads the address of x[2], moves there, reads the address of x[3], moves there and finally accesses the value. This takes time and can be speeded up by assigning the list elements to simple identifiers.

Related

Can I put multiple functions into one matrix using iteration in Julia?

I am new to Julia and trying to see whether I can put different functions as an element of a Mtrix
I constructed a matrix B (2x2)
And I want to put function,for example, x^1 + 1x as a (1,1) element, x^1 + 2x as a (1,2) element, x^2 + 1x as a (2,1) element and x^2 + 2x as a (2,2) element
I was planning to do something like as below but couldn't find a way to implement this. Is it possible to make such Matrix?
B = Array{Function}(undef,2,2)
agrid = 1:1:2
dgrid = 1:1:2
for(i_a,a) in enumerate(agrid)
for(i_d,d) in enumerate(dgrid)
B[i_a,i_d](x) = x^a+d*x
end
end
The reason I want to construct this matrix is to solve the following model.
I need to solve the equation with two variables 'm' and 'n' given 'a' and 'd'.
And I thought if I have a matrix consisting of each function with all possible combinations of 'a' and 'd'(in the sample code, the possible combination would be (1,1) (1,2) (2,1) (2,2)), then it would be easier to solve the model at once.
Any help will be appreciated.
Thank you.
Yes it's possible. You could insert Functions one by one into the B you have. In this case, there's a pattern so you could have built B with anonymous functions and comprehension:
M = 2
N = 2
x = 10
B = [ ((y) -> (y^a + b*y)) for a in 1:M, b in 1:N]
result1 = x .|> B # result1[i] = B[i](x)
I'd rather not tie the array's indices to the functions though; you would have to keep making a new Function matrix for different M and N. Instead, you could make a more memory-efficient CartesianIndices matrix and use a function that takes in CartesianIndex:
# memory-efficient matrix of CartesianIndex{2}
indices = CartesianIndices((M,N))
# if x were an MxN Matrix, you could also do CartesianIndices(x)
f(y, a, b) = y^a + b*y
g(z, index::CartesianIndex{2}) = f(z, index[1], index[2])
result2 = g.(x, indices) # result2[i] = g(x, indices[i])
As an aside, it appears that in the anonymous function comprehension used to create my B, only 1 actual method is compiled, and each element is similar to an instance of a functor with a and b as fields:
struct P{A,B} <: Function # subtype Function to be similar to B
a::A
b::B
end
(p::P)(y) = y^(p.a) + (p.b)*x
PB = [P(a, b) for a in 1:M, b in 1:N]
result3 = x .|> PB # result3[i] = PB[i](x) = P(a, b)(x)

Loop invariant proof on multiply algorithm

I'm currently stuck on a loop invariant proof in my home assignment. The algorithm that I need to prove correctness of, is:
Multiply(a,b)
x=a
y=0
WHILE x>=b DO
x=x-b
y=y+1
IF x=0 THEN
RETURN(y)
ELSE
RETURN(-1)
I've tried to look at several examples of loop invariants and I have some sense of idea of how its supposed to work out. However in this algorithm above, I have two exit conditions, and I'm a bit lost on how to approach this in a loop invariant proof. In particular its the termination part I'm struggling with, around the IF and ELSE statements.
So far what I've constructed is simply by looking at the termination of the algorithm in which case if x = 0 then it returns the value of y containing the value of n (number of iterations in the while loop), where as if x is not 0, and x < b then it returns -1. I just have a feeling I need to prove this some how.
I hope someone can help share some light on this for me, as the similar cases I've found in here, have not been sufficient.
Thanks alot in advance for your time.
Provided that the algorithm terminates (for this let's assume a>0 and b>0, which is sufficient), one invariant is that at every iteration of your while loop, you have x + by = a.
Proof:
at first, x = a and y = 0 so that's ok
If x + by = a, then (x - b) + (y + 1)b = a, which are the values of x and y for your next iteration
Illustration:
Multiply(a,b)
x=a
y=0
// x + by = a, is true
WHILE x>=b DO
// x + by = a, is true
x=x-b // X = x - b
y=y+1 // Y = y + 1
// x + by = a
// x - b + by + b = a
// (x-b) + (y+1)b = a
// X + bY = a, is still true
// x + by = a, will remain true when you exit the loop
// since we exited the loop, x < b
IF x=0 THEN
// 0 + by = a, and 0 < b
// y = a/b
RETURN(y)
ELSE
RETURN(-1)
This algorithm returns a/b when b divides a, and -1 otherwise. Multiply does not quite sound like an appropriate name for it...
We can't prove correctness without a specification of exactly what the function is supposed to do, which I can't find in your question. Even the name of the function doesn't help: as noted already, your function returns a/b most of the time when b divides a, and -1 otherwise. Multiply is an inappropriate name for it.
Furthermore, if b=0 and a>=b the "algorithm" doesn't terminate so it isn't even an algorithm.
As Alex M noted, a loop invariant for the loop is x + by = a. At the moment the loop exits, we also have x < b. There are no other guarantees on x because (presumably) a could be negative. If we had a guarantee that a and b are positive, then we could guarantee that 0<=x<b at the moment the loop exits, which would mean that it implements the division with remainder algorithm (at the end of the loop, y is quotient and x is remainder, and it terminates by an "infinite descent" type argument: a decreasing sequence of positive integers x must terminate). Then you could conclude that if x=0, b divides a evenly, and the quotient is returned, otherwise -1 is returned.
But that is not a proof, because we are lacking a specification for what the algorithm is supposed to do, and a specification on restrictions on its inputs. (Are a and b any positive integers? Negative and 0 not allowed?)

How to parallelize computation of pairwise distance matrix?

My problem is roughly as follows. Given a numerical matrix X, where each row is an item. I want to find each row's nearest neighbor in terms of L2 distance in all rows except itself. I tried reading the official documentation but was still a little confused about how to achieve this. Could someone give me some hint?
My code is as follows
function l2_dist(v1, v2)
return sqrt(sum((v1 - v2) .^ 2))
end
function main(Mat, dist_fun)
n = size(Mat, 1)
Dist = SharedArray{Float64}(n) #[Inf for i in 1:n]
Id = SharedArray{Int64}(n) #[-1 for i in 1:n]
#parallel for i = 1:n
Dist[i] = Inf
Id[i] = 0
end
Threads.#threads for i in 1:n
for j in 1:n
if i != j
println(i, j)
dist_temp = dist_fun(Mat[i, :], Mat[j, :])
if dist_temp < Dist[i]
println("Dist updated!")
Dist[i] = dist_temp
Id[i] = j
end
end
end
end
return Dict("Dist" => Dist, "Id" => Id)
end
n = 4000
p = 30
X = [rand() for i in 1:n, j in 1:p];
main(X[1:30, :], l2_dist)
#time N = main(X, l2_dist)
I'm trying to distributed all the i's (i.e. calculating each row minimum) over different cores. But the version above apparently isn't working correctly. It is even slower than the sequential version. Can someone point me to the right direction? Thanks.
Maybe you're doing something in addition to what you have written down, but, at this point from what I can see, you aren't actually doing any computations in parallel. Julia requires you to tell it how many processors (or threads) you would like it to have access to. You can do this through either
Starting Julia with multiple processors julia -p # (where # is the number of processors you want Julia to have access to)
Once you have started a Julia "session" you can call the addprocs function to add additional processors.
To have more than 1 thread, you need to run command export JULIA_NUM_THREADS = #. I don't know very much about threading, so I will be sticking with the #parallel macro. I suggest reading documentation for more details on threading -- Maybe #Chris Rackauckas could expand a little more on the difference.
A few comments below about my code and on your code:
I'm on version 0.6.1-pre.0. I don't think I'm doing anything 0.6 specific, but this is a heads up just in case.
I'm going to use the Distances.jl package when computing the distances between vectors. I think it is a good habit to farm out as many of my computations to well-written and well-maintained packages as possible.
Rather than compute the distance between rows, I'm going to compute the distance between columns. This is because Julia is a column-major language, so this will increase the number of cache hits and give a little extra speed. You can obviously get the row-wise results you want by just transposing the input.
Unless you expect to have that many memory allocations then that many allocations are a sign that something in your code is inefficient. It is often a type stability problem. I don't know if that was the case in your code before, but that doesn't seem to be an issue in the current version (it wasn't immediately clear to me why you were having so many allocations).
Code is below
# Make sure all processors have access to Distances package
#everywhere using Distances
# Create a random matrix
nrow = 30
ncol = 4000
# Seed creation of random matrix so it is always same matrix
srand(42)
X = rand(nrow, ncol)
function main(X::AbstractMatrix{Float64}, M::Distances.Metric)
# Get size of the matrix
nrow, ncol = size(X)
# Create `SharedArray` to store output
ind_vec = SharedArray{Int}(ncol)
dist_vec = SharedArray{Float64}(ncol)
# Compute the distance between columns
#sync #parallel for i in 1:ncol
# Initialize various temporary variables
min_dist_i = Inf
min_ind_i = -1
X_i = view(X, :, i)
# Check distance against all other columns
for j in 1:ncol
# Skip comparison with itself
if i==j
continue
end
# Tell us who is doing the work
# (can uncomment if you want to verify stuff)
# println("Column $i compared with Column $j by worker $(myid())")
# Evaluate the new distance...
# If it is less then replace it, otherwise proceed
dist_temp = evaluate(M, X_i, view(X, :, j))
if dist_temp < min_dist_i
min_dist_i = dist_temp
min_ind_i = j
end
end
# Which column is minimum distance from column i
dist_vec[i] = min_dist_i
ind_vec[i] = min_ind_i
end
return dist_vec, ind_vec
end
# Using Euclidean metric
metric = Euclidean()
inds, dist = main(X, metric)
#time main(X, metric);
#show dist[[1, 5, 25]], inds[[1, 5, 25]]
You can run the code with
1 processor julia testfile.jl
% julia testfile.jl
0.640365 seconds (16.00 M allocations: 732.495 MiB, 3.70% gc time)
(dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])
n processors (in this case 4) julia -p n testfile.jl
% julia -p 4 testfile.jl
0.201523 seconds (2.10 k allocations: 99.107 KiB)
(dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])

Efficient partial permutation sort in Julia

I am dealing with a problem that requires a partial permutation sort by magnitude in Julia. If x is a vector of dimension p, then what I need are the first k indices corresponding to the k components of x that would appear first in a partial sort by absolute value of x.
Refer to Julia's sorting functions here. Basically, I want a cross between sortperm and select!. When Julia 0.4 is released, I will be able to obtain the same answer by applying sortperm! (this function) to the vector of indices and choosing the first k of them. However, using sortperm! is not ideal here because it will sort the remaining p-k indices of x, which I do not need.
What would be the most memory-efficient way to do the partial permutation sort? I hacked a solution by looking at the sortperm source code. However, since I am not versed in the ordering modules that Julia uses there, I am not sure if my approach is intelligent.
One important detail: I can ignore repeats or ambiguities here. In other words, I do not care about the ordering by abs() of indices for two components 2 and -2. My actual code uses floating point values, so exact equality never occurs for practical purposes.
# initialize a vector for testing
x = [-3,-2,4,1,0,-1]
x2 = copy(x)
k = 3 # num components desired in partial sort
p = 6 # num components in x, x2
# what are the indices that sort x by magnitude?
indices = sortperm(x, by = abs, rev = true)
# now perform partial sort on x2
select!(x2, k, by = abs, rev = true)
# check if first k components are sorted here
# should evaluate to "true"
isequal(x2[1:k], x[indices[1:k]])
# now try my partial permutation sort
# I only need indices2[1:k] at end of day!
indices2 = [1:p]
select!(indices2, 1:k, 1, p, Base.Perm(Base.ord(isless, abs, true, Base.Forward), x))
# same result? should evaluate to "true"
isequal(indices2[1:k], indices[1:k])
EDIT: With the suggested code, we can briefly compare performance on much larger vectors:
p = 10000; k = 100; # asking for largest 1% of components
x = randn(p); x2 = copy(x);
# run following code twice for proper timing results
#time {indices = sortperm(x, by = abs, rev = true); indices[1:k]};
#time {indices2 = [1:p]; select!(indices2, 1:k, 1, p, Base.Perm(Base.ord(isless, abs, true, Base.Forward), x))};
#time selectperm(x,k);
My output:
elapsed time: 0.048876901 seconds (19792096 bytes allocated)
elapsed time: 0.007016534 seconds (2203688 bytes allocated)
elapsed time: 0.004471847 seconds (1657808 bytes allocated)
The following version appears to be relatively space-efficient because it uses only an integer array of the same length as the input array:
function selectperm (x,k)
if k > 1 then
kk = 1:k
else
kk = 1
end
z = collect(1:length(x))
return select!(z,1:k,by = (i)->abs(x[i]), rev = true)
end
x = [-3,-2,4,1,0,-1]
k = 3 # num components desired in partial sort
print (selectperm(x,k))
The output is:
[3,1,2]
... as expected.
I'm not sure if it uses less memory than the originally-proposed solution (though I suspect the memory usage is similar) but the code may be clearer and it does produce only the first k indices whereas the original solution produced all p indices.
(Edit)
selectperm() has been edited to deal with the BoundsError that occurs if k=1 in the call to select!().

numpy: evaluating function in matrix, using previous array as argument in calculating the next

I have an m x n array: a, where the integers m > 1E6, and n <= 5.
I have functions F and G, which are composed like this: F( u, G ( u, t)). u is a 1 x n array, t is a scalar, and F and G returns 1 x n arrays.
I need to evaluate each row of a in F, and use previously evaluated row as the u-array for the next evaluation. I need to make m such evaluations.
This has to be really fast. I was previously impressed by scitools.std StringFunction evaluaion for a whole array, but this problem requires using the previously calculated array as an argument in calculating the next. I don't know if StringFunction can do this.
For example:
a = zeros((1000000, 4))
a[0] = asarray([1.,69.,3.,4.1])
# A is a float defined elsewhere, h is a function which accepts a float as its argument and returns an arbitrary float. h is defined elsewhere.
def G(u, t):
return asarray([u[0], u[1]*A, cos(u[2]), t*h(u[3])])
def F(u, t):
return u + G(u, t)
dt = 1E-6
for i in range(1, 1000000):
a[i] = F(a[i-1], i*dt)
i += 1
The problem with the above code is that it is slow as hell. I need to get these calculations done by numpy milliseconds.
How can I do what I want?
Thank you for our time.
Kind regards,
Marius
This sort of thing is very difficult to do in numpy. If we look at this by column we see a few simpler solutions.
a[:,0] is very easy:
col0 = np.ones((1000))*2
col0[0] = 1 #Or whatever start value.
np.cumprod(col0, out=col0)
np.allclose(col0, a[:1000,0])
True
As mentioned earlier this will overflow very quickly. a[:,1] can be done much along the same lines.
I do not believe there is a way to do the next two columns inside numpy alone quickly. We can turn to numba for this:
from numba import auotojit
def python_loop(start, count):
out = np.zeros((count), dtype=np.double)
out[0] = start
for x in xrange(count-1):
out[x+1] = out[x] + np.cos(out[x+1])
return out
numba_loop = autojit(python_loop)
np.allclose(numba_loop(3,1000),a[:1000,2])
True
%timeit python_loop(3,1000000)
1 loops, best of 3: 4.14 s per loop
%timeit numba_loop(3,1000000)
1 loops, best of 3: 42.5 ms per loop
Although its worth pointing out that this converges to pi/2 very very quickly and there is little point in calculating this recursion past ~20 values for any start value. This returns the exact same answer to double point precision- I didn't bother finding the cutoff, but it is much less then 50:
%timeit tmp = np.empty((1000000));
tmp[:50] = numba_loop(3,50);
tmp[50:] = np.pi/2
100 loops, best of 3: 2.25 ms per loop
You can do something similar with the fourth column. Of course you can autojit all of the functions, but this gives you several different options to try out depending on numba usage:
Use cumprod for the first two columns
Use an approximation for column 3 (and possible 4) where only the first few iterations are calculated
Implement columns 3 and 4 in numba using autojit
Wrap everything inside of an autojit loop (the best option)
The way you have presented this all rows past ~200 will either be np.inf or np.pi/2. Exploit this.
Slightly faster. Your first column is basicly 2^n. Calculating 2^n for n up to 1000000 is gonna overflow.. second column is even worse.
def calc(arr, t0=1E-6):
u = arr[0]
dt = 1E-6
h = lambda x: np.random.random(1)*50.0
def firstColGen(uStart):
u = uStart
while True:
u += u
yield u
def secondColGen(uStart, A):
u = uStart
while True:
u += u*A
yield u
def thirdColGen(uStart):
u = uStart
while True:
u += np.cos(u)
yield u
def fourthColGen(uStart, h, t0, dt):
u = uStart
t = t0
while True:
u += h(u) * dt
t += dt
yield u
first = firstColGen(u[0])
second = secondColGen(u[1], A)
third = thirdColGen(u[2])
fourth = fourthColGen(u[3], h, t0, dt)
for i in xrange(1, len(arr)):
arr[i] = [first.next(), second.next(), third.next(), fourth.next()]

Resources