Insane amount of space allocated solving system of stochastic differential equations - performance

first time asking a question here.
I previously used a simple MATLAB script to model 90 Hopf oscillators, coupled through a matrix, with randn noise, with a simple Euler step integration. I wanted to upgrade this, so I got into Julia, seems to have many exciting properties.
This is the system of equations I'm solving
I'm kinda lost. I started using differentialequations.jl (stochastic solver) , arrived to a solution, and found myself with a benchmark that tells me that solving 200 seconds occupies like 4 Gb!!! (2.5 Gb with alg_hints=[:stiff]) (I haven't fixed dt, previously I used dt=0.1)
function Shopf(du,u,p,t)
du[1:90,1]=(p[1:90,1]-u[1:90,1].^2.0-u[1:90,2].^2.0).*u[1:90,1]-p[1:90,2].*u[1:90,2] + 0.5*(-p[: , end].*u[:,1]+p[:,4:end-1] *u[:,1])
du[1:90,2]=(p[1:90,1]-u[1:90,1].^2.0-u[1:90,2].^2.0).*u[1:90,1]+p[1:90,2].*u[1:90,1] + 0.5*(-p[: , end].*u[:,2]+p[:,4:end-1] *u[:,2])
end
function σ_Shopf(du,u,p,t)
du[1:90,1]=0.04*ones(90,1)
du[1:90,2]=0.04*ones(90,1)
end
#initial condition
u0=-0.1*ones(90,2);
#initial time
t0=0.0;
#final time
tend=200.0;
#setting parameter matrix
p0=[0.1 , 2*pi*0.04]
push!(p0,-p0[2])
p=p0'.*ones(90,3);
SC=SC;
p=[p SC]
p=[p sum(SC,dims=2)]
#
#col 1 :alpha
#col 2-3 : [w0 -w0]
#col 3-93 : coupling matrix
#col 94: col-wise sum of coupling matrix
#benchmark solve(prob_sde_Shopf,nlsolver=Rosenbrock23(),alg_hints=[:stiff])
BenchmarkTools.Trial:
memory estimate: 2.30 GiB
allocs estimate: 722769
minimum time: 859.224 ms (13.24% GC)
median time: 942.707 ms (13.10% GC)
mean time: 975.430 ms (12.99% GC)
maximum time: 1.223 s (13.00% GC)
samples: 6
evals/sample: 1
Any thoughts? I'm checking out several solutions, but none of them reduce the amount of memory to a reasonable amount.
Thanks in advance.

You are creating a staggering number of temporary arrays. Every slice creates a temporary. You put in a dot here and there, but you have to dot everything to get fused broadcasting. Instead, you can just use the #. macro which will do it for you. Also, using #views will make sure that slices don't copy:
function Shopf(du, u, p, t)
#. du[1:90, 1] = #views (p[1:90, 1] - u[1:90, 1]^2 - u[1:90, 2]^2) * u[1:90, 1] -
p[1:90, 2] * u[1:90,2] + 0.5 * (-p[:, end] * u[:, 1] + p[:, 4:end-1] * u[:,1])
#. du[1:90, 2] = #views (p[1:90, 1] - u[1:90, 1]^2 - u[1:90, 2]^2) * u[1:90, 1] +
p[1:90, 2] * u[1:90,1] + 0.5 * (-p[:, end] * u[:, 2] + p[:, 4:end-1] * u[:,2])
end
Also, dont write x^2.0, use x^2, the former is a slow float power, while the latter is a fast x * x. In fact, try to use integers wherever you can, in multiplications, additions, etc.
Here's another thing
function σ_Shopf(du,u,p,t)
du[1:90,1]=0.04*ones(90,1)
du[1:90,2]=0.04*ones(90,1)
end
No need to create two temporary arrays on the right side of the assignment. Just write this:
function σ_Shopf(du, u, p, t)
du[1:90, 1:2] .= 0.04
end
Faster and simpler. Note, that I haven't tested this, so please fix any typos.
(Finally, please use indentation and put spaces around operators, it makes your code much nicer to read.)
Update: I don't really know what your code is supposed to do, what with the strange indices, but here is a possible improvement that just uses loops (which I think is actually cleaner, and will let you make further optimizations):
The operation producing A is a matrix product, so you cannot avoid allocations there, unless you can pass in a cache array to work on, using mul!. Aside from that, you should have no allocations below.
function shopf!(du, u, p, t)
A = #view p[:, 4:end-1] * u
# mul!(A, view(p, 4:end-1), u) # in-place matrix product
for i in axes(u, 1)
val = (p[i, 1] - u[i, 1]^2 - u[i, 2]^2) * u[i, 1] # don't calculate this twice
du[i, 1] = val - (p[i, 2] * u[i, 2]) - (0.5 * p[i, end] * u[i, 1]) +
(0.5 * A[i, 1])
du[i, 2] = val + (p[i, 2] * u[i, 1]) - (0.5 * p[i, end] * u[i, 2]) +
(0.5 * A[i, 2])
end
end
After this, you can add various optimizations, #inbounds if you are sure about the array sizes, multithreading, #simd or even #avx from the LoopVectorization experimental package.

Related

Optimizing matrix multiplication with varying sizes

Suppose I have the following data generating process
using Random
using StatsBase
m_1 = [1.0 2.0]
m_2 = [1.0 2.0; 3.0 4.0]
DD = []
y = zeros(2,200)
for i in 1:100
rand!(m_1)
rand!(m_2)
push!(DD, m_1)
push!(DD, m_2)
end
idxs = sample(1:200,10)
for i in idxs
DD[i] = DD[1]
end
and suppose given the data, I have the following function
function test(y, DD, n)
v_1 = [1 2]
v_2 = [3 4]
for j in 1:n
for i in 1:size(DD,1)
if size(DD[i],1) == 1
y[1:size(DD[i],1),i] .= (v_1 * DD[i]')[1]
else
y[1:size(DD[i],1),i] = (v_2 * DD[i]')'
end
end
end
end
I'm struggling to optimize the speed of test. In particular, memory allocation increases as I increase n. However, I'm not really allocating anything new.
The data generating process captures the fact that I don't know for sure the size of DD[i] beforehand. That is, the first time I call test, DD[1] could be a 2x2 matrix. The second time I call test, DD[1] could be a 1x2 matrix. I think this could be part of the issue with memory allocation: Julia doesn't know the sizes beforehand.
I'm completely stuck. I've tried #inbounds but that didn't help. Is there a way to improve this?
One important thing to check for performance is that Julia can understand the types. You can check this by running #code_warntype test(y, DD, 1), the output will make it clear that DD is of type Any[] (since you declared it that way). Working with Any can incur quite a performance penalty so declaring DD = Matrix{Float64}[] cuts the time to a third in my testing.
I'm not sure how close this example is to the actual code you want to write but in this particular case the size(DD[i],1) == 1 branch can be replaced by a call to LinearAlgebra.dot:
y[1:size(DD[i],1),i] .= dot(v_1, DD[i])
this cuts the time by another 50% for me. Finally you can squeeze out just a tiny bit more by using mul! to perform the other multiplication in place:
mul!(view(y, 1:size(DD[i],1),i:i), DD[i], v_2')
Full example:
using Random
using LinearAlgebra
DD = [rand(i,2) for _ in 1:100 for i in 1:2]
y = zeros(2,200)
shuffle!(DD)
function test(y, DD, n)
v_1 = [1 2]
v_2 = [3 4]'
for j in 1:n
for i in 1:size(DD,1)
if size(DD[i],1) == 1
y[1:size(DD[i],1),i] .= dot(v_1, DD[i])
else
mul!(view(y, 1:size(DD[i],1),i:i), DD[i], v_2)
end
end
end
end

Why does finding the eigenvalues of a 4*4 matrix by z3py take so much time and do not give any solutions?

I'm trying to calculate the eigenvalues of a 4*4 matrix called A in my code (I know that the eigenvalues are real values). All the elements of A are z3 expressions and need to be calculated from the previous constraints. The code below is the last part of a long code that tries to calculate matrix A, then its eigenvalues. The code is written as an entire but I've split it into two separate parts in order to debug it: part 1 in which the code tries to find the matrix A and part 2 which is eigenvalues' calculation. In part 1, the code works very fast and calculates A in less than a sec, but when I add part 2 to the code, it doesn't give me any solutions after.
I was wondering what could be the reason? Is it because of the order of the polynomial (which is 4) or what? I would appreciate it if anyone can help me find an alternative way to calculate the eigenvalues or give me some hints on how to rewrite the code so it can solve the problem.
(Note that A2 in the actusl code is a matrix with all of its elements as z3 expressions defined by previous constraints in the code. But, here I've defined the elements as real values just to make the code executable. In this way, the code gives a solution so fast but in the real situation it takes so long, like days.
for example, one of the elements of A is almost like this:
0 +
1*Vq0__1 +
2 * -Vd0__1 +
0 +
((5.5 * Iq0__1 - 0)/64/5) *
(0 +
0 * (Vq0__1 - 0) +
-521702838063439/62500000000000 * (-Vd0__1 - 0)) +
((.10 * Id0__1 - Etr_q0__1)/64/5) *
(0 +
521702838063439/62500000000000 * (Vq0__1 - 0) +
0.001 * (-Vd0__1 - 0)) +
0 +
0 + 0 +
0 +
((100 * Iq0__1 - 0)/64/5) * 0 +
((20 * Id0__1 - Etr_q0__1)/64/5) * 0 +
0 +
-5/64
All the variables in this example are z3 variables.)
from z3 import *
import numpy as np
def sub(*arg):
counter = 0
for matrix in arg:
if counter == 0:
counter += 1
Sub = []
for i in range(len(matrix)):
Sub1 = []
for j in range(len(matrix[0])):
Sub1 += [matrix[i][j]]
Sub += [Sub1]
else:
row = len(matrix)
colmn = len(matrix[0])
for i in range(row):
for j in range(colmn):
Sub[i][j] = Sub[i][j] - matrix[i][j]
return Sub
Landa = RealVector('Landa', 2) # Eigenvalues considered as real values
LandaI0 = np.diag( [ Landa[0] for i in range(4)] ).tolist()
ALandaz3 = RealVector('ALandaz3', 4 * 4 )
############# Building ( A - \lambda * I ) to find the eigenvalues ############
A2 = [[1,2,3,4],
[5,6,7,8],
[3,7,4,1],
[4,9,7,1]]
s = Solver()
for i in range(4):
for j in range(4):
s.add( ALandaz3[ 4 * i + j ] == sub(A2, LandaI0)[i][j] )
ALanda = [[ALandaz3[0], ALandaz3[1], ALandaz3[2], ALandaz3[3] ],
[ALandaz3[4], ALandaz3[5], ALandaz3[6], ALandaz3[7] ],
[ALandaz3[8], ALandaz3[9], ALandaz3[10], ALandaz3[11]],
[ALandaz3[12], ALandaz3[13], ALandaz3[14], ALandaz3[15] ]]
Determinant = (
ALandaz3[0] * ALandaz3[5] * (ALandaz3[10] * ALandaz3[15] - ALandaz3[14] * ALandaz3[11]) -
ALandaz3[1] * ALandaz3[4] * (ALandaz3[10] * ALandaz3[15] - ALandaz3[14] * ALandaz3[11]) +
ALandaz3[2] * ALandaz3[4] * (ALandaz3[9] * ALandaz3[15] - ALandaz3[13] * ALandaz3[11]) -
ALandaz3[3] * ALandaz3[4] * (ALandaz3[9] * ALandaz3[14] - ALandaz3[13] * ALandaz3[10]) )
tol = 0.001
s.add( And( Determinant >= -tol, Determinant <= tol ) ) # giving some flexibility instead of equalling to zero
print(s.check())
print(s.model())
Note that you seem to be using Z3 for a type of equations it absolutely isn't meant for. Z is a sat/smt solver. Such a solver works internally with a huge number of boolean equations. Integers and fractions can be converted to boolean expressions, but with general floats Z3 quickly reaches its limits. See here and here for a lot of typical examples, and note how floats are avoided.
Z3 can work in a limited way with floats, converting them to fractions, but doesn't work with approximations and accuracies as in needed in numerical algorithms. Therefore, the results are usually not what you are hoping for.
Finding eigenvalues is a typical numerical problem, where accuracy issues are very tricky. Python has libraries such as numpy and scipy to efficiently deal with those. See e.g. numpy.linalg.eig.
If, however your A2 matrix contains some symbolic expressions (and uses fractions instead of floats), sympy's matrix functions could be an interesting alternative.

Why is Julia allocating so much memory?

I am trying to write a fast coordinate descent algorithm for solving ordinary least squares regression. The following Julia code works, but I don't understand why it's allocating so much memory
function OLS_cd{T<:Float64}(A::Array{T,2}, b::Array{T,1}, tolerance::T=1e-12)
N,P = size(A)
x = zeros(P)
r = copy(b)
d = ones(P)
while sum(d.*d) > tolerance
#inbounds for j = 1:P
d[j] = sum(A[:,j].*r)
x[j] += d[j]
r -= d[j]*A[:,j]
end
end
return(x)
end
On the data I generate with
n = 100
p = 75
σ = 0.1
β_nz = float([i*(-1)^i for i in 1:10])
β = append!(β_nz,zeros(p-length(β_nz)))
X = randn(n,p); X .-= mean(X,1); X ./= sqrt(sum(abs2(X),1))
y = X*β + σ*randn(n); y .-= mean(y);
Using #benchmark OLS_cd(X, y) I get
BenchmarkTools.Trial:
memory estimate: 65.94 mb
allocs estimate: 151359
--------------
minimum time: 19.316 ms (16.49% GC)
median time: 20.545 ms (16.60% GC)
mean time: 22.164 ms (16.24% GC)
maximum time: 42.114 ms (10.82% GC)
--------------
samples: 226
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
The OLS problem gets harder as p gets bigger, and I've noticed that the as I make p bigger and need to run longer, the more memory Julia allocates.
Why would each pass through the while loop allocate more memory? To my eye, it seems like all of my operations are in place, and the types are clearly specified.
Nothing popped out to me while profiling, but I could post that output as well if it's useful.
Update:
As pointed out below, temporary arrays caused by using vectorized operations were the culprit. The following eliminated extraneous allocations and runs pretty quickly:
function OLS_cd_unrolled{T<:Float64}(A::Array{T,2}, b::Array{T,1}, tolerance::T=1e-12)
N,P = size(A)
x = zeros(P)
r = copy(b)
d = ones(P)
while norm(d,Inf) > tolerance
#inbounds for j = 1:P
d[j] = 0.0; #inbounds for i = 1:N d[j] += A[i,j]*r[i] end
#inbounds for i = 1:N r[i] -= d[j]*A[i,j] end
x[j] += d[j]
end
end
return(x)
end
A[:,j] creates a copy, not a view. You want to use #view A[:,j] or view(A,:,j).
You can devectorize r -= d[j]*A[:,j] with r .= -.(r,d[j]*A[:.j]) to get rid of some more temporaries. As #LutfullahTomak said sum(A[:,j].*r) should devectorize as dot(view(A,:,j),r) to get rid of all of the temporaries in there. To use an infix operator, you can use \cdot, as in view(A,:,j)⋅r.
You should read up on copies vs views and how vectorization causes temporary arrays. The jist of it is that when vectorized operations occur, they have to create a new vector as output. Instead, you want to write to an existing vector. r = ... for an array changes reference, so r = ex for some expression which makes an array will make a new array, and then point r to that array. r .= ex will replace the values of the array r with the values from the expression. The former allocates a temporary, the latter does not. Repeated applications of this idea is where all of the temporaries come from.
Actually, sum(d.*d) , sum(A[:,j].*r) and so on are not inplace and make temporary arrays.. First, sum(d.*d) == dot(d,d) I think and sum(A[:,j].*r) makes 2 temporary arrays. I'd do dot(view(A,:,j),r) for the latter. Current stable version of julia(0.5) doesn't have short version for r -= d[j]*A[:,j] so you need to devectorize it make a loop.

Dynamic Programming and Probability

I've been staring at this problem for hours and I'm still as lost as I was at the beginning. It's been a while since I took discrete math or statistics so I tried watching some videos on youtube, but I couldn't find anything that would help me solve the problem in less than what seems to be exponential time. Any tips on how to approach the problem below would be very much appreciated!
A certain species of fern thrives in lush rainy regions, where it typically rains almost every day.
However, a drought is expected over the next n days, and a team of botanists is concerned about
the survival of the species through the drought. Specifically, the team is convinced of the following
hypothesis: the fern population will survive if and only if it rains on at least n/2 days during the
n-day drought. In other words, for the species to survive there must be at least as many rainy days
as non-rainy days.
Local weather experts predict that the probability that it rains on a day i ∈ {1, . . . , n} is
pi ∈ [0, 1], and that these n random events are independent. Assuming both the botanists and
weather experts are correct, show how to compute the probability that the ferns survive the drought.
Your algorithm should run in time O(n2).
Have an (n + 1)×n matrix such that C[i][j] denotes the probability that after ith day there will have been j rainy days (i runs from 1 to n, j runs from 0 to n). Initialize:
C[1][0] = 1 - p[1]
C[1][1] = p[1]
C[1][j] = 0 for j > 1
Now loop over the days and set the values of the matrix like this:
C[i][0] = (1 - p[i]) * C[i-1][0]
C[i][j] = (1 - p[i]) * C[i-1][j] + p[i] * C[i - 1][j - 1] for j > 0
Finally, sum the values from C[n][n/2] to C[n][n] to get the probability of fern survival.
Dynamic programming problems can be solved in a top down or bottom up fashion.
You've already had the bottom up version described. To do the top-down version, write a recursive function, then add a caching layer so you don't recompute any results that you already computed. In pseudo-code:
cache = {}
function whatever(args)
if args not in cache
compute result
cache[args] = result
return cache[args]
This process is called "memoization" and many languages have ways of automatically memoizing things.
Here is a Python implementation of this specific example:
def prob_survival(daily_probabilities):
days = len(daily_probabilities)
days_needed = days / 2
# An inner function to do the calculation.
cached_odds = {}
def prob_survival(day, rained):
if days_needed <= rained:
return 1.0
elif days <= day:
return 0.0
elif (day, rained) not in cached_odds:
p = daily_probabilities[day]
p_a = p * prob_survival(day+1, rained+1)
p_b = (1- p) * prob_survival(day+1, rained)
cached_odds[(day, rained)] = p_a + p_b
return cached_odds[(day, rained)]
return prob_survival(0, 0)
And then you would call it as follows:
print(prob_survival([0.2, 0.4, 0.6, 0.8])

Efficient way of computing dot product inside double sum in python3

I'm looking into how to compute as efficient as possible in python3 a dot product inside a double sum of the form:
import cmath
for j in range(0,N):
for k in range(0,N):
sum_p += cmath.exp(-1j * sum(a*b for a,b in zip(x, [l - m for l, m in zip(r_p[j], r_p[k])])))
where r_np is a array of several thousand triples, and x a constant triple. Timing for a length of N=1000 triples is about 2.4s. The same using numpy:
import numpy as np
for j in range(0,N):
for k in range(0,N):
sum_np = np.add(sum_np, np.exp(-1j * np.inner(x_np,(r_np[j] - r_np[k]))))
is actually slower with a runtime of about 4.0s. I presume this is due to no big vectorizing advantage, only the short 3 dot 3 is np.dot, which is eaten up by starting N^2 of those in the loop.
However, a modest speedup over the first example I could gain by using plain python3 with map and mul:
from operator import mul
for j in range(0,N):
for k in range(0,N):
sum_p += cmath.exp(-1j * sum(map(mul,x, [l - m for l, m in zip(r_p[j], r_p[k])])))
with a runtime about 2.0s
Attempts to either use an if condition to not calculate the case j=k, where
r_np[j] - r_np[k] = 0
and thus the dot product also becomes 0, or splitting the sum up in two to achieve the same
for j in range(0,N):
for k in range(j+1,N):
...
for k in range(0,N):
for j in range(k+1,N):
...
both made it even slower. So the whole thing scales with O(N^2), and I wonder if with some methods like sorting or other things one could get rid of the loops and to make it scale with O(N logN).
The problem is that I need single digit second runtimes for a set of N~6000 triples as I have thousands of those sums to compute. Otherwise I have to try scipy’s weave , numba, pyrex or python or go down the C path entirely…
Thanks in advance for any help!
Edit:
this is how a data sample would look like:
# numpy arrays
x_np = np.array([0,0,1], dtype=np.float64)
N=1000
xy = np.multiply(np.subtract(np.random.rand(N,2),0.5),8)
z = np.linspace(0,40,N).reshape(N,1)
r_np = np.hstack((xy,z))
# in python format
x = (0,0,1)
r_p = r_np.tolist()
I used this to generate test data:
x = (1, 2, 3)
r_p = [(i, j, k) for i in range(10) for j in range(10) for k in range(10)]
On my machine, this took 2.7 seconds with your algorithm.
Then I got rid of the zips and sum:
for j in range(0,N):
for k in range(0,N):
s = 0
for t in range(3):
s += x[t] * (r_p[j][t] - r_p[k][t])
sum_p += cmath.exp(-1j * s)
This brought it down to 2.4 seconds.
Then I noted that x is constant so:
x * (p - q) = x1*p1 - x1*q1 + x2*p2 - x2*q2 - ...
So I changed the generation code to:
x = (1, 2, 3)
r_p = [(x[0] * i, x[1] * j, x[2] * k) for i in range(10) for j in range(10) for k in range(10)]
And the algorithm to:
for j in range(0,N):
for k in range(0,N):
s = 0
for t in range(3):
s += r_p[j][t] - r_p[k][t]
sum_p += cmath.exp(-1j * s)
Which got me to 2.0 seconds.
Then I realized we can rewrite it as:
for j in range(0,N):
for k in range(0,N):
sum_p += cmath.exp(-1j * (sum(r_p[j]) - sum(r_p[k])))
Which, surprisingly, got me to 1.1 seconds, which I can't really explain - maybe some caching going on?
Anyway, caching or not, you can precompute the sums of your triples and then you won't have to rely on the caching mechanism. I did that:
sums = [sum(a) for a in r_p]
sum_p = 0
N = len(r_p)
start = time.clock()
for j in range(0,N):
for k in range(0,N):
sum_p += cmath.exp(-1j * (sums[j] - sums[k]))
Which got me to 0.73 seconds.
I hope this is good enough!
Update:
Here's one around 0.01 seconds with a single for loop. It seems mathematically sound, but it's giving slightly different results, which I'm guessing is due to precision issues. I'm not sure how to fix those, but I thought I'd post it in case you can live with the precision issues or someone knows how to fix them.
Considering I'm using fewer exp calls than your initial code however, consider that maybe this is actually the more correct version, and your initial approach is the one with precision issues.
sums = [sum(a) for a in r_p]
e_denom = sum([cmath.exp(1j * p) for p in sums])
sum_p = 0
N = len(r_p)
start = time.clock()
for j in range(0,N):
sum_p += e_denom * cmath.exp(-1j * sums[j])
print(sum_p)
end = time.clock()
print(end - start)
Update 2:
The same, except with less multiplications and a sum function call:
sum_p = e_denom * sum([np.exp(-1j * p) for p in sums])
That double loop is a time killer in numpy. If you use vectorized array operations, the evaluation is cut to under a second.
In [1764]: sum_np=0
In [1765]: for j in range(0,N):
for k in range(0,N):
sum_np += np.exp(-1j * np.inner(x_np,(r_np[j] - r_np[k])))
In [1766]: sum_np
Out[1766]: (2116.3316526447466-1.0796252780664872e-11j)
In [1767]: np.exp(-1j * np.inner(x_np, (r_np[:N,None,:]-r_np[None,:N,:]))).sum((0,1))
Out[1767]: (2116.3316526447466-1.0796252780664872e-11j)
Timings:
In [1768]: timeit np.exp(-1j * np.inner(x_np, (r_np[:N,None,:]-r_np[None,:N,:]))).sum((0,1))
1 loops, best of 3: 506 ms per loop
In [1769]: %%timeit
sum_np=0
for j in range(0,N):
for k in range(0,N):
sum_np += np.exp(-1j * np.inner(x_np,(r_np[j] - r_np[k])))
1 loops, best of 3: 12.9 s per loop
replacing np.inner with np.einsum shaves 20% off the time
np.exp(-1j * np.einsum('k,ijk', x_np, r_np[:N,None,:]-r_np[None,:N,:])).sum((0,1))
Ok guys, thanks a lot for the help. IVlads last code that uses the identity sum_j sum_k a[j]*a[k] = sum_j a[j] * sum_k a[k] makes the biggest difference. This now scales also with less then O(N^2).
Precalculating the dot product before the sum makes hpaulj's numpy suggestion exactly the same fast:
sum_np = 0
dotprods = np.inner(q_np,r_np)
sum_rkexp = np.exp(1j * dotprods).sum()
sum_np = sum_rkexp * np.exp(-1j * dotprods).sum()
both with a runtime about 0.0003s. However, I found one more thing that gives another ~50% increase, instead of computing the exponential twice, I take the complex conjugate inside the sum:
sum_np = 0
dotprods = np.inner(q_np,r_np)
rkexp = np.exp(1j * dotprods)
sum_rkexp = rkexp.sum()
sum_np = sum_rkexp * np.conj(rkexp).sum()
which runs at around 0.0002s. Over my first attempts with non vectorized numpy that took ~4s, this is a speedup of about 2*10^4, and for my 'real data' arrays of N~6000 which run about 125s I now get 0.0005s, which is an amazing speedup of about 2.5*10^5. Thanks a lot, IVlad and hpaulj, learned a lot in the last day :)
P.S. I'm amazed by how quick you guys answer with stuff that took me half a day to just follow up ;)

Resources