Why is pypy3 slower than python - performance

In an attempt to run my code faster I thought pypy would be just the job. However, I am finding that it is actually slower for some of my code.
Can someone help me understand why this is the case?
This is the call (a weighted sum) that I have identified which is slower in pypy, which is the main method within a class, at line 663 of functions.py.
It is called 500000 times.
def __call__(self, verbose=False):
if len(self.links) != self.weights.size:
raise Exception(f'Number of links ...')
super().check_links(len(self.links))
inputs = np.array([link.get_value() for link in self.links])
self.value = np.dot(inputs, self.weights)
return super().__call__(verbose)
Here is the snakeviz view of pypy run with cProfile
And here is the snakeviz view run with python
EDIT: 20210612
#mattip I took your advice and tried a dot product in standard python (sdot). Here are timings with numpy dot (ndot) for python and pypy.
It is good that pypy sdot (0.299) is faster than python sdot (0.749) and faster than both ndots (1.075/4.165). However, what surprises me is that sdot (python lists) is faster than ndot (numpy arrays), with the python interpreter.
Why is that? I had thought that numpy was supposed to be an optimised, fast package for this sort of thing.
Here is the code:
numpy dot product
def runndot(runs):
weightslist = [0.5, 0.5, 0.5, 0.5, 0.5]
weights = np.array(weightslist)
inputslist = [0.1, 0.1, 0.1, 0.1, 0.1]
inputs = np.array(inputslist)
for _ in range(runs):
value = np.dot(inputs, weights)
return value
python lists dot product
def runsdot(runs):
weights = [0.5, 0.5, 0.5, 0.5, 0.5]
inputs = [0.1, 0.1, 0.1, 0.1, 0.1]
for _ in range(runs):
value = dot(inputs, weights)
return value
def dot(inputs, weights):
sum = 0
for i in range(len(inputs)):
sum += inputs[i]*weights[i]
return sum

You are using NumPy, which is written in C. In order for PyPy to use c-extensions like NumPy, it needs to jump through some hoops which make the python-c-python transitions slow. I am not aware of a quick replacement for using np.dot on PyPy, sorry. There is work afoot to make it happen, but it will not be available for a year or two.
You may be interested in using Numba to speed up this kind of code.
If the shape of your array is small, you can hand-write the dot product in python, avoid NumPy, and be fast.

Related

Optimize rank computation for high dimension tensor

My programme wastes a lot of time on the code below, whereas it's been executed on the GPU machine. How can I optimise it please? The tensors can be of this size y_ul.shape = [8, 512, 128, 128]
for i, m in enumerate(y_ul):
for j, l in enumerate(m):
ranks_topleft.append(torch.matrix_rank(l))
mean_rank_topleft = torch.mean(ranks_topleft.float())
Unlike torch.matrix_rank, torch.linalg.matrix_rank allows batched inputs documetation here. You can try:
ranks = torch.linalg.matrix_rank(y_ul) # shape (8, 512)
mean_rank_topleft = torch.mean(ranks) # scalar
Note that you can adjust the tolerance for computation (but the one by default should be good). Plus if your matrices are symmetric, you can add hermitian=True to speed-up calculations.
Note that torch.matrix_rank is deprecated and will be removed in further versions!

Python Gekko - How to use built-in maximum function with sequential solvers?

When solving models sequentially in Python GEKKO (i.e. with IMODE >= 4) fails when using the max2 and max3 functions that come with GEKKO.
This is for use cases, where np.maximum or the standard max function treat a GEKKO parameter like an array, which is not always the intended usage or can create errors when comparing against integers for example.
minimal code example:
from gekko import GEKKO
import numpy as np
m = GEKKO()
m.time = np.arange(0,20)
y = m.Var(value=5)
forcing = m.Param(value=np.arange(-5,15))
m.Equation(y.dt()== m.max2(forcing,0) * y)
m.options.IMODE=4
m.solve(disp=False)
returns:
Exception: #error: Degrees of Freedom
* Error: DOF must be zero for this mode
STOPPING...
I know from looking at the code that both max2 and max3 use inequality expressions in the equations, which understandably introduces the degrees of freedoms, so was this functionality never intended? Could there be some workaround to fix this?
Any help would be much appreciated!
Note:
I hope this is not a duplicate of How to define maximum of Intermediate and another value in Python Gekko, when using sequential solver?, but instead asking a more concise & different question, about essentially the same issue.
You can get a successful solution by switching to IMODE=6. IMODE=4 (simultaneous simulation) or IMODE=7 sequential simulation requires zero degrees of freedom. Both m.max2() and m.max3() require degrees of freedom and an optimizer to solve.
from gekko import GEKKO
import numpy as np
m = GEKKO(remote=False)
m.time = np.arange(0,20)
y = m.Var(value=5)
forcing = m.Param(value=np.arange(-5,15))
m.Equation(y.dt()== -m.max2(forcing,0) * y)
m.options.IMODE=6
m.solve(disp=True)
The equation y.dt()== -m.max2(forcing,0) * y exponentially increases beyond machine precision so I switched the equation to something that can solve.

I cannot control the number of chains and jobs in pymc3

I am trying to use pymc3 to generate some samples from a GMM distribution, here is my code:
w = sp.array([.3, .6, 0.1])
w = sp.array([.3, .6, 0.1])
mu = sp.array([-2, 1, 4])
sd = sp.array([1, 0.5, 0.5])
with pm.Model() as model:
pm.NormalMixture('x', w=w, mu=mu, sd=sd)
step = pm.Metropolis(tune=False, S=sp.array([1]))
trace = pm.sampling.sample(1000, step=step, start={'x':5},
chain=10, cores=1, tune=0)
result = trace['x']
However, no matter what I do with "chain" and "cores", I get the following :
Multiprocess sampling (2 chains in 2 jobs)
Metropolis: [x]
100%|██████████| 1000/1000 [00:00<00:00, 1407.68it/s]
You should use chains, and njobs. Note that setting n_chains with 1000 samples means you will actually get n_chains * 1000 total draws from your model. The njobs argument is passed to joblib, which figures out how to distribute those chains on your machine.
cores will be accepted starting with PyMC 3.4 (or on master as of January, 2018). It is A Bad Thing that sample accepts keyword arguments and silently does nothing with them. That would be a useful contribution, or issue, in the project.

A very quick method to approximate np.random.dirichlet with large dimension

I'd like to evaluate np.random.dirichlet with large dimension as quickly as possible. More precisely, I'd like a function approximating the below by at least 10 times faster. Empirically, I observed that small-dimension-version of this function outputs one or two entries that have the order of 0.1, and every other entries are so small that they are immaterial. But this observation isn't based on any rigorous assessment. The approximation doesn't need to be so accurate, but I want something not too crude, as I'm using this noise for MCTS.
def g():
np.random.dirichlet([0.03]*4840)
>>> timeit.timeit(g,number=1000)
0.35117408499991143
Assuming your alpha is fixed over components and used for many iterations you could tabulate the ppf of the corresponding gamma distribution. This is probably available as scipy.stats.gamma.ppf but we can also use scipy.special.gammaincinv. This function seems rather slow, so this is a siginificant upfront investment.
Here is a crude implementation of the general idea:
import numpy as np
from scipy import special
class symm_dirichlet:
def __init__(self, alpha, resolution=2**16):
self.alpha = alpha
self.resolution = resolution
self.range, delta = np.linspace(0, 1, resolution,
endpoint=False, retstep=True)
self.range += delta / 2
self.table = special.gammaincinv(self.alpha, self.range)
def draw(self, n_sampl, n_comp, interp='nearest'):
if interp != 'nearest':
raise NotImplementedError
gamma = self.table[np.random.randint(0, self.resolution,
(n_sampl, n_comp))]
return gamma / gamma.sum(axis=1, keepdims=True)
import time, timeit
t0 = time.perf_counter()
X = symm_dirichlet(0.03)
t1 = time.perf_counter()
print(f'Upfront cost {t1-t0:.3f} sec')
print('Running cost per 1000 samples of width 4840')
print('tabulated {:3f} sec'.format(timeit.timeit(
'X.draw(1, 4840)', number=1000, globals=globals())))
print('np.random.dirichlet {:3f} sec'.format(timeit.timeit(
'np.random.dirichlet([0.03]*4840)', number=1000, globals=globals())))
Sample output:
Upfront cost 13.067 sec
Running cost per 1000 samples of width 4840
tabulated 0.059365 sec
np.random.dirichlet 0.980067 sec
Better check whether it is roughly correct:

How can I efficiently calculate the binomial cumulative distribution function?

Let's say that I know the probability of a "success" is P. I run the test N times, and I see S successes. The test is akin to tossing an unevenly weighted coin (perhaps heads is a success, tails is a failure).
I want to know the approximate probability of seeing either S successes, or a number of successes less likely than S successes.
So for example, if P is 0.3, N is 100, and I get 20 successes, I'm looking for the probability of getting 20 or fewer successes.
If, on the other hadn, P is 0.3, N is 100, and I get 40 successes, I'm looking for the probability of getting 40 our more successes.
I'm aware that this problem relates to finding the area under a binomial curve, however:
My math-fu is not up to the task of translating this knowledge into efficient code
While I understand a binomial curve would give an exact result, I get the impression that it would be inherently inefficient. A fast method to calculate an approximate result would suffice.
I should stress that this computation has to be fast, and should ideally be determinable with standard 64 or 128 bit floating point computation.
I'm looking for a function that takes P, S, and N - and returns a probability. As I'm more familiar with code than mathematical notation, I'd prefer that any answers employ pseudo-code or code.
Exact Binomial Distribution
def factorial(n):
if n < 2: return 1
return reduce(lambda x, y: x*y, xrange(2, int(n)+1))
def prob(s, p, n):
x = 1.0 - p
a = n - s
b = s + 1
c = a + b - 1
prob = 0.0
for j in xrange(a, c + 1):
prob += factorial(c) / (factorial(j)*factorial(c-j)) \
* x**j * (1 - x)**(c-j)
return prob
>>> prob(20, 0.3, 100)
0.016462853241869437
>>> 1-prob(40-1, 0.3, 100)
0.020988576003924564
Normal Estimate, good for large n
import math
def erf(z):
t = 1.0 / (1.0 + 0.5 * abs(z))
# use Horner's method
ans = 1 - t * math.exp( -z*z - 1.26551223 +
t * ( 1.00002368 +
t * ( 0.37409196 +
t * ( 0.09678418 +
t * (-0.18628806 +
t * ( 0.27886807 +
t * (-1.13520398 +
t * ( 1.48851587 +
t * (-0.82215223 +
t * ( 0.17087277))))))))))
if z >= 0.0:
return ans
else:
return -ans
def normal_estimate(s, p, n):
u = n * p
o = (u * (1-p)) ** 0.5
return 0.5 * (1 + erf((s-u)/(o*2**0.5)))
>>> normal_estimate(20, 0.3, 100)
0.014548164531920815
>>> 1-normal_estimate(40-1, 0.3, 100)
0.024767304545069813
Poisson Estimate: Good for large n and small p
import math
def poisson(s,p,n):
L = n*p
sum = 0
for i in xrange(0, s+1):
sum += L**i/factorial(i)
return sum*math.e**(-L)
>>> poisson(20, 0.3, 100)
0.013411150012837811
>>> 1-poisson(40-1, 0.3, 100)
0.046253037645840323
I was on a project where we needed to be able to calculate the binomial CDF in an environment that didn't have a factorial or gamma function defined. It took me a few weeks, but I ended up coming up with the following algorithm which calculates the CDF exactly (i.e. no approximation necessary). Python is basically as good as pseudocode, right?
import numpy as np
def binomial_cdf(x,n,p):
cdf = 0
b = 0
for k in range(x+1):
if k > 0:
b += + np.log(n-k+1) - np.log(k)
log_pmf_k = b + k * np.log(p) + (n-k) * np.log(1-p)
cdf += np.exp(log_pmf_k)
return cdf
Performance scales with x. For small values of x, this solution is about an order of magnitude faster than scipy.stats.binom.cdf, with similar performance at around x=10,000.
I won't go into a full derivation of this algorithm because stackoverflow doesn't support MathJax, but the thrust of it is first identifying the following equivalence:
For all k > 0, sp.misc.comb(n,k) == np.prod([(n-k+1)/k for k in range(1,k+1)])
Which we can rewrite as:
sp.misc.comb(n,k) == sp.misc.comb(n,k-1) * (n-k+1)/k
or in log space:
np.log( sp.misc.comb(n,k) ) == np.log(sp.misc.comb(n,k-1)) + np.log(n-k+1) - np.log(k)
Because the CDF is a summation of PMFs, we can use this formulation to calculate the binomial coefficient (the log of which is b in the function above) for PMF_{x=i} from the coefficient we calculated for PMF_{x=i-1}. This means we can do everything inside a single loop using accumulators, and we don't need to calculate any factorials!
The reason most of the calculations are done in log space is to improve the numerical stability of the polynomial terms, i.e. p^x and (1-p)^(1-x) have the potential to be extremely large or extremely small, which can cause computational errors.
EDIT: Is this a novel algorithm? I've been poking around on and off since before I posted this, and I'm increasingly wondering if I should write this up more formally and submit it to a journal.
I think you want to evaluate the incomplete beta function.
There's a nice implementation using a continued fraction representation in "Numerical Recipes In C", chapter 6: 'Special Functions'.
I can't totally vouch for the efficiency, but Scipy has a module for this
from scipy.stats.distributions import binom
binom.cdf(successes, attempts, chance_of_success_per_attempt)
An efficient and, more importantly, numerical stable algorithm exists in the domain of Bezier Curves used in Computer Aided Design. It is called de Casteljau's algorithm used to evaluate the Bernstein Polynomials used to define Bezier Curves.
I believe that I am only allowed one link per answer so start with Wikipedia - Bernstein Polynomials
Notice the very close relationship between the Binomial Distribution and the Bernstein Polynomials. Then click through to the link on de Casteljau's algorithm.
Lets say I know the probability of throwing a heads with a particular coin is P.
What is the probability of me throwing
the coin T times and getting at least
S heads?
Set n = T
Set beta[i] = 0 for i = 0, ... S - 1
Set beta[i] = 1 for i = S, ... T
Set t = p
Evaluate B(t) using de Casteljau
or at most S heads?
Set n = T
Set beta[i] = 1 for i = 0, ... S
Set beta[i] = 0 for i = S + 1, ... T
Set t = p
Evaluate B(t) using de Casteljau
Open source code probably exists already. NURBS Curves (Non-Uniform Rational B-spline Curves) are a generalization of Bezier Curves and are widely used in CAD. Try openNurbs (the license is very liberal) or failing that Open CASCADE (a somewhat less liberal and opaque license). Both toolkits are in C++, though, IIRC, .NET bindings exist.
If you are using Python, no need to code it yourself. Scipy got you covered:
from scipy.stats import binom
# probability that you get 20 or less successes out of 100, when p=0.3
binom.cdf(20, 100, 0.3)
>>> 0.016462853241869434
# probability that you get exactly 20 successes out of 100, when p=0.3
binom.pmf(20, 100, 0.3)
>>> 0.0075756449257260777
From the portion of your question "getting at least S heads" you want the cummulative binomial distribution function. See http://en.wikipedia.org/wiki/Binomial_distribution for the equation, which is described as being in terms of the "regularized incomplete beta function" (as already answered). If you just want to calculate the answer without having to implement the entire solution yourself, the GNU Scientific Library provides the function: gsl_cdf_binomial_P and gsl_cdf_binomial_Q.
The DCDFLIB Project has C# functions (wrappers around C code) to evaluate many CDF functions, including the binomial distribution. You can find the original C and FORTRAN code here. This code is well tested and accurate.
If you want to write your own code to avoid being dependent on an external library, you could use the normal approximation to the binomial mentioned in other answers. Here are some notes on how good the approximation is under various circumstances. If you go that route and need code to compute the normal CDF, here's Python code for doing that. It's only about a dozen lines of code and could easily be ported to any other language. But if you want high accuracy and efficient code, you're better off using third party code like DCDFLIB. Several man-years went into producing that library.
Try this one, used in GMP. Another reference is this.
import numpy as np
np.random.seed(1)
x=np.random.binomial(20,0.6,10000) #20 flips of coin,probability of
heads percentage and 10000 times
done.
sum(x>12)/len(x)
The output is 41% of times we got 12 heads.

Resources