Improving performance of Cython function for faster file read - performance

I'm trying to read an ascii file(text-based) of a particular format. I have done some line-profiling and post of the time is being used in looping. I'm trying if the code inside the loop could be improved in performance.
Things I've tried
Faster indexing of numpy array by initializing numpy array with buffer interface as in official docs, which I expected to speed up a lot but barely made a difference.
Custom type conversion function(no python interaction) to replace int(line[0:5]) but it ended being rather costly
The custom function for type conversion
cdef int fast_atoi(str buf):
cdef int i=0 ,c = 0, x = 0
for i in range(5):
c = buf[i]
if c > 47 and c < 58:
x = x * 10 + c - 48
return x
The main code-block which I want to optimize
def func(filename):
cdef np.ndarray[np.int32_t] a1
cdef np.ndarray[object] a2
cdef np.ndarray[object] a3
cdef np.ndarray[np.int32_t] a4
cdef int count = 0
cdef int n_lines
cdef str line
with open(filename) as inf:
next(inf)
n_lines = int(next(inf))
a1 = np.zeros(n_atoms, dtype=np.int32)
a2 = np.zeros(n_atoms, dtype=object)
a3 = np.zeros(n_atoms, dtype = object)
a4 = np.zeros(n_atoms, dtype=np.int32)
for i,line in enumerate(inf):
if i == n_lines:
break
try:
a1[i] = int(line[0:5]) #custom function fast_atoi(line[0:5])
a2[i] = line[5:10].strip()
a3[i] = line[10:15].strip()
a4[i] = int(line[15:20])
except (ValueError, TypeError) as e:
break
I'm having a 4.3 mb file
Author
n_lines
1xyz A 1 5.202 4.356 3.155
1mno A1 2 5.119 4.411 3.172
1mno A2 3 5.155 4.283 3.104
1nnn B3 4 5.247 4.318 3.237
1xax KA 5 5.306 4.421 3.075
1ooo MA 6 5.383 4.347 3.054
1cbd NB 7 5.257 4.474 2.941
1orc OB1 8 5.189 4.404 2.893
Current implementation takes 76ms on average on my machine, Adding the custom function mentioned makes it worse.
I'd be very grateful if some improvements could be suggested. I'm new to cython.

Related

Runge-Kutta curve fitting extremely slow

I am currently trying to do a regression of a function calculated via a RK4 method performed on a non-linear Volterra integral of the second kind. The problem I found is that the code is extremely slow, for 1 call of the curve_fit function (fitt), it takes about 30-40 minute to generate a data. Overall, there will be a lot of calls to fitt before the parameters are determined, this takes more than 6 hours to run. Is there anyway to optimize this code? Thanks in advance!
from scipy.special import gamma
from ml_internal import LTInversion
from scipy.optimize import curve_fit , fsolve
from scipy.misc import derivative
from sklearn.metrics import r2_score
from math import comb , factorial
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Gets the data
df = pd.read_excel('D:\\CoMat\\Fractional_fit\\optimized\\data_optimized.xlsx')
skipTime = 1
skipIndex = df[df['Time']== skipTime].index.values[0]
xls = pd.read_excel('D:\\CoMat\\Fractional_fit\\optimized\\data_optimized.xlsx',skiprows=np.arange(1,skipIndex+1,1))
timeDF = xls['Time']
tempDF = xls['Temp']
taDF = xls['Ta']
timeDF = timeDF - timeDF[0]
tempDF = tempDF + 273.15
t0 = tempDF[0]
ta = sum(taDF)/len(taDF)
ta = ta + 273.15
###########################################
#Spliting into intervals
h = 0.05
a = 0
b = timeDF[len(timeDF)-1]
N = int(np.round((b-a)/h))
#Each xi
def xidx(index):
return a + h*index
#Function in the image are written here.
def gx(t,lamda,alpha):
return t0 * ml(lamda*(t**alpha),alpha)
gx = np.vectorize(gx)
def kernel(t,s,rad,lamda,alpha,beta):
if t == s:
return 0
return (t-s)**(alpha-1) * ml_(lamda*((t-s)**alpha),alpha,alpha,1) * (beta*(rad**4) - beta*(ta**4) - lamda*ta)
kernel = np.vectorize(kernel)
############################
# The problem is here!!!!!!
def fx(x,n,lamda,alpha,beta):
ans = gx(x,lamda,alpha)
for j in range(n):
ans += (h/6)*(kernel(x,xidx(j),f0[j],lamda,alpha,beta) + 2*kernel(x,xidx(j+1/2),f1[j],lamda,alpha,beta) + 2*kernel(x,xidx(j+1/2),f2[j],lamda,alpha,beta) + kernel(x,xidx(j+1),f3[j],lamda,alpha,beta))
return ans
#########################
f0 = np.zeros(N+1)
f0[0] = t0
f1 = np.zeros(N+1)
f2 = np.zeros(N+1)
f3 = np.zeros(N+1)
F = np.zeros((3,N+1))
def fitt(xvalue,lamda,alpha,beta):
global f0,f1,f2,f3,F
n = int(np.round(xvalue/h))
f1[n] = fx(xidx(n) + 1/2,n,lamda,alpha,beta) + (h/2)*kernel(xidx(n + 1/2),xidx(n),f0[n],lamda,alpha,beta)
f2[n] = fx(xidx(n + 1/2),n,lamda,alpha,beta)
f3[n] = fx(xidx(n+1),n,lamda,alpha,beta) + h*kernel(xidx(n+1),xidx(n+1/2),f2[n],lamda,alpha,beta)
if n+1 <= N:
f0[n+1] = fx(xidx(n+1),n,lamda,alpha,beta) + (h/6)*(kernel(xidx(n+1),xidx(n),f0[n],lamda,alpha,beta) + 2*kernel(xidx(n+1),xidx(n+1/2),f1[n],lamda,alpha,beta) + 2*kernel(xidx(n+1),xidx(n+1/2),f2[n],lamda,alpha,beta) + kernel(xidx(n+1),xidx(n+1),f3[n],lamda,alpha,beta))
if(xvalue == timeDF[len(timeDF) - 1]):
print(f0[n],n)
returnValue = f0[n]
f0 = np.zeros(N+1)
f0[0] = t0
f1 = np.zeros(N+1)
f2 = np.zeros(N+1)
f3 = np.zeros(N+1)
return returnValue
print(f0[n],n)
return f0[n]
fitt = np.vectorize(fitt)
#Fitting, plotting and giving (Adj) R-squared
popt , pcov = curve_fit(fitt,timeDF,tempDF,p0=(-0.1317,0.95,-1e-11),bounds=((-np.inf,0,-np.inf),(0,1,0)))
print(popt)
y_fit = np.array(fitt(timeDF,popt[0],popt[1],popt[2]))
plt.scatter(timeDF,tempDF,color='ORANGE',marker='.',s=0.5)
plt.fill_between(timeDF,tempDF-0.5,tempDF+0.5,color='ORANGE', alpha=0.2)
plt.plot(timeDF,y_fit,color='RED',linewidth=1)
plt.legend(["Experimental data", "Caputo fit"], loc ="upper right")
plt.xlabel("Time (min)")
plt.ylabel("Temperature (Kelvin)")
plt.show()
plt.close()
r2 = r2_score(tempDF,y_fit)
print(r2)
adjr2 = 1 - (1 - r2)*((len(xls)-1)/(len(xls)-3-1))
print(adjr2)
I already tried computing the values f0,f1,f2,f3 all at once, but the thing consuming the most time is Fn(x) which I haven't figured in out how to compute them all at once. If this is possible to compute at once, I think the program will run much faster. PS: ML,ML_ is a function from https://github.com/khinsen/mittag-leffler.
This is the function necesssary. Fn is the only one I haven't figured out yet.
There are two typing errors in the cited image. The combination of x_n and 1/2 is always meant to be the midpoint x_{n+1/2} = x_n + h/2. The second error is a duplication of x_{n+1/2} in the formula for f^{(4)}_n in its third term. The first error is probably producing errors that are large enough to make convergence complicated and any limit wrong for the intended problem.
In the Simpson/RK4 step, the 4 fx computations can be reduced to 2.
The F_n implement the left side of the integral equation
F(x) = g(x) + int(s=0 to x of K(x,s,f(s))
where the integral is approximated with the sample sequences f0,...,f3. Due to the structure of problem and algorithm F_n(x_n)=f^0_n = f^4_{n-1}.
Note that K(x,s,f) should be set to zero for s >= x. In the exact version of the equation these values "above the diagonal" are not used.
If an increase in accuracy is needed, for instance to avoid divergence where there is none in the exact solution, you can decrease the step site by a factor of 10 and then sub-sample the f^0_n sequence to produce the numerical guess for the given data. Other factors than 10 are of course also possible.

Finding the formula for an alphanumeric code

A script I am making scans a 5-character code and assigns it a number based on the contents of characters within the code. The code is a randomly-generated number/letter combination. For example 7D3B5 or HH42B where any position can be any one of (26 + 10) characters.
Now, the issue I am having is I would like to figure out the number from 1-(36^5) based on the code. For example:
00000 = 0
00001 = 1
00002 = 2
0000A = 10
0000B = 11
0000Z = 36
00010 = 37
00011 = 38
So on and so forth until the final possible code which is:
ZZZZZ = 60466176 (36^5)
What I need to work out is a formula to figure out, let's say G47DU in its number form, using the examples below.
Something like this?
function getCount(s){
if (!isNaN(s))
return Number(s);
return s.charCodeAt(0) - 55;
}
function f(str){
let result = 0;
for (let i=0; i<str.length; i++)
result += Math.pow(36, str.length - i - 1) * getCount(str[i]);
return result;
}
var strs = [
'00000',
'00001',
'00002',
'0000A',
'0000B',
'0000Z',
'00010',
'00011',
'ZZZZZ'
];
for (str of strs)
console.log(str, f(str));
You are trying to create a base 36 numeric system. Since there are 5 'digits' each digit being 0 to Z, the value can go from 0 to 36^5. (If we are comparing this with hexadecimal system, in hexadecimal each 'digit' goes from 0 to F). Now to convert this to decimal, you could try use the same method used to convert from hex or binary etc... system to the decimal system.
It will be something like d4 * (36 ^ 4) + d3 * (36 ^ 3) + d2 * (36 ^ 2) + d1 * (36 ^ 1) + d0 * (36 ^ 0)
Note: Here 36 is the total number of symbols.
d0, d1, d2, d3, d4 can range from 0 to 35 in decimal (Important: Not 0 to 36).
Also, you can extend this for any number of digits or symbols and you can implement operations like addition, subtraction etc in this system itself as well. (It will be fun to implement that. :) ) But it will be easier to convert it to decimal do the operations and convert it back though.

All possible combinations of N objects in K buckets

Suppose I have 3 boxes labeled A, B, C and I have 2 balls, B1 and B2. I want to get all possible combinations of these balls in the boxes. Please note, it is important to know which ball is in each box, meaning B1 and B2 are not the same.
A B C
B1, B2
B1 B2
B1 B2
B2 B1
B2 B1
B1, B2
B1 B2
B2 B1
B1, B2
Edit
If there is a known algorithm for this problem, please tell me its name.
Let N be number of buckets (3 in the example), M number of balls (2). Now, let's have a look at numbers in a range [0..N**M) - [0..9) in the example; these numbers we represent with radix = N. For the example in the question we have trinary numbers
Now we can easily interprete these numbers: first digit shows 1st ball location, second - 2nd ball position.
|--- Second Ball position [0..2]
||-- First Ball position [0..2]
||
0 = 00 - both balls are in the bucket #0 (`A`)
1 = 01 - first ball is in the bucket #1 ('B'), second is in the bucket #0 (`A`)
2 = 02 - first ball is in the bucket #2 ('C'), second is in the bucket #0 (`A`)
3 = 10 - first ball is in the bucket #0 ('A'), second is in the bucket #1 (`B`)
4 = 11 - both balls are in the bucket #1 (`B`)
5 = 12 ...
6 = 20
7 = 21 ...
8 = 22 - both balls are in the bucket #2 (`C`)
the general algorithm is:
For each number in 0 .. N**M range
ith ball (i = 0..M-1) will be in the bucket # (number / N**i) % N (here / stands for integer division, % for remainder)
If you want just total count, the answer is simple N ** M, in the example above 3 ** 2 == 9
C# Code The algorithm itself is easy to implement:
static IEnumerable<int[]> BallsLocations(int boxCount, int ballCount) {
BigInteger count = BigInteger.Pow(boxCount, ballCount);
for (BigInteger i = 0; i < count; ++i) {
int[] balls = new int[ballCount];
int index = 0;
for (BigInteger value = i; value > 0; value /= boxCount)
balls[index++] = (int)(value % boxCount);
yield return balls;
}
}
It's answer representation which can be entangled:
static IEnumerable<string> BallsSolutions(int boxCount, int ballCount) {
foreach (int[] balls in BallsLocations(boxCount, ballCount)) {
List<int>[] boxes = Enumerable
.Range(0, boxCount)
.Select(_ => new List<int>())
.ToArray();
for (int j = 0; j < balls.Length; ++j)
boxes[balls[j]].Add(j + 1);
yield return string.Join(Environment.NewLine, boxes
.Select((item, index) => $"Box {index + 1} : {string.Join(", ", item.Select(b => $"B{b}"))}"));
}
}
Demo:
int balls = 3;
int boxes = 2;
string report = string.Join(
Environment.NewLine + "------------------" + Environment.NewLine,
BallsSolutions(boxes, balls));
Console.Write(report);
Outcome:
Box 1 : B1, B2, B3
Box 2 :
------------------
Box 1 : B2, B3
Box 2 : B1
------------------
Box 1 : B1, B3
Box 2 : B2
------------------
Box 1 : B3
Box 2 : B1, B2
------------------
Box 1 : B1, B2
Box 2 : B3
------------------
Box 1 : B2
Box 2 : B1, B3
------------------
Box 1 : B1
Box 2 : B2, B3
------------------
Box 1 :
Box 2 : B1, B2, B3
Fiddle
There's a very simple recursive implementation that at each level adds the current ball to each box. The recursion ends when all balls have been processed.
Here's some Java code to illustrate. We use a Stack to represent each box so we can simply pop the last-added ball after each level of recursion.
void boxBalls(List<Stack<String>> boxes, String[] balls, int i)
{
if(i == balls.length)
{
System.out.println(boxes);
return;
}
for(Stack<String> box : boxes)
{
box.push(balls[i]);
boxBalls(boxes, balls, i+1);
box.pop();
}
}
Test:
String[] balls = {"B1", "B2"};
List<Stack<String>> boxes = new ArrayList<>();
for(int i=0; i<3; i++) boxes.add(new Stack<>());
boxBalls(boxes, balls, 0);
Output:
[[B1, B2], [], []]
[[B1], [B2], []]
[[B1], [], [B2]]
[[B2], [B1], []]
[[], [B1, B2], []]
[[], [B1], [B2]]
[[B2], [], [B1]]
[[], [B2], [B1]]
[[], [], [B1, B2]]

cython memoryviews slices without GIL

I want to release the GIL in order to parallelise loop in cython, where different slices of memoryviews are passed to a some function inside the loop. The code looks like this:
cpdef void do_sth_in_parallel(bint[:,:] input, bint[:] output, int D):
for d in prange(D, schedule=dynamic, nogil=True):
ouput[d] = some_function_not_requiring_gil(x[d,:])
This is not possible, since selecting the slice x[d,:], seems to require GIL. Running cython -a, and using a normal for loop, I get the code posted below. How can this be done in pure C?
__pyx_t_5.data = __pyx_v_x.data;
__pyx_t_5.memview = __pyx_v_x.memview;
__PYX_INC_MEMVIEW(&__pyx_t_5, 0);
{
Py_ssize_t __pyx_tmp_idx = __pyx_v_d;
Py_ssize_t __pyx_tmp_shape = __pyx_v_x.shape[0];
Py_ssize_t __pyx_tmp_stride = __pyx_v_x.strides[0];
if (0 && (__pyx_tmp_idx < 0))
__pyx_tmp_idx += __pyx_tmp_shape;
if (0 && (__pyx_tmp_idx < 0 || __pyx_tmp_idx >= __pyx_tmp_shape)) {
PyErr_SetString(PyExc_IndexError, "Index out of bounds (axis 0)");
__PYX_ERR(0, 130, __pyx_L1_error)
}
__pyx_t_5.data += __pyx_tmp_idx * __pyx_tmp_stride;
}
__pyx_t_5.shape[0] = __pyx_v_x.shape[1];
__pyx_t_5.strides[0] = __pyx_v_x.strides[1];
__pyx_t_5.suboffsets[0] = -1;
__pyx_t_6.data = __pyx_v_u.data;
__pyx_t_6.memview = __pyx_v_u.memview;
__PYX_INC_MEMVIEW(&__pyx_t_6, 0);
__pyx_t_6.shape[0] = __pyx_v_u.shape[0];
__pyx_t_6.strides[0] = __pyx_v_u.strides[0];
__pyx_t_6.suboffsets[0] = -1;
The following works for me:
from cython.parallel import prange
cdef bint some_function_not_requiring_gil(bint[:] x) nogil:
return x[0]
cpdef void do_sth_in_parallel(bint[:,:] input, bint[:] output, int D):
cdef int d
for d in prange(D, schedule=dynamic, nogil=True):
output[d] = some_function_not_requiring_gil(input[d,:])
The two main changes I had to make were x to input (because it's assuming it can find x as a python object at the global scope) to fix the error
Converting to Python object not allowed without gil
and adding cdef int d to force the type of d and fix the error
Coercion from Python not allowed without the GIL
(I also created an example some_function_not_requiring_gil but I assume this is fairly obvious)
Solution that works for me:
Access the array slice using
input[d:d+1, :]
instead of
input [d,:]
And pass a 2D array.

Haskell performance tuning

I'm quite new to Haskell, and to learn it better I started solving problems here and there and I ended up with this (project Euler 34).
145 is a curious number, as 1! + 4! + 5! = 1 + 24 + 120 = 145.
Find the sum of all numbers which are equal to the sum of the factorial >of their digits.
Note: as 1! = 1 and 2! = 2 are not sums they are not included.
I wrote a C and an Haskell brute force solution.
Could someone explain me the Haskell version is ~15x (~0.450 s vs ~6.5s )slower than the C implementation and how to possibly tune and speedup the Haskell solution?
unsigned int solve(){
unsigned int result = 0;
unsigned int i=10;
while(i<2540161){
unsigned int sumOfFacts = 0;
unsigned int number = i;
while (number > 0) {
unsigned int d = number % 10;
number /= 10;
sumOfFacts += factorial(d);
}
if (sumOfFacts == i)
result += i;
i++;
}
return result;
}
here the haskell solution
--BRUTE FORCE SOLUTION
solve:: Int
solve = sum (filter (\x-> sfc x 0 == x) [10..2540160])
--sum factorial of digits
sfc :: Int -> Int -> Int
sfc 0 acc = acc
sfc n acc = sfc n' (acc+fc r)
where
n' = div n 10
r = mod n 10 --n-(10*n')
fc 0 =1
fc 1 =1
fc 2 =2
fc 3 =6
fc 4 =24
fc 5 =120
fc 6 =720
fc 7 =5040
fc 8 =40320
fc 9 =362880
First, compile with optimizations. With ghc-7.10.1 -O2 -fllvm, the Haskell version runs in 0.54 secs for me. This is already pretty good.
If we want to do even better, we should first replace div with quot and mod with rem. div and mod do some extra work, because they handle the rounding of negative numbers differently. Since we only have positive numbers here, we should switch to the faster functions.
Second, we should replace the pattern matching in fc with an array lookup. GHC uses a branching construct for Int patterns, and uses binary search when the number of cases is large enough. We can do better here with a lookup.
The new code looks like this:
import qualified Data.Vector.Unboxed as V
facs :: V.Vector Int
facs =
V.fromList [1, 1, 2, 6, 24, 120, 720, 5040, 40320, 362880]
--BRUTE FORCE SOLUTION
solve:: Int
solve = sum (filter (\x-> sfc x 0 == x) [10..2540160])
--sum factorial of digits
sfc :: Int -> Int -> Int
sfc 0 acc = acc
sfc n acc = sfc n' (acc + V.unsafeIndex facs r)
where
(n', r) = quotRem n 10
main = print solve
It runs in 0.095 seconds on my computer.

Resources