Vectorized 2D array scipy BDF solver - performance

I'm trying to solve simultaneously the same ODE at different point (each point n is an independent vector of shape m) using the scipy BDF solver. In other world, i have a matrix n x m, and i want to solve n points (by solving, I mean make them advance in time with a while loop ), knowing that each point n are independant from each other.
Obviously you can loop on the different points, but this method takes too much time. Is there any way to make this faster and use it as a vectorized function?
I also tried to reshape my matrix to a 1D vector, but it looks like the solver compute the jacobian matrix of the complete vector, which takes too much time and is useless as the points along n are independent.
Maybe there is a way to specify that the derivatives of points n-m are zeros to speed up the jacobian computation ?
Thanks for your answer #Lutz Lehmann. I was able to sped up the computation a little using jac_sparcity, that avoid computing a lot of unnecessary points.
The other improvement I can imagine is regarding the rate of progress h_abs : each independent ODE should have its own h_abs. Using the 1D vector method implies that all the ODE's are advancing at the same rate of progress h_abs i.e. the most restricting one. I don't know if there is anyway of doing this.
I am already using a vectorized atol built as an n x m matrix and reshaped, the same way as the complete set of ODE to make sure that the good tolerances are applied for each variable. I've never used jumba so far, but I will definitely have a look.


Should one calculate QR decomposition before Least Squares to speed up the process?

I am reading the book "Introduction to linear algebra" by Gilbert Strang. The section is called "Orthonormal Bases and Gram-Schmidt". The author several times emphasised the fact that with orthonormal basis it's very easy and fast to calculate Least Squares solution, since Qᵀ*Q = I, where Q is a design matrix with orthonormal basis. So your equation becomes x̂ = Qᵀb.
And I got the impression that it's a good idea to every time calculate QR decomposition before applying Least Squares. But later I figured out time complexity for QR decomposition and it turned out to be that calculating QR decomposition and after that applying Least Squares is more expensive than regular x̂ = inv(AᵀA)Aᵀb.
Is that right that there is no point in using QR decomposition to speed up Least Squares? Or maybe I got something wrong?
So the only purpose of QR decomposition regarding Least Squares is numerical stability?
There are many ways to do least squares; typically these vary in applicability, accuracy and speed.
Perhaps the Rolls-Royce method is to use SVD. This can be used to solve under-determined (fewer obs than states) and singular systems (where A'*A is not invertible) and is very accurate. It is also the slowest.
QR can only be used to solve non-singular systems (that is we must have A'*A invertible, ie A must be of full rank), and though perhaps not as accurate as SVD is also a good deal faster.
The normal equations ie
compute P = A'*A
solve P*x = A'*b
is the fastest (perhaps by a large margin if P can be computed efficiently, for example if A is sparse) but is also the least accurate. This too can only be used to solve non singular systems.
Inaccuracy should not be taken lightly nor dismissed as some academic fanciness. If you happen to know that the problems ypu will be solving are nicely behaved, then it might well be fine to use an inaccurate method. But otherwise the inaccurate routine might well fail (ie say there is no solution when there is, or worse come up with a totally bogus answer).
I'm a but confused that you seem to be suggesting forming and solving the normal equations after performing the QR decomposition. The usual way to use QR in least squares is, if A is nObs x nStates:
decompose A as A = Q*(R )
(0 )
transform b into b~ = Q'*b
(here R is upper triangular)
solve R * x = b# for x,
(here b# is the first nStates entries of b~)

How are sparse Ax = b systems solved in practice?

Let A be an n x n sparse matrix, represented by a sequence of m tuples of the form (i,j,a) --- with indices i,j (between 0 and n-1) and a being a value a in the underlying field F.
What algorithms are used, in practice, to solve linear systems of equations of the form Ax = b? Please describe them, don't just link somewhere.
I'm interested both in exact solutions for finite fields, and in exact and bounded-error solutions for reals or complex numbers using floating-point representation. I suppose exact or bounded-solutions for rational numbers are also interesting.
I'm particularly interested in parallelizable solutions.
A is not fixed, i.e. you don't just get different b's for the same A.
The main two algorithms that I have used and parallelised are the Wiedemann algorithm and the Lanczos algorithm (and their block variants for GF(2) computations), both of which are better than structured gaussian elimination.
The LaMacchia-Odlyzo paper (the one for the Lanczos algorithm) will tell you what you need to know. The algorithms involve repeatedly multiplying your sparse matrix by a sequence of vectors. To do this efficiently, you need to use the right data structure (linked list) to make the matrix-vector multiply time proportional to the number of non-zero values in the matrix (i.e. the sparsity).
Paralellisation of these algorithms is trivial, but optimisation will depend upon the architecture of your system. The parallelisation of the matrix-vector multiply is done by splitting the matrix into blocks of rows (each processor gets one block), each block of rows multiplies by the vector separately. Then you combine the results to get the new vector.
I've done these types of computations extensively. The original authors that broke the RSA-129 factorisation took 6 weeks using structured gaussian elimination on a 16,384 processor MasPar. On the same machine, I worked with Arjen Lenstra (one of the authors) to solve the matrix in 4 days with block Wiedemann and 1 day with block Lanczos. Unfortunately, I never published the result!

paralleling sequence of matrix multiplication for speed up

In my function, there is a lot of element wise matrix multiplication which are independent. Is there a way to calculate them in parallel ?
All of them are very simple operations, but 70% of my run time is for these parts of code because this function is invoked millions of times.
function [r1,r2,r3]=backward(A,B,C,D,E,F,r1,r2,r3)
for i=1:300
EDIT: After writing the answer, I observed that you are not multiplying all the input matrices by means of matrix multiplication. Some of them are elementwise multiplications. If this is what you intended, the following answer won't apply.
You are looking for an optimal algorithm for computing product of multiple matrices. People have studied this problem long ago and they have come up with a dynamic programming algorithm to decide the optimal order.
For example, if A is of size 10000 x 1, B is of size 1 x 10000 and C is of size 10000 x 1, it would be a lot more efficient if we computed A*B*C as A*(B*C), instead of (A*B)*C. The proof of correctness of this technique lies in the fact that matrix multiplication is associative. You can read more about this on Wikipedia.
If you want a good quality MATLAB implementation of this, you can find it here. It takes the matrices as input and gives out the product. It seems like this implementation does a decent job of finding the optimal way of computing "upto" 10 matrices.
First thing to note: the last 3 variables that you provide as input are not beeing used. I don't think this will matter much, but it would be better to clean it up.
Now the actual answer:
MATLAB is all about matrix operations, and this has been highly optimized. Even using C++ you will not expect a significant speedup (and be wary of a slowdown). As such, with the information that is provided in the question, the conclusion would be that you cannot do anything to speed up independent matrix calculations.
That being said: If you could reduce the number of sequential function calls, there may be something to gain.
It is hard to say how to do this in general, but two ideas:
If you call the fuction in a for loop, use a parfor loop instead (assuming you have the parallel processing toolbox, otherwise manually break up the loop and open 4 matlab instances to paralellize the loop (can be automated if needed).
See whether you really need this many function calls to small matrix operations. If you could improve your algorithm, that could offer a huge improvement, but otherwise you may still be able to combine multiple matrices (multiple versions of A with multiple versions of B for instance) and do 1 big multiplication, rather than a 100 tiny ones).

Accurate least-squares fit algorithm needed

I've experimented with the two ways of implementing a least-squares fit (LSF) algorithm shown here.
The first code is simply the textbook approach, as described by Wolfram's page on LSF. The second code re-arranges the equation to minimize machine errors. Both codes produce similar results for my data. I compared these results with Matlab's p=polyfit(x,y,1) function, using correlation coefficients to measure the "goodness" of fit and compare each of the 3 routines. I observed that while all 3 methods produced good results, at least for my data, Matlab's routine had the best fit (the other 2 routines had similar results to each other).
Matlab's p=polyfit(x,y,1) function uses a Vandermonde matrix, V (n x 2 matrix) and QR factorization to solve the least-squares problem. In Matlab code, it looks like:
V = [x1,1; x2,1; x3,1; ... xn,1] % this line is pseudo-code
[Q,R] = qr(V,0);
p = R\(Q'*y); % performs same as p = V\y
I'm not a mathematician, so I don't understand why it would be more accurate. Although the difference is slight, in my case I need to obtain the slope from the LSF and multiply it by a large number, so any improvement in accuracy shows up in my results.
For reasons I can't get into, I cannot use Matlab's routine in my work. So, I'm wondering if anyone has a more accurate equation-based approach recommendation I could use that is an improvement over the above two approaches, in terms of rounding errors/machine accuracy/etc.
For a polynomial fitting, you can create a Vandermonde matrix and solve the linear system, as you already done.
Another solution is using methods like Gauss-Newton to fit the data (since the system is linear, one iteration should do fine). There are differences between the methods. One possibly reason is the Runge's phenomenon.

Difference between a linear problem and a non-linear problem? Essence of Dot-Product and Kernel trick

The kernel trick maps a non-linear problem into a linear problem.
My questions are:
1. What is the main difference between a linear and a non-linear problem? What is the intuition behind the difference of these two classes of problem? And How does kernel trick helps use the linear classifiers on a non-linear problem?
2. Why is the dot product so important in the two cases?
When people say linear problem with respect to a classification problem, they usually mean linearly separable problem. Linearly separable means that there is some function that can separate the two classes that is a linear combination of the input variable. For example, if you have two input variables, x1 and x2, there are some numbers theta1 and theta2 such that the function theta1.x1 + theta2.x2 will be sufficient to predict the output. In two dimensions this corresponds to a straight line, in 3D it becomes a plane and in higher dimensional spaces it becomes a hyperplane.
You can get some kind of intuition about these concepts by thinking about points and lines in 2D/3D. Here's a very contrived pair of examples...
This is a plot of a linearly inseparable problem. There is no straight line that can separate the red and blue points.
However, if we give each point an extra coordinate (specifically 1 - sqrt(x*x + y*y)... I told you it was contrived), then the problem becomes linearly separable since the red and blue points can be separated by a 2-dimensional plane going through z=0.
Hopefully, these examples demonstrate part of the idea behind the kernel trick:
Mapping a problem into a space with a larger number of dimensions makes it more likely that the problem will become linearly separable.
The second idea behind the kernel trick (and the reason why it is so tricky) is that it is usually very awkward and computationally expensive to work in a very high-dimensional space. However, if an algorithm only uses the dot products between points (which you can think of as distances), then you only have to work with a matrix of scalars. You can implicitly perform the calculations in the higher-dimensional space without ever actually having to do the mapping or handle the higher-dimensional data.
Many classifiers, among them the linear Support Vector Machine (SVM), can only solve problems that are linearly separable, i.e. where the points belonging to class 1 can be separated from the points belonging to class 2 by a hyperplane.
In many cases, a problem that is not linearly separable can be solved by applying a transform phi() to the data points; this transform is said to transform the points to feature space. The hope is that, in feature space, the points will be linearly separable. (Note: This is not the kernel trick yet... stay tuned.)
It can be shown that, the higher the dimension of the feature space, the greater the number of problems that are linearly separable in that space. Therefore, one would ideally want the feature space to be as high-dimensional as possible.
Unfortunately, as the dimension of feature space increases, so does the amount of computation required. This is where the kernel trick comes in. Many machine learning algorithms (among them the SVM) can be formulated in such a way that the only operation they perform on the data points is a scalar product between two data points. (I will denote a scalar product between x1 and x2 by <x1, x2>.)
If we transform our points to feature space, the scalar product now looks like this:
<phi(x1), phi(x2)>
The key insight is that there exists a class of functions called kernels that can be used to optimize the computation of this scalar product. A kernel is a function K(x1, x2) that has the property that
K(x1, x2) = <phi(x1), phi(x2)>
for some function phi(). In other words: We can evaluate the scalar product in the low-dimensional data space (where x1 and x2 "live") without having to transform to the high-dimensional feature space (where phi(x1) and phi(x2) "live") -- but we still get the benefits of transforming to the high-dimensional feature space. This is called the kernel trick.
Many popular kernels, such as the Gaussian kernel, actually correspond to a transform phi() that transforms into an infinte-dimensional feature space. The kernel trick allows us to compute scalar products in this space without having to represent points in this space explicitly (which, obviously, is impossible on computers with finite amounts of memory).
The main difference (for practical purposes) is: A linear problem either does have a solution (and then it's easily found), or you get a definite answer that there is no solution at all. You do know this much, before you even know the problem at all. As long as it's linear, you'll get an answer; quickly.
The intuition beheind this is the fact that if you have two straight lines in some space, it's pretty easy to see whether they intersect or not, and if they do, it's easy to know where.
If the problem is not linear -- well, it can be anything, and you know just about nothing.
The dot product of two vectors just means the following: The sum of the products of the corresponding elements. So if your problem is
c1 * x1 + c2 * x2 + c3 * x3 = 0
(where you usually know the coefficients c, and you're looking for the variables x), the left hand side is the dot product of the vectors (c1,c2,c3) and (x1,x2,x3).
The above equation is (pretty much) the very defintion of a linear problem, so there's your connection between the dot product and linear problems.
Linear equations are homogeneous, and superposition applies. You can create solutions using combinations of other known solutions; this is one reason why Fourier transforms work so well. Non-linear equations are not homogeneous, and superposition does not apply. Non-linear equations usually have to be solved numerically using iterative, incremental techniques.
I'm not sure how to express the importance of the dot product, but it does take two vectors and returns a scalar. Certainly a solution to a scalar equation is less work than solving a vector or higher-order tensor equation, simply because there are fewer components to deal with.
My intuition in this matter is based more on physics, so I'm having a hard time translating to AI.
