sparse matrices solver for f90 - matrix

I am dealing with up to N=10^7 x N=10^7 matrices; number of nonzero elements is about 6 x N. (Those elements are grouped around diagonal.) My RAM has 16 Gbt size; so I clearly need sparse matrix solver. I run Ubuntu LINUX, and use fortran90 (gfortran), or precisely speaking, ratfor90.
I have LAPACK, but it doesn't seem to support sparse matrix solving.
(am I wrong with that?) MATLAB must be good, but I don't want to spend much time to get familiar with it; the time is pressing. I have old/gold slatec installed and use it for spec. functions; does it have sparse matrix routins?
I hear about ARPACK, but can it be used as a plain solver? could it be called from gfortran?
Any other suggestion?
Thanks, -- Alex

You are right. Lapack is not applicable to this problem.
Direct Sparse solvers are provided by MUMPS, UMFPACK, SuperLU libraries.
Also PETSc is a library collection where you can find a lot of information
You can find Ubuntu package available for all these libraries.

ARPACK is a package that solves eigenvalue problems, but it is not a solver by itself.
I am not sure you can solve your problem on 16 Gb. I recommend having a look at freefem++

Related

State-of-the-art out-of-place matrix transposition in libraries such as LaPack?

I'm looking for the most efficient way to compute out-of-place transposition for large matrices (>> 1024x1024), in C/C++. I've already came across several answers in SO, however I need more "trustworthy" sources for my work (like blas/lapack).
From an online quick search I understood blas has no such function. But it was implied that Lapack implemented matrix transposition. I've been looking for awhile (including the lapack documentation) but found no answer.
I know MKL-Blas implements matrix transposition, but I'm working in a remote server and I'm not able to install it there.
OpenBLAS (a BLAS implementation) supports those:
https://github.com/xianyi/OpenBLAS/wiki/OpenBLAS-Extensions
?omatcopy s,d,c,z out-of-place transpositon/copying

Eigendecompositions are 5 times slower in Julia than in Mathematica?

I am new to Julia and primarily work in Mathematica, so I probably have a few elementary mistakes floating around. I attempted to time how long Julia took to compute the eigensystem of a random matrix, and found it was 5-6 times slower than in Mathematica.
In Julia:
D=1000*(rand(1000,1000)-0.5);
#time (E,F)=eig(D);
Out: elapsed time: 7.47950706 seconds (79638920 bytes allocated*)
In Mathematica:
First#Timing#Eigensystem[RandomReal[{-500, 500}, {1000, 1000}]]
Out: 1.310408
For 2000 x 2000 arrays it's similar, although the Julia result slowed down slightly less than the equivalent Mathematica call, but it's still slower; Julia takes 22 seconds, whereas Mathematica computes it in 8 seconds.
As far as I read in the Julia standard library for linear algebra, decompositions are implemented by calling LAPACK, which I thought was supposed to be very good, so I'm confused as to why the Julia code is running so much slower. Does anyone know why this is the case? Is it doing some kind of balancing or array-symmetry-detection that Mathematica doesn't do? Or is it actually slower?
Also, this is a syntax question and probably a silly error, but how do you change the balancing in Julia? I tried
#time (E,F)=eig(D[, balance=:nobalance]);
exactly as copied and pasted from the Julia manual, but it just gave a syntax error, so something's wrong.
I am using Windows 7 64-bit, with Julia version 0.2.0 64-bit, installed using the instructions at Steven Johnson's site, with Anaconda installed first to take care of prerequisites. I am using Mathematica student edition version 9.0.1.
EDIT 1:
Executing versioninfo() yielded
Julia Version 0.2.0
Commit 05c6461 (2013-11-16 23:44 UTC)
Platform Info:
System: Windows (x86_64-w64-mingw32)
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm
So it looks like I'm using the openBLAS for LAPACK and BLAS. Once I get the Mathematica implementation info I will add that as well.
EDIT 2:
It appears that Windows Mathematica probably uses Intel MKL BLAS.
The eigen calculation in Julia is outsourced to LAPACK and BLAS, and I think it is also the case for Mathematica. Julia can use different versions of BLAS and LAPACK and you are therefore effectively comparing your choice of LAPACK and BLAS for Julia with Mathematica's LAPACK and BLAS (probably Intel MKL).
The default choice for Julia is OpenBLAS which is fast on most architectures and on my machine Julia is faster than Mathematica for the eigen calculation. If you are on Linux and have chosen BLAS and LAPACK from a repo, it is very likely that they are much slower than OpenBLAS.
The option for balancing has recently been added to Julia and mistakenly the option was not added to the function eig, which is only a MATLAB compatible interface to the eigfact function. Writing eigfact(A,balance=:nobalance) should work.
Edit 1:
Further investigation has shown that the difference is due to a threading problem in OpenBLAS on Windows. If Julia's BLAS is restricted to one thread the timings are comparable to Mathematica, but if more threads are allowed the calculation slows down. This doesn't seem to be a problem on Mac or Linux, but as mentioned above, in general the performance of OpenBLAS depends on the architecture.
Edit 2:
Recently, the balancing option has changed. Balancing can be switched off by writing eigfact(A,permute=false,scale=false).
There's definitely at least one thing wrong, and of couse the good news is that it's likely to be fixable. For this kind of thing, you're much better off filing an issue on GitHub. I've done that for you here.
Regarding the speed, you may want to check that the accuracy of the eigendecomposition is comparable. And of course, accuracy sometimes depends on the condition number of the matrix; it's possible that Julia is using a more careful algorithm, and your example may or may not reveal that if its condition number is not a problem. This is definitely something to discuss over at the issue.
Regarding :nobalance, in the documentation, [, something] is used to indicate that something is optional. You want to use #time (E,F)=eig(D, balance=:nobalance);. However, eig doesn't take keyword arguments, so the code and the documentation are not currently in sync.

Boost compressed-matrix & Chlesky factorization & BLAS/LAPACK

I'd like to solve the following steps:
fill a boost::numeric::ublas::compressed_matrix;
now, I'd need to apply Cholesky factorization.
However, there is no such boost's function.
So I started looking for another library - I was thinking about BLAS or a kind LAPACK libs. But, although there are algorithms I'd need, how to bind boost::numeric::ublas::compressed_matrix with a BLAS or LAPACK's function/algorithm? Is there such a way?
Finally, I'd need to solve 'Ax=b', where 'A' is boost's compressed_matrix factorized by Cholesky algorithm. So how to solve 'x' with use of boost's and/or LAPACK's algorithm(s) or function(s)?
Thanks in advance.
LuP

fast matrix multiplication in Matlab

I need to make a matrix/vector multiplication in Matlab of very large sizes: "A" is an 655360 by 5 real-valued matrix that are not necessarily sparse and "B" is a 655360 by 1 real-valued vector. My question is how to compute: B'*A efficiently.
I have notice a slight time improvement by computing A'*B instead, which gives a column vector. But still it is quite slow (I need to perform this operation several times in the program).
With a little bit search I found an interesting Matlab toolbox MTIMESX by James Tursa, which I hoped would improve the above matrix multiplication performance. After several trials, I can only have very marginal gains over the Matlab native matrix multiplication.
Any suggestions about how should I rewrite A'*B so that the operation is more efficient? Thanks.
Matlab's raison d'etre is doing matrix computations. I would be fairly surprised if you could significantly outperform its built-in matrix multiplication with hand-crafted tools. First of all, you should make sure your multiplication can actually be performed significantly faster. You could do this by implementing a similar multiplication in C++ with Eigen.
I have had good results with matlab matrix multiplication using the GPU
In order to avoid the transpose operation, you could try:
sum(bsxfun(#times, A, B), 2)
But I would be astonished it was faster than the direct version. See #thiton's answer.
Also look at http://www.mathworks.co.uk/company/newsletters/news_notes/june07/patterns.html to see why the column-vector-based version is faster than the row-vector-based version.
Matlab is built using fairly optimized libraries (BLAS, etc.), so you can't easily improve upon it from within Matlab. Where you can improve is to get a better BLAS, such as one optimized for your processor - this will enable better use of the caches by getting appropriately sized blocks of data from main memory. Take a look into creating your own compiled versions of ATLAS, ACML, MKL, and Goto BLAS.
I wouldn't try to solve this one particular multiplication unless it's really killing you. Changing up the BLAS is likely to lead to a happier solution, especially if you're not currently making use of multicore processors.
Your #1 option, if this is your bottleneck, is to re-examine your algorithm. See this question Optimizing MATLAB code for a great example of how choosing a different algorithm reduced runtime by three orders of magnitude.

Efficient EigenSolver Implementation

I am looking for an efficient eigensolver ( language not important, although I would be programming in C#), that utilizes the multi-core features found in modern CPU. Being able to work directly with pardiso solver is a major plus. My matrix are mostly sparse matrix, so an ideal solver should be able to take advantage of this fact and greatly enhance the memory usage and performance.
So far I have only found LAPACK and ARPACK. The LAPACK, as implemented in Intel MKL, is a good candidate, as it offers multi-core optimization. But it seems that the drivers inside the LAPACK don't work directly with pardiso solver, furthermore, it seems that they don't take advantage of sparse matrix ( but I am not sure on this point).
ARPACK, on the other hand, seems to be pretty hard to setup in Windows environment, and the parallel version, PARPACK, doesn't work so well. The bonus point is that it can work with pardiso solver.
The best would be Intel MKL + ARPACK with multi-core speedup. Not sure whether there is any existing implementations that already do what I want to do?
I'm working on a problem with needs very similar to the ones you state. I'm considering FEAST:
http://www.ecs.umass.edu/~polizzi/feast/index.htm
I'm trying to make it work right now, but it seems perfect. I'm interested in hearing what your experience with it is, if you use it.
cheers
Ned
Have a look at the Eigen2 library.
I've implemented it already, in C#.
The idea is that one must convert the matrix format in CSR format. Then, one can use MKL to compute linear equation solving algorithm ( using pardiso solver), the matrix-vector manipulation.

Resources