Cheaply compute QR factorization in Julia? - matrix

In Julia, I would like to compute the QR factorization of a matrix in a program. However, just using the qr() function is (relatively) very expensive computationally compared to the cost of the rest of my program. Is there any way to more cheaply compute a matrix's qr() factorization other than simply using the qr()? I also want to avoid/minimize storing and allocating arrays whenever possible.

Related

Precise matrix inversion in Q

Given an invertible matrix M over the rationals Q, the inverse matrix M^(-1) is again a matrix over Q. Are their (efficient) libraries to compute the inverse precisely?
I am aware of high-performance linear algebra libraries such as BLAS/LAPACK, but these libraries are based on floating point arithmetic and are thus not suitable for computing precise (analytical) solutions.
Motivation: I want to compute the absorption probabilities of a large absorbing Markov chain using its fundamental matrix. I would like to do so precisely.
Details: By large, I mean a 1000x1000 matrix in the best case, and a several million dimensional matrix in the worst case. The further I can scale things the better. (I realize that the worst case is likely far out of reach.)
You can use the Eigen matrix library, which with little effort works on arbitrary scalar types. There is an example in the documentation how to use it with GMPs mpq_class: http://eigen.tuxfamily.org/dox/TopicCustomizing_CustomScalar.html
Of course, as #btilly noted, most of the time you should not calculate the inverse, but calculate a matrix decomposition and use that to solve equation systems. For rational numbers you can use any LU-decomposition, or if the matrix is symmetric, the LDLt decomposition. See here for a catalogue of decompositions.

Cholesky Decomposition for a large number of small matrices using OpenCL

My question is similar to what was asked here:[a link] Performing many small matrix operations in parallel in OpenCL, except that I want to do Cholesky decomposition. Also, my matrices will be going from 15 X 15 to 100 X 100, and I will have upto 100,000 of them. All of the matrices will have the same dimensions.The decomposed matrices will be used further within the GPU.
This paper [a link] http://icl.cs.utk.edu/news_pub/submissions/haidar_iccp.pdf discusses the algorithms at a high level. They use the term batched Cholesky for such a problem (large number of small matrices). The way they have done it is to implement a batched version of all the steps involved in a Cholesky decomposition.
So I am wanting to start with a batched matrix multiplication (since it is one of the steps in a cholesky decomposition). For large matrices, matrix multiplication is done in a blocked manner on a GPU. My question: would it be suitable for the kind of problem I have? Any other suggestions will be helpful. Am a little unsure how to approach this.

Algorithms for Performing Large Integer Matrix Operations w/ Numerical Stability

I'm looking for a library that performs matrix operations on large sparse matrices w/o sacrificing numerical stability. Matrices will be 1000+ by 1000+ and values of the matrix will be between 0 and 1000. I will be performing the index calculus algorithm (en.wikipedia.org/wiki/Index_calculus_algorithm) so I will be generating (sparse) row vectors of the matrix serially. As I develop each row, I will need to test for linear independence. Once I fill my matrix with the desired number of linearly independent vectors, I will then need to transform the matrix into reduced row echelon form.
The problem now is that my implementation uses Gaussian elimination to determine linear independence (ensuring row echelon form once all my row vectors have been found). However, given the density and size of the matrix, this means the entries in each new row become exponentially larger over time, as the lcm of the leading entries must be found in order to perform cancellation. Finding the reduced form of the matrix further exacerbates the problem.
So my question is, is there an algorithm, or better yet an implementation, that can test linear independence and solve the reduced row echelon form while keeping the entries as small as possible? An efficient test for linear independence is especially important since in the index calculus algorithm it is performed by far the most.
Thanks in advance!
Usually if you are working with large matrices, people use LAPACK: this library contains all the basic matrix routines and support many different matrix types (sparse, ...). You can use this library to implement your algorithm, I think it will help you

Fast algorithm to calculate Pi in parallel

I am starting to learn CUDA and I think calculating long digits of pi would be a nice, introductory project.
I have already implemented the simple Monte Carlo method which is easily parallelize-able. I simply have each thread randomly generate points on the unit square, figure out how many lie within the unit circle, and tally up the results using a reduction operation.
But that is certainly not the fastest algorithm for calculating the constant. Before, when I did this exercise on a single threaded CPU, I used Machin-like formulae to do the calculation for far faster convergence. For those interested, this involves expressing pi as the sum of arctangents and using Taylor series to evaluate the expression.
An example of such a formula:
Unfortunately, I found that parallelizing this technique to thousands of GPU threads is not easy. The problem is that the majority of the operations are simply doing high precision math as opposed to doing floating point operations on long vectors of data.
So I'm wondering, what is the most efficient way to calculate arbitrarily long digits of pi on a GPU?
You should use the Bailey–Borwein–Plouffe formula
Why? First of all, you need an algorithm that can be broken down. So, the first thing that came to my mind is having a representation of pi as an infinite sum. Then, each processor just computes one term, and you sum them all in the end.
Then, it is preferable that each processor manipulates small-precision values, as opposed to very high precision ones. For example, if you want one billion decimals, and you use some of the expressions used here, like the Chudnovsky algorithm, each of your processor will need to manipulate a billion long number. That's simply not the appropriate method for a GPU.
So, all in all, the BBP formula will allow you to compute the digits of pi separately (the algorithm is very cool), and with "low precision" processors! Read the "BBP digit-extraction algorithm for π"
Advantages of the BBP algorithm for computing π
This algorithm computes π without requiring custom data types having thousands or even millions of digits. The method calculates the nth digit without calculating the first n − 1 digits, and can use small, efficient data types.
The algorithm is the fastest way to compute the nth digit (or a few digits in a neighborhood of the nth), but π-computing algorithms using large data types remain faster when the goal is to compute all the digits from 1 to n.

Efficient algorithm for finding largest eigenpair of small general complex matrix

I am looking for an efficient algorithm to find the largest eigenpair of a small, general (non-square, non-sparse, non-symmetric), complex matrix, A, of size m x n. By small I mean m and n is typically between 4 and 64 and usually around 16, but with m not equal to n.
This problem is straight forward to solve with the general LAPACK SVD algorithms, i.e. gesvd or gesdd. However, as I am solving millions of these problems and only require the largest eigenpair, I am looking for a more efficient algorithm. Additionally, in my application the eigenvectors will generally be similar for all cases. This lead me to investigate Arnoldi iteration based methods, but I have neither found a good library nor algorithm that applies to my small general complex matrix. Is there an appropriate algorithm and/or library?
Rayleigh iteration has cubic convergence. You may want to implement also the power method and see how they compare, since you need LU or QR decomposition of your matrix.
http://en.wikipedia.org/wiki/Rayleigh_quotient_iteration
Following #rchilton's comment, you can apply this to A* A.
The idea of looking for the largest eigenpair is analogous to finding a large power of the matrix, as the lower frequency modes get damped out during the iteration. The Lanczos algorithm, is one of a few such algorithms that rely on the so-called Ritz eigenvectors during the decomposition. From Wikipedia:
The Lanczos algorithm is an iterative algorithm ... that is an adaptation of power methods to find eigenvalues and eigenvectors of a square matrix or the singular value decomposition of a rectangular matrix. It is particularly useful for finding decompositions of very large sparse matrices. In latent semantic indexing, for instance, matrices relating millions of documents to hundreds of thousands of terms must be reduced to singular-value form.
The technique works even if the system is not sparse, but if it is large and dense it has the advantage that it doesn't all have to be stored in memory at the same time.
How does it work?
The power method for finding the largest eigenvalue of a matrix A can be summarized by noting that if x_{0} is a random vector and x_{n+1}=A x_{n}, then in the large n limit, x_{n} / ||x_{n}|| approaches the normed eigenvector corresponding to the largest eigenvalue.
Non-square matrices?
Noting that your system is not a square matrix, I'm pretty sure that the SVD problem can be decomposed into separate linear algebra problems where the Lanczos algorithm would apply. A good place to ask such questions would be over at https://math.stackexchange.com/.

Resources