ConjugateGradient in Eigen for Hermitian Matrices - eigen

in the literature the conjugate gradient method is typically presented for real symmetric positive-definite matrices. However, in the description of the CG method in the Eigen library:
https://eigen.tuxfamily.org/dox/group__IterativeLinearSolvers__Module.html
one can find the statement:
"ConjugateGradient for selfadjoint (hermitian) matrices"
This implies that it should also work for Hermitian (complex, not purely real) matrices. Is that the case?
A minimal example shows that it actually doesn't work naively with Hermitian matrices. Is there a trick that one needs to know or is this an error in the description?
My minimal example uses the spin 3/2 matrices Sx (real symmetric) and Sy (complex Hermitian), whose Eigenvalues are know to be -1.5,-0.5,0.5,1.5.
The results for the real symmetric case are fine but in the complex case it results in a NaN.
#include <iostream>
#include <complex>
#include <Eigen/Core>
#include <Eigen/IterativeLinearSolvers>
int main(int args, char **argv){
Eigen::VectorXcd b=Eigen::VectorXcd::Ones(4);
Eigen::VectorXcd x;
std::complex<double> i_unit(0,1);
//Hermitian matrix:
Eigen::MatrixXcd A(4,4);
A<<0,-i_unit*sqrt(3.)/2., 0 ,0, \
i_unit*sqrt(3.)/2., 0 ,-i_unit, 0,\
0,i_unit,0,-i_unit*sqrt(3.)/2.,\
0,0,i_unit*sqrt(3.)/2.,0;
//Real symmetric matrix:
Eigen::MatrixXcd B(4,4);
B<<0,sqrt(3.)/2., 0 ,0, \
sqrt(3.)/2., 0 ,1, 0,\
0,1,0,sqrt(3.)/2.,\
0,0,sqrt(3.)/2.,0;
Eigen::ConjugateGradient< Eigen::MatrixXcd, Eigen::Lower|Eigen::Upper> cg;
cg.compute(A);
x = cg.solve(b);
std::cout<<"Hermitian matrix:"<<std::endl;
std::cout<<"A: "<<std::endl<<A<<std::endl;
std::cout<<"b: "<<std::endl<<b<<std::endl;
std::cout<<"x: "<<std::endl<<x<<std::endl;
std::cout<<"(b-A*x).norm(): "<<(b-A*x).norm()<<std::endl;
std::cout<<"cg.error(): "<<cg.error()<<std::endl;
std::cout<<std::endl;
cg.compute(B);
x = cg.solve(b);
std::cout<<"Real symmetric matrix:"<<std::endl;
std::cout<<"B: "<<std::endl<<B<<std::endl;
std::cout<<"b: "<<std::endl<<b<<std::endl;
std::cout<<"x: "<<std::endl<<x<<std::endl;
std::cout<<"(b-B*x).norm(): "<<(b-B*x).norm()<<std::endl;
std::cout<<"cg.error(): "<<cg.error()<<std::endl;
std::cout<<std::endl;
return 0;
}

Hermitian is not enough, it also needs to be positive definite which is not your case since your matrix has both positive and negative eigenvalues. Anyways, CG is rather designed for handling very large sparse matrices, for a 4x4 matrix better use a dense decomposition. In your case, LDLT will do well.

Related

numerical diagonalization of a unitary matrix

To numerically diagonalize a unitary matrix I use the LAPACK routine zgeev.
The problem is: In case of degeneracies the degenerate subspace is not orthonormalized, since the routine is for general matrices.
However, since in my case the matrices are unitary, the basis can be always orthonormalized. Is there a better solution than applying QR-algorithm afterwards to the degenerate subspace?
Short answer: Schur decomposition!
If a square matrix A is complex, then its Schur factorization is A=ZTZ*, where Z is unitary and T is upper triangular.
If A happens to be unitary, T must also be unitary. Since T is both unitary and triangular, it is diagonal (proof here,.or there)
Let's consider the vectors Z.e_i, where e_i are the vectors of the canonical basis. These vectors obviously form an orthonormal basis. Moreover, these vectors are eigenvectors of the matrix A.
Hence, the columns of the unitary matrix Z are eigenvectors of the unitary matrix A and form an orthonormal basis.
As a consequence, computing a Schur decomposition of a unitary matrix is equivalent to finding one of its orthogonal basis of eigenvectors.
ZGEESX computes the eigenvalues, the Schur form, and, optionally, the matrix of Schur vectors for GE matrices
The resulting T can also be tested to check that A is unitary.
Here is a piece of python code testing it, though scipy's scipy.linalg.schur makes use of Lapack's zgees for Schur decomposition. I used hpaulj's code to generate random unitary matrix as shown in How to create random orthonormal matrix in python numpy
import numpy as np
import scipy.linalg
#from hpaulj, https://stackoverflow.com/questions/38426349/how-to-create-random-orthonormal-matrix-in-python-numpy
def rvs(dim=3):
random_state = np.random
H = np.eye(dim)
D = np.ones((dim,))
for n in range(1, dim):
x = random_state.normal(size=(dim-n+1,))
D[n-1] = np.sign(x[0])
x[0] -= D[n-1]*np.sqrt((x*x).sum())
# Householder transformation
Hx = (np.eye(dim-n+1) - 2.*np.outer(x, x)/(x*x).sum())
mat = np.eye(dim)
mat[n-1:, n-1:] = Hx
H = np.dot(H, mat)
# Fix the last sign such that the determinant is 1
D[-1] = (-1)**(1-(dim % 2))*D.prod()
# Equivalent to np.dot(np.diag(D), H) but faster, apparently
H = (D*H.T).T
return H
n=42
A= rvs(n)
A = A.astype(complex)
T,Z=scipy.linalg.schur(A,output='complex',lwork=None,overwrite_a=False,sort=None,check_finite=True)
#print T
normT=np.linalg.norm(T,ord=None) #2-norm
eigenvalues=[]
for i in range(n):
eigenvalues.append(T[i,i])
T[i,i]=0.
normTu=np.linalg.norm(T,ord=None)
print 'must be very low if A is unitary: ',normTu/normT
#print Z
for i in range(n):
v=Z[:,i]
w=A.dot(v)-eigenvalues[i]*v
print i,'must be very low if column i of Z is eigenvector of A: ',np.linalg.norm(w,ord=None)/np.linalg.norm(v,ord=None)

Blas daxpy routine with matrices

I am working on some matrices related problems in c++. I want to solve the problem: Y = aX + Y, where X and Y are matrices and a is a constant. I thought about using the daxpy BLAS routine, however, DAXPY according to the documentation is a vectors routine and I am not getting the same results as when I solve the same problem in matlab.
I am currently running this:
F77NAME(daxpy)(N, a, X, 1, Y, 1);
When you need to perform operation Y=a*X+Y it does not matter if X',Y` are 1D or 2D matrices, since the operation is done element-wise.
So, If you allocated the matrices in single pointers double A[] = new[] (M*N);, then you can use daxpy by defining the dimension of the vector as M*N
int MN = M*N;
int one = 1;
F77NAME(daxpy)(&MN, &a, &X, &one, &Y, &one);
Same goes with stack two dimension matrix double A[3][2]; as this memory is allocated in sequence.
Otherwise, you need to use a for loop and add each row separately.

I am looking for a simple algorithm for fast DCT and IDCT of matrix [NxM]

I am looking for a simple algorithm to perform fast DCT (type 2) of a matrix of any size [NxM], and also an algorithm for the inverse transformation IDCT (also called DCT type 3).
I need a DCT-2D algorithm, but even a DCT-1D algorithm is good enough because I can use DCT-1D to implement DCT-2D (and IDCT-1D to implement IDCT-2D ).
PHP code is preferable, but any algorithm that is clear enough will do.
My current PHP script for implementing DCT/IDCT is very slow whenever matrix size is more than [200x200].
I was hopping to find a way to preform DCT of up to [4000x4000] within less than 20 seconds. Does anyone know how to do it?
Here is mine computation of 1D FDCT and IFDCT by FFT with the same length:
//---------------------------------------------------------------------------
void DFCTrr(double *dst,double *src,double *tmp,int n)
{
// exact normalized DCT II by N DFFT
int i,j;
double nn=n,a,da=(M_PI*(nn-0.5))/nn,a0,b0,a1,b1,m;
for (j= 0,i=n-1;i>=0;i-=2,j++) dst[j]=src[i];
for (j=n-1,i=n-2;i>=0;i-=2,j--) dst[j]=src[i];
DFFTcr(tmp,dst,n);
m=2.0*sqrt(2.0);
for (a=0.0,j=0,i=0;i<n;i++,j+=2,a+=da)
{
a0=tmp[j+0]; a1= cos(a);
b0=tmp[j+1]; b1=-sin(a);
a0=(a0*a1)-(b0*b1);
if (i) a0*=m; else a0*=2.0;
dst[i]=a0;
}
}
//---------------------------------------------------------------------------
void iDFCTrr(double *dst,double *src,double *tmp,int n)
{
// exact normalized DCT III = iDCT II by N iDFFT
int i,j;
double nn=n,a,da=(M_PI*(nn-0.5))/nn,a0,m,aa,bb;
m=1.0/sqrt(2.0);
for (a=0.0,j=0,i=0;i<n;i++,j+=2,a+=da)
{
a0=src[i];
if (i) a0*=m;
aa= cos(a)*a0;
bb=+sin(a)*a0;
tmp[j+0]=aa;
tmp[j+1]=bb;
}
m=src[0]*0.25;
iDFFTrc(src,tmp,n);
for (j= 0,i=n-1;i>=0;i-=2,j++) dst[i]=src[j]-m;
for (j=n-1,i=n-2;i>=0;i-=2,j--) dst[i]=src[j]-m;
}
//---------------------------------------------------------------------------
dst is destination vector [n]
src is source vector [n]
tmp is temp vector [2n]
These arrays should not overlap !!! It is taken from mine transform class so I hope did not forget to copy something.
XXXrr means destination is real and source is also real domain
XXXrc means destination is real and source is complex domain
XXXcr means destination is complex and source is real domain
All data are double arrays, for complex domain first number is Real and second Imaginary part so array is 2N size. Both functions use FFT and iFFT if you need code also for them comment me. Just to be sure I added not fast implementation of them below. It is much easier to copy that because fast ones use too much of the transform class hierarchy
slow DFT,iDFT implementations for testing:
//---------------------------------------------------------------------------
void transform::DFTcr(double *dst,double *src,int n)
{
int i,j;
double a,b,a0,_n,q,qq,dq;
dq=+2.0*M_PI/double(n); _n=2.0/double(n);
for (q=0.0,j=0;j<n;j++,q+=dq)
{
a=0.0; b=0.0;
for (qq=0.0,i=0;i<n;i++,qq+=q)
{
a0=src[i];
a+=a0*cos(qq);
b+=a0*sin(qq);
}
dst[j+j ]=a*_n;
dst[j+j+1]=b*_n;
}
}
//---------------------------------------------------------------------------
void transform::iDFTrc(double *dst,double *src,int n)
{
int i,j;
double a,a0,a1,b0,b1,q,qq,dq;
dq=+2.0*M_PI/double(n);
for (q=0.0,j=0;j<n;j++,q+=dq)
{
a=0.0;
for (qq=0.0,i=0;i<n;i++,qq+=q)
{
a0=src[i+i ]; a1=+cos(qq);
b0=src[i+i+1]; b1=-sin(qq);
a+=(a0*a1)-(b0*b1);
}
dst[j]=a*0.5;
}
}
//---------------------------------------------------------------------------
So for testing just rewrite the names to DFFTcr and iDFFTrc (or use them to compare to your FFT,iFFT) when the code works properly then implement your own FFT,iFFT For more info on that see:
How to compute Discrete Fourier Transform?
2D DFCT
resize src matrix to power of 2
by adding zeros, to use fast algorithm the size must be always power of 2 !!!
allocate NxN real matrices tmp,dst and 1xN complex vector t
transform lines by DFCTrr
DFCT(tmp.line(i),src.line(i),t,N)
transpose tmp matrix
transform lines by DFCTrr
DFCT(dst.line(i),tmp.line(i),t,N)
transpose dst matrix
normalize dst by multiply matrix by 0.0625
2D iDFCT
Is the same as above but use iDFCTrr and multiply by 16.0 instead.
[Notes]
Be sure before implementing your own FFT and iFFT that they give the same result as mine otherwise the DCT/iDCT will not work properly !!!

CUBLAS - is matrix-element exponentiation possible?

I'm using CUBLAS (Cuda Blas libraries) for matrix operations.
Is possible to use CUBLAS to achieve the exponentiation/root mean square of a matrix items?
I mean, having the 2x2 matrix
1 4
9 16
What I want is a function to elevate to a given value e.g. 2
1 16
81 256
and calculating the root mean square e.g.
1 2
3 4
Is this possible with CUBLAS? I can't find a function suitable to this goal, but I'll ask here first to begin coding my own kernel.
So this may well be something you do have to implement yourself, because the library won't do it for you. (There's probably some way to implement it some of it in terms of BLAS level 3 routines - certainly the squaring of the matrix elements - but it would involve expensive and otherwise unnecessary matrix-vector multiplications. And I still don't know how you'd do the squareroot operation). The reason is that these operations aren't really linear-algebra procedures; taking the square root of each matrix element doesn't really correspond to any fundamental linear algebra operation.
The good news is that these elementwise operations are very simple to implement in CUDA. Again, there are lots of tuning options one could play with for best performance, but one can get started fairly easily.
As with the matrix addition operations, you'll be treating the NxM matricies here as (N*M)-length vectors; the structure of the matrix doesn't matter for these elementwise operations. So you'll be passing in a pointer to the first element of the matrix and treating it as a single list of N*M numbers. (I'm going to assume you're using floats here, as you were talking about SGEMM and SAXPY earlier.)
The kernel, the actual bit of CUDA code which implements the operation, is quite simple. For now, each thread will compute the square (or squareroot) of one array element. (Whether this is optimal or not for performance is something you could test). So the kernels would look like the following. I'm assuming you're doing something like B_ij = (A_ij)^2; if you wanted to do the operation inplace, eg A_ij = (A_ij)^2, you could do that, too:
__global__ void squareElements(float *a, float *b, int N) {
/* which element does this compute? */
int tid = blockDim.x * blockIdx.x + threadIdx.x;
/* if valid, squre the array element */
if (tid < N)
b[tid] = (a[tid]*a[tid]);
}
__global__ void sqrtElements(float *a, float *b, int N) {
/* which element does this compute? */
int tid = blockDim.x * blockIdx.x + threadIdx.x;
/* if valid, sqrt the array element */
if (tid < N)
b[tid] = sqrt(a[tid]); /* or sqrtf() */
}
Note that if you're ok with very slightly increased error, the 'sqrtf()' function which has maximum error of 3 ulp (units in the last place) is significantly faster.
How you call these kernels will depend on the order in which you're doing things. If you've already made some CUBLAS calls on these matricies, you'll want to use them on the arrays which are already in GPU memory.

Using basic arithmetics for calculating Pi with arbitary precision

I am looking for a formula/algorithm to calculate PI~3.14 in a given precision.
The formula/algorithm must have only very basic arithmetic as
+: Addition
-: Subtraction
*: Multiplication
/: Divison
because I want to implement these operations in C++ and want to keep the implementation as simple as possible (no bignum library is allowed).
I have found that this formula for calculating Pi is pretty simple:
Pi/4 = 1 - 1/3 + 1/5 - 1/7 + ... = sum( (-1)^(k+1)/(2*k-1) , k=1..inf )
(note that (-1)^(k+1) can be implemented easily by above operators).
But the problem about this formula is the inability to specify the number of digits to calculate. In other words, there is no direct way to determine when to stop the calculation.
Maybe a workaround to this problem is calculating the difference between n-1th and nth calculated term and considering it as the current error.
Anyway, I am looking for a formula/algorithm that have these properties and also converges faster to Pi
Codepad link:
#include <iostream>
#include <cmath>
int main()
{
double p16 = 1, pi = 0, precision = 10;
for(int k=0; k<=precision; k++)
{
pi += 1.0/p16 * (4.0/(8*k + 1) - 2.0/(8*k + 4) - 1.0/(8*k + 5) - 1.0/(8*k+6));
p16 *= 16;
}
std::cout<<std::setprecision(80)<<pi<<'\n'<<M_PI;
}
Output:
3.141592653589793115997963468544185161590576171875
3.141592653589793115997963468544185161590576171875
This is actually the Bailey-Borwein-Plouffe formula, also taken from the link from wikipedia.
In your original (slowly converging) example, the error term can be computed because this is an alternating series; see http://en.wikipedia.org/wiki/Alternating_series#Approximating_Sums
Essentially, the next uncomputed term is a bound on the error.
You can just do the Taylor envelope of the arctan(1) and then you will get pi/4 just summing all the rest part.
The taylor envelope of arctan(1)
http://en.wikipedia.org/wiki/Taylor_series
also you can use the euler formula with z=1 and then multiply the result by 4.
http://upload.wikimedia.org/math/2/7/9/279bed5a2ea3b80a71f5b22078090168.png

Resources