i’m a new programmer on opencl, i’ve to perform a multiplication of 2 complex matrix but i don’t know how to deal with complex matrix on opencl. please any help? I aleady tried matrix multiplication with normal numbers.
One way, though probably not the most efficient, would be to regard your complex matrix, Z say as being two real matrices X (the real parts) and Y the imaginary parts,ie
X[i,j]= Real( Z[i,j]) Y[i,j] = Imag( Z[i,j])
If you have another complex matrix W say, which is split as above into U and V then to multiply:
Z*W = (X*U-Y*V, X*V+Y*U)
where on the rhs we have real matrices and real matrix multiplication and addition.
In terms of multiplies and adds this will be the same amount of computation as doing the complex multiplications and additions (of the elements) directly. The inefficiency will come if you are given, and should return, arrays of complex numbers; then you have to split, as above the matrices you are going to multiply into real ones, and combine the product into complex array.
Related
I'm trying to solve simultaneously the same ODE at different point (each point n is an independent vector of shape m) using the scipy BDF solver. In other world, i have a matrix n x m, and i want to solve n points (by solving, I mean make them advance in time with a while loop ), knowing that each point n are independant from each other.
Obviously you can loop on the different points, but this method takes too much time. Is there any way to make this faster and use it as a vectorized function?
I also tried to reshape my matrix to a 1D vector, but it looks like the solver compute the jacobian matrix of the complete vector, which takes too much time and is useless as the points along n are independent.
Maybe there is a way to specify that the derivatives of points n-m are zeros to speed up the jacobian computation ?
Thanks in advance for the answer
Edit:
Thanks for your answer #Lutz Lehmann. I was able to sped up the computation a little using jac_sparcity, that avoid computing a lot of unnecessary points.
The other improvement I can imagine is regarding the rate of progress h_abs : each independent ODE should have its own h_abs. Using the 1D vector method implies that all the ODE's are advancing at the same rate of progress h_abs i.e. the most restricting one. I don't know if there is anyway of doing this.
I am already using a vectorized atol built as an n x m matrix and reshaped, the same way as the complete set of ODE to make sure that the good tolerances are applied for each variable. I've never used jumba so far, but I will definitely have a look.
Related question Multiplying real matrix with a complex vector using BLAS
Suppose I aim at C = A*B, where A, B, C are real, complex, and complex matrices, respectively. A[i,j] * B[j,k] := (A[i,j] Re(B[j,k])), (A[i,j] Im(B[j,k])). Is there any available subroutine in BLAS?
I can think about split B into two real matrices for the real and imaginary part, do dgemm then combine, (combine should be faster than matrix multiplication, even directly using nested loops(?)) as suggested by Multiplying real matrix with a complex vector using BLAS
I don't know if there is a direct option in BLAS.
No, there is no routine in standard BLAS that multiplies real and complex matrices together to produce a complex result.
Let A be an n x n sparse matrix, represented by a sequence of m tuples of the form (i,j,a) --- with indices i,j (between 0 and n-1) and a being a value a in the underlying field F.
What algorithms are used, in practice, to solve linear systems of equations of the form Ax = b? Please describe them, don't just link somewhere.
Notes:
I'm interested both in exact solutions for finite fields, and in exact and bounded-error solutions for reals or complex numbers using floating-point representation. I suppose exact or bounded-solutions for rational numbers are also interesting.
I'm particularly interested in parallelizable solutions.
A is not fixed, i.e. you don't just get different b's for the same A.
The main two algorithms that I have used and parallelised are the Wiedemann algorithm and the Lanczos algorithm (and their block variants for GF(2) computations), both of which are better than structured gaussian elimination.
The LaMacchia-Odlyzo paper (the one for the Lanczos algorithm) will tell you what you need to know. The algorithms involve repeatedly multiplying your sparse matrix by a sequence of vectors. To do this efficiently, you need to use the right data structure (linked list) to make the matrix-vector multiply time proportional to the number of non-zero values in the matrix (i.e. the sparsity).
Paralellisation of these algorithms is trivial, but optimisation will depend upon the architecture of your system. The parallelisation of the matrix-vector multiply is done by splitting the matrix into blocks of rows (each processor gets one block), each block of rows multiplies by the vector separately. Then you combine the results to get the new vector.
I've done these types of computations extensively. The original authors that broke the RSA-129 factorisation took 6 weeks using structured gaussian elimination on a 16,384 processor MasPar. On the same machine, I worked with Arjen Lenstra (one of the authors) to solve the matrix in 4 days with block Wiedemann and 1 day with block Lanczos. Unfortunately, I never published the result!
Assume I have a very fast subroutine for fixed size unitary matrix multiplication. (The subroutine may involve hardware acceleration) Say, a function called quantum_unmm_256(A, U, m) right-multiplies a m by 256 matrix A with a 256 by 256 unitary matrix U.
Now I want multiply something with a unitary matrix whose size is multiples of 256, say, a 1280x1280 unitary matrix. What would be a fast algorithm that make best use of the fast subroutine?
All matrices are assumed dense, with 64 or 128 bit float complex type.
Have a look at parallel matrix multiplication algorithms. You can always divide the matrix in blocks, and multiply it in pieces. You can even reduce the amount of operations needed.
For example, reading Wikipedia:
This isn't a full answer, but too long for a comment:
It might be easier to work with a (1280x1280) if it were reshaped to (4, 256, 4, 256), and then transposed to (4,4,256,256). But even that could require a copy() to ensure that the inner most blocks (numpy terms) are contiguous.
It could even be cast as a (4,4) object dtype array, where each element is your 'fast' unitary array.
I could elaborate on those actions if needed, but I suspect you have enough numpy skills to do it.
There are a lot things that are unclear about this question.
why both MATLAB and numpy tags
how is this block acceleration coded - if it's fast it must be compiled code; if so what's the proposed link to interpreted code
what are the constraints on the data structure? I suspect it must be some sort of contiguous block(s) of data. That's why I suggest the reshape and transpose.
I have an MxM matrix S whose entries are zero on the diagonal, and non-zero everywhere else. I need to make a larger, block matrix. The blocks will be size NxN, and there will be MxM of them.
The (i,j)th block will be S(i,j)I where I=eye(N) is the NxN identity. This matrix will certainly be sparse, S has M^2-M nonzero entries and my block matrix will have N(M^2-M) out of (NM)^2 or about 1/N% nonzero entries, but I'll be adding it to another NMxNM matrix that I do not expect to be sparse.
Since I will be adding my block matrix to a full matrix, would there be any speed gain by trying to write my code in a 'sparse' way? I keep going back and forth, but my thinking is settling on: even if my code to turn S into a sparse block matrix isn't very efficient, when I tell it to add a full and sparse matrix together, wouldn't MATLAB know that it only needs to iterate over the nonzero elements? I've been trained that for loops are slow in MATLAB and things like repmat and padding with zeros is faster, but my guess is that the fastest thing to do would be to not even build the block matrix at all, but write code that adds the entries of (the small matrix) S to my other (large, full) matrix in a sparse way. If I were to learn how to build the block matrix with sparse code (faster than building it in a full way and passing it to sparse), then that code should be able to do the addition for me in a sparse way without even needing to build the block matrix right?
If you can keep a full NMxNM matrix in memory, dont bother with sparse operations. In fact in most cases A+B, with A full and B sparse, will take longer than A+B, where A and B are both full.
From your description, using sparse is likely slower for your problem:
If you're adding a sparse matrix A to a full matrix B, the result is full and there's almost certainly no advantage to having A sparse.
For example:
n = 12000; A = rand(n, n); B1 = rand(n, n); B2 = spalloc(n, n, n*n);
B2 is as sparse as possible, that is, it's all zeros!
On my machine, A+B1 takes about .23 seconds while A + B2 takes about .7 seconds.
Basically, operations on full matrices use BLAS/LAPACK library calls that are insanely optimized. Overhead associated with sparse is going to make things worse unless you're in the special cases where sparse is super useful.
When is sparse super useful?
Sparse is super useful when the size of matrices suggest that some algorithm should be very slow, but because of sparsity (+ perhaps special matrix structure), the actual number of computations required is orders of magnitude less.
EXAMPLE: Solving linear system A*x=b where A is block diagonal matrix:
As = sparse(rand(5, 5)); for(i=1:999) As = blkdiag(As, sparse(rand(5,5))); end %generate a 5000x5000 sparse block diagonal matrix of 5x5 blocks
Af = full(As);
b = rand(5000, 1);
On my machine, solving the linear system on the full matrix (i.e. Af \ b) takes about 2.3 seconds while As \ b takes .0012 seconds.
Sparse can be awesome, but it's only helpful for large problems where you can cleverly exploit structure.