CUDA thread allocation [duplicate] - matrix

This question already has answers here:
How do I choose grid and block dimensions for CUDA kernels?
(3 answers)
Closed 4 years ago.
I am trying to implement a iterative linear solver named "Conjugate Gradient Solver" in CUDA which solves equation of form,
A*x=b,
where A is sparse symmetric positive definite matrix of size nXn,
x is unknown vector of size n with initial guess as 0 and
b is a vector of size n on right hand side of the equation.
There are many operations included in my code like Sparse Matrix-vector multiplication,vector-vector operations.
My code works fine with matrix size upto 31 X 31,but not more than 31 X 31. It may be because of the number of threads allocated to a kernel function. I am allocating threads as
mul<<<1,nrows>>>()
Here mul is a function used to perform Sparse matrix-vector multiplication and nrows is the number of rows in a sparse matrix,A.
Is this problem related to 1 wrap size=32 threads ?
If anyone knows,please suggest me.
Thank you..!!

Try to run the "devicequery" program from the NVIDIA CUDA Samples to get warp size present in your installation. If its shows warp size=32 then your problem may be related to it else specific code snippet is mandatory to give any solution.

Related

Parallellize least squares for large (> 30k x 30k) non-square dense matrices

Let RG = A for dense unstructured matrices with shapes (e.g. roughly) R: (30k x 40k, entries float32) and G: (40k x 50k, entries either 0.0 or 1.0, roughly equally often) and of course A: (30k x 50k, entries float32).
Given A and G, I want to find the least squares solution for R.
I can use hundreds of CPU cores, hundreds of GB of RAM and also an A40 GPU. What is the best way to use such resources to solve the problem? I'm using Julia 1.7 in the examples below but I'm open to other options!
First question: Can I somehow exploit that the entries of G are only zeros and ones?
Trying to use Julia LinearAlgebra with many CPUs
I've tried two methods: "Penrose inverse" and "right division"
using LinearAlgebra
#show BLAS.get_num_threads()
# defaults to 8. Can change using BLAS.set_num_threads(N)
# build toy problem (order of magnitude smaller sizes)
R_true = rand(Float32, 3_000, 4_000)
G = rand([0., 1.], 4_000, 5_000)
# note: using true/false here gives same results but is much slower!
A = R_true * G
# solve toy problem using matrix (right) division
R_fitted_rdiv = A / G
# solve toy problem using Penrose inverse
R_fitted_pinv = (pinv(G') * A')'
First, setting BLAS.set_num_threads(64) (or any bigger number) actually only gives me BLAS.get_num_threads() returning 32. Apparantly that's an upper limit. Second,
using 32 BLAS threads is actually slower than using 8.
(e.g. performing right division with sizes (4000, 9800) / (8500, 9800) takes less than 50 seconds on 8 threads but more than 55 seconds on 32 threads. I ran things multiple times to exclude compilation time issues.) I don't know why this is or if it's normal. How can I make use of my computing power for this problem?
I think that the matrix division is faster than the Penrose inverse method. Should this be expected? I don't know what either of the functions do exactly for these inputs. The docs say that left division (\) uses pivoted QR factorization. I couldn't find what algorithm(s) are used for pinv or right division (/) (although it's probably the same as \ since they are related by transposing the matrices). I'd rather not delve too deeply because my knowledge in numerical linear algebra is quite limited.
The issue is that for my large matrices either method takes forever. Is there a way to make use of my ~100 cores somehow?
Trying to use the GPU:
Using CUDA.jl, Matrices of size around 10k work fine and take a minute to pinv:
using CUDA
#time matrix = CUDA.rand(Float32, 10_000, 10_500) # 0.003037 seconds (5 allocations: 160 bytes)
#time pinv(matrix) # 57.417559 seconds (678 allocations: 172.094 KiB)
However, when I try to do matrices around size 20k, I get right away the error InexactError: trunc(Int32, 4811456640). I assume this is due to CUBLAS using int32 for indexing, even though I don't understand why it leads to an error in this case. (edit: it's about the size of the array in bytes fitting into 31 bits.)
Trying to use right division with CuArrays gives the error "DimensionMismatch("LU factored matrix A must be square!")". I guess I have to choose a different algorithm manually? I don't know what it's called. (Although, it probably would still crash for large matrices...?)
To summarize, it doesn't look like I can use the GPU from Julia easily to solve my problem. Should I keep trying to use the GPU for this task or stick to the many CPUs?
Yes this is really my problem, please refrain from commenting "nobody should ever need such large least squares"
Naive answer
Using pytorch, this will require at least 30GB GPU memory
import torch
A = torch.randint(0, 2, (50000, 40000), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (50000, 30000), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)
If the system can sustain the same operation throughput as my laptop you should have an answer in about 15 minutes.
I would suggest you to try a generalized version scaling up the dimensions to get a better feeling of how your system will handle it
def try_it(a,b,c):
A = torch.randint(0, 2, (a, b), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (a, c), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)
I transposed the dimensions in the generation in order to make sure G.T and A.T would be contiguous.
You can't take much advantage of the entries being integer. This type of problem is easier to solve on the reals than on the integers, because finding integer solutions would require you to search the solutions, while the real solution you can find by doing algebraic manipulations.

Parallel method to get all the eigenvalues of a large sparse matrix

Is it possible to compute all the eigenvalues of a large sparse matrix using multiple CPUs ?
If yes, then is it possible to do it without storing the full dense matrix in memory ? (using only the stored sparse matrix)
If yes, then what's a good (rapid and low memory usage) method to do it ?
Can numpy or scipy do it ?
My matrix is complex, non-hermitian, as sparse as the identity matrix and of dimension N x N where N = BinomialCoefficient(L,Floor(L/2)) where we need to take L as large as possible.
For example, with L = 20, N = 184 756 it is 99.9995% sparse, having just N non-zero elements. So, the memory usage of the sparse matrix is ~0.1GB but would be ~10TB for the dense matrix.
With L = 30, N = 155 117 520 and we use ~60GB (sparse) and ~10EB (dense). So it's impraticable to store the full dense matrix in memory.
I have access to IntelĀ® Gold 6148 Skylake # 2.4 [GHz] CPUs with up to 752 [GB] of RAM each. I could use Python, C (ScaLAPACK, OpenBLAS, MAGMA, ELPA, MUMPS, SuperLU, SuiteSparse, PETSc, Lis,...), C++ (Armadillo, Eigen, BLitz++, Trilinos,...), Matlab, R, Perl, Fortran, mpi4py, CUDA, IntelĀ® Math Kernel Library, and a few other softwares.
I build my matrix using Python (scipy.sparse, numpy and multiprocessing). I've tried using numpy.linalg.eigvals() and scipy.linalg.eigvals(), but it seems that they only use the cores of one CPU. I could look further into those, but I wont if there's a better way to solve my matrix.
For the curious ones, my matrix comes from a representation of a non-hermitian operator on a subset of states of a length L quantum spin 1/2 chain with strong interactions. I need the full spectrum because it allows me to study the level spacing distribution of the energy spectrum for a fixed set of quantum numbers.
I'm far from being a professional in computer science, so if I missed some basic concept please be clement.

Octave - out of memory or dimension too large for Octave's index type

I'm aware of the fact that there are 3 questions with a similar exception message. Unfortunately none of the questions is answered and the comments could not solve my problem.
I use Octave 4.2.1 in the 64 bit version on a Windows 10 System with 16 GB RAM in total and ~11 GB free during runtime.
When i try to multiply a 60000 x 10 matrix with a 10 x 60000 matrix Octave comes up with the following exception:
error: out of memory or dimension too large for Octave's index type
This multiplication would result in a 60000 x 60000 matrix and therefore should not be a problem for a 64 bit index.
I can't even do zeros(60000,60000);
I don't get what i am doing wrong. Could someone point me into the right direction?
As is often the case, this error is often misinterpreted (maybe we should address this as a bug on the octave tracker already ;) )
>> 60000*60000
ans = 3.6000e+09
>> intmax
ans = 2147483647
>> 60000*60000 > intmax
ans = 1
I.e. the number of elements of the resulting 60000x60000 matrix is larger than the maximum integer representation supported by the system, therefore there is no way to linearly index such a matrix using an integer index.
Also, in order to use actual 64-bit indexing, you need to compile octave in that manner, as this tends not to be the default, but unfortunately that's not as straightforward as you might wish, as you'll have to use the respective 64-bit supporting libraries as well. More on that here.
Having said that, it may well be possible to make use of sparse matrices instead if your matrices are indeed sparse in nature. If not, you're essentially using 'big data', and you need to find workarounds, such as blockprocessing / mapping large arrays to files etc. It's worth reading up on common "big data" techniques. Unfortunately octave does not seem to support matlab's memmapfile command just yet, but you can simulate this using fwrite / fread / fseek appropriately to read appropriate ranges from a file.

CUDA implementation for arbitrary precision arithmetics

I have to multiply two very large (~ 2000 X 2000) dense matrices whose entries are floats with arbitrary precision (I am using GMP and the precision is currently set to 600). I was wondering if there is any CUDA library that supports arbitrary precision arithmetics? The only library that I have found is called CAMPARY however it seems to be missing some references to some of the used functions.
The other solution that I was thinking about was implementing a version of the Karatsuba algorithm for multiplying matrices with arbitrary precision entries. The end step of the algorithm would just be multiplying matrices of doubles, which could be done very efficiently using cuBLAS. Is there any similar implementation already out there?
Since nobody has suggested such a library so far, let's assume that one doesn't exist.
You could always implement the naive implementation:
One grid thread for each pair of coordinates in the output matrix.
Each thread performs an inner product of a row and a column in the input matrices.
Individual element operations will use the code taken from the GMP (hopefully not much more than copy-and-paste).
But you can also do better than this - just like you can do better for regular-float matrix multiplication. Here's my idea (likely not the best of course):
Consider the worked example of matrix multiplication using shared memory in the CUDA C Programming Guide. It suggests putting small submatrices in shared memory. You can still do this - but you need to be careful with shared memory sizes (they're small...):
A typical GPU today has 64 KB shared memory usable per grid block (or more)
They take 16 x 16 submatrix.
Times 2 (for the two multiplicands)
Times ceil(801/8) (assuming the GMP representation uses 600 bits from the mantissa, one bit for the sign and 200 bits from the exponent)
So 512 * 101 < 64 KB !
That means you can probably just use the code in their worked example as-is, again replacing the float multiplication and addition with code from GMP.
You may then want to consider something like parallelizing the GMP code itself, i.e. using multiple threads to work together on single pairs of 600-bit-precision numbers. That would likely help your shared memory reading pattern. Alternatively, you could interleave the placement of 4-byte sequences from the representation of your elements, in shared memory, for the same effect.
I realize this is a bit hand-wavy, but I'm pretty certain I've waved my hands correctly and it would be a "simple matter of coding".

How many number of primitive operations does a 16, 32 or a 64-bit processor execute to perform logical right shift of an N-bit Binary number? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
Recently,I have been trying to understand how the Binary Extended Euclidean Algorithm works at the processor level. This question is all about finding an Inverse element in GF(2^m) with polynomial basis.
Generally I came across the Extended Euclidean Algorithm for evaluating an inverse element but the fact is that it involves too many addition and multiplication operations. The Binary EEA algorithm requires just bit shifting operations (equivalent to division by 2--logical shift right). The algorithm is in this link, page number 8.
In step 3 and 5 of this algorithm, every iteration shifts the parameters u and b by 1 bit to the right adding zero to the MSB at the same time. The loop ends when u == 1 and returns b. My question is how many primitive operations does a processor (say a 32 bit processor for example) perform in step 3 or step 5 of every iteration?
I came across barrel shifter and I am quite confused about how fast the shifting takes place. Should I really consider these primitive operations or should I ignore them if because the shifting may be faster?
It would really help me a lot if someone would show the primitive operations for the case where the size of u is 194 bits.
In case you might be wondering about the denominator x in step 3 and 5 of the algorithm, its the polynomial representation and x means nothing but 10 in binary and parameter u is an N-bit binary number.
There is no generic answer to this question: you can use portable code that will be tedious to optimize or highly machine specific code that will be even more complicated to optimize without breaking.
If you want real performance, you have to use MMX/AVX registers on the maximum width you can get your hands on. Intel provides lightweight wrappers on low-level instructions as macros and inline functions.
Always use unsigned types for your shifting operations to avoid unnecessary steps.
Usually ther is a "right shift" assembly OP code which is able to right shift a register a given number of bits. Such an operation takes one cycle.
This assumes thet your value is already loaded to the register however.
The best answer anyway: Implement this algorithm in a low level language (C, C++) and look at the assembly code produced by the compiler.

Resources