CUDA implementation for arbitrary precision arithmetics - matrix

I have to multiply two very large (~ 2000 X 2000) dense matrices whose entries are floats with arbitrary precision (I am using GMP and the precision is currently set to 600). I was wondering if there is any CUDA library that supports arbitrary precision arithmetics? The only library that I have found is called CAMPARY however it seems to be missing some references to some of the used functions.
The other solution that I was thinking about was implementing a version of the Karatsuba algorithm for multiplying matrices with arbitrary precision entries. The end step of the algorithm would just be multiplying matrices of doubles, which could be done very efficiently using cuBLAS. Is there any similar implementation already out there?

Since nobody has suggested such a library so far, let's assume that one doesn't exist.
You could always implement the naive implementation:
One grid thread for each pair of coordinates in the output matrix.
Each thread performs an inner product of a row and a column in the input matrices.
Individual element operations will use the code taken from the GMP (hopefully not much more than copy-and-paste).
But you can also do better than this - just like you can do better for regular-float matrix multiplication. Here's my idea (likely not the best of course):
Consider the worked example of matrix multiplication using shared memory in the CUDA C Programming Guide. It suggests putting small submatrices in shared memory. You can still do this - but you need to be careful with shared memory sizes (they're small...):
A typical GPU today has 64 KB shared memory usable per grid block (or more)
They take 16 x 16 submatrix.
Times 2 (for the two multiplicands)
Times ceil(801/8) (assuming the GMP representation uses 600 bits from the mantissa, one bit for the sign and 200 bits from the exponent)
So 512 * 101 < 64 KB !
That means you can probably just use the code in their worked example as-is, again replacing the float multiplication and addition with code from GMP.
You may then want to consider something like parallelizing the GMP code itself, i.e. using multiple threads to work together on single pairs of 600-bit-precision numbers. That would likely help your shared memory reading pattern. Alternatively, you could interleave the placement of 4-byte sequences from the representation of your elements, in shared memory, for the same effect.
I realize this is a bit hand-wavy, but I'm pretty certain I've waved my hands correctly and it would be a "simple matter of coding".


Say you have Matrix<int, 4, 2> and Matrix<int, 3, 2> which you want to multiply in the natural way that consumes the -1 dimension without first transposing.
Is this possible? Or do we have to transpose first. Which would be silly (unperformative) from a cache perspective, because now the elements we are multiplying and summing aren't contiguous.
Here's a playground.
Pytorch provides torch.inner and torch.tensordot which do this.
Just like in Numpy, transpose() just creates a "view". It doesn't do any expensive memory operations (unless you assign it to a new matrix). Just call a * b.transpose() and let Eigen handle the details of the memory access. A properly optimized BLAS library like Eigen handles the transposition on smaller tiles in temporary memory for optimal performance.
Memory order still matters for fine tuning though. If you can, write your matrix multiplications in the form a.transpose() * b for column-major matrices (like Eigen, Matlab), or a * b.transpose() for row-major matrices like those in Numpy. That saves the BLAS library the trouble of doing that transposition.
Side note: You used auto for your result. Please read the Common Pitfalls chapter in the documentation. Your code didn't compute a matrix multiplication, it stored an expression of one.

I'm reading the documentation to the math/big package here:
I am trying to understand how large a number is too big for math.big, and this looked like a constant I could interrogate.
I see on my machine:
How does this relate to the largest integer possible on my machine, for the purpose of calculation? What are the units of this number? Is this bytes, or decimal places or something other than the number itself?
bignum libraries usually store big numbers as a sequence of digits (e.g. in base 264). Their limitation is related to the memory available. So the largest number you could represent is tied to the limitation of your virtual address space. You can safely assume that a number even as large as 1010000 is representable in bignum. Of course, a googolplex is not representable as a bignum (because it has more bits than the number of particles in the universe).
Another limitation is the complexity of arithmetic operations. But there exist very efficient bignum algorithms.
FWIW, the GMPlib (a C library for bignums) can deal with numbers as long as there is memory for them. However, it is rumored than when malloc fails, GMPlib is aborting.
I don't know what happens inside Go bignums when a number is too big to be representable (and that limit varies from one machine to the next and could be different from one run to the next). For example, Go's Int.Mul gives a product whose size is the sum of the size of the arguments, and the "out of memory" error is undocumented (but obviously can happen).
When using bignums, prefer iterative algorithms to recursive ones. For example, a naive recursive factorial might overflow the call stack with large enough bignums, so you want to code it iteratively.

I'm a J newbie, and am trying to import one of my large datasets for further experimentation. It is a 2D matrix of doubles, approximately 80000x50000. So far, I have found two different methods to load data into J.
The first is to convert the data into J format (replacing negatives with underscores, putting exponential notation numbers into J format, etc) and then load with (adapted from J: Handy method to enter a matrix?):
(".;._2) fread 'path/to/file'
The second method is to use tables/dsv.
I am experiencing the same problem with both methods: namely, that these methods work with small matrices, but fail at approximately 10M values. It seems the input just gets truncated to some arbitrary limit. How can I load matrices of arbitrary size? If I have to convert to some binary format, that's OK, as long as there is a description of the format somewhere.
I should add that this is a 64-bit system and build of J, and I can successfully create a matrix of random numbers of the appropriate size, so it doesn't seem to be a limitation on matrix size per se but only during I/O.
EDIT: I did not find what exactly was causing this, but thanks to Dane, I did find a workaround by using JMF ( 'data/jmf' package). It turns out that JMF is just straight binary data with no header and native (?) or little-endian data can be mapped directly with JFL map_jmf_ 'x';'whatever.bin'
You're running out of memory. A quick test to see how much space integers take up yields the following:
7!:2 'i. 80000 5000'
That is, an 80,000 by 5,000 matrix of integers requires 8 GB of memory. Your 80,000 by 50,000 matrix, if it were of integers, would require approximately 80 GB of memory.
Your next question should be about performing array or matrix operations on a matrix too big to load into memory.

I'm still working on routines for arbitrary long integers in C++. So far, I have implemented addition/subtraction and multiplication for 64-bit Intel CPUs.
Everything works fine, but I wondered if I can speed it a bit by using SSE. I browsed through the SSE docs and processor instruction lists, but I could not find anything I think I can use and here is why:
SSE has some integer instructions, but most instructions handle floating point. It doesn't look like it was designed for use with integers (e.g. is there an integer compare for less?)
The SSE idea is SIMD (same instruction, multiple data), so it provides instructions for 2 or 4 independent operations. I, on the other hand, would like to have something like a 128 bit integer add (128 bit input and output). This doesn't seem to exist. (Yet? In AVX2 maybe?)
The integer additions and subtractions handle neither input nor output carries. So it's very cumbersome (and thus, slow) to do it by hand.
My question is: is my assessment correct or is there anything I have overlooked? Can long integer routines benefit from SSE? In particular, can they help me to write a quicker add, sub or mul routine?
In the past, the answer to this question was a solid, "no". But as of 2017, the situation is changing.
But before I continue, time for some background terminology:
Full Word Arithmetic
Partial Word Arithmetic
Full-Word Arithmetic:
This is the standard representation where the number is stored in base 232 or 264 using an array of 32-bit or 64-bit integers.
Many bignum libraries and applications (including GMP) use this representation.
In full-word representation, every integer has a unique representation. Operations like comparisons are easy. But stuff like addition are more difficult because of the need for carry-propagation.
It is this carry-propagation that makes bignum arithmetic almost impossible to vectorize.
Partial-Word Arithmetic
This is a lesser-used representation where the number uses a base less than the hardware word-size. For example, putting only 60 bits in each 64-bit word. Or using base 1,000,000,000 with a 32-bit word-size for decimal arithmetic.
The authors of GMP call this, "nails" where the "nail" is the unused portion of the word.
In the past, use of partial-word arithmetic was mostly restricted to applications working in non-binary bases. But nowadays, it's becoming more important in that it allows carry-propagation to be delayed.
Problems with Full-Word Arithmetic:
Vectorizing full-word arithmetic has historically been a lost cause:
SSE/AVX2 has no support for carry-propagation.
SSE/AVX2 has no 128-bit add/sub.
SSE/AVX2 has no 64 x 64-bit integer multiply.*
*AVX512-DQ adds a lower-half 64x64-bit multiply. But there is still no upper-half instruction.
Furthermore, x86/x64 has plenty of specialized scalar instructions for bignums:
Add-with-Carry: adc, adcx, adox.
Double-word Multiply: Single-operand mul and mulx.
In light of this, both bignum-add and bignum-multiply are difficult for SIMD to beat scalar on x64. Definitely not with SSE or AVX.
With AVX2, SIMD is almost competitive with scalar bignum-multiply if you rearrange the data to enable "vertical vectorization" of 4 different (and independent) multiplies of the same lengths in each of the 4 SIMD lanes.
AVX512 will tip things more in favor of SIMD again assuming vertical vectorization.
But for the most part, "horizontal vectorization" of bignums is largely still a lost cause unless you have many of them (of the same size) and can afford the cost of transposing them to make them "vertical".
Vectorization of Partial-Word Arithmetic
With partial-word arithmetic, the extra "nail" bits enable you to delay carry-propagation.
So as long as you as you don't overflow the word, SIMD add/sub can be done directly. In many implementations, partial-word representation uses signed integers to allow words to go negative.
Because there is (usually) no need to perform carryout, SIMD add/sub on partial words can be done equally efficiently on both vertically and horizontally-vectorized bignums.
Carryout on horizontally-vectorized bignums is still cheap as you merely shift the nails over the next lane. A full carryout to completely clear the nail bits and get to a unique representation usually isn't necessary unless you need to do a comparison of two numbers that are almost the same.
Multiplication is more complicated with partial-word arithmetic since you need to deal with the nail bits. But as with add/sub, it is nevertheless possible to do it efficiently on horizontally-vectorized bignums.
AVX512-IFMA (coming with Cannonlake processors) will have instructions that give the full 104 bits of a 52 x 52-bit multiply (presumably using the FPU hardware). This will play very well with partial-word representations that use 52 bits per word.
Large Multiplication using FFTs
For really large bignums, multiplication is most efficiently done using Fast-Fourier Transforms (FFTs).
FFTs are completely vectorizable since they work on independent doubles. This is possible because fundamentally, the representation that FFTs use is
a partial word representation.
To summarize, vectorization of bignum arithmetic is possible. But sacrifices must be made.
If you expect SSE/AVX to be able to speed up some existing bignum code without fundamental changes to the representation and/or data layout, that's not likely to happen.
But nevertheless, bignum arithmetic is possible to vectorize.
I'm the author of y-cruncher which does plenty of large number arithmetic.

I've tried for hours to find the implementation of rand() function used in gcc...
It would be much appreciated if someone could reference me to the file containing it's implementation or website with the implementation.
By the way, which directory (I'm using Ubuntu if that matters) contains the c standard library implementations for the gcc compiler?
rand consists of a call to a function __random, which mostly just calls another function called __random_r in random_r.c.
Note that the function names above are hyperlinks to the glibc source repository, at version 2.28.
The glibc random library supports two kinds of generator: a simple linear congruential one, and a more sophisticated linear feedback shift register one. It is possible to construct instances of either, but the default global generator, used when you call rand, uses the linear feedback shift register generator (see the definition of unsafe_state.rand_type).
You will find C library implementation used by GCC in the GNU GLIBC project.
You can download it sources and you should find rand() implementation. Sources with function definitions are usually not installed on a Linux distribution. Only the header files which I guess you already know are usually stored in /usr/include directory.
If you are familiar with GIT source code management, you can do:
$ git clone git://
To get GLIBC source code.
The files are available via FTP. I found that there is more to rand() used in stdlib, which is from [glibc][2]. From the 2.32 version (glibc-2.32.tar.gz) obtained from here, the stdlib folder contains a random.c file which explains that a simple linear congruential algorithm is used. The folder also has rand.c and rand_r.c which can show you more of the source code. stdlib.h contained in the same folder will show you the values used for macros like RAND_MAX.
/* An improved random number generation package. In addition to the
standard rand()/srand() like interface, this package also has a
special state info interface. The initstate() routine is called
with a seed, an array of bytes, and a count of how many bytes are
being passed in; this array is then initialized to contain
information for random number generation with that much state
information. Good sizes for the amount of state information are
32, 64, 128, and 256 bytes. The state can be switched by calling
the setstate() function with the same array as was initialized with
initstate(). By default, the package runs with 128 bytes of state
information and generates far better random numbers than a linear
congruential generator. If the amount of state information is less
than 32 bytes, a simple linear congruential R.N.G. is used.
Internally, the state information is treated as an array of longs;
the zeroth element of the array is the type of R.N.G. being used
(small integer); the remainder of the array is the state
information for the R.N.G. Thus, 32 bytes of state information
will give 7 longs worth of state information, which will allow a
degree seven polynomial. (Note: The zeroth word of state
information also has some other information stored in it; see setstate
for details). The random number generation technique is a linear
feedback shift register approach, employing trinomials (since there
are fewer terms to sum up that way). In this approach, the least
significant bit of all the numbers in the state table will act as a
linear feedback shift register, and will have period 2^deg - 1
(where deg is the degree of the polynomial being used, assuming
that the polynomial is irreducible and primitive). The higher order
bits will have longer periods, since their values are also
influenced by pseudo-random carries out of the lower bits. The
total period of the generator is approximately deg*(2deg - 1); thus
doubling the amount of state information has a vast influence on the
period of the generator. Note: The deg*(2deg - 1) is an
approximation only good for large deg, when the period of the shift
register is the dominant factor. With deg equal to seven, the
period is actually much longer than the 7*(2**7 - 1) predicted by
this formula. */
