Fast algorithm to calculate Pi in parallel - algorithm

I am starting to learn CUDA and I think calculating long digits of pi would be a nice, introductory project.
I have already implemented the simple Monte Carlo method which is easily parallelize-able. I simply have each thread randomly generate points on the unit square, figure out how many lie within the unit circle, and tally up the results using a reduction operation.
But that is certainly not the fastest algorithm for calculating the constant. Before, when I did this exercise on a single threaded CPU, I used Machin-like formulae to do the calculation for far faster convergence. For those interested, this involves expressing pi as the sum of arctangents and using Taylor series to evaluate the expression.
An example of such a formula:
Unfortunately, I found that parallelizing this technique to thousands of GPU threads is not easy. The problem is that the majority of the operations are simply doing high precision math as opposed to doing floating point operations on long vectors of data.
So I'm wondering, what is the most efficient way to calculate arbitrarily long digits of pi on a GPU?

You should use the Bailey–Borwein–Plouffe formula
Why? First of all, you need an algorithm that can be broken down. So, the first thing that came to my mind is having a representation of pi as an infinite sum. Then, each processor just computes one term, and you sum them all in the end.
Then, it is preferable that each processor manipulates small-precision values, as opposed to very high precision ones. For example, if you want one billion decimals, and you use some of the expressions used here, like the Chudnovsky algorithm, each of your processor will need to manipulate a billion long number. That's simply not the appropriate method for a GPU.
So, all in all, the BBP formula will allow you to compute the digits of pi separately (the algorithm is very cool), and with "low precision" processors! Read the "BBP digit-extraction algorithm for π"
Advantages of the BBP algorithm for computing π
This algorithm computes π without requiring custom data types having thousands or even millions of digits. The method calculates the nth digit without calculating the first n − 1 digits, and can use small, efficient data types.
The algorithm is the fastest way to compute the nth digit (or a few digits in a neighborhood of the nth), but π-computing algorithms using large data types remain faster when the goal is to compute all the digits from 1 to n.

Related

Why is exponentiation not atomic?

In calculating the efficiency of algorithms, I have read that the exponentiation operation is not considered to be an atomic operation (like multiplication).
Is it because exponentiation is the same as the multiplication operation repeated several times over?
In principle, you can pick any set of "core" operations on numbers that you consider to take a single time unit to evaluate. However, there are a couple of reasons, though, why we typically don't count exponentiation as one of them.
Perhaps the biggest has to do with how large of an output you produce. Suppose you have two numbers x and y that are each d digits long. Then their sum x + y has (at most) d + 1 digits - barely bigger than what we started with. Their product xy has at most 2d digits - larger than what we started with, but not by a huge amount. On the other hand, the value xy has roughly yd digits, which can be significantly bigger than what we started with. (A good example of this: think about computing 100100, which has about 200 digits!) This means that simply writing down the result of the exponentiation would require a decent amount of time to complete.
This isn't to say that you couldn't consider exponentiation to be a constant-time operation. Rather, I've just never seen it done.
(Fun fact: some theory papers don't consider multiplication to be a constant-time operation, since the complexity of a hardware circuit to multiply two b-bit numbers grows quadratically with the size of b. And some theory papers don't consider addition to be constant-time either, especially when working with variable-length numbers! It's all about context. If you're dealing with "smallish" numbers that fit into machine words, then we can easily count addition and multiplication as taking constant time. If you have huge numbers - say, large primes for RSA encryption - then the size of the numbers starts to impact the algorithm's runtime and implementation.)
This is a matter of definition. For example in hardware-design and biginteger-processing multiplication is not considered an atomic operation (see e.g. this analysis of the karatsuba-algorithm).
On the level that is relevant for general purpose software-design on the other hand, multiplication can be considered as a fairly fast operation on fixed-digit numbers implemented in hardware. Exponentiation on the other hand is rarely implemented in hardware and an upper bound for the complexity can only be given in terms of the exponent, rather than the number of digits.

How does one actually implement polynomial multiplications using FFT?

I have been studying this topic in an Algorithms textbook.
The clever usage of the complex roots of unity seems to be mathematically working. However, I do not understand how one could actually represent this in a computer.
I can think of two things:
Use the real/imaginary decomposition to represent the complex numbers. But this means using floats, which means I open up my algorithm to numerical error and I would loose out precision even if I want to multiply two polynomials with integer coefficients.
Represent exp(i 2pi/n) as n. So, I'd eventually get a tuple in omega, and if I have to keep it in this form, I'd essentially be doing polynomial multiplication in omega again, taking us back to square one.
I'd really like to see an implementation of this algorithm in a familiar programming language.
Indeed as you identify, the roots of unity are typically not nice numbers that can be stored well in a computer. Since the numerical error is small, if you know the output should be integers, rounding usually produces the right result.
If you don't want to (or cannot) rely on that, an exact option is the Number Theoretic Transform. It substitutes the roots of unity in the complex plane with roots of unity in a finite field ℤ/pℤ where p is a suitable prime. p has to be large enough for all the necessary roots to exist, and the efficiency is affected by properties of p. If you choose a Fermat prime then the roots of unity have convenient forms and there is a trick to do reduction modulo p more efficiently than usual. That is all exact integer arithmetic and the values stay small, so there is no problem implementing it in a computer.
That technique is used in the Schönhage–Strassen algorithm so you can look up the specifics there.

Which algorithm to choose for a huge integer multiplication, depending on N size

In my free time I'm preparing for interview questions like: implement multiplying numbers represented as arrays of digits. Obviously I'm forced to write it from the scratch in a language like Python or Java, so an answer like "use GMP" is not acceptable (as mentioned here: Understanding Schönhage-Strassen algorithm (huge integer multiplication)).
For which exactly range of sizes of those 2 numbers (i.e. number of digits), I should choose
School grade algorithm
Karatsuba algorithm
Toom-Cook
Schönhage–Strassen algorithm ?
Is Schönhage–Strassen O(n log n log log n) always a good solution? Wikipedia mentions that Schönhage–Strassen is advisable for numbers beyond 2^2^15 to 2^2^17. What to do when one number is ridiculously huge (e.g. 10,000 to 40,000 decimal digits), but second consists of just couple of digits?
Does all those 4 algorithms parallelizes easily?
You can browse the GNU Multiple Precision Arithmetic Library's source and see their thresholds for switching between algorithms.
More pragmatically, you should just profile your implementation of the algorithms. GMP puts a lot of effort into optimizing, so their algorithms will have different constant factors than yours. The difference could easily move the thresholds around by an order of magnitude. Find out where the times cross as input size increases for your code, and set the thresholds correspondingly.
I think all of the algorithms are amenable to parallelization, since they're mostly made up up of divide and conquer passes. But keep in mind that parallelizing is another thing that will move the thresholds around quite a lot.

Why is division more expensive than multiplication?

I am not really trying to optimize anything, but I remember hearing this from programmers all the time, that I took it as a truth. After all they are supposed to know this stuff.
But I wonder why is division actually slower than multiplication? Isn't division just a glorified subtraction, and multiplication is a glorified addition? So mathematically I don't see why going one way or the other has computationally very different costs.
Can anyone please clarify the reason/cause of this so I know, instead of what I heard from other programmer's that I asked before which is: "because".
CPU's ALU (Arithmetic-Logic Unit) executes algorithms, though they are implemented in hardware. Classic multiplications algorithms includes Wallace tree and Dadda tree. More information is available here. More sophisticated techniques are available in newer processors. Generally, processors strive to parallelize bit-pairs operations in order the minimize the clock cycles required. Multiplication algorithms can be parallelized quite effectively (though more transistors are required).
Division algorithms can't be parallelized as efficiently. The most efficient division algorithms are quite complex (The Pentium FDIV bug demonstrates the level of complexity). Generally, they requires more clock cycles per bit. If you're after more technical details, here is a nice explanation from Intel. Intel actually patented their division algorithm.
But I wonder why is division actually slower than multiplication? Isn't division just a glorified subtraction, and multiplication is a glorified addition?
The big difference is that in a long multiplication you just need to add up a bunch of numbers after shifting and masking. In a long division you have to test for overflow after each subtraction.
Lets consider a long multiplication of two n bit binary numbers.
shift (no time)
mask (constant time)
add (neively looks like time proportional to n²)
But if we look closer it turns out we can optimise the addition by using two tricks (there are further optimisations but these are the most important).
We can add the numbers in groups rather than sequentially.
Until the final step we can add three numbers to produce two rather than adding two to produce one. While adding two numbers to produce one takes time proportional to n, adding three numbers to produce two can be done in constant time because we can eliminate the carry chain.
So now our algorithm looks like
shift (no time)
mask (constant time)
add numbers in groups of three to produce two until there are only two left (time proportional to log(n))
perform the final addition (time proportional to n)
In other words we can build a multiplier for two n bit numbers in time roughly proportional to n (and space roughly proportional to n²). As long as the CPU designer is willing to dedicate the logic multiplication can be almost as fast as addition.
In long division we need to know whether each subtraction overflowed before we can decide what inputs to use for the next one. So we can't apply the same parallising tricks as we can with long multiplication.
There are methods of division that are faster than basic long division but still they are slower than multiplication.

What algorithms have high time complexity, to help "burn" more CPU cycles?

I am trying to write a demo for an embedded processor, which is a multicore architecture and is very fast in floating point calculations. The problem is that the current hardware I have is the processor connected through an evaluation board where the DRAM to chip rate is somewhat limited, and the board to PC rate is very slow and inefficient.
Thus, when demonstrating big matrix multiplication, I can do, say, 128x128 matrices in a couple of milliseconds, but the I/O takes (lots of) seconds kills the demo.
So, I am looking for some kind of a calculation with higher complexity than n^3, the more the better (but preferably easy to program and to explain/understand) to make the computation part more dominant in the time budget, where the dataset is preferably bound to about 16KB per thread (core).
Any suggestion?
PS: I think it is very similar to this question in its essence.
You could generate large (256-bit) numbers and factor them; that's commonly used in "stress-test" tools. If you specifically want to exercise floating point computation, you can build a basic n-body simulator with a Runge-Kutta integrator and run that.
What you can do is
Declare a std::vector of int
populate it with N-1 to 0
Now keep using std::next_permutation repeatedly until they are sorted again i..e..next_permutation returns false.
With N integers this will need O(N !) calculations and also deterministic
PageRank may be a good fit. Articulated as a linear algebra problem, one repeatedly squares a certain floating-point matrix of controllable size until convergence. In the graphical metaphor, one "ripples" change coming into each node onto the other edges. Both treatments can be made parallel.
You could do a least trimmed squares fit. One use of this is to identify outliers in a data set. For example you could generate samples from some smooth function (a polynomial say) and add (large) noise to some of the samples, and then the problem is to find a subset H of the samples of a given size that minimises the sum of the squares of the residuals (for the polynomial fitted to the samples in H). Since there are a large number of such subsets, you have a lot of fits to do! There are approximate algorithms for this, for example here.
Well one way to go would be to implement brute-force solver for the Traveling Salesman problem in some M-space (with M > 1).
The brute-force solution is to just try every possible permutation and then calculate the total distance for each permutation, without any optimizations (including no dynamic programming tricks like memoization).
For N points, there are (N!) permutations (with a redundancy factor of at least (N-1), but remember, no optimizations). Each pair of points requires (M) subtractions, (M) multiplications and one square root operation to determine their pythagorean distance apart. Each permutation has (N-1) pairs of points to calculate and add to the total distance.
So order of computation is O(M((N+1)!)), whereas storage space is only O(N).
Also, this should not be either too hard, nor too intensive to parallelize across the cores, though it does take some overhead. (I can demonstrate, if needed).
Another idea might be to compute a fractal map. Basically, choose a grid of whatever dimensionality you want. Then, for each grid point, do the fractal iteration to get the value. Some points might require only a few iterations; I believe some will iterate forever (chaos; of course, this can't really happen when you have a finite number of floating-point numbers, but still). The ones that don't stop you'll have to "cut off" after a certain number of iterations... just make this preposterously high, and you should be able to demonstrate a high-quality fractal map.
Another benefit of this is that grid cells are processed completely independently, so you will never need to do communication (not even at boundaries, as in stencil computations, and definitely not O(pairwise) as in direct N-body simulations). You can usefully use O(gridcells) number of processors to parallelize this, although in practice you can probably get better utilization by using gridcells/factor processors and dynamically scheduling grid points to processors on an as-ready basis. The computation is basically all floating-point math.
Mandelbrot/Julia and Lyupanov come to mind as potential candidates, but any should do.

Resources