Why is division more expensive than multiplication? - performance

I am not really trying to optimize anything, but I remember hearing this from programmers all the time, that I took it as a truth. After all they are supposed to know this stuff.
But I wonder why is division actually slower than multiplication? Isn't division just a glorified subtraction, and multiplication is a glorified addition? So mathematically I don't see why going one way or the other has computationally very different costs.
Can anyone please clarify the reason/cause of this so I know, instead of what I heard from other programmer's that I asked before which is: "because".

CPU's ALU (Arithmetic-Logic Unit) executes algorithms, though they are implemented in hardware. Classic multiplications algorithms includes Wallace tree and Dadda tree. More information is available here. More sophisticated techniques are available in newer processors. Generally, processors strive to parallelize bit-pairs operations in order the minimize the clock cycles required. Multiplication algorithms can be parallelized quite effectively (though more transistors are required).
Division algorithms can't be parallelized as efficiently. The most efficient division algorithms are quite complex (The Pentium FDIV bug demonstrates the level of complexity). Generally, they requires more clock cycles per bit. If you're after more technical details, here is a nice explanation from Intel. Intel actually patented their division algorithm.

But I wonder why is division actually slower than multiplication? Isn't division just a glorified subtraction, and multiplication is a glorified addition?
The big difference is that in a long multiplication you just need to add up a bunch of numbers after shifting and masking. In a long division you have to test for overflow after each subtraction.
Lets consider a long multiplication of two n bit binary numbers.
shift (no time)
mask (constant time)
add (neively looks like time proportional to n²)
But if we look closer it turns out we can optimise the addition by using two tricks (there are further optimisations but these are the most important).
We can add the numbers in groups rather than sequentially.
Until the final step we can add three numbers to produce two rather than adding two to produce one. While adding two numbers to produce one takes time proportional to n, adding three numbers to produce two can be done in constant time because we can eliminate the carry chain.
So now our algorithm looks like
shift (no time)
mask (constant time)
add numbers in groups of three to produce two until there are only two left (time proportional to log(n))
perform the final addition (time proportional to n)
In other words we can build a multiplier for two n bit numbers in time roughly proportional to n (and space roughly proportional to n²). As long as the CPU designer is willing to dedicate the logic multiplication can be almost as fast as addition.
In long division we need to know whether each subtraction overflowed before we can decide what inputs to use for the next one. So we can't apply the same parallising tricks as we can with long multiplication.
There are methods of division that are faster than basic long division but still they are slower than multiplication.

Related

Why is exponentiation not atomic?

In calculating the efficiency of algorithms, I have read that the exponentiation operation is not considered to be an atomic operation (like multiplication).
Is it because exponentiation is the same as the multiplication operation repeated several times over?
In principle, you can pick any set of "core" operations on numbers that you consider to take a single time unit to evaluate. However, there are a couple of reasons, though, why we typically don't count exponentiation as one of them.
Perhaps the biggest has to do with how large of an output you produce. Suppose you have two numbers x and y that are each d digits long. Then their sum x + y has (at most) d + 1 digits - barely bigger than what we started with. Their product xy has at most 2d digits - larger than what we started with, but not by a huge amount. On the other hand, the value xy has roughly yd digits, which can be significantly bigger than what we started with. (A good example of this: think about computing 100100, which has about 200 digits!) This means that simply writing down the result of the exponentiation would require a decent amount of time to complete.
This isn't to say that you couldn't consider exponentiation to be a constant-time operation. Rather, I've just never seen it done.
(Fun fact: some theory papers don't consider multiplication to be a constant-time operation, since the complexity of a hardware circuit to multiply two b-bit numbers grows quadratically with the size of b. And some theory papers don't consider addition to be constant-time either, especially when working with variable-length numbers! It's all about context. If you're dealing with "smallish" numbers that fit into machine words, then we can easily count addition and multiplication as taking constant time. If you have huge numbers - say, large primes for RSA encryption - then the size of the numbers starts to impact the algorithm's runtime and implementation.)
This is a matter of definition. For example in hardware-design and biginteger-processing multiplication is not considered an atomic operation (see e.g. this analysis of the karatsuba-algorithm).
On the level that is relevant for general purpose software-design on the other hand, multiplication can be considered as a fairly fast operation on fixed-digit numbers implemented in hardware. Exponentiation on the other hand is rarely implemented in hardware and an upper bound for the complexity can only be given in terms of the exponent, rather than the number of digits.

Which algorithm to choose for a huge integer multiplication, depending on N size

In my free time I'm preparing for interview questions like: implement multiplying numbers represented as arrays of digits. Obviously I'm forced to write it from the scratch in a language like Python or Java, so an answer like "use GMP" is not acceptable (as mentioned here: Understanding Schönhage-Strassen algorithm (huge integer multiplication)).
For which exactly range of sizes of those 2 numbers (i.e. number of digits), I should choose
School grade algorithm
Karatsuba algorithm
Toom-Cook
Schönhage–Strassen algorithm ?
Is Schönhage–Strassen O(n log n log log n) always a good solution? Wikipedia mentions that Schönhage–Strassen is advisable for numbers beyond 2^2^15 to 2^2^17. What to do when one number is ridiculously huge (e.g. 10,000 to 40,000 decimal digits), but second consists of just couple of digits?
Does all those 4 algorithms parallelizes easily?
You can browse the GNU Multiple Precision Arithmetic Library's source and see their thresholds for switching between algorithms.
More pragmatically, you should just profile your implementation of the algorithms. GMP puts a lot of effort into optimizing, so their algorithms will have different constant factors than yours. The difference could easily move the thresholds around by an order of magnitude. Find out where the times cross as input size increases for your code, and set the thresholds correspondingly.
I think all of the algorithms are amenable to parallelization, since they're mostly made up up of divide and conquer passes. But keep in mind that parallelizing is another thing that will move the thresholds around quite a lot.

Does multiplication take unit time?

I have the following problem
Under what circumstances can multiplication be regarded as a unit time
operation?
But I thought multiplication is always considered to be taking unit time. Was I wrong?
It depends on what N is. If N is the number of bits in an arbitrarily large number, then as the number of bits increases, it takes longer to compute the product. However, in most programming languages and applications, the size of numbers are capped to some reasonable number of bits (usually 32 or 64). In hardware, these numbers are multiplied in one step that does not depend on the size of the number.
When the number of bits is a fixed number, like 32, then it doesn't make sense to talk about asymptotic complexity, and you can treat multiplication like an O(1) operation in terms of whatever algorithm you're looking at. When can become arbitrarily large, like with Java's BigInteger class, then multiplication depends on the size of those numbers, as does the memory required to store them.
Only in those cases where you're performing operations on two numbers,of numeric type (emphasis here ,not going into the binary detail), you simply need to assume that the operation being performed is of constant time only.
It's not defined as unit time,but,more strictly, a constant time interval which doesn't change even if we increase the size of number,but, in reality the calculation does utilise subtle more time to perform calculation on large numbers). These are generally considered trivial,unless the numbers being multiplied are too large,like BigIntegers in Java,etc.
But,as soon as we move towards performing multiplication of binary strings, our complexity increases and naive method yields a complexity of O(n^2).
So,to aimplify, we perform a divide and conquer-based multiplication, also known as Karatsuba's algorithm for multiplication, which has a complexity of O(n^1.59) which reduces the total number of multiplications and additions to some lesser number of multiplications and some of the additions.
I hope I haven't misjudged the question. If so,please alert me so that
I can remove this answer. If I understood the question properly, then the other answer posted here seems
incomplete.
The expression unit time is a little ambiguous (and AFAIK not much used).
True unit time is achieved when the multiply is performed in a single clock cycle. This rarely occurs on modern processors.
If the execution time of the multiply does not depend on the particular values of the operands, we can say that it is performed in constant time.
When the operand length is bounded, so that the time never exceeds a given duration, we also say that an operation is performed in constant time.
This constant duration can be used as the timing unit of the running time, so that you count in "multiplies" instead of seconds (ops, flops).
Lastly, you can evaluate the performance of an algorithm in terms of the number of multiplies it performs, independently of the time they take.

Where is strassen's matrix multiplication useful?

Strassen's algorithm for matrix multiplication just gives a marginal improvement over the conventional O(N^3) algorithm. It has higher constant factors and is much harder to implement. Given these shortcomings, is strassens algorithm actually useful and is it implemented in any library for matrix multiplication? Moreover, how is matrix multiplication implemented in libraries?
Generally Strassen’s Method is not preferred for practical applications for following reasons.
The constants used in Strassen’s method are high and for a typical application Naive method works better.
For Sparse matrices, there are better methods especially designed
for them.
The submatrices in recursion take extra space.
Because of the limited precision of computer arithmetic on
noninteger values, larger errors accumulate in Strassen’s algorithm
than in Naive Method.
So the idea of strassen's algorithm is that it's faster (asymptotically speaking). This could make a big difference if you are dealing with either huge matrices or else a very large number of matrix multiplications. However, just because it's faster asymptotically doesn't make it the most efficient algorithm practically. There are all sorts of implementation considerations such as caching and architecture specific quirks. Also there is also parallelism to consider.
I think your best bet would be to look at the common libraries and see what they are doing. Have a look at BLAS for example. And I think that Matlab uses MAGMA.
If your contention is that you don't think O(n^2.8) is that much faster than O(n^3) this chart shows you that n doesn't need to be very large before that difference becomes significant.
It's very important to stop at the right moment.
With 1,000 x 1,000 matrices, you can multiply them by doing seven 500 x 500 products plus a few additions. That's probably useful. With 500 x 500, maybe. With 10 x 10 matrices, most likely not. You'd just have to do some experiments first at what point to stop.
But Strassen's algorithm only saves a factor 2 (at best) when the number of rows grows by a factor 32, the number of coefficients grows by 1,024, and the total time grows by a factor 16,807 instead of 32,768. In practice, that's a "constant factor". I'd say you gain more by transposing the second matrix first so you can multiply rows by rows, then look carefully at cache sizes, vectorise as much as possible, and distribute over multiple cores that don't step on each others' feet.
Marginal improvement: True, but growing as the matrix sizes grow.
Higher constant factors: Practical implementations of Strassen's algorithm use conventional n^3 for blocks below a particular size, so this doesn't really matter.
Harder to implement: whatever.
As for what's used in practice: First, you have to understand that multiplying two ginormous dense matrices is unusual. Much more often, one or both of them is sparse, or symmetric, or upper triangular, or some other pattern, which means that there are quite a few specialized tools which are essential to the efficient large matrix multiplication toolbox. With that said, for giant dense matrices, Strassen's is The Solution.

What algorithms have high time complexity, to help "burn" more CPU cycles?

I am trying to write a demo for an embedded processor, which is a multicore architecture and is very fast in floating point calculations. The problem is that the current hardware I have is the processor connected through an evaluation board where the DRAM to chip rate is somewhat limited, and the board to PC rate is very slow and inefficient.
Thus, when demonstrating big matrix multiplication, I can do, say, 128x128 matrices in a couple of milliseconds, but the I/O takes (lots of) seconds kills the demo.
So, I am looking for some kind of a calculation with higher complexity than n^3, the more the better (but preferably easy to program and to explain/understand) to make the computation part more dominant in the time budget, where the dataset is preferably bound to about 16KB per thread (core).
Any suggestion?
PS: I think it is very similar to this question in its essence.
You could generate large (256-bit) numbers and factor them; that's commonly used in "stress-test" tools. If you specifically want to exercise floating point computation, you can build a basic n-body simulator with a Runge-Kutta integrator and run that.
What you can do is
Declare a std::vector of int
populate it with N-1 to 0
Now keep using std::next_permutation repeatedly until they are sorted again i..e..next_permutation returns false.
With N integers this will need O(N !) calculations and also deterministic
PageRank may be a good fit. Articulated as a linear algebra problem, one repeatedly squares a certain floating-point matrix of controllable size until convergence. In the graphical metaphor, one "ripples" change coming into each node onto the other edges. Both treatments can be made parallel.
You could do a least trimmed squares fit. One use of this is to identify outliers in a data set. For example you could generate samples from some smooth function (a polynomial say) and add (large) noise to some of the samples, and then the problem is to find a subset H of the samples of a given size that minimises the sum of the squares of the residuals (for the polynomial fitted to the samples in H). Since there are a large number of such subsets, you have a lot of fits to do! There are approximate algorithms for this, for example here.
Well one way to go would be to implement brute-force solver for the Traveling Salesman problem in some M-space (with M > 1).
The brute-force solution is to just try every possible permutation and then calculate the total distance for each permutation, without any optimizations (including no dynamic programming tricks like memoization).
For N points, there are (N!) permutations (with a redundancy factor of at least (N-1), but remember, no optimizations). Each pair of points requires (M) subtractions, (M) multiplications and one square root operation to determine their pythagorean distance apart. Each permutation has (N-1) pairs of points to calculate and add to the total distance.
So order of computation is O(M((N+1)!)), whereas storage space is only O(N).
Also, this should not be either too hard, nor too intensive to parallelize across the cores, though it does take some overhead. (I can demonstrate, if needed).
Another idea might be to compute a fractal map. Basically, choose a grid of whatever dimensionality you want. Then, for each grid point, do the fractal iteration to get the value. Some points might require only a few iterations; I believe some will iterate forever (chaos; of course, this can't really happen when you have a finite number of floating-point numbers, but still). The ones that don't stop you'll have to "cut off" after a certain number of iterations... just make this preposterously high, and you should be able to demonstrate a high-quality fractal map.
Another benefit of this is that grid cells are processed completely independently, so you will never need to do communication (not even at boundaries, as in stencil computations, and definitely not O(pairwise) as in direct N-body simulations). You can usefully use O(gridcells) number of processors to parallelize this, although in practice you can probably get better utilization by using gridcells/factor processors and dynamically scheduling grid points to processors on an as-ready basis. The computation is basically all floating-point math.
Mandelbrot/Julia and Lyupanov come to mind as potential candidates, but any should do.

Resources