Addition efficiency proportional to size of operands? - performance

Question that just popped into my head, and I don't think I've seen an answer on here. Is the time taken by a binary addition algorithm, proportional to the size of the operands?
Obviously, adding 1101011010101010101101010 and 10110100101010010101 is going to take longer than 1 + 1, but my question refers more to the smaller values. Is there a negligible difference, no difference, a theoretical difference?
At what point, with these sorts of rudimentary calculations should we start looking into more efficient methods of calculation? ie: Exponentiation by squaring with large exponents for calculating huge powers.

How we see the binary patterns...
1101011010101010101101010 (big)
10110100101010010101 (medium)
1 (small)
How a 32bit computer sees the binary patterns...
00000001101011010101010101101010 32bit,
00000000000010110100101010010101 32bit,
00000000000000000000000000000001 i'm lovin it
On a 32bit system, all the above numbers will take the same time (no. of CPU instructions) to be added. As all of them fit within the basic computational block i.e. the 32bit CPU register.
How a 16bit computer sees the binary patterns...
1
+1 = ?
0000000000000001 i'm lovin it
0000000000000001 i'm lovin it
00000001101011010101010101101010
+00000000000010110100101010010101 = ?
00000001101011010101010101101010 too BIG for me!
00000000000010110100101010010101 too BIG for me!
On a 16bit system, as the larger numbers will NOT fit in a 16bit register, it will need an additional pass(to add the significant bits that remain after the first 16LSBs are added).
Step1: ADD Least significant bits
0101010101101010
0100101010010101
Step2: ADD the rest (remember carry bit from previous operation)
000000000000000C
0000000110101101
0000000000001011
We can start thinking of optimising the mathematical operations on
numbers once the numbers no longer fit in the basic computation unit
of the system i.e. the CPU-register.
Modern hardware architectures are developed keeping this in mind and support SIMD instructions. Compilers will often employ them (SSE on x86, NEON on ARM) when they see such a case being made i.e. 128bit decryption logic being run on a 32bit system.
Also instead of checking ONLY the size of the operands, the size of the result also determines whether the system can accomplish the mathematical operation within one step. Not only the operands involved, but the operation being performed needs to be taken into consideration as well.
For example, on a 32bit system, adding two 30bit numbers can be definitely carried out using the regular operations as the result is guaranteed to NOT exceed a 32bit register. But multiplying the same two 30bit numbers may result in a number that does NOT fit within 32bits.
In the absence of such a guarantee of being able to store the result in a single computational unit, to ensure validity of the result for all possible values, the architecture(and the compiler) must :
go the long way i.e. multi-step mathematical operations
or
employ SIMD optimisations
or
define and implement custom mechanisms
(like register-pairs EDX:EAX to hold the result on x86)

In practice, there's no (or completely negligible) difference between adding different integers that fit in the processor words as that should always be a fixed-time operation.
In theory, the complexity for adding two unsigned integers should be O(log(n)) where n is the bigger of the two. As such, you need to go pretty high before mere additions become a problem.
As for where exactly to draw the line between simple and complex algorithms for computing numbers, I don't have an exact answer. However, the GMP library comes to mind. From what I understand, they've carefully chosen their algorithms and under what circumstances to use each in terms of performance. You may want to look into what they did.

I somewhat disagree with the above answers. It very much depends on the context.
For simple integer arithmetic (for loop counters etc), then on 64bit machines that computation will be done using 64bit general purpose registers (RSI/RCX/etc). In those cases, there is no difference in speed between an 8bit or 64bit addition.
If however you are processing arrays of integers, and assuming the compiler has been able to optimise the code nicely, then yes, smaller is faster (but not for the reason you think).
In the AVX2 instruction set, you have access to 4 integer addition instructions:
__m256i _mm256_add_epi8 (__m256i a, __m256i b); // 32 x 8bit
__m256i _mm256_add_epi16(__m256i a, __m256i b); // 16 x 16bit
__m256i _mm256_add_epi32(__m256i a, __m256i b); // 8 x 32bit
__m256i _mm256_add_epi64(__m256i a, __m256i b); // 4 x 64bit
You'll notice that all of them operate on 256bits at a time, which means you can process 4 integer additions if you're using 64bit, compared to 32 additions if you are using 8bit integers. (As mentioned above, you'd need to make sure you have enough precision). They all take the same number of clock cycles to compute - 1clk.
There are also other effects of using smaller data types, which are mainly better CPU cache usage, and a reduced number of memory reads/writes.
However, back to your original question on bit-by-bit computation. Prior to the new AVX-512 instruction set, it might not have seemed a little silly. However, the new instruction set contains a ternary logic instruction. With this instruction, it is possible to compute 512 additions on numbers of any bit length fairly easily.
inline __m512i add(__m512i x, __m512i x, __m512i carry_in)
{
return _mm512_ternarylogic_epi32(carry_in, y, x, 0x96);
}
inline __m512i adc(__m512i x, __m512i x, __m512i carry_in)
{
return _mm512_ternarylogic_epi32(carry_in, y, x, 0xE8);
}
__m512i A[NUM_BITS];
__m512i B[NUM_BITS];
__m512i RESULT[NUM_BITS];
__m512i CARRY = _mm512_setzero_ps();
for(int i = 0; i < NUM_BITS; ++i)
{
RESULT[i] = add(A[i], B[i], CARRY);
CARRY = adc(A[i], B[i], CARRY);
}
In this particular example (which to be honest, probably has very limited real world usage!), The time it takes to perform the 512 additions, is indeed directly proportional to NUM_BITS.

Related

Is it still worth using the Quake fast inverse square root algorithm nowadays on x86-64?

Specifically, this is the code I'm talking about:
float InvSqrt(float x) {
float xhalf = 0.5f*x;
int i = *(int*)&x; // warning: strict-aliasing UB, use memcpy instead
i = 0x5f375a86- (i >> 1);
x = *(float*)&i; // same
x = x*(1.5f-xhalf*x*x);
return x;
}
I forgot where I got this from but it's apparently better and more efficient or precise than the original Quake III algorithm (slightly different magic constant), but it's been more than 2 decades since this algorithm was created, and I just want to know if it's still worth using it in terms of performance, or if there's an instruction that implements it already in modern x86-64 CPUs.
Origins:
See John Carmack's Unusual Fast Inverse Square Root (Quake III)
Modern usefulness: none, obsoleted by SSE1 rsqrtss
Use _mm_rsqrt_ps or ss to get a very approximate reciprocal-sqrt for 4 floats in parallel, much faster than even a good compiler could do with this (using SSE2 integer shift/add instructions to keep the FP bit pattern in an XMM register, which is probably not how it would actually compile with the type-pun to integer. Which is strict-aliasing UB in C or C++; use memcpy or C++20 std::bit_cast.)
https://www.felixcloutier.com/x86/rsqrtss documents the scalar version of the asm instruction, including the |Relative Error| ≤ 1.5 ∗ 2−12 guarantee. (i.e. about half the mantissa bits are correct.) One Newton-Raphson iteration can refine it to within 1ulp of being correct, although still not the 0.5ulp you'd get from actual sqrt. See Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision)
rsqrtps performs only slightly slower than a mulps / mulss instruction on most CPUs, like 5 cycle latency, 1/clock throughput. (With a Newton iteration to refine it, more uops.) Latency various by microarchitecture, as low as 3 uops in Zen 3, but Intel runs it with about 5c latency since Conroe at least (https://uops.info/).
The integer shift / subtract from the magic number in the Quake InvSqrt similarly provides an even rougher initial-guess, and the rest (after type-punning the bit-pattern back to a float is a Newton Raphson iteration.
Compilers will even use rsqrtss for you when compiling sqrt with -ffast-math, depending on context and tuning options. (e.g. modern clang compiling 1.0f/sqrtf(x) with -O3 -ffast-math -march=skylake https://godbolt.org/z/fT86bKesb uses vrsqrtss and 3x vmulss plus an FMA.) Non-reciprocal sqrt is usually not worth it, but rsqrt + refinement avoids a division as well as a sqrt.
Full-precision square root and division themselves are not as slow as they used to be, at least if you use them infrequently compared to mul/add/sub. (e.g. if you can hide the latency, one sqrt every 12 or so other operations might cost about the same, still a single uop instead of multiple for rsqrt + Newton iteration.) See Floating point division vs floating point multiplication
But sqrt and div do compete with each other for throughput so needing to divide by a square root is a nasty case.
So if you have a bad loop over an array that mostly just does sqrt, not mixed with other math operations, that's a use-case for _mm_rsqrt_ps (and a Newton iteration) as a higher throughput approximation than _mm_sqrt_ps
But if you can combine that pass with something else to increase computational intensity and get more work done overlapped with keeping the div/sqrt unit, often it's better to use a real sqrt instruction on its own, since that's still just 1 uop for the front-end to issue, and for the back-end to track and execute. vs. a Newton iteration taking something like 5 uops if FMA is available for reciprocal square root, else more (also if non-reciprocal sqrt is needed).
With Skylake for example having 1 per 3 cycle sqrtps xmm throughput (128-bit vectors), it costs the same as a mul/add/sub/fma operation if you don't do more than one per 6 math operations. (Throughput is worse for 256-bit YMM vectors, 6 cycles.) A Newton iteration would cost more uops, so if uops for port 0/1 are the bottleneck, it's a win to just use sqrt directly. (This is assuming that out-of-order exec can hide the latency, typically when each loop iteration is independent.) This kind of situation is common if you're using a polynomial approximation as part of something like log or exp in a loop.
See also Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision re: performance on modern OoO exec CPUs.

Fastest way to implement floating-point multiplication by a small integer constant

Suppose you are trying to multiply a floating-point number k by a small integer constant n (by small I mean -20 <= n <= 20). The naive way of doing this is converting n to a floating point number (which for the purposes of this question does not count towards the runtime) and executing a floating-point multiply. However, for n = 2, it seems likely that k + k is a faster way of computing it. At what n does the multiply instruction become faster than repeated additions (plus an inversion at the end if n < 0)?
Note that I am not particularly concerned about accuracy here; I am willing to allow unsound optimizations as long as they get roughly the right answer (i.e.: up to 1024 ULP error is probably fine).
I am writing OpenCL code, so I'm interested in the answer to this question in many computational contexts (x86-64, x86-64 + AVX256, GPUs).
I could benchmark this, but since I don't have a particular architecture in mind, I'd prefer a theoretical justification of the choice.
According to AMD's OpenCL optimisation guide for GPUs, section 3.8.1 "Instruction Bandwidths", for single-precision floating point operands, addition, multiplication and 'MAD' (multiply-add) all have a throughput of 5 per cycle on GCN based GPUs. The same is true for 24-bit integers. Only once you move to 32-bit integers are multiplications much more expensive (1/cycle). Int-to-float conversions and vice versa are also comparatively slow (1/cycle), and unless you have a double-precision float capable model (mostly FirePro/Radeon Pro series or Quadro/Tesla from nvidia) operations on doubles are super slow (<1/cycle). Negation is typically "free" on GPUs - for example GCN has sign flags on instruction operands, so -(a + b) compiles to one instruction after transforming to (-a) + (-b).
Nvidia GPUs tend to be a bit slower at integer operations, for floats it's a similar story to AMD's though: multiplications are just as fast as addition, and if you can combine them into MAD operations, you can double throughput. Intel's GPUs are quite different in other regards, but again they're very fast at FP multiplication and addition.
Basically, it's really hard to beat a GPU at floating-point multiplication, as that's essentially the one thing they're optimised for.
On the CPU it's typically more complicated - Agner Fog's optimisation resources and instruction tables are the place to go for the details. Note though that on many CPUs you'll pay a penalty for interpreting float data as integer and back because ALU and FPU are typically separate. (For example if you wanted to optimise multiplying floats by a power of 2 by performing an integer addition on their exponents. On x86, you can easily do this by operating on SSE or AVX registers using first float instructions, then integer ones, but it's generally not good for performance.)

Arithmetic Operations using only 32 bit integers

How would you compute the multiplication of two 1024 bit numbers on a microprocessor that is only capable of multiplying 32 bit numbers?
The starting point is to realize that you already know how to do this: in elementary school you were taught how to do arithmetic on single digit numbers, and then given data structures to represent larger numbers (e.g. decimals) and algorithms to compute arithmetic operations (e.g. long division).
If you have a way to multiply two 32-bit numbers to give a 64-bit result (note that unsigned long long is guaranteed to be at least 64 bits), then you can use those same algorithms to do arithmetic in base 2^32.
You'll also need, e.g., an add with carry operation. You can determine the carry when adding two unsigned numbers of the same type by detecting overflow, e.g. as follows:
uint32_t x, y; // set to some value
uint32_t sum = x + y;
uint32_t carry = (sum < x);
(technically, this sort of operation requires that you do unsigned arithmetic: overflow in signed arithmetic is undefined behavior, and optimizers will do surprising things to your code you least expect it)
(modern processors usually give a way to multiply two 64-bit numbers to give a 128-bit result, but to access it you will have to use compiler extensions like 128-bit types, or you'll have to write inline assembly code. modern processors also have specialized add-with-carry instructions)
Now, to do arithmetic efficiently is an immense project; I found it quite instructive to browse through the documentation and source code to gmp, the GNU multiple precision arithmetic library.
look at any implementation of bigint operations
here are few of mine approaches in C++ for fast bignum square
some are solely for sqr but others are usable for multiplication...
use 32bit arithmetics as a module for 64/128/256/... bit arithmetics
see mine 32bit ALU in x86 C++
use long multiplication with digit base 2^32
can use also Karatsuba this way

Why is comparing double faster than uint64?

I analysed the following program using Matlab's profile. Both double and uint64 are 64-bit variables. Why is comparing two double much faster than comparing two uint64? Aren't they both compared bitwise?
big = 1000000;
a = uint64(randi(100,big,1));
b = uint64(randi(100,big,1));
c = uint64(zeros(big,1));
tic;
for i=1:big
if a(i) == b(i)
c(i) = c(i) + 1;
end
end
toc;
a = randi(100,big,1);
b = randi(100,big,1);
c = zeros(big,1);
tic;
for i=1:big
if a(i) == b(i)
c(i) = c(i) + 1;
end
end
toc;
This is the measurement of profile:
This is what tictoc measures:
Elapsed time is 6.259040 seconds.
Elapsed time is 0.015387 seconds.
The effect disappears when uint8..uint32 or int8..int32 are used instead of 64-bit data types.
It's probably a combination of the Matlab interpreter and the underlying CPU not supporting 64-bit int types as well as the others.
Matlab favors double over int operations. Most values are stored in double types, even if they represent integer values. The double and int == operations will take different code paths, and MathWorks will have spent a lot more attention on tuning and optimizing the code for double than for ints, especially int64. In fact, older versions of Matlab did not support arithmetic operations on int64 at all. (And IIRC, it still doesn't support mixed-integer math.) When you do int64 math, you're using less mature, less tuned code, and the same may apply to ==. Int types are not a priority in Matlab. The presence of the int64 may even interfere with the JIT optimizing that code, but that's just a guess.
But there might be an underlying hardware reason for this too. Here's a hypothesis: if you're on 32-bit x86, you're working with 32-bit general purpose (integer) registers. That means the smaller int types can fit in a register and be compared using fast instructions, but the 64-bit int values won't fit in a register, and may take more expensive instruction sequences to compare. The doubles, even though they are 64 bits wide, will fit in the wide floating-point registers of the x87 floating point unit, and can be compared in hardware using fast floating-point comparison instructions. This means the [u]int64s are the only ones that can't be compared using fast single-register, single-instruction operations.
If that's the case, if you run this same code on 64-bit x86-64 (in 64-bit Matlab), the difference may disappear because you then have 64-bit wide general purpose registers. But then again, it may not, if the Matlab interpreter's code isn't compiled to take advantage of it.

Can long integer routines benefit from SSE?

I'm still working on routines for arbitrary long integers in C++. So far, I have implemented addition/subtraction and multiplication for 64-bit Intel CPUs.
Everything works fine, but I wondered if I can speed it a bit by using SSE. I browsed through the SSE docs and processor instruction lists, but I could not find anything I think I can use and here is why:
SSE has some integer instructions, but most instructions handle floating point. It doesn't look like it was designed for use with integers (e.g. is there an integer compare for less?)
The SSE idea is SIMD (same instruction, multiple data), so it provides instructions for 2 or 4 independent operations. I, on the other hand, would like to have something like a 128 bit integer add (128 bit input and output). This doesn't seem to exist. (Yet? In AVX2 maybe?)
The integer additions and subtractions handle neither input nor output carries. So it's very cumbersome (and thus, slow) to do it by hand.
My question is: is my assessment correct or is there anything I have overlooked? Can long integer routines benefit from SSE? In particular, can they help me to write a quicker add, sub or mul routine?
In the past, the answer to this question was a solid, "no". But as of 2017, the situation is changing.
But before I continue, time for some background terminology:
Full Word Arithmetic
Partial Word Arithmetic
Full-Word Arithmetic:
This is the standard representation where the number is stored in base 232 or 264 using an array of 32-bit or 64-bit integers.
Many bignum libraries and applications (including GMP) use this representation.
In full-word representation, every integer has a unique representation. Operations like comparisons are easy. But stuff like addition are more difficult because of the need for carry-propagation.
It is this carry-propagation that makes bignum arithmetic almost impossible to vectorize.
Partial-Word Arithmetic
This is a lesser-used representation where the number uses a base less than the hardware word-size. For example, putting only 60 bits in each 64-bit word. Or using base 1,000,000,000 with a 32-bit word-size for decimal arithmetic.
The authors of GMP call this, "nails" where the "nail" is the unused portion of the word.
In the past, use of partial-word arithmetic was mostly restricted to applications working in non-binary bases. But nowadays, it's becoming more important in that it allows carry-propagation to be delayed.
Problems with Full-Word Arithmetic:
Vectorizing full-word arithmetic has historically been a lost cause:
SSE/AVX2 has no support for carry-propagation.
SSE/AVX2 has no 128-bit add/sub.
SSE/AVX2 has no 64 x 64-bit integer multiply.*
*AVX512-DQ adds a lower-half 64x64-bit multiply. But there is still no upper-half instruction.
Furthermore, x86/x64 has plenty of specialized scalar instructions for bignums:
Add-with-Carry: adc, adcx, adox.
Double-word Multiply: Single-operand mul and mulx.
In light of this, both bignum-add and bignum-multiply are difficult for SIMD to beat scalar on x64. Definitely not with SSE or AVX.
With AVX2, SIMD is almost competitive with scalar bignum-multiply if you rearrange the data to enable "vertical vectorization" of 4 different (and independent) multiplies of the same lengths in each of the 4 SIMD lanes.
AVX512 will tip things more in favor of SIMD again assuming vertical vectorization.
But for the most part, "horizontal vectorization" of bignums is largely still a lost cause unless you have many of them (of the same size) and can afford the cost of transposing them to make them "vertical".
Vectorization of Partial-Word Arithmetic
With partial-word arithmetic, the extra "nail" bits enable you to delay carry-propagation.
So as long as you as you don't overflow the word, SIMD add/sub can be done directly. In many implementations, partial-word representation uses signed integers to allow words to go negative.
Because there is (usually) no need to perform carryout, SIMD add/sub on partial words can be done equally efficiently on both vertically and horizontally-vectorized bignums.
Carryout on horizontally-vectorized bignums is still cheap as you merely shift the nails over the next lane. A full carryout to completely clear the nail bits and get to a unique representation usually isn't necessary unless you need to do a comparison of two numbers that are almost the same.
Multiplication is more complicated with partial-word arithmetic since you need to deal with the nail bits. But as with add/sub, it is nevertheless possible to do it efficiently on horizontally-vectorized bignums.
AVX512-IFMA (coming with Cannonlake processors) will have instructions that give the full 104 bits of a 52 x 52-bit multiply (presumably using the FPU hardware). This will play very well with partial-word representations that use 52 bits per word.
Large Multiplication using FFTs
For really large bignums, multiplication is most efficiently done using Fast-Fourier Transforms (FFTs).
FFTs are completely vectorizable since they work on independent doubles. This is possible because fundamentally, the representation that FFTs use is
a partial word representation.
To summarize, vectorization of bignum arithmetic is possible. But sacrifices must be made.
If you expect SSE/AVX to be able to speed up some existing bignum code without fundamental changes to the representation and/or data layout, that's not likely to happen.
But nevertheless, bignum arithmetic is possible to vectorize.
Disclosure:
I'm the author of y-cruncher which does plenty of large number arithmetic.

Resources