What is the difference between the OpenCL functions length() and fast_length()? - performance

On page three of this OpenCL reference sheet (broken link) there are two built in vector length functions with identical parameters: length() and half_length().
What is the difference between these functions? I gather from the name one is 'faster' than the other but in what circumstances? Does it sacrafice accuracy for this speed increase? If not, why would one ever use length() over fast_length()?

According to the OpenCL spec (version 1.1, page 215):
float length(floatn p): Return the length of vector p, i.e. sqrt(p.x²+p.y²+...)
float fast_length(floatn p): Return the length of vector p computed as half_sqrt(p.x²+p.y²+...)
So fast_length uses half_sqrt, while length uses sqrt. As you can guess sqrt has better guarantees on accuracy, but might be slower. More to the point:
Min Accuracy of sqrt: 3ulp (unit of least precision)
Min Accuracy of half_sqrt: 8192ulp
So half_sqrt can be about 11bits less accurate then sqrt (well actually it can be 13 bit less accurate, since there ist no requirement for sqrt not to be better then strictly necessary). Since float has a mantissa of 23bit (plus one implicit bit) half_sqrt only promises about 10bit of precision (11bit including the implicit 1). It might however be faster, if the hardware has such a function. In hardware it's not unusual to have sqrt or rsqrt instruction providing only a small number of bits (like 10-14) and using Newton-Raphson iterations after the instruction to get the necessary precision. In such a case using half_sqrt is obviously faster.

Related

Is it still worth using the Quake fast inverse square root algorithm nowadays on x86-64?

Specifically, this is the code I'm talking about:
float InvSqrt(float x) {
float xhalf = 0.5f*x;
int i = *(int*)&x; // warning: strict-aliasing UB, use memcpy instead
i = 0x5f375a86- (i >> 1);
x = *(float*)&i; // same
x = x*(1.5f-xhalf*x*x);
return x;
}
I forgot where I got this from but it's apparently better and more efficient or precise than the original Quake III algorithm (slightly different magic constant), but it's been more than 2 decades since this algorithm was created, and I just want to know if it's still worth using it in terms of performance, or if there's an instruction that implements it already in modern x86-64 CPUs.
Origins:
See John Carmack's Unusual Fast Inverse Square Root (Quake III)
Modern usefulness: none, obsoleted by SSE1 rsqrtss
Use _mm_rsqrt_ps or ss to get a very approximate reciprocal-sqrt for 4 floats in parallel, much faster than even a good compiler could do with this (using SSE2 integer shift/add instructions to keep the FP bit pattern in an XMM register, which is probably not how it would actually compile with the type-pun to integer. Which is strict-aliasing UB in C or C++; use memcpy or C++20 std::bit_cast.)
https://www.felixcloutier.com/x86/rsqrtss documents the scalar version of the asm instruction, including the |Relative Error| ≤ 1.5 ∗ 2−12 guarantee. (i.e. about half the mantissa bits are correct.) One Newton-Raphson iteration can refine it to within 1ulp of being correct, although still not the 0.5ulp you'd get from actual sqrt. See Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision)
rsqrtps performs only slightly slower than a mulps / mulss instruction on most CPUs, like 5 cycle latency, 1/clock throughput. (With a Newton iteration to refine it, more uops.) Latency various by microarchitecture, as low as 3 uops in Zen 3, but Intel runs it with about 5c latency since Conroe at least (https://uops.info/).
The integer shift / subtract from the magic number in the Quake InvSqrt similarly provides an even rougher initial-guess, and the rest (after type-punning the bit-pattern back to a float is a Newton Raphson iteration.
Compilers will even use rsqrtss for you when compiling sqrt with -ffast-math, depending on context and tuning options. (e.g. modern clang compiling 1.0f/sqrtf(x) with -O3 -ffast-math -march=skylake https://godbolt.org/z/fT86bKesb uses vrsqrtss and 3x vmulss plus an FMA.) Non-reciprocal sqrt is usually not worth it, but rsqrt + refinement avoids a division as well as a sqrt.
Full-precision square root and division themselves are not as slow as they used to be, at least if you use them infrequently compared to mul/add/sub. (e.g. if you can hide the latency, one sqrt every 12 or so other operations might cost about the same, still a single uop instead of multiple for rsqrt + Newton iteration.) See Floating point division vs floating point multiplication
But sqrt and div do compete with each other for throughput so needing to divide by a square root is a nasty case.
So if you have a bad loop over an array that mostly just does sqrt, not mixed with other math operations, that's a use-case for _mm_rsqrt_ps (and a Newton iteration) as a higher throughput approximation than _mm_sqrt_ps
But if you can combine that pass with something else to increase computational intensity and get more work done overlapped with keeping the div/sqrt unit, often it's better to use a real sqrt instruction on its own, since that's still just 1 uop for the front-end to issue, and for the back-end to track and execute. vs. a Newton iteration taking something like 5 uops if FMA is available for reciprocal square root, else more (also if non-reciprocal sqrt is needed).
With Skylake for example having 1 per 3 cycle sqrtps xmm throughput (128-bit vectors), it costs the same as a mul/add/sub/fma operation if you don't do more than one per 6 math operations. (Throughput is worse for 256-bit YMM vectors, 6 cycles.) A Newton iteration would cost more uops, so if uops for port 0/1 are the bottleneck, it's a win to just use sqrt directly. (This is assuming that out-of-order exec can hide the latency, typically when each loop iteration is independent.) This kind of situation is common if you're using a polynomial approximation as part of something like log or exp in a loop.
See also Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision re: performance on modern OoO exec CPUs.

Fastest way to implement floating-point multiplication by a small integer constant

Suppose you are trying to multiply a floating-point number k by a small integer constant n (by small I mean -20 <= n <= 20). The naive way of doing this is converting n to a floating point number (which for the purposes of this question does not count towards the runtime) and executing a floating-point multiply. However, for n = 2, it seems likely that k + k is a faster way of computing it. At what n does the multiply instruction become faster than repeated additions (plus an inversion at the end if n < 0)?
Note that I am not particularly concerned about accuracy here; I am willing to allow unsound optimizations as long as they get roughly the right answer (i.e.: up to 1024 ULP error is probably fine).
I am writing OpenCL code, so I'm interested in the answer to this question in many computational contexts (x86-64, x86-64 + AVX256, GPUs).
I could benchmark this, but since I don't have a particular architecture in mind, I'd prefer a theoretical justification of the choice.
According to AMD's OpenCL optimisation guide for GPUs, section 3.8.1 "Instruction Bandwidths", for single-precision floating point operands, addition, multiplication and 'MAD' (multiply-add) all have a throughput of 5 per cycle on GCN based GPUs. The same is true for 24-bit integers. Only once you move to 32-bit integers are multiplications much more expensive (1/cycle). Int-to-float conversions and vice versa are also comparatively slow (1/cycle), and unless you have a double-precision float capable model (mostly FirePro/Radeon Pro series or Quadro/Tesla from nvidia) operations on doubles are super slow (<1/cycle). Negation is typically "free" on GPUs - for example GCN has sign flags on instruction operands, so -(a + b) compiles to one instruction after transforming to (-a) + (-b).
Nvidia GPUs tend to be a bit slower at integer operations, for floats it's a similar story to AMD's though: multiplications are just as fast as addition, and if you can combine them into MAD operations, you can double throughput. Intel's GPUs are quite different in other regards, but again they're very fast at FP multiplication and addition.
Basically, it's really hard to beat a GPU at floating-point multiplication, as that's essentially the one thing they're optimised for.
On the CPU it's typically more complicated - Agner Fog's optimisation resources and instruction tables are the place to go for the details. Note though that on many CPUs you'll pay a penalty for interpreting float data as integer and back because ALU and FPU are typically separate. (For example if you wanted to optimise multiplying floats by a power of 2 by performing an integer addition on their exponents. On x86, you can easily do this by operating on SSE or AVX registers using first float instructions, then integer ones, but it's generally not good for performance.)

When is it useful to compare floating-point values for equality?

I seem to see people asking all the time around here questions about comparing floating-point numbers. The canonical answer is always: just see if the numbers are within some small number of each other…
So the question is this: Why would you ever need to know if two floating point numbers are equal to each other?
In all my years of coding I have never needed to do this (although I will admit that even I am not omniscient). From my point of view, if you are trying to use floating point numbers for and for some reason want to know if the numbers are equal, you should probably be using an integer type (or a decimal type in languages that support it). Am I just missing something?
A few reasons to compare floating-point numbers for equality are:
Testing software. Given software that should conform to a precise specification, exact results might be known or feasibly computable, so a test program would compare the subject software’s results to the expected results.
Performing exact arithmetic. Carefully designed software can perform exact arithmetic with floating-point. At its simplest, this may simply be integer arithmetic. (On platforms which provide IEEE-754 64-bit double-precision floating-point but only 32-bit integer arithmetic, floating-point arithmetic can be used to perform 53-bit integer arithmetic.) Comparing for equality when performing exact arithmetic is the same as comparing for equality with integer operations.
Searching sorted or structured data. Floating-point values can be used as keys for searching, in which case testing for equality is necessary to determine that the sought item has been found. (There are issues if NaNs may be present, since they report false for any order test.)
Avoiding poles and discontinuities. Functions may have special behaviors at certain points, the most obvious of which is division. Software may need to test for these points and divert execution to alternate methods.
Note that only the last of these tests for equality when using floating-point arithmetic to approximate real arithmetic. (This list of examples is not complete, so I do not expect this is the only such use.) The first three are special situations. Usually when using floating-point arithmetic, one is approximating real arithmetic and working with mostly continuous functions. Continuous functions are “okay” for working with floating-point arithmetic because they transmit errors in “normal” ways. For example, if your calculations so far have produced some a' that approximates an ideal mathematical result a, and you have a b' that approximates an ideal mathematical result b, then the computed sum a'+b' will approximate a+b.
Discontinuous functions, on the other hand, can disrupt this behavior. For example, if we attempt to round a number to the nearest integer, what happens when a is 3.49? Our approximation a' might be 3.48 or 3.51. When the rounding is computed, the approximation may produce 3 or 4, turning a very small error into a very large error. When working with discontinuous functions in floating-point arithmetic, one has to be careful. For example, consider evaluating the quadratic formula, (−b±sqrt(b2−4ac))/(2a). If there is a slight error during the calculations for b2−4ac, the result might be negative, and then sqrt will return NaN. So software cannot simply use floating-point arithmetic as if it easily approximated real arithmetic. The programmer must understand floating-point arithmetic and be wary of the pitfalls, and these issues and their solutions can be specific to the particular software and application.
Testing for equality is a discontinuous function. It is a function f(a, b) that is 0 everywhere except along the line a=b. Since it is a discontinuous function, it can turn small errors into large errors—it can report as equal numbers that are unequal if computed with ideal mathematics, and it can report as unequal numbers that are equal if computed with ideal mathematics.
With this view, we can see testing for equality is a member of a general class of functions. It is not any more special than square root or division—it is continuous in most places but discontinuous in some, and so its use must be treated with care. That care is customized to each application.
I will relate one place where testing for equality was very useful. We implement some math library routines that are specified to be faithfully rounded. The best quality for a routine is that it is correctly rounded. Consider a function whose exact mathematical result (for a particular input x) is y. In some cases, y is exactly representable in the floating-point format, in which case a good routine will return y. Often, y is not exactly representable. In this case, it is between two numbers representable in the floating-point format, some numbers y0 and y1. If a routine is correctly rounded, it returns whichever of y0 and y1 is closer to y. (In case of a tie, it returns the one with an even low digit. Also, I am discussing only the round-to-nearest ties-to-even mode.)
If a routine is faithfully rounded, it is allowed to return either y0 or y1.
Now, here is the problem we wanted to solve: We have some version of a single-precision routine, say sin0, that we know is faithfully rounded. We have a new version, sin1, and we want to test whether it is faithfully rounded. We have multiple-precision software that can evaluate the mathematical sin function to great precision, so we can use that to check whether the results of sin1 are faithfully rounded. However, the multiple-precision software is slow, and we want to test all four billion inputs. sin0 and sin1 are both fast, but sin1 is allowed to have outputs different from sin0, because sin1 is only required to be faithfully rounded, not to be the same as sin0.
However, it happens that most of the sin1 results are the same as sin0. (This is partly a result of how math library routines are designed, using some extra precision to get a very close result before using a few final arithmetic operations to deliver the final result. That tends to get the correctly rounded result most of the time but sometimes slips to the next nearest value.) So what we can do is this:
For each input, calculate both sin0 and sin1.
Compare the results for equality.
If the results are equal, we are done. If they are not, use the extended precision software to test whether the sin1 result is faithfully rounded.
Again, this is a special case for using floating-point arithmetic. But it is one where testing for equality serves very well; the final test program runs in a few minutes instead of many hours.
The only time I needed, it was to check if the GPU was IEEE 754 compliant.
It was not.
Anyway I haven't used a comparison with a programming language. I just run the program on the CPU and on the GPU producing some binary output (no literals) and compared the outputs with a simple diff.
There are plenty possible reasons.
Since I know Squeak/Pharo Smalltalk better, here are a few trivial examples taken out of it (it relies on strict IEEE 754 model):
Float>>isFinite
"simple, byte-order independent test for rejecting Not-a-Number and (Negative)Infinity"
^(self - self) = 0.0
Float>>isInfinite
"Return true if the receiver is positive or negative infinity."
^ self = Infinity or: [self = NegativeInfinity]
Float>>predecessor
| ulp |
self isFinite ifFalse: [
(self isNaN or: [self negative]) ifTrue: [^self].
^Float fmax].
ulp := self ulp.
^self - (0.5 * ulp) = self
ifTrue: [self - ulp]
ifFalse: [self - (0.5 * ulp)]
I'm sure that you would find some more involved == if you open some libm implementation and check... Unfortunately, I don't know how to search == thru github web interface, but manually I found this example in julia libm (a variant of fdlibm)
https://github.com/JuliaLang/openlibm/blob/master/src/s_remquo.c
remquo(double x, double y, int *quo)
{
...
fixup:
INSERT_WORDS(x,hx,lx);
y = fabs(y);
if (y < 0x1p-1021) {
if (x+x>y || (x+x==y && (q & 1))) {
q++;
x-=y;
}
} else if (x>0.5*y || (x==0.5*y && (q & 1))) {
q++;
x-=y;
}
GET_HIGH_WORD(hx,x);
SET_HIGH_WORD(x,hx^sx);
q &= 0x7fffffff;
*quo = (sxy ? -q : q);
return x;
Here, the remainder function answer a result x between -y/2 and y/2. If it is exactly y/2, then there are 2 choices (a tie)... The == test in fixup is here to test the case of exact tie (resolved so as to always have an even quotient).
There are also a few ==zero tests, for example in __ieee754_logf (test for trivial case log(1)) or __ieee754_rem_pio2 (modulo pi/2 used for trigonometric functions).

Time taken in executing % / * + - operations

Recently, i heard that % operator is costly in terms of time.
So, the question is that, is there a way to find the remainder faster?
Also your help will be appreciated if anyone can tell the difference in the execution of % / * + - operations.
In some cases where you're using power-of-2 divisors you can do better with roll-your-own techniques for calculating remainder, but generally a halfway decent compiler will do the best job possible with variable divisors, or "odd" divisors that don't fit any pattern.
Note that a few CPUs don't even have a multiply operation, and so (on those) multiply is quite slow vs add (at least 64x for a 32-bit multiply). (But a smart compiler may improve on this if the multiplier is a literal.) A slightly larger number do not have a divide operation or have a pretty slow one. (On a CPU with a fast multiplier multiply may only be on the order of 4 times slower than add, but on "normal" hardware it's 16-32 times slower for a 32 bit operation. Divide is inherently 2-4x slower than multiply, but can be much slower on some hardware.)
The remainder operation is rarely implemented in hardware, and normally A % B maps to something along the lines of A - ((A / B) * B) (a few extra operations may be required to assure the proper sign, et al).
(I learned about this stuff while microprogramming the instruction set for the SUMC computer for RCA/NASA back in the early 70s.)
No, the compiler is going to implement % in the most efficient way possible.
In terms of speed, + and - are the fastest (and are equally fast, generally done by the same hardware).
*, /, and % are much slower. Multiplication is basically done by the method you learn in grade school- multiply the first number by every digit in the second number and add the results. With some hacks made possible by binary. As of a few years ago, multiply was 3x slower than add. Division should be similar to multiply. Remainder is similar to division (in fact it generally calculates both at once).
Exact differences depend on the CPU type and exact model. You'd need to look up the latencies in the CPU spec sheets for your particular machine.

Addition efficiency proportional to size of operands?

Question that just popped into my head, and I don't think I've seen an answer on here. Is the time taken by a binary addition algorithm, proportional to the size of the operands?
Obviously, adding 1101011010101010101101010 and 10110100101010010101 is going to take longer than 1 + 1, but my question refers more to the smaller values. Is there a negligible difference, no difference, a theoretical difference?
At what point, with these sorts of rudimentary calculations should we start looking into more efficient methods of calculation? ie: Exponentiation by squaring with large exponents for calculating huge powers.
How we see the binary patterns...
1101011010101010101101010 (big)
10110100101010010101 (medium)
1 (small)
How a 32bit computer sees the binary patterns...
00000001101011010101010101101010 32bit,
00000000000010110100101010010101 32bit,
00000000000000000000000000000001 i'm lovin it
On a 32bit system, all the above numbers will take the same time (no. of CPU instructions) to be added. As all of them fit within the basic computational block i.e. the 32bit CPU register.
How a 16bit computer sees the binary patterns...
1
+1 = ?
0000000000000001 i'm lovin it
0000000000000001 i'm lovin it
00000001101011010101010101101010
+00000000000010110100101010010101 = ?
00000001101011010101010101101010 too BIG for me!
00000000000010110100101010010101 too BIG for me!
On a 16bit system, as the larger numbers will NOT fit in a 16bit register, it will need an additional pass(to add the significant bits that remain after the first 16LSBs are added).
Step1: ADD Least significant bits
0101010101101010
0100101010010101
Step2: ADD the rest (remember carry bit from previous operation)
000000000000000C
0000000110101101
0000000000001011
We can start thinking of optimising the mathematical operations on
numbers once the numbers no longer fit in the basic computation unit
of the system i.e. the CPU-register.
Modern hardware architectures are developed keeping this in mind and support SIMD instructions. Compilers will often employ them (SSE on x86, NEON on ARM) when they see such a case being made i.e. 128bit decryption logic being run on a 32bit system.
Also instead of checking ONLY the size of the operands, the size of the result also determines whether the system can accomplish the mathematical operation within one step. Not only the operands involved, but the operation being performed needs to be taken into consideration as well.
For example, on a 32bit system, adding two 30bit numbers can be definitely carried out using the regular operations as the result is guaranteed to NOT exceed a 32bit register. But multiplying the same two 30bit numbers may result in a number that does NOT fit within 32bits.
In the absence of such a guarantee of being able to store the result in a single computational unit, to ensure validity of the result for all possible values, the architecture(and the compiler) must :
go the long way i.e. multi-step mathematical operations
or
employ SIMD optimisations
or
define and implement custom mechanisms
(like register-pairs EDX:EAX to hold the result on x86)
In practice, there's no (or completely negligible) difference between adding different integers that fit in the processor words as that should always be a fixed-time operation.
In theory, the complexity for adding two unsigned integers should be O(log(n)) where n is the bigger of the two. As such, you need to go pretty high before mere additions become a problem.
As for where exactly to draw the line between simple and complex algorithms for computing numbers, I don't have an exact answer. However, the GMP library comes to mind. From what I understand, they've carefully chosen their algorithms and under what circumstances to use each in terms of performance. You may want to look into what they did.
I somewhat disagree with the above answers. It very much depends on the context.
For simple integer arithmetic (for loop counters etc), then on 64bit machines that computation will be done using 64bit general purpose registers (RSI/RCX/etc). In those cases, there is no difference in speed between an 8bit or 64bit addition.
If however you are processing arrays of integers, and assuming the compiler has been able to optimise the code nicely, then yes, smaller is faster (but not for the reason you think).
In the AVX2 instruction set, you have access to 4 integer addition instructions:
__m256i _mm256_add_epi8 (__m256i a, __m256i b); // 32 x 8bit
__m256i _mm256_add_epi16(__m256i a, __m256i b); // 16 x 16bit
__m256i _mm256_add_epi32(__m256i a, __m256i b); // 8 x 32bit
__m256i _mm256_add_epi64(__m256i a, __m256i b); // 4 x 64bit
You'll notice that all of them operate on 256bits at a time, which means you can process 4 integer additions if you're using 64bit, compared to 32 additions if you are using 8bit integers. (As mentioned above, you'd need to make sure you have enough precision). They all take the same number of clock cycles to compute - 1clk.
There are also other effects of using smaller data types, which are mainly better CPU cache usage, and a reduced number of memory reads/writes.
However, back to your original question on bit-by-bit computation. Prior to the new AVX-512 instruction set, it might not have seemed a little silly. However, the new instruction set contains a ternary logic instruction. With this instruction, it is possible to compute 512 additions on numbers of any bit length fairly easily.
inline __m512i add(__m512i x, __m512i x, __m512i carry_in)
{
return _mm512_ternarylogic_epi32(carry_in, y, x, 0x96);
}
inline __m512i adc(__m512i x, __m512i x, __m512i carry_in)
{
return _mm512_ternarylogic_epi32(carry_in, y, x, 0xE8);
}
__m512i A[NUM_BITS];
__m512i B[NUM_BITS];
__m512i RESULT[NUM_BITS];
__m512i CARRY = _mm512_setzero_ps();
for(int i = 0; i < NUM_BITS; ++i)
{
RESULT[i] = add(A[i], B[i], CARRY);
CARRY = adc(A[i], B[i], CARRY);
}
In this particular example (which to be honest, probably has very limited real world usage!), The time it takes to perform the 512 additions, is indeed directly proportional to NUM_BITS.

Resources