On a particular STM32 microcontroller, the system clock is driven by a PLL whose frequency F is given by the following formula:
F := (S/M * (N + K/8192)) / P
S is the PLL input source frequency (1 - 64000000, or 64 MHz).
The other factors M, N, K, and P are the parameters the user can modify to calibrate the frequency. Judging by the bitmasks in the SDK I'm using, the value of each can be limited to a maximum of M < 64, N < 512, K < 8192, and P < 128.
Unfortunately, my target firmware does not have FPU support, so floating-point arithmetic is out. Instead, I need to compute F using integer-only arithmetic.
I have tried to rearrange the given formula with 3 goals in mind:
Expand and distribute all multiplication factors
Minimize the number of factors in each denominator
Minimize the total number of divisions performed
If two expressions have the same number of divisions, choose the one whose denominators have the least maximum (identified in earlier paragraph)
However, each of my attempts to expand and rearrange the expression all produce errors greater than the original formula as it was first expressed verbatim.
To test out different arrangements of the formula and compare error, I've written a small Go program you can run online here.
Is it possible to improve this formula so that error is minimized when using integer arithmetic? Also are any of my goals listed above incorrect or useless?
I took your program (your first parentheses is redundant, so I removed):
S K
--- * ( N + ------ )
M 8192
--------------------
P
and ran through QuickMath [1], and I got this:
S * (8192 * N + K)
------------------
8192 * M * P
or in Go code:
S * (8192 * N + K) / (8192 * M * P)
So it does reduce the amount of divisions. You could improve it further by
pulling out the lower constant:
S * (8192 * N + K) / (M * P) >> 13
https://quickmath.com
Looking at the answer by #StevenPerry, I realized the majority of error is introduced by the limited precision we have to represent K/8192. This error then gets propagated into the other factors and dividends.
Postponing that division, however, often results in integer overflow before its ever reached. Thus, the solution I've found unfortunately depends on widening these operands to 64-bit.
The result is of the same form as the other answer, but it must be emphasized that widening the operands to 64-bit is essential. In Go source code, this looks like:
var S, N, M, P, K uint32
...
F := uint32(uint64(S) * uint64(8192*N+K) / uint64(8192*M*P))
To see the accuracy of all three of these expressions, run the code yourself on the Go Playground.
We have two int8 matrices
A = B = [200, 200; 200, 200]. How can we get the int matrix product
C = A * B without converting A and B in advance.
Just use
C = A.cast<int>() * B.cast<int>();
If you want to make sure that no temporaries are generated (for casting A or B to int, try
C = A.cast<int>().lazyProduct(B.cast<int>());
For small (fixed-sized) matrices that is likely equivalent to the standard-product above. What is generated depends on your compiler (and optimization level and target machine).
If the code is performance critical, always benchmark and have a look at the generated assembly.
I'll make examples in Python, since I use Python, but the question is not about Python.
Lets say I want to increment a variable by specific value so that it stays in given boundaries.
So for increment and decrement I have these two functions:
def up (a, s, Bmax):
r = a + s
if r > Bmax : return Bmax
else : return r
def down (a, s, Bmin):
r = a - s
if r < Bmin : return Bmin
else : return r
Note: it is supposed that initial value of the variable "a" is already in boundaries (min <= a <= max) so additional initial checking does not belong to this function. What makes me curious, almost every program I made needs these functions.
The question is:
are those classified as some typical operations and have they specific names?
if yes, is there some correspondence to intrinsic processor functionality so it is optimised in some compilers?
Reason why I ask is pure curiousity, of course I cannot optimise it in Python and I know little about CPU architecture.
To be more specific, on a lower level for an unsigned 8-bit integer the increment would look I suppose like this:
def up (a, s, Bmax):
counter = 0
while True:
if counter == s : break
if a == Bmax : break
if a == 255 : break
a += 1
counter += 1
I know the latter would not make any sense in Python so treat it as my naive attempt to imagine low level code which adds the value in place. There are some nuances, e.g. signed, unsigned, but I was interested merely about unsigned integers since I came across it more often.
It is called saturation arithmetic, it has native support on DSPs and GPUs (not a random pair: both deals with signals).
For example the NVIDIA PTX ISA let the programmer chose if an addition is saturated or not
add.type d, a, b;
add{.sat}.s32 d, a, b; // .sat applies only to .s32
.sat
limits result to MININT..MAXINT (no overflow) for the size of the operation.
The TI TMS320C64x/C64x+ DSP has support for
Dual 16-bit saturated arithmetic operations
and instruction like sadd to perform a saturated add and even a whole register (Saturation Status Register) dedicated to collecting precise information about saturation while executing a sequence of instructions.
Even the mainstream x86 has support for saturation with instructions like vpaddsb and similar (including conversions).
Another example is the GLSL clamp function, used to make sure color values are not outside the range [0, 1].
In general if the architecture must be optimized for signal/media processing it has support for saturation arithmetic.
Much more rare is the support for saturation with arbitrary bounds, e.g. asymmetrical bounds, non power of two bounds, non word sized bounds.
However, saturation can be implemented easily as min(max(v, b), B) where v is the result of the unsaturated (and not overflowed) operation, b the lower bound and B the upper bound.
So any architecture that support finding the minimum and the maximum without a branch, can implement any form of saturation efficiently.
See also this question for a more real example of how saturated addition is implemented.
As a side note the default behavior is wrap around: for 8-bit quantities the sum 255 + 1 equals 0 (i.e. operations are modulo 28).
Being unsure of the sophistication of mobile GPU's but aware that in the past, conditional statements are expensive, I have been experimenting with replacing conditionals with mathematical equivalents. I've read that the ALU's (and FPU's?) of modern GPU's can execute rounding and clamp "for free". 90% of the time that I need an if statement, I can harvest the equivalent result from a rounding function. The trick is as follows:
// with an if statement
vec4 a_fac;
vec4 b_fac;
if (a > b){
result = a * a_fac;
}
else{
result = b * b_fac;
}
Instead, perform the following math:
// equivalent math
a_greater = floor(0.5 + (a - b));
result = a_greater * a_fac + (1.0 - a_greater) * b_fac;
This is only viable when the magnitude of a and b are always in the unity range and the difference between a and b is always less than 0.5 (to avoid flooring to not in {1.0, 0.0}). To prevent floor from going awry, I can use the following modification:
// equivalent math
a_greater = clamp(floor(0.5 + (a - b)), 0.0, 1.0);
result = a_greater * a_fac + (1.0 - a_greater) * b_fac;
Obviously this means many more FP calculations, so I'm not entirely sure if this technique is "worth it" or how to assemble a benchmark without texture-fill limiting. The maximum shader instruction count might limit unless I create a loop, which is also expensive?
I've been profiling some of our core math on an Intel Core Duo, and while looking at various approaches to square root I've noticed something odd: using the SSE scalar operations, it is faster to take a reciprocal square root and multiply it to get the sqrt, than it is to use the native sqrt opcode!
I'm testing it with a loop something like:
inline float TestSqrtFunction( float in );
void TestFunc()
{
#define ARRAYSIZE 4096
#define NUMITERS 16386
float flIn[ ARRAYSIZE ]; // filled with random numbers ( 0 .. 2^22 )
float flOut [ ARRAYSIZE ]; // filled with 0 to force fetch into L1 cache
cyclecounter.Start();
for ( int i = 0 ; i < NUMITERS ; ++i )
for ( int j = 0 ; j < ARRAYSIZE ; ++j )
{
flOut[j] = TestSqrtFunction( flIn[j] );
// unrolling this loop makes no difference -- I tested it.
}
cyclecounter.Stop();
printf( "%d loops over %d floats took %.3f milliseconds",
NUMITERS, ARRAYSIZE, cyclecounter.Milliseconds() );
}
I've tried this with a few different bodies for the TestSqrtFunction, and I've got some timings that are really scratching my head. The worst of all by far was using the native sqrt() function and letting the "smart" compiler "optimize". At 24ns/float, using the x87 FPU this was pathetically bad:
inline float TestSqrtFunction( float in )
{ return sqrt(in); }
The next thing I tried was using an intrinsic to force the compiler to use SSE's scalar sqrt opcode:
inline void SSESqrt( float * restrict pOut, float * restrict pIn )
{
_mm_store_ss( pOut, _mm_sqrt_ss( _mm_load_ss( pIn ) ) );
// compiles to movss, sqrtss, movss
}
This was better, at 11.9ns/float. I also tried Carmack's wacky Newton-Raphson approximation technique, which ran even better than the hardware, at 4.3ns/float, although with an error of 1 in 210 (which is too much for my purposes).
The doozy was when I tried the SSE op for reciprocal square root, and then used a multiply to get the square root ( x * 1/√x = √x ). Even though this takes two dependent operations, it was the fastest solution by far, at 1.24ns/float and accurate to 2-14:
inline void SSESqrt_Recip_Times_X( float * restrict pOut, float * restrict pIn )
{
__m128 in = _mm_load_ss( pIn );
_mm_store_ss( pOut, _mm_mul_ss( in, _mm_rsqrt_ss( in ) ) );
// compiles to movss, movaps, rsqrtss, mulss, movss
}
My question is basically what gives? Why is SSE's built-in-to-hardware square root opcode slower than synthesizing it out of two other math operations?
I'm sure that this is really the cost of the op itself, because I've verified:
All data fits in cache, and
accesses are sequential
the functions are inlined
unrolling the loop makes no difference
compiler flags are set to full optimization (and the assembly is good, I checked)
(edit: stephentyrone correctly points out that operations on long strings of numbers should use the vectorizing SIMD packed ops, like rsqrtps — but the array data structure here is for testing purposes only: what I am really trying to measure is scalar performance for use in code that can't be vectorized.)
sqrtss gives a correctly rounded result. rsqrtss gives an approximation to the reciprocal, accurate to about 11 bits.
sqrtss is generating a far more accurate result, for when accuracy is required. rsqrtss exists for the cases when an approximation suffices, but speed is required. If you read Intel's documentation, you will also find an instruction sequence (reciprocal square-root approximation followed by a single Newton-Raphson step) that gives nearly full precision (~23 bits of accuracy, if I remember properly), and is still somewhat faster than sqrtss.
edit: If speed is critical, and you're really calling this in a loop for many values, you should be using the vectorized versions of these instructions, rsqrtps or sqrtps, both of which process four floats per instruction.
There are a number of other answers to this already from a few years ago. Here's what the consensus got right:
The rsqrt* instructions compute an approximation to the reciprocal square root, good to about 11-12 bits.
It's implemented with a lookup table (i.e. a ROM) indexed by the mantissa. (In fact, it's a compressed lookup table, similar to mathematical tables of old, using adjustments to the low-order bits to save on transistors.)
The reason why it's available is that it is the initial estimate used by the FPU for the "real" square root algorithm.
There's also an approximate reciprocal instruction, rcp. Both of these instructions are a clue to how the FPU implements square root and division.
Here's what the consensus got wrong:
SSE-era FPUs do not use Newton-Raphson to compute square roots. It's a great method in software, but it would be a mistake to implement it that way in hardware.
The N-R algorithm to compute reciprocal square root has this update step, as others have noted:
x' = 0.5 * x * (3 - n*x*x);
That's a lot of data-dependent multiplications and one subtraction.
What follows is the algorithm that modern FPUs actually use.
Given b[0] = n, suppose we can find a series of numbers Y[i] such that b[n] = b[0] * Y[0]^2 * Y[1]^2 * ... * Y[n]^2 approaches 1. Then consider:
x[n] = b[0] * Y[0] * Y[1] * ... * Y[n]
y[n] = Y[0] * Y[1] * ... * Y[n]
Clearly x[n] approaches sqrt(n) and y[n] approaches 1/sqrt(n).
We can use the Newton-Raphson update step for reciprocal square root to get a good Y[i]:
b[i] = b[i-1] * Y[i-1]^2
Y[i] = 0.5 * (3 - b[i])
Then:
x[0] = n Y[0]
x[i] = x[i-1] * Y[i]
and:
y[0] = Y[0]
y[i] = y[i-1] * Y[i]
The next key observation is that b[i] = x[i-1] * y[i-1]. So:
Y[i] = 0.5 * (3 - x[i-1] * y[i-1])
= 1 + 0.5 * (1 - x[i-1] * y[i-1])
Then:
x[i] = x[i-1] * (1 + 0.5 * (1 - x[i-1] * y[i-1]))
= x[i-1] + x[i-1] * 0.5 * (1 - x[i-1] * y[i-1]))
y[i] = y[i-1] * (1 + 0.5 * (1 - x[i-1] * y[i-1]))
= y[i-1] + y[i-1] * 0.5 * (1 - x[i-1] * y[i-1]))
That is, given initial x and y, we can use the following update step:
r = 0.5 * (1 - x * y)
x' = x + x * r
y' = y + y * r
Or, even fancier, we can set h = 0.5 * y. This is the initialisation:
Y = approx_rsqrt(n)
x = Y * n
h = Y * 0.5
And this is the update step:
r = 0.5 - x * h
x' = x + x * r
h' = h + h * r
This is Goldschmidt's algorithm, and it has a huge advantage if you're implementing it in hardware: the "inner loop" is three multiply-adds and nothing else, and two of them are independent and can be pipelined.
In 1999, FPUs already needed a pipelined add/substract circuit and a pipelined multiply circuit, otherwise SSE would not be very "streaming". Only one of each circuit was needed in 1999 to implement this inner loop in a fully-pipelined way without wasting a lot of hardware just on square root.
Today, of course, we have fused multiply-add exposed to the programmer. Again, the inner loop is three pipelined FMAs, which are (again) generally useful even if you're not computing square roots.
This is also true for division. MULSS(a,RCPSS(b)) is way faster than DIVSS(a,b). In fact it's still faster even when you increase its precision with a Newton-Raphson iteration.
Intel and AMD both recommend this technique in their optimisation manuals. In applications which don't require IEEE-754 compliance, the only reason to use div/sqrt is code readability.
Instead of supplying an answer, that actually might be incorrect (I'm also not going to check or argue about cache and other stuff, let's say they are identical) I'll try to point you to the source that can answer your question.
The difference might lie in how sqrt and rsqrt are computed. You can read more here http://www.intel.com/products/processor/manuals/. I'd suggest to start from reading about processor functions you are using, there are some info, especially about rsqrt (cpu is using internal lookup table with huge approximation, which makes it much simpler to get the result). It may seem, that rsqrt is so much faster than sqrt, that 1 additional mul operation (which isn't to costly) might not change the situation here.
Edit: Few facts that might be worth mentioning:
1. Once I was doing some micro optimalizations for my graphics library and I've used rsqrt for computing length of vectors. (instead of sqrt, I've multiplied my sum of squared by rsqrt of it, which is exactly what you've done in your tests), and it performed better.
2. Computing rsqrt using simple lookup table might be easier, as for rsqrt, when x goes to infinity, 1/sqrt(x) goes to 0, so for small x's the function values doesn't change (a lot), whereas for sqrt - it goes to infinity, so it's that simple case ;).
Also, clarification: I'm not sure where I've found it in books I've linked, but I'm pretty sure I've read that rsqrt is using some lookup table, and it should be used only, when the result doesn't need to be exact, although - I might be wrong as well, as it was some time ago :).
Newton-Raphson converges to the zero of f(x) using increments equals to -f/f' where f' is the derivative.
For x=sqrt(y), you can try to solve f(x) = 0 for x using f(x) = x^2 - y;
Then the increment is: dx = -f/f' = 1/2 (x - y/x) = 1/2 (x^2 - y) / x
which has a slow divide in it.
You can try other functions (like f(x) = 1/y - 1/x^2) but they will be equally complicated.
Let's look at 1/sqrt(y) now. You can try f(x) = x^2 - 1/y, but it will be equally complicated: dx = 2xy / (y*x^2 - 1) for instance.
One non-obvious alternate choice for f(x) is: f(x) = y - 1/x^2
Then: dx = -f/f' = (y - 1/x^2) / (2/x^3) = 1/2 * x * (1 - y * x^2)
Ah! It's not a trivial expression, but you only have multiplies in it, no divide. => Faster!
And: the full update step new_x = x + dx then reads:
x *= 3/2 - y/2 * x * x which is easy too.
It is faster becausse these instruction ignore rounding modes, and do not handle floatin point exceptions or dernormalized numbers. For these reasons it is much easier to pipeline, speculate and execute other fp instruction Out of order.