LUT versus Newton-Raphson Division For IEEE-754 32-bit Floating Point - algorithm

I was wondering what are the tradeoffs when implementing 32-bit IEEE-754 floating point division: using LUTs versus through the Newton-Raphson method?
When I say tradeoffs I mean in terms of memory size, instruction count, etc.
I have a small memory (130 words (each 16-bits)). I am storing upper 12-bits of mantissa (including hidden bit) in one memory location and lower 12-bits of mantissa in another location.
Currently I am using newton-raphson division, but am considering what are the tradeoffs if I changed my method. Here is a link to my algorithm: Newton's Method for finding the reciprocal of a floating point number for division
Thank you and please explain your reasoning.

The trade-off is fairly simple. A LUT uses extra memory in the hope of reducing the instruction count enough to save some time. Whether it's effective will depend a lot on the details of the processor -- caching in particular.
For Newton-Raphson, you change X/Y to X* (1/Y) and use your iteration to find 1/Y. At least in my experience, if you need full precision, it's rarely useful -- it's primary strength is in allowing you to find something to (say) 16-bit precision more quickly.
The usual method for division is a bit-by-bit method. Although that particular answer deals with integers, for floating point you do essentially the same except that along with it you subtract the exponents. A floating point number is basically A*2N, where A is the significand and N is the exponent part of the number. So, you take two numbers A*2N / B * 2M, and carry out the division as A/B * 2N-M, with A and B being treated as (essentially) integers in this case. The only real difference is that with floating point you normally want to round rather than truncate the result. That basically just means carrying out the division (at least) one extra bit of precision, then rounding up if that extra bit is a one.
The most common method using lookup tables is SRT division. This is most often done in hardware, so I'd probably Google for something like "Verilog SRT" or "VHDL SRT". Rendering it in C++ shouldn't be terribly difficult though. Where the method I outlined in the linked answer produces on bit per iteration, this can be written to do 2, 4, etc. If memory serves, the size of table grows quadratically with the number of bits produced per iteration though, so you rarely see much more than 4 in practice.

Each Newton-Raphson step roughly doubles the number of digits of precision, so if you can work out the number of bits of precision you expect from a LUT of a particular size, you should be able to work out how many NR steps you need to attain your desired precision. The Cray-1 used NR as the final stage of its reciprocal calculation. Looking for this I found a fairly detailed article on this sort of thing: An Accurate, High Speed Implementation of Division by Reciprocal Approximation from the 9th IEEE Symposium on Computer Arithmetic (September 6-8, 1989).

Related

Precison of digital computers

I read that multiplying multiple values between 0 and 1 will significantly reduce the precision of digital computers; I want to know the basis on which such postulate is based? And does it still holds for modern-day computers?
The typical IEEE-conformant representation of fractional numbers only supports a limited number of (binary) digits. So, very often, the result of some computation isn't an exact representation of the expected mathematical value, but something close to it (rounded to the next number representable within the digits limit), meaning that there is some amount of error in most calculations.
If you do multi-step calculations, you might be lucky that the error introduced by one step is compensated by some complementary error at a later step. But that's pure luck, and statistics teaches us that the expected error will indeed increase with every step.
If you e.g. do 1000 multiplications using the float datatype (typically achieving 6-7 significant decimal digits accuracy), I'd expect the result to be correct only up to about 5 digits, and in worst case only 3-4 digits.
There are ways to do precise calculations (at least for addition, subtraction, multiplication and division), e.g. using the ratio type in the LISP programming language, but in practice they are rarely used.
So yes, doing multi-step calculations in datatypes supporting fractional numbers quickly degrades precision, and it happens with all number ranges, not only with numbers between 0 and 1.
If this is a problem for some application, it's a special skill to transform mathematical formulas into equivalent ones that can be computed with better precision (e.g. formulas with fewer intermediate steps).

fast algorithm for computing 1/d?(SRT, goldsmidt, newton raphson,...)

I want to find a fast algorithm for computing 1/d , where d is double ( albeit it can be converted to integer) what is the best algorithm of many algorithms(SRT , goldschmidt,newton raphson, ...)?I'm writing my program in c language.
thanks in advance.
The fastest program is: double result = 1 / d;
CPU:s already use a root finding iterative algorithm like the ones you describe, to find the reciprocal 1/d. So you should find it difficult to beat it using a software implementation of the same algorithm.
If you have few/known denominators then try a lookup table. This is the usual approach for even slower functions such as trig functions.
Otherwise: just compute 1/d. It will be the fastest you can do. And there is an endless list of things you can do to speed up arithmetic if you have to
use 32 bit (single) instead of 64bit (double) precision. FP Division on takes a number of cycles proportional to the number of bits.
vectorize the operations. For example I believe you can compute four 32bit float divisions in parallel with SSE2, or even more in parallel by doing it on the GPU.
I've asked it from someone and I get my answer:
So, you can't add a hardware divider to the FPGA then? Or fast reciprocal support?
Anyway it depends. Does it have fast multiplication? If not, well, that's a problem, you could only implement the slow methods then.
If you have fast multiplication and IEEE floats, you can use the weird trick I linked to in my previous post with a couple of refinement steps. That's really just Newton–Raphson division with a simpler calculation for the initial approximation (but afaik it still only takes 3 refinements for single-precision floats, just like the regular initial approximation). Fast reciprocal support works that way too - give a fast initial approximation (handling the exponent right and getting significant bits from a lookup table, if you get 12 significant bits that way you only need one refinement step for single-precision or, 13 are enough to get 2 steps for double-precision) and optionally have instructions that help implement the refinement step (like AMD's PFRCPIT1 and PFRCPIT2), for example to calculate Y = (1 - D*X) and to calculate X + X * Y.
Even without those tricks Newton–Raphson division is still not bad, with the linear approximation it takes only 4 refinements for double-precision floats, but it also takes some annoying exponent adjustments to get in the right range first (in hardware that wouldn't be half as annoying).
Goldschmidt division is, afaik, roughly equivalent in performance and might have a slightly less complex implementation. It's really the same sort of deal - trickery with the exponent to get in the right range, the "2 - something" estimation trick (which is rearranged in Newton-Raphson division, but it's really the same thing), and doing the refinement step until all the bits are right. It just looks a little different.

Denormalized Numbers - IEEE 754 Floating Point

So I'm trying to learn more about Denormalized numbers as defined in the IEEE 754 standard for Floating Point numbers. I've already read several articles thanks to Google search results, and I've gone through several StackOverFlow posts. However I still have some questions unanswered.
First off, just to review my understanding of what a Denormalized float is:
Numbers which have fewer bits of precision, and are smaller (in
magnitude) than normalized numbers
Essentially, a denormalized float has the ability to represent the SMALLEST (in magnitude) number that is possible to be represented with any floating point value.
Does that sound correct? Anything more to it than that?
I've read that:
using denormalized numbers comes with a performance cost on many
platforms
Any comments on this?
I've also read in one of the articles that
one should "avoid overlap between normalized and denormalized numbers"
Any comments on this?
In some presentations of the IEEE standard, when floating point ranges are presented the denormalized values are excluded and the tables are labeled as an "effective range", almost as if the presenter is thinking "We know that denormalized numbers CAN represent the smallest possible floating point values, but because of certain disadvantages of denormalized numbers, we choose to exclude them from ranges that will better fit common use scenarios" -- As if denormalized numbers are not commonly used.
I guess I just keep getting the impression that using denormalized numbers turns out to not be a good thing in most cases?
If I had to answer that question on my own I would want to think that:
Using denormalized numbers is good because you can represent the smallest (in magnitude) numbers possible -- As long as precision is not important, and you do not mix them up with normalized numbers, AND the resulting performance of the application fits within requirements.
Using denormalized numbers is a bad thing because most applications do not require representations so small -- The precision loss is detrimental, and you can shoot yourself in the foot too easily by mixing them up with normalized numbers, AND the peformance is not worth the cost in most cases.
Any comments on these two answers? What else might I be missing or not understand about denormalized numbers?
Essentially, a denormalized float has the ability to represent the
SMALLEST (in magnitude) number that is possible to be represented with
any floating point value.
That is correct.
using denormalized numbers comes with a performance cost on many platforms
The penalty is different on different processors, but it can be up to 2 orders of magnitude. The reason? The same as for this advice:
one should "avoid overlap between normalized and denormalized numbers"
Here's the key: denormals are a fixed-point "micro-format" within the IEEE-754 floating-point format. In normal numbers, the exponent indicates the position of the binary point. Denormal numbers contain the last 52 bits in the fixed-point notation with an exponent of 2-1074 for doubles.
So, denormals are slow because they require special handling. In practice, they occur very rarely, and chip makers don't like to spend too many valuable resources on rare cases.
Mixing denormals with normals is slow because then you're mixing formats and you have the additional step of converting between the two.
I guess I just keep getting the impression that using denormalized
numbers turns out to not be a good thing in most cases?
Denormals were created for one primary purpose: gradual underflow. It's a way to keep the relative difference between tiny numbers small. If you go straight from the smallest normal number to zero (abrupt underflow), the relative change is infinite. If you go to denormals on underflow, the relative change is still not fully accurate, but at least more reasonable. And that difference shows up in calculations.
To put it a different way. Floating-point numbers are not distributed uniformly. There are always the same amount of numbers between successive powers of two: 252 (for double precision). So without denormals, you always end up with a gap between 0 and the smallest floating-point number that is 252 times the size of the difference between the smallest two numbers. Denormals fill this gap uniformly.
As an example about the effects of abrupt vs. gradual underflow, look at the mathematically equivalent x == y and x - y == 0. If x and y are tiny but different and you use abrupt underflow, then if their difference is less than the minimum cutoff value, their difference will be zero, and so the equivalence is violated.
With gradual underflow, the difference between two tiny but different normal numbers gets to be a denormal, which is still not zero. The equivalence is preserved.
So, using denormals on purpose is not advised, because they were designed only as a backup mechanism in exceptional cases.

Faster integer division when denominator is known?

I am working on GPU device which has very high division integer latency, several hundred cycles. I am looking to optimize divisions.
All divisions by denominator which is in a set { 1,3,6,10 }, however numerator is a runtime positive value, roughly 32000 or less. due to memory constraints, lookup table may not be a good option.
Can you think of alternatives?
I have thought of computing float point inverses, and using those to multiply numerator.
Thanks
PS. thank you people. bit shift hack is a really cool.
to recover from roundoff, I use following C segment:
// q = m/n
q += (n*(j +1)-1) < m;
a/b=a*(1/b)
x=(1<<16)/b
a/b=(a*x)>>16
can you build a lookup table for the denominators? since you said 15 bit numerators, you could use 17 for the shifts if everything is unsigned 32 bit:
a/b=a*((1<<17)/b)>>17
The larger the shift the less the rounding error. You can do a brute force check to see how many times, if any, this is actually wrong.
The book, "Hacker's Delight" by Henry Warren, has a whole chapter devoted to integer division by constants, including techniques that transform an integer division to a multiply/shift/add series of operations.
This page calculates the magic numbers for the multiply/shift/add operations:
http://www.hackersdelight.org/magic.htm
The standard embedded systems hack for this is to convert an integer division by N into a fixed-point multiplication by 1/N.
Assuming 16 bits, 0.33333 can be represented as 21845 (decimal). Multiply, giving a 32-bit integer product, and shift down 16 bits.
You will almost certainly encounter some roundoff (truncation) error. This may or may not be something you can live with.
It MIGHT be worthwhile to look hard at your GPU and see if you can hand-code a faster integer division routine, taking advantage of your knowledge of the restricted range of the numerator.

Why is float division slow?

What are the steps in the algorithm to do floating point division?
Why is the result slower than say, multiplication?
Is it done the same way we do division by hand? By repeatedly dividing by the divisor, subtracting the result to obtain a remainder, aligning the number again and continuing till the remainder is less than a particular value?
Also, why do we gain on performance if instead of doing
a = b / c
we do
d = 1 / c
a = b * d
?
Edit:
Basically I was asking because someone asked me to distribute a value among contenders based on the assignment of weights. I did all this in integers and was later asked to convert to float, which caused a slowdown in performance. I was just interested in knowing how would C or C++ do these operations that would cause the slowness.
FPU division often basically uses Newton-Raphson (or some other algorithm) to get a reciprocal then multiplies by that reciprocal. That's why the reciprocal operation is slightly faster than the general division operation.
This HP paper (which is actually more understandable than most papers I come across talking about Newton-Raphson) has this to say about floating point division:
Floating point division and square
root take considerably longer to
compute than addition and
multiplication. The latter two are
computed directly while the former are
usually computed with an iterative
algorithm. The most common approach is
to use a division-free Newton-Raphson
iteration to get an approximation to
the reciprocal of the denominator
(division) or the reciprocal square
root, and then multiply by the
numerator (division) or input argument
(square root).
From a hardware point of view division is a iterative algorithm, and the time it takes is proportional to the number of bits. The fastest division that is currently around uses the radix4 algorithm which generates 4 bit of result per iteration. For a 32 bit divide you need 8 steps at least.
Multiplication can be done in parallel to a certain degree. Without going into detail you can break up a large multiplication into several smaller, independent ones. These multiplications can again be broken down until you're at a bit-level, or you stop earlier and use a small lookup-table in hardware. This makes the multiplication hardware heavy from a silicon real estate point of view but very fast as well. It's the classic size/speed tradeoff.
You need log2 steps to combine the parallel computed results, so a 32 bit multiply need 5 logical steps (if you go down to the minimum). Fortunately these 5 steps are a good deal simpler than the division steps (it's just additions). That means in practice multiplies are even faster.
As described in the Wikipedia article Division algorithm, there are two main aproaches to division in computers:
Slow Division
Uses the following recurrence and finds one digit per iteration:
partialRemainder[j+1] = radix * partialRemainder[j] - quotientDigit[n-(j+1)]*denominator
Fast Division
Starts with an estimation and converges on the quotient. How accurate you are depends on the number of iterations.
Newton-Raphson division (very briefly):
Calculate estimate of the reciprocal.
Compute more accurate estimates of the reciprocal.
Compute quotient by multiplying the dividend by the reciprocal.
Think of the hardware involved, and you'll understand a lot better why it takes so much longer to divide than multiply. Both operations are done down at the Floating Point Unit (FPU) level, and even in the world of integral ALUs, the division circuit is a far busier place than a multiplication circuit. I would suspect this is only more painful in the world of floating point, as now the data isn't just least to most significant digit ordered, but is instead ordered by the IEEE 754 standard.
As for the round off, it's really about wherever the signals traveling between the gates get soldered to ground; where that happens, you lose digits. Not rounding, so much as truncation.
Or were you asking about simulating floating point arithmetic using just integers?
You won't gain performance by doing
d = 1 / c
a = b * d
You probably mean:
d = 1 / c
a1 = b1 * d
a2 = b2 * d
This way the division is done only once.
Division is per se slower than multiplication, however, I don't know the details. The basic reason is that, similar to functions such as sin or sqrt, it's just mathematically more complex. IIRC, a multiplication takes about 10 cycles on an average CPU, while a division takes about 50 or more.
How it is actually done was nicely explained by John Mulder.
Float division is not much slower than integer division, but the compiler may be unable to do the same optimizations.
For example the compiler can replace integer division between 3 with a multiplication and a binary shift.
Also it can replace float division between 2.0 with a multipliation by 0.5 but it cannot replace division between 3.0 with a multiplication by 1/3.0 as 1/3.0 cannot be represented exactly using binary numbers, therefore rounding errors may change the result of the division.
As the compiler doesn't know how sensitive is your application to rounding errors (say you were doing a weather simulation, see Butterfly effect )it cannot do the optimization.

Resources