Algorithm and datastructure for calculating trig functions to arbitrary precisions - algorithm

I need to do a lot of calculations to arbitrarily high precisions - in Javascript which only has a 64 bit float representation of numbers.
I can see how I could combine multiple variables to represent large numbers: for example to represent a large decimal of m digits, where the 64 bit floating point can represent n digits, I need m / n variables.
But how can I implement an algorithm that calculates tan() to an arbitrary precision, using only 64-bit floating point arithmetic?

Why do you want to (re)do it yourself ? I'd use a lib for that. For instance http://mathjs.org/docs/datatypes/bignumbers.html.

Related

floating point random number generation in verilog.

Is there a way to generate random floating point numbers in either verilog or system verilog? More specifically through some hardware implementation?
A floating point number is just a number of bits. As such generating a random floating point number can be done by generating random bits and then interpreting them as a float (Or real as they are called in both VHDL and Verilog).
A standard way of generating a series of random bits in hardware is using a PRBS generator (Pseudo Random Bit Sequence generator): A linear feedback shift register with special feedback to get the maximum sequence. There are various polynomial depending on how long a PRBS you want to have.
For exact implementation I suggest you search for PRBS.

What is the fastest way to get the value of e?

What is the most optimised algorithm which finds the value of e with moderate accuracy?
I am looking for a comparison between optimised approaches giving more importance to speed than high precision.
Edit: By moderate accuracy I mean upto 6-7 decimal places.
But if there is a HUGE difference in speed, then I can settle with 4-5 places.
basic datatype
As mentioned in the comments 6-7 decimal places is too small accuracy to do this by an algorithm. Instead use a constant which is the fastest way anyway for this.
const double e=2.7182818284590452353602874713527;
If FPU is involved the constant is usually stored there too... Also having single constant occupies much less space than a function that computes it ...
finite accuracy
Only once bignums are involved then has any merit to use algorithm to compute e. The algo depends on target accuracy. Again for smaller accuracies are predefined constants used:
e=2.71828182845904523536028747135266249775724709369995957496696762772407663035354759457138217852516642742746639193200305992181741359662904357290033429526059563073813232862794349076323382988075319525101901157383418793070215408914993488416750924476146066808226480016847741185374234544243710753907774499206955170189
but usually in hex format for faster and more precise manipulation:
e=2.B7E151628AED2A6ABF7158809CF4F3C762E7160F38B4DA56A784D9045190CFEF324E7738926CFBE5F4BF8D8D8C31D763DA06C80ABB1185EB4F7C7B5757F5958490CFD47D7C19BB42158D9554F7B46BCED55C4D79FD5F24D6613C31C3839A2DDF8A9A276BCFBFA1C877C56284DAB79CD4C2B3293D20E9E5EAF02AC60ACC93ECEBh
For limited/finite accuracy and best speed is the PSLQ algorithm best. My understanding is that it is algorithm to find relation between real number and integer iterations.
here is my favourite PSLQ up to 800 digits of Pi PSLQ example
arbitrary accuracy
For arbitrary or "fixed" precision you need algorithm that is with variable precision. This is what I use in my arbnum class:
e=(1+1/x)^x where x -> +infinity
If you chose x as power of 2 realize that x is just single set bit of the number and 1/x has predictable bit-width. So the e will be obtained with single division and pow. Here an example:
arbnum arithmetics_e() // e computation min(_arbnum_max_a,arbnum_max_b)*5 decimals
{ // e=(1+1/x)^x ... x -> +inf
int i; arbnum c,x;
i=_arbnum_bits_a; if (i>_arbnum_bits_b) i=_arbnum_bits_b; i>>=1;
c.zero(); c.bitset(_arbnum_bits_b-i); x.one(); x/=c; c++;
for (;!x.bitget(_arbnum_bits_b);x>>=1) c*=c; //=pow(c,x);
return c;
}
Where _arbnum_bits_a,_arbnum_bits_b is the number of bits before and after decimal point in binary. So it break down to some bit operations, one bignum division and single power by squaring. Beware that multiplication and division is not that simple with bignums and usually involves Karatsuba or worse ...
There are also polynomial approaches out there that does not require bignum arithmetics similar to compute Pi. The idea is to compute chunk of binary bits per iteration without affecting the previously computed bits (too much). They should be faster but as usual for any optimizations that depends on the implementation and HW it runs on.
For reference see Brother's formula here : https://www.intmath.com/exponential-logarithmic-functions/calculating-e.php

LUT versus Newton-Raphson Division For IEEE-754 32-bit Floating Point

I was wondering what are the tradeoffs when implementing 32-bit IEEE-754 floating point division: using LUTs versus through the Newton-Raphson method?
When I say tradeoffs I mean in terms of memory size, instruction count, etc.
I have a small memory (130 words (each 16-bits)). I am storing upper 12-bits of mantissa (including hidden bit) in one memory location and lower 12-bits of mantissa in another location.
Currently I am using newton-raphson division, but am considering what are the tradeoffs if I changed my method. Here is a link to my algorithm: Newton's Method for finding the reciprocal of a floating point number for division
Thank you and please explain your reasoning.
The trade-off is fairly simple. A LUT uses extra memory in the hope of reducing the instruction count enough to save some time. Whether it's effective will depend a lot on the details of the processor -- caching in particular.
For Newton-Raphson, you change X/Y to X* (1/Y) and use your iteration to find 1/Y. At least in my experience, if you need full precision, it's rarely useful -- it's primary strength is in allowing you to find something to (say) 16-bit precision more quickly.
The usual method for division is a bit-by-bit method. Although that particular answer deals with integers, for floating point you do essentially the same except that along with it you subtract the exponents. A floating point number is basically A*2N, where A is the significand and N is the exponent part of the number. So, you take two numbers A*2N / B * 2M, and carry out the division as A/B * 2N-M, with A and B being treated as (essentially) integers in this case. The only real difference is that with floating point you normally want to round rather than truncate the result. That basically just means carrying out the division (at least) one extra bit of precision, then rounding up if that extra bit is a one.
The most common method using lookup tables is SRT division. This is most often done in hardware, so I'd probably Google for something like "Verilog SRT" or "VHDL SRT". Rendering it in C++ shouldn't be terribly difficult though. Where the method I outlined in the linked answer produces on bit per iteration, this can be written to do 2, 4, etc. If memory serves, the size of table grows quadratically with the number of bits produced per iteration though, so you rarely see much more than 4 in practice.
Each Newton-Raphson step roughly doubles the number of digits of precision, so if you can work out the number of bits of precision you expect from a LUT of a particular size, you should be able to work out how many NR steps you need to attain your desired precision. The Cray-1 used NR as the final stage of its reciprocal calculation. Looking for this I found a fairly detailed article on this sort of thing: An Accurate, High Speed Implementation of Division by Reciprocal Approximation from the 9th IEEE Symposium on Computer Arithmetic (September 6-8, 1989).

Faster integer division when denominator is known?

I am working on GPU device which has very high division integer latency, several hundred cycles. I am looking to optimize divisions.
All divisions by denominator which is in a set { 1,3,6,10 }, however numerator is a runtime positive value, roughly 32000 or less. due to memory constraints, lookup table may not be a good option.
Can you think of alternatives?
I have thought of computing float point inverses, and using those to multiply numerator.
Thanks
PS. thank you people. bit shift hack is a really cool.
to recover from roundoff, I use following C segment:
// q = m/n
q += (n*(j +1)-1) < m;
a/b=a*(1/b)
x=(1<<16)/b
a/b=(a*x)>>16
can you build a lookup table for the denominators? since you said 15 bit numerators, you could use 17 for the shifts if everything is unsigned 32 bit:
a/b=a*((1<<17)/b)>>17
The larger the shift the less the rounding error. You can do a brute force check to see how many times, if any, this is actually wrong.
The book, "Hacker's Delight" by Henry Warren, has a whole chapter devoted to integer division by constants, including techniques that transform an integer division to a multiply/shift/add series of operations.
This page calculates the magic numbers for the multiply/shift/add operations:
http://www.hackersdelight.org/magic.htm
The standard embedded systems hack for this is to convert an integer division by N into a fixed-point multiplication by 1/N.
Assuming 16 bits, 0.33333 can be represented as 21845 (decimal). Multiply, giving a 32-bit integer product, and shift down 16 bits.
You will almost certainly encounter some roundoff (truncation) error. This may or may not be something you can live with.
It MIGHT be worthwhile to look hard at your GPU and see if you can hand-code a faster integer division routine, taking advantage of your knowledge of the restricted range of the numerator.

Why is float division slow?

What are the steps in the algorithm to do floating point division?
Why is the result slower than say, multiplication?
Is it done the same way we do division by hand? By repeatedly dividing by the divisor, subtracting the result to obtain a remainder, aligning the number again and continuing till the remainder is less than a particular value?
Also, why do we gain on performance if instead of doing
a = b / c
we do
d = 1 / c
a = b * d
?
Edit:
Basically I was asking because someone asked me to distribute a value among contenders based on the assignment of weights. I did all this in integers and was later asked to convert to float, which caused a slowdown in performance. I was just interested in knowing how would C or C++ do these operations that would cause the slowness.
FPU division often basically uses Newton-Raphson (or some other algorithm) to get a reciprocal then multiplies by that reciprocal. That's why the reciprocal operation is slightly faster than the general division operation.
This HP paper (which is actually more understandable than most papers I come across talking about Newton-Raphson) has this to say about floating point division:
Floating point division and square
root take considerably longer to
compute than addition and
multiplication. The latter two are
computed directly while the former are
usually computed with an iterative
algorithm. The most common approach is
to use a division-free Newton-Raphson
iteration to get an approximation to
the reciprocal of the denominator
(division) or the reciprocal square
root, and then multiply by the
numerator (division) or input argument
(square root).
From a hardware point of view division is a iterative algorithm, and the time it takes is proportional to the number of bits. The fastest division that is currently around uses the radix4 algorithm which generates 4 bit of result per iteration. For a 32 bit divide you need 8 steps at least.
Multiplication can be done in parallel to a certain degree. Without going into detail you can break up a large multiplication into several smaller, independent ones. These multiplications can again be broken down until you're at a bit-level, or you stop earlier and use a small lookup-table in hardware. This makes the multiplication hardware heavy from a silicon real estate point of view but very fast as well. It's the classic size/speed tradeoff.
You need log2 steps to combine the parallel computed results, so a 32 bit multiply need 5 logical steps (if you go down to the minimum). Fortunately these 5 steps are a good deal simpler than the division steps (it's just additions). That means in practice multiplies are even faster.
As described in the Wikipedia article Division algorithm, there are two main aproaches to division in computers:
Slow Division
Uses the following recurrence and finds one digit per iteration:
partialRemainder[j+1] = radix * partialRemainder[j] - quotientDigit[n-(j+1)]*denominator
Fast Division
Starts with an estimation and converges on the quotient. How accurate you are depends on the number of iterations.
Newton-Raphson division (very briefly):
Calculate estimate of the reciprocal.
Compute more accurate estimates of the reciprocal.
Compute quotient by multiplying the dividend by the reciprocal.
Think of the hardware involved, and you'll understand a lot better why it takes so much longer to divide than multiply. Both operations are done down at the Floating Point Unit (FPU) level, and even in the world of integral ALUs, the division circuit is a far busier place than a multiplication circuit. I would suspect this is only more painful in the world of floating point, as now the data isn't just least to most significant digit ordered, but is instead ordered by the IEEE 754 standard.
As for the round off, it's really about wherever the signals traveling between the gates get soldered to ground; where that happens, you lose digits. Not rounding, so much as truncation.
Or were you asking about simulating floating point arithmetic using just integers?
You won't gain performance by doing
d = 1 / c
a = b * d
You probably mean:
d = 1 / c
a1 = b1 * d
a2 = b2 * d
This way the division is done only once.
Division is per se slower than multiplication, however, I don't know the details. The basic reason is that, similar to functions such as sin or sqrt, it's just mathematically more complex. IIRC, a multiplication takes about 10 cycles on an average CPU, while a division takes about 50 or more.
How it is actually done was nicely explained by John Mulder.
Float division is not much slower than integer division, but the compiler may be unable to do the same optimizations.
For example the compiler can replace integer division between 3 with a multiplication and a binary shift.
Also it can replace float division between 2.0 with a multipliation by 0.5 but it cannot replace division between 3.0 with a multiplication by 1/3.0 as 1/3.0 cannot be represented exactly using binary numbers, therefore rounding errors may change the result of the division.
As the compiler doesn't know how sensitive is your application to rounding errors (say you were doing a weather simulation, see Butterfly effect )it cannot do the optimization.

Resources