Implementation of strtof(), floating-point multiplication and mantissa rounding issues

Implementation of strtof(), floating-point multiplication and mantissa rounding issues - algorithm

This question is not so much about the C as about the algorithm. I need to implement strtof() function, which would behave exactly the same as GCC one - and do it from scratch (no GNU MPL etc.).
Let's skip checks, consider only correct inputs and positive numbers, e.g. 345.6e7. My basic algorithm is:
Split the number into fraction and integer exponent, so for 345.6e7 fraction is 3.456e2 and exponent is 7.
Create a floating-point exponent. To do this, I use these tables:
static const float powersOf10[] = {
1.0e1f,
1.0e2f,
1.0e4f,
1.0e8f,
1.0e16f,
1.0e32f
};
static const float minuspowersOf10[] = {
1.0e-1f,
1.0e-2f,
1.0e-4f,
1.0e-8f,
1.0e-16f,
1.0e-32f
};
and get float exponent as a product of corresponding bits in integer exponent, e.g. 7 = 1+2+4 => float_exponent = 1.0e1f * 1.0e2f * 1.0e4f.
Multiply fraction by floating exponent and return the result.
And here comes the first problem: since we do a lot of multiplications, we get a somewhat big error becaule of rounding multiplication result each time. So, I decided to dive into floating point multiplication algorithm and implement it myself: a function takes a number of floats (in my case - up to 7) and multiplies them on bit level. Consider I have uint256_t type to fit mantissas product.
Now, the second problem: round mantissas product to 23 bits. I've tried several rounding methods (round-to-even, Von Neumann rounding - a small article about them), but no of them can give the correct result for all the test numbers. And some of them really confuse me, like this one:
7038531e-32. GCC's strtof() returns 0x15ae43fd, so correct unbiased mantissa is 2e43fd. I go for multiplication of 7.038531e6 (biased mantissa d6cc86) and 1e-32 (b.m. cfb11f). The resulting unbiased mantissa in binary form is
( 47)0001 ( 43)0111 ( 39)0010 ( 35)0001
( 31)1111 ( 27)1110 ( 23)1110 ( 19)0010
( 15)1011 ( 11)0101 ( 7)0001 ( 3)1101
which I have to round to 23 bits. However, by all rounding methods I have to round it up, and I'll get 2e43fe in result - wrong! So, for this number the only way to get correct mantissa is just to chop it - but chopping does not work for other numbers.
Having this worked on countless nights, my questions are:
Is this approach to strtof() correct? (I know that GCC uses GNU MPL for it, and tried to see into it. However, trying to copy MPL's implementation would require porting the entire library, and this is definitely not what I want). Maybe this split-then-multiply algorithm is inevitably prone to errors? I did some other small tricks, (e.g. create exponent tables for all integer exponents in float range), but they led to even more failed conversions.
If so, did I miss something while rounding? I thought so for long time, but this 7038531e-32 number completely confused me.

If I want to be as precise as I can I usually do stuff like this (however I usually do the reverse operation float -> text):
use only integers (no floats what so ever)
as you know float is integer mantissa bit-shifted by integer exponent so no need for floats.
For constructing the final float datatype you can use simple union with float and 32 bit unsigned integer in it ... or pointers to such types pointing to the same address.
This will avoid rounding errors for numbers that fit completely and shrink error for those that don't fit considerably.
use hex numbers
You can convert your text of decadic number on the run into its hex counterpart (still as text) from there creating mantissa and exponent integers is simple.
Here:
How to convert a gi-normous integer (in string format) to hex format? (C#)
is C++ implementation example of dec2hex and hex2dec number conversions done on text
use more bits for mantissa while converting
for task like this and single precision float I usually use 2 or 3 32 bit DWORDs for the 24 bit mantissa to still hold some precision after the multiplications If you want to be precise you have to deal with 128+24 bits for both integer and fractional part of number so 5x32 bit numbers in sequence.
For more info and inspiration see (reverse operation):
my best attempt to print 32 bit floats with least rounding errors (integer math only)
Your code will be just inverse of that (so many parts will be similar)
Since I post that I made even more advanced version that recognize formatting just like printf , supports much more datatypes and more without using any libs (however its ~22.5 KByte of code). I needed it for MCUs as GCC implementation of prints are not very good there ...

Related

Fastest algorithm to convert hexadecimal numbers into decimal form without using a fixed length variable to store the result

I want to write a program to convert hexadecimal numbers into their decimal forms without using a variable of fixed length to store the result because that would restrict the range of inputs that my program can work with.
Let's say I were to use a variable of type long long int to calculate, store and print the result. Doing so would limit the range of hexadecimal numbers that my program can handle to between 8000000000000001 and 7FFFFFFFFFFFFFFF. Anything outside this range would cause the variable to overflow.
I did write a program that calculates and stores the decimal result in a dynamically allocated string by performing carry and borrow operations but it runs much slower, even for numbers that are as big as 7FFFFFFFF!
Then I stumbled onto this site which could take numbers that are way outside the range of a 64 bit variable. I tried their converter with numbers much larger than 16^65 - 1 and still couldn't get it to overflow. It just kept on going and printing the result.
I figured that they must be using a much better algorithm for hex to decimal conversion, one that isn't limited to 64 bit values.
So far, Google's search results have only led me to algorithms that use some fixed-length variable for storing the result.
That's why I am here. I wanna know if such an algorithm exists and if it does, what is it?

Well, it sounds like you already did it when you wrote "a program that calculates and stores the decimal result in a dynamically allocated string by performing carry and borrow operations".
Converting from base 16 (hexadecimal) to base 10 means implementing multiplication and addition of numbers in a base 10x representation. Then for each hex digit d, you calculate result = result*16 + d. When you're done you have the same number in a 10-based representation that is easy to write out as a decimal string.
There could be any number of reasons why your string-based method was slow. If you provide it, I'm sure someone could comment.
The most important trick for making it reasonably fast, though, is to pick the right base to convert to and from. I would probably do the multiplication and addition in base 109, so that each digit will be as large as possible while still fitting into a 32-bit integer, and process 7 hex digits at a time, which is as many as I can while only multiplying by single digits.
For every 7 hex digts, I'd convert them to a number d, and then do result = result * ‭(16^7) + d.
Then I can get the 9 decimal digits for each resulting digit in base 109.
This process is pretty easy, since you only have to multiply by single digits. I'm sure there are faster, more complicated ways that recursively break the number into equal-sized pieces.

Is using integers as fractional coefficients instead of floats a good idea for a monetary application?

My application requires a fractional quantity multiplied by a monetary value.
For example, $65.50 × 0.55 hours = $36.025 (rounded to $36.03).
I know that floats should not be used to represent money, so I'm storing all of my monetary values as cents. $65.50 in the above equation is stored as 6550 (integer).
For the fractional coefficient, my issue is that 0.55 does not have a 32-bit float representation. In the use case above, 0.55 hours == 33 minutes, so 0.55 is an example of a specific value that my application will need to account for exactly. The floating point representation of 0.550000012 is insufficient, because the user will not understand where the additional 0.000000012 came from. I cannot simply call a rounding function on 0.550000012 because it will round to the whole number.
Multiplication solution
To solve this, my first idea was to store all quantities as integers and multiply × 1000. So 0.55 entered by the user would become 550 (integer) when stored. All calculations would happen without floats, and then simply divide by 1000 (integer division, not float) when presenting the result to the user.
I realize that this would permanently limit me to 3 decimal places of
precision. If I decide that 3 is adequate for the lifetime of my
application, does this approach make sense?
Are there potential rounding issues if I were to use integer division?
Is there a name for this process? EDIT: As indicated by #SergGr, this is fixed-point arithmetic.
Is there a better approach?
EDIT:
I should have clarified, this is not time-specific. It is for generic quantities like 1.256 pounds of flour, 1 sofa, or 0.25 hours (think invoices).
What I'm trying to replicate here is a more exact version of Postgres's extra_float_digits = 0 functionality, where if the user enters 0.55 (float32), the database stores 0.550000012 but when queried for the result returns 0.55 which appears to be exactly what the user typed.
I am willing to limit this application's precision to 3 decimal places (it's business, not scientific), so that's what made me consider the × 1000 approach.
I'm using the Go programming language, but I'm interested in generic cross-language solutions.

Another solution to store the result is using the rational form of the value. You can explain the number by two integer value which the number is equal p/q, such that both p and q are integers. Hence, you can have more precision for your numbers and do some math with the rational numbers in the format of two integers.

Note: This is an attempt to merge different comments into one coherent answer as was requested by Matt.
TL;DR
Yes, this approach makes sense but most probably is not the best choice
Yes, there are rounding issues but there inevitably will be some no matter what representation you use
What you suggest using is called Decimal fixed point numbers
I'd argue yes, there is a better approach and it is to use some standard or popular decimal floating point numbers library for your language (Go is not my native language so I can't recommend one)
In PostgreSQL it is better to use Numeric (something like Numeric(15,3) for example) rather than a combination of float4/float8 and extra_float_digits. Actually this is what the first item in the PostgreSQL doc on Floating-Point Types suggests:
If you require exact storage and calculations (such as for monetary amounts), use the numeric type instead.
Some more details on how non-integer numbers can be stored
First of all there is a fundamental fact that there are infinitely many numbers in the range [0;1] so you obviously can't store every number there in any finite data structure. It means you have to make some compromises: no matter what way you choose, there will be some numbers you can't store exactly so you'll have to round.
Another important point is that people are used to 10-based system and in that system only results of division by numbers in a form of 2^a*5^b can be represented using a finite number of digits. For every other rational number even if you somehow store it in the exact form, you will have to do some truncation and rounding at the formatting for human usage stage.
Potentially there are infinitely many ways to store numbers. In practice only a few are widely used:
floating point numbers with two major branches of binary (this is what most today's hardware natively implements and what is support by most of the languages as float or double) and decimal. This is the format that store mantissa and exponent (can be negative), so the number is mantissa * base^exponent (I omit sign and just say it is logically a part of the mantissa although in practice it is usually stored separately). Binary vs. decimal is specified by the base. For example 0.5 will be stored in binary as a pair (1,-1) i.e. 1*2^-1 and in decimal as a pair (5,-1) i.e. 5*10^-1. Theoretically you can use any other base as well but in practice only 2 and 10 make sense as the bases.
fixed point numbers with the same division in binary and decimal. The idea is the same as in floating point numbers but some fixed exponent is used for all the numbers. What you suggests is actually a decimal fixed point number with the exponent fixed at -3. I've seen a usage of binary fixed-point numbers on some embedded hardware where there is no built-in support of floating point numbers, because binary fixed-point numbers can be implemented with reasonable efficiency using integer arithmetic. As for decimal fixed-point numbers, in practice they are not much easier to implement that decimal floating-point numbers but provide much less flexibility.
rational numbers format i.e. the value is stored as a pair of (p, q) which represents p/q (and usually q>0 so sign stored in p and either p=0, q=1 for 0 or gcd(p,q) = 1 for every other number). Usually this requires some big integer arithmetic to be useful in the first place (here is a Go example of math.big.Rat). Actually this might be an useful format for some problems and people often forget about this possibility, probably because it is often not a part of a standard library. Another obvious drawback is that as I said people are not used to think in rational numbers (can you easily compare which is greater 123/456 or 213/789?) so you'll have to convert the final results to some other form. Another drawback is that if you have a long chain of computations, internal numbers (p and q) might easily become very big values so computations will be slow. Still it may be useful to store intermediate results of calculations.
In practical terms there is also a division into arbitrary length and fixed length representations. For example:
IEEE 754 float or double are fixed length floating-point binary representations,
Go math.big.Float is an arbitrary length floating-point binary representations
.Net decimal is a fixed length floating-point decimal representations
Java BigDecimal is an arbitrary length floating-point decimal representations
In practical terms I'd says that the best solution for your problem is some big enough fixed length floating point decimal representations (like .Net decimal). An arbitrary length implementation would also work. If you have to make an implementation from scratch, than your idea of a fixed length fixed point decimal representation might be OK because it is the easiest thing to implement yourself (a bit easier than the previous alternatives) but it may become a burden at some point.

As mentioned in the comments, it would be best to use some builtin Decimal module in your language to handle exact arithmetic. However, since you haven't specified a language, we cannot be certain that your language may even have such a module. If it does not, here is how to go about doing so.
Consider using Binary Coded Decimal to store your values. The way it works is by restricting the values that can be stored per byte to 0 through 9 (inclusive), "wasting" the rest. You can encode a decimal representation of a number byte by byte that way. For example, 613 would become
6 -> 0000 0110
1 -> 0000 0001
3 -> 0000 0011
613 -> 0000 0110 0000 0001 0000 0011
Where each grouping of 4 digits above is a "nibble" of a byte. In practice, a packed variant is used, where two decimal digits are packed into a byte (one per nibble) to be less "wasteful". You can then implement a few methods to do your basic addition, subtract, multiplication, etc. Just iterate over an array of bytes, and perform your classic grade school addition / multiplication algorithms (keep in mind for the packed variant that you may need to pad a zero to get an even number of nibbles). You just need to keep a variable to store where the decimal point is, and remember to carry where necessary to preserve the encoding.

why does Ruby's Rational class treat string arguments differently from numeric arguments?

I'm using ruby's Rational library to convert the width & height of images to aspect ratios.
I've noticed that string arguments are treated differently than numeric arguments.
>> Rational('1.91','1')
=> (191/100)
>> Rational(1.91,1)
=> (8601875288277647/4503599627370496)
>> RUBY_VERSION
=> "2.1.5"
>> RUBY_ENGINE
=> "ruby"
FYI 1.91:1 is an aspect ratio recommended by Facebook for images on their platform.
Values like 191 and 100 are much more convenient to store in my database than 8601875288277647 and 4503599627370496. But I'd like to understand where this different originates before deciding which approach to use.
The Rational test suite doesn't seem to cover this exact case.

Disclaimer: This is only an educated guess, based on some knowledge on how to implement such a feat.
As Kent Dahl already said, Floats are not precise, they have a fixed precision, which means 1.91 is really 1.910000000000000000001 or something like that, which ruby "knows" should be displayed as 1.91.
"1.91" on the other hand is a string, basically an array of characters: '1', '.', '9', '1'.
This said, here is what you need to do, to build the rational out of floats:
Get rid of the . (mathematically by multiplying the numerator and denominator with 10^x, or multiplying with ten as many times, as there are numbers behind the .)
Find the greatest common denominator (gcd)
Divide num and denom with the gcd
Step 1 however, is a little different for Float and String:
The Float, we will have to multiply with 10^x, where x is (because of the precision) not 2 (as one would think with 1.91), but more something like 16 (remember: 1.9100...1).
For the String, we COULD cast it into a float and do the same trick, but hey, there is an easier way: We just count the number of characters behind the dot (which is 2), remove the dot and multiply the denom with 10^2... This is not only the easier, but also the more precise way.
The big numbers might disappear again, when applying step 3, that's why you will not always get those strange results when dealing with rationals from floats.
TLDR: The numbers will be built differently based on the arguments being String, or FLoat. FLoats can produce long-ass numbers, because precision.

The Float 1.91 is stored as a double which has a given amount of precision, limited by binary presentation. The equivalent Rational object retains this precision a such as possible, so it is huge. There is no way of storing 1.91 exactly in a double, but the value you get is close enough for most uses.
As for the String, it represents a different value - the exact value of 1.91 - and as you create a Rational it retains it better. It is more correct than the Float, UT takes longer to use for calculations.
This is similar to the problem with 1.0/3 as it "goes on forever" 0.333333...etc, but Rational can represent it exactly.

Arithmetic Operations using only 32 bit integers

How would you compute the multiplication of two 1024 bit numbers on a microprocessor that is only capable of multiplying 32 bit numbers?

The starting point is to realize that you already know how to do this: in elementary school you were taught how to do arithmetic on single digit numbers, and then given data structures to represent larger numbers (e.g. decimals) and algorithms to compute arithmetic operations (e.g. long division).
If you have a way to multiply two 32-bit numbers to give a 64-bit result (note that unsigned long long is guaranteed to be at least 64 bits), then you can use those same algorithms to do arithmetic in base 2^32.
You'll also need, e.g., an add with carry operation. You can determine the carry when adding two unsigned numbers of the same type by detecting overflow, e.g. as follows:
uint32_t x, y; // set to some value
uint32_t sum = x + y;
uint32_t carry = (sum < x);
(technically, this sort of operation requires that you do unsigned arithmetic: overflow in signed arithmetic is undefined behavior, and optimizers will do surprising things to your code you least expect it)
(modern processors usually give a way to multiply two 64-bit numbers to give a 128-bit result, but to access it you will have to use compiler extensions like 128-bit types, or you'll have to write inline assembly code. modern processors also have specialized add-with-carry instructions)
Now, to do arithmetic efficiently is an immense project; I found it quite instructive to browse through the documentation and source code to gmp, the GNU multiple precision arithmetic library.

look at any implementation of bigint operations
here are few of mine approaches in C++ for fast bignum square
some are solely for sqr but others are usable for multiplication...
use 32bit arithmetics as a module for 64/128/256/... bit arithmetics
see mine 32bit ALU in x86 C++
use long multiplication with digit base 2^32
can use also Karatsuba this way

How is π calculated within sas?

just curious! but I spotted that the value of π held by SAS is in fact incorrect.
for instance:
data _null_;
x= constant('pi') * 1000000000000000000000000000;
put x= 32.;
run;
gives a π value of (3.)141592653589792961327005696
however - π is of course (3.)1415926535897932384626433832795 ( http://www.joyofpi.com/pi.html ) - to 31 dp.
what gives??!!

SAS stores PI as a constant to 14 decimal places. The difference you are seeing is an artifact of floating point math when you did the multiplication step.
data _null_;
pi=constant("PI");
put pi= 32.30;
run;
/*On Log */
pi=3.141592653589790000000000000000

PI is held as a constant in all programming languages to a set precision. It isn't calculated. Your code just exposes how accurate PI is in SAS.

You got 16 digits of precision. Which means it probably uses an IEEE 754 double-precision floating point representation, which only gives about 16-17 decimal digits of precision. It is impossible for π to be represented in any finite number of digits, so any computer representation is going to be truncated at some number of digits. There are ways of doing arbitrary-precision math (Java has a BigDecimal class), but even then you'd have to truncate π somewhere. And math done that way is several orders of magnitude slower (because it is not handled by direct CPU instructions).

As Garry Shutler said, it's held as a constant. Note that that small fractional values in the numeric types of computer languages are rarely all that accurate (in fact, their accuracy can be lower than their precision), because they're stored as very good approximations that can be manipulated quickly. If you need excellent precision (as in financial and scientific endeavors), you need to use special types like Java's BigDecimal that handle being completely accurate (at the cost of computational speed). (Sorry, don't know SAS so don't know of an analog for BigDecimal.)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio