I am working on a program that will convert a floating point number base ten to its binary representation base two. I understand that we cannot use bit wise operators on floating point numbers, but I am told that if we use a cast to convert it to an unsigned int, we can. using unsigned int float_int = ((unsigned int)&f). My assignment states that we can separate out the two fields using bit wise operators. I have tried everything, but I cannot come up with a solution that will separate a number from it's decimal.
Related
This question is not so much about the C as about the algorithm. I need to implement strtof() function, which would behave exactly the same as GCC one - and do it from scratch (no GNU MPL etc.).
Let's skip checks, consider only correct inputs and positive numbers, e.g. 345.6e7. My basic algorithm is:
Split the number into fraction and integer exponent, so for 345.6e7 fraction is 3.456e2 and exponent is 7.
Create a floating-point exponent. To do this, I use these tables:
static const float powersOf10[] = {
1.0e1f,
1.0e2f,
1.0e4f,
1.0e8f,
1.0e16f,
1.0e32f
};
static const float minuspowersOf10[] = {
1.0e-1f,
1.0e-2f,
1.0e-4f,
1.0e-8f,
1.0e-16f,
1.0e-32f
};
and get float exponent as a product of corresponding bits in integer exponent, e.g. 7 = 1+2+4 => float_exponent = 1.0e1f * 1.0e2f * 1.0e4f.
Multiply fraction by floating exponent and return the result.
And here comes the first problem: since we do a lot of multiplications, we get a somewhat big error becaule of rounding multiplication result each time. So, I decided to dive into floating point multiplication algorithm and implement it myself: a function takes a number of floats (in my case - up to 7) and multiplies them on bit level. Consider I have uint256_t type to fit mantissas product.
Now, the second problem: round mantissas product to 23 bits. I've tried several rounding methods (round-to-even, Von Neumann rounding - a small article about them), but no of them can give the correct result for all the test numbers. And some of them really confuse me, like this one:
7038531e-32. GCC's strtof() returns 0x15ae43fd, so correct unbiased mantissa is 2e43fd. I go for multiplication of 7.038531e6 (biased mantissa d6cc86) and 1e-32 (b.m. cfb11f). The resulting unbiased mantissa in binary form is
( 47)0001 ( 43)0111 ( 39)0010 ( 35)0001
( 31)1111 ( 27)1110 ( 23)1110 ( 19)0010
( 15)1011 ( 11)0101 ( 7)0001 ( 3)1101
which I have to round to 23 bits. However, by all rounding methods I have to round it up, and I'll get 2e43fe in result - wrong! So, for this number the only way to get correct mantissa is just to chop it - but chopping does not work for other numbers.
Having this worked on countless nights, my questions are:
Is this approach to strtof() correct? (I know that GCC uses GNU MPL for it, and tried to see into it. However, trying to copy MPL's implementation would require porting the entire library, and this is definitely not what I want). Maybe this split-then-multiply algorithm is inevitably prone to errors? I did some other small tricks, (e.g. create exponent tables for all integer exponents in float range), but they led to even more failed conversions.
If so, did I miss something while rounding? I thought so for long time, but this 7038531e-32 number completely confused me.
If I want to be as precise as I can I usually do stuff like this (however I usually do the reverse operation float -> text):
use only integers (no floats what so ever)
as you know float is integer mantissa bit-shifted by integer exponent so no need for floats.
For constructing the final float datatype you can use simple union with float and 32 bit unsigned integer in it ... or pointers to such types pointing to the same address.
This will avoid rounding errors for numbers that fit completely and shrink error for those that don't fit considerably.
use hex numbers
You can convert your text of decadic number on the run into its hex counterpart (still as text) from there creating mantissa and exponent integers is simple.
Here:
How to convert a gi-normous integer (in string format) to hex format? (C#)
is C++ implementation example of dec2hex and hex2dec number conversions done on text
use more bits for mantissa while converting
for task like this and single precision float I usually use 2 or 3 32 bit DWORDs for the 24 bit mantissa to still hold some precision after the multiplications If you want to be precise you have to deal with 128+24 bits for both integer and fractional part of number so 5x32 bit numbers in sequence.
For more info and inspiration see (reverse operation):
my best attempt to print 32 bit floats with least rounding errors (integer math only)
Your code will be just inverse of that (so many parts will be similar)
Since I post that I made even more advanced version that recognize formatting just like printf , supports much more datatypes and more without using any libs (however its ~22.5 KByte of code). I needed it for MCUs as GCC implementation of prints are not very good there ...
Assume we have some sequence as input. For performance reasons we may want to convert it in homogeneous representation. And in order to transform it into homogeneous representation we are trying to convert it to same type. Here lets consider only 2 types in input - int64 and float64 (in my simple code I will use numpy and python; it is not the matter of this question - one may think only about 64-bit integer and 64-bit floats).
First we may try to cast everything to float64.
So we want something like so as input:
31 1.2 -1234
be converted to float64. If we would have all int64 we may left it unchanged ("already homogeneous"), or if something else was found we would return "not homogeneous". Pretty straightforward.
But here is the problem. Consider a bit modified input:
31000000 1.2 -1234
Idea is clear - we need to check that our "caster" is able to handle large by absolute value int64 properly:
format(np.float64(31000000), '.0f') # just convert to float64 and print
'31000000'
Seems like not a problem at all. So lets go to the deal right away:
im = np.iinfo(np.int64).max # maximum of int64 type
format(np.float64(im), '.0f')
format(np.float64(im-100), '.0f')
'9223372036854775808'
'9223372036854775808'
Now its really undesired - we lose some information which maybe needed. I.e. we want to preserve all the information provided in the input sequence.
So our im and im-100 values cast to the same float64 representation. The reason of this is clear - float64 has only 53 significand of total 64 bits. That is why its precision enough to represent log10(2^53) ~= 15.95 i.e. about all 16-length int64 without any information loss. But int64 type contains up to 19 digits.
So we end up with about [10^16; 10^19] (more precisely [10^log10(53); int64.max]) range in which each int64 may be represented with information loss.
Q: What decision in such situation should one made in order to represent int64 and float64 homogeneously.
I see several options for now:
Just convert all int64 range to float64 and "forget" about possible information loss.
Motivation here is "majority of input barely will be > 10^16 int64 values".
EDIT: This clause was misleading. In clear formulation we don't consider such solutions (but left it for completeness).
Do not make such automatic conversions at all. Only if explicitly specified.
I.e. we agree with performance drawbacks. For any int-float arrays. Even with ones as in simplest 1st case.
Calculate threshold for performing conversion to float64 without possible information loss. And use it while making casting decision. If int64 above this threshold found - do not convert (return "not homogeneous").
We've already calculate this threshold. It is log10(2^53) rounded.
Create new type "fint64". This is an exotic decision but I'm considering even this one for completeness.
Motivation here consists of 2 points. First one: it is frequent situation when user wants to store int and float types together. Second - is structure of float64 type. I'm not quite understand why one will need ~308 digits value range if significand consists only of ~16 of them and other ~292 is itself a noise. So we might use one of float64 exponent bits to indicate whether its float or int is stored here. But for int64 it would be definitely drawback to lose 1 bit. Cause would reduce our integer range twice. But we would gain possibility freely store ints along with floats without any additional overhead.
EDIT: While my initial thinking of this was as "exotic" decision in fact it is just a variant of another solution alternative - composite type for our representation (see 5 clause). But need to add here that my 1st composition has definite drawback - losing some range for float64 and for int64. What we rather do - is not to subtract 1 bit but add one bit which represents a flag for int or float type stored in following 64 bits.
As proposed #Brendan one may use composite type consists of "combination of 2 or more primitive types". So using additional primitives we may cover our "problem" range for int64 for example and get homogeneous representation in this "new" type.
EDITs:
Because here question arisen I need to try be very specific: Devised application in question do following thing - convert sequence of int64 or float64 to some homogeneous representation lossless if possible. The solutions are compared by performance (e.g. total excessive RAM needed for representation). That is all. No any other requirements is considered here (cause we should consider a problem in its minimal state - not writing whole application). Correspondingly algo that represents our data in homogeneous state lossless (we are sure we not lost any information) fits into our app.
I've decided to remove words "app" and "user" from question - it was also misleading.
When choosing a data type there are 3 requirements:
if values may have different signs
needed precision
needed range
Of course hardware doesn't provide a lot of types to choose from; so you'll need to select the next largest provided type. For example, if you want to store values ranging from 0 to 500 with 8 bits of precision; then hardware won't provide anything like that and you will need to use either 16-bit integer or 32-bit floating point.
When choosing a homogeneous representation there are 3 requirements:
if values may have different signs; determined from the requirements from all of the original types being represented
needed precision; determined from the requirements from all of the original types being represented
needed range; determined from the requirements from all of the original types being represented
For example, if you have integers from -10 to +10000000000 you need a 35 bit integer type that doesn't exist so you'll use a 64-bit integer, and if you need floating point values from -2 to +2 with 31 bits of precision then you'll need a 33 bit floating point type that doesn't exist so you'll use a 64-bit floating point type; and from the requirements of these two original types you'll know that a homogeneous representation will need a sign flag, a 33 bit significand (with an implied bit), and a 1-bit exponent; which doesn't exist so you'll use a 64-bit floating point type as the homogeneous representation.
However; if you don't know anything about the requirements of the original data types (and only know that whatever the requirements were they led to the selection of a 64-bit integer type and a 64-bit floating point type), then you'll have to assume "worst cases". This leads to needing a homogeneous representation that has a sign flag, 62 bits of precision (plus an implied 1 bit) and an 8 bit exponent. Of course this 71 bit floating point type doesn't exist, so you need to select the next largest type.
Also note that sometimes there is no "next largest type" that hardware supports. When this happens you need to resort to "composed types" - a combination of 2 or more primitive types. This can include anything up to and including "big rational numbers" (numbers represented by 3 big integers in "numerator / divisor * (1 << exponent)" form).
Of course if the original types (the 64-bit integer type and 64-bit floating point type) were primitive types and your homogeneous representation needs to use a "composed type"; then your "for performance reasons we may want to convert it in homogeneous representation" assumption is likely to be false (it's likely that, for performance reasons, you want to avoid using a homogeneous representation).
In other words:
If you don't know anything about the requirements of the original data types, it's likely that, for performance reasons, you want to avoid using a homogeneous representation.
Now...
Let's rephrase your question as "How to deal with design failures (choosing the wrong types which don't meet requirements)?". There is only one answer, and that is to avoid the design failure. Run-time checks (e.g. throwing an exception if the conversion to the homogeneous representation caused precision loss) serve no purpose other than to notify developers of design failures.
It is actually very basic: use 64 bits floating point. Floating point is an approximation, and you will loose precision for many ints. But there are no uncertainties other than "might this originally have been integral" and "does the original value deviates more than 1.0".
I know of one non-standard floating point representation that would be more powerfull (to be found in the net). That might (or might not) help cover the ints.
The only way to have an exact int mapping, would be to reduce the int range, and guarantee (say) 60 bits ints to be precise, and the remaining range approximated by floating point. Floating point would have to be reduced too, either exponential range as mentioned, or precision (the mantissa).
I am using a 64Bit server. My golang program needs integer type.
SO, If I use uint16 and uint32 type in source code, does it cost more than use most regular int type?
I am considering both computing cost and developing cost.
For the vast majority of cases using int makes more sense.
Here are some reasons:
Go doesn't implicitly convert between the numeric types, even when you think it should. If you start using some unsigned type instead of int, you should expect to pepper your code with multiple type conversions, because of other libraries or APIs preferring not to bother with unsigned types, because of untyped constant numerical expressions returning int values, etc.
Unsigned types are more prone to underflowing than signed types, because 0 (an unsigned type's boundary value) is much more of a naturally occurring value in computer programs than, for example, -9223372036854775808.
If you want to use an unsigned type because it restricts the values that you can put in it, keep in mind that when you combine silent underflow and compile time-only constant propagation, you probably aren't getting the bargain you were looking for. For example, while you cannot convert the constant math.MinInt64 to a uint, you can easily convert an int variable with value math.MinInt64 to a uint. And arguably it's not a bad Go style to have an if check whether the value you're trying to assign is valid for your program.
Unless you are experiencing significant memory pressure and your value space is somewhere slightly over what a smaller signed type would offer you, I'd think that using int will be much more efficient even if only because of development cost.
And even then, chances are that either there's a problem somewhere else in your program's memory footprint, or a managed language like Go is not the best fit for your needs.
How would you compute the multiplication of two 1024 bit numbers on a microprocessor that is only capable of multiplying 32 bit numbers?
The starting point is to realize that you already know how to do this: in elementary school you were taught how to do arithmetic on single digit numbers, and then given data structures to represent larger numbers (e.g. decimals) and algorithms to compute arithmetic operations (e.g. long division).
If you have a way to multiply two 32-bit numbers to give a 64-bit result (note that unsigned long long is guaranteed to be at least 64 bits), then you can use those same algorithms to do arithmetic in base 2^32.
You'll also need, e.g., an add with carry operation. You can determine the carry when adding two unsigned numbers of the same type by detecting overflow, e.g. as follows:
uint32_t x, y; // set to some value
uint32_t sum = x + y;
uint32_t carry = (sum < x);
(technically, this sort of operation requires that you do unsigned arithmetic: overflow in signed arithmetic is undefined behavior, and optimizers will do surprising things to your code you least expect it)
(modern processors usually give a way to multiply two 64-bit numbers to give a 128-bit result, but to access it you will have to use compiler extensions like 128-bit types, or you'll have to write inline assembly code. modern processors also have specialized add-with-carry instructions)
Now, to do arithmetic efficiently is an immense project; I found it quite instructive to browse through the documentation and source code to gmp, the GNU multiple precision arithmetic library.
look at any implementation of bigint operations
here are few of mine approaches in C++ for fast bignum square
some are solely for sqr but others are usable for multiplication...
use 32bit arithmetics as a module for 64/128/256/... bit arithmetics
see mine 32bit ALU in x86 C++
use long multiplication with digit base 2^32
can use also Karatsuba this way
I am using IEEE fixed point package in VHDL.
It works well, but I now facing a problem concerning their string representation in a test bench : I would like to dump them in a text file.
I have found that it is indeed possible to directly write ufixed or sfixed using :
write(buf, to_string(x)); --where x is either sfixed or ufixed (and buf : line)
But then I get values like 11110001.10101 (for sfixed q8.5 representation).
So my question : how to convert back these fixed point numbers to reals (and then to string) ?
The variable needs to be split into two std-logic-vector parts, the integer part can be converted to a string using standard conversion, but for the fraction part the string conversion is a bit different. For the integer part you need to use a loop and divide by 10 and convert the modulo remainder into ascii character, building up from the lower digit to the higher digit. For the fractional part it also need a loop but one needs to multiply by 10 take the floor and isolate this digit to get the corresponding character, then that integer is used to be substracted to the fraction number, etc. This is the concept, worked in MATLAB to test and making a vhdl version I will share soon. I was surprised not to find such useful function anywhere. Of course fixed-point format can vary Q(N,M) N and M can have all sorts of values, while for floating point, it is standardized.