Oracle Floats vs Number - oracle

I'm seeing conflicting references in Oracles documentation. Is there any difference between how decimals are stored in a FLOAT and a NUMBER types in the database?
As I recall from C, et al, a float has accuracy limitations that an int doesn't have. R.g., For 'float's, 0.1(Base 10) is approximated as 0.110011001100110011001101(Base 2) which equals roughtly something like 0.100000001490116119384765625 (Base 10). However, for 'int's, 5(Base 10) is exactly 101(Base 2).
Which is why the following won't terminate as expected in C:
float i;
i = 0;
for (i=0; i != 10; )
{
i += 0.1
}
However I see elsewhere in Oracle's documentation that FLOAT has been defined as a NUMBER. And as I understand it, Oracle's implementation of the NUMBER type does not run into the same problem as C's float.
So, what's the real story here? Has Oracle deviated from the norm of what I expect to happen with floats/FLOATs?
(I'm sure it's a bee-fart-in-a-hurricane of difference for what I'll be using them for, but I know I'm going to have questions if 0.1*10 comes out to 1.00000000000000001)

Oracle's BINARY_FLOAT stores the data internally using IEEE 754 floating-point representation, like C and many other languages do. When you fetch them from the database, and typically store them in an IEEE 754 data type in the host language, it's able to copy the value without transforming it.
Whereas Oracle's FLOAT data type is a synonym for the ANSI SQL NUMERIC data type, called NUMBER in Oracle. This is an exact numeric, a scaled decimal data type that doesn't have the rounding behavior of IEEE 754. But if you fetch these values from the database and put them into a C or Java float, you can lose precision during this step.

The Oracle BINARY_FLOAT and BINARY_DOUBLE are mostly equivalent to the IEEE 754 standard but they are definitely not stored internally in the standard IEEE 754 representation.
For example, a BINARY_DOUBLE takes 9 bytes of storage vs. IEEE's 8. Also the double floating number -3.0 is represented as 3F-F7-FF-FF-FF-FF-FF-FF which if you use real IEEE would be C0-08-00-00-00-00-00-00. Notice that bit 63 is 0 in the Oracle representation while it is 1 in the IEEE one (if 's' is the sign bit, according to IEEE, the sign of the number is (-1)^s). See the very good IEEE 754 calculators at http://babbage.cs.qc.cuny.edu/IEEE-754/
You can easily find this if you have a BINARY__DOUBLE column BD in table T with the query:
select BD,DUMP(BD) from T
Now all of that is fine and interesting (maybe) but when one works in C and gets a numeric value from Oracle (by binding a variable to a numeric column of any kind), one typically gets the result in a real IEEE double as is supported by C. Now this value is subject to all of the usual IEEE inaccuracies.
If one wants to do precise arithmetic one can either do it in PL/SQL or using special precise-arithmetic C libraries.
For Oracle's own explanation of their numeric data types see: http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/datatype.htm#i16209

Oracle's Number is in fact a Decimal (base-10) floating point representation...
Float is just an alias for Number and does the exact same thing.
if you want Binary (base-2) floats, you need to use Oracle's BINARY_FLOAT or BINARY_DOUBLE datatypes.
link text

Bill's answer about Oracle's FLOAT is only correct to late version(say 11i), in Oracle 8i, the document says:
You can specify floating-point numbers with the form discussed in
"NUMBER Datatype". Oracle also supports the ANSI datatype FLOAT. You
can specify this datatype using one of these syntactic forms:
FLOAT specifies a floating-point number with decimal precision 38, or
binary precision 126. FLOAT(b) specifies a floating-point number with
binary precision b. The precision b can range from 1 to 126. To
convert from binary to decimal precision, multiply b by 0.30103. To
convert from decimal to binary precision, multiply the decimal
precision by 3.32193. The maximum of 126 digits of binary precision is
roughly equivalent to 38 digits of decimal precision.
It sounds like a Quadruple precision(126 binary precision). If I am not mistaken, IEEE754 only requires b = 2, p = 24 for single precision and p = 53 for double precision. The differences between 8i an 11i caused a lot of confusion when I was looking into a conversion plan between Oracle and PostgreSQL.

Like the PLS_INTEGER mentioned previously, the BINARY_FLOAT and BINARY_DOUBLE types in Oracle 10g use machine arithmetic and require less storage space, both of which make them more efficient than the NUMBER type
ONLY BINARY_FLOAT and BINARY_DOUBLE supports NAN values
-not precise calculations

Related

What is the purpose of arbitrary precision constants in Go?

Go features untyped exact numeric constants with arbitrary size and precision. The spec requires all compilers to support integers to at least 256 bits, and floats to at least 272 bits (256 bits for the mantissa and 16 bits for the exponent). So compilers are required to faithfully and exactly represent expressions like this:
const (
PI = 3.1415926535897932384626433832795028841971
Prime256 = 84028154888444252871881479176271707868370175636848156449781508641811196133203
)
This is interesting...and yet I cannot find any way to actually use any such constant that exceeds the maximum precision of the 64 bit concrete types int64, uint64, float64, complex128 (which is just a pair of float64 values). Even the standard library big number types big.Int and big.Float cannot be initialized from large numeric constants -- they must instead be deserialized from string constants or other expressions.
The underlying mechanics are fairly obvious: the constants exist only at compile time, and must be coerced to some value representable at runtime to be used at runtime. They are a language construct that exists only in code and during compilation. You cannot retrieve the raw value of a constant at runtime; it is is not stored at some address in the compiled program itself.
So the question remains: Why does the language make such a point of supporting enormous constants when they cannot be used in practice?
TLDR; Go's arbitrary precision constants give you the possibility to work with "real" numbers and not with "boxed" numbers, so "artifacts" like overflow, underflow, infinity corner cases are relieved. You have the possibility to work with higher precision, and only the result have to be converted to limited-precision, mitigating the effect of intermediate errors.
The Go Blog: Constants: (emphasizes are mine answering your question)
Numeric constants live in an arbitrary-precision numeric space; they are just regular numbers. But when they are assigned to a variable the value must be able to fit in the destination. We can declare a constant with a very large value:
const Huge = 1e1000
—that's just a number, after all—but we can't assign it or even print it. This statement won't even compile:
fmt.Println(Huge)
The error is, "constant 1.00000e+1000 overflows float64", which is true. But Huge might be useful: we can use it in expressions with other constants and use the value of those expressions if the result can be represented in the range of a float64. The statement,
fmt.Println(Huge / 1e999)
prints 10, as one would expect.
In a related way, floating-point constants may have very high precision, so that arithmetic involving them is more accurate. The constants defined in the math package are given with many more digits than are available in a float64. Here is the definition of math.Pi:
Pi = 3.14159265358979323846264338327950288419716939937510582097494459
When that value is assigned to a variable, some of the precision will be lost; the assignment will create the float64 (or float32) value closest to the high-precision value. This snippet
pi := math.Pi
fmt.Println(pi)
prints 3.141592653589793.
Having so many digits available means that calculations like Pi/2 or other more intricate evaluations can carry more precision until the result is assigned, making calculations involving constants easier to write without losing precision. It also means that there is no occasion in which the floating-point corner cases like infinities, soft underflows, and NaNs arise in constant expressions. (Division by a constant zero is a compile-time error, and when everything is a number there's no such thing as "not a number".)
See related: How does Go perform arithmetic on constants?

64 bit integer and 64 bit float homogeneous representation

Assume we have some sequence as input. For performance reasons we may want to convert it in homogeneous representation. And in order to transform it into homogeneous representation we are trying to convert it to same type. Here lets consider only 2 types in input - int64 and float64 (in my simple code I will use numpy and python; it is not the matter of this question - one may think only about 64-bit integer and 64-bit floats).
First we may try to cast everything to float64.
So we want something like so as input:
31 1.2 -1234
be converted to float64. If we would have all int64 we may left it unchanged ("already homogeneous"), or if something else was found we would return "not homogeneous". Pretty straightforward.
But here is the problem. Consider a bit modified input:
31000000 1.2 -1234
Idea is clear - we need to check that our "caster" is able to handle large by absolute value int64 properly:
format(np.float64(31000000), '.0f') # just convert to float64 and print
'31000000'
Seems like not a problem at all. So lets go to the deal right away:
im = np.iinfo(np.int64).max # maximum of int64 type
format(np.float64(im), '.0f')
format(np.float64(im-100), '.0f')
'9223372036854775808'
'9223372036854775808'
Now its really undesired - we lose some information which maybe needed. I.e. we want to preserve all the information provided in the input sequence.
So our im and im-100 values cast to the same float64 representation. The reason of this is clear - float64 has only 53 significand of total 64 bits. That is why its precision enough to represent log10(2^53) ~= 15.95 i.e. about all 16-length int64 without any information loss. But int64 type contains up to 19 digits.
So we end up with about [10^16; 10^19] (more precisely [10^log10(53); int64.max]) range in which each int64 may be represented with information loss.
Q: What decision in such situation should one made in order to represent int64 and float64 homogeneously.
I see several options for now:
Just convert all int64 range to float64 and "forget" about possible information loss.
Motivation here is "majority of input barely will be > 10^16 int64 values".
EDIT: This clause was misleading. In clear formulation we don't consider such solutions (but left it for completeness).
Do not make such automatic conversions at all. Only if explicitly specified.
I.e. we agree with performance drawbacks. For any int-float arrays. Even with ones as in simplest 1st case.
Calculate threshold for performing conversion to float64 without possible information loss. And use it while making casting decision. If int64 above this threshold found - do not convert (return "not homogeneous").
We've already calculate this threshold. It is log10(2^53) rounded.
Create new type "fint64". This is an exotic decision but I'm considering even this one for completeness.
Motivation here consists of 2 points. First one: it is frequent situation when user wants to store int and float types together. Second - is structure of float64 type. I'm not quite understand why one will need ~308 digits value range if significand consists only of ~16 of them and other ~292 is itself a noise. So we might use one of float64 exponent bits to indicate whether its float or int is stored here. But for int64 it would be definitely drawback to lose 1 bit. Cause would reduce our integer range twice. But we would gain possibility freely store ints along with floats without any additional overhead.
EDIT: While my initial thinking of this was as "exotic" decision in fact it is just a variant of another solution alternative - composite type for our representation (see 5 clause). But need to add here that my 1st composition has definite drawback - losing some range for float64 and for int64. What we rather do - is not to subtract 1 bit but add one bit which represents a flag for int or float type stored in following 64 bits.
As proposed #Brendan one may use composite type consists of "combination of 2 or more primitive types". So using additional primitives we may cover our "problem" range for int64 for example and get homogeneous representation in this "new" type.
EDITs:
Because here question arisen I need to try be very specific: Devised application in question do following thing - convert sequence of int64 or float64 to some homogeneous representation lossless if possible. The solutions are compared by performance (e.g. total excessive RAM needed for representation). That is all. No any other requirements is considered here (cause we should consider a problem in its minimal state - not writing whole application). Correspondingly algo that represents our data in homogeneous state lossless (we are sure we not lost any information) fits into our app.
I've decided to remove words "app" and "user" from question - it was also misleading.
When choosing a data type there are 3 requirements:
if values may have different signs
needed precision
needed range
Of course hardware doesn't provide a lot of types to choose from; so you'll need to select the next largest provided type. For example, if you want to store values ranging from 0 to 500 with 8 bits of precision; then hardware won't provide anything like that and you will need to use either 16-bit integer or 32-bit floating point.
When choosing a homogeneous representation there are 3 requirements:
if values may have different signs; determined from the requirements from all of the original types being represented
needed precision; determined from the requirements from all of the original types being represented
needed range; determined from the requirements from all of the original types being represented
For example, if you have integers from -10 to +10000000000 you need a 35 bit integer type that doesn't exist so you'll use a 64-bit integer, and if you need floating point values from -2 to +2 with 31 bits of precision then you'll need a 33 bit floating point type that doesn't exist so you'll use a 64-bit floating point type; and from the requirements of these two original types you'll know that a homogeneous representation will need a sign flag, a 33 bit significand (with an implied bit), and a 1-bit exponent; which doesn't exist so you'll use a 64-bit floating point type as the homogeneous representation.
However; if you don't know anything about the requirements of the original data types (and only know that whatever the requirements were they led to the selection of a 64-bit integer type and a 64-bit floating point type), then you'll have to assume "worst cases". This leads to needing a homogeneous representation that has a sign flag, 62 bits of precision (plus an implied 1 bit) and an 8 bit exponent. Of course this 71 bit floating point type doesn't exist, so you need to select the next largest type.
Also note that sometimes there is no "next largest type" that hardware supports. When this happens you need to resort to "composed types" - a combination of 2 or more primitive types. This can include anything up to and including "big rational numbers" (numbers represented by 3 big integers in "numerator / divisor * (1 << exponent)" form).
Of course if the original types (the 64-bit integer type and 64-bit floating point type) were primitive types and your homogeneous representation needs to use a "composed type"; then your "for performance reasons we may want to convert it in homogeneous representation" assumption is likely to be false (it's likely that, for performance reasons, you want to avoid using a homogeneous representation).
In other words:
If you don't know anything about the requirements of the original data types, it's likely that, for performance reasons, you want to avoid using a homogeneous representation.
Now...
Let's rephrase your question as "How to deal with design failures (choosing the wrong types which don't meet requirements)?". There is only one answer, and that is to avoid the design failure. Run-time checks (e.g. throwing an exception if the conversion to the homogeneous representation caused precision loss) serve no purpose other than to notify developers of design failures.
It is actually very basic: use 64 bits floating point. Floating point is an approximation, and you will loose precision for many ints. But there are no uncertainties other than "might this originally have been integral" and "does the original value deviates more than 1.0".
I know of one non-standard floating point representation that would be more powerfull (to be found in the net). That might (or might not) help cover the ints.
The only way to have an exact int mapping, would be to reduce the int range, and guarantee (say) 60 bits ints to be precise, and the remaining range approximated by floating point. Floating point would have to be reduced too, either exponential range as mentioned, or precision (the mantissa).

Is using integers as fractional coefficients instead of floats a good idea for a monetary application?

My application requires a fractional quantity multiplied by a monetary value.
For example, $65.50 × 0.55 hours = $36.025 (rounded to $36.03).
I know that floats should not be used to represent money, so I'm storing all of my monetary values as cents. $65.50 in the above equation is stored as 6550 (integer).
For the fractional coefficient, my issue is that 0.55 does not have a 32-bit float representation. In the use case above, 0.55 hours == 33 minutes, so 0.55 is an example of a specific value that my application will need to account for exactly. The floating point representation of 0.550000012 is insufficient, because the user will not understand where the additional 0.000000012 came from. I cannot simply call a rounding function on 0.550000012 because it will round to the whole number.
Multiplication solution
To solve this, my first idea was to store all quantities as integers and multiply × 1000. So 0.55 entered by the user would become 550 (integer) when stored. All calculations would happen without floats, and then simply divide by 1000 (integer division, not float) when presenting the result to the user.
I realize that this would permanently limit me to 3 decimal places of
precision. If I decide that 3 is adequate for the lifetime of my
application, does this approach make sense?
Are there potential rounding issues if I were to use integer division?
Is there a name for this process? EDIT: As indicated by #SergGr, this is fixed-point arithmetic.
Is there a better approach?
EDIT:
I should have clarified, this is not time-specific. It is for generic quantities like 1.256 pounds of flour, 1 sofa, or 0.25 hours (think invoices).
What I'm trying to replicate here is a more exact version of Postgres's extra_float_digits = 0 functionality, where if the user enters 0.55 (float32), the database stores 0.550000012 but when queried for the result returns 0.55 which appears to be exactly what the user typed.
I am willing to limit this application's precision to 3 decimal places (it's business, not scientific), so that's what made me consider the × 1000 approach.
I'm using the Go programming language, but I'm interested in generic cross-language solutions.
Another solution to store the result is using the rational form of the value. You can explain the number by two integer value which the number is equal p/q, such that both p and q are integers. Hence, you can have more precision for your numbers and do some math with the rational numbers in the format of two integers.
Note: This is an attempt to merge different comments into one coherent answer as was requested by Matt.
TL;DR
Yes, this approach makes sense but most probably is not the best choice
Yes, there are rounding issues but there inevitably will be some no matter what representation you use
What you suggest using is called Decimal fixed point numbers
I'd argue yes, there is a better approach and it is to use some standard or popular decimal floating point numbers library for your language (Go is not my native language so I can't recommend one)
In PostgreSQL it is better to use Numeric (something like Numeric(15,3) for example) rather than a combination of float4/float8 and extra_float_digits. Actually this is what the first item in the PostgreSQL doc on Floating-Point Types suggests:
If you require exact storage and calculations (such as for monetary amounts), use the numeric type instead.
Some more details on how non-integer numbers can be stored
First of all there is a fundamental fact that there are infinitely many numbers in the range [0;1] so you obviously can't store every number there in any finite data structure. It means you have to make some compromises: no matter what way you choose, there will be some numbers you can't store exactly so you'll have to round.
Another important point is that people are used to 10-based system and in that system only results of division by numbers in a form of 2^a*5^b can be represented using a finite number of digits. For every other rational number even if you somehow store it in the exact form, you will have to do some truncation and rounding at the formatting for human usage stage.
Potentially there are infinitely many ways to store numbers. In practice only a few are widely used:
floating point numbers with two major branches of binary (this is what most today's hardware natively implements and what is support by most of the languages as float or double) and decimal. This is the format that store mantissa and exponent (can be negative), so the number is mantissa * base^exponent (I omit sign and just say it is logically a part of the mantissa although in practice it is usually stored separately). Binary vs. decimal is specified by the base. For example 0.5 will be stored in binary as a pair (1,-1) i.e. 1*2^-1 and in decimal as a pair (5,-1) i.e. 5*10^-1. Theoretically you can use any other base as well but in practice only 2 and 10 make sense as the bases.
fixed point numbers with the same division in binary and decimal. The idea is the same as in floating point numbers but some fixed exponent is used for all the numbers. What you suggests is actually a decimal fixed point number with the exponent fixed at -3. I've seen a usage of binary fixed-point numbers on some embedded hardware where there is no built-in support of floating point numbers, because binary fixed-point numbers can be implemented with reasonable efficiency using integer arithmetic. As for decimal fixed-point numbers, in practice they are not much easier to implement that decimal floating-point numbers but provide much less flexibility.
rational numbers format i.e. the value is stored as a pair of (p, q) which represents p/q (and usually q>0 so sign stored in p and either p=0, q=1 for 0 or gcd(p,q) = 1 for every other number). Usually this requires some big integer arithmetic to be useful in the first place (here is a Go example of math.big.Rat). Actually this might be an useful format for some problems and people often forget about this possibility, probably because it is often not a part of a standard library. Another obvious drawback is that as I said people are not used to think in rational numbers (can you easily compare which is greater 123/456 or 213/789?) so you'll have to convert the final results to some other form. Another drawback is that if you have a long chain of computations, internal numbers (p and q) might easily become very big values so computations will be slow. Still it may be useful to store intermediate results of calculations.
In practical terms there is also a division into arbitrary length and fixed length representations. For example:
IEEE 754 float or double are fixed length floating-point binary representations,
Go math.big.Float is an arbitrary length floating-point binary representations
.Net decimal is a fixed length floating-point decimal representations
Java BigDecimal is an arbitrary length floating-point decimal representations
In practical terms I'd says that the best solution for your problem is some big enough fixed length floating point decimal representations (like .Net decimal). An arbitrary length implementation would also work. If you have to make an implementation from scratch, than your idea of a fixed length fixed point decimal representation might be OK because it is the easiest thing to implement yourself (a bit easier than the previous alternatives) but it may become a burden at some point.
As mentioned in the comments, it would be best to use some builtin Decimal module in your language to handle exact arithmetic. However, since you haven't specified a language, we cannot be certain that your language may even have such a module. If it does not, here is how to go about doing so.
Consider using Binary Coded Decimal to store your values. The way it works is by restricting the values that can be stored per byte to 0 through 9 (inclusive), "wasting" the rest. You can encode a decimal representation of a number byte by byte that way. For example, 613 would become
6 -> 0000 0110
1 -> 0000 0001
3 -> 0000 0011
613 -> 0000 0110 0000 0001 0000 0011
Where each grouping of 4 digits above is a "nibble" of a byte. In practice, a packed variant is used, where two decimal digits are packed into a byte (one per nibble) to be less "wasteful". You can then implement a few methods to do your basic addition, subtract, multiplication, etc. Just iterate over an array of bytes, and perform your classic grade school addition / multiplication algorithms (keep in mind for the packed variant that you may need to pad a zero to get an even number of nibbles). You just need to keep a variable to store where the decimal point is, and remember to carry where necessary to preserve the encoding.

Which cfsqltype to use for Oracle's Number(*,0) datatype in cfqueryparam?

I am a little bit confused with which datatype I should use for Oracle's Number(*,0) with zero scale and any precision?
Which one should I use CF_SQL_INTEGER or CF_SQL_FLOAT? and why?
According to the documentation, Number(*,0) means you are working with very large integers, ie up to 38 digits and no decimal places:
column_name NUMBER (precision, scale)
... precision (total number of digits) and scale (number of digits to
the right of the decimal point):
column_name NUMBER (*, scale)
In this case, the precision is 38, and the specified scale is
maintained.
That is too many digits to store in CF_SQL_INTEGER. To support the full range, requires a type with a much greater capacity. Looking at the standard JDBC Mappings that means either java.sql.Types.NUMERIC or java.sql.Types.DECIMAL. Both of those use java's BigDecimal for storage, which has more than enough capacity for Number(38,0).
The cfqueryparam matrix and Oracle JDBC driver documentation both say the same thing about the DECIMAL type. Since java.sql.Types.NUMERIC is really just a synonym for java.sql.Types.DECIMAL you can use either one.
Note: When using cfqueryparam, if you omit the "scale" attribute, it defaults to scale="0", ie no decimal places.
<cfqueryparam type="CF_SQL_DECIMAL" scale="0" value="....">

How is π calculated within sas?

just curious! but I spotted that the value of π held by SAS is in fact incorrect.
for instance:
data _null_;
x= constant('pi') * 1000000000000000000000000000;
put x= 32.;
run;
gives a π value of (3.)141592653589792961327005696
however - π is of course (3.)1415926535897932384626433832795 ( http://www.joyofpi.com/pi.html ) - to 31 dp.
what gives??!!
SAS stores PI as a constant to 14 decimal places. The difference you are seeing is an artifact of floating point math when you did the multiplication step.
data _null_;
pi=constant("PI");
put pi= 32.30;
run;
/*On Log */
pi=3.141592653589790000000000000000
PI is held as a constant in all programming languages to a set precision. It isn't calculated. Your code just exposes how accurate PI is in SAS.
You got 16 digits of precision. Which means it probably uses an IEEE 754 double-precision floating point representation, which only gives about 16-17 decimal digits of precision. It is impossible for π to be represented in any finite number of digits, so any computer representation is going to be truncated at some number of digits. There are ways of doing arbitrary-precision math (Java has a BigDecimal class), but even then you'd have to truncate π somewhere. And math done that way is several orders of magnitude slower (because it is not handled by direct CPU instructions).
As Garry Shutler said, it's held as a constant. Note that that small fractional values in the numeric types of computer languages are rarely all that accurate (in fact, their accuracy can be lower than their precision), because they're stored as very good approximations that can be manipulated quickly. If you need excellent precision (as in financial and scientific endeavors), you need to use special types like Java's BigDecimal that handle being completely accurate (at the cost of computational speed). (Sorry, don't know SAS so don't know of an analog for BigDecimal.)

Resources