IEEE-754 Standard - format

I have an actually very easy question about the IEEE-754 standard in which numbers are coded and saved on the computer.
At uni (exams) I have come across the following definition for 16-bit IEEE-754-format (half precision): 1 sign bit, 6 exponent bits & 9 mantissa bits.
An internet search (or books) reveal another definition:
1 sign bit, 5 exponent bits & 10 mantissa bits
The reason why I’m asking is that I cannot believe the uni might have made such a simple mistake, so are there multiple definitions for numbers given in 16-bit IEEE-754 format?

Conforming to an IEEE standard is voluntary. People are free to use other formats. The IEEE-754 standard specifies a binary16 format that uses 1 bit for the sign, 5 bits for the exponent, and 10 bits for the primary significand encoding.
People may use other formats because they want more or less precision in the significand or range in the exponent.
Textbooks and academic exercises often use non-standard formats for the purpose of inducing students to reason about them on their own rather than looking up answers or learning existing formats by rote.
If the hardware you are using supports a 16-bit floating-point format, the binding specification for that format is in the hardware documentation, not in the IEEE-754 standard.

Related

Does an OCaml char represent exactly 8 bits?

From the Char library documentation, I see that chars are able to represent at least the ISO/IEC 8859-1 character set, a character set that uses 8 bits per character. Do OCaml chars represent exactly 8 bits, no more and no less? Where is this documented?
The document says this:
Character values are represented as 8-bit integers between 0 and 255. Character codes between 0 and 127 are interpreted following the ASCII standard. The current implementation interprets character codes between 128 and 255 following the ISO 8859-1 standard.
So yes, an OCaml char represents exactly 8 bits.
The documentation for base values is here: OCaml Manual, Chapter 9.2. Values.
Update
It might be worth noting that although a char value in OCaml can take on only values from 0 to 255, in the mainline OCaml version (from INRIA) the actual space occupied in memory by a char value is the same as for int. On a 32-bit implementation this will be 32 bits and on a 64-bit implementation it will be 64 bits. So (for example) a char array is not a space-efficient way to store more than a few chars. You can use string or bytes to get compact storage of char values (as 8 bits each).
The documentation for representation of OCaml values is here: OCaml Manual, Chapter 20.3, Representation of OCaml Data Types.
The representation of the char type could be different depending on the implementation of the OCaml language and runtime. While all chars shall fit into 8 bits, an implementation may use a bigger type to represent it. The Char abstraction guarantees that it is impossible to create a character that uses more than 8 bits. And even though the INRIA implementation of OCaml represents Char.t the same as Int.t, it still relies on the assumption that char will fit into 8 bits. For example, a bigarray of n chars will take n bytes. And String.t will have a size in bytes proportional to the number of characters that comprise the string. Last but not least, various external (i.e., implemented in C) functions and the optimized compiler itself will assume that a character fits into 8 bits.

Is there a reason why arbitrary precision arithmetic (such as BigInt in JavaScript) is implemented in binary?

From this question, it seems Google Chrome and Node.js both chose to implement arbitrary precision arithmetic in binary. Is there a good reason to do that?
If we can add, subtract, multiply, or divide, and do 7 + 8 = 15 and carry to the next digit, it is faster than doing it bit by bit, with 7 + 8 needing to add two bits 4 times.
V8 developer here. Binary is a good choice because hardware is binary [*]. That doesn't mean that operations happen one bit at a time. In V8, a BigInt's "digits" are uintptr_t values, i.e. register-sized (32 bit on a 32-bit machine, 64 bit on a 64-bit machine) unsigned integers. See our blog post for an overview, and the source for all the gory details. FWIW, many other implementations (e.g. GMP, OpenJDK, Go, Dart) have made the same basic choice.
[*] Some hardware architectures have instructions for "binary coded decimal" arithmetic, which is similar to what you're describing, but this approach is (1) generally considered less efficient, and (2) not available on all architectures that we want V8 to run on.
One possible answer: it is done by adding two 32 or 64 bit integer together, so it is faster than doing it one decimal digit at a time.
To get the result of a multiplication, probably in one machine code cycle, two 64 bit integers can multiply and all digits of the result can be obtained.

Is using integers as fractional coefficients instead of floats a good idea for a monetary application?

My application requires a fractional quantity multiplied by a monetary value.
For example, $65.50 × 0.55 hours = $36.025 (rounded to $36.03).
I know that floats should not be used to represent money, so I'm storing all of my monetary values as cents. $65.50 in the above equation is stored as 6550 (integer).
For the fractional coefficient, my issue is that 0.55 does not have a 32-bit float representation. In the use case above, 0.55 hours == 33 minutes, so 0.55 is an example of a specific value that my application will need to account for exactly. The floating point representation of 0.550000012 is insufficient, because the user will not understand where the additional 0.000000012 came from. I cannot simply call a rounding function on 0.550000012 because it will round to the whole number.
Multiplication solution
To solve this, my first idea was to store all quantities as integers and multiply × 1000. So 0.55 entered by the user would become 550 (integer) when stored. All calculations would happen without floats, and then simply divide by 1000 (integer division, not float) when presenting the result to the user.
I realize that this would permanently limit me to 3 decimal places of
precision. If I decide that 3 is adequate for the lifetime of my
application, does this approach make sense?
Are there potential rounding issues if I were to use integer division?
Is there a name for this process? EDIT: As indicated by #SergGr, this is fixed-point arithmetic.
Is there a better approach?
EDIT:
I should have clarified, this is not time-specific. It is for generic quantities like 1.256 pounds of flour, 1 sofa, or 0.25 hours (think invoices).
What I'm trying to replicate here is a more exact version of Postgres's extra_float_digits = 0 functionality, where if the user enters 0.55 (float32), the database stores 0.550000012 but when queried for the result returns 0.55 which appears to be exactly what the user typed.
I am willing to limit this application's precision to 3 decimal places (it's business, not scientific), so that's what made me consider the × 1000 approach.
I'm using the Go programming language, but I'm interested in generic cross-language solutions.
Another solution to store the result is using the rational form of the value. You can explain the number by two integer value which the number is equal p/q, such that both p and q are integers. Hence, you can have more precision for your numbers and do some math with the rational numbers in the format of two integers.
Note: This is an attempt to merge different comments into one coherent answer as was requested by Matt.
TL;DR
Yes, this approach makes sense but most probably is not the best choice
Yes, there are rounding issues but there inevitably will be some no matter what representation you use
What you suggest using is called Decimal fixed point numbers
I'd argue yes, there is a better approach and it is to use some standard or popular decimal floating point numbers library for your language (Go is not my native language so I can't recommend one)
In PostgreSQL it is better to use Numeric (something like Numeric(15,3) for example) rather than a combination of float4/float8 and extra_float_digits. Actually this is what the first item in the PostgreSQL doc on Floating-Point Types suggests:
If you require exact storage and calculations (such as for monetary amounts), use the numeric type instead.
Some more details on how non-integer numbers can be stored
First of all there is a fundamental fact that there are infinitely many numbers in the range [0;1] so you obviously can't store every number there in any finite data structure. It means you have to make some compromises: no matter what way you choose, there will be some numbers you can't store exactly so you'll have to round.
Another important point is that people are used to 10-based system and in that system only results of division by numbers in a form of 2^a*5^b can be represented using a finite number of digits. For every other rational number even if you somehow store it in the exact form, you will have to do some truncation and rounding at the formatting for human usage stage.
Potentially there are infinitely many ways to store numbers. In practice only a few are widely used:
floating point numbers with two major branches of binary (this is what most today's hardware natively implements and what is support by most of the languages as float or double) and decimal. This is the format that store mantissa and exponent (can be negative), so the number is mantissa * base^exponent (I omit sign and just say it is logically a part of the mantissa although in practice it is usually stored separately). Binary vs. decimal is specified by the base. For example 0.5 will be stored in binary as a pair (1,-1) i.e. 1*2^-1 and in decimal as a pair (5,-1) i.e. 5*10^-1. Theoretically you can use any other base as well but in practice only 2 and 10 make sense as the bases.
fixed point numbers with the same division in binary and decimal. The idea is the same as in floating point numbers but some fixed exponent is used for all the numbers. What you suggests is actually a decimal fixed point number with the exponent fixed at -3. I've seen a usage of binary fixed-point numbers on some embedded hardware where there is no built-in support of floating point numbers, because binary fixed-point numbers can be implemented with reasonable efficiency using integer arithmetic. As for decimal fixed-point numbers, in practice they are not much easier to implement that decimal floating-point numbers but provide much less flexibility.
rational numbers format i.e. the value is stored as a pair of (p, q) which represents p/q (and usually q>0 so sign stored in p and either p=0, q=1 for 0 or gcd(p,q) = 1 for every other number). Usually this requires some big integer arithmetic to be useful in the first place (here is a Go example of math.big.Rat). Actually this might be an useful format for some problems and people often forget about this possibility, probably because it is often not a part of a standard library. Another obvious drawback is that as I said people are not used to think in rational numbers (can you easily compare which is greater 123/456 or 213/789?) so you'll have to convert the final results to some other form. Another drawback is that if you have a long chain of computations, internal numbers (p and q) might easily become very big values so computations will be slow. Still it may be useful to store intermediate results of calculations.
In practical terms there is also a division into arbitrary length and fixed length representations. For example:
IEEE 754 float or double are fixed length floating-point binary representations,
Go math.big.Float is an arbitrary length floating-point binary representations
.Net decimal is a fixed length floating-point decimal representations
Java BigDecimal is an arbitrary length floating-point decimal representations
In practical terms I'd says that the best solution for your problem is some big enough fixed length floating point decimal representations (like .Net decimal). An arbitrary length implementation would also work. If you have to make an implementation from scratch, than your idea of a fixed length fixed point decimal representation might be OK because it is the easiest thing to implement yourself (a bit easier than the previous alternatives) but it may become a burden at some point.
As mentioned in the comments, it would be best to use some builtin Decimal module in your language to handle exact arithmetic. However, since you haven't specified a language, we cannot be certain that your language may even have such a module. If it does not, here is how to go about doing so.
Consider using Binary Coded Decimal to store your values. The way it works is by restricting the values that can be stored per byte to 0 through 9 (inclusive), "wasting" the rest. You can encode a decimal representation of a number byte by byte that way. For example, 613 would become
6 -> 0000 0110
1 -> 0000 0001
3 -> 0000 0011
613 -> 0000 0110 0000 0001 0000 0011
Where each grouping of 4 digits above is a "nibble" of a byte. In practice, a packed variant is used, where two decimal digits are packed into a byte (one per nibble) to be less "wasteful". You can then implement a few methods to do your basic addition, subtract, multiplication, etc. Just iterate over an array of bytes, and perform your classic grade school addition / multiplication algorithms (keep in mind for the packed variant that you may need to pad a zero to get an even number of nibbles). You just need to keep a variable to store where the decimal point is, and remember to carry where necessary to preserve the encoding.

Decimal Presentation : different purpose between zoned decimal and packed decimal

I'm learning Computer Science course and when I read to these definition, I understand. But I don't know what different purpose of two presentations and why.
Here some short explanation of purpose that my book said:
Zone decimal : hightly compatible with text data.
Packed decimal : faster computing speed.
Something I want to know is:
1) in zone decimal presentation there is a zone section that duplicate every digit. Why ? I see this is no purpose :(
2) why they say zone decimal is compatible with text data and why packed decimaal is faster.
Thanks :)
Firstly - where are you learning CS? Those terms are from the 1960s, the more common name is BCD (Binary Coded Decimal)
Zone decimal uses an entire byte for each digit. This means you can just print a number as if it was text (each 'character' stores a digit 0-9) but since there are only 10 digits and a byte can hold 256 different values this is a bit wasteful.
Packed decimal uses the fact that 4bits can store 16different values. So you can store two digits in a byte (top 4bits and bottom 4bits). This is still a bit wasteful since you only use half the capacity. But it's pretty easy to extract the two digits with just shift and mask operations.
Pretty much the only place you would see BCD these days is in some low level hardware where you want to read/x-mit a digit without using a microprocessor at all. It's easy to make a BCD counter just in transistors
but if you want to do any maths you either have to do long multiplication on each digit like you would on paper - or convert into regular ints and back again
Both of these representations have fallen out of favor, perhaps because they are not directly supported by C, and hence all of the systems descended from Unix.
Packed decimal has an advantage in two respects: since takes up less space it can get off the bus and into the processor faster, and many CISC instruction sets have dedicated instructions for arithmetic. To quote from http://en.wikipedia.org/wiki/Packed_decimal#Packed_BCD:
Packed BCD [binary coded decimal] is supported in the COBOL programming language as the
"COMPUTATIONAL-3" (an IBM extension adopted by many other compiler
vendors) or "PACKED-DECIMAL" (part of the 1985 COBOL standard) data
type. Besides the IBM System/360 and later compatible mainframes,
packed BCD was implemented in the native instruction set of the
original VAX processors from Digital Equipment Corporation and was the
native format for the Burroughs Corporation Medium Systems line of
mainframes (descended from the 1950s Electrodata 200 series).
Zoned decimal (http://en.wikipedia.org/wiki/Zoned_decimal#Zoned_decimal) has an easy mapping between characters on punch cards and their representation in memory, which perhaps explains your textbook's claim that it is "highly compatible with text data." As the Wikipedia article suggests, it's a term more used in IBM mainframe circles. On minis, we tended to just call it plain old decimal, PIC 9 data.
"Zoned Decimal" in its natural environment is meant to be compatable with the EBCDIC char set .
ASCII represents numbers as x'3x' -- x'39' which display as character "0" to "9".
The EBCDIC character sets (which has its origins in Hollerith pucnched cards) uses a similar but different scheme where x'F0' is displayed as characer "0' and x'F9' is displayed as character '9'.
Punched cards had a fixed length of 80 characters in many cases 10 or 12 of these characters were eaten up with record type identifiers and sequence numbers (desperately important if you dropped a bunch of cards on the floor!). So space was at a premium. Rather than enter a "+" or "-" character next to each number an "overpunch" extra holes near the top bit of the card was used to represent a positive or negative numbers, so saving a byte.
These overpunched characters were encdoded in EBCDIC as x"D0' to x'D9" for -0 to -9 and x'C0' to x'C9' for +0 to +9 usually in the last digit of the number.
Hence the "Zoned Decimal" format. The first four bits of each byte are the Zone, the second four bits the "number" to -42 was encoded as x'F4D2'.
This is more of a convention than anything else as the computer could not do anything with this format. So it needed to be encoded into "packed" format before any calculations took place. This is pretty easy s 'X'F4D2' -> x'042D' is mostly a case a grabbing the last zone then extracting the "numeric" four bits from each byte, which, could then be converted to binary.
When IBM mainframes were designed the largest group of users were banks, insurance companies and utility companies. The bulk of there processing followed this pattern.
read punch card.
read tape record.
add monthly payment to balance
store new balance on tape
print new balance
Most of the calculations involved currency amounts and most of the results were displayed immediately. It became clear that if the machine could do the arithmetic directly on the packed decimal values you could avoid several expensive "convert to binary" and "convert to decimal" instructions. As a bonus it made it easy to place the decimal point at the correct position and perform any decimal rounding. So a great deal of work went into implementing native packed decimal instructions (zero, add, subtract, multiply, divide, shift and round etc.).
This has been the preferred currency format for IBM mainframes ever since.
For many years developers on other platforms poured scorn on the mainframers for using such an archaic format, and, only recently began to realize how difficult it was to do fixed point decimal arithmetic to the standards accountants and tax collectors expect. Thanks to the efforts of Mike_Cowlishaw and others the rest of the world has caught up with the venerable IBM 360 and Java programmers can now calculate sales tax correctly using the BigDecimal library which is based on a variation on the old packed decimal format.

Implied bit in IEEE floating point format

Why is there an implied (or hidden) bit in IEEE floating point format? What is the purpose of it? It is mentioned in passing on Wikipedia.
From (Complete) Tutorial to Understand IEEE Floating-Point Errors:
"[Fraction] is the normalized fractional part of the number, normalized because the exponent is adjusted so that the leading bit is always a 1. This way, it does not have to be stored, and you get one more bit of precision. This is why there is an implied bit."
It basically allows for higher precision.

Resources