Misunderstanding Go Language specification on floating-point rounding - go

The Go language specification on the section about Constant expressions states:
A compiler may use rounding while computing untyped floating-point or complex constant expressions; see the implementation restriction in the section on constants. This rounding may cause a floating-point constant expression to be invalid in an integer context, even if it would be integral when calculated using infinite precision, and vice versa.
Does the sentence
This rounding may cause a floating-point constant expression to be invalid in an integer context
point to something like the following:
func main() {
a := 853784574674.23846278367
fmt.Println(int8(a)) // output: 0
}

The quoted part from the spec does not apply to your example, as a is not a constant expression but a variable, so int8(a) is converting a non-constant expression. This conversion is covered by Spec: Conversions, Conversions between numeric types:
When converting a floating-point number to an integer, the fraction is discarded (truncation towards zero).
[...] In all non-constant conversions involving floating-point or complex values, if the result type cannot represent the value the conversion succeeds but the result value is implementation-dependent.
Since you convert a non-constant expression a being 853784574674.23846278367 to an integer, the fraction part is discarded, and since the result does not fit into an int8, the result is not specified, it's implementation-dependent.
The quoted part means that while constants are represented with a lot higher precision than the builtin types (eg. float64 or int64), the precision that a compiler (have to) implement is not infinite (for practical reasons), and even if a floating point literal is representable precisely, performing operations on them may be carried out with intermediate roundings and may not give mathematically correct result.
The spec includes the minimum supportable precision:
Implementation restriction: Although numeric constants have arbitrary precision in the language, a compiler may implement them using an internal representation with limited precision. That said, every implementation must:
Represent integer constants with at least 256 bits.
Represent floating-point constants, including the parts of a complex constant, with a mantissa of at least 256 bits and a signed binary exponent of at least 16 bits.
Give an error if unable to represent an integer constant precisely.
Give an error if unable to represent a floating-point or complex constant due to overflow.
Round to the nearest representable constant if unable to represent a floating-point or complex constant due to limits on precision.
For example:
const (
x = 1e100000 + 1
y = 1e100000
)
func main() {
fmt.Println(x - y)
}
This code should output 1 as x is being 1 larger than y. Running it on the Go Playground outputs 0 because the constant expression x - y is executed with roundings, and the +1 is lost as a result. Both x and y are integers (have no fraction part), so in integer context the result should be 1. But the number being 1e100000, representing it requires around ~333000 bits, which is not a valid requirement from a compiler (according to the spec, 256 bit mantissa is sufficient).
If we lower the constants, we get correct result:
const (
x = 1e1000 + 1
y = 1e1000
)
func main() {
fmt.Println(x - y)
}
This outputs the mathematically correct 1 result. Try it on the Go Playground. Representing the number 1e1000 requires around ~3333 bits which seems to be supported (and it's way above the minimum 256 bit requirement).

An int8 is a signed integer, and can have a value from -128 to 127. That's why you are seeing unexpected value with int8(a) conversion.

Related

How go compare overflow int? [duplicate]

I've been reading this post on constants in Go, and I'm trying to understand how they are stored and used in memory. You can perform operations on very large constants in Go, and as long as the result fits in memory, you can coerce that result to a type. For example, this code prints 10, as you would expect:
const Huge = 1e1000
fmt.Println(Huge / 1e999)
How does this work under the hood? At some point, Go has to store 1e1000 and 1e999 in memory, in order to perform operations on them. So how are constants stored, and how does Go perform arithmetic on them?
Short summary (TL;DR) is at the end of the answer.
Untyped arbitrary-precision constants don't live at runtime, constants live only at compile time (during the compilation). That being said, Go does not have to represent constants with arbitrary precision at runtime, only when compiling your application.
Why? Because constants do not get compiled into the executable binaries. They don't have to be. Let's take your example:
const Huge = 1e1000
fmt.Println(Huge / 1e999)
There is a constant Huge in the source code (and will be in the package object), but it won't appear in your executable. Instead a function call to fmt.Println() will be recorded with a value passed to it, whose type will be float64. So in the executable only a float64 value being 10.0 will be recorded. There is no sign of any number being 1e1000 in the executable.
This float64 type is derived from the default type of the untyped constant Huge. 1e1000 is a floating-point literal. To verify it:
const Huge = 1e1000
x := Huge / 1e999
fmt.Printf("%T", x) // Prints float64
Back to the arbitrary precision:
Spec: Constants:
Numeric constants represent exact values of arbitrary precision and do not overflow.
So constants represent exact values of arbitrary precision. As we saw, there is no need to represent constants with arbitrary precision at runtime, but the compiler still has to do something at compile time. And it does!
Obviously "infinite" precision cannot be dealt with. But there is no need, as the source code itself is not "infinite" (size of the source is finite). Still, it's not practical to allow truly arbitrary precision. So the spec gives some freedom to compilers regarding to this:
Implementation restriction: Although numeric constants have arbitrary precision in the language, a compiler may implement them using an internal representation with limited precision. That said, every implementation must:
Represent integer constants with at least 256 bits.
Represent floating-point constants, including the parts of a complex constant, with a mantissa of at least 256 bits and a signed exponent of at least 32 bits.
Give an error if unable to represent an integer constant precisely.
Give an error if unable to represent a floating-point or complex constant due to overflow.
Round to the nearest representable constant if unable to represent a floating-point or complex constant due to limits on precision.
These requirements apply both to literal constants and to the result of evaluating constant expressions.
However, also note that when all the above said, the standard package provides you the means to still represent and work with values (constants) with "arbitrary" precision, see package go/constant. You may look into its source to get an idea how it's implemented.
Implementation is in go/constant/value.go. Types representing such values:
// A Value represents the value of a Go constant.
type Value interface {
// Kind returns the value kind.
Kind() Kind
// String returns a short, human-readable form of the value.
// For numeric values, the result may be an approximation;
// for String values the result may be a shortened string.
// Use ExactString for a string representing a value exactly.
String() string
// ExactString returns an exact, printable form of the value.
ExactString() string
// Prevent external implementations.
implementsValue()
}
type (
unknownVal struct{}
boolVal bool
stringVal string
int64Val int64 // Int values representable as an int64
intVal struct{ val *big.Int } // Int values not representable as an int64
ratVal struct{ val *big.Rat } // Float values representable as a fraction
floatVal struct{ val *big.Float } // Float values not representable as a fraction
complexVal struct{ re, im Value }
)
As you can see, the math/big package is used to represent untyped arbitrary precision values. big.Int is for example (from math/big/int.go):
// An Int represents a signed multi-precision integer.
// The zero value for an Int represents the value 0.
type Int struct {
neg bool // sign
abs nat // absolute value of the integer
}
Where nat is (from math/big/nat.go):
// An unsigned integer x of the form
//
// x = x[n-1]*_B^(n-1) + x[n-2]*_B^(n-2) + ... + x[1]*_B + x[0]
//
// with 0 <= x[i] < _B and 0 <= i < n is stored in a slice of length n,
// with the digits x[i] as the slice elements.
//
// A number is normalized if the slice contains no leading 0 digits.
// During arithmetic operations, denormalized values may occur but are
// always normalized before returning the final result. The normalized
// representation of 0 is the empty or nil slice (length = 0).
//
type nat []Word
And finally Word is (from math/big/arith.go)
// A Word represents a single digit of a multi-precision unsigned integer.
type Word uintptr
Summary
At runtime: predefined types provide limited precision, but you can "mimic" arbitrary precision with certain packages, such as math/big and go/constant. At compile time: constants seemingly provide arbitrary precision, but in reality a compiler may not live up to this (doesn't have to); but still the spec provides minimal precision for constants that all compiler must support, e.g. integer constants must be represented with at least 256 bits which is 32 bytes (compared to int64 which is "only" 8 bytes).
When an executable binary is created, results of constant expressions (with arbitrary precision) have to be converted and represented with values of finite precision types – which may not be possible and thus may result in compile-time errors. Note that only results –not intermediate operands– have to be converted to finite precision, constant operations are carried out with arbitrary precision.
How this arbitrary or enhanced precision is implemented is not defined by the spec, math/big for example stores "digits" of the number in a slice (where digits is not a digit of the base 10 representation, but "digit" is an uintptr which is like base 4294967295 representation on 32-bit architectures, and even bigger on 64-bit architectures).
Go constants are not allocated to memory. They are used in context by the compiler. The blog post you refer to gives the example of Pi:
Pi = 3.14159265358979323846264338327950288419716939937510582097494459
If you assign Pi to a float32 it will lose precision to fit, but if you assign it to a float64, it will lose less precision, but the compiler will determine what type to use.

Max value on 64 bit

Below code compiles:
package main
import "fmt"
const (
// Max integer value on 64 bit architecture.
maxInt = 9223372036854775807
// Much larger value than int64.
bigger = 9223372036854775808543522345
// Will NOT compile
// biggerInt int64 = 9223372036854775808543522345
)
func main() {
fmt.Println("Will Compile")
//fmt.Println(bigger) // error
}
Type is size in memory + representation of bits in that memory
What is the implicit type assigned to bigger at compile time? Because error constant 9223372036854775808543522345 overflows int for line fmt.Println(bigger)
Those are untyped constants. They have larger limits than typed constants:
https://golang.org/ref/spec#Constants
In particular:
Represent integer constants with at least 256 bits.
None, it's an untyped constant. Because you haven't assigned it to any variable or used it in any expression, it doesn't "need" to be given a representation as any concrete type yet. Numeric constants in Go have effectively unlimited precision (required by the language spec to be at least 256 bits for integers, and at least 256 mantissa bits for floating-point numbers, but I believe that the golang/go compiler uses the Go arbitrary-precision types internally which are only limited by memory). See the section about Constants in the language spec.
What is the use of a constant if you can't assign it to a variable of any type? Well, it can be part of a constant expression. Constant expressions are evaluated at arbitrary precision, and their results may be able to be assigned to a variable. In other words, it's allowed to use values that are too big to represent to reach an answer that is representable, as long as all of that computation happens at compile time.
From this comment:
my goal is to convertf bigger = 9223372036854775808543522345 to binary form
we find that your question is an XY problem.
Since we do know that the constant exceeds 64 bits, we'll need to take it apart into multiple 64-bit words, or store it in some sort of bigger-integer storage.
Go provides math/big for general purpose large-number operations, or in this case we can take advantage of the fact that it's easy to store up to 127-bit signed values (or 128-bit unsigned values) in a struct holding two 64-bit integers (at least one of which is unsigned).
This rather trivial program prints the result of converting to binary:
500000000 x 2-sup-64 + 543522345 as binary:
111011100110101100101000000000000000000000000000000000000000000100000011001010111111000101001
package main
import "fmt"
const (
// Much larger value than int64.
bigger = 9223372036854775808543522345
d64 = 1 << 64
)
type i128 struct {
Upper int64
Lower uint64
}
func main() {
x := i128{Upper: bigger / d64, Lower: bigger % d64}
fmt.Printf("%d x 2-sup-64 + %d as binary:\n%b%.64b\n", x.Upper, x.Lower, x.Upper, x.Lower)
}

Go convert int to byte

var x uint64 = 257
var y int = 257
fmt.Println("rv1 is ", byte(x)) // ok
fmt.Println("rv2 is ", byte(y)) // ok
fmt.Println("rv3 is ", byte(257)) // constant 257 overflows byte
fmt.Println("rv4 is ", byte(int(257))) // constant 257 overflows byte
It is strange.
All of them are converting int to byte,so all of them should be error.
But case 1,2 is ok!
How could be that?
Variable numeric values can be converted to smaller types, with the normal loss of the high bits.
The compiler refuses to do this for constant values (that is clearly always an error). This is required by the spec (emphasize mine):
Every implementation must:
Represent integer constants with at least 256 bits.
Represent floating-point constants, including the parts of a complex constant, > with a mantissa of at least 256 bits and a signed binary exponent of at least 16 bits.
Give an error if unable to represent an integer constant precisely.
Give an error if unable to represent a floating-point or complex constant due to overflow.
Round to the nearest representable constant if unable to represent a floating-point or complex constant due to limits on precision.
These requirements apply both to literal constants and to the result of evaluating constant expressions.
Consequently, if you change var x and var y to const x and const y, you get an error for all four cases.

How does Go perform arithmetic on constants?

I've been reading this post on constants in Go, and I'm trying to understand how they are stored and used in memory. You can perform operations on very large constants in Go, and as long as the result fits in memory, you can coerce that result to a type. For example, this code prints 10, as you would expect:
const Huge = 1e1000
fmt.Println(Huge / 1e999)
How does this work under the hood? At some point, Go has to store 1e1000 and 1e999 in memory, in order to perform operations on them. So how are constants stored, and how does Go perform arithmetic on them?
Short summary (TL;DR) is at the end of the answer.
Untyped arbitrary-precision constants don't live at runtime, constants live only at compile time (during the compilation). That being said, Go does not have to represent constants with arbitrary precision at runtime, only when compiling your application.
Why? Because constants do not get compiled into the executable binaries. They don't have to be. Let's take your example:
const Huge = 1e1000
fmt.Println(Huge / 1e999)
There is a constant Huge in the source code (and will be in the package object), but it won't appear in your executable. Instead a function call to fmt.Println() will be recorded with a value passed to it, whose type will be float64. So in the executable only a float64 value being 10.0 will be recorded. There is no sign of any number being 1e1000 in the executable.
This float64 type is derived from the default type of the untyped constant Huge. 1e1000 is a floating-point literal. To verify it:
const Huge = 1e1000
x := Huge / 1e999
fmt.Printf("%T", x) // Prints float64
Back to the arbitrary precision:
Spec: Constants:
Numeric constants represent exact values of arbitrary precision and do not overflow.
So constants represent exact values of arbitrary precision. As we saw, there is no need to represent constants with arbitrary precision at runtime, but the compiler still has to do something at compile time. And it does!
Obviously "infinite" precision cannot be dealt with. But there is no need, as the source code itself is not "infinite" (size of the source is finite). Still, it's not practical to allow truly arbitrary precision. So the spec gives some freedom to compilers regarding to this:
Implementation restriction: Although numeric constants have arbitrary precision in the language, a compiler may implement them using an internal representation with limited precision. That said, every implementation must:
Represent integer constants with at least 256 bits.
Represent floating-point constants, including the parts of a complex constant, with a mantissa of at least 256 bits and a signed exponent of at least 32 bits.
Give an error if unable to represent an integer constant precisely.
Give an error if unable to represent a floating-point or complex constant due to overflow.
Round to the nearest representable constant if unable to represent a floating-point or complex constant due to limits on precision.
These requirements apply both to literal constants and to the result of evaluating constant expressions.
However, also note that when all the above said, the standard package provides you the means to still represent and work with values (constants) with "arbitrary" precision, see package go/constant. You may look into its source to get an idea how it's implemented.
Implementation is in go/constant/value.go. Types representing such values:
// A Value represents the value of a Go constant.
type Value interface {
// Kind returns the value kind.
Kind() Kind
// String returns a short, human-readable form of the value.
// For numeric values, the result may be an approximation;
// for String values the result may be a shortened string.
// Use ExactString for a string representing a value exactly.
String() string
// ExactString returns an exact, printable form of the value.
ExactString() string
// Prevent external implementations.
implementsValue()
}
type (
unknownVal struct{}
boolVal bool
stringVal string
int64Val int64 // Int values representable as an int64
intVal struct{ val *big.Int } // Int values not representable as an int64
ratVal struct{ val *big.Rat } // Float values representable as a fraction
floatVal struct{ val *big.Float } // Float values not representable as a fraction
complexVal struct{ re, im Value }
)
As you can see, the math/big package is used to represent untyped arbitrary precision values. big.Int is for example (from math/big/int.go):
// An Int represents a signed multi-precision integer.
// The zero value for an Int represents the value 0.
type Int struct {
neg bool // sign
abs nat // absolute value of the integer
}
Where nat is (from math/big/nat.go):
// An unsigned integer x of the form
//
// x = x[n-1]*_B^(n-1) + x[n-2]*_B^(n-2) + ... + x[1]*_B + x[0]
//
// with 0 <= x[i] < _B and 0 <= i < n is stored in a slice of length n,
// with the digits x[i] as the slice elements.
//
// A number is normalized if the slice contains no leading 0 digits.
// During arithmetic operations, denormalized values may occur but are
// always normalized before returning the final result. The normalized
// representation of 0 is the empty or nil slice (length = 0).
//
type nat []Word
And finally Word is (from math/big/arith.go)
// A Word represents a single digit of a multi-precision unsigned integer.
type Word uintptr
Summary
At runtime: predefined types provide limited precision, but you can "mimic" arbitrary precision with certain packages, such as math/big and go/constant. At compile time: constants seemingly provide arbitrary precision, but in reality a compiler may not live up to this (doesn't have to); but still the spec provides minimal precision for constants that all compiler must support, e.g. integer constants must be represented with at least 256 bits which is 32 bytes (compared to int64 which is "only" 8 bytes).
When an executable binary is created, results of constant expressions (with arbitrary precision) have to be converted and represented with values of finite precision types – which may not be possible and thus may result in compile-time errors. Note that only results –not intermediate operands– have to be converted to finite precision, constant operations are carried out with arbitrary precision.
How this arbitrary or enhanced precision is implemented is not defined by the spec, math/big for example stores "digits" of the number in a slice (where digits is not a digit of the base 10 representation, but "digit" is an uintptr which is like base 4294967295 representation on 32-bit architectures, and even bigger on 64-bit architectures).
Go constants are not allocated to memory. They are used in context by the compiler. The blog post you refer to gives the example of Pi:
Pi = 3.14159265358979323846264338327950288419716939937510582097494459
If you assign Pi to a float32 it will lose precision to fit, but if you assign it to a float64, it will lose less precision, but the compiler will determine what type to use.

Implement equation in VHDL

I am trying to implement the equation in VHDL which has multiplication by some constant and addition. The equation is as below,
y<=-((x*x*x)*0.1666)+(2.5*(x*x))- (21.666*x) + 36.6653; ----error
I got the error
HDLCompiler:1731 - found '0' definitions of operator "*",
can not determine exact overloaded matching definition for "*".
entity is
entity eq1 is
Port ( x : in signed(15 downto 0);
y : out signed (15 downto 0) );
end eq1;
I tried using the function RESIZE and x in integer but it gives same error. Should i have to use another data type? x is having pure integer values like 2,4,6..etc.
Since x and y are of datatype signed, you can multiply them. However, there is no multiplication of signed with real. Even if there was, the result would be real (not signed or integer).
So first, you need to figure out what you want (the semantics). Then you should add type casts and conversion functions.
y <= x*x; -- OK
y <= 0.5 * x; -- not OK
y <= to_signed(integer(0.5 * real(to_integer(x))),y'length); -- OK
This is another case where simulating before synthesis might be handy. ghdl for instances tells you which "*" operator it finds the first error for:
ghdl -a implement.vhdl
implement.vhdl:12:21: no function declarations for operator "*"
y <= -((x*x*x) * 0.1666) + (2.5 * (x*x)) - (21.666 * x) + 36.6653;
---------------^ character position 21, line 12
The expressions with x multiplied have both operands with a type of signed.
(And for later, we also note that the complex expression on the right hand side of the signal assignment operation will eventually be interpreted as a signed value with a narrow subtype constraint when assigned to y).
VHDL determines the type of the literal 0.1666, it's an abstract literal, that is decimal literal or floating-point literal (IEEE Std 1076-2008 5.2.5 Floating-point types, 5.2.5.1 General, paragraph 5):
Floating-point literals are the literals of an anonymous predefined type that is called universal_real in this standard. Other floating-point types have no literals. However, for each floating-point type there exists an implicit conversion that converts a value of type universal_real into the corresponding value (if any) of the floating-point type (see 9.3.6).
There's only one predefined floating-point type in VHDL, see 5.2.5.2, and the floating-point literal of type universal_real is implicitly converted to type REAL.
9.3.6 Type conversions paragraph 14 tells us:
In certain cases, an implicit type conversion will be performed. An implicit conversion of an operand of type universal_integer to another integer type, or of an operand of type universal_real to another floating-point type, can only be applied if the operand is either a numeric literal or an attribute, or if the operand is an expression consisting of the division of a value of a physical type by a value of the same type; such an operand is called a convertible universal operand. An implicit conversion of a convertible universal operand is applied if and only if the innermost complete context determines a unique (numeric) target type for the implicit conversion, and there is no legal interpretation of this context without this conversion.
Because you haven't included a package containing another floating-point type that leaves us searching for a "*" multiplying operator with one operand of type signed and one of type REAL with a return type of signed (or another "*" operator with the opposite operand type arguments) and VHDL found 0 of those.
There is no
function "*" (l: signed; r: REAL) return REAL;
or
function "*" (l: signed; r: REAL) return signed;
found in package numeric_std.
Phillipe suggests one way to overcome this by converting signed x to integer.
Historically synthesis doesn't encompass type REAL, prior to the 2008 version of the VHDL standard you were likely to have arbitrary precision, while 5.2.5 paragraph 7 now tells us:
An implementation shall choose a representation for all floating-point types except for universal_real that conforms either to IEEE Std 754-1985 or to IEEE Std 854-1987; in either case, a minimum representation size of 64 bits is required for this chosen representation.
And that doesn't help us unless the synthesis tool supports floating-point types of REAL and is -2008 compliant.
VHDL has the float_generic_pkg package introduced in the 2008 version, which performs synthesis eligible floating point operations and is compatible with the used of signed types by converting to and from it's float type.
Before we suggest something so drastic as performing all these calculations as 64 bit floating point numbers and synthesize all that let's again note that the result is a 16 bit signed which is an array type of std_ulogic and represents a 16 bit integer.
You can model the multiplications on the right hand side as distinct expressions executed in both floating point or signed
representation to determine when the error is significant.
Because you are using a 16 bit signed value for y, significant would mean a difference greater than 1 in magnitude. Flipped signs or unexpected 0s between the two methods will likely tell you there's a precision issue.
I wrote a little C program to look at the differences and right off the bat it tells us 16 bits isn't enough to hold the math:
int16_t x, y, Y;
int16_t a,b,c,d;
double A,B,C,D;
a = x*x*x * 0.1666;
A = x*x*x * 0.1666;
b = 2.5 * x*x;
B = 2.5 * x*x;
c = 21.666 * x;
C = 21.666 * x;
d = 36;
D = 36.6653;
y = -( a + b - c + d);
Y = (int16_t) -(A + B - C + D);
And outputs for the left most value of x:
x = -32767, a = 11515, b = 0, c = 10967, y = -584, Y = 0
x = -32767, A = -178901765.158200, B = 2684190722.500000, C = -709929.822000
x = -32767 , y = -584 , Y= 0, double = -2505998923.829100
The first line of output is for 16 bit multiplies and you can see all three expressions with multiplies are incorrect.
The second line says double has enough precision, yet Y (-(A + B - C + D)) doesn't fit in a 16 bit number. And you can't cure that by making the result size larger unless the input size remains the same. Chaining operations then becomes a matter of picking best product and keeping track of the scale, meaning you might as well use floating point.
You could of course do clamping if it were appropriate. The double value on the third line of output is the non truncated value. It's more negative than x'LOW.
You could also do clamping in the 16 bit math domain, though all this tells you this math has no meaning in the hardware domain unless it's done in floating point.
So if you were trying to solve a real math problem in hardware it would require floating point, likely accomplished using package float_generic_pkg, and wouldn't fit meaningfully in a 16 bit result.
As stated in found '0' definitions of operator “+” in VHDL, the VHDL compiler is unable to find the matching operator for your operation, which is e.g. multiplying x*x. You probably want to use numeric_std (see here) in order to make operators for signed (and unsigned) available.
But note, that VHDL is not a programming language but a hardware design language. That is, if your long-term goal is to move the code to an FPGA or CPLD, these functions might not work any longer, because they are not synthesizable.
I'm stating this, because you will become more problems when you try to multiply with e.g. 0.1666, because VHDL usually has no knowledge about floating point numbers out of the box.

Resources