Float Arithmetic inconsistent between golang programs - go

When decoding audio files with pion/opus I will occasionally get values that are incorrect.
I have debugged it down to the following code. When this routine runs inside the Opus decoder I get a different value then when I run it outside? When the two floats are added together the right most bit is different. The difference in values eventually becomes a problem as the program runs longer.
Is this a bug or expected behavior? I don't know how to debug this deeper/dump state of my program to understand more.
Outside decoder
package main
import (
"fmt"
"math"
)
func main() {
a := math.Float32frombits(uint32(955684399))
b := math.Float32frombits(uint32(927295728))
fmt.Printf("%b\n", math.Float32bits(a))
fmt.Printf("%b\n", math.Float32bits(b))
fmt.Printf("%b\n", math.Float32bits(a+b))
}
Returns
111000111101101001011000101111
110111010001010110100011110000
111001000001111010000110100110
Then Inside decoder
fmt.Printf("%b\n", math.Float32bits(lpcVal))
fmt.Printf("%b\n", math.Float32bits(val))
fmt.Printf("%b\n", math.Float32bits(lpcVal+val))
Returns
111000111101101001011000101111
110111010001010110100011110000
111001000001111010000110100111

I guess that lpcval and val are not Float32 but rather Float64.
If that is the case, then you are proposing two different operations:
in the former case, you do Float32bits(lpcval) + Float32bits(val)
in the later case, you do Float32bits(lpcval + val)
the two 32 bits floats are in binary:
1.11101101001011000101111 * 2^-14
1.10001010110100011110000 * 2^-17
The exact sum is
1.000011110100001101001101 * 2^-13
which is an exact tie between two representable Float32
the result is rounded to the Float32 with even significand
1.00001111010000110100110 * 2^-13
But lpcval and val are Float64: instead of 23 bits after the floating point, they have 52 (19 more).
If a single bit among those 19 more bits is different from zero, the result might not be an exact tie, but slightly larger than the exact tie.
Once converted to nearest Float32, that will be
1.00001111010000110100111 * 2^-13
Since we have no idea of what lpcval and val contains in those low significant bits, anything can happen, even without the use of fma operations.

This was happening because of Fused multiply and add. Multiple floating point operations were becoming combined into one operation.
You can read more about it in the Go Language Spec#Floating_Point_Operators
The change I made to my code was
- lpcVal += currentLPCVal * (aQ12 / 4096.0)
+ lpcVal = float32(lpcVal) + float32(currentLPCVal)*float32(aQ12)/float32(4096.0)
Thank you to Bryan C. Mills for answering this on the #performance channel on the Gophers slack.

Related

golang losing precision while converting float32 to float64?

In Golang, it seems that when a float64 var first convert to float32 then convert float64, it's value will change.
a := -8888.95
fmt.Println(a) // -8888.95
fmt.Println(float32(a)) // -8888.95
fmt.Println(float64(float32(a))) // -8888.9501953125
How can I make it unchanging
The way you have described the problem is perhaps misleading.
The precision is not lost "when converting float32 to float64"; rather, it is lost when converting from float64 to float32.
So how can you avoid losing precision when converting from float64 to float32? You can't. This task is impossible, and it's quite easy to see the reason why:
float64 has twice as many bits as float32
multiple different float64 values will map to the same float32 value due to the pigeonhole principle
the conversion is therefore not reversible.
package main
import (
"fmt"
)
func main() {
a := -8888.95
fmt.Printf("%.20f\n", a)
fmt.Printf("%.20f\n", float32(a))
fmt.Printf("%.20f\n", float64(float32(a)))
}
Adjusting your program to show a more precise output of each value, you'll see exactly where the precision is lost:
-8888.95000000000072759576
-8888.95019531250000000000
-8888.95019531250000000000
That is, after the float32 conversion (as is expected).
It's also worth noting that neither float64 nor float32 can represent your value -8888.95 exactly. If you convert this number to a fraction, you will get -177779/20. Notice the denominator, 20. The prime factorization of 20 is 2 * 2 * 5.
If you apply this process to a number and the prime factorization of the denominator contains any factors which are NOT 2, then you can rest assured that this number is definitely not representable exactly in binary floating point form. You may discover that the probability of any number passing this test is quite low.

Max value on 64 bit

Below code compiles:
package main
import "fmt"
const (
// Max integer value on 64 bit architecture.
maxInt = 9223372036854775807
// Much larger value than int64.
bigger = 9223372036854775808543522345
// Will NOT compile
// biggerInt int64 = 9223372036854775808543522345
)
func main() {
fmt.Println("Will Compile")
//fmt.Println(bigger) // error
}
Type is size in memory + representation of bits in that memory
What is the implicit type assigned to bigger at compile time? Because error constant 9223372036854775808543522345 overflows int for line fmt.Println(bigger)
Those are untyped constants. They have larger limits than typed constants:
https://golang.org/ref/spec#Constants
In particular:
Represent integer constants with at least 256 bits.
None, it's an untyped constant. Because you haven't assigned it to any variable or used it in any expression, it doesn't "need" to be given a representation as any concrete type yet. Numeric constants in Go have effectively unlimited precision (required by the language spec to be at least 256 bits for integers, and at least 256 mantissa bits for floating-point numbers, but I believe that the golang/go compiler uses the Go arbitrary-precision types internally which are only limited by memory). See the section about Constants in the language spec.
What is the use of a constant if you can't assign it to a variable of any type? Well, it can be part of a constant expression. Constant expressions are evaluated at arbitrary precision, and their results may be able to be assigned to a variable. In other words, it's allowed to use values that are too big to represent to reach an answer that is representable, as long as all of that computation happens at compile time.
From this comment:
my goal is to convertf bigger = 9223372036854775808543522345 to binary form
we find that your question is an XY problem.
Since we do know that the constant exceeds 64 bits, we'll need to take it apart into multiple 64-bit words, or store it in some sort of bigger-integer storage.
Go provides math/big for general purpose large-number operations, or in this case we can take advantage of the fact that it's easy to store up to 127-bit signed values (or 128-bit unsigned values) in a struct holding two 64-bit integers (at least one of which is unsigned).
This rather trivial program prints the result of converting to binary:
500000000 x 2-sup-64 + 543522345 as binary:
111011100110101100101000000000000000000000000000000000000000000100000011001010111111000101001
package main
import "fmt"
const (
// Much larger value than int64.
bigger = 9223372036854775808543522345
d64 = 1 << 64
)
type i128 struct {
Upper int64
Lower uint64
}
func main() {
x := i128{Upper: bigger / d64, Lower: bigger % d64}
fmt.Printf("%d x 2-sup-64 + %d as binary:\n%b%.64b\n", x.Upper, x.Lower, x.Upper, x.Lower)
}

How does Go perform arithmetic on constants?

I've been reading this post on constants in Go, and I'm trying to understand how they are stored and used in memory. You can perform operations on very large constants in Go, and as long as the result fits in memory, you can coerce that result to a type. For example, this code prints 10, as you would expect:
const Huge = 1e1000
fmt.Println(Huge / 1e999)
How does this work under the hood? At some point, Go has to store 1e1000 and 1e999 in memory, in order to perform operations on them. So how are constants stored, and how does Go perform arithmetic on them?
Short summary (TL;DR) is at the end of the answer.
Untyped arbitrary-precision constants don't live at runtime, constants live only at compile time (during the compilation). That being said, Go does not have to represent constants with arbitrary precision at runtime, only when compiling your application.
Why? Because constants do not get compiled into the executable binaries. They don't have to be. Let's take your example:
const Huge = 1e1000
fmt.Println(Huge / 1e999)
There is a constant Huge in the source code (and will be in the package object), but it won't appear in your executable. Instead a function call to fmt.Println() will be recorded with a value passed to it, whose type will be float64. So in the executable only a float64 value being 10.0 will be recorded. There is no sign of any number being 1e1000 in the executable.
This float64 type is derived from the default type of the untyped constant Huge. 1e1000 is a floating-point literal. To verify it:
const Huge = 1e1000
x := Huge / 1e999
fmt.Printf("%T", x) // Prints float64
Back to the arbitrary precision:
Spec: Constants:
Numeric constants represent exact values of arbitrary precision and do not overflow.
So constants represent exact values of arbitrary precision. As we saw, there is no need to represent constants with arbitrary precision at runtime, but the compiler still has to do something at compile time. And it does!
Obviously "infinite" precision cannot be dealt with. But there is no need, as the source code itself is not "infinite" (size of the source is finite). Still, it's not practical to allow truly arbitrary precision. So the spec gives some freedom to compilers regarding to this:
Implementation restriction: Although numeric constants have arbitrary precision in the language, a compiler may implement them using an internal representation with limited precision. That said, every implementation must:
Represent integer constants with at least 256 bits.
Represent floating-point constants, including the parts of a complex constant, with a mantissa of at least 256 bits and a signed exponent of at least 32 bits.
Give an error if unable to represent an integer constant precisely.
Give an error if unable to represent a floating-point or complex constant due to overflow.
Round to the nearest representable constant if unable to represent a floating-point or complex constant due to limits on precision.
These requirements apply both to literal constants and to the result of evaluating constant expressions.
However, also note that when all the above said, the standard package provides you the means to still represent and work with values (constants) with "arbitrary" precision, see package go/constant. You may look into its source to get an idea how it's implemented.
Implementation is in go/constant/value.go. Types representing such values:
// A Value represents the value of a Go constant.
type Value interface {
// Kind returns the value kind.
Kind() Kind
// String returns a short, human-readable form of the value.
// For numeric values, the result may be an approximation;
// for String values the result may be a shortened string.
// Use ExactString for a string representing a value exactly.
String() string
// ExactString returns an exact, printable form of the value.
ExactString() string
// Prevent external implementations.
implementsValue()
}
type (
unknownVal struct{}
boolVal bool
stringVal string
int64Val int64 // Int values representable as an int64
intVal struct{ val *big.Int } // Int values not representable as an int64
ratVal struct{ val *big.Rat } // Float values representable as a fraction
floatVal struct{ val *big.Float } // Float values not representable as a fraction
complexVal struct{ re, im Value }
)
As you can see, the math/big package is used to represent untyped arbitrary precision values. big.Int is for example (from math/big/int.go):
// An Int represents a signed multi-precision integer.
// The zero value for an Int represents the value 0.
type Int struct {
neg bool // sign
abs nat // absolute value of the integer
}
Where nat is (from math/big/nat.go):
// An unsigned integer x of the form
//
// x = x[n-1]*_B^(n-1) + x[n-2]*_B^(n-2) + ... + x[1]*_B + x[0]
//
// with 0 <= x[i] < _B and 0 <= i < n is stored in a slice of length n,
// with the digits x[i] as the slice elements.
//
// A number is normalized if the slice contains no leading 0 digits.
// During arithmetic operations, denormalized values may occur but are
// always normalized before returning the final result. The normalized
// representation of 0 is the empty or nil slice (length = 0).
//
type nat []Word
And finally Word is (from math/big/arith.go)
// A Word represents a single digit of a multi-precision unsigned integer.
type Word uintptr
Summary
At runtime: predefined types provide limited precision, but you can "mimic" arbitrary precision with certain packages, such as math/big and go/constant. At compile time: constants seemingly provide arbitrary precision, but in reality a compiler may not live up to this (doesn't have to); but still the spec provides minimal precision for constants that all compiler must support, e.g. integer constants must be represented with at least 256 bits which is 32 bytes (compared to int64 which is "only" 8 bytes).
When an executable binary is created, results of constant expressions (with arbitrary precision) have to be converted and represented with values of finite precision types – which may not be possible and thus may result in compile-time errors. Note that only results –not intermediate operands– have to be converted to finite precision, constant operations are carried out with arbitrary precision.
How this arbitrary or enhanced precision is implemented is not defined by the spec, math/big for example stores "digits" of the number in a slice (where digits is not a digit of the base 10 representation, but "digit" is an uintptr which is like base 4294967295 representation on 32-bit architectures, and even bigger on 64-bit architectures).
Go constants are not allocated to memory. They are used in context by the compiler. The blog post you refer to gives the example of Pi:
Pi = 3.14159265358979323846264338327950288419716939937510582097494459
If you assign Pi to a float32 it will lose precision to fit, but if you assign it to a float64, it will lose less precision, but the compiler will determine what type to use.

How does 1 << 64 - 1 work?

At http://tour.golang.org/#14 they show an example where the number 1 is shifted by 64 bits. This of course would result in an overflow, but then it is subtracted by 1 and all is well. How does half of the expression result in a failure while the entire expression as whole work just fine?
Thoughts:
I would assume that the setting of the unsigned to a number larger than what it allows is what causes the explosion. It would seem that memory is allocated more loosely on the right hand side of the expression than on the left? Is this true?
The result of your expression is a (compile time) constant and the expression is therefore evaluated during compilation. The language specification mandates that
Constant expressions are always evaluated exactly; intermediate values
and the constants themselves may require precision significantly
larger than supported by any predeclared type in the language. The
following are legal declarations:
const Huge = 1 << 100 // Huge == 1267650600228229401496703205376 (untyped integer constant)
const Four int8 = Huge >> 98 // Four == 4 (type int8)
https://golang.org/ref/spec#Constant_expressions
That is because the Go compiler handles constant expressions as numeric constants. Contrary to the data-types that have to obey the law of range, storage bits and side-effects like overflow, numeric constants never lose precision.
Numeric constants only get deduced down to a data-type with limited precision and range at the moment when you assign them to a variable (which has a known type and thus a numeric range and a defined way to store the number into bits). You can also force them to get deduced to a ordinary data-type by using them as part of a equation that contains non Numeric constant types.
Confused? So was I..
Here is a longer write-up on the data-types and how constants are handled: http://www.goinggo.net/2014/04/introduction-to-numeric-constants-in-go.html?m=1
I decided to try it. For reasons that are subtle, executing the expression as a constant expression (1 << 64 -1) or piece by piece at run time gives the same answer. This is because of 2 different mechanisms. A constant expression is fully evaluated with infinite precision before being assigned to the variable. The step by step execution explicitly allows overflows and underflows through addition, subtraction and shift operations, and thus the result is the same.
See https://golang.org/ref/spec#Integer_overflow for a description of how integers are supposed to overflow.
However, doing it in groups, ie 1<<64 and then -1 causes overflow errors!
You can make a variable overflow though arithmetic, but you can not assign an overflow to a variable.
Try it yourself. Paste the code below into http://try.golang.org/
This one works:
// You can edit this code!
// Click here and start typing.
package main
import "fmt"
func main() {
var MaxInt uint64 = 1
MaxInt = MaxInt << 64
MaxInt = MaxInt - 1
fmt.Println("%d",MaxInt)
}
This one doesn't work:
// You can edit this code!
// Click here and start typing.
package main
import "fmt"
func main() {
var MaxInt uint64 = 1 << 64
MaxInt = MaxInt - 1
fmt.Println("%d",MaxInt)
}
Actually 1 << 64 - 1 does not always result in a left shift of 64 and minus 1. The - operator is applied before the << operator in most languages, at least in any I know (like C++, Java, ...). Therefore 1 << 64 - 1 <=> 1 << 63.
But Go behaves different: https://golang.org/ref/spec#Operator_precedence
The - operator comes after the << operator.
The result of a 64 Bit left shift is based on the data type. It's just like adding 64 of 0 on the right, while cutting any Bit that extend the data type on the left side. In some languages an overflow may be valid, in some other not.
Compilers may also behave different based on the interpretion when your shift is greater or equal than the actual data type size. I know that the Java compiler will reduce the actual shift as often by size of the data type until it's smaller than the size of the data byte.
Sounds difficult, but here and easy example for a long data type with 64 Bit size.
so i << 64 <=> i << 0 <=> i
or i << 65 <=> i << 1
or i << 130 <=> i << 66 <=> i << 2.
As said, this may differ with different compilers / languages. There is never a solid answer without refering to a certain language.
For learning I would suggest a more common language than Go, maybe like something from the C family.
var MaxInt uint64 = 1<<64 - 1
BITSHIFTING
In binary, 1, with 64 0s after it (10000...).
Same as 2^64.
This maxes out 64-bit unsigned integer (positive numbers only).
So, we subtract 1 to prevent the error.
Therefore, this is the maximum value unsigned integer we can write.
You can see here go constant int actually is a bigInt, so 1 << 63 won't overflow. But var a int64 = 1 << 63 will overflow, because you are a assign a value bigger than int64.

GoLang for loop with floats creates error

Can someone explain the following. I have a function in go that accepts a couple of float64 and then uses this value to calculate a lot of other values. The function looks like
func (g *Geometry) CalcStresses(x, zmax, zmin float64)(Vertical)
the result is put into a struct like
type Vertical struct {
X float64
Stresses []Stress
}
Now the funny thing is this. If I call the function like this;
for i:=14.0; i<15.0; i+=0.1{
result := geo.CalcStresses(i, 10, -10)
}
then I get a lot of results where the Stress array is empty, antoher interesting detail is that x sometimes shows like a number with a LOT of decimals (like 14.3999999999999999998)
However, if I call the function like this;
for i:=0; i<10; i++{
x := 14.0 + float64(i) * 0.1
result := geo.CalcStresses(x,10,-10)
}
then everything is fine.
Does anyone know why this happens?
Thanks in advance,
Rob
Not all real numbers can be represented precisely in binary floating point format, therefore looping over floating point number is asking for trouble.
From Wikipedia on Floating point
The fact that floating-point numbers cannot precisely represent all real numbers, and that floating-point operations cannot precisely represent true arithmetic operations, leads to many surprising situations. This is related to the finite precision with which computers generally represent numbers.
For example, the non-representability of 0.1 and 0.01 (in binary) means that the result of attempting to square 0.1 is neither 0.01 nor the representable number closest to it.
This code
for i := 14.0; i < 15.0; i += 0.1 {
fmt.Println(i)
}
produces this
14
14.1
14.2
14.299999999999999
14.399999999999999
14.499999999999998
14.599999999999998
14.699999999999998
14.799999999999997
14.899999999999997
14.999999999999996
You may use math.big.Rat type to represent rational numbers accurately.
Example
x := big.NewRat(14, 1)
y := big.NewRat(15, 1)
z := big.NewRat(1, 10)
for i := x; i.Cmp(y) < 0; i = i.Add(i, z) {
v, _ := i.Float64()
fmt.Println(v)
}

Resources