Bitwise manipulation and Go newbie here :D I am reading some data from sensor with Go and I get it as 2 bytes - let's say 0xFFFE. It is easy too cast it to uint16 since in Go we can just do uint16(0xFFFE) but what I need is to convert it to integer, because the sensor returns in fact values in range from -32768 to 32767. Now I thought "Maybe Go will be this nice and if I do int16(0xFFFE) it will understand what I want?", but no. I ended up using following solution (I translated some python code from web):
x := 0xFFFE
if (x & (1 << 15)) != 0 {
x = x - (1<<16)
}
It seems to work, but A) I am not entirely sure why, and B) It looks a bit ugly to what I imagined should be a trivial solution for casting uint16 to int16. Could anyone give me a hand and clarify why this is only way to do this? Or is there any other possible way?
But what you want works, "Go is nice":
ui := uint16(0xFFFE)
fmt.Println(ui)
i := int16(ui)
fmt.Println(i)
Output (try it on the Go Playground):
65534
-2
int16(0xFFFE) doesn't work because 0xfffe is an untyped integer constant which cannot be represented by a value of type int16, that's why the the compiler complains. But you can certainly convert any uint16 non-constant value to int16.
See possible duplicates:
Golang: on-purpose int overflow
Does go compiler's evaluation differ for constant expression and other expression
Related
When decoding audio files with pion/opus I will occasionally get values that are incorrect.
I have debugged it down to the following code. When this routine runs inside the Opus decoder I get a different value then when I run it outside? When the two floats are added together the right most bit is different. The difference in values eventually becomes a problem as the program runs longer.
Is this a bug or expected behavior? I don't know how to debug this deeper/dump state of my program to understand more.
Outside decoder
package main
import (
"fmt"
"math"
)
func main() {
a := math.Float32frombits(uint32(955684399))
b := math.Float32frombits(uint32(927295728))
fmt.Printf("%b\n", math.Float32bits(a))
fmt.Printf("%b\n", math.Float32bits(b))
fmt.Printf("%b\n", math.Float32bits(a+b))
}
Returns
111000111101101001011000101111
110111010001010110100011110000
111001000001111010000110100110
Then Inside decoder
fmt.Printf("%b\n", math.Float32bits(lpcVal))
fmt.Printf("%b\n", math.Float32bits(val))
fmt.Printf("%b\n", math.Float32bits(lpcVal+val))
Returns
111000111101101001011000101111
110111010001010110100011110000
111001000001111010000110100111
I guess that lpcval and val are not Float32 but rather Float64.
If that is the case, then you are proposing two different operations:
in the former case, you do Float32bits(lpcval) + Float32bits(val)
in the later case, you do Float32bits(lpcval + val)
the two 32 bits floats are in binary:
1.11101101001011000101111 * 2^-14
1.10001010110100011110000 * 2^-17
The exact sum is
1.000011110100001101001101 * 2^-13
which is an exact tie between two representable Float32
the result is rounded to the Float32 with even significand
1.00001111010000110100110 * 2^-13
But lpcval and val are Float64: instead of 23 bits after the floating point, they have 52 (19 more).
If a single bit among those 19 more bits is different from zero, the result might not be an exact tie, but slightly larger than the exact tie.
Once converted to nearest Float32, that will be
1.00001111010000110100111 * 2^-13
Since we have no idea of what lpcval and val contains in those low significant bits, anything can happen, even without the use of fma operations.
This was happening because of Fused multiply and add. Multiple floating point operations were becoming combined into one operation.
You can read more about it in the Go Language Spec#Floating_Point_Operators
The change I made to my code was
- lpcVal += currentLPCVal * (aQ12 / 4096.0)
+ lpcVal = float32(lpcVal) + float32(currentLPCVal)*float32(aQ12)/float32(4096.0)
Thank you to Bryan C. Mills for answering this on the #performance channel on the Gophers slack.
I need to find the square root of a big.Rat. Is there a way to do it without losing (already existing) accuracy?
For example, I could convert the numerator and denominator into floats, get the square root, and then convert it back...
func ratSquareRoot(num *big.Rat) *big.Rat {
f, exact := num.Float64() //Yuck! Floats!
squareRoot := math.Sqrt(f)
var accuracy int64 = 10 ^ 15 //Significant digits of precision for float64
return big.NewRat(int64(squareRoot*float64(accuracy)), accuracy)
// ^ This is now totally worthless. And also probably not simplified very well.
}
...but that would eliminate all of the accuracy of using a rational. Is there a better way of doing this?
The big.Float type has a .Sqrt(x) operation, and handles defining explicitly the precision you aim for. I'd try to use that and convert the result back to a Rat with the same operations in your question, only manipulating big.Int values.
r := big.NewRat(1, 3)
var x big.Float
x.SetPrec(30) // I didn't figure out the 'Prec' part correctly, read the docs more carefully than I did and experiement
x.SetRat(r)
var s big.Float
s.SetPrec(15)
s.Sqrt(&x)
r, _ = s.Rat(nil)
fmt.Println(x.String(), s.String())
fmt.Println(r.String(), float64(18919)/float64(32768))
playground
In Golang, it seems that when a float64 var first convert to float32 then convert float64, it's value will change.
a := -8888.95
fmt.Println(a) // -8888.95
fmt.Println(float32(a)) // -8888.95
fmt.Println(float64(float32(a))) // -8888.9501953125
How can I make it unchanging
The way you have described the problem is perhaps misleading.
The precision is not lost "when converting float32 to float64"; rather, it is lost when converting from float64 to float32.
So how can you avoid losing precision when converting from float64 to float32? You can't. This task is impossible, and it's quite easy to see the reason why:
float64 has twice as many bits as float32
multiple different float64 values will map to the same float32 value due to the pigeonhole principle
the conversion is therefore not reversible.
package main
import (
"fmt"
)
func main() {
a := -8888.95
fmt.Printf("%.20f\n", a)
fmt.Printf("%.20f\n", float32(a))
fmt.Printf("%.20f\n", float64(float32(a)))
}
Adjusting your program to show a more precise output of each value, you'll see exactly where the precision is lost:
-8888.95000000000072759576
-8888.95019531250000000000
-8888.95019531250000000000
That is, after the float32 conversion (as is expected).
It's also worth noting that neither float64 nor float32 can represent your value -8888.95 exactly. If you convert this number to a fraction, you will get -177779/20. Notice the denominator, 20. The prime factorization of 20 is 2 * 2 * 5.
If you apply this process to a number and the prime factorization of the denominator contains any factors which are NOT 2, then you can rest assured that this number is definitely not representable exactly in binary floating point form. You may discover that the probability of any number passing this test is quite low.
Go's builtin len() function returns a signed int. Why wasn't a uint used instead?
Is it ever possible for len() to return something negative?
As far as I can tell, the answer is no:
Arrays: "The number of elements is called the length and is never negative."
Slices: "At any time the following relationship holds: 0 <= len(s) <= cap(s)"
Maps "The number of map elements is called its length". (I couldn't find anything in the spec that explicitly restricts this to a nonnegative value, but it's difficult for me to understand how there could be fewer than 0 elements in a map)
Strings "A string value is a (possibly empty) sequence of bytes.... The length of a string s (its size in bytes) can be discovered using the built-in function len()" (Again, hard to see how a sequence could have a negative number of bytes)
Channels "number of elements queued in channel buffer (ditto)
len() (and cap()) return int because that is what is used to index slices and arrays (not uint). So the question is more "Why does Go use signed integers to index slices/arrays when there are no negative indices?".
The answer is simple: It is common to compute an index and such computations tend to underflow much too easy if done in unsigned integers. Some innocent code like i := a-b+7 might yield i == 4294967291 for innocent values for aand b of 6 and 10. Such an index will probably overflow your slice. Lots of index calculations happen around 0 and are tricky to get right using unsigned integers and these bugs hide behind mathematically totally sensible and sound formulas. This is neither safe nor convenient.
This is a tradeoff based on experience: Underflow tends to happen often for index calculations done with unsigned ints while overflow is much less common if signed integers are used for index calculations.
Additionally: There is basically zero benefit from using unsigned integers in these cases.
There is a proposal in progress "issue 31795 Go 2: change len, cap to
return untyped int if result is constant"
It might be included for Go 1.14 (Q1 2010)
we should be able to do it for len and cap without problems - and indeed
there aren't any in the stdlib as a type-checking it via a modified type
checker shows
See CL 179184 as a PoC: this is still experimental.
As noted below by peterSO, this has been closed.
Robert Griesemer explains:
As you noted, the problem with making len always untyped is the size of the
result. For booleans (and also strings) the size is known, no matter what
kind of boolean (or string).
Russ Cox added:
I am not sure the costs here are worth the benefit. Today there is a simple
rule: len(x) has type int. Changing the type to depend on what x is
will interact in non-orthogonal ways with various code changes. For example,
under the proposed semantics, this code compiles:
const x string = "hello"
func f(uintptr)
...
f(len(x))
but suppose then someone comes along and wants to be able to modify x for
testing or something like that, so they s/const/var/. That's usually fairly
safe, but now the f(len(x)) call fails to type-check, and it will be
mysterious why it ever worked.
This change seems like it might add more rough edges than it removes.
Length and capacity
The built-in functions len and cap take arguments of various types and
return a result of type int. The implementation guarantees that the
result always fits into an int.
Golang is strongly typed language, so if len() was uint then instead of:
i := 0 // int
if len(a) == i {
}
you should write:
if len(a) == uint(i) {
}
or:
if int(len(a)) == i {
}
Also See:
uint either 32 or 64 bits
int same size as uint
uintptr an unsigned integer large enough to store the uninterpreted
bits of a pointer value
Also for compatibility with C: CGo the C.size_t and size of array in C is of type int.
From the spec:
The length is part of the array's type; it must evaluate to a non-negative constant representable by a value of type int. The length of array a can be discovered using the built-in function len. The elements can be addressed by integer indices 0 through len(a)-1. Array types are always one-dimensional but may be composed to form multi-dimensional types.
I realize it's maybe a little circular to say the spec dictates X because the spec dictates Y, but since the length can't exceed the maximum value of an int, it's equally as impossible for len to return a uint-exclusive value as for it to return a negative value.
I'm using the hash function murmur2 which returns me an uint64.
I want then to store it in PostgreSQL, which only support BIGINT (signed 64 bits).
As I'm not interested in the number itself, but just the binary value (as I use it as an id for detecting uniqueness (my set of values being of ~1000 values, a 64bit hash is enough for me) I would like to convert it into int64 by "just" changing the type.
How does one do that in a way that pleases the compiler?
You can simply use a type conversion:
i := uint64(0xffffffffffffffff)
i2 := int64(i)
fmt.Println(i, i2)
Output:
18446744073709551615 -1
Converting uint64 to int64 always succeeds: it doesn't change the memory representation just the type. What may confuse you is if you try to convert an untyped integer constant value to int64:
i3 := int64(0xffffffffffffffff) // Compile time error!
This is a compile time error as the constant value 0xffffffffffffffff (which is represented with arbitrary precision) does not fit into int64 because the max value that fits into int64 is 0x7fffffffffffffff:
constant 18446744073709551615 overflows int64