At http://tour.golang.org/#14 they show an example where the number 1 is shifted by 64 bits. This of course would result in an overflow, but then it is subtracted by 1 and all is well. How does half of the expression result in a failure while the entire expression as whole work just fine?
Thoughts:
I would assume that the setting of the unsigned to a number larger than what it allows is what causes the explosion. It would seem that memory is allocated more loosely on the right hand side of the expression than on the left? Is this true?
The result of your expression is a (compile time) constant and the expression is therefore evaluated during compilation. The language specification mandates that
Constant expressions are always evaluated exactly; intermediate values
and the constants themselves may require precision significantly
larger than supported by any predeclared type in the language. The
following are legal declarations:
const Huge = 1 << 100 // Huge == 1267650600228229401496703205376 (untyped integer constant)
const Four int8 = Huge >> 98 // Four == 4 (type int8)
https://golang.org/ref/spec#Constant_expressions
That is because the Go compiler handles constant expressions as numeric constants. Contrary to the data-types that have to obey the law of range, storage bits and side-effects like overflow, numeric constants never lose precision.
Numeric constants only get deduced down to a data-type with limited precision and range at the moment when you assign them to a variable (which has a known type and thus a numeric range and a defined way to store the number into bits). You can also force them to get deduced to a ordinary data-type by using them as part of a equation that contains non Numeric constant types.
Confused? So was I..
Here is a longer write-up on the data-types and how constants are handled: http://www.goinggo.net/2014/04/introduction-to-numeric-constants-in-go.html?m=1
I decided to try it. For reasons that are subtle, executing the expression as a constant expression (1 << 64 -1) or piece by piece at run time gives the same answer. This is because of 2 different mechanisms. A constant expression is fully evaluated with infinite precision before being assigned to the variable. The step by step execution explicitly allows overflows and underflows through addition, subtraction and shift operations, and thus the result is the same.
See https://golang.org/ref/spec#Integer_overflow for a description of how integers are supposed to overflow.
However, doing it in groups, ie 1<<64 and then -1 causes overflow errors!
You can make a variable overflow though arithmetic, but you can not assign an overflow to a variable.
Try it yourself. Paste the code below into http://try.golang.org/
This one works:
// You can edit this code!
// Click here and start typing.
package main
import "fmt"
func main() {
var MaxInt uint64 = 1
MaxInt = MaxInt << 64
MaxInt = MaxInt - 1
fmt.Println("%d",MaxInt)
}
This one doesn't work:
// You can edit this code!
// Click here and start typing.
package main
import "fmt"
func main() {
var MaxInt uint64 = 1 << 64
MaxInt = MaxInt - 1
fmt.Println("%d",MaxInt)
}
Actually 1 << 64 - 1 does not always result in a left shift of 64 and minus 1. The - operator is applied before the << operator in most languages, at least in any I know (like C++, Java, ...). Therefore 1 << 64 - 1 <=> 1 << 63.
But Go behaves different: https://golang.org/ref/spec#Operator_precedence
The - operator comes after the << operator.
The result of a 64 Bit left shift is based on the data type. It's just like adding 64 of 0 on the right, while cutting any Bit that extend the data type on the left side. In some languages an overflow may be valid, in some other not.
Compilers may also behave different based on the interpretion when your shift is greater or equal than the actual data type size. I know that the Java compiler will reduce the actual shift as often by size of the data type until it's smaller than the size of the data byte.
Sounds difficult, but here and easy example for a long data type with 64 Bit size.
so i << 64 <=> i << 0 <=> i
or i << 65 <=> i << 1
or i << 130 <=> i << 66 <=> i << 2.
As said, this may differ with different compilers / languages. There is never a solid answer without refering to a certain language.
For learning I would suggest a more common language than Go, maybe like something from the C family.
var MaxInt uint64 = 1<<64 - 1
BITSHIFTING
In binary, 1, with 64 0s after it (10000...).
Same as 2^64.
This maxes out 64-bit unsigned integer (positive numbers only).
So, we subtract 1 to prevent the error.
Therefore, this is the maximum value unsigned integer we can write.
You can see here go constant int actually is a bigInt, so 1 << 63 won't overflow. But var a int64 = 1 << 63 will overflow, because you are a assign a value bigger than int64.
Related
Below code compiles:
package main
import "fmt"
const (
// Max integer value on 64 bit architecture.
maxInt = 9223372036854775807
// Much larger value than int64.
bigger = 9223372036854775808543522345
// Will NOT compile
// biggerInt int64 = 9223372036854775808543522345
)
func main() {
fmt.Println("Will Compile")
//fmt.Println(bigger) // error
}
Type is size in memory + representation of bits in that memory
What is the implicit type assigned to bigger at compile time? Because error constant 9223372036854775808543522345 overflows int for line fmt.Println(bigger)
Those are untyped constants. They have larger limits than typed constants:
https://golang.org/ref/spec#Constants
In particular:
Represent integer constants with at least 256 bits.
None, it's an untyped constant. Because you haven't assigned it to any variable or used it in any expression, it doesn't "need" to be given a representation as any concrete type yet. Numeric constants in Go have effectively unlimited precision (required by the language spec to be at least 256 bits for integers, and at least 256 mantissa bits for floating-point numbers, but I believe that the golang/go compiler uses the Go arbitrary-precision types internally which are only limited by memory). See the section about Constants in the language spec.
What is the use of a constant if you can't assign it to a variable of any type? Well, it can be part of a constant expression. Constant expressions are evaluated at arbitrary precision, and their results may be able to be assigned to a variable. In other words, it's allowed to use values that are too big to represent to reach an answer that is representable, as long as all of that computation happens at compile time.
From this comment:
my goal is to convertf bigger = 9223372036854775808543522345 to binary form
we find that your question is an XY problem.
Since we do know that the constant exceeds 64 bits, we'll need to take it apart into multiple 64-bit words, or store it in some sort of bigger-integer storage.
Go provides math/big for general purpose large-number operations, or in this case we can take advantage of the fact that it's easy to store up to 127-bit signed values (or 128-bit unsigned values) in a struct holding two 64-bit integers (at least one of which is unsigned).
This rather trivial program prints the result of converting to binary:
500000000 x 2-sup-64 + 543522345 as binary:
111011100110101100101000000000000000000000000000000000000000000100000011001010111111000101001
package main
import "fmt"
const (
// Much larger value than int64.
bigger = 9223372036854775808543522345
d64 = 1 << 64
)
type i128 struct {
Upper int64
Lower uint64
}
func main() {
x := i128{Upper: bigger / d64, Lower: bigger % d64}
fmt.Printf("%d x 2-sup-64 + %d as binary:\n%b%.64b\n", x.Upper, x.Lower, x.Upper, x.Lower)
}
Go's builtin len() function returns a signed int. Why wasn't a uint used instead?
Is it ever possible for len() to return something negative?
As far as I can tell, the answer is no:
Arrays: "The number of elements is called the length and is never negative."
Slices: "At any time the following relationship holds: 0 <= len(s) <= cap(s)"
Maps "The number of map elements is called its length". (I couldn't find anything in the spec that explicitly restricts this to a nonnegative value, but it's difficult for me to understand how there could be fewer than 0 elements in a map)
Strings "A string value is a (possibly empty) sequence of bytes.... The length of a string s (its size in bytes) can be discovered using the built-in function len()" (Again, hard to see how a sequence could have a negative number of bytes)
Channels "number of elements queued in channel buffer (ditto)
len() (and cap()) return int because that is what is used to index slices and arrays (not uint). So the question is more "Why does Go use signed integers to index slices/arrays when there are no negative indices?".
The answer is simple: It is common to compute an index and such computations tend to underflow much too easy if done in unsigned integers. Some innocent code like i := a-b+7 might yield i == 4294967291 for innocent values for aand b of 6 and 10. Such an index will probably overflow your slice. Lots of index calculations happen around 0 and are tricky to get right using unsigned integers and these bugs hide behind mathematically totally sensible and sound formulas. This is neither safe nor convenient.
This is a tradeoff based on experience: Underflow tends to happen often for index calculations done with unsigned ints while overflow is much less common if signed integers are used for index calculations.
Additionally: There is basically zero benefit from using unsigned integers in these cases.
There is a proposal in progress "issue 31795 Go 2: change len, cap to
return untyped int if result is constant"
It might be included for Go 1.14 (Q1 2010)
we should be able to do it for len and cap without problems - and indeed
there aren't any in the stdlib as a type-checking it via a modified type
checker shows
See CL 179184 as a PoC: this is still experimental.
As noted below by peterSO, this has been closed.
Robert Griesemer explains:
As you noted, the problem with making len always untyped is the size of the
result. For booleans (and also strings) the size is known, no matter what
kind of boolean (or string).
Russ Cox added:
I am not sure the costs here are worth the benefit. Today there is a simple
rule: len(x) has type int. Changing the type to depend on what x is
will interact in non-orthogonal ways with various code changes. For example,
under the proposed semantics, this code compiles:
const x string = "hello"
func f(uintptr)
...
f(len(x))
but suppose then someone comes along and wants to be able to modify x for
testing or something like that, so they s/const/var/. That's usually fairly
safe, but now the f(len(x)) call fails to type-check, and it will be
mysterious why it ever worked.
This change seems like it might add more rough edges than it removes.
Length and capacity
The built-in functions len and cap take arguments of various types and
return a result of type int. The implementation guarantees that the
result always fits into an int.
Golang is strongly typed language, so if len() was uint then instead of:
i := 0 // int
if len(a) == i {
}
you should write:
if len(a) == uint(i) {
}
or:
if int(len(a)) == i {
}
Also See:
uint either 32 or 64 bits
int same size as uint
uintptr an unsigned integer large enough to store the uninterpreted
bits of a pointer value
Also for compatibility with C: CGo the C.size_t and size of array in C is of type int.
From the spec:
The length is part of the array's type; it must evaluate to a non-negative constant representable by a value of type int. The length of array a can be discovered using the built-in function len. The elements can be addressed by integer indices 0 through len(a)-1. Array types are always one-dimensional but may be composed to form multi-dimensional types.
I realize it's maybe a little circular to say the spec dictates X because the spec dictates Y, but since the length can't exceed the maximum value of an int, it's equally as impossible for len to return a uint-exclusive value as for it to return a negative value.
Suppose you are using a bit set or something similar, essentially some object that allows you to access the value of individual bits. It may be something simple like an integer word or an array of bytes, or something more generic like a BitSet in Java, depending on the number of bits you want to handle.
My question concerns transforming the length of the useful bits into a length expressed as a number of bytes. This is virtually always required because you typically can't allocate less than 8 bits (1 byte) of memory, and so you end up with extra padding bits in your "bit-set" object.
So, to sum things up, how do you correctly get the size in bytes necessary to accommodate a given size in bits?
NOTE: Take into consideration potential integer overflows that may result in an incorrect answer. For example, n_bytes = (n_bits + 7) / 8 may result in an integer overflow if n_bits is large enough.
You can avoid int overflows by using long long ints:
n_bytes = static_cast<int>((n_bits + 7LL) / 8)
Here is an answer that works, however I think there are faster methods than this one.
if ((bit_size % 8) == 0)
byte_size = bit_size/8
else
byte_size = bit_size/8 + 1
EDIT: For example, to speed things up, you could replace the division by a right shift and the modulus by a bitwise AND.
if ((bit_size & 7) == 0)
byte_size = bit_size >> 3
else
byte_size = (bit_size >> 3) + 1
However, compilers may sometimes make these kinds of optimizations themselves, so this may not be that much better.
I am looking for a hash-function which operates on a small integer (say in the range 0...1000) and outputs a 64 bit int.
The result-set should look like a random distribution of 64 bit ints: a uniform distribution with no linear correlation between the results.
I was hoping for a function that only takes a few CPU-cycles to execute. (the code will be in C++).
I considered multiplying the input by a big prime number and taking the modulo 2**64 (something like a linear congruent generator), but there are obvious dependencies between the outputs (in the lower bits).
Googling did not show up anything, but I am probably using wrong search terms.
Does such a function exist?
Some Background-info:
I want to avoid using a big persistent table with pseudo random numbers in an algorithm, and calculate random-looking numbers on the fly.
Security is not an issue.
I tested the 64-bit finalizer of MurmurHash3 (suggested by #aix and this SO post). This gives zero if the input is zero, so I increased the input parameter by 1 first:
typedef unsigned long long uint64;
inline uint64 fasthash(uint64 i)
{
i += 1ULL;
i ^= i >> 33ULL;
i *= 0xff51afd7ed558ccdULL;
i ^= i >> 33ULL;
i *= 0xc4ceb9fe1a85ec53ULL;
i ^= i >> 33ULL;
return i;
}
Here the input argument i is a small integer, for example an element of {0, 1, ..., 1000}. The output looks random:
i fasthash(i) decimal: fasthash(i) hex:
0 12994781566227106604 0xB456BCFC34C2CB2C
1 4233148493373801447 0x3ABF2A20650683E7
2 815575690806614222 0x0B5181C509F8D8CE
3 5156626420896634997 0x47900468A8F01875
... ... ...
There is no linear correlation between subsequent elements of the series:
The range of both axes is 0..2^64-1
Why not use an existing hash function, such as MurmurHash3 with a 64-bit finalizer? According to the author, the function takes tens of CPU cycles per key on current Intel hardware.
Given: input i in the range of 0 to 1,000.
const MaxInt which is the maximum value that cna be contained in a 64 bit int. (you did not say if it is signed or unsigned; 2^64 = 18446744073709551616 )
and a function rand() that returns a value between 0 and 1 (most languages have such a function)
compute hashvalue = i * rand() * ( MaxInt / 1000 )
1,000 * 1,000 = 1,000,000. That fits well within an Int32.
Subtract the low bound of your range, from the number.
Square it, and use it as a direct subscript into some sort of bitmap.
From "Signed Types" on Encoding - Protocol Buffers - Google Code:
ZigZag encoding maps signed integers to unsigned integers so that numbers with a small absolute value (for instance, -1) have a small varint encoded value too. It does this in a way that "zig-zags" back and forth through the positive and negative integers, so that -1 is encoded as 1, 1 is encoded as 2, -2 is encoded as 3, and so on, as you can see in the following table:
Signed Original Encoded As
0 0
-1 1
1 2
-2 3
2147483647 4294967294
-2147483648 4294967295
In other words, each value n is encoded using
(n << 1) ^ (n >> 31)
for sint32s, or
(n << 1) ^ (n >> 63)
for the 64-bit version.
How does (n << 1) ^ (n >> 31) equal whats in the table? I understand that would work for positives, but how does that work for say, -1? Wouldn't -1 be 1111 1111, and (n << 1) be 1111 1110? (Is bit-shifting on negatives well formed in any language?)
Nonetheless, using the fomula and doing (-1 << 1) ^ (-1 >> 31), assuming a 32-bit int, I get 1111 1111, which is 4 billion, whereas the table thinks I should have 1.
Shifting a negative signed integer to the right copies the sign bit, so that
(-1 >> 31) == -1
Then,
(-1 << 1) ^ (-1 >> 31) = -2 ^ -1
= 1
This might be easier to visualise in binary (8 bit here):
(-1 << 1) ^ (-1 >> 7) = 11111110 ^ 11111111
= 00000001
Another way to think about zig zag mapping is that it is a slight twist on a sign and magnitude representation.
In zig zag mapping, the least significant bit (lsb) of the mapping indicates the sign of the value: if it's 0, then the original value is non-negative, if it's 1, then the original value is negative.
Non-negative values are simply left shifted one bit to make room for the sign bit in the lsb.
For negative values, you could do the same one bit left shift for the absolute value (magnitude) of the number and simply have the lsb indicate the sign. For example, -1 could map to 0x03 or 0b00000011, where the lsb indicates that it is negative and the magnitude of 1 is left shifted by 1 bit.
The ugly thing about this sign and magnitude representation is "negative zero," mapped as 0x01 or 0b00000001. This variant of zero "uses up" one of our values and shifts the range of integers we can represent by one. We probably want to special case map negative zero to -2^63, so that we can represent the full 64b 2's complement range of [-2^63, 2^63). That means we've used one of our valuable single byte encodings to represent a value that will very, very, very rarely be used in an encoding optimized for small magnitude numbers and we've introduced a special case, which is bad.
This is where zig zag's twist on this sign and magnitude representation happens. The sign bit is still in the lsb, but for negative numbers, we subtract one from the magnitude rather than special casing negative zero. Now, -1 maps to 0x01 and -2^63 has a non-special case representation too (i.e. - magnitude 2^63 - 1, left shifted one bit, with lsb / sign bit set, which is all bits set to 1s).
So, another way to think about zig zag encoding is that it is a smarter sign and magnitude representation: the magnitude is left shifted one bit, the sign bit is stored in the lsb, and 1 is subtracted from the magnitude of negative numbers.
It is faster to implement these transformations using the unconditional bit-wise operators that you posted rather than explicitly testing the sign, special case manipulating negative values (e.g. - negate and subtract 1, or bitwise not), shifting the magnitude, and then explicitly setting the lsb sign bit. However, they are equivalent in effect and this more explicit sign and magnitude series of steps might be easier to understand what and why we are doing these things.
I will warn you though that bit shifting signed values in C / C++ is not portable and should be avoided. Left shifting a negative value has undefined behavior and right shifting a negative value has implementation defined behavior. Even left shifting a positive integer can have undefined behavior (e.g. - if you shift into the sign bit it might cause a trap or something worse). So, in general, don't bit shift signed types in C / C++. "Just say no."
Cast first to the unsigned version of the type to have safe, well-defined results according to the standards. This does mean that you then won't have arithmetic shift of negative values (i.e. - dragging the sign bit to the right) -- only logical shift, so you need to adjust the logic to account for that.
Here are the safe and portable versions of the zig zag mappings for 2's complement 64b integers in C:
#include <stdint.h>
uint64_t zz_map( int64_t x )
{
return ( ( uint64_t ) x << 1 ) ^ -( ( uint64_t ) x >> 63 );
}
int64_t zz_unmap( uint64_t y )
{
return ( int64_t ) ( ( y >> 1 ) ^ -( y & 0x1 ) );
}
Note the arithmetic negation of the sign bit in the right hand term of the XORs. That yields either 0 for non-negatives or all 1's for negatives -- just like arithmetic shift of the sign bit from msb to lsb would do. The XOR then effectively "undoes" / "redoes" the 2's complementation minus 1 (i.e. - 1's complementation or logical negation) for negative values without any conditional logic or further math.
Let me add my two cents to the discussion. As other answers noted, the zig-zag encoding can be thought as a sign-magnitude twist. This fact can be used to implement conversion functions which work for arbitrary-sized integers.
For example, I use the following code in one on my Python projects:
def zigzag(x: int) -> int:
return x << 1 if x >= 0 else (-x - 1) << 1 | 1
def zagzig(x: int) -> int:
assert x >= 0
sign = x & 1
return -(x >> 1) - 1 if sign else x >> 1
These functions work despite Python's int has no fixed bitwidth; instead, it extends dynamically. However, this approach may be inefficient in compiled languages since it requires conditional branching.