I'm teaching myself Go from a C background.
The code below works as I expect (the first two Printf() will access bytes, the last two Printf() will access codepoints).
What I am not clear is if this involves any copying of data.
package main
import "fmt"
var a string
func main() {
a = "èe"
fmt.Printf("%d\n", a[0])
fmt.Printf("%d\n", a[1])
fmt.Println("")
fmt.Printf("%d\n", []rune(a)[0])
fmt.Printf("%d\n", []rune(a)[1])
}
In other words:
does []rune("string") create an array of runes and fill it with the runes corresponding to "string", or it's just the compiler that figures out how to get runes from the string bytes?
It is not possible to turn []uint8 (i.e. a string) into []int32 (an alias for []rune) without allocating an array.
Also, strings are immutable in Go but slices are not, so the conversion to both []byte and []rune must copy the string's bytes in some way or another.
It involves a copy because:
strings are immutable; if the conversion []rune(s) didn't make a copy, you would be able to index the rune slice and change the string contents
a string value is a "(possibly empty) sequence of bytes", where byte is an alias of uint8, whereas a rune is a "an integer value identifying a Unicode code point" and an alias of int32. The types are not identical and even the lengths may not be the same:
a = "èe"
r := []rune(a)
fmt.Println(len(a)) // 3 (3 bytes)
fmt.Println(len(r)) // 2 (2 Unicode code points)
Related
Go allows conversion from rune to byte. But the underlying type for rune is int32 (because Go uses UTF-8) and for byte it is uint8, the conversion therefore results in a loss of information. However it is not possible to convert from a rune to []byte.
var b byte = '©'
bs := []byte(string('©'))
fmt.Println(b)
fmt.Println(bs)
// Output
169
[194 169]
Working example
Why does Go allow conversion from rune to byte instead of rune to []byte?
Go supports conversion from rune to byte as it does for all pairs of numeric types. It would be a surprising special case if int32 to byte conversion was not allowed.
But the underlying type for rune is int32 (because Go uses UTF-8)
This misses an important detail: rune is an alias for int32. They are the same type.
It's true that the underlying type is rune is int32, but that's because rune and int32 are the same type and the underlying type of a builtin type is the type itself.
The representation of Unicode code points as int32 values is unrelated to UTF-8 encoding.
the conversion therefore results in a loss of information
Yes, conversions between numeric types can result in loss of information. This is one reason why conversions in Go must be explicit.
Note that the statement var b byte = '©' does not do any conversions. The expression '#' is an untyped constant.
The compiler reports an error if the assignment of an untyped constant results in a loss of information. For example, the statement var b byte = '世' causes a compilation error.
All UTF-8 encoding functionality in the language is related to the string type. The UTF-8 aware conversions are all to or from the string type. The []byte(numericType) conversion could be supported, but that would bring UTF-8 encoding outside of the string type.
The Go authors regret including the string(numericType) conversion because it's not very useful in practice and the conversion is not what some people expect. A library function is a better place for the functionality.
Yes it is possible to convert from a rune to []byte (for example via a byte) and back again.
package main
import "fmt"
func main() {
var b byte = '©'
bs := []byte{b}
fmt.Printf("%T %v\n", b, b) // uint8 169
fmt.Printf("%T %v\n", bs, bs) // []uint8 [169]
s := string(bs[0]) // s := string(b) works too.
r2 := rune(s[0]) // r2 := rune(b) works too.
fmt.Printf("%T %v\n", s, s) // string ©
fmt.Printf("%T %v\n", r2, r2) // int32 169
}
The reason for this behaviour is the same reason why it's legal to do
var b int32
b = 1000000
fmt.Printf("%b\n", b)
fmt.Printf("%b", uint8(b))
// Output:
// 11110100001001000000
// 1000000
You should expect the conversion to loose data when you put data of a type with larger memory footprint into one with a smaller memory footprint.
Also, for encoding a rune you can use EncodeRune which indeed uses a []byte.
I came across a function posted online that used the rune() function in golang, but I am having a hard time looking up what it is. I am going through the tutorial and inexperienced with the docs so it is hard to find what I am looking for.
Specifically, I am trying to see why this fails...
fmt.Println(rune("foo"))
and this does not
fmt.Println([]rune("foo"))
rune is a type in Go. It's just an alias for int32, but it's usually used to represent Unicode points. rune() isn't a function, it's syntax for type conversion into rune. Conversions in Go always have the syntax type() which might make them look like functions.
The first bit of code fails because conversion of strings to numeric types isn't defined in Go. However conversion of strings to slices of runes/int32s is defined like this in language specification:
Converting a value of a string type to a slice of runes type yields a
slice containing the individual Unicode code points of the string.
[golang.org]
So your example prints a slice of runes with values 102, 111 and 111
As stated in #Michael's first-rate comment fmt.Println([]rune("foo")) is a conversion of a string to a slice of runes []rune. When you convert from string to []rune, each utf-8 char in that string becomes a Rune. See https://stackoverflow.com/a/51611567/12817546. Similarly, in the reverse conversion, when converted from []rune to string, each rune becomes a utf-8 char in the string. See https://stackoverflow.com/a/51611567/12817546. A []rune can also be set to a byte, float64, int or a bool.
package main
import (
. "fmt"
)
func main() {
r := []rune("foo")
c := []interface{}{byte(r[0]), float64(r[0]), int(r[0]), r, string(r), r[0] != 0}
checkType(c)
}
func checkType(s []interface{}) {
for k, _ := range s {
Printf("%T %v\n", s[k], s[k])
}
}
byte(r[0]) is set to “uint8 102”, float64(r[0]) is set to “float64 102”,int(r[0]) is set to “int 102”, r is the rune” []int32 [102 111 111]”, string(r) prints “string foo”, r[0] != 0 and shows “bool true”.
[]rune to string conversion is supported natively by the spec. See the comment in https://stackoverflow.com/a/46021588/12817546. In Go then a string is a sequence of bytes. However, since multiple bytes can represent a rune code-point, a string value can also contain runes. So, it can be converted to a []rune , or vice versa. See https://stackoverflow.com/a/19325804/12817546.
Note, there are only two built-in type aliases in Go, byte (alias of uint8) and rune (alias of int32). See https://Go101.org/article/type-system-overview.html. Rune literals are just 32-bit integer values. For example, the rune literal 'a' is actually the number "97". See https://stackoverflow.com/a/19311218/12817546. Quotes edited.
I have this code:
func my_function(hash string) [16]byte {
b, _ := hex.DecodeString(hash)
return b // Compile error: fails since [16]byte != []byte
}
b will be of type []byte. I know that hash is of length 32. How can I make my code above work? Ie. can I somehow cast from a general-length byte array to a fixed-length byte array? I am not interested in allocating 16 new bytes and copying the data over.
There is no direct method to convert a slice to an array. You can however do a copy.
var ret [16]byte
copy(ret[:], b)
The standard library uses []byte and if you insist on using something else you will just have a lot more typing to do. I wrote a program using arrays for my md5 values and regretted it.
Is it true that converting from string to []byte allocates new memory? Also, does converting from []byte to string allocates new memory?
s := "a very long string"
b := []byte(s) // does this doubled the memory requirement?
b := []byte{1,2,3,4,5, ...very long bytes..}
s := string(b) // does this doubled the memory requirement?
Yes in both cases.
String types are immutable. Therefore converting them to a mutable slice type will allocate a new slice. See also http://blog.golang.org/go-slices-usage-and-internals
The same with the inverse. Otherwise mutating the slice would change the string, which would contradict the spec.
I'm having some trouble while reading a file which has a fixed column length format. Some columns may contain umlauts.
Umlauts seem to use 2 bytes instead of one. This is not the behaviour I was expecting. Is there any kind of function which returns a substring? Slice does not seem to work in this case.
Here's some sample code:
http://play.golang.org/p/ZJ1axy7UXe
umlautsString := "Rhön"
fmt.Println(len(umlautsString))
fmt.Println(umlautsString[0:4])
Prints:
5
Rhö
In go, a slice of a string counts bytes, not runes. This is why "Rhön"[0:3] gives you Rh and the first byte of ö.
Characters encoded in UTF-8 are represented as runes because UTF-8 encodes characters in more than one
byte (up to four bytes) to provide a bigger range of characters.
If you want to slice a string with the [] syntax, convert the string to []rune before.
Example (on play):
umlautsString := "Rhön"
runes = []rune(umlautsString)
fmt.Println(string(runes[0:3])) // Rhö
Noteworthy: This golang blog post about string representation in go.
You can convert string to []rune and work with it:
package main
import "fmt"
func main() {
umlautsString := "Rhön"
fmt.Println(len(umlautsString))
subStrRunes:= []rune(umlautsString)
fmt.Println(len(subStrRunes))
fmt.Println(string(subStrRunes[0:4]))
}
http://play.golang.org/p/__WfitzMOJ
Hope that helps!
Another option is the utf8string package:
package main
import "golang.org/x/exp/utf8string"
func main() {
s := utf8string.NewString("🧡💛💚💙💜")
// example 1
n := s.RuneCount()
println(n == 5)
// example 2
t := s.Slice(0, 2)
println(t == "🧡💛")
}
https://pkg.go.dev/golang.org/x/exp/utf8string