Golang unicode charactor value - go

I run this code and get an output, but why the bytes value is E4B8AD and the int value is 20013. Why column 2 is not equal to column 5 ?
package main
import(
"fmt"
)
func main(){
str2 := "中文"
fmt.Println("index int(rune) rune char bytes")
for index, rune := range str2{
fmt.Printf("%-2d %d %U '%c' %X\n", index, rune, rune, rune, []byte(string(rune)))
}
}
the output is :
index int(rune) rune char bytes
0 20013 U+4E2D '中' E4B8AD
1 25991 U+6587 '文' E69687

A Unicode code point for a character is not necessarily the same as the byte representation of that character in a given character encoding.
For the character 中, the code point is U+4E2D, but the byte representations in various character encodings are:
E4B8AD (UTF-8)
4E2D (UTF-16)
00004E2D (UTF-32)
There's a really good answer here that explains how to convert between code points and byte representations. There's also the excellent The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.

Related

Does the conversion from string to rune slice make a copy?

I'm teaching myself Go from a C background.
The code below works as I expect (the first two Printf() will access bytes, the last two Printf() will access codepoints).
What I am not clear is if this involves any copying of data.
package main
import "fmt"
var a string
func main() {
a = "èe"
fmt.Printf("%d\n", a[0])
fmt.Printf("%d\n", a[1])
fmt.Println("")
fmt.Printf("%d\n", []rune(a)[0])
fmt.Printf("%d\n", []rune(a)[1])
}
In other words:
does []rune("string") create an array of runes and fill it with the runes corresponding to "string", or it's just the compiler that figures out how to get runes from the string bytes?
It is not possible to turn []uint8 (i.e. a string) into []int32 (an alias for []rune) without allocating an array.
Also, strings are immutable in Go but slices are not, so the conversion to both []byte and []rune must copy the string's bytes in some way or another.
It involves a copy because:
strings are immutable; if the conversion []rune(s) didn't make a copy, you would be able to index the rune slice and change the string contents
a string value is a "(possibly empty) sequence of bytes", where byte is an alias of uint8, whereas a rune is a "an integer value identifying a Unicode code point" and an alias of int32. The types are not identical and even the lengths may not be the same:
a = "èe"
r := []rune(a)
fmt.Println(len(a)) // 3 (3 bytes)
fmt.Println(len(r)) // 2 (2 Unicode code points)

Why does `utf8.Valid()` consider some valid UTF-8 code as invalid?

$ cat main.go
package main
import (
"fmt"
"unicode/utf8"
"io/ioutil"
"os"
"log"
)
func main() {
str, err := ioutil.ReadFile(os.Args[1])
if err != nil { log.Fatal(err) }
fmt.Println(utf8.Valid([]byte(str)))
}
$ go run main.go <(echo $'\xed\xa0\xb5')
false
utf8.Valid() says '\xed\xa0\xb5' is invalid.
\xed: 11101101
\xa0: 10100000
\xb5: 10110101
But there are the three bytes in binary format. 111xxxxx means it is a three bytes characters. According to the table on the wiki page, the following two bytes should valid.
https://en.wikipedia.org/wiki/UTF-8
Does anybody know why utf8.Valid() shows the three byte character as invalid UTF-8 character?
As per the codepage layout table on Wikipedia, while the hex sequence \xed\xa0\xb5 follows the encoding rules for the codepoint at 0xd835, it is considered invalid as it matches the codepoint reserved for surrogate halves in UTF-16. When looking at the codepoint table for Unicode, you'll see that the codepoints from 0xD800 to 0xDFFF are all unassigned and marked as "explicitly invalid in UTF-8".
Unicode and ISO/IEC 10646 do not assign actual characters to any of the code points in the D800–DFFF range — these code points only have meaning when used
in surrogate pairs. Hence an individual code point from a surrogate pair does not represent a character, is invalid unless used in a surrogate pair, and is
unconditionally invalid in UTF-32 and UTF-8 (if strict conformance to the standard is applied).

Problem with decoding utf8 characters - šđžčć

I have a word which contains some of these characters - šđžčć. When I take the first letter out of that word, I'll have a byte, when I convert that byte into string I'll get incorrectly decoded string.
Can someone help me figure out how to decode properly the extracter letter.
This is example code:
package main
import (
"fmt"
)
func main() {
word := "ŠKOLA"
c := word[0]
fmt.Println(word, string(c)) // ŠKOLA Å
}
https://play.golang.org/p/6T2FX4vN3-U
Š is more than one byte. One method to index runes is to convert the string to []rune
c := []rune(word)[0]
https://play.golang.org/p/NBUopxe-ik1
You can also use the functions provided in the utf8 package, like utf8.DecodeRune and utf8.DecodeRuneInString to iterate over the individual codepoints in the utf8 string.
r, _ := utf8.DecodeRuneInString(word)
fmt.Println(word, string(r))

Why does fmt.Printf("%x", 'ᚵ') ~> 16b5, while fmt.Printf("%x", "ᚵ") ~> e19ab5?

package main
import (
"fmt"
)
func main() {
fmt.Printf("%c, %x, %x", 'ᚵ', 'ᚵ', "ᚵ")
}
Outputs:
ᚵ, 16b5, e19ab5
https://play.golang.org/p/_Bs7JcdOfO
Because each does a different thing. Both format the argument as a hexadecimal number, but each views the argument differently.
fmt.Printf("%x", 'ᚵ') prints a single unicode character (a rune, if you will), as a 32 bit integer (int32).
fmt.Printf("%x", "ᚵ") prints a string (individual bytes of a string) as 8 bit integers (uint8). The rune is encoded on three bytes when utf-8 encoding is used. That is a reason why there is six hexadecimal digits (two for each byte).
To study printing of a string in detail, start at function fmtString in file fmt/print.go.
func (p *pp) fmtString(v string, verb rune) {

golang how does the rune() function work

I came across a function posted online that used the rune() function in golang, but I am having a hard time looking up what it is. I am going through the tutorial and inexperienced with the docs so it is hard to find what I am looking for.
Specifically, I am trying to see why this fails...
fmt.Println(rune("foo"))
and this does not
fmt.Println([]rune("foo"))
rune is a type in Go. It's just an alias for int32, but it's usually used to represent Unicode points. rune() isn't a function, it's syntax for type conversion into rune. Conversions in Go always have the syntax type() which might make them look like functions.
The first bit of code fails because conversion of strings to numeric types isn't defined in Go. However conversion of strings to slices of runes/int32s is defined like this in language specification:
Converting a value of a string type to a slice of runes type yields a
slice containing the individual Unicode code points of the string.
[golang.org]
So your example prints a slice of runes with values 102, 111 and 111
As stated in #Michael's first-rate comment fmt.Println([]rune("foo")) is a conversion of a string to a slice of runes []rune. When you convert from string to []rune, each utf-8 char in that string becomes a Rune. See https://stackoverflow.com/a/51611567/12817546. Similarly, in the reverse conversion, when converted from []rune to string, each rune becomes a utf-8 char in the string. See https://stackoverflow.com/a/51611567/12817546. A []rune can also be set to a byte, float64, int or a bool.
package main
import (
. "fmt"
)
func main() {
r := []rune("foo")
c := []interface{}{byte(r[0]), float64(r[0]), int(r[0]), r, string(r), r[0] != 0}
checkType(c)
}
func checkType(s []interface{}) {
for k, _ := range s {
Printf("%T %v\n", s[k], s[k])
}
}
byte(r[0]) is set to “uint8 102”, float64(r[0]) is set to “float64 102”,int(r[0]) is set to “int 102”, r is the rune” []int32 [102 111 111]”, string(r) prints “string foo”, r[0] != 0 and shows “bool true”.
[]rune to string conversion is supported natively by the spec. See the comment in https://stackoverflow.com/a/46021588/12817546. In Go then a string is a sequence of bytes. However, since multiple bytes can represent a rune code-point, a string value can also contain runes. So, it can be converted to a []rune , or vice versa. See https://stackoverflow.com/a/19325804/12817546.
Note, there are only two built-in type aliases in Go, byte (alias of uint8) and rune (alias of int32). See https://Go101.org/article/type-system-overview.html. Rune literals are just 32-bit integer values. For example, the rune literal 'a' is actually the number "97". See https://stackoverflow.com/a/19311218/12817546. Quotes edited.

Resources