Why does `utf8.Valid()` consider some valid UTF-8 code as invalid? - go

$ cat main.go
package main
import (
"fmt"
"unicode/utf8"
"io/ioutil"
"os"
"log"
)
func main() {
str, err := ioutil.ReadFile(os.Args[1])
if err != nil { log.Fatal(err) }
fmt.Println(utf8.Valid([]byte(str)))
}
$ go run main.go <(echo $'\xed\xa0\xb5')
false
utf8.Valid() says '\xed\xa0\xb5' is invalid.
\xed: 11101101
\xa0: 10100000
\xb5: 10110101
But there are the three bytes in binary format. 111xxxxx means it is a three bytes characters. According to the table on the wiki page, the following two bytes should valid.
https://en.wikipedia.org/wiki/UTF-8
Does anybody know why utf8.Valid() shows the three byte character as invalid UTF-8 character?

As per the codepage layout table on Wikipedia, while the hex sequence \xed\xa0\xb5 follows the encoding rules for the codepoint at 0xd835, it is considered invalid as it matches the codepoint reserved for surrogate halves in UTF-16. When looking at the codepoint table for Unicode, you'll see that the codepoints from 0xD800 to 0xDFFF are all unassigned and marked as "explicitly invalid in UTF-8".
Unicode and ISO/IEC 10646 do not assign actual characters to any of the code points in the D800–DFFF range — these code points only have meaning when used
in surrogate pairs. Hence an individual code point from a surrogate pair does not represent a character, is invalid unless used in a surrogate pair, and is
unconditionally invalid in UTF-32 and UTF-8 (if strict conformance to the standard is applied).

Related

Golang unicode charactor value

I run this code and get an output, but why the bytes value is E4B8AD and the int value is 20013. Why column 2 is not equal to column 5 ?
package main
import(
"fmt"
)
func main(){
str2 := "中文"
fmt.Println("index int(rune) rune char bytes")
for index, rune := range str2{
fmt.Printf("%-2d %d %U '%c' %X\n", index, rune, rune, rune, []byte(string(rune)))
}
}
the output is :
index int(rune) rune char bytes
0 20013 U+4E2D '中' E4B8AD
1 25991 U+6587 '文' E69687
A Unicode code point for a character is not necessarily the same as the byte representation of that character in a given character encoding.
For the character 中, the code point is U+4E2D, but the byte representations in various character encodings are:
E4B8AD (UTF-8)
4E2D (UTF-16)
00004E2D (UTF-32)
There's a really good answer here that explains how to convert between code points and byte representations. There's also the excellent The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.

Problem with decoding utf8 characters - šđžčć

I have a word which contains some of these characters - šđžčć. When I take the first letter out of that word, I'll have a byte, when I convert that byte into string I'll get incorrectly decoded string.
Can someone help me figure out how to decode properly the extracter letter.
This is example code:
package main
import (
"fmt"
)
func main() {
word := "ŠKOLA"
c := word[0]
fmt.Println(word, string(c)) // ŠKOLA Å
}
https://play.golang.org/p/6T2FX4vN3-U
Š is more than one byte. One method to index runes is to convert the string to []rune
c := []rune(word)[0]
https://play.golang.org/p/NBUopxe-ik1
You can also use the functions provided in the utf8 package, like utf8.DecodeRune and utf8.DecodeRuneInString to iterate over the individual codepoints in the utf8 string.
r, _ := utf8.DecodeRuneInString(word)
fmt.Println(word, string(r))

Why does fmt.Printf("%x", 'ᚵ') ~> 16b5, while fmt.Printf("%x", "ᚵ") ~> e19ab5?

package main
import (
"fmt"
)
func main() {
fmt.Printf("%c, %x, %x", 'ᚵ', 'ᚵ', "ᚵ")
}
Outputs:
ᚵ, 16b5, e19ab5
https://play.golang.org/p/_Bs7JcdOfO
Because each does a different thing. Both format the argument as a hexadecimal number, but each views the argument differently.
fmt.Printf("%x", 'ᚵ') prints a single unicode character (a rune, if you will), as a 32 bit integer (int32).
fmt.Printf("%x", "ᚵ") prints a string (individual bytes of a string) as 8 bit integers (uint8). The rune is encoded on three bytes when utf-8 encoding is used. That is a reason why there is six hexadecimal digits (two for each byte).
To study printing of a string in detail, start at function fmtString in file fmt/print.go.
func (p *pp) fmtString(v string, verb rune) {

golang how does the rune() function work

I came across a function posted online that used the rune() function in golang, but I am having a hard time looking up what it is. I am going through the tutorial and inexperienced with the docs so it is hard to find what I am looking for.
Specifically, I am trying to see why this fails...
fmt.Println(rune("foo"))
and this does not
fmt.Println([]rune("foo"))
rune is a type in Go. It's just an alias for int32, but it's usually used to represent Unicode points. rune() isn't a function, it's syntax for type conversion into rune. Conversions in Go always have the syntax type() which might make them look like functions.
The first bit of code fails because conversion of strings to numeric types isn't defined in Go. However conversion of strings to slices of runes/int32s is defined like this in language specification:
Converting a value of a string type to a slice of runes type yields a
slice containing the individual Unicode code points of the string.
[golang.org]
So your example prints a slice of runes with values 102, 111 and 111
As stated in #Michael's first-rate comment fmt.Println([]rune("foo")) is a conversion of a string to a slice of runes []rune. When you convert from string to []rune, each utf-8 char in that string becomes a Rune. See https://stackoverflow.com/a/51611567/12817546. Similarly, in the reverse conversion, when converted from []rune to string, each rune becomes a utf-8 char in the string. See https://stackoverflow.com/a/51611567/12817546. A []rune can also be set to a byte, float64, int or a bool.
package main
import (
. "fmt"
)
func main() {
r := []rune("foo")
c := []interface{}{byte(r[0]), float64(r[0]), int(r[0]), r, string(r), r[0] != 0}
checkType(c)
}
func checkType(s []interface{}) {
for k, _ := range s {
Printf("%T %v\n", s[k], s[k])
}
}
byte(r[0]) is set to “uint8 102”, float64(r[0]) is set to “float64 102”,int(r[0]) is set to “int 102”, r is the rune” []int32 [102 111 111]”, string(r) prints “string foo”, r[0] != 0 and shows “bool true”.
[]rune to string conversion is supported natively by the spec. See the comment in https://stackoverflow.com/a/46021588/12817546. In Go then a string is a sequence of bytes. However, since multiple bytes can represent a rune code-point, a string value can also contain runes. So, it can be converted to a []rune , or vice versa. See https://stackoverflow.com/a/19325804/12817546.
Note, there are only two built-in type aliases in Go, byte (alias of uint8) and rune (alias of int32). See https://Go101.org/article/type-system-overview.html. Rune literals are just 32-bit integer values. For example, the rune literal 'a' is actually the number "97". See https://stackoverflow.com/a/19311218/12817546. Quotes edited.

Umlauts and slices

I'm having some trouble while reading a file which has a fixed column length format. Some columns may contain umlauts.
Umlauts seem to use 2 bytes instead of one. This is not the behaviour I was expecting. Is there any kind of function which returns a substring? Slice does not seem to work in this case.
Here's some sample code:
http://play.golang.org/p/ZJ1axy7UXe
umlautsString := "Rhön"
fmt.Println(len(umlautsString))
fmt.Println(umlautsString[0:4])
Prints:
5
Rhö
In go, a slice of a string counts bytes, not runes. This is why "Rhön"[0:3] gives you Rh and the first byte of ö.
Characters encoded in UTF-8 are represented as runes because UTF-8 encodes characters in more than one
byte (up to four bytes) to provide a bigger range of characters.
If you want to slice a string with the [] syntax, convert the string to []rune before.
Example (on play):
umlautsString := "Rhön"
runes = []rune(umlautsString)
fmt.Println(string(runes[0:3])) // Rhö
Noteworthy: This golang blog post about string representation in go.
You can convert string to []rune and work with it:
package main
import "fmt"
func main() {
umlautsString := "Rhön"
fmt.Println(len(umlautsString))
subStrRunes:= []rune(umlautsString)
fmt.Println(len(subStrRunes))
fmt.Println(string(subStrRunes[0:4]))
}
http://play.golang.org/p/__WfitzMOJ
Hope that helps!
Another option is the utf8string package:
package main
import "golang.org/x/exp/utf8string"
func main() {
s := utf8string.NewString("🧡💛💚💙💜")
// example 1
n := s.RuneCount()
println(n == 5)
// example 2
t := s.Slice(0, 2)
println(t == "🧡💛")
}
https://pkg.go.dev/golang.org/x/exp/utf8string

Resources