Problem with decoding utf8 characters - šđžčć - go

I have a word which contains some of these characters - šđžčć. When I take the first letter out of that word, I'll have a byte, when I convert that byte into string I'll get incorrectly decoded string.
Can someone help me figure out how to decode properly the extracter letter.
This is example code:
package main
import (
"fmt"
)
func main() {
word := "ŠKOLA"
c := word[0]
fmt.Println(word, string(c)) // ŠKOLA Å
}
https://play.golang.org/p/6T2FX4vN3-U

Š is more than one byte. One method to index runes is to convert the string to []rune
c := []rune(word)[0]
https://play.golang.org/p/NBUopxe-ik1
You can also use the functions provided in the utf8 package, like utf8.DecodeRune and utf8.DecodeRuneInString to iterate over the individual codepoints in the utf8 string.
r, _ := utf8.DecodeRuneInString(word)
fmt.Println(word, string(r))

Related

golang, £ char causing weird  character

I have a function that generates a random string from a string of valid characters. I'm occasionally getting weird results when it selects a £
I've reproduced it to the following minimal example:
func foo() string {
validChars := "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~#:!£$%^&*"
var result strings.Builder
for i := 0; i < len(validChars); i++ {
currChar := validChars[i]
result.WriteString(string(currChar))
}
return result.String()
}
I would expect this to return
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~#:!£$%^&*
But it doesn't, it produces
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~#:!£$%^&*
^
where did you come from ?
if I take the £ sign out of the original validChars string, that weird A goes away.
func foo() string {
validChars := "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~#:!$%^&*"
var result strings.Builder
for i := 0; i < len(validChars); i++ {
currChar := validChars[i]
result.WriteString(string(currChar))
}
return result.String()
}
This produces
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~#:!$%^&*
A string is a type alias for []byte. Your mental model of a string is probably that it consists of a slice of characters - or, as we call it in Go: a slice of rune.
For many runes in your validChars string this is fine, as they are part of the ASCII chars and can therefore be represented in a single byte in UTF-8. However, the £ rune is represented as 2 bytes.
Now if we consider a string £, it consists of 1 rune but 2 bytes. As I've mentioned, a string is really just a []byte. If we grab the first element like you are effectively doing in your sample, we will only get the first of the two bytes that represent £. When you convert it back to a string, it gives you an unexpected rune.
The fix for your problem is to first convert string validChars to a []rune. Then, you can access its individual runes (rather than bytes) by index, and foo will work as expected. You can see it in action in this playground.
Also note that len(validChars) will give you the count of bytes in the string. To get the count of runes, use utf8.RuneCountInString instead.
Finally, here's a blog post from Rob Pike on the subject that you may find interesting.

Golang unicode charactor value

I run this code and get an output, but why the bytes value is E4B8AD and the int value is 20013. Why column 2 is not equal to column 5 ?
package main
import(
"fmt"
)
func main(){
str2 := "中文"
fmt.Println("index int(rune) rune char bytes")
for index, rune := range str2{
fmt.Printf("%-2d %d %U '%c' %X\n", index, rune, rune, rune, []byte(string(rune)))
}
}
the output is :
index int(rune) rune char bytes
0 20013 U+4E2D '中' E4B8AD
1 25991 U+6587 '文' E69687
A Unicode code point for a character is not necessarily the same as the byte representation of that character in a given character encoding.
For the character 中, the code point is U+4E2D, but the byte representations in various character encodings are:
E4B8AD (UTF-8)
4E2D (UTF-16)
00004E2D (UTF-32)
There's a really good answer here that explains how to convert between code points and byte representations. There's also the excellent The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.

Why does fmt.Printf("%x", 'ᚵ') ~> 16b5, while fmt.Printf("%x", "ᚵ") ~> e19ab5?

package main
import (
"fmt"
)
func main() {
fmt.Printf("%c, %x, %x", 'ᚵ', 'ᚵ', "ᚵ")
}
Outputs:
ᚵ, 16b5, e19ab5
https://play.golang.org/p/_Bs7JcdOfO
Because each does a different thing. Both format the argument as a hexadecimal number, but each views the argument differently.
fmt.Printf("%x", 'ᚵ') prints a single unicode character (a rune, if you will), as a 32 bit integer (int32).
fmt.Printf("%x", "ᚵ") prints a string (individual bytes of a string) as 8 bit integers (uint8). The rune is encoded on three bytes when utf-8 encoding is used. That is a reason why there is six hexadecimal digits (two for each byte).
To study printing of a string in detail, start at function fmtString in file fmt/print.go.
func (p *pp) fmtString(v string, verb rune) {

How to convert ascii code to byte in golang?

As the title say, I can find the function to give me ascii code of bytes, but not the other way around
Golang string literals are UTF-8 and since ASCII is a subset of UTF-8, and each of its characters are only 7 bits, we can easily get them as bytes by casting (e.g. bytes := []byte(str):
package main
import "fmt"
func main() {
asciiStr := "ABC"
asciiBytes := []byte(asciiStr)
fmt.Printf("OK: string=%v, bytes=%v\n", asciiStr, asciiBytes)
fmt.Printf("OK: byte(A)=%v\n", asciiBytes[0])
}
// OK: string=ABC, bytes=[65 66 67]
// OK: byte(A)=65

Umlauts and slices

I'm having some trouble while reading a file which has a fixed column length format. Some columns may contain umlauts.
Umlauts seem to use 2 bytes instead of one. This is not the behaviour I was expecting. Is there any kind of function which returns a substring? Slice does not seem to work in this case.
Here's some sample code:
http://play.golang.org/p/ZJ1axy7UXe
umlautsString := "Rhön"
fmt.Println(len(umlautsString))
fmt.Println(umlautsString[0:4])
Prints:
5
Rhö
In go, a slice of a string counts bytes, not runes. This is why "Rhön"[0:3] gives you Rh and the first byte of ö.
Characters encoded in UTF-8 are represented as runes because UTF-8 encodes characters in more than one
byte (up to four bytes) to provide a bigger range of characters.
If you want to slice a string with the [] syntax, convert the string to []rune before.
Example (on play):
umlautsString := "Rhön"
runes = []rune(umlautsString)
fmt.Println(string(runes[0:3])) // Rhö
Noteworthy: This golang blog post about string representation in go.
You can convert string to []rune and work with it:
package main
import "fmt"
func main() {
umlautsString := "Rhön"
fmt.Println(len(umlautsString))
subStrRunes:= []rune(umlautsString)
fmt.Println(len(subStrRunes))
fmt.Println(string(subStrRunes[0:4]))
}
http://play.golang.org/p/__WfitzMOJ
Hope that helps!
Another option is the utf8string package:
package main
import "golang.org/x/exp/utf8string"
func main() {
s := utf8string.NewString("🧡💛💚💙💜")
// example 1
n := s.RuneCount()
println(n == 5)
// example 2
t := s.Slice(0, 2)
println(t == "🧡💛")
}
https://pkg.go.dev/golang.org/x/exp/utf8string

Resources