golang, £ char causing weird  character - go

I have a function that generates a random string from a string of valid characters. I'm occasionally getting weird results when it selects a £
I've reproduced it to the following minimal example:
func foo() string {
validChars := "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~#:!£$%^&*"
var result strings.Builder
for i := 0; i < len(validChars); i++ {
currChar := validChars[i]
result.WriteString(string(currChar))
}
return result.String()
}
I would expect this to return
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~#:!£$%^&*
But it doesn't, it produces
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~#:!£$%^&*
^
where did you come from ?
if I take the £ sign out of the original validChars string, that weird A goes away.
func foo() string {
validChars := "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~#:!$%^&*"
var result strings.Builder
for i := 0; i < len(validChars); i++ {
currChar := validChars[i]
result.WriteString(string(currChar))
}
return result.String()
}
This produces
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~#:!$%^&*

A string is a type alias for []byte. Your mental model of a string is probably that it consists of a slice of characters - or, as we call it in Go: a slice of rune.
For many runes in your validChars string this is fine, as they are part of the ASCII chars and can therefore be represented in a single byte in UTF-8. However, the £ rune is represented as 2 bytes.
Now if we consider a string £, it consists of 1 rune but 2 bytes. As I've mentioned, a string is really just a []byte. If we grab the first element like you are effectively doing in your sample, we will only get the first of the two bytes that represent £. When you convert it back to a string, it gives you an unexpected rune.
The fix for your problem is to first convert string validChars to a []rune. Then, you can access its individual runes (rather than bytes) by index, and foo will work as expected. You can see it in action in this playground.
Also note that len(validChars) will give you the count of bytes in the string. To get the count of runes, use utf8.RuneCountInString instead.
Finally, here's a blog post from Rob Pike on the subject that you may find interesting.

Related

when do we use rune function in golang work? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
Improve this question
I am a beginner in Golang...
I found that rune(char) == "-" has been used to check if a character in a word matches with hyphen instead of checking it as char == "-".
Here is the code:
package main
import (
"fmt"
"unicode"
)
func CodelandUsernameValidation(str string) bool {
// code goes here
if len(str) >= 4 && len(str) <= 25 {
if unicode.IsLetter(rune(str[0])) {
for _,char := range str {
if !unicode.IsLetter(rune(char)) && !unicode.IsDigit(rune(char)) && !(rune(char) == '_') {
return false
}
}
return true
}
}
return false;
}
func main() {
// do not modify below here, readline is our function
// that properly reads in the input for you
var user string
fmt.Println("Enter Username")
fmt.Scan(&user)
fmt.Println(CodelandUsernameValidation(user))
}
Could you please clarify why rune is required here?
The code in the question must convert the byte str[0] to a rune for the call to unicode.IsLetter. Otherwise, the rune conversions are not needed.
The required byte to rune conversion hints a problem: The application is treating a byte as a rune, but bytes are not runes.
Fix by using for range to iterate through the runes in the string. This eliminates conversions from the code:
func CodelandUsernameValidation(str string) bool {
if len(str) < 4 || len(str) > 25 {
return false
}
for i, r := range str {
if i == 0 && !unicode.IsLetter(r) {
// str must start with a letter
return false
} else if !unicode.IsLetter(r) && !unicode.IsDigit(r) && !(r == '_') {
// str is restricted to letters, digit and _.
return false
}
}
return true
}
The first thing we need to know is that rune is nothing but an alias of int32. Single quotes represent a rune and double quotes represent a string. so instead of this rune(char) == "-" it should be rune(char) == '-'.
comment from builtin package
// rune is an alias for int32 and is equivalent to int32 in all ways.
It is // used, by convention, to distinguish character values from
integer values.
Second, here we need to know that A loop over the string and accesses it by index returns individual bytes, not characters. like here unicode.IsLetter(rune(str[0])). str[0] returns a byte which is the alias of uint8 not characters. it will fail for some cases because some characters encoded have a length of more than 1 byte because UTF-8. for example take this character ⌘ is represented by the bytes [e2 8c 98] and that those bytes are the UTF-8 encoding, in your example code if you try to access str[0] it will return e2 which may an invalid UTF-8 codepoint or it will represent another character which is a single UTF-8 encoded byte. so here you do like this
strbytes := []byte(str)
firstChar, size := utf8.DecodeRune(strbytes )
A for range loop, by contrast, decodes one UTF-8-encoded rune on each iteration. Each time around the loop, the index of the loop is the starting position of the current rune, measured in bytes, and the code point is its value. so in the example code for _,char := range str { the type of char is rune and again you are trying to convert rune to rune which is duplicated the work.
if want to learn more about strings how they work in Golang here is a great post by Rob Pike
You need to translate from str to []rune
r := []rune(str)
This must be the first line in the function CodelandUsernameValidation.

Does the conversion from string to rune slice make a copy?

I'm teaching myself Go from a C background.
The code below works as I expect (the first two Printf() will access bytes, the last two Printf() will access codepoints).
What I am not clear is if this involves any copying of data.
package main
import "fmt"
var a string
func main() {
a = "èe"
fmt.Printf("%d\n", a[0])
fmt.Printf("%d\n", a[1])
fmt.Println("")
fmt.Printf("%d\n", []rune(a)[0])
fmt.Printf("%d\n", []rune(a)[1])
}
In other words:
does []rune("string") create an array of runes and fill it with the runes corresponding to "string", or it's just the compiler that figures out how to get runes from the string bytes?
It is not possible to turn []uint8 (i.e. a string) into []int32 (an alias for []rune) without allocating an array.
Also, strings are immutable in Go but slices are not, so the conversion to both []byte and []rune must copy the string's bytes in some way or another.
It involves a copy because:
strings are immutable; if the conversion []rune(s) didn't make a copy, you would be able to index the rune slice and change the string contents
a string value is a "(possibly empty) sequence of bytes", where byte is an alias of uint8, whereas a rune is a "an integer value identifying a Unicode code point" and an alias of int32. The types are not identical and even the lengths may not be the same:
a = "èe"
r := []rune(a)
fmt.Println(len(a)) // 3 (3 bytes)
fmt.Println(len(r)) // 2 (2 Unicode code points)

String literal in struct is longer than the same string outside struct

I'm running into a really weird issue in my Go code. It seems that identical strings, one declared inside a struct and one outside, have different lengths when used. The following code shows an example:
type evaluateTest struct {
name string
expected int
fen string
}
func TestEvaluate(t *testing.T) {
cases := []evaluateTest{
{"Pawn testing", 330, "​​8/8/8/8/4P3/3P4/2P5/8 w KQkq - 0 11"},
}
for _, test := range cases {
outside := "​8/8/8/8/4P3/3P4/2P5/8 w KQkq - 0 11"
fmt.Printf("String in struct has length %v\n", len(test.fen))
fmt.Printf("String outside struct has length %v\n", len(outside))
This outputs:
String in struct has length 41
String outside struct has length 38
Looping through the string and printing the character codes gives junk characters in the first three positions (decimal 226, 128, 139) of the string in the struct, and none in the one declared outside.
I'm really at a loss as to what's going on here. Any help is very appreciated.
One string starts with two zero width spaces (\u200b). The other starts with one.
In situations like this, it's helpful to print using %q to see what's going on. See the playground example.

golang how does the rune() function work

I came across a function posted online that used the rune() function in golang, but I am having a hard time looking up what it is. I am going through the tutorial and inexperienced with the docs so it is hard to find what I am looking for.
Specifically, I am trying to see why this fails...
fmt.Println(rune("foo"))
and this does not
fmt.Println([]rune("foo"))
rune is a type in Go. It's just an alias for int32, but it's usually used to represent Unicode points. rune() isn't a function, it's syntax for type conversion into rune. Conversions in Go always have the syntax type() which might make them look like functions.
The first bit of code fails because conversion of strings to numeric types isn't defined in Go. However conversion of strings to slices of runes/int32s is defined like this in language specification:
Converting a value of a string type to a slice of runes type yields a
slice containing the individual Unicode code points of the string.
[golang.org]
So your example prints a slice of runes with values 102, 111 and 111
As stated in #Michael's first-rate comment fmt.Println([]rune("foo")) is a conversion of a string to a slice of runes []rune. When you convert from string to []rune, each utf-8 char in that string becomes a Rune. See https://stackoverflow.com/a/51611567/12817546. Similarly, in the reverse conversion, when converted from []rune to string, each rune becomes a utf-8 char in the string. See https://stackoverflow.com/a/51611567/12817546. A []rune can also be set to a byte, float64, int or a bool.
package main
import (
. "fmt"
)
func main() {
r := []rune("foo")
c := []interface{}{byte(r[0]), float64(r[0]), int(r[0]), r, string(r), r[0] != 0}
checkType(c)
}
func checkType(s []interface{}) {
for k, _ := range s {
Printf("%T %v\n", s[k], s[k])
}
}
byte(r[0]) is set to “uint8 102”, float64(r[0]) is set to “float64 102”,int(r[0]) is set to “int 102”, r is the rune” []int32 [102 111 111]”, string(r) prints “string foo”, r[0] != 0 and shows “bool true”.
[]rune to string conversion is supported natively by the spec. See the comment in https://stackoverflow.com/a/46021588/12817546. In Go then a string is a sequence of bytes. However, since multiple bytes can represent a rune code-point, a string value can also contain runes. So, it can be converted to a []rune , or vice versa. See https://stackoverflow.com/a/19325804/12817546.
Note, there are only two built-in type aliases in Go, byte (alias of uint8) and rune (alias of int32). See https://Go101.org/article/type-system-overview.html. Rune literals are just 32-bit integer values. For example, the rune literal 'a' is actually the number "97". See https://stackoverflow.com/a/19311218/12817546. Quotes edited.

Umlauts and slices

I'm having some trouble while reading a file which has a fixed column length format. Some columns may contain umlauts.
Umlauts seem to use 2 bytes instead of one. This is not the behaviour I was expecting. Is there any kind of function which returns a substring? Slice does not seem to work in this case.
Here's some sample code:
http://play.golang.org/p/ZJ1axy7UXe
umlautsString := "Rhön"
fmt.Println(len(umlautsString))
fmt.Println(umlautsString[0:4])
Prints:
5
Rhö
In go, a slice of a string counts bytes, not runes. This is why "Rhön"[0:3] gives you Rh and the first byte of ö.
Characters encoded in UTF-8 are represented as runes because UTF-8 encodes characters in more than one
byte (up to four bytes) to provide a bigger range of characters.
If you want to slice a string with the [] syntax, convert the string to []rune before.
Example (on play):
umlautsString := "Rhön"
runes = []rune(umlautsString)
fmt.Println(string(runes[0:3])) // Rhö
Noteworthy: This golang blog post about string representation in go.
You can convert string to []rune and work with it:
package main
import "fmt"
func main() {
umlautsString := "Rhön"
fmt.Println(len(umlautsString))
subStrRunes:= []rune(umlautsString)
fmt.Println(len(subStrRunes))
fmt.Println(string(subStrRunes[0:4]))
}
http://play.golang.org/p/__WfitzMOJ
Hope that helps!
Another option is the utf8string package:
package main
import "golang.org/x/exp/utf8string"
func main() {
s := utf8string.NewString("🧡💛💚💙💜")
// example 1
n := s.RuneCount()
println(n == 5)
// example 2
t := s.Slice(0, 2)
println(t == "🧡💛")
}
https://pkg.go.dev/golang.org/x/exp/utf8string

Resources