Character index in line with UTF-8 files - go

I'm writing a lexical analyzer for UTF-8 text. When an error is detected, I'm supposed to give the line number and the index position in the line.
The user is expected to identify the location in the line by counting the characters he see on the screen (or on the paper) until he reaches the given index value. He could also use the index in the line of the cursor shown by some editors.
I suppose I can't simply use the rune count as index because some unicode characters have zero space width and are supposed to be hidden markers or combined with a non-zero space width unicode character.
How am I supposed to deal with this ?
Is there a function that is able to give the visual unicode index given a byte slice containing runes ?
Also, do the line index in a file start at 0 or at 1 ?

I couldnt find anything in the standard library, but this seems to do it:
package main
import "github.com/rivo/uniseg"
func index(s, substr string) int {
g := uniseg.NewGraphemes(s)
for n := 0; g.Next(); n++ {
if g.Str() == substr { return n }
}
return -1
}
func main() {
n := index("Z a̎ B", "B")
println(n == 4)
}
https://pkg.go.dev/github.com/rivo/uniseg

Related

when do we use rune function in golang work? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
Improve this question
I am a beginner in Golang...
I found that rune(char) == "-" has been used to check if a character in a word matches with hyphen instead of checking it as char == "-".
Here is the code:
package main
import (
"fmt"
"unicode"
)
func CodelandUsernameValidation(str string) bool {
// code goes here
if len(str) >= 4 && len(str) <= 25 {
if unicode.IsLetter(rune(str[0])) {
for _,char := range str {
if !unicode.IsLetter(rune(char)) && !unicode.IsDigit(rune(char)) && !(rune(char) == '_') {
return false
}
}
return true
}
}
return false;
}
func main() {
// do not modify below here, readline is our function
// that properly reads in the input for you
var user string
fmt.Println("Enter Username")
fmt.Scan(&user)
fmt.Println(CodelandUsernameValidation(user))
}
Could you please clarify why rune is required here?
The code in the question must convert the byte str[0] to a rune for the call to unicode.IsLetter. Otherwise, the rune conversions are not needed.
The required byte to rune conversion hints a problem: The application is treating a byte as a rune, but bytes are not runes.
Fix by using for range to iterate through the runes in the string. This eliminates conversions from the code:
func CodelandUsernameValidation(str string) bool {
if len(str) < 4 || len(str) > 25 {
return false
}
for i, r := range str {
if i == 0 && !unicode.IsLetter(r) {
// str must start with a letter
return false
} else if !unicode.IsLetter(r) && !unicode.IsDigit(r) && !(r == '_') {
// str is restricted to letters, digit and _.
return false
}
}
return true
}
The first thing we need to know is that rune is nothing but an alias of int32. Single quotes represent a rune and double quotes represent a string. so instead of this rune(char) == "-" it should be rune(char) == '-'.
comment from builtin package
// rune is an alias for int32 and is equivalent to int32 in all ways.
It is // used, by convention, to distinguish character values from
integer values.
Second, here we need to know that A loop over the string and accesses it by index returns individual bytes, not characters. like here unicode.IsLetter(rune(str[0])). str[0] returns a byte which is the alias of uint8 not characters. it will fail for some cases because some characters encoded have a length of more than 1 byte because UTF-8. for example take this character ⌘ is represented by the bytes [e2 8c 98] and that those bytes are the UTF-8 encoding, in your example code if you try to access str[0] it will return e2 which may an invalid UTF-8 codepoint or it will represent another character which is a single UTF-8 encoded byte. so here you do like this
strbytes := []byte(str)
firstChar, size := utf8.DecodeRune(strbytes )
A for range loop, by contrast, decodes one UTF-8-encoded rune on each iteration. Each time around the loop, the index of the loop is the starting position of the current rune, measured in bytes, and the code point is its value. so in the example code for _,char := range str { the type of char is rune and again you are trying to convert rune to rune which is duplicated the work.
if want to learn more about strings how they work in Golang here is a great post by Rob Pike
You need to translate from str to []rune
r := []rune(str)
This must be the first line in the function CodelandUsernameValidation.

How do I count emojis in a string in go? [duplicate]

How can I get the number of characters of a string in Go?
For example, if I have a string "hello" the method should return 5. I saw that len(str) returns the number of bytes and not the number of characters so len("£") returns 2 instead of 1 because £ is encoded with two bytes in UTF-8.
You can try RuneCountInString from the utf8 package.
returns the number of runes in p
that, as illustrated in this script: the length of "World" might be 6 (when written in Chinese: "世界"), but the rune count of "世界" is 2:
package main
import "fmt"
import "unicode/utf8"
func main() {
fmt.Println("Hello, 世界", len("世界"), utf8.RuneCountInString("世界"))
}
Phrozen adds in the comments:
Actually you can do len() over runes by just type casting.
len([]rune("世界")) will print 2. At least in Go 1.3.
And with CL 108985 (May 2018, for Go 1.11), len([]rune(string)) is now optimized. (Fixes issue 24923)
The compiler detects len([]rune(string)) pattern automatically, and replaces it with for r := range s call.
Adds a new runtime function to count runes in a string.
Modifies the compiler to detect the pattern len([]rune(string))
and replaces it with the new rune counting runtime function.
RuneCount/lenruneslice/ASCII 27.8ns ± 2% 14.5ns ± 3% -47.70%
RuneCount/lenruneslice/Japanese 126ns ± 2% 60 ns ± 2% -52.03%
RuneCount/lenruneslice/MixedLength 104ns ± 2% 50 ns ± 1% -51.71%
Stefan Steiger points to the blog post "Text normalization in Go"
What is a character?
As was mentioned in the strings blog post, characters can span multiple runes.
For example, an 'e' and '◌́◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character.
The definition of a character may vary depending on the application.
For normalization we will define it as:
a sequence of runes that starts with a starter,
a rune that does not modify or combine backwards with any other rune,
followed by possibly empty sequence of non-starters, that is, runes that do (typically accents).
The normalization algorithm processes one character at at time.
Using that package and its Iter type, the actual number of "character" would be:
package main
import "fmt"
import "golang.org/x/text/unicode/norm"
func main() {
var ia norm.Iter
ia.InitString(norm.NFKD, "école")
nc := 0
for !ia.Done() {
nc = nc + 1
ia.Next()
}
fmt.Printf("Number of chars: %d\n", nc)
}
Here, this uses the Unicode Normalization form NFKD "Compatibility Decomposition"
Oliver's answer points to UNICODE TEXT SEGMENTATION as the only way to reliably determining default boundaries between certain significant text elements: user-perceived characters, words, and sentences.
For that, you need an external library like rivo/uniseg, which does Unicode Text Segmentation.
That will actually count "grapheme cluster", where multiple code points may be combined into one user-perceived character.
package uniseg
import (
"fmt"
"github.com/rivo/uniseg"
)
func main() {
gr := uniseg.NewGraphemes("👍🏼!")
for gr.Next() {
fmt.Printf("%x ", gr.Runes())
}
// Output: [1f44d 1f3fc] [21]
}
Two graphemes, even though there are three runes (Unicode code points).
You can see other examples in "How to manipulate strings in GO to reverse them?"
👩🏾‍🦰 alone is one grapheme, but, from unicode to code points converter, 4 runes:
👩: women (1f469)
dark skin (1f3fe)
ZERO WIDTH JOINER (200d)
🦰red hair (1f9b0)
There is a way to get count of runes without any packages by converting string to []rune as len([]rune(YOUR_STRING)):
package main
import "fmt"
func main() {
russian := "Спутник и погром"
english := "Sputnik & pogrom"
fmt.Println("count of bytes:",
len(russian),
len(english))
fmt.Println("count of runes:",
len([]rune(russian)),
len([]rune(english)))
}
count of bytes 30 16
count of runes 16 16
I should point out that none of the answers provided so far give you the number of characters as you would expect, especially when you're dealing with emojis (but also some languages like Thai, Korean, or Arabic). VonC's suggestions will output the following:
fmt.Println(utf8.RuneCountInString("🏳️‍🌈🇩🇪")) // Outputs "6".
fmt.Println(len([]rune("🏳️‍🌈🇩🇪"))) // Outputs "6".
That's because these methods only count Unicode code points. There are many characters which can be composed of multiple code points.
Same for using the Normalization package:
var ia norm.Iter
ia.InitString(norm.NFKD, "🏳️‍🌈🇩🇪")
nc := 0
for !ia.Done() {
nc = nc + 1
ia.Next()
}
fmt.Println(nc) // Outputs "6".
Normalization is not really the same as counting characters and many characters cannot be normalized into a one-code-point equivalent.
masakielastic's answer comes close but only handles modifiers (the rainbow flag contains a modifier which is thus not counted as its own code point):
fmt.Println(GraphemeCountInString("🏳️‍🌈🇩🇪")) // Outputs "5".
fmt.Println(GraphemeCountInString2("🏳️‍🌈🇩🇪")) // Outputs "5".
The correct way to split Unicode strings into (user-perceived) characters, i.e. grapheme clusters, is defined in the Unicode Standard Annex #29. The rules can be found in Section 3.1.1. The github.com/rivo/uniseg package implements these rules so you can determine the correct number of characters in a string:
fmt.Println(uniseg.GraphemeClusterCount("🏳️‍🌈🇩🇪")) // Outputs "2".
If you need to take grapheme clusters into account, use regexp or unicode module. Counting the number of code points(runes) or bytes also is needed for validaiton since the length of grapheme cluster is unlimited. If you want to eliminate extremely long sequences, check if the sequences conform to stream-safe text format.
package main
import (
"regexp"
"unicode"
"strings"
)
func main() {
str := "\u0308" + "a\u0308" + "o\u0308" + "u\u0308"
str2 := "a" + strings.Repeat("\u0308", 1000)
println(4 == GraphemeCountInString(str))
println(4 == GraphemeCountInString2(str))
println(1 == GraphemeCountInString(str2))
println(1 == GraphemeCountInString2(str2))
println(true == IsStreamSafeString(str))
println(false == IsStreamSafeString(str2))
}
func GraphemeCountInString(str string) int {
re := regexp.MustCompile("\\PM\\pM*|.")
return len(re.FindAllString(str, -1))
}
func GraphemeCountInString2(str string) int {
length := 0
checked := false
index := 0
for _, c := range str {
if !unicode.Is(unicode.M, c) {
length++
if checked == false {
checked = true
}
} else if checked == false {
length++
}
index++
}
return length
}
func IsStreamSafeString(str string) bool {
re := regexp.MustCompile("\\PM\\pM{30,}")
return !re.MatchString(str)
}
There are several ways to get a string length:
package main
import (
"bytes"
"fmt"
"strings"
"unicode/utf8"
)
func main() {
b := "这是个测试"
len1 := len([]rune(b))
len2 := bytes.Count([]byte(b), nil) -1
len3 := strings.Count(b, "") - 1
len4 := utf8.RuneCountInString(b)
fmt.Println(len1)
fmt.Println(len2)
fmt.Println(len3)
fmt.Println(len4)
}
Depends a lot on your definition of what a "character" is. If "rune equals a character " is OK for your task (generally it isn't) then the answer by VonC is perfect for you. Otherwise, it should be probably noted, that there are few situations where the number of runes in a Unicode string is an interesting value. And even in those situations it's better, if possible, to infer the count while "traversing" the string as the runes are processed to avoid doubling the UTF-8 decode effort.
I tried to make to do the normalization a bit faster:
en, _ = glyphSmart(data)
func glyphSmart(text string) (int, int) {
gc := 0
dummy := 0
for ind, _ := range text {
gc++
dummy = ind
}
dummy = 0
return gc, dummy
}

how to dynamically pass values to Sprintf or Printf

If I want to pad a string I could use something like this:
https://play.golang.org/p/ATeUhSP18N
package main
import (
"fmt"
)
func main() {
x := fmt.Sprintf("%+20s", "Hello World!")
fmt.Println(x)
}
From https://golang.org/pkg/fmt/
+ always print a sign for numeric values;
guarantee ASCII-only output for %q (%+q)
- pad with spaces on the right rather than the left (left-justify the field)
But If I would like to dynamically change the pad size how could I pass the value ?
My first guest was:
x := fmt.Sprintf("%+%ds", 20, "Hello World!")
But I get this:
%ds%!(EXTRA int=20, string=Hello World!)
Is there a way of doing this without creating a custom pad function what would add spaces either left or right probably using a for loop:
for i := 0; i < n; i++ {
out += str
}
Use * to tell Sprintf to get a formatting parameter from the argument list:
fmt.Printf("%*s\n", 20, "Hello World!")
Full code on play.golang.org
Go to: https://golang.org/pkg/fmt/
and scroll down until you find this:
fmt.Sprintf("%[3]*.[2]*[1]f", 12.0, 2, 6)
equivalent to
fmt.Sprintf("%6.2f", 12.0)
will yield " 12.00". Because an explicit index affects subsequent verbs,
this notation can be used to print the same values multiple times by
resetting the index for the first argument to be repeated
This sounds like what you want.
The real core of the description of using arguments to set field width and precision occurs further above:
Width and precision are measured in units of Unicode code points, that
is, runes. (This differs from C's printf where the units are always
measured in bytes.) Either or both of the flags may be replaced with
the character '*', causing their values to be obtained from the next
operand, which must be of type int.
The example above is just using explicit indexing into the argument list in addition, which is sometimes nice to have and allows you to reuse the same width and precision values for more conversions.
So you could also write:
fmt.Sprintf("*.*f", 6, 2, 12.0)

Golang - ToUpper() on a single byte?

I have a []byte, b, and I want to select a single byte, b[pos] and change it too upper case (and then lower case) The bytes type has a method called ToUpper(). How can I use this for a single byte?
Calling ToUpper on single Byte
OneOfOne gave the most efficient (when calling thousands of times), I use
val = byte(unicode.ToUpper(rune(b[pos])))
in order to find the byte and change the value
b[pos] = val
Checking if byte is Upper
Sometimes, instead of changing the case of a byte, I want to check if a byte is upper or lower case; All the upper case roman-alphabet bytes are lower than the value of the lower case bytes.
func (b Board) isUpper(x int) bool {
return b.board[x] < []byte{0x5a}[0]
}
For a single byte/rune, you can use unicode.ToUpper.
b[pos] = byte(unicode.ToUpper(rune(b[pos])))
I want to remind OP that bytes.ToUpper() operates on unicode code points encoded using UTF-8 in a byte slice while unicode.ToUpper() operates on a single unicode code point.
By asking to convert a single byte to upper case, OP is implying that the "b" byte slice contains something other than UTF-8, perhaps ASCII-7 or some 8-bit encoding such as ISO Latin-1 (e.g.). In that case OP needs to write an ISO Latin-1 (e.g.) ToUpper() function or OP must convert the ISO Latin-1 (e.g.) bytes to UTF-8 or unicode before using the bytes.ToUpper() or unicode.ToUpper() function.
Anything less creates a pending bug. Neither of the previously mentioned functions will properly convert all possible ISO Latin-1 (e.g.) encoded characters to upper case.
Use the following code to test if an element of the board is an ASCII uppercase letter:
func (b Board) isUpper(x int) bool {
v := b.board[x]
return 'A' <= v && v <= 'Z'
}
If the application only needs to distinguish between upper and lowercase letters, then there's no need for the lower bound test:
func (b Board) isUpper(x int) bool {
return b.board[x] <= 'Z'
}
The code in this answer improves on the code in the question in a few ways:
The code in the answer returns the correct value for a board element containing 'Z' (run playground example below for demonstration).
'Z' and 0x85 are the same value, but the code is easier to understand with 'Z'.
It's simpler to compare directly with the value 'Z'. No need to create a slice.
playground example
Edit: Revamped answer based on new information in the question since time of my original answer.
You can use bytes.ToUpper, you just need to deal with making the input a slice,
and making the output a byte:
package main
import "bytes"
func main() {
b, pos := []byte("north"), 1
b[pos] = bytes.ToUpper(b)[pos]
println(string(b) == "nOrth")
}

Tour of Go Exercise #23: my word counter doesn't work

I'm trying to resolve the puzzle from go tour #23 and I don't understand why my word counter doesn't work. print seems to print the expected value but the tests sees only 1 regardless the count.
package main
import (
"strings"
"unicode/utf8"
"golang.org/x/tour/wc"
)
func WordCount(s string) map[string]int {
// explode the string into a slice without whitespaces
ws := strings.Fields(s)
//make a new map
c := make(map[string]int)
//iterate over each word
for _, v := range ws {
c[v] = utf8.RuneCountInString(v)
}
print(c["am"])
return c
}
func main() {
wc.Test(WordCount)
}
The playground is available here
You're solving the wrong problem. It doesn't ask you for the length of each word, but for the number of times each word occurs. Change
c[v] = utf8.RuneCountInString(v)
for
c[v] += 1 // or c[v]++
The problem is c[v] = utf8.RuneCountInString(v). It has two problems:
You're resetting the counter for each word every time you re-encounter it. You should increment, not set.
You are setting the number of runes in the word to the counter. The puzzle is "how many times a word appears in the text". so just do something like c[v] = c[v] + 1 (if the entry is empty it will default to 0)
Also, I'd normalize the text - strip punctuation marks and lowercase everything.

Resources