In golang (go1.17 windows/amd64) the program below gives the following result:
rune1 = U+0130 'İ'
rune2 = U+0131 'ı'
lower(rune1) = U+0069 'i'
upper(rune2) = U+0049 'I'
strings.EqualFold(İ, ı) = false
strings.EqualFold(i, I) = true
I thought that strings.EqualFold would check strings for equality under Unicode case folding; however, the above example seem to give a counter-example. Clearly both runes can be folded (by hand) into code points that are equal under case folding.
Question: is golang correct that strings.EqualFold(İ, ı) is false? I expected it to yield true. And if golang is correct, why would that be? Or is this behaviour according to some Unicode specification.
What am I missing here.
Program:
func TestRune2(t *testing.T) {
r1 := rune(0x0130) // U+0130 'İ'
r2 := rune(0x0131) // U+0131 'ı'
r1u := unicode.ToLower(r1)
r2u := unicode.ToUpper(r2)
t.Logf("\nrune1 = %#U\nrune2 = %#U\nlower(rune1) = %#U\nupper(rune2) = %#U\nstrings.EqualFold(%s, %s) = %v\nstrings.EqualFold(%s, %s) = %v",
r1, r2, r1u, r2u, string(r1), string(r2), strings.EqualFold(string(r1), string(r2)), string(r1u), string(r2u), strings.EqualFold(string(r1u), string(r2u)))
}
Yes, this is "correct" behaviour. These letters do not behave normal under case folding. See:
http://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt
U+0131 has full case folding "F" and special "T":
T: special case for uppercase I and dotted uppercase I
- For non-Turkic languages, this mapping is normally not used.
- For Turkic languages (tr, az), this mapping can be used instead
of the normal mapping for these characters.
Note that the Turkic mappings do not maintain canonical equivalence
without additional processing.
See the discussions of case mapping in the Unicode Standard for more information.
I think there is no way of to force package strings to use the tr or az mapping.
From the strings.EqualFold source - unicode.ToLower and unicode.ToUpper are not used.
Instead, it uses unicode.SimpleFold to see if a particular rune is "foldable" and therefore potentially comparable:
// General case. SimpleFold(x) returns the next equivalent rune > x
// or wraps around to smaller values.
r := unicode.SimpleFold(sr)
for r != sr && r < tr {
r = unicode.SimpleFold(r)
}
The rune İ is not foldable. It's lowercase code-point is:
r := rune(0x0130) // U+0130 'İ'
lr := unicode.ToLower(r) // U+0069 'i'
fmt.Printf("foldable? %v\n", r != unicode.SimpleFold(r)) // foldable? false
fmt.Printf("foldable? %v\n", lr != unicode.SimpleFold(lr)) // foldable? true
If a rune is not foldable (i.e. SimpleFold returns itself) - then that rune can only match itself and no other code-point.
https://play.golang.org/p/105x0I714nS
Related
I've a terratest where I get an output from terraform like so s := "[a b]". The terraform output's value = toset([resource.name]), it's a set of strings.
Apparently fmt.Printf("%T", s) returns string. I need to iterate to perform further validation.
I tried the below approach but errors!
var v interface{}
if err := json.Unmarshal([]byte(s), &v); err != nil {
fmt.Println(err)
}
My current implementation to convert to a slice is:
s := "[a b]"
s1 := strings.Fields(strings.Trim(s, "[]"))
for _, v:= range s1 {
fmt.Println("v -> " + v)
}
Looking for suggestions to current approach or alternative ways to convert to arr/slice that I should be considering. Appreciate any inputs. Thanks.
Actually your current implementation seems just fine.
You can't use JSON unmarshaling because JSON strings must be enclosed in double quotes ".
Instead strings.Fields does just that, it splits a string on one or more characters that match unicode.IsSpace, which is \t, \n, \v. \f, \r and .
Moeover this works also if terraform sends an empty set as [], as stated in the documentation:
returning [...] an empty slice if s contains only white space.
...which includes the case of s being empty "" altogether.
In case you need additional control over this, you can use strings.FieldsFunc, which accepts a function of type func(rune) bool so you can determine yourself what constitutes a "space". But since your input string comes from terraform, I guess it's going to be well-behaved enough.
There may be third-party packages that already implement this functionality, but unless your program already imports them, I think the native solution based on the standard lib is always preferrable.
unicode.IsSpace actually includes also the higher runes 0x85 and 0xA0, in which case strings.Fields calls FieldsFunc(s, unicode.IsSpace)
package main
import (
"fmt"
"strings"
)
func main() {
src := "[a b]"
dst := strings.Split(src[1:len(src)-1], " ")
fmt.Println(dst)
}
https://play.golang.org/p/KVY4r_8RWv6
How can I get the number of characters of a string in Go?
For example, if I have a string "hello" the method should return 5. I saw that len(str) returns the number of bytes and not the number of characters so len("£") returns 2 instead of 1 because £ is encoded with two bytes in UTF-8.
You can try RuneCountInString from the utf8 package.
returns the number of runes in p
that, as illustrated in this script: the length of "World" might be 6 (when written in Chinese: "世界"), but the rune count of "世界" is 2:
package main
import "fmt"
import "unicode/utf8"
func main() {
fmt.Println("Hello, 世界", len("世界"), utf8.RuneCountInString("世界"))
}
Phrozen adds in the comments:
Actually you can do len() over runes by just type casting.
len([]rune("世界")) will print 2. At least in Go 1.3.
And with CL 108985 (May 2018, for Go 1.11), len([]rune(string)) is now optimized. (Fixes issue 24923)
The compiler detects len([]rune(string)) pattern automatically, and replaces it with for r := range s call.
Adds a new runtime function to count runes in a string.
Modifies the compiler to detect the pattern len([]rune(string))
and replaces it with the new rune counting runtime function.
RuneCount/lenruneslice/ASCII 27.8ns ± 2% 14.5ns ± 3% -47.70%
RuneCount/lenruneslice/Japanese 126ns ± 2% 60 ns ± 2% -52.03%
RuneCount/lenruneslice/MixedLength 104ns ± 2% 50 ns ± 1% -51.71%
Stefan Steiger points to the blog post "Text normalization in Go"
What is a character?
As was mentioned in the strings blog post, characters can span multiple runes.
For example, an 'e' and '◌́◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character.
The definition of a character may vary depending on the application.
For normalization we will define it as:
a sequence of runes that starts with a starter,
a rune that does not modify or combine backwards with any other rune,
followed by possibly empty sequence of non-starters, that is, runes that do (typically accents).
The normalization algorithm processes one character at at time.
Using that package and its Iter type, the actual number of "character" would be:
package main
import "fmt"
import "golang.org/x/text/unicode/norm"
func main() {
var ia norm.Iter
ia.InitString(norm.NFKD, "école")
nc := 0
for !ia.Done() {
nc = nc + 1
ia.Next()
}
fmt.Printf("Number of chars: %d\n", nc)
}
Here, this uses the Unicode Normalization form NFKD "Compatibility Decomposition"
Oliver's answer points to UNICODE TEXT SEGMENTATION as the only way to reliably determining default boundaries between certain significant text elements: user-perceived characters, words, and sentences.
For that, you need an external library like rivo/uniseg, which does Unicode Text Segmentation.
That will actually count "grapheme cluster", where multiple code points may be combined into one user-perceived character.
package uniseg
import (
"fmt"
"github.com/rivo/uniseg"
)
func main() {
gr := uniseg.NewGraphemes("👍🏼!")
for gr.Next() {
fmt.Printf("%x ", gr.Runes())
}
// Output: [1f44d 1f3fc] [21]
}
Two graphemes, even though there are three runes (Unicode code points).
You can see other examples in "How to manipulate strings in GO to reverse them?"
👩🏾🦰 alone is one grapheme, but, from unicode to code points converter, 4 runes:
👩: women (1f469)
dark skin (1f3fe)
ZERO WIDTH JOINER (200d)
🦰red hair (1f9b0)
There is a way to get count of runes without any packages by converting string to []rune as len([]rune(YOUR_STRING)):
package main
import "fmt"
func main() {
russian := "Спутник и погром"
english := "Sputnik & pogrom"
fmt.Println("count of bytes:",
len(russian),
len(english))
fmt.Println("count of runes:",
len([]rune(russian)),
len([]rune(english)))
}
count of bytes 30 16
count of runes 16 16
I should point out that none of the answers provided so far give you the number of characters as you would expect, especially when you're dealing with emojis (but also some languages like Thai, Korean, or Arabic). VonC's suggestions will output the following:
fmt.Println(utf8.RuneCountInString("🏳️🌈🇩🇪")) // Outputs "6".
fmt.Println(len([]rune("🏳️🌈🇩🇪"))) // Outputs "6".
That's because these methods only count Unicode code points. There are many characters which can be composed of multiple code points.
Same for using the Normalization package:
var ia norm.Iter
ia.InitString(norm.NFKD, "🏳️🌈🇩🇪")
nc := 0
for !ia.Done() {
nc = nc + 1
ia.Next()
}
fmt.Println(nc) // Outputs "6".
Normalization is not really the same as counting characters and many characters cannot be normalized into a one-code-point equivalent.
masakielastic's answer comes close but only handles modifiers (the rainbow flag contains a modifier which is thus not counted as its own code point):
fmt.Println(GraphemeCountInString("🏳️🌈🇩🇪")) // Outputs "5".
fmt.Println(GraphemeCountInString2("🏳️🌈🇩🇪")) // Outputs "5".
The correct way to split Unicode strings into (user-perceived) characters, i.e. grapheme clusters, is defined in the Unicode Standard Annex #29. The rules can be found in Section 3.1.1. The github.com/rivo/uniseg package implements these rules so you can determine the correct number of characters in a string:
fmt.Println(uniseg.GraphemeClusterCount("🏳️🌈🇩🇪")) // Outputs "2".
If you need to take grapheme clusters into account, use regexp or unicode module. Counting the number of code points(runes) or bytes also is needed for validaiton since the length of grapheme cluster is unlimited. If you want to eliminate extremely long sequences, check if the sequences conform to stream-safe text format.
package main
import (
"regexp"
"unicode"
"strings"
)
func main() {
str := "\u0308" + "a\u0308" + "o\u0308" + "u\u0308"
str2 := "a" + strings.Repeat("\u0308", 1000)
println(4 == GraphemeCountInString(str))
println(4 == GraphemeCountInString2(str))
println(1 == GraphemeCountInString(str2))
println(1 == GraphemeCountInString2(str2))
println(true == IsStreamSafeString(str))
println(false == IsStreamSafeString(str2))
}
func GraphemeCountInString(str string) int {
re := regexp.MustCompile("\\PM\\pM*|.")
return len(re.FindAllString(str, -1))
}
func GraphemeCountInString2(str string) int {
length := 0
checked := false
index := 0
for _, c := range str {
if !unicode.Is(unicode.M, c) {
length++
if checked == false {
checked = true
}
} else if checked == false {
length++
}
index++
}
return length
}
func IsStreamSafeString(str string) bool {
re := regexp.MustCompile("\\PM\\pM{30,}")
return !re.MatchString(str)
}
There are several ways to get a string length:
package main
import (
"bytes"
"fmt"
"strings"
"unicode/utf8"
)
func main() {
b := "这是个测试"
len1 := len([]rune(b))
len2 := bytes.Count([]byte(b), nil) -1
len3 := strings.Count(b, "") - 1
len4 := utf8.RuneCountInString(b)
fmt.Println(len1)
fmt.Println(len2)
fmt.Println(len3)
fmt.Println(len4)
}
Depends a lot on your definition of what a "character" is. If "rune equals a character " is OK for your task (generally it isn't) then the answer by VonC is perfect for you. Otherwise, it should be probably noted, that there are few situations where the number of runes in a Unicode string is an interesting value. And even in those situations it's better, if possible, to infer the count while "traversing" the string as the runes are processed to avoid doubling the UTF-8 decode effort.
I tried to make to do the normalization a bit faster:
en, _ = glyphSmart(data)
func glyphSmart(text string) (int, int) {
gc := 0
dummy := 0
for ind, _ := range text {
gc++
dummy = ind
}
dummy = 0
return gc, dummy
}
Want to have an empty char/byte, which has zero size/length, in Go, such as byte("").
func main() {
var a byte = '' // not working
var a byte = 0 // not working
}
A more specific example is
func removeOuterParentheses(S string) string {
var stack []int
var res []byte
for i, b := range []byte(S) {
if b == '(' {
stack = append(stack, i)
} else {
if len(stack) == 1 {
res[stack[0]] = '' // set this byte to be empty
res[i] = '' // / set this byte to be empty
}
stack = stack[:len(stack)-1]
}
}
return string(res)
}
There is an equivalent question in Java
A byte is an alias to the uint8 type. Having an "empty byte" doesn't really make any sense, just as having an "empty number" doesn't make any sense (you can have the number 0, but what is an "empty" number?)
You can assign a value of zero (b := byte(0), or var b byte), which can be used to indicate that nothing is assigned yet ("zero value"). The byte value of 0 is is known as a "null byte". It normally never occurs in regular text, but often occurs in binary data (e.g. images, compressed files, etc.)
This is different from byte(""), which is a sequence of bytes. You can have a sequence of zero bytes. To give an analogy: I can have a wallet with no money in it, but I can't have a coin that is worth "empty".
If you really want to distinguish between "value of 0" and "never set" you can use either a pointer or a struct. An example with a pointer:
var b *byte
fmt.Println(b) // <nil>, since it's a pointer which has no address to point to.
one := byte(0)
b = &one // Set address to one.
fmt.Println(b, *b) // 0xc000014178 0 (first value will be different, as it's
// a memory address).
You'll need to be a little bit careful here, as *b will be a panic if you haven't assigned a value yet. Depending on how it's used it can either work quite well, or be very awkward to work with. An example where this is used in the standard library is the flag package.
Another possibility is to use a struct with separate fiels for the byte itself and a flag to record whether it's been set or not. In the database/sql library there are already the Null* types (e.g. NullInt64, which you can use as a starting point.
a single byte is a number. 0 would transform into a 8bit number. 00000000.
A byte slice/array can have a length of 0.
var a byte = 0
var b = [0]byte{}
I wish to ignore a particular field whilst processing a string with sscanf.
Man page for sscanf says
An optional '*' assignment-suppression character: scanf() reads input as directed by the conversion specification, but discards the input. No corresponding pointer argument is required, and this specification is not included in the count of successful assignments returned by scanf().
Attempting to use this in Golang, to ignore the 3rd field:
if c, err := fmt.Sscanf(str, " %s %d %*d %d ", &iface.Name, &iface.BTx, &iface.BytesRx); err != nil || c != 3 {
compiles OK, but at runtime err is set to:
bad verb %* for integer
Golang doco doesn't specifically mention the %* conversion specification, but it does say,
Package fmt implements formatted I/O with functions analogous to C's printf and scanf.
It doesn't indicate that %* is not implemented, so... Am I doing it wrong? Or has it just been quietly omitted? ...but then, why does it compile?
To the best of my knowledge there is no such verb (as the format specifiers are called in the fmt package) for this task. What you can do however, is specifying some verb and ignoring its value. This is not particularly memory friendly, though. Ideally this would work:
fmt.Scan(&a, _, &b)
Sadly, it doesn't. So your next best option would be to declare the variables and ignore the one
you don't want:
var a,b,c int
fmt.Scanf("%d %v %d", &a, &b, &c)
fmt.Println(a,c)
%v would read a space separated token. Depending on what you're scanning on, you may fast forward the
stream to the position you need to scan on. See this answer
for details on seeking in buffers. If you're using stdio or you don't know which length your input may
have, you seem to be out of luck here.
It doesn't indicate that %* is not implemented, so... Am I doing it
wrong? Or has it just been quietly omitted? ...but then, why does it
compile?
It compiles because for the compiler a format string is just a string like any other. The content of that string is evaluated at run time by functions of the fmt package. Some C compilers may check format strings
for correctness, but this is a feature, not the norm. With go, the go vet command will try to warn you about format string errors with mismatched arguments.
Edit:
For the special case of needing to parse a row of integers and just caring for some of them, you
can use fmt.Scan in combination with a slice of integers. The following example reads 3 integers
from stdin and stores them in the slice named vals:
ints := make([]interface{}, 3)
vals := make([]int, len(ints))
for i, _ := range ints {
ints[i] = interface{}(&vals[i])
}
fmt.Scan(ints...)
fmt.Println(vals)
This is probably shorter than the conventional split/trim/strconv chain. It makes a slice of pointers
which each points to a value in vals. fmt.Scan then fills these pointers. With this you can even
ignore most of the values by assigning the same pointer over and over for the values you don't want:
ignored := 0
for i, _ := range ints {
if(i == 0 || i == 2) {
ints[i] = interface{}(&vals[i])
} else {
ints[i] = interface{}(&ignored)
}
}
The example above would assign the address of ignore to all values except the first and the second, thus
effectively ignoring them by overwriting.
I'm looking for a go language capability similar to the "dictionary" in python to facilitate the conversion of some python code.
EDIT: Maps worked quite well for this de-dupe application. I was able to condense 1.3e6 duplicated items down to 2.5e5 unique items using a map with a 16 byte string index in just a few seconds. The map-related code was simple so I've included it below. Worth noting that pre-allocation of map with 1.3e6 elements sped it up by only a few percent:
var m = make(map[string]int, 1300000) // map with initial space for 1.3e6 elements
ct, ok := m[ax_hash]
if ok {
m[ax_hash] = ct + 1
} else {
m[ax_hash] = 1
}
To expand a little on answers already given:
A Go map is a typed hash map data structure. A map's type signature is of the form map[keyType]valueType where keyType and valueType are the types of the keys and values respectively.
To initialize a map, you must use the make function:
m := make(map[string]int)
An uninitialized map is equal to nil, and if read from or written a panic will occur at runtime.
The syntax for storing values is much the same as doing so with arrays or slices:
m["Alice"] = 21
m["Bob"] = 17
Similarly, retrieving values from a map is done like so:
a := m["Alice"]
b := m["Bob"]
You can use the range keyword to iterate over a map with a for loop:
for k, v := range m {
fmt.Println(k, v)
}
This code will print:
Alice 21
Bob 17
Retrieving a value for a key that is not in the map will return the value type's zero value:
c := m["Charlie"]
// c == 0
By reading multiple values from a map, you can test for a key's presence. The second value will be a boolean indicating the key's presence:
a, ok := m["Alice"]
// a == 21, ok == true
c, ok := m["Charlie"]
// c == 0, ok == false
To remove a key/value entry from a map, you flip it around and assign false as the second value:
m["Bob"] = 0, false
b, ok := m["Bob"]
// b == 0, ok == false
You can store arbitrary types in a map by using the empty interface type interface{}:
n := make(map[string]interface{})
n["One"] = 1
n["Two"] = "Two"
The only proviso is that when retrieving those values you must perform a type assertion to use them in their original form:
a := n["One"].(int)
b := n["Two"].(string)
You can use a type switch to determine the types of the values you're pulling out, and deal with them appropriately:
for k, v := range n {
switch u := v.(type) {
case int:
fmt.Printf("Key %q is an int with the value %v.\n", k, u)
case string:
fmt.Printf("Key %q is a string with the value %q.\n", k, u)
}
}
Inside each of those case blocks, u will be of the type specified in the case statement; no explicit type assertion is necessary.
This code will print:
Key "One" is an int with the value 1.
Key "Two" is a string with the value "Two".
The key can be of any type for which the equality operator is defined, such as integers, floats, strings, and pointers. Interface types can also be used, as long as the underlying type supports equality. (Structs, arrays and slices cannot be used as map keys, because equality is not defined on those types.)
For example, the map o can take keys of any of the above types:
o := make(map[interface{}]int)
o[1] = 1
o["Two"] = 2
And that's maps in a nutshell.
The map type. http://golang.org/doc/effective_go.html#maps
There is some difference from python in that the keys have to be typed, so you can't mix numeric and string keys (for some reason I forgot you can), but they're pretty easy to use.
dict := make(map[string]string)
dict["user"] = "so_user"
dict["pass"] = "l33t_pass1"
You're probably looking for a map.