Why _(underscore) ignored in output? - go

I would like to know the reason behind the output of this program.
package main
Program
import (
"fmt"
)
func main() {
a := 1_00_000
fmt.Println(a)
}
Output
100000
How come the underscore is ignored in the output. What is the use of this new feature in Go?

It's not ignored in the output; it's ignored in the source code. The underscores are a convenience to make large number literals in code easier to read; the literal is still an integer, and integers don't contain underscores. You could always use a string of course:
a := "1_00_000"
fmt.Println(a)
Underscores as separators were added as a new feature in Go 1.13: https://golang.org/doc/go1.13#language

Underscores are just digit separators.This new feature is introduced in Go 1.13 to improve readability.It is not printed along with the number.
The digits of any number literal can be separated (grouped) using underscores, such as in 1_000_000, 0b_1010_011 to make it more readable.
d := 9795696874578
d := 9_795_696_874_578 // thousand separators
Here underscored literals are much easier to read.

Related

In Go 1.18 strings.Title() is deprecated. What to use now? And how?

As suggested here names of people should be capitalized like John William Smith.
I'm writing a small software in Golang which gets last and first name from user's form inputs.
Until Go 1.18 I was using:
lastname = strings.Title(strings.ToLower(strings.TrimSpace(lastname)))
firstname = strings.Title(strings.ToLower(strings.TrimSpace(firstname)))
It works but now Go 1.18 has deprecated strings.Title().
They suggest to use golang.org/x/text/cases instead.
So I think I should change my code in something like this:
caser := cases.Title(language.Und)
lastname = caser.Title(strings.ToLower(strings.TrimSpace(lastname)))
firstname = caser.Title(strings.ToLower(strings.TrimSpace(firstname)))
It works the same as before.
The difference is for Dutch word like ijsland that should be titled as IJsland and not Ijsland.
The question
In the line caser := cases.Title(language.Und) I'm using Und because I don't know what language Tag to use.
Should I use language.English or language.AmericanEnglish or other?
So far it was like strings.Title() was using Und or English?
As mentioned in documentation strings.Title is deprecated and you should use cases.Title instead.
Deprecated: The rule Title uses for word boundaries does not handle
Unicode punctuation properly. Use golang.org/x/text/cases instead.
Here is an example code of how to use it as from two perspectives:
// Straightforward approach
caser := cases.Title(language.BrazilianPortuguese)
titleStr := caser.String(str)
// Transformer interface aware approach
src := []byte(s)
dest := []byte(s) // dest can also be `dest := src`
caser := cases.Title(language.BrazilianPortuguese)
_, _, err := caser.Transform(dest, src, true)
Make sure to take a look on the transform.Transformer.Transform and cases.Caser in order to understand what each parameter and return values mean, as well as the tool's limitations. For example:
A Caser may be stateful and should therefore not be shared between
goroutines.
Regarding what language to use, you should be aware of their difference in the results, besides that, you should be fine with any choice. Here is a copy from 煎鱼's summary on the differences that cleared it for me:
Go Playground: https://go.dev/play/p/xp59r1BkC9L
func main() {
src := []string{
"hello world!",
"i with dot",
"'n ijsberg",
"here comes O'Brian",
}
for _, c := range []cases.Caser{
cases.Lower(language.Und),
cases.Upper(language.Turkish),
cases.Title(language.Dutch),
cases.Title(language.Und, cases.NoLower),
} {
fmt.Println()
for _, s := range src {
fmt.Println(c.String(s))
}
}
}
With the following output
hello world!
i with dot
'n ijsberg
here comes o'brian
HELLO WORLD!
İ WİTH DOT
'N İJSBERG
HERE COMES O'BRİAN
Hello World!
I With Dot
'n IJsberg
Here Comes O'brian
Hello World!
I With Dot
'N Ijsberg
Here Comes O'Brian
So far it was like strings.Title() was using Und or English?
strings.Title() works based on ASCII, where cases.Title() works based on Unicode, there is no way to get the exact same behavior.
Should I use language.English or language.AmericanEnglish or other?
language.English, language.AmericanEnglish and language.Und all seem to have the same Title rules. Using any of them should get you the closest to the original strings.Title() behavior as you are going to get.
The whole point of using this package with Unicode support is that it is objectively more correct. So pick a tag appropriate for your users.
strings.Title(str) was deprecated, should change to cases.Title(language.Und, cases.NoLower).String(str)
package main
import (
"fmt"
"strings"
"golang.org/x/text/cases"
"golang.org/x/text/language"
)
func main() {
fmt.Println(strings.Title("abcABC")) // AbcABC
fmt.Println(cases.Title(language.Und, cases.NoLower).String("abcABC")) // AbcABC
}
Playground : https://go.dev/play/p/i0Eqh3QfxTx
Here is a straightforward example of how to capitalize the initial letter of each string value in the variable using the golang.org/x/text package.
package main
import (
"fmt"
"golang.org/x/text/cases"
"golang.org/x/text/language"
)
func main() {
sampleStr := "with value lower, all the letters are lowercase. this is good for poetry perhaps"
caser := cases.Title(language.English)
fmt.Println(caser.String(sampleStr))
}
Output : With Value Lower, All The Letters Are Lowercase. This Is Good For Poetry Perhaps
Playground Example: https://go.dev/play/p/_J8nGVuhYC9

How to convert the string representation of a Terraform set of strings to a slice of strings

I've a terratest where I get an output from terraform like so s := "[a b]". The terraform output's value = toset([resource.name]), it's a set of strings.
Apparently fmt.Printf("%T", s) returns string. I need to iterate to perform further validation.
I tried the below approach but errors!
var v interface{}
if err := json.Unmarshal([]byte(s), &v); err != nil {
fmt.Println(err)
}
My current implementation to convert to a slice is:
s := "[a b]"
s1 := strings.Fields(strings.Trim(s, "[]"))
for _, v:= range s1 {
fmt.Println("v -> " + v)
}
Looking for suggestions to current approach or alternative ways to convert to arr/slice that I should be considering. Appreciate any inputs. Thanks.
Actually your current implementation seems just fine.
You can't use JSON unmarshaling because JSON strings must be enclosed in double quotes ".
Instead strings.Fields does just that, it splits a string on one or more characters that match unicode.IsSpace, which is \t, \n, \v. \f, \r and .
Moeover this works also if terraform sends an empty set as [], as stated in the documentation:
returning [...] an empty slice if s contains only white space.
...which includes the case of s being empty "" altogether.
In case you need additional control over this, you can use strings.FieldsFunc, which accepts a function of type func(rune) bool so you can determine yourself what constitutes a "space". But since your input string comes from terraform, I guess it's going to be well-behaved enough.
There may be third-party packages that already implement this functionality, but unless your program already imports them, I think the native solution based on the standard lib is always preferrable.
unicode.IsSpace actually includes also the higher runes 0x85 and 0xA0, in which case strings.Fields calls FieldsFunc(s, unicode.IsSpace)
package main
import (
"fmt"
"strings"
)
func main() {
src := "[a b]"
dst := strings.Split(src[1:len(src)-1], " ")
fmt.Println(dst)
}
https://play.golang.org/p/KVY4r_8RWv6

How do I count emojis in a string in go? [duplicate]

How can I get the number of characters of a string in Go?
For example, if I have a string "hello" the method should return 5. I saw that len(str) returns the number of bytes and not the number of characters so len("£") returns 2 instead of 1 because £ is encoded with two bytes in UTF-8.
You can try RuneCountInString from the utf8 package.
returns the number of runes in p
that, as illustrated in this script: the length of "World" might be 6 (when written in Chinese: "世界"), but the rune count of "世界" is 2:
package main
import "fmt"
import "unicode/utf8"
func main() {
fmt.Println("Hello, 世界", len("世界"), utf8.RuneCountInString("世界"))
}
Phrozen adds in the comments:
Actually you can do len() over runes by just type casting.
len([]rune("世界")) will print 2. At least in Go 1.3.
And with CL 108985 (May 2018, for Go 1.11), len([]rune(string)) is now optimized. (Fixes issue 24923)
The compiler detects len([]rune(string)) pattern automatically, and replaces it with for r := range s call.
Adds a new runtime function to count runes in a string.
Modifies the compiler to detect the pattern len([]rune(string))
and replaces it with the new rune counting runtime function.
RuneCount/lenruneslice/ASCII 27.8ns ± 2% 14.5ns ± 3% -47.70%
RuneCount/lenruneslice/Japanese 126ns ± 2% 60 ns ± 2% -52.03%
RuneCount/lenruneslice/MixedLength 104ns ± 2% 50 ns ± 1% -51.71%
Stefan Steiger points to the blog post "Text normalization in Go"
What is a character?
As was mentioned in the strings blog post, characters can span multiple runes.
For example, an 'e' and '◌́◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character.
The definition of a character may vary depending on the application.
For normalization we will define it as:
a sequence of runes that starts with a starter,
a rune that does not modify or combine backwards with any other rune,
followed by possibly empty sequence of non-starters, that is, runes that do (typically accents).
The normalization algorithm processes one character at at time.
Using that package and its Iter type, the actual number of "character" would be:
package main
import "fmt"
import "golang.org/x/text/unicode/norm"
func main() {
var ia norm.Iter
ia.InitString(norm.NFKD, "école")
nc := 0
for !ia.Done() {
nc = nc + 1
ia.Next()
}
fmt.Printf("Number of chars: %d\n", nc)
}
Here, this uses the Unicode Normalization form NFKD "Compatibility Decomposition"
Oliver's answer points to UNICODE TEXT SEGMENTATION as the only way to reliably determining default boundaries between certain significant text elements: user-perceived characters, words, and sentences.
For that, you need an external library like rivo/uniseg, which does Unicode Text Segmentation.
That will actually count "grapheme cluster", where multiple code points may be combined into one user-perceived character.
package uniseg
import (
"fmt"
"github.com/rivo/uniseg"
)
func main() {
gr := uniseg.NewGraphemes("👍🏼!")
for gr.Next() {
fmt.Printf("%x ", gr.Runes())
}
// Output: [1f44d 1f3fc] [21]
}
Two graphemes, even though there are three runes (Unicode code points).
You can see other examples in "How to manipulate strings in GO to reverse them?"
👩🏾‍🦰 alone is one grapheme, but, from unicode to code points converter, 4 runes:
👩: women (1f469)
dark skin (1f3fe)
ZERO WIDTH JOINER (200d)
🦰red hair (1f9b0)
There is a way to get count of runes without any packages by converting string to []rune as len([]rune(YOUR_STRING)):
package main
import "fmt"
func main() {
russian := "Спутник и погром"
english := "Sputnik & pogrom"
fmt.Println("count of bytes:",
len(russian),
len(english))
fmt.Println("count of runes:",
len([]rune(russian)),
len([]rune(english)))
}
count of bytes 30 16
count of runes 16 16
I should point out that none of the answers provided so far give you the number of characters as you would expect, especially when you're dealing with emojis (but also some languages like Thai, Korean, or Arabic). VonC's suggestions will output the following:
fmt.Println(utf8.RuneCountInString("🏳️‍🌈🇩🇪")) // Outputs "6".
fmt.Println(len([]rune("🏳️‍🌈🇩🇪"))) // Outputs "6".
That's because these methods only count Unicode code points. There are many characters which can be composed of multiple code points.
Same for using the Normalization package:
var ia norm.Iter
ia.InitString(norm.NFKD, "🏳️‍🌈🇩🇪")
nc := 0
for !ia.Done() {
nc = nc + 1
ia.Next()
}
fmt.Println(nc) // Outputs "6".
Normalization is not really the same as counting characters and many characters cannot be normalized into a one-code-point equivalent.
masakielastic's answer comes close but only handles modifiers (the rainbow flag contains a modifier which is thus not counted as its own code point):
fmt.Println(GraphemeCountInString("🏳️‍🌈🇩🇪")) // Outputs "5".
fmt.Println(GraphemeCountInString2("🏳️‍🌈🇩🇪")) // Outputs "5".
The correct way to split Unicode strings into (user-perceived) characters, i.e. grapheme clusters, is defined in the Unicode Standard Annex #29. The rules can be found in Section 3.1.1. The github.com/rivo/uniseg package implements these rules so you can determine the correct number of characters in a string:
fmt.Println(uniseg.GraphemeClusterCount("🏳️‍🌈🇩🇪")) // Outputs "2".
If you need to take grapheme clusters into account, use regexp or unicode module. Counting the number of code points(runes) or bytes also is needed for validaiton since the length of grapheme cluster is unlimited. If you want to eliminate extremely long sequences, check if the sequences conform to stream-safe text format.
package main
import (
"regexp"
"unicode"
"strings"
)
func main() {
str := "\u0308" + "a\u0308" + "o\u0308" + "u\u0308"
str2 := "a" + strings.Repeat("\u0308", 1000)
println(4 == GraphemeCountInString(str))
println(4 == GraphemeCountInString2(str))
println(1 == GraphemeCountInString(str2))
println(1 == GraphemeCountInString2(str2))
println(true == IsStreamSafeString(str))
println(false == IsStreamSafeString(str2))
}
func GraphemeCountInString(str string) int {
re := regexp.MustCompile("\\PM\\pM*|.")
return len(re.FindAllString(str, -1))
}
func GraphemeCountInString2(str string) int {
length := 0
checked := false
index := 0
for _, c := range str {
if !unicode.Is(unicode.M, c) {
length++
if checked == false {
checked = true
}
} else if checked == false {
length++
}
index++
}
return length
}
func IsStreamSafeString(str string) bool {
re := regexp.MustCompile("\\PM\\pM{30,}")
return !re.MatchString(str)
}
There are several ways to get a string length:
package main
import (
"bytes"
"fmt"
"strings"
"unicode/utf8"
)
func main() {
b := "这是个测试"
len1 := len([]rune(b))
len2 := bytes.Count([]byte(b), nil) -1
len3 := strings.Count(b, "") - 1
len4 := utf8.RuneCountInString(b)
fmt.Println(len1)
fmt.Println(len2)
fmt.Println(len3)
fmt.Println(len4)
}
Depends a lot on your definition of what a "character" is. If "rune equals a character " is OK for your task (generally it isn't) then the answer by VonC is perfect for you. Otherwise, it should be probably noted, that there are few situations where the number of runes in a Unicode string is an interesting value. And even in those situations it's better, if possible, to infer the count while "traversing" the string as the runes are processed to avoid doubling the UTF-8 decode effort.
I tried to make to do the normalization a bit faster:
en, _ = glyphSmart(data)
func glyphSmart(text string) (int, int) {
gc := 0
dummy := 0
for ind, _ := range text {
gc++
dummy = ind
}
dummy = 0
return gc, dummy
}

How to strings.Split on newline?

I'm trying to do the rather simple task of splitting a string by newlines.
This does not work:
temp := strings.Split(result,`\n`)
I also tried ' instead of ` but no luck.
Any ideas?
You have to use "\n".
Splitting on `\n`, searches for an actual \ followed by n in the text, not the newline byte.
playground
For those of us that at times use Windows platform, it can
help remember to use replace before split:
strings.Split(strings.ReplaceAll(windows, "\r\n", "\n"), "\n")
Go Playground
It does not work because you're using backticks:
Raw string literals are character sequences between back quotes ``. Within the quotes, any character is legal except back quote. The value of a raw string literal is the string composed of the uninterpreted (implicitly UTF-8-encoded) characters between the quotes; in particular, backslashes have no special meaning and the string may contain newlines.
Reference: http://golang.org/ref/spec#String_literals
So, when you're doing
strings.Split(result,`\n`)
you're actually splitting using the two consecutive characters "\" and "n", and not the character of line return "\n". To do what you want, simply use "\n" instead of backticks.
Your code doesn't work because you're using backticks instead of double quotes. However, you should be using a bufio.Scanner if you want to support Windows.
import (
"bufio"
"strings"
)
func SplitLines(s string) []string {
var lines []string
sc := bufio.NewScanner(strings.NewReader(s))
for sc.Scan() {
lines = append(lines, sc.Text())
}
return lines
}
Alternatively, you can use strings.FieldsFunc (this approach skips blank lines)
strings.FieldsFunc(s, func(c rune) bool { return c == '\n' || c == '\r' })
import regexp
var lines []string = regexp.MustCompile("\r?\n").Split(inputString, -1)
MustCompile() creates a regular expression that allows to split by both \r\n and \n
Split() performs the split, seconds argument sets maximum number of parts, -1 for unlimited
' doesn't work because it is not a string type, but instead a rune.
temp := strings.Split(result,'\n')
go compiler: cannot use '\u000a' (type rune) as type string in argument to strings.Split
definition: Split(s, sep string) []string

Umlauts and slices

I'm having some trouble while reading a file which has a fixed column length format. Some columns may contain umlauts.
Umlauts seem to use 2 bytes instead of one. This is not the behaviour I was expecting. Is there any kind of function which returns a substring? Slice does not seem to work in this case.
Here's some sample code:
http://play.golang.org/p/ZJ1axy7UXe
umlautsString := "Rhön"
fmt.Println(len(umlautsString))
fmt.Println(umlautsString[0:4])
Prints:
5
Rhö
In go, a slice of a string counts bytes, not runes. This is why "Rhön"[0:3] gives you Rh and the first byte of ö.
Characters encoded in UTF-8 are represented as runes because UTF-8 encodes characters in more than one
byte (up to four bytes) to provide a bigger range of characters.
If you want to slice a string with the [] syntax, convert the string to []rune before.
Example (on play):
umlautsString := "Rhön"
runes = []rune(umlautsString)
fmt.Println(string(runes[0:3])) // Rhö
Noteworthy: This golang blog post about string representation in go.
You can convert string to []rune and work with it:
package main
import "fmt"
func main() {
umlautsString := "Rhön"
fmt.Println(len(umlautsString))
subStrRunes:= []rune(umlautsString)
fmt.Println(len(subStrRunes))
fmt.Println(string(subStrRunes[0:4]))
}
http://play.golang.org/p/__WfitzMOJ
Hope that helps!
Another option is the utf8string package:
package main
import "golang.org/x/exp/utf8string"
func main() {
s := utf8string.NewString("🧡💛💚💙💜")
// example 1
n := s.RuneCount()
println(n == 5)
// example 2
t := s.Slice(0, 2)
println(t == "🧡💛")
}
https://pkg.go.dev/golang.org/x/exp/utf8string

Resources