how to dynamically pass values to Sprintf or Printf - go

If I want to pad a string I could use something like this:
https://play.golang.org/p/ATeUhSP18N
package main
import (
"fmt"
)
func main() {
x := fmt.Sprintf("%+20s", "Hello World!")
fmt.Println(x)
}
From https://golang.org/pkg/fmt/
+ always print a sign for numeric values;
guarantee ASCII-only output for %q (%+q)
- pad with spaces on the right rather than the left (left-justify the field)
But If I would like to dynamically change the pad size how could I pass the value ?
My first guest was:
x := fmt.Sprintf("%+%ds", 20, "Hello World!")
But I get this:
%ds%!(EXTRA int=20, string=Hello World!)
Is there a way of doing this without creating a custom pad function what would add spaces either left or right probably using a for loop:
for i := 0; i < n; i++ {
out += str
}

Use * to tell Sprintf to get a formatting parameter from the argument list:
fmt.Printf("%*s\n", 20, "Hello World!")
Full code on play.golang.org

Go to: https://golang.org/pkg/fmt/
and scroll down until you find this:
fmt.Sprintf("%[3]*.[2]*[1]f", 12.0, 2, 6)
equivalent to
fmt.Sprintf("%6.2f", 12.0)
will yield " 12.00". Because an explicit index affects subsequent verbs,
this notation can be used to print the same values multiple times by
resetting the index for the first argument to be repeated
This sounds like what you want.
The real core of the description of using arguments to set field width and precision occurs further above:
Width and precision are measured in units of Unicode code points, that
is, runes. (This differs from C's printf where the units are always
measured in bytes.) Either or both of the flags may be replaced with
the character '*', causing their values to be obtained from the next
operand, which must be of type int.
The example above is just using explicit indexing into the argument list in addition, which is sometimes nice to have and allows you to reuse the same width and precision values for more conversions.
So you could also write:
fmt.Sprintf("*.*f", 6, 2, 12.0)

Related

Character index in line with UTF-8 files

I'm writing a lexical analyzer for UTF-8 text. When an error is detected, I'm supposed to give the line number and the index position in the line.
The user is expected to identify the location in the line by counting the characters he see on the screen (or on the paper) until he reaches the given index value. He could also use the index in the line of the cursor shown by some editors.
I suppose I can't simply use the rune count as index because some unicode characters have zero space width and are supposed to be hidden markers or combined with a non-zero space width unicode character.
How am I supposed to deal with this ?
Is there a function that is able to give the visual unicode index given a byte slice containing runes ?
Also, do the line index in a file start at 0 or at 1 ?
I couldnt find anything in the standard library, but this seems to do it:
package main
import "github.com/rivo/uniseg"
func index(s, substr string) int {
g := uniseg.NewGraphemes(s)
for n := 0; g.Next(); n++ {
if g.Str() == substr { return n }
}
return -1
}
func main() {
n := index("Z a̎ B", "B")
println(n == 4)
}
https://pkg.go.dev/github.com/rivo/uniseg

How do I count emojis in a string in go? [duplicate]

How can I get the number of characters of a string in Go?
For example, if I have a string "hello" the method should return 5. I saw that len(str) returns the number of bytes and not the number of characters so len("£") returns 2 instead of 1 because £ is encoded with two bytes in UTF-8.
You can try RuneCountInString from the utf8 package.
returns the number of runes in p
that, as illustrated in this script: the length of "World" might be 6 (when written in Chinese: "世界"), but the rune count of "世界" is 2:
package main
import "fmt"
import "unicode/utf8"
func main() {
fmt.Println("Hello, 世界", len("世界"), utf8.RuneCountInString("世界"))
}
Phrozen adds in the comments:
Actually you can do len() over runes by just type casting.
len([]rune("世界")) will print 2. At least in Go 1.3.
And with CL 108985 (May 2018, for Go 1.11), len([]rune(string)) is now optimized. (Fixes issue 24923)
The compiler detects len([]rune(string)) pattern automatically, and replaces it with for r := range s call.
Adds a new runtime function to count runes in a string.
Modifies the compiler to detect the pattern len([]rune(string))
and replaces it with the new rune counting runtime function.
RuneCount/lenruneslice/ASCII 27.8ns ± 2% 14.5ns ± 3% -47.70%
RuneCount/lenruneslice/Japanese 126ns ± 2% 60 ns ± 2% -52.03%
RuneCount/lenruneslice/MixedLength 104ns ± 2% 50 ns ± 1% -51.71%
Stefan Steiger points to the blog post "Text normalization in Go"
What is a character?
As was mentioned in the strings blog post, characters can span multiple runes.
For example, an 'e' and '◌́◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character.
The definition of a character may vary depending on the application.
For normalization we will define it as:
a sequence of runes that starts with a starter,
a rune that does not modify or combine backwards with any other rune,
followed by possibly empty sequence of non-starters, that is, runes that do (typically accents).
The normalization algorithm processes one character at at time.
Using that package and its Iter type, the actual number of "character" would be:
package main
import "fmt"
import "golang.org/x/text/unicode/norm"
func main() {
var ia norm.Iter
ia.InitString(norm.NFKD, "école")
nc := 0
for !ia.Done() {
nc = nc + 1
ia.Next()
}
fmt.Printf("Number of chars: %d\n", nc)
}
Here, this uses the Unicode Normalization form NFKD "Compatibility Decomposition"
Oliver's answer points to UNICODE TEXT SEGMENTATION as the only way to reliably determining default boundaries between certain significant text elements: user-perceived characters, words, and sentences.
For that, you need an external library like rivo/uniseg, which does Unicode Text Segmentation.
That will actually count "grapheme cluster", where multiple code points may be combined into one user-perceived character.
package uniseg
import (
"fmt"
"github.com/rivo/uniseg"
)
func main() {
gr := uniseg.NewGraphemes("👍🏼!")
for gr.Next() {
fmt.Printf("%x ", gr.Runes())
}
// Output: [1f44d 1f3fc] [21]
}
Two graphemes, even though there are three runes (Unicode code points).
You can see other examples in "How to manipulate strings in GO to reverse them?"
👩🏾‍🦰 alone is one grapheme, but, from unicode to code points converter, 4 runes:
👩: women (1f469)
dark skin (1f3fe)
ZERO WIDTH JOINER (200d)
🦰red hair (1f9b0)
There is a way to get count of runes without any packages by converting string to []rune as len([]rune(YOUR_STRING)):
package main
import "fmt"
func main() {
russian := "Спутник и погром"
english := "Sputnik & pogrom"
fmt.Println("count of bytes:",
len(russian),
len(english))
fmt.Println("count of runes:",
len([]rune(russian)),
len([]rune(english)))
}
count of bytes 30 16
count of runes 16 16
I should point out that none of the answers provided so far give you the number of characters as you would expect, especially when you're dealing with emojis (but also some languages like Thai, Korean, or Arabic). VonC's suggestions will output the following:
fmt.Println(utf8.RuneCountInString("🏳️‍🌈🇩🇪")) // Outputs "6".
fmt.Println(len([]rune("🏳️‍🌈🇩🇪"))) // Outputs "6".
That's because these methods only count Unicode code points. There are many characters which can be composed of multiple code points.
Same for using the Normalization package:
var ia norm.Iter
ia.InitString(norm.NFKD, "🏳️‍🌈🇩🇪")
nc := 0
for !ia.Done() {
nc = nc + 1
ia.Next()
}
fmt.Println(nc) // Outputs "6".
Normalization is not really the same as counting characters and many characters cannot be normalized into a one-code-point equivalent.
masakielastic's answer comes close but only handles modifiers (the rainbow flag contains a modifier which is thus not counted as its own code point):
fmt.Println(GraphemeCountInString("🏳️‍🌈🇩🇪")) // Outputs "5".
fmt.Println(GraphemeCountInString2("🏳️‍🌈🇩🇪")) // Outputs "5".
The correct way to split Unicode strings into (user-perceived) characters, i.e. grapheme clusters, is defined in the Unicode Standard Annex #29. The rules can be found in Section 3.1.1. The github.com/rivo/uniseg package implements these rules so you can determine the correct number of characters in a string:
fmt.Println(uniseg.GraphemeClusterCount("🏳️‍🌈🇩🇪")) // Outputs "2".
If you need to take grapheme clusters into account, use regexp or unicode module. Counting the number of code points(runes) or bytes also is needed for validaiton since the length of grapheme cluster is unlimited. If you want to eliminate extremely long sequences, check if the sequences conform to stream-safe text format.
package main
import (
"regexp"
"unicode"
"strings"
)
func main() {
str := "\u0308" + "a\u0308" + "o\u0308" + "u\u0308"
str2 := "a" + strings.Repeat("\u0308", 1000)
println(4 == GraphemeCountInString(str))
println(4 == GraphemeCountInString2(str))
println(1 == GraphemeCountInString(str2))
println(1 == GraphemeCountInString2(str2))
println(true == IsStreamSafeString(str))
println(false == IsStreamSafeString(str2))
}
func GraphemeCountInString(str string) int {
re := regexp.MustCompile("\\PM\\pM*|.")
return len(re.FindAllString(str, -1))
}
func GraphemeCountInString2(str string) int {
length := 0
checked := false
index := 0
for _, c := range str {
if !unicode.Is(unicode.M, c) {
length++
if checked == false {
checked = true
}
} else if checked == false {
length++
}
index++
}
return length
}
func IsStreamSafeString(str string) bool {
re := regexp.MustCompile("\\PM\\pM{30,}")
return !re.MatchString(str)
}
There are several ways to get a string length:
package main
import (
"bytes"
"fmt"
"strings"
"unicode/utf8"
)
func main() {
b := "这是个测试"
len1 := len([]rune(b))
len2 := bytes.Count([]byte(b), nil) -1
len3 := strings.Count(b, "") - 1
len4 := utf8.RuneCountInString(b)
fmt.Println(len1)
fmt.Println(len2)
fmt.Println(len3)
fmt.Println(len4)
}
Depends a lot on your definition of what a "character" is. If "rune equals a character " is OK for your task (generally it isn't) then the answer by VonC is perfect for you. Otherwise, it should be probably noted, that there are few situations where the number of runes in a Unicode string is an interesting value. And even in those situations it's better, if possible, to infer the count while "traversing" the string as the runes are processed to avoid doubling the UTF-8 decode effort.
I tried to make to do the normalization a bit faster:
en, _ = glyphSmart(data)
func glyphSmart(text string) (int, int) {
gc := 0
dummy := 0
for ind, _ := range text {
gc++
dummy = ind
}
dummy = 0
return gc, dummy
}

Go print large number

I am currently doing the Go Lang tutorial, "Numeric Constants" to be precise. The example code starts with the following statement:
const (
// Create a huge number by shifting a 1 bit left 100 places.
// In other words, the binary number that is 1 followed by 100 zeroes.
Big = 1 << 100
// Shift it right again 99 places, so we end up with 1<<1, or 2.
Small = Big >> 99
)
The constant Big is obviously huge, and I am trying to print it and its type, like this:
fmt.Printf("%T", Big)
fmt.Println(Big)
However, I get the following error for both lines:
# command-line-arguments ./compile26.go:19: constant 1267650600228229401496703205376 overflows int
I would try casting Big to some other type, such as uint64, which it overflowed with the same error, or just convert it to a string, but when trying Big.String() I get the following error:
Big.String undefined (type int has no field or method String)
It appears that its type is int, yet I can't print it or cast it to anything and it overflows all methods. What do I do with this number/object and how do I print it?
That value is larger than any 64 bit numeric type can hold, so you have no way of manipulating it directly.
If you need to write a numeric constant that can only be manipulated with the math/big package, you need to store it serialized in a format that package can consume. Easiest way is probably to use a base 10 string:
https://play.golang.org/p/Mzwox3I2SL
bigNum := "1267650600228229401496703205376"
b, ok := big.NewInt(0).SetString(bigNum, 10)
fmt.Println(ok, b)
// true 1267650600228229401496703205376

Golang - ToUpper() on a single byte?

I have a []byte, b, and I want to select a single byte, b[pos] and change it too upper case (and then lower case) The bytes type has a method called ToUpper(). How can I use this for a single byte?
Calling ToUpper on single Byte
OneOfOne gave the most efficient (when calling thousands of times), I use
val = byte(unicode.ToUpper(rune(b[pos])))
in order to find the byte and change the value
b[pos] = val
Checking if byte is Upper
Sometimes, instead of changing the case of a byte, I want to check if a byte is upper or lower case; All the upper case roman-alphabet bytes are lower than the value of the lower case bytes.
func (b Board) isUpper(x int) bool {
return b.board[x] < []byte{0x5a}[0]
}
For a single byte/rune, you can use unicode.ToUpper.
b[pos] = byte(unicode.ToUpper(rune(b[pos])))
I want to remind OP that bytes.ToUpper() operates on unicode code points encoded using UTF-8 in a byte slice while unicode.ToUpper() operates on a single unicode code point.
By asking to convert a single byte to upper case, OP is implying that the "b" byte slice contains something other than UTF-8, perhaps ASCII-7 or some 8-bit encoding such as ISO Latin-1 (e.g.). In that case OP needs to write an ISO Latin-1 (e.g.) ToUpper() function or OP must convert the ISO Latin-1 (e.g.) bytes to UTF-8 or unicode before using the bytes.ToUpper() or unicode.ToUpper() function.
Anything less creates a pending bug. Neither of the previously mentioned functions will properly convert all possible ISO Latin-1 (e.g.) encoded characters to upper case.
Use the following code to test if an element of the board is an ASCII uppercase letter:
func (b Board) isUpper(x int) bool {
v := b.board[x]
return 'A' <= v && v <= 'Z'
}
If the application only needs to distinguish between upper and lowercase letters, then there's no need for the lower bound test:
func (b Board) isUpper(x int) bool {
return b.board[x] <= 'Z'
}
The code in this answer improves on the code in the question in a few ways:
The code in the answer returns the correct value for a board element containing 'Z' (run playground example below for demonstration).
'Z' and 0x85 are the same value, but the code is easier to understand with 'Z'.
It's simpler to compare directly with the value 'Z'. No need to create a slice.
playground example
Edit: Revamped answer based on new information in the question since time of my original answer.
You can use bytes.ToUpper, you just need to deal with making the input a slice,
and making the output a byte:
package main
import "bytes"
func main() {
b, pos := []byte("north"), 1
b[pos] = bytes.ToUpper(b)[pos]
println(string(b) == "nOrth")
}

How to ignore fields with sscanf (%* is rejected)

I wish to ignore a particular field whilst processing a string with sscanf.
Man page for sscanf says
An optional '*' assignment-suppression character: scanf() reads input as directed by the conversion specification, but discards the input. No corresponding pointer argument is required, and this specification is not included in the count of successful assignments returned by scanf().
Attempting to use this in Golang, to ignore the 3rd field:
if c, err := fmt.Sscanf(str, " %s %d %*d %d ", &iface.Name, &iface.BTx, &iface.BytesRx); err != nil || c != 3 {
compiles OK, but at runtime err is set to:
bad verb %* for integer
Golang doco doesn't specifically mention the %* conversion specification, but it does say,
Package fmt implements formatted I/O with functions analogous to C's printf and scanf.
It doesn't indicate that %* is not implemented, so... Am I doing it wrong? Or has it just been quietly omitted? ...but then, why does it compile?
To the best of my knowledge there is no such verb (as the format specifiers are called in the fmt package) for this task. What you can do however, is specifying some verb and ignoring its value. This is not particularly memory friendly, though. Ideally this would work:
fmt.Scan(&a, _, &b)
Sadly, it doesn't. So your next best option would be to declare the variables and ignore the one
you don't want:
var a,b,c int
fmt.Scanf("%d %v %d", &a, &b, &c)
fmt.Println(a,c)
%v would read a space separated token. Depending on what you're scanning on, you may fast forward the
stream to the position you need to scan on. See this answer
for details on seeking in buffers. If you're using stdio or you don't know which length your input may
have, you seem to be out of luck here.
It doesn't indicate that %* is not implemented, so... Am I doing it
wrong? Or has it just been quietly omitted? ...but then, why does it
compile?
It compiles because for the compiler a format string is just a string like any other. The content of that string is evaluated at run time by functions of the fmt package. Some C compilers may check format strings
for correctness, but this is a feature, not the norm. With go, the go vet command will try to warn you about format string errors with mismatched arguments.
Edit:
For the special case of needing to parse a row of integers and just caring for some of them, you
can use fmt.Scan in combination with a slice of integers. The following example reads 3 integers
from stdin and stores them in the slice named vals:
ints := make([]interface{}, 3)
vals := make([]int, len(ints))
for i, _ := range ints {
ints[i] = interface{}(&vals[i])
}
fmt.Scan(ints...)
fmt.Println(vals)
This is probably shorter than the conventional split/trim/strconv chain. It makes a slice of pointers
which each points to a value in vals. fmt.Scan then fills these pointers. With this you can even
ignore most of the values by assigning the same pointer over and over for the values you don't want:
ignored := 0
for i, _ := range ints {
if(i == 0 || i == 2) {
ints[i] = interface{}(&vals[i])
} else {
ints[i] = interface{}(&ignored)
}
}
The example above would assign the address of ignore to all values except the first and the second, thus
effectively ignoring them by overwriting.

Resources