Golang Convert UTF-8 string to ASCII [duplicate] - go

How can I remove all diacritics from the given UTF8 encoded string using Go? e.g. transform the string "žůžo" => "zuzo". Is there a standard way?

You can use the libraries described in Text normalization in Go.
Here's an application of those libraries:
// Example derived from: http://blog.golang.org/normalization
package main
import (
"fmt"
"unicode"
"golang.org/x/text/transform"
"golang.org/x/text/unicode/norm"
)
func isMn(r rune) bool {
return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}
func main() {
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
result, _, _ := transform.String(t, "žůžo")
fmt.Println(result)
}

To expand a bit on the existing answer:
The internet standard for comparing strings of different character sets is called "PRECIS" (Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols) and is documented in RFC7564. There is also a Go implementation at golang.org/x/text/secure/precis.
None of the standard profiles will do what you want, but it would be fairly straight forward to define a new profile that did. You would want to apply Unicode Normalization Form D ("D" for "Decomposition", which means the accents will be split off and be their own combining character), and then remove any combining character as part of the additional mapping rule, then recompose with the normalization rule. Something like this:
package main
import (
"fmt"
"unicode"
"golang.org/x/text/secure/precis"
"golang.org/x/text/transform"
"golang.org/x/text/unicode/norm"
)
func main() {
loosecompare := precis.NewIdentifier(
precis.AdditionalMapping(func() transform.Transformer {
return transform.Chain(norm.NFD, transform.RemoveFunc(func(r rune) bool {
return unicode.Is(unicode.Mn, r)
}))
}),
precis.Norm(norm.NFC), // This is the default; be explicit though.
)
p, _ := loosecompare.String("žůžo")
fmt.Println(p, loosecompare.Compare("žůžo", "zuzo"))
// Prints "zuzo true"
}
This lets you expand your comparison with more options later (eg. width mapping, case mapping, etc.)
It's also worth noting that removing accents is almost never what you actually want to do when comparing strings like this, however, without knowing your use case I can't actually make that assertion about your project. To prevent the proliferation of precis profiles it's good to use one of the existing profiles where possible. Also note that no effort was made to optimize the example profile.

transform.RemoveFunc is deprecated.
Instead you can use the Remove function from runes package:
t := transform.Chain(norm.NFD, runes.Remove(runes.In(unicode.Mn)), norm.NFC)
result, _, _ := transform.String(t, "žůžo")
fmt.Println(result)

For anyone looking how to remove (or replace / flatten) Polish diacritics in Go, you may define a mapping for runes:
package main
import (
"fmt"
"golang.org/x/text/runes"
"golang.org/x/text/secure/precis"
"golang.org/x/text/transform"
"golang.org/x/text/unicode/norm"
)
func main() {
trans := transform.Chain(
norm.NFD,
precis.UsernameCaseMapped.NewTransformer(),
runes.Map(func(r rune) rune {
switch r {
case 'ą':
return 'a'
case 'ć':
return 'c'
case 'ę':
return 'e'
case 'ł':
return 'l'
case 'ń':
return 'n'
case 'ó':
return 'o'
case 'ś':
return 's'
case 'ż':
return 'z'
case 'ź':
return 'z'
}
return r
}),
norm.NFC,
)
result, _, _ := transform.String(trans, "ŻóŁć")
fmt.Println(result)
}
On Go Playground: https://play.golang.org/p/3ulPnOd3L91

Related

How to convert the string representation of a Terraform set of strings to a slice of strings

I've a terratest where I get an output from terraform like so s := "[a b]". The terraform output's value = toset([resource.name]), it's a set of strings.
Apparently fmt.Printf("%T", s) returns string. I need to iterate to perform further validation.
I tried the below approach but errors!
var v interface{}
if err := json.Unmarshal([]byte(s), &v); err != nil {
fmt.Println(err)
}
My current implementation to convert to a slice is:
s := "[a b]"
s1 := strings.Fields(strings.Trim(s, "[]"))
for _, v:= range s1 {
fmt.Println("v -> " + v)
}
Looking for suggestions to current approach or alternative ways to convert to arr/slice that I should be considering. Appreciate any inputs. Thanks.
Actually your current implementation seems just fine.
You can't use JSON unmarshaling because JSON strings must be enclosed in double quotes ".
Instead strings.Fields does just that, it splits a string on one or more characters that match unicode.IsSpace, which is \t, \n, \v. \f, \r and .
Moeover this works also if terraform sends an empty set as [], as stated in the documentation:
returning [...] an empty slice if s contains only white space.
...which includes the case of s being empty "" altogether.
In case you need additional control over this, you can use strings.FieldsFunc, which accepts a function of type func(rune) bool so you can determine yourself what constitutes a "space". But since your input string comes from terraform, I guess it's going to be well-behaved enough.
There may be third-party packages that already implement this functionality, but unless your program already imports them, I think the native solution based on the standard lib is always preferrable.
unicode.IsSpace actually includes also the higher runes 0x85 and 0xA0, in which case strings.Fields calls FieldsFunc(s, unicode.IsSpace)
package main
import (
"fmt"
"strings"
)
func main() {
src := "[a b]"
dst := strings.Split(src[1:len(src)-1], " ")
fmt.Println(dst)
}
https://play.golang.org/p/KVY4r_8RWv6

Parsefloat give output in scientific format in golang

i am trying to parse this string "7046260" using Parsefloat function in golang , but i am getting output in scientific format 7.04626e+06. i want the output in the format 7046260. how to get this?
package main
import (
"fmt"
"strconv"
)
func main() {
Value := "7046260"
Fval, err := strconv.ParseFloat(Value, 64)
if err == nil {
fmt.Println(Fval)
}
}
ouput :- 7.04626e+06
Parsefloat give output in scientific format in golang
i am trying to parse this string "7046260" using Parsefloat function in golang , but i am getting output in scientific format 7.04626e+06. i want the output in the format 7046260
You're confusing the floating-point value's (default) formatted output with its internal representation.
ParseFloat is working fine.
You just need to specify an output format:
See the fmt package documentation.
Use Printf to specify a format-string.
Use the format %.0f to instruct Go to print the value as-follows:
% marks the start of a placeholder.
. denotes default width (i.e. don't add leading or trailing zeroes).
0 denotes zero radix precision (i.e. don't print any decimal places, even if the value has them)
f denotes the end of the placeholder, and that the placeholder is for a floating-point value.
I have a few other recommendations:
Local variables in Go should use camelCase, not PascalCase. Go does not encourage the use of snake_case.
You should check err != nil after each nil-returning function returns and either fail-fast (if appropriate), pass the error up (and optionally log it), or handle it gracefully.
When working with floating-point numbers, you should be aware of NaN's special status. The IsNaN function is the only way to correctly check for NaN values (because ( aNaNValue1 == math.NaN ) == false).
The same applies in all languages that implement IEEE-754, including Java, JavaScript, C, C#.NET and Go.
Like so:
package main
import (
"fmt"
"strconv"
"math"
"log"
)
func main() {
numberText := "7046260"
numberFloat, err := strconv.ParseFloat(numberText, 64)
if err != nil {
log.Fatal(err)
}
if math.IsNaN(numberFloat) {
log.Fatal("NaN value encountered")
}
fmt.Printf("%.0f",numberFloat)
fmt.Println()
}

How in golang to remove the last letter from the string?

Let's say I have a string called varString.
varString := "Bob,Mark,"
QUESTION: How to remove the last letter from the string? In my case, it's the second comma.
How to remove the last letter from the string?
In Go, character strings are UTF-8 encoded. Unicode UTF-8 is a variable-length character encoding which uses one to four bytes per Unicode character (code point).
For example,
package main
import (
"fmt"
"unicode/utf8"
)
func trimLastChar(s string) string {
r, size := utf8.DecodeLastRuneInString(s)
if r == utf8.RuneError && (size == 0 || size == 1) {
size = 0
}
return s[:len(s)-size]
}
func main() {
s := "Bob,Mark,"
fmt.Println(s)
s = trimLastChar(s)
fmt.Println(s)
}
Playground: https://play.golang.org/p/qyVYrjmBoVc
Output:
Bob,Mark,
Bob,Mark
Here's a much simpler method that works for unicode strings too:
func removeLastRune(s string) string {
r := []rune(s)
return string(r[:len(r)-1])
}
Playground link: https://play.golang.org/p/ezsGUEz0F-D
Something like this:
s := "Bob,Mark,"
s = s[:len(s)-1]
Note that this does not work if the last character is not represented by just one byte.
newStr := strings.TrimRightFunc(str, func(r rune) bool {
return !unicode.IsLetter(r) // or any other validation can go here
})
This will trim anything that isn't a letter on the right hand side.

Does go provide variable sanitization?

I am a beginner in Golang.
I have a problem with variable type assigning from user input.
When the user enters data like "2012BV352" I need to be able to ignore the BV and pass 2012352 to my next function.
There has a package name gopkg.in/validator.v2 in doc
But what it returns is whether or not the variable is safe or not.
I need to cut off the unusual things.
Any idea on how to achieve this?
You could write your own sanitizing methods and if it becomes something you'll be using more often, I'd package it out and add other methods to cover more use cases.
I provide two different ways to achieve the same result. One is commented out.
I haven't run any benchmarks so i couldn't tell you for certain which is more performant, but you could write your own tests if you wanted to figure it out. It would also expose another important aspect of Go and in my opinion one of it's more powerful tools... testing.
package main
import (
"fmt"
"log"
"regexp"
"strconv"
"strings"
)
// using a regex here which simply targets all digits and ignores everything else. I make it a global var and use MustCompile because the
// regex doesn't need to be created every time.
var extractInts = regexp.MustCompile(`\d+`)
func SanitizeStringToInt(input string) (int, error) {
m := extractInts.FindAllString(input, -1)
s := strings.Join(m, "")
return strconv.Atoi(s)
}
/*
// if you didn't want to use regex you could use a for loop
func SanitizeStringToInt(input string) (int, error) {
var s string
for _, r := range input {
if !unicode.IsLetter(r) {
s += string(r)
}
}
return strconv.Atoi(s)
}
*/
func main() {
a := "2012BV352"
n, err := SanitizeStringToInt(a)
if err != nil {
log.Fatal(err)
}
fmt.Println(n)
}

How to create a case insensitive map in Go?

I want to have a key insensitive string as key.
Is it supported by the language or do I have to create it myself?
thank you
Edit: What I am looking for is a way to make it by default instead of having to remember to convert the keys every time I use the map.
Edit: My initial code actually still allowed map syntax and thus allowed the methods to be bypassed. This version is safer.
You can "derive" a type. In Go we just say declare. Then you define methods on your type. It just takes a very thin wrapper to provide the functionality you want. Note though, that you must call get and set with ordinary method call syntax. There is no way to keep the index syntax or optional ok result that built in maps have.
package main
import (
"fmt"
"strings"
)
type ciMap struct {
m map[string]bool
}
func newCiMap() ciMap {
return ciMap{m: make(map[string]bool)}
}
func (m ciMap) set(s string, b bool) {
m.m[strings.ToLower(s)] = b
}
func (m ciMap) get(s string) (b, ok bool) {
b, ok = m.m[strings.ToLower(s)]
return
}
func main() {
m := newCiMap()
m.set("key1", true)
m.set("kEy1", false)
k := "keY1"
b, _ := m.get(k)
fmt.Println(k, "value is", b)
}
Two possiblities:
Convert to uppercase/lowercase if you're input set is guaranteed to be restricted to only characters for which a conversion to uppercase/lowercase will yield correct results (may not be true for some Unicode characters)
Convert to Unicode fold case otherwise:
Use unicode.SimpleFold(rune) to convert a unicode rune to fold case. Obviously this is dramatically more expensive an operation than simple ASCII-style case mapping, but it is also more portable to other languages. See the source code for EqualsFold to see how this is used, including how to extract Unicode runes from your source string.
Obviously you'd abstract this functionality into a separate package instead of re-implementing it everywhere you use the map. This should go without saying, but then you never know.
Here is something more robust than just strings.ToLower, you can use
the golang.org/x/text/cases package. Example:
package main
import "golang.org/x/text/cases"
func main() {
s := cases.Fold().String("March")
println(s == "march")
}
If you want to use something from the standard library, I ran this test:
package main
import (
"strings"
"unicode"
)
func main() {
var (
lower, upper int
m = make(map[string]bool)
)
for n := '\u0080'; n <= '\u07FF'; n++ {
q, r := n, n
for {
q = unicode.SimpleFold(q)
if q == n { break }
for {
r = unicode.SimpleFold(r)
if r == n { break }
s, t := string(q), string(r)
if m[t + s] { continue }
if strings.ToLower(s) == strings.ToLower(t) { lower++ }
if strings.ToUpper(s) == strings.ToUpper(t) { upper++ }
m[s + t] = true
}
}
}
println(lower == 951, upper == 989)
}
So as can be seen, ToUpper is the marginally better choice.

Resources