how to realize mismatch of regexp in golang? - go

This is a multiple choice question example. I want to get the chinese text like "英国、法国", "加拿大、墨西哥", "葡萄牙、加拿大", "墨西哥、德国" in the content of following code in golang, but it does not work.
package main
import (
"fmt"
"regexp"
"testing"
)
func TestRegex(t *testing.T) {
text := `( B )38.目前,亚马逊美国站后台,除了有美国站点外,还有( )站点。
A.英国、法国B.加拿大、墨西哥
C.葡萄牙、加拿大D.墨西哥、德国
`
fmt.Printf("%q\n", regexp.MustCompile(`[A-E]\.(\S+)?`).FindAllStringSubmatch(text, -1))
fmt.Printf("%q\n", regexp.MustCompile(`[A-E]\.`).Split(text, -1))
}
text:
( B )38.目前,亚马逊美国站后台,除了有美国站点外,还有( )站点。
A.英国、法国B.加拿大、墨西哥
C.葡萄牙、加拿大D.墨西哥、德国
pattern: [A-E]\.(\S+)?
Actual result: [["A.英国、法国B.加拿大、墨西哥" "英国、法国B.加拿大、墨西哥"] ["C.葡萄牙、加拿大D.墨西哥、德国" "葡萄牙、加拿大D.墨西哥、德国"]].
Expect result: [["A.英国、法国" "英国、法国"] ["B.加拿大、墨西哥" "加拿大、墨西哥"] ["C.葡萄牙、加拿大" "葡萄牙、加拿大"] ["D.墨西哥、德国" "墨西哥、德国"]]
I think it might be a greedy mode problem. Because in my code, it reads option A and option B as one option directly.

Non-greedy matching won't solve this, you need positive lookahead, which re2 doesn't support.
As a workaround can just search on the labels and extract the text in between manually.
re := regexp.MustCompile(`[A-E]\.`)
res := re.FindAllStringIndex(text, -1)
results := make([][]string, len(res))
for i, m := range res {
if i < len(res)-1 {
results[i] = []string{text[m[0]:m[1]], text[m[1]:res[i+1][0]]}
} else {
results[i] = []string{text[m[0]:m[1]], text[m[1]:]}
}
}
fmt.Printf("%q\n", results)
Should print
[["A." "英国、法国"] ["B." "加拿大、墨西哥\n"] ["C." "葡萄牙、加拿大"] ["D." "墨西哥、德国\n"]]

Related

How can I clean the text for search using RegEx

I can use the below code to search if the text str contains any or both of the keys, i.e.if it contains "MS" or "dynamics" or both of them
package main
import (
"fmt"
"regexp"
)
func main() {
keys := []string{"MS", "dynamics"}
keysReg := fmt.Sprintf("(%s %s)|%s|%s", keys[0], keys[1], keys[0], keys[1]) // => "(MS dynamics)|MS|dynamics"
fmt.Println(keysReg)
str := "What is MS dynamics, is it a product from MS?"
re := regexp.MustCompile(`(?i)` + keysReg)
matches := re.FindAllString(str, -1)
fmt.Println("We found", len(matches), "matches, that are:", matches)
}
I want the user to enter his phrase, so I trim unwanted words and characters, then doing the search as per above.
Let's say the user input was: This,is,a,delimited,string and I need to build the keys variable dynamically to be (delimited string)|delimited|string so that I can search for my variable str for all the matches, so I wrote the below:
s := "This,is,a,delimited,string"
t := regexp.MustCompile(`(?i),|\.|this|is|a`) // backticks are used here to contain the expression, (?i) for case insensetive
v := t.Split(s, -1)
fmt.Println(len(v))
fmt.Println(v)
But I got the output as:
8
[ delimited string]
What is the wrong part in my cleaning of the input text, I'm expecting the output to be:
2
[delimited string]
Here is my playground
To quote the famous quip from Jamie Zawinski,
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Two things:
Instead of trying to weed out garbage from the string ("cleaning" it), extract complete words from it instead.
Unicode is a compilcated matter; so even after you have succeeded with extracting words, you have to make sure your words are properly "escaped" to not contain any characters which might be interpreted as RE syntax before building a regexp of them.
package main
import (
"errors"
"fmt"
"regexp"
"strings"
)
func build(words ...string) (*regexp.Regexp, error) {
var sb strings.Builder
switch len(words) {
case 0:
return nil, errors.New("empty input")
case 1:
return regexp.Compile(regexp.QuoteMeta(words[0]))
}
quoted := make([]string, len(words))
for i, w := range words {
quoted[i] = regexp.QuoteMeta(w)
}
sb.WriteByte('(')
for i, w := range quoted {
if i > 0 {
sb.WriteByte('\x20')
}
sb.WriteString(w)
}
sb.WriteString(`)|`)
for i, w := range quoted {
if i > 0 {
sb.WriteByte('|')
}
sb.WriteString(w)
}
return regexp.Compile(sb.String())
}
var words = regexp.MustCompile(`\pL+`)
func main() {
allWords := words.FindAllString("\tThis\v\x20\x20,\t\tis\t\t,?a!,¿delimited?,string‽", -1)
re, err := build(allWords...)
if err != nil {
panic(err)
}
fmt.Println(re)
}
Further reading:
https://pkg.go.dev/regexp/syntax
https://pkg.go.dev/regexp#QuoteMeta
https://pkg.go.dev/unicode#pkg-variables and https://pkg.go.dev/unicode#Categories

How can I trim whitespaces in Go from a slice after Split

I have a string that is comma separated, so it could be
test1, test2, test3 or test1,test2,test3 or test1, test2, test3.
I split this in Go currently with strings.Split(s, ","), but now I have a []string that can contain elements with an arbitrary numbers of whitespaces.
How can I easily trim them off? What is best practice here?
This is my current code
var property= os.Getenv(env.templateDirectories)
if property != "" {
var dirs = strings.Split(property, ",")
for index,ele := range dirs {
dirs[index] = strings.TrimSpace(ele)
}
return dirs
}
I come from Java and assumed that there is a map/reduce etc functionality in Go also, therefore the question.
You can use strings.TrimSpace in a loop. If you want to preserve order too, the indexes can be used rather than values as the loop parameters:
Go Playground Example
EDIT: To see the code without the click:
package main
import (
"fmt"
"strings"
)
func main() {
input := "test1, test2, test3"
slc := strings.Split(input , ",")
for i := range slc {
slc[i] = strings.TrimSpace(slc[i])
}
fmt.Println(slc)
}
Easy way without looping
test := "2 , 123, 1"
result := strings.Split(strings.ReplaceAll(test," ","") , ",")
The encoding/csv package can handle this:
package main
import (
"encoding/csv"
"fmt"
"strings"
)
func main() {
for _, each := range []string{
"test1, test2, test3", "test1, test2, test3", "test1,test2,test3",
} {
r := csv.NewReader(strings.NewReader(each))
r.TrimLeadingSpace = true
s, e := r.Read()
if e != nil {
panic(e)
}
fmt.Printf("%q\n", s)
}
}
https://golang.org/pkg/encoding/csv#Reader.TrimLeadingSpace
If you already use regexp may be you can split using regular expressions:
regexp.MustCompile(`\s*,\s*`).Split(test, -1)
This solution is probably slower than the standard Split + TrimSpaces, but is more flexible. For example if you want to skip empty fields you can :
regexp.MustCompile(`(\s*,\s*)+`).Split(test, -1)
or to use multiple separators
regexp.MustCompile(`\s*[,;]\s*`).Split(test, -1)
You can test it in the go playground.

Slice unicode/ascii strings in golang?

I need to slice a string in Go. Possible values can contain Latin chars and/or Arabic/Chinese chars. In the following example, the slice annotation [:1] for the Arabic string alphabet is returning a non-expected value/character.
package main
import "fmt"
func main() {
a := "a"
fmt.Println(a[:1]) // works
b := "ذ"
fmt.Println(b[:1]) // does not work
fmt.Println(b[:2]) // works
fmt.Println(len(a) == len(b)) // false
}
http://play.golang.org/p/R-JxaxbfNL
First of all, you should really read about strings, bytes and runes in Go.
And here is how you can achieve what you want: Go playground (I was not able to properly paste arabic symbols, but if Chinese works, arabic should work too).
s := "abcdefghijklmnop"
fmt.Println(s[2:9])
s = "维基百科:关于中文维基百科"
fmt.Println(string([]rune(s)[2:9]))
The output is:
cdefghi
百科:关于中文
You can use the utf8string package:
package main
import "golang.org/x/exp/utf8string"
func main() {
a := utf8string.NewString("🎈🎄🎀🎢👓")
// example 1
r := a.At(1)
// example 2
s := a.Slice(1, 3)
// example 3
n := a.RuneCount()
// print
println(r == '🎄', s == "🎄🎀", n == 5)
}
https://pkg.go.dev/golang.org/x/exp/utf8string

Evaluate/Execute Golang code/expressions like js' eval()

Is there a eval() like method on golang?
Evaluate/Execute JavaScript code/expressions:
var x = 10;
var y = 20;
var a = eval("x * y") + "<br>";
var b = eval("2 + 2") + "<br>";
var c = eval("x + 17") + "<br>";
var res = a + b + c;
The result of res will be:
200
4
27
Is this possible in golang? and why?
Its perfectly possible. At least for expressions, which seems to be what you want:
Have a look at:
https://golang.org/src/go/types/eval.go
https://golang.org/src/go/constant/value.go
https://golang.org/pkg/go/types/#Scope
You'd need to create your own Package and Scope objects and Insert constants to the package's scope. Constants are created using types.NewConst by providing appropriate type information.
Is this possible in golang? and why?
No, because golang is not that kind of language. It is intended to be compiled, not interpreted, so that the runtime does not contain any “string to code” transformer, or indeed knows what a syntactically correct program looks like.
Note that in Go as in most other programming languages, you can write your own interpreter, that is, a function that takes a string and causes computations to be done accordingly. The choice of the Go designers is only not to force a feature of such dubious interest and security on everyone who did not need it.
There is no built-in eval. But it is possible to implement evaluation which will follow most of GoLang spec: eval (only expression, not a code) package on github / on godoc.
Example:
import "github.com/apaxa-go/eval"
...
src:="int8(1*(1+2))"
expr,err:=eval.ParseString(src,"")
if err!=nil{
return err
}
r,err:=expr.EvalToInterface(nil)
if err!=nil{
return err
}
fmt.Printf("%v %T", r, r) // "3 int8"
It is also possible to use variables in evaluated expression, but it requires pass them with theirs names to Eval method.
This parsing example parses GO code at runtime:
package main
import (
"fmt"
"go/parser"
"go/token"
)
func main() {
fset := token.NewFileSet() // positions are relative to fset
src := `package foo
import (
"fmt"
"time"
)
func bar() {
fmt.Println(time.Now())
}`
// Parse src but stop after processing the imports.
f, err := parser.ParseFile(fset, "", src, parser.ImportsOnly)
if err != nil {
fmt.Println(err)
return
}
// Print the imports from the file's AST.
for _, s := range f.Imports {
fmt.Println(s.Path.Value)
}
}
go-exprtk package will probably meet all kinds of your needs to evaluate any kind of mathematical expression dynamically.
package main
import (
"fmt"
"github.com/Pramod-Devireddy/go-exprtk"
)
func main() {
exprtkObj := exprtk.NewExprtk()
exprtkObj.SetExpression("x * y")
exprtkObj.AddDoubleVariable("x")
exprtkObj.AddDoubleVariable("y")
exprtkObj.CompileExpression()
exprtkObj.SetDoubleVariableValue("x", 10)
exprtkObj.SetDoubleVariableValue("y", 20)
a := exprtkObj.GetEvaluatedValue()
exprtkObj.SetExpression("2 + 2")
exprtkObj.CompileExpression()
b := exprtkObj.GetEvaluatedValue()
exprtkObj.SetExpression("x + 17")
exprtkObj.CompileExpression()
c := exprtkObj.GetEvaluatedValue()
res := a + b + c
fmt.Println(a, b, c, res)
}

strings - get characters before a digit

I have some strings such E2 9NZ, N29DZ, EW29DZ . I need to extract the chars before the first digit, given the above example : E, N, EW.
Am I supposed to use regex ? The strings package looks really nice but just doesn't seem to handle this case (extract everything before a specific type).
Edit:
To clarify the "question" I'm wondering what method is more idiomatic to go and perhaps likely to provide better performance.
For example,
package main
import (
"fmt"
"unicode"
)
func DigitPrefix(s string) string {
for i, r := range s {
if unicode.IsDigit(r) {
return s[:i]
}
}
return s
}
func main() {
fmt.Println(DigitPrefix("E2 9NZ"))
fmt.Println(DigitPrefix("N29DZ"))
fmt.Println(DigitPrefix("EW29DZ"))
fmt.Println(DigitPrefix("WXYZ"))
}
Output:
E
N
EW
WXYZ
If there is no digit, example "WXYZ", and you don't want anything returned, change return s to return "".
Not sure why almost everyone provided answers in everything but Go. Here is regex-based Go version:
package main
import (
"fmt"
"regexp"
)
func main() {
pattern, err := regexp.Compile("^[^\\d]*")
if err != nil {
panic(err)
}
part := pattern.Find([]byte("EW29DZ"))
if part != nil {
fmt.Printf("Found: %s\n", string(part))
} else {
fmt.Println("Not found")
}
}
Running:
% go run main.go
Found: EW
Go playground
We don't need regex for this problem. You can easily walk through on a slice of rune and check the current character with unicode.IsDigit(), if it's a digit: return. If it isn't: continue the loop. If there are no numbers: return the argument
Code
package main
import (
"fmt"
"unicode"
)
func UntilDigit(r []rune) []rune {
var i int
for _, v := range r {
if unicode.IsDigit(v) {
return r[0:i]
}
i++
}
return r
}
func main() {
fmt.Println(string(UntilDigit([]rune("E2 9NZ"))))
fmt.Println(string(UntilDigit([]rune("N29DZ"))))
fmt.Println(string(UntilDigit([]rune("EW29DZ"))))
}
Playground link
I think the best option is to use the index returned from strings.IndexAny which will return the first index of any character in a string.
func BeforeNumbers(str string) string {
value := strings.IndexAny(str,"0123456789")
if value >= 0 && value <= len(str) {
return str[:value]
}
return str
}
Will slice the string and return the subslice up to (but not including) the first character that's in the string "0123456789" which is any number.
Way later edit:
It would probably be better to use IndexFunc rather than IndexAny:
func BeforeNumbers(str string) string {
indexFunc := func(r rune) bool {
return r >= '0' && r <= '9'
}
value := strings.IndexFunc(str,indexFunc)
if value >= 0 && value <= len(str) {
return str[:value]
}
return str
}
This is more or less equivalent to the loop version, and eliminates a search over a long string to check for a match every character from my previous answer. But I think it looks cleaner than the loop version, which is obviously a manner of taste.
The code below will continue grabbing characters until it reaches a digit.
int i = 0;
String string2test = "EW29DZ";
String stringOutput = "";
while (!Character.isDigit(string2test.charAt(i)))
{
stringOutput = stringOutput + string2test.charAt(i);
i++;
}

Resources