How can I clean the text for search using RegEx - go

I can use the below code to search if the text str contains any or both of the keys, i.e.if it contains "MS" or "dynamics" or both of them
package main
import (
"fmt"
"regexp"
)
func main() {
keys := []string{"MS", "dynamics"}
keysReg := fmt.Sprintf("(%s %s)|%s|%s", keys[0], keys[1], keys[0], keys[1]) // => "(MS dynamics)|MS|dynamics"
fmt.Println(keysReg)
str := "What is MS dynamics, is it a product from MS?"
re := regexp.MustCompile(`(?i)` + keysReg)
matches := re.FindAllString(str, -1)
fmt.Println("We found", len(matches), "matches, that are:", matches)
}
I want the user to enter his phrase, so I trim unwanted words and characters, then doing the search as per above.
Let's say the user input was: This,is,a,delimited,string and I need to build the keys variable dynamically to be (delimited string)|delimited|string so that I can search for my variable str for all the matches, so I wrote the below:
s := "This,is,a,delimited,string"
t := regexp.MustCompile(`(?i),|\.|this|is|a`) // backticks are used here to contain the expression, (?i) for case insensetive
v := t.Split(s, -1)
fmt.Println(len(v))
fmt.Println(v)
But I got the output as:
8
[ delimited string]
What is the wrong part in my cleaning of the input text, I'm expecting the output to be:
2
[delimited string]
Here is my playground

To quote the famous quip from Jamie Zawinski,
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Two things:
Instead of trying to weed out garbage from the string ("cleaning" it), extract complete words from it instead.
Unicode is a compilcated matter; so even after you have succeeded with extracting words, you have to make sure your words are properly "escaped" to not contain any characters which might be interpreted as RE syntax before building a regexp of them.
package main
import (
"errors"
"fmt"
"regexp"
"strings"
)
func build(words ...string) (*regexp.Regexp, error) {
var sb strings.Builder
switch len(words) {
case 0:
return nil, errors.New("empty input")
case 1:
return regexp.Compile(regexp.QuoteMeta(words[0]))
}
quoted := make([]string, len(words))
for i, w := range words {
quoted[i] = regexp.QuoteMeta(w)
}
sb.WriteByte('(')
for i, w := range quoted {
if i > 0 {
sb.WriteByte('\x20')
}
sb.WriteString(w)
}
sb.WriteString(`)|`)
for i, w := range quoted {
if i > 0 {
sb.WriteByte('|')
}
sb.WriteString(w)
}
return regexp.Compile(sb.String())
}
var words = regexp.MustCompile(`\pL+`)
func main() {
allWords := words.FindAllString("\tThis\v\x20\x20,\t\tis\t\t,?a!,¿delimited?,string‽", -1)
re, err := build(allWords...)
if err != nil {
panic(err)
}
fmt.Println(re)
}
Further reading:
https://pkg.go.dev/regexp/syntax
https://pkg.go.dev/regexp#QuoteMeta
https://pkg.go.dev/unicode#pkg-variables and https://pkg.go.dev/unicode#Categories

Related

How to parse email addresses from a long string in Golang

How can I extract only email addresses from a long string in Golang? For example:
"a bunch of irrelevant text fjewiwofjfjvnvkdlslsosiejwoqlwpwpwo
mail=jim.halpert#gmail.com,ou=f,c=US
mail=apple.pie#gmail.com,ou=f,c=US
mail=hello.world#gmail.com,ou=f,c=US
mail=alex.alex#gmail.com,ou=f,c=US
mail=bob.jim#gmail.com,ou=people,ou=f,c=US
mail=arnold.schwarzenegger#gmail.com,ou=f,c=US"
This would return a list of all the emails:
[jim.halpert#gmail.com, apple.pie#gmail.com, etc...]
Each email address would begin with "mail=" and end with a comma ",".
For this you need to breakdown the long go string into parts that you need. You can do filtration and searching using Regular Expressions to match the email pattern you see above.
Here's a piece of code using Regular Expressions to first obtain the section with "mail=" then further format the email removing the trailing ,
import (
"fmt"
"regexp"
"strings"
)
func main() {
var re = regexp.MustCompile(`(?m)mail=[A-Za-z.#0-9]+\,`)
var str = `a bunch of irrelevant text fjewiwofjfjvnvkdlslsosiejwoqlwpwpwo
mail=jim.halpert#gmail.com,ou=f,c=US
mail=apple.pie#gmail.com,ou=f,c=US
mail=hello.world#gmail.com,ou=f,c=US
mail=alex.alex#gmail.com,ou=f,c=US
mail=bob.jim#gmail.com,ou=people,ou=f,c=US
mail=arnold.schwarzenegger#gmail.com,ou=f,c=US`
for i, match := range re.FindAllString(str, -1) {
fmt.Println(match, "found at index", i)
email := strings.Split(match, "=")[1]
email = strings.ReplaceAll(email, ",", "")
fmt.Print(email)
}
}
while i agree with the comment from user datenwolf here is another version which does not involve regular expressions.
It also handle more complex emails format including comma within the local parts. Something uneasy to implement using regexp.
see https://stackoverflow.com/a/2049510/11892070
import (
"bufio"
"fmt"
"strings"
)
var str = `a bunch of irrelevant text fjewiwofjfjvnvkdlslsosiejwoqlwpwpwo
mail=jim.halpert#gmail.com,ou=f,c=US
mail=apple.pie#gmail.com,ou=f,c=US
mail=hello.world#gmail.com,ou=f,c=US
mail=alex.alex#gmail.com,ou=f,c=US
mail=bob.jim#gmail.com,ou=people,ou=f,c=US
mail=arnold.schwarzenegger#gmail.com,ou=f,c=US
mail=(comented)arnold.schwarzenegger#gmail.com,ou=f,c=US
mail="(with comma inside)arnold,schwarzenegger#gmail.com",ou=f,c=US
mail=nocommaatall#gmail.com`
func main() {
var emails []string
sc := bufio.NewScanner(strings.NewReader(str))
for sc.Scan() {
t := sc.Text()
if !strings.HasPrefix(t, "mail=") {
continue
}
t = t[5:]
// Lookup for the next comma after the #.
at := strings.Index(t, "#")
comma := strings.Index(t[at:], ",")
if comma < 0 {
email := strings.TrimSpace(t)
emails = append(emails, email)
continue
}
comma += at
email := strings.TrimSpace(t[:comma])
emails = append(emails, email)
}
for _, e := range emails {
fmt.Println(e)
}
}
You can use this package to do that :
https://github.com/hamidteimouri/htutils/blob/main/htregex/htregex.go
// Emails finds all email strings
func Emails(text string) []string {
return match(text, EmailsRegex)
}
you can use an original package from golang is regexp.Compile or regexp.MustCompile
r, _ := regexp.Compile(regexEmail)
newVariable := `a bunch of irrelevant text fjewiwofjfjvnvkdlslsosiejwoqlwpwpwo
mail=jim.halpert#gmail.com,ou=f,c=US
mail=apple.pie#gmail.com,ou=f,c=US
mail=hello.world#gmail.com,ou=f,c=US
mail=alex.alex#gmail.com,ou=f,c=US
mail=bob.jim#gmail.com,ou=people,ou=f,c=US
mail=arnold.schwarzenegger#gmail.com,ou=f,c=US`
fmt.Printf("%#v\n", r.FindStringSubmatch(newVariable))
fmt.Printf("%#v\n", r.SubexpNames())

how to realize mismatch of regexp in golang?

This is a multiple choice question example. I want to get the chinese text like "英国、法国", "加拿大、墨西哥", "葡萄牙、加拿大", "墨西哥、德国" in the content of following code in golang, but it does not work.
package main
import (
"fmt"
"regexp"
"testing"
)
func TestRegex(t *testing.T) {
text := `( B )38.目前,亚马逊美国站后台,除了有美国站点外,还有( )站点。
A.英国、法国B.加拿大、墨西哥
C.葡萄牙、加拿大D.墨西哥、德国
`
fmt.Printf("%q\n", regexp.MustCompile(`[A-E]\.(\S+)?`).FindAllStringSubmatch(text, -1))
fmt.Printf("%q\n", regexp.MustCompile(`[A-E]\.`).Split(text, -1))
}
text:
( B )38.目前,亚马逊美国站后台,除了有美国站点外,还有( )站点。
A.英国、法国B.加拿大、墨西哥
C.葡萄牙、加拿大D.墨西哥、德国
pattern: [A-E]\.(\S+)?
Actual result: [["A.英国、法国B.加拿大、墨西哥" "英国、法国B.加拿大、墨西哥"] ["C.葡萄牙、加拿大D.墨西哥、德国" "葡萄牙、加拿大D.墨西哥、德国"]].
Expect result: [["A.英国、法国" "英国、法国"] ["B.加拿大、墨西哥" "加拿大、墨西哥"] ["C.葡萄牙、加拿大" "葡萄牙、加拿大"] ["D.墨西哥、德国" "墨西哥、德国"]]
I think it might be a greedy mode problem. Because in my code, it reads option A and option B as one option directly.
Non-greedy matching won't solve this, you need positive lookahead, which re2 doesn't support.
As a workaround can just search on the labels and extract the text in between manually.
re := regexp.MustCompile(`[A-E]\.`)
res := re.FindAllStringIndex(text, -1)
results := make([][]string, len(res))
for i, m := range res {
if i < len(res)-1 {
results[i] = []string{text[m[0]:m[1]], text[m[1]:res[i+1][0]]}
} else {
results[i] = []string{text[m[0]:m[1]], text[m[1]:]}
}
}
fmt.Printf("%q\n", results)
Should print
[["A." "英国、法国"] ["B." "加拿大、墨西哥\n"] ["C." "葡萄牙、加拿大"] ["D." "墨西哥、德国\n"]]

How can I trim whitespaces in Go from a slice after Split

I have a string that is comma separated, so it could be
test1, test2, test3 or test1,test2,test3 or test1, test2, test3.
I split this in Go currently with strings.Split(s, ","), but now I have a []string that can contain elements with an arbitrary numbers of whitespaces.
How can I easily trim them off? What is best practice here?
This is my current code
var property= os.Getenv(env.templateDirectories)
if property != "" {
var dirs = strings.Split(property, ",")
for index,ele := range dirs {
dirs[index] = strings.TrimSpace(ele)
}
return dirs
}
I come from Java and assumed that there is a map/reduce etc functionality in Go also, therefore the question.
You can use strings.TrimSpace in a loop. If you want to preserve order too, the indexes can be used rather than values as the loop parameters:
Go Playground Example
EDIT: To see the code without the click:
package main
import (
"fmt"
"strings"
)
func main() {
input := "test1, test2, test3"
slc := strings.Split(input , ",")
for i := range slc {
slc[i] = strings.TrimSpace(slc[i])
}
fmt.Println(slc)
}
Easy way without looping
test := "2 , 123, 1"
result := strings.Split(strings.ReplaceAll(test," ","") , ",")
The encoding/csv package can handle this:
package main
import (
"encoding/csv"
"fmt"
"strings"
)
func main() {
for _, each := range []string{
"test1, test2, test3", "test1, test2, test3", "test1,test2,test3",
} {
r := csv.NewReader(strings.NewReader(each))
r.TrimLeadingSpace = true
s, e := r.Read()
if e != nil {
panic(e)
}
fmt.Printf("%q\n", s)
}
}
https://golang.org/pkg/encoding/csv#Reader.TrimLeadingSpace
If you already use regexp may be you can split using regular expressions:
regexp.MustCompile(`\s*,\s*`).Split(test, -1)
This solution is probably slower than the standard Split + TrimSpaces, but is more flexible. For example if you want to skip empty fields you can :
regexp.MustCompile(`(\s*,\s*)+`).Split(test, -1)
or to use multiple separators
regexp.MustCompile(`\s*[,;]\s*`).Split(test, -1)
You can test it in the go playground.

Text processing in Go - how to convert string to byte?

I'm writing a small pragram to number the paragraph:
put paragraph number in front of each paragraph in the form of [1]..., [2]....
Article title should be excluded.
Here is my program:
package main
import (
"fmt"
"io/ioutil"
)
var s_end = [3]string{".", "!", "?"}
func main() {
b, err := ioutil.ReadFile("i_have_a_dream.txt")
if err != nil {
panic(err)
}
p_num, s_num := 1, 1
for _, char := range b {
fmt.Printf("[%s]", p_num)
p_num += 1
if char == byte("\n") {
fmt.Printf("\n[%s]", p_num)
p_num += 1
} else {
fmt.Printf(char)
}
}
}
http://play.golang.org/p/f4S3vQbglY
I got this error:
prog.go:21: cannot convert "\n" to type byte
prog.go:21: cannot convert "\n" (type string) to type byte
prog.go:21: invalid operation: char == "\n" (mismatched types byte and string)
prog.go:25: cannot use char (type byte) as type string in argument to fmt.Printf
[process exited with non-zero status]
How to convert string to byte?
What is the general practice to process text? Read in, parse it by byte, or by line?
Update
I solved the problem by converting the buffer byte to string, replacing strings by regular expression. (Thanks to #Tomasz Kłak for the regexp help)
I put the code here for reference.
package main
import (
"fmt"
"io/ioutil"
"regexp"
)
func main() {
b, err := ioutil.ReadFile("i_have_a_dream.txt")
if err != nil {
panic(err)
}
s := string(b)
r := regexp.MustCompile("(\r\n)+")
counter := 1
repl := func(match string) string {
p_num := counter
counter++
return fmt.Sprintf("%s [%d] ", match, p_num)
}
fmt.Println(r.ReplaceAllStringFunc(s, repl))
}
Using "\n" causes it to be treated as an array, use '\n' to treat it as a single char.
A string cannot be converted into a byte in a meaningful way. Use one of the following approaches:
If you have a string literal like "a", consider using a rune literal like 'a' which can be converted into a byte.
If you want to take a byte out of a string, use an index expression like myString[42].
If you want to interpret the content of a string as a (decimal) number, use strconv.Atoi() or strconv.ParseInt().
Please notice that it is customary in Go to write programs that can deal with Unicode characters. Explaining how to do this would be too much for this answer, but there are tutorials out there which explain what kind of things to pay attention to.

strings - get characters before a digit

I have some strings such E2 9NZ, N29DZ, EW29DZ . I need to extract the chars before the first digit, given the above example : E, N, EW.
Am I supposed to use regex ? The strings package looks really nice but just doesn't seem to handle this case (extract everything before a specific type).
Edit:
To clarify the "question" I'm wondering what method is more idiomatic to go and perhaps likely to provide better performance.
For example,
package main
import (
"fmt"
"unicode"
)
func DigitPrefix(s string) string {
for i, r := range s {
if unicode.IsDigit(r) {
return s[:i]
}
}
return s
}
func main() {
fmt.Println(DigitPrefix("E2 9NZ"))
fmt.Println(DigitPrefix("N29DZ"))
fmt.Println(DigitPrefix("EW29DZ"))
fmt.Println(DigitPrefix("WXYZ"))
}
Output:
E
N
EW
WXYZ
If there is no digit, example "WXYZ", and you don't want anything returned, change return s to return "".
Not sure why almost everyone provided answers in everything but Go. Here is regex-based Go version:
package main
import (
"fmt"
"regexp"
)
func main() {
pattern, err := regexp.Compile("^[^\\d]*")
if err != nil {
panic(err)
}
part := pattern.Find([]byte("EW29DZ"))
if part != nil {
fmt.Printf("Found: %s\n", string(part))
} else {
fmt.Println("Not found")
}
}
Running:
% go run main.go
Found: EW
Go playground
We don't need regex for this problem. You can easily walk through on a slice of rune and check the current character with unicode.IsDigit(), if it's a digit: return. If it isn't: continue the loop. If there are no numbers: return the argument
Code
package main
import (
"fmt"
"unicode"
)
func UntilDigit(r []rune) []rune {
var i int
for _, v := range r {
if unicode.IsDigit(v) {
return r[0:i]
}
i++
}
return r
}
func main() {
fmt.Println(string(UntilDigit([]rune("E2 9NZ"))))
fmt.Println(string(UntilDigit([]rune("N29DZ"))))
fmt.Println(string(UntilDigit([]rune("EW29DZ"))))
}
Playground link
I think the best option is to use the index returned from strings.IndexAny which will return the first index of any character in a string.
func BeforeNumbers(str string) string {
value := strings.IndexAny(str,"0123456789")
if value >= 0 && value <= len(str) {
return str[:value]
}
return str
}
Will slice the string and return the subslice up to (but not including) the first character that's in the string "0123456789" which is any number.
Way later edit:
It would probably be better to use IndexFunc rather than IndexAny:
func BeforeNumbers(str string) string {
indexFunc := func(r rune) bool {
return r >= '0' && r <= '9'
}
value := strings.IndexFunc(str,indexFunc)
if value >= 0 && value <= len(str) {
return str[:value]
}
return str
}
This is more or less equivalent to the loop version, and eliminates a search over a long string to check for a match every character from my previous answer. But I think it looks cleaner than the loop version, which is obviously a manner of taste.
The code below will continue grabbing characters until it reaches a digit.
int i = 0;
String string2test = "EW29DZ";
String stringOutput = "";
while (!Character.isDigit(string2test.charAt(i)))
{
stringOutput = stringOutput + string2test.charAt(i);
i++;
}

Resources