Text processing in Go - how to convert string to byte? - go

I'm writing a small pragram to number the paragraph:
put paragraph number in front of each paragraph in the form of [1]..., [2]....
Article title should be excluded.
Here is my program:
package main
import (
"fmt"
"io/ioutil"
)
var s_end = [3]string{".", "!", "?"}
func main() {
b, err := ioutil.ReadFile("i_have_a_dream.txt")
if err != nil {
panic(err)
}
p_num, s_num := 1, 1
for _, char := range b {
fmt.Printf("[%s]", p_num)
p_num += 1
if char == byte("\n") {
fmt.Printf("\n[%s]", p_num)
p_num += 1
} else {
fmt.Printf(char)
}
}
}
http://play.golang.org/p/f4S3vQbglY
I got this error:
prog.go:21: cannot convert "\n" to type byte
prog.go:21: cannot convert "\n" (type string) to type byte
prog.go:21: invalid operation: char == "\n" (mismatched types byte and string)
prog.go:25: cannot use char (type byte) as type string in argument to fmt.Printf
[process exited with non-zero status]
How to convert string to byte?
What is the general practice to process text? Read in, parse it by byte, or by line?
Update
I solved the problem by converting the buffer byte to string, replacing strings by regular expression. (Thanks to #Tomasz Kłak for the regexp help)
I put the code here for reference.
package main
import (
"fmt"
"io/ioutil"
"regexp"
)
func main() {
b, err := ioutil.ReadFile("i_have_a_dream.txt")
if err != nil {
panic(err)
}
s := string(b)
r := regexp.MustCompile("(\r\n)+")
counter := 1
repl := func(match string) string {
p_num := counter
counter++
return fmt.Sprintf("%s [%d] ", match, p_num)
}
fmt.Println(r.ReplaceAllStringFunc(s, repl))
}

Using "\n" causes it to be treated as an array, use '\n' to treat it as a single char.

A string cannot be converted into a byte in a meaningful way. Use one of the following approaches:
If you have a string literal like "a", consider using a rune literal like 'a' which can be converted into a byte.
If you want to take a byte out of a string, use an index expression like myString[42].
If you want to interpret the content of a string as a (decimal) number, use strconv.Atoi() or strconv.ParseInt().
Please notice that it is customary in Go to write programs that can deal with Unicode characters. Explaining how to do this would be too much for this answer, but there are tutorials out there which explain what kind of things to pay attention to.

Related

How can I clean the text for search using RegEx

I can use the below code to search if the text str contains any or both of the keys, i.e.if it contains "MS" or "dynamics" or both of them
package main
import (
"fmt"
"regexp"
)
func main() {
keys := []string{"MS", "dynamics"}
keysReg := fmt.Sprintf("(%s %s)|%s|%s", keys[0], keys[1], keys[0], keys[1]) // => "(MS dynamics)|MS|dynamics"
fmt.Println(keysReg)
str := "What is MS dynamics, is it a product from MS?"
re := regexp.MustCompile(`(?i)` + keysReg)
matches := re.FindAllString(str, -1)
fmt.Println("We found", len(matches), "matches, that are:", matches)
}
I want the user to enter his phrase, so I trim unwanted words and characters, then doing the search as per above.
Let's say the user input was: This,is,a,delimited,string and I need to build the keys variable dynamically to be (delimited string)|delimited|string so that I can search for my variable str for all the matches, so I wrote the below:
s := "This,is,a,delimited,string"
t := regexp.MustCompile(`(?i),|\.|this|is|a`) // backticks are used here to contain the expression, (?i) for case insensetive
v := t.Split(s, -1)
fmt.Println(len(v))
fmt.Println(v)
But I got the output as:
8
[ delimited string]
What is the wrong part in my cleaning of the input text, I'm expecting the output to be:
2
[delimited string]
Here is my playground
To quote the famous quip from Jamie Zawinski,
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Two things:
Instead of trying to weed out garbage from the string ("cleaning" it), extract complete words from it instead.
Unicode is a compilcated matter; so even after you have succeeded with extracting words, you have to make sure your words are properly "escaped" to not contain any characters which might be interpreted as RE syntax before building a regexp of them.
package main
import (
"errors"
"fmt"
"regexp"
"strings"
)
func build(words ...string) (*regexp.Regexp, error) {
var sb strings.Builder
switch len(words) {
case 0:
return nil, errors.New("empty input")
case 1:
return regexp.Compile(regexp.QuoteMeta(words[0]))
}
quoted := make([]string, len(words))
for i, w := range words {
quoted[i] = regexp.QuoteMeta(w)
}
sb.WriteByte('(')
for i, w := range quoted {
if i > 0 {
sb.WriteByte('\x20')
}
sb.WriteString(w)
}
sb.WriteString(`)|`)
for i, w := range quoted {
if i > 0 {
sb.WriteByte('|')
}
sb.WriteString(w)
}
return regexp.Compile(sb.String())
}
var words = regexp.MustCompile(`\pL+`)
func main() {
allWords := words.FindAllString("\tThis\v\x20\x20,\t\tis\t\t,?a!,¿delimited?,string‽", -1)
re, err := build(allWords...)
if err != nil {
panic(err)
}
fmt.Println(re)
}
Further reading:
https://pkg.go.dev/regexp/syntax
https://pkg.go.dev/regexp#QuoteMeta
https://pkg.go.dev/unicode#pkg-variables and https://pkg.go.dev/unicode#Categories

Struct field tag `name` not compatible with reflect.StructTag.Get: bad syntax for struct tag pair

I read this, but it is different than my case, I've the below code:
package main
import (
"bytes"
"fmt"
"reflect"
"strconv"
"strings"
)
type User struct {
Name string `name`
Age int64 `age`
}
func main() {
var u User = User{"bob", 10}
res, err := JSONEncode(u)
if err != nil {
panic(err)
}
fmt.Println(string(res))
}
func JSONEncode(v interface{}) ([]byte, error) {
refObjVal := reflect.ValueOf(v)
refObjTyp := reflect.TypeOf(v)
buf := bytes.Buffer{}
if refObjVal.Kind() != reflect.Struct {
return buf.Bytes(), fmt.Errorf(
"val of kind %s is not supported",
refObjVal.Kind(),
)
}
buf.WriteString("{")
pairs := []string{}
for i := 0; i < refObjVal.NumField(); i++ {
structFieldRefObj := refObjVal.Field(i)
structFieldRefObjTyp := refObjTyp.Field(i)
switch structFieldRefObj.Kind() {
case reflect.String:
strVal := structFieldRefObj.Interface().(string)
pairs = append(pairs, `"`+string(structFieldRefObjTyp.Tag)+`":"`+strVal+`"`)
case reflect.Int64:
intVal := structFieldRefObj.Interface().(int64)
pairs = append(pairs, `"`+string(structFieldRefObjTyp.Tag)+`":`+strconv.FormatInt(intVal, 10))
default:
return buf.Bytes(), fmt.Errorf(
"struct field with name %s and kind %s is not supprted",
structFieldRefObjTyp.Name,
structFieldRefObj.Kind(),
)
}
}
buf.WriteString(strings.Join(pairs, ","))
buf.WriteString("}")
return buf.Bytes(), nil
}
It works perfectly, and give output as:
{"name":"bob","age":10}
But as VS code, it gives me the below problems:
What could be the issue?
Note that that's just a warning telling you that you're not following convention. The code, as you already know, compiles and runs and outputs the result you want: https://go.dev/play/p/gxcv8qPVZ6z.
To avoid the warning, disable your linter, or, better yet, follow the convention by using key:"value" in the struct tags and then extract the value by using the Get method: https://go.dev/play/p/u0VTGL48TjO.
https://pkg.go.dev/reflect#go1.18.3#StructTag
A StructTag is the tag string in a struct field.
By convention, tag strings are a concatenation of optionally
space-separated key:"value" pairs. Each key is a non-empty string
consisting of non-control characters other than space (U+0020 ' '),
quote (U+0022 '"'), and colon (U+003A ':'). Each value is quoted using
U+0022 '"' characters and Go string literal syntax.
Struct tag supposed to be a key:"value", field:"name" for example.
type User struct {
Name string `field:"name"`
Age int64 `field:"age"`
}
instead of field as in another answer you can use json:"keyname"
type User struct {
Name string `json:"name"`
Age int64 `json:"age"`
}

Go: CSV NewReader not getting the correct number of fields

How to get the correct number of fields when using NewReader ?
package main
import (
"encoding/csv"
"fmt"
"log"
"strings"
)
func main() {
parser := csv.NewReader(strings.NewReader(`||""FOO""||`))
parser.Comma = '|'
parser.LazyQuotes = true
record, err := parser.Read()
if err != nil {
log.Fatal(err)
}
fmt.Printf("record length: %v\n", len(record))
}
https://go.dev/play/p/gg-KYRciWFH
It should return 5, but instead I'm getting 3:
record length: 3
Program exited.
EDIT
I'm actually working with a big CSV file containing many double quotes.
After examining your code, I decided to modify it slightly and then print the results:
package main
import (
"encoding/csv"
"fmt"
"log"
"strings"
)
func main() {
parser := csv.NewReader(strings.NewReader(`x||""FOO""|x|x\n`))
parser.Comma = '|'
parser.LazyQuotes = true
record, err := parser.Read()
if err != nil {
log.Fatal(err)
}
fmt.Printf("record length: %v, Data: %v\n", len(record), strings.Join(record, ", "))
}
When you run this, the data is printed as x, , "FOO"||x|x\n". My thought is that when you end your entry with two double-quotes, the parser is assuming the string is still being quoted and therefore lumps the rest of the line into the third entry. This appears to be a bug with how lazy-quoting works in the csv package, however, when examining the documentation for LazyQuotes, you'll see this:
If LazyQuotes is true, a quote may appear in an unquoted field and a non-doubled quote may appear in a quoted field.
This doesn't mention anything about finding double quotes within double quotes. To fix this, you should either remove the quotes altogether or replace the double double-quotes ("") with double quotes (").
One other thing you might consider would be using the gocsv package. I've worked with this package in the past and it's reasonably stable. I'm not sure how it would respond to this specific issue, but it might be worth your time checking it out.
Note:
The encoding/csv package implements the RFC 4180 standard. If you have such input, that's not an RFC 4180 compliant CSV file and encoding/csv will not parse it properly.
You're misusing the quotes. Quoting a single field FOO is like this:
parser := csv.NewReader(strings.NewReader(`||"FOO"||`))
If you want the field to have the "FOO" value, you have to use 2 double quotes in a quoted field, so it should be:
parser := csv.NewReader(strings.NewReader(`||"""FOO"""||`))
This will output 5. Try it on the Go Playground.
What you have is this:
parser := csv.NewReader(strings.NewReader(`||""FOO""||`))
Since the second " character is not followed by a separator character, the field is not interrupted and the rest is processed as the content of the quoted field (which will terminate at the end of the line).
If you print the record:
fmt.Println(record)
fmt.Printf("%#v", record)
Output will be (try it on the Go Playground):
[ "FOO"||]
[]string{"", "", "\"FOO\"||"}
Quotes are a part of csv format.
There is a problem with go/csv shielding, you can try something like this:
package main
import (
"encoding/csv"
"fmt"
"log"
"strings"
)
func main() {
parser := csv.NewReader(strings.NewReader(`||FOO||`))
parser.Comma = '|'
parser.LazyQuotes = true
record, err := parser.Read()
if err != nil {
log.Fatal(err)
}
fmt.Printf("record length: %v\n", len(record))
fmt.Println(strings.Join(record, " /SEP/ "))
}
or like this:
package main
import (
"encoding/csv"
"fmt"
"log"
"strings"
)
func main() {
parser := csv.NewReader(strings.NewReader(`||"""FOO"""||`))
parser.Comma = '|'
parser.LazyQuotes = true
record, err := parser.Read()
if err != nil {
log.Fatal(err)
}
fmt.Printf("record length: %v\n", len(record))
fmt.Println(strings.Join(record, " SEP "))
}

How in golang to remove the last letter from the string?

Let's say I have a string called varString.
varString := "Bob,Mark,"
QUESTION: How to remove the last letter from the string? In my case, it's the second comma.
How to remove the last letter from the string?
In Go, character strings are UTF-8 encoded. Unicode UTF-8 is a variable-length character encoding which uses one to four bytes per Unicode character (code point).
For example,
package main
import (
"fmt"
"unicode/utf8"
)
func trimLastChar(s string) string {
r, size := utf8.DecodeLastRuneInString(s)
if r == utf8.RuneError && (size == 0 || size == 1) {
size = 0
}
return s[:len(s)-size]
}
func main() {
s := "Bob,Mark,"
fmt.Println(s)
s = trimLastChar(s)
fmt.Println(s)
}
Playground: https://play.golang.org/p/qyVYrjmBoVc
Output:
Bob,Mark,
Bob,Mark
Here's a much simpler method that works for unicode strings too:
func removeLastRune(s string) string {
r := []rune(s)
return string(r[:len(r)-1])
}
Playground link: https://play.golang.org/p/ezsGUEz0F-D
Something like this:
s := "Bob,Mark,"
s = s[:len(s)-1]
Note that this does not work if the last character is not represented by just one byte.
newStr := strings.TrimRightFunc(str, func(r rune) bool {
return !unicode.IsLetter(r) // or any other validation can go here
})
This will trim anything that isn't a letter on the right hand side.

Replace a character at a specific location in a string

I know about the method string.Replace(). And it works if you know exactly what to replace and its occurrences. But what can I do if I want to replace a char at only a known position? I'm thinking of something like this:
randLetter := getRandomChar()
myText := "This is my text"
randPos := rand.Intn(len(myText) - 1)
newText := [:randPos] + randLetter + [randPos + 1:]
But this does not replace the char at randPos, just inserts the randLetter at that position. Right?
I've written some code to replace the character found at indexofcharacter with the replacement. I may not be the best method, but it works fine.
https://play.golang.org/p/9CTgHRm6icK
func replaceAtPosition(originaltext string, indexofcharacter int, replacement string) string {
runes := []rune(originaltext )
partOne := string(runes[0:indexofcharacter-1])
partTwo := string(runes[indexofcharacter:len(runes)])
return partOne + replacement + partTwo
}
UTF-8 is a variable-length encoding. For example,
package main
import "fmt"
func insertChar(s string, c rune, i int) string {
if i >= 0 {
r := []rune(s)
if i < len(r) {
r[i] = c
s = string(r)
}
}
return s
}
func main() {
s := "Hello, 世界"
fmt.Println(s)
s = insertChar(s, 'X', len([]rune(s))-1)
fmt.Println(s)
}
Output:
Hello, 世界
Hello, 世X
A string is a read-only slice of bytes. You can't replace anything.
A single Rune can consist of multiple bytes. So you should convert the string to a (intermediate) mutable slice of Runes anyway:
myText := []rune("This is my text")
randPos := rand.Intn(len(myText) - 1)
myText[randPos] = randLetter
fmt.Println(string(myText))

Resources