Handling Unicode in string search

Handling Unicode in string search - go

Suppose I have a string containing Unicode characters. For example:
s := "foo 日本 foo!"
I'm trying to find the last occurrence foo in the string:
index := strings.LastIndex(s, "foo")
The expected result here would be 7 but this will return 11 as the index due to the Unicode in the string.
Is there a way to handle this using standard library functions?

You're encountering the difference between runes in go and bytes. Strings are composed of bytes, not runes. If you haven't learned about this, you should read https://blog.golang.org/strings.
Here's my version of a quick function to calculate the number of runes preceding the last match of a substring in a string. The basic approach is to find the byte index, then iterate/count through the strings runes until that number of bytes have been consumed.
I'm not aware of a standard library method that will do this directly.
package main
import (
"fmt"
"strings"
)
func LastRuneIndex(s, substr string) (int, error) {
byteIndex := strings.LastIndex(s, substr)
if byteIndex < 0 {
return byteIndex, nil
}
reader := strings.NewReader(s)
count := 0
for byteIndex > 0 {
_, bytes, err := reader.ReadRune()
if err != nil {
return 0, err
}
byteIndex = byteIndex - bytes
count += 1
}
return count, nil
}
func main() {
s := "foo 日本 foo!"
count, err := LastRuneIndex(s, "foo")
fmt.Println(count, err)
// outputs:
// 7 <nil>
}

This gets pretty close:
package main
import (
"golang.org/x/text/language"
"golang.org/x/text/search"
)
func main() {
m := search.New(language.English)
start, end := m.IndexString("foo 日本 foo!", "foo")
println(start == 0, end == 3)
}
buts it's searching forward. I tried this:
m.IndexString("foo 日本 foo!", "foo", search.Backwards)
but I get this result:
panic: TODO: implement
https://pkg.go.dev/golang.org/x/text/search
https://github.com/golang/text/blob/v0.3.6/search/search.go#L222-L223

Related

How to convert strings to lower case in GO?

I am new to the language GO and working on an assignment where i should write a code that return the word frequencies of the text. However I know that the words 'Hello', 'HELLO' and 'hello' are all counted as 'hello', so I need to convert all strings to lower case.
I know that I should use strings.ToLower(), however I dont know where I should Included that in the class. Can someone please help me?
package main
import (
"fmt"
"io/ioutil"
"log"
"strings"
"time"
)
const DataFile = "loremipsum.txt"
// Return the word frequencies of the text argument.
func WordCount(text string) map[string]int {
fregs := make(map[string]int)
words := strings.Fields(text)
for _, word := range words {
fregs[word] += 1
}
return fregs
}
// Benchmark how long it takes to count word frequencies in text numRuns times.
//
// Return the total time elapsed.
func benchmark(text string, numRuns int) int64 {
start := time.Now()
for i := 0; i < numRuns; i++ {
WordCount(text)
}
runtimeMillis := time.Since(start).Nanoseconds() / 1e6
return runtimeMillis
}
// Print the results of a benchmark
func printResults(runtimeMillis int64, numRuns int) {
fmt.Printf("amount of runs: %d\n", numRuns)
fmt.Printf("total time: %d ms\n", runtimeMillis)
average := float64(runtimeMillis) / float64(numRuns)
fmt.Printf("average time/run: %.2f ms\n", average)
}
func main() {
// read in DataFile as a string called data
data, err:= ioutil.ReadFile("loremipsum.txt")
if err != nil {
log.Fatal(err)
}
// Convert []byte to string and print to screen
text := string(data)
fmt.Println(text)
fmt.Printf("%#v",WordCount(string(data)))
numRuns := 100
runtimeMillis := benchmark(string(data), numRuns)
printResults(runtimeMillis, numRuns)
}

You should convert words to lowercase when you are using them as map key
for _, word := range words {
fregs[strings.ToLower(word)] += 1
}

I get [a:822 a.:110 I want all a in the same. How do i a change the code so that a and a. is the same? – hello123
You need to carefully define a word. For example, a string of consecutive letters and numbers converted to lowercase.
func WordCount(s string) map[string]int {
wordFunc := func(r rune) bool {
return !unicode.IsLetter(r) && !unicode.IsNumber(r)
}
counts := make(map[string]int)
for _, word := range strings.FieldsFunc(s, wordFunc) {
counts[strings.ToLower(word)]++
}
return counts
}

to remove all non-word characters you could use a regular expression:
package main
import (
"bufio"
"fmt"
"log"
"regexp"
"strings"
)
func main() {
str1 := "This is some text! I want to count each word. Is it cool?"
re, err := regexp.Compile(`[^\w]`)
if err != nil {
log.Fatal(err)
}
str1 = re.ReplaceAllString(str1, " ")
scanner := bufio.NewScanner(strings.NewReader(str1))
scanner.Split(bufio.ScanWords)
for scanner.Scan() {
fmt.Println(strings.ToLower(scanner.Text()))
}
}

See strings.EqualFold.
Here is an example.

Golang index an array of strings

Hi I have a string that says this.
"Style: Saison
ABV: 7.7
IBU: 20"
I try to split it into an array so that I can get Saison
Here is how I convert to array.
style :=strings.Split(style, "Style:")
When I do
style[0]
It doesn't index Saison. I also tried style[1] and style[2] and nothing happens. What am I doing wrong?
Style = []string so it is a list of strings right?

You could use strings.FieldsFunc:
FieldsFunc splits the string s at each run of Unicode code points c
satisfying f(c) and returns an array of slices of s. If all code
points in s satisfy f(c) or the string is empty, an empty slice is
returned.
FieldsFunc makes no guarantees about the order in which it calls f(c)
and assumes that f always returns the same value for a given c.
package main
import (
"fmt"
"strconv"
"strings"
)
func main() {
str := `Style: Saison Drink
ABV: 7.7
IBU: 20`
f := func(c rune) bool {
return c == ':' || c == '\n'
}
strFields := strings.FieldsFunc(str, f)
fmt.Printf("%q\n", strFields)
styleValue := strings.TrimSpace(strFields[1])
fmt.Println(styleValue)
abvValue, err := strconv.ParseFloat(strings.TrimSpace(strFields[3]), 32)
if err != nil {
fmt.Println("Error parsing float!")
}
fmt.Printf("%.2f\n", abvValue)
ibuValue, err := strconv.ParseInt(strings.TrimSpace(strFields[5]), 10, 32)
if err != nil {
fmt.Println("Error parsing int!")
}
fmt.Printf("%d\n", ibuValue)
}
Output:
["Style" " Saison Drink" "ABV" " 7.7" "IBU" " 20"]
Saison Drink
7.70
20

How to convert []byte to C hex format 0x...?

func main() {
str := hex.EncodeToString([]byte("go"))
fmt.Println(str)
}
this code return 676f. How I can print C-like 0x67, 0x6f ?

I couldn't find any function in the hex module that would achieve what you want. However, we can use a custom buffer to write in our desired format.
package main
import (
"bytes"
"fmt"
)
func main() {
originalBytes := []byte("go")
result := make([]byte, 4*len(originalBytes))
buff := bytes.NewBuffer(result)
for _, b := range originalBytes {
fmt.Fprintf(buff, "0x%02x ", b)
}
fmt.Println(buff.String())
}
Runnable example: https://goplay.space/#fyhDJ094GgZ

Here's a solution that produces the result as specified in the question. Specifically, there's a ", " between each byte and no trailing space.
p := []byte("go")
var buf strings.Builder
if len(p) > 0 {
buf.Grow(len(p)*6 - 2)
for i, b := range p {
if i > 0 {
buf.WriteString(", ")
}
fmt.Fprintf(&buf, "0x%02x", b)
}
}
result := buf.String()
The strings.Builder type is used to avoid allocating memory on the final conversion to a string. Another answer uses bytes.Buffer that does allocate memory at this step.
The the builder is initially sized large enough to hold the representation of each byte and the separators. Another answer ignores the size of the separators.
Try this on the Go playground.

Convert slice of string input from console to slice of numbers

I'm trying to write a Go script that takes in as many lines of comma-separated coordinates as the user wishes, split and convert the string of coordinates to float64, store each line as a slice, and then append each slice in a slice of slices for later usage.
Example inputs are:
1.1,2.2,3.3
3.14,0,5.16
Example outputs are:
[[1.1 2.2 3.3],[3.14 0 5.16]]
The equivalent in Python is
def get_input():
print("Please enter comma separated coordinates:")
lines = []
while True:
line = input()
if line:
line = [float(x) for x in line.replace(" ", "").split(",")]
lines.append(line)
else:
break
return lines
But what I wrote in Go seems way too long (pasted below), and I'm creating a lot of variables without the ability to change variable type as in Python. Since I literally just started writing Golang to learn it, I fear my script is long as I'm trying to convert Python thinking into Go. Therefore, I would like to ask for some advice as to how to write this script shorter and more concise in Go style? Thank you.
package main
import (
"fmt"
"os"
"bufio"
"strings"
"strconv"
)
func main() {
inputs := get_input()
fmt.Println(inputs)
}
func get_input() [][]float64 {
fmt.Println("Please enter comma separated coordinates: ")
var inputs [][]float64
scanner := bufio.NewScanner(os.Stdin)
for scanner.Scan() {
if len(scanner.Text()) > 0 {
raw_input := strings.Replace(scanner.Text(), " ", "", -1)
input := strings.Split(raw_input, ",")
converted_input := str2float(input)
inputs = append(inputs, converted_input)
} else {
break
}
}
return inputs
}
func str2float(records []string) []float64 {
var float_slice []float64
for _, v := range records {
if s, err := strconv.ParseFloat(v, 64); err == nil {
float_slice = append(float_slice, s)
}
}
return float_slice
}

Using only string functions:
package main
import (
"bufio"
"fmt"
"os"
"strconv"
"strings"
)
func main() {
scanner := bufio.NewScanner(os.Stdin)
var result [][]float64
var txt string
for scanner.Scan() {
txt = scanner.Text()
if len(txt) > 0 {
values := strings.Split(txt, ",")
var row []float64
for _, v := range values {
fl, err := strconv.ParseFloat(strings.Trim(v, " "), 64)
if err != nil {
panic(fmt.Sprintf("Incorrect value for float64 '%v'", v))
}
row = append(row, fl)
}
result = append(result, row)
}
}
fmt.Printf("Result: %v\n", result)
}
Run:
$ printf "1.1,2.2,3.3
3.14,0,5.16
2,45,76.0, 45 , 69" | go run experiment2.go
Result: [[1.1 2.2 3.3] [3.14 0 5.16] [2 45 76 45 69]]

With given input, you can concatenate them to make a JSON string and then unmarshal (deserialize) that:
func main() {
var lines []string
for {
var line string
fmt.Scanln(&line)
if line == "" {
break
}
lines = append(lines, "["+line+"]")
}
all := "[" + strings.Join(lines, ",") + "]"
inputs := [][]float64{}
if err := json.Unmarshal([]byte(all), &inputs); err != nil {
fmt.Println(err)
return
}
fmt.Println(inputs)
}

Strip consecutive empty lines in a golang writer

I've got a Go text/template that renders a file, however I've found it difficult to structure the template cleanly while preserving the line breaks in the output.
I'd like to have additional, unnecessary newlines in the template to make it more readable, but strip them from the output. Any group of newlines more than a normal paragraph break should be condensed to a normal paragraph break, e.g.
lines with
too many breaks should become lines with
normal paragraph breaks.
The string is potentially too large to store safely in memory, so I want to keep it as an output stream.
My first attempt:
type condensingWriter struct {
writer io.Writer
lastLineIsEmpty bool
}
func (c condensingWriter) Write(b []byte) (n int, err error){
thisLineIsEmpty := strings.TrimSpace(string(b)) == ""
defer func(){
c.lastLineIsEmpty = thisLineIsEmpty
}()
if c.lastLineIsEmpty && thisLineIsEmpty{
return 0, nil
} else {
return c.writer.Write(b)
}
}
This doesn't work because I naively assumed that it would buffer on newline characters, but it doesn't.
Any suggestions on how to get this to work?

Inspired by zmb's approach, I've come up with the following package:
//Package striplines strips runs of consecutive empty lines from an output stream.
package striplines
import (
"io"
"strings"
)
// Striplines wraps an output stream, stripping runs of consecutive empty lines.
// You must call Flush before the output stream will be complete.
// Implements io.WriteCloser, Writer, Closer.
type Striplines struct {
Writer io.Writer
lastLine []byte
currentLine []byte
}
func (w *Striplines) Write(p []byte) (int, error) {
totalN := 0
s := string(p)
if !strings.Contains(s, "\n") {
w.currentLine = append(w.currentLine, p...)
return 0, nil
}
cur := string(append(w.currentLine, p...))
lastN := strings.LastIndex(cur, "\n")
s = cur[:lastN]
for _, line := range strings.Split(s, "\n") {
n, err := w.writeLn(line + "\n")
w.lastLine = []byte(line)
if err != nil {
return totalN, err
}
totalN += n
}
rem := cur[(lastN + 1):]
w.currentLine = []byte(rem)
return totalN, nil
}
// Close flushes the last of the output into the underlying writer.
func (w *Striplines) Close() error {
_, err := w.writeLn(string(w.currentLine))
return err
}
func (w *Striplines) writeLn(line string) (n int, err error) {
if strings.TrimSpace(string(w.lastLine)) == "" && strings.TrimSpace(line) == "" {
return 0, nil
} else {
return w.Writer.Write([]byte(line))
}
}
See it in action here: http://play.golang.org/p/t8BGPUMYhb

The general idea is you'll have to look for consecutive newlines anywhere in the input slice and if such cases exist, skip over all but the first newline character.
Additionally, you have to track whether the last byte written was a newline, so the next call to Write will know to eliminate a newline if necessary. You were on the right track by adding a bool to your writer type. However, you'll want to use a pointer receiver instead of a value receiver here, otherwise you'll be modifying a copy of the struct.
You would want to change
func (c condensingWriter) Write(b []byte)
to
func (c *condensingWriter) Write(b []byte)
You could try something like this. You'll have to test with larger inputs to make sure it handles all cases correctly.
package main
import (
"bytes"
"io"
"os"
)
var Newline byte = byte('\n')
type ReduceNewlinesWriter struct {
w io.Writer
lastByteNewline bool
}
func (r *ReduceNewlinesWriter) Write(b []byte) (int, error) {
// if the previous call to Write ended with a \n
// then we have to skip over any starting newlines here
i := 0
if r.lastByteNewline {
for i < len(b) && b[i] == Newline {
i++
}
b = b[i:]
}
r.lastByteNewline = b[len(b) - 1] == Newline
i = bytes.IndexByte(b, Newline)
if i == -1 {
// no newlines - just write the entire thing
return r.w.Write(b)
}
// write up to the newline
i++
n, err := r.w.Write(b[:i])
if err != nil {
return n, err
}
// skip over immediate newline and recurse
i++
for i < len(b) && b[i] == Newline {
i++
}
i--
m, err := r.Write(b[i:])
return n + m, nil
}
func main() {
r := ReduceNewlinesWriter{
w: os.Stdout,
}
io.WriteString(&r, "this\n\n\n\n\n\n\nhas\nmultiple\n\n\nnewline\n\n\n\ncharacters")
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Handling Unicode in string search - go

Related

How to convert strings to lower case in GO?

Golang index an array of strings

How to convert []byte to C hex format 0x...?

Convert slice of string input from console to slice of numbers

Strip consecutive empty lines in a golang writer

Categories

Resources