I wrote a piece of code to illustrate the standard command grep in Go, but the speed is
far behind it, could someone give me any advances? here is the code:
package main
import (
"bufio"
"fmt"
"log"
"os"
"strings"
"sync"
)
func parse_args() (file, pat string) {
if len(os.Args) < 3 {
log.Fatal("usage: gorep2 <file_name> <pattern>")
}
file = os.Args[1]
pat = os.Args[2]
return
}
func readFile(file string, to chan<- string) {
f, err := os.Open(file)
if err != nil {
log.Fatal(err)
}
defer f.Close()
freader := bufio.NewReader(f)
for {
line, er := freader.ReadBytes('\n')
if er == nil {
to <- string(line)
} else {
break
}
}
close(to)
}
func grepLine(pat string, from <-chan string, result chan<- bool) {
var wg sync.WaitGroup
for line := range from {
wg.Add(1)
go func(l string) {
defer wg.Done()
if strings.Contains(l, pat) {
result <- true
}
}(string(line))
}
wg.Wait()
close(result)
}
func main() {
file, pat := parse_args()
text_chan := make(chan string, 10)
result_chan := make(chan bool, 10)
go readFile(file, text_chan)
go grepLine(pat, text_chan, result_chan)
var total uint = 0
for r := range result_chan {
if r == true {
total += 1
}
}
fmt.Printf("Total %d\n", total)
}
The time in Go:
>>> time gogrep /var/log/task.log DEBUG
Total 21089
real 0m0.156s
user 0m0.156s
sys 0m0.015s
The time in grep:
>>> time grep DEBUG /var/log/task.log | wc -l
21089
real 0m0.069s
user 0m0.046s
sys 0m0.064s
For an easily reproducible benchmark, I counted the number of occurences of the text "and" in Shakespeare.
gogrep:
$ go build gogrep.go && time ./gogrep /home/peter/shakespeare.txt and
Total 21851
real 0m0.613s
user 0m0.651s
sys 0m0.068s
grep:
$ time grep and /home/peter/shakespeare.txt | wc -l
21851
real 0m0.108s
user 0m0.107s
sys 0m0.014s
petergrep:
$ go build petergrep.go && time ./petergrep /home/peter/shakespeare.txt and
Total 21851
real 0m0.098s
user 0m0.092s
sys 0m0.008s
petergrep is written in Go. It's fast.
package main
import (
"bufio"
"bytes"
"fmt"
"log"
"os"
)
func parse_args() (file, pat string) {
if len(os.Args) < 3 {
log.Fatal("usage: petergrep <file_name> <pattern>")
}
file = os.Args[1]
pat = os.Args[2]
return
}
func grepFile(file string, pat []byte) int64 {
patCount := int64(0)
f, err := os.Open(file)
if err != nil {
log.Fatal(err)
}
defer f.Close()
scanner := bufio.NewScanner(f)
for scanner.Scan() {
if bytes.Contains(scanner.Bytes(), pat) {
patCount++
}
}
if err := scanner.Err(); err != nil {
fmt.Fprintln(os.Stderr, err)
}
return patCount
}
func main() {
file, pat := parse_args()
total := grepFile(file, []byte(pat))
fmt.Printf("Total %d\n", total)
}
Data: Shakespeare: pg100.txt
Go regular expressions are fully utf-8 and I think that has some overhead. They also have a different theoretical basis meaning they will always run in a time proportional to the length of the input. It is noticeable that Go regexps just aren't as fast as the pcre regexp in use by other languages. If you look at the benchmarks game shootouts for the regexp test you'll see what I mean.
You can always use the pcre library directly if you want a bit more speed though.
A datapoint on the relevance of UTF-8 in regexp parsing: I've a long-used custom perl5 script for source grepping. I recently modified it to support UTF-8 so it could match fancy golang symbol names. It ran a FULL ORDER OF MAGNITUDE slower in repeated tests. So while golang regexp's do pay a price for the predictability of it's runtime, we also have to factor UTF-8 handling into the equation.
Related
I am new to the language GO and working on an assignment where i should write a code that return the word frequencies of the text. However I know that the words 'Hello', 'HELLO' and 'hello' are all counted as 'hello', so I need to convert all strings to lower case.
I know that I should use strings.ToLower(), however I dont know where I should Included that in the class. Can someone please help me?
package main
import (
"fmt"
"io/ioutil"
"log"
"strings"
"time"
)
const DataFile = "loremipsum.txt"
// Return the word frequencies of the text argument.
func WordCount(text string) map[string]int {
fregs := make(map[string]int)
words := strings.Fields(text)
for _, word := range words {
fregs[word] += 1
}
return fregs
}
// Benchmark how long it takes to count word frequencies in text numRuns times.
//
// Return the total time elapsed.
func benchmark(text string, numRuns int) int64 {
start := time.Now()
for i := 0; i < numRuns; i++ {
WordCount(text)
}
runtimeMillis := time.Since(start).Nanoseconds() / 1e6
return runtimeMillis
}
// Print the results of a benchmark
func printResults(runtimeMillis int64, numRuns int) {
fmt.Printf("amount of runs: %d\n", numRuns)
fmt.Printf("total time: %d ms\n", runtimeMillis)
average := float64(runtimeMillis) / float64(numRuns)
fmt.Printf("average time/run: %.2f ms\n", average)
}
func main() {
// read in DataFile as a string called data
data, err:= ioutil.ReadFile("loremipsum.txt")
if err != nil {
log.Fatal(err)
}
// Convert []byte to string and print to screen
text := string(data)
fmt.Println(text)
fmt.Printf("%#v",WordCount(string(data)))
numRuns := 100
runtimeMillis := benchmark(string(data), numRuns)
printResults(runtimeMillis, numRuns)
}
You should convert words to lowercase when you are using them as map key
for _, word := range words {
fregs[strings.ToLower(word)] += 1
}
I get [a:822 a.:110 I want all a in the same. How do i a change the code so that a and a. is the same? – hello123
You need to carefully define a word. For example, a string of consecutive letters and numbers converted to lowercase.
func WordCount(s string) map[string]int {
wordFunc := func(r rune) bool {
return !unicode.IsLetter(r) && !unicode.IsNumber(r)
}
counts := make(map[string]int)
for _, word := range strings.FieldsFunc(s, wordFunc) {
counts[strings.ToLower(word)]++
}
return counts
}
to remove all non-word characters you could use a regular expression:
package main
import (
"bufio"
"fmt"
"log"
"regexp"
"strings"
)
func main() {
str1 := "This is some text! I want to count each word. Is it cool?"
re, err := regexp.Compile(`[^\w]`)
if err != nil {
log.Fatal(err)
}
str1 = re.ReplaceAllString(str1, " ")
scanner := bufio.NewScanner(strings.NewReader(str1))
scanner.Split(bufio.ScanWords)
for scanner.Scan() {
fmt.Println(strings.ToLower(scanner.Text()))
}
}
See strings.EqualFold.
Here is an example.
I am trying to end terminal input programmatically in 3 seconds and output the result.
My code is the following:
package main
import (
"bufio"
"fmt"
"os"
"time"
)
var (
result string
err error
)
func main() {
fmt.Println("Please input something, you have 3000 milliseconds")
go func() {
time.Sleep(time.Millisecond * 3000)
fmt.Println("It's time to break input and read what you have already typed")
fmt.Println("result")
fmt.Println(result)
}()
in := bufio.NewReader(os.Stdin)
result, err = in.ReadString('\n')
if err != nil {
fmt.Println(err)
}
}
The output:
Please input something, you have 3000 milliseconds
hello It's time to break input and read what you have already typed
result
I just printed hello and 3 seconds passed and the program should end the input and read my hello and give output:
result
hello
But I don't know how to provide this. Is it possible to end terminal input without user's intention and read the inputted value?
You can't timeout the read on stdin directly, so you need to create a timeout around receiving the result from the reading goroutine:
func getInput(input chan string) {
in := bufio.NewReader(os.Stdin)
result, err := in.ReadString('\n')
if err != nil {
log.Fatal(err)
}
input <- result
}
func main() {
input := make(chan string, 1)
go getInput(input)
select {
case i := <-input:
fmt.Println(i)
case <-time.After(3000 * time.Millisecond):
fmt.Println("timed out")
}
}
I started to do programming contests in go (just to learn the language) and to my surprise found that
var T int
fmt.Scanf("%d", &T)
is unimaginably slow. How slow? To read 10^5 integers it take me 2.5 seconds (in comparison python does it in 0.8 secs).
So why is it so slow and how should I properly read int, uint64 and float64?
If you have only the integer as input, this should be faster (not tested though)
package main
import (
"io/ioutil"
"log"
"os"
"strconv"
)
func read() (int64, error) {
b, err := ioutil.ReadAll(os.Stdin)
if err != nil {
return 0, err
}
// use strconv.ParseUint and strconv.ParseFloat in a similar way
return strconv.ParseInt(string(b[:len(b)-1]), 10, 0)
}
func main() {
i, err := read()
if err != nil {
log.Fatal(err)
}
println(i)
}
run it like this
echo 123 | go run main.go
for interactive input, you might want to use bufio.NewReader, see How to read input from console line?
The case is :
I want read the log like "tail -f" *NIX
when I kill the program I can know how many bytes I have already read,and I can use the seek
when the program start again,will continue to read the log line by line depend by seek data in step 2
I want get the bytes when I use bufio.NewScanner as a line reader to read a line
eg:
import ...
func main() {
f, err := os.Open("111.txt")
if err != nil {
log.Fatal(err)
}
f.Seek(0,os.SEEK_SET)
scan := bufio.NewScanner(f)
for scan.Scan() {
log.Printf(scan.Text())
//what I want is how many bytes at this time when I read a line
}//This is a program for read line
}
thx!
==================================update==========================================
#twotwotwo this is close to what I want,but I want change the io.Reader to the io.ReaderAt, and it is what I want,I write a demo use the io.Reader:`
import (
"os"
"log"
"io"
)
type Reader struct {
reader io.Reader
count int
}
func (r *Reader) Read(b []byte) (int, error) {
n, err := r.reader.Read(b)
r.count += n
return n, err
}
func (r *Reader) Count() int {
return r.count
}
func NewReader(r io.Reader) *Reader {
return &Reader{reader: r}
}
func ReadLine(r *Reader) (ln int,line []byte,err error) {
line = make([]byte,0,4096)
for {
b := make([]byte,1)
n,er := r.Read(b)
if er == io.EOF {
err = er
break
}
if n > 0{
c := b[0]
if c == '\n' {
break
}
line = append(line, c)
}
if er != nil{
err = er
}
}
ln = r.Count()
return ln,line,err
}
func main() {
f, err := os.Open("111.txt")
if err != nil {
log.Fatal(err)
}
fi,_:=os.Stat("111.txt")
log.Printf("the file have %v bytes",fi.Size())
co := NewReader(f)
for {
count,line,er := ReadLine(co)
if er == io.EOF {
break
}
log.Printf("now read the line :%v",string(line))
log.Printf("in all we have read %v bytes",count)
}
}`
this Program can tell me how many bytes I have already read,but cannt read start from anywhere where I want,so I think that if we use io.ReaderAt must can do it.
thanks again!
You could consider another approach based on os.File.
See ActiveState/tail, which monitor the state of a file, and uses os.File#Seek() to resume tailing a file from within a certain point.
See tail.go.
Consider composition.
We know that bufio.NewScanner is interacting with its input through the io.Reader interface. So we may wrap an io.Reader with something else that counts how many bytes have been read so far.
package main
import (
"bufio"
"bytes"
"io"
"log"
)
type ReadCounter struct {
io.Reader
BytesRead int
}
func (r *ReadCounter) Read(p []byte) (int, error) {
n, err := r.Reader.Read(p)
r.BytesRead += n
return n, err
}
func main() {
b := &ReadCounter{Reader: bytes.NewBufferString("hello\nworld\testing\n")}
scan := bufio.NewScanner(b)
for scan.Scan() {
log.Println(scan.Text())
log.Println("Read", b.BytesRead, "bytes so far")
}
}
But we'll note that bufio.NewScanner is buffered, so we can see that it reads its input in chunks. So for your purposes, this might not be as useful as you want.
An alternative is to take the content of scan.Text() and count up the lengths. You can compensate for its removal of newline bytes in your internal count.
I need to read a file of integers into an array. I have it working with this:
package main
import (
"fmt"
"io"
"os"
)
func readFile(filePath string) (numbers []int) {
fd, err := os.Open(filePath)
if err != nil {
panic(fmt.Sprintf("open %s: %v", filePath, err))
}
var line int
for {
_, err := fmt.Fscanf(fd, "%d\n", &line)
if err != nil {
fmt.Println(err)
if err == io.EOF {
return
}
panic(fmt.Sprintf("Scan Failed %s: %v", filePath, err))
}
numbers = append(numbers, line)
}
return
}
func main() {
numbers := readFile("numbers.txt")
fmt.Println(len(numbers))
}
The file numbers.txt is just:
1
2
3
...
ReadFile() seems too long (maybe because of the error handing).
Is there a shorter / more Go idiomatic way to load a file?
Using a bufio.Scanner makes things nice. I've also used an io.Reader rather than taking a filename. Often that's a good technique, since it allows the code to be used on any file-like object and not just a file on disk. Here it's "reading" from a string.
package main
import (
"bufio"
"fmt"
"io"
"strconv"
"strings"
)
// ReadInts reads whitespace-separated ints from r. If there's an error, it
// returns the ints successfully read so far as well as the error value.
func ReadInts(r io.Reader) ([]int, error) {
scanner := bufio.NewScanner(r)
scanner.Split(bufio.ScanWords)
var result []int
for scanner.Scan() {
x, err := strconv.Atoi(scanner.Text())
if err != nil {
return result, err
}
result = append(result, x)
}
return result, scanner.Err()
}
func main() {
tf := "1\n2\n3\n4\n5\n6"
ints, err := ReadInts(strings.NewReader(tf))
fmt.Println(ints, err)
}
I would do it like this:
package main
import (
"fmt"
"io/ioutil"
"strconv"
"strings"
)
// It would be better for such a function to return error, instead of handling
// it on their own.
func readFile(fname string) (nums []int, err error) {
b, err := ioutil.ReadFile(fname)
if err != nil { return nil, err }
lines := strings.Split(string(b), "\n")
// Assign cap to avoid resize on every append.
nums = make([]int, 0, len(lines))
for _, l := range lines {
// Empty line occurs at the end of the file when we use Split.
if len(l) == 0 { continue }
// Atoi better suits the job when we know exactly what we're dealing
// with. Scanf is the more general option.
n, err := strconv.Atoi(l)
if err != nil { return nil, err }
nums = append(nums, n)
}
return nums, nil
}
func main() {
nums, err := readFile("numbers.txt")
if err != nil { panic(err) }
fmt.Println(len(nums))
}
Your solution with fmt.Fscanf is fine. There are certainly a number of other ways to do though, depending on your situation. Mostafa's technique is one I use a lot (although I might allocate the result all at once with make. oops! scratch that. He did.) but for ultimate control you should learn bufio.ReadLine. See go readline -> string for some example code.