Improving performance of reading with bufio.NewScanner - go

A simple program to serve one purpose:
Read a script file line by line, create a string while ignoring any blank new lines or comments (including the shebang). Adding a ';' at the end of a line if needed. (I know, I know, backslashes and ampersands, etc)
My question is:
How to improve the performance of this small program? In a different answer I've read about utilizing scanner.Bytes() instead of scanner.Text(), but this doesn't seem feasible as a string is what I want.
Sample code with test file: https://play.golang.org/p/gzSTLkP3BoB
Here is the simple program:
func main() {
file, err := os.Open("./script.sh")
if err != nil {
log.Fatalln(err)
}
defer file.Close()
var a strings.Builder
scanner := bufio.NewScanner(file)
for scanner.Scan() {
lines := scanner.Text()
switch {
case lines == "" || lines[:1] == "#":
continue
case lines[len(lines)-1:] != ";":
a.WriteString(lines + "; ")
default:
a.WriteString(lines + " ")
}
}
fmt.Println(a.String())
}

I used strings.Builder and ioutil.ReadAll to improve the performance. As you are dealing with small shell scripts I assumed that read the file all at once should not put pressure on memory (I used ioutil.ReadAll). I also allocated just once to make sufficient store for strings.Builder — reduced allocations.
doFast: faster implementation
doSlow: slower implementation (what you've originally done)
Now, let's look at the benchmark results:
goos: darwin
goarch: amd64
pkg: test
cpu: Intel(R) Core(TM) i5-1038NG7 CPU # 2.00GHz
BenchmarkDoFast-8 342602 3334 ns/op 1280 B/op 3 allocs/op
BenchmarkDoSlow-8 258896 4408 ns/op 4624 B/op 8 allocs/op
PASS
ok test 2.477s
We can see that doFast is not only faster but only makes lesser allocations. Metrics measured are lower the better.
package main
import (
"bufio"
"bytes"
"fmt"
"io/ioutil"
"os"
"strings"
)
func open(filename string) (*os.File, error) {
return os.Open(filename)
}
func main() {
fd, err := open("test.sh")
if err != nil {
panic(err)
}
defer fd.Close()
outputA, err := doFast(fd)
if err != nil {
panic(err)
}
fd.Seek(0, 0)
outputB, err := doSlow(fd)
if err != nil {
panic(err)
}
fmt.Println(outputA)
fmt.Println(outputB)
}
func doFast(fd *os.File) (string, error) {
b, err := ioutil.ReadAll(fd)
if err != nil {
return "", err
}
var res strings.Builder
res.Grow(len(b))
bLines := bytes.Split(b, []byte("\n"))
for i := range bLines {
switch {
case len(bLines[i]) == 0 || bLines[i][0] == '#':
case bLines[i][len(bLines[i])-1] != ';':
res.Write(bLines[i])
res.WriteString("; ")
default:
res.Write(bLines[i])
res.WriteByte(' ')
}
}
return res.String(), nil
}
func doSlow(fd *os.File) (string, error) {
var a strings.Builder
scanner := bufio.NewScanner(fd)
for scanner.Scan() {
lines := scanner.Text()
switch {
case lines == "" || lines[:1] == "#":
continue
case lines[len(lines)-1:] != ";":
a.WriteString(lines + "; ")
default:
a.WriteString(lines + " ")
}
}
return a.String(), nil
}
Note: I didn't use bufio.NewScanner; is it required?

It is feasible to use scanner.Bytes(). Here's the code:
func main() {
file, err := os.Open("./script.sh")
if err != nil {
log.Fatalln(err)
}
defer file.Close()
var a strings.Builder
scanner := bufio.NewScanner(file)
for scanner.Scan() {
lines := scanner.Bytes()
switch {
case len(lines) == 0 || lines[0] == '#':
continue
case lines[len(lines)-1] != ';':
a.Write(lines)
a.WriteString("; ")
default:
a.Write(lines)
a.WriteByte(' ')
}
}
fmt.Println(a.String())
}
This program avoids the string allocation in scanner.Text(). The program may not be faster in practice if the program speed is limited by I/O.
Run it on the playground.
If your goal is to write the result to stdout, then write to a bufio.Writer instead of a strings.Builder. This change replaces one or more allocations in strings.Builder with a single allocation in bufio.Writer.
func main() {
file, err := os.Open("./script.sh")
if err != nil {
log.Fatalln(err)
}
defer file.Close()
a := bufio.NewWriter(os.Stdout)
defer a.Flush() // flush buffered data on return from main.
scanner := bufio.NewScanner(file)
for scanner.Scan() {
lines := scanner.Bytes()
switch {
case len(lines) == 0 || lines[0] == '#':
continue
case lines[len(lines)-1] != ';':
a.Write(lines)
a.WriteString("; ")
default:
a.Write(lines)
a.WriteByte(' ')
}
}
}
Run it on the playground.
Bonus improvement: use lines := bytes.TrimSpace(scanner.Bytes()) to handle whitespace before a '#' and after a ';'

You may be able to improve performance by buffering the output as well.
func main() {
output := bufio.NewWriter(os.Stdout)
// instead of Printf, use
fmt.Fprintf(output, "%s\n", a)
}

Related

Ignore a line containing a pattern from a long text file in Go

I'm trying to implement a function to ignore a line containing a pattern from a long text file (ASCII guaranteed) in Go
The functions I have below withoutIgnore and withIgnore, both take a filename argument input and return a *byte.Buffer, which can be subsequently used to write to a io.Writer.
The withIgnore function takes an additional argument pattern to exclude the line containing the pattern from the file. The function works, but with benchmarking, found it to be 5x slower than withoutIgnore. Is there a way it could be improved?
package main
import (
"bufio"
"bytes"
"io"
"log"
"os"
)
func withoutIgnore(f string) (*bytes.Buffer, error) {
rfd, err := os.Open(f)
if err != nil {
log.Fatal(err)
}
defer func() {
if err := rfd.Close(); err != nil {
log.Fatal(err)
}
}()
inputBuffer := make([]byte, 1048576)
var bytesRead int
var bs []byte
opBuffer := bytes.NewBuffer(bs)
for {
bytesRead, err = rfd.Read(inputBuffer)
if err == io.EOF {
return opBuffer, nil
}
if err != nil {
return nil, nil
}
_, err = opBuffer.Write(inputBuffer[:bytesRead])
if err != nil {
return nil, err
}
}
return opBuffer, nil
}
func withIgnore(f, pattern string) (*bytes.Buffer, error) {
rfd, err := os.Open(f)
if err != nil {
log.Fatal(err)
}
defer func() {
if err := rfd.Close(); err != nil {
log.Fatal(err)
}
}()
scanner := bufio.NewScanner(rfd)
var bs []byte
buffer := bytes.NewBuffer(bs)
for scanner.Scan() {
if !bytes.Contains(scanner.Bytes(), []byte(pattern)) {
_, err := buffer.WriteString(scanner.Text() + "\n")
if err != nil {
return nil, nil
}
}
}
return buffer, nil
}
func main() {
// buff, err := withoutIgnore("base64dump.log")
buff, err := withIgnore("base64dump.log", "AUDIT")
if err != nil {
log.Fatal(err)
}
_, err = buff.WriteTo(os.Stdout)
if err != nil {
log.Fatal(err)
}
}
Benchmark test
package main
import "testing"
func BenchmarkTestWithoutIgnore(b *testing.B) {
for i := 0; i < b.N; i++ {
_, err := withoutIgnore("base64dump.log")
if err != nil {
b.Fatal(err)
}
}
}
func BenchmarkTestWithIgnore(b *testing.B) {
for i := 0; i < b.N; i++ {
_, err := withIgnore("base64dump.log", "AUDIT")
if err != nil {
b.Fatal(err)
}
}
}
and the "base64dump.log" can be generated in the command line using
base64 /dev/urandom | head -c 10000000 > base64dump.log
Since ASCII is guaranteed, one can work directly at byte level.
Still if one checks each byte for line breaks when reading the input and then searches for the pattern again within the line, operations are applied to each byte.
If, on the other hand, one reads chunks of the input and performs an optimized search for the pattern in the text, not even examining each input byte, one minimizes the operations per input byte.
For example, there is the Boyer-Moore string search algorithm. Go's built-in bytes.Index function is also optimized. The achieved speed depends of course on the input data and the actual pattern. For the input as specified in the question, `bytes.Index turned out to be significantly more performant when measured.
Procedure
read in a chunk, where the chunk size should be significantly longer than the maximum line length, a value >= 64KB should probably be good, in the test 1MB was used as in the question.
a chunk usually doesn't end at a linefeed, so search from the end of the chunk to the next linefeed, limit the search to this slice and remember the remaining data for the next pass
the last chunk does not necessarily end in a linefeed
with the help of the performant GO function bytes.Index you can find the places where the pattern occurs in the chunk
from the found location one searches for the preceding and the following linefeed
then the block is output up to the corresponding beginning of the line
and the search is continued from the end of the line where the pattern occurred
if the search does not find another location, the rest is output
read the next chunk and apply the described steps again until the end of the file is reached
Noteworthy
A read operation may return less data than the chunk size, so it makes sense to repeat the read operation until the chunk size data has been read.
Benchmark
Optimized code is often significantly more complicated, but the performance is also significantly better, as we will see in a moment.
BenchmarkTestWithoutIgnore-8 270 4137267 ns/op
BenchmarkTestWithIgnore-8 54 22403931 ns/op
BenchmarkTestFilter-8 150 7947454 ns/op
Here, the optimized code BenchmarkTestFilter-8 is only about 1.9x slower than the operation without filtering while the BenchmarkTestWithIgnore-8 method is 5.4x slower than the comparison value without filtering.
Looked at another way: the optimized code is 2.8 times faster than the unoptimized one.
Code
Of course, here is the code for your own tests:
func filterFile(f, pattern string) (*bytes.Buffer, error) {
rfd, err := os.Open(f)
if err != nil {
log.Fatal(err)
}
defer func() {
if err := rfd.Close(); err != nil {
log.Fatal(err)
}
}()
reader := bufio.NewReader(rfd)
return filter(reader, []byte(pattern), 1024*1024)
}
// chunkSize must be larger than the longest line
// a reasonable size is probably >= 64K
func filter(reader io.Reader, pattern []byte, chunkSize int) (*bytes.Buffer, error) {
var bs []byte
buffer := bytes.NewBuffer(bs)
chunk := make([]byte, chunkSize)
var remaining []byte
for lastChunk := false; !lastChunk; {
n, err := readChunk(reader, chunk, remaining, chunkSize)
if err != nil {
if err == io.EOF {
lastChunk = true
} else {
return nil, err
}
}
remaining = remaining[:0]
if !lastChunk {
for i := n - 1; i > 0; i-- {
if chunk[i] == '\n' {
remaining = append(remaining, chunk[i+1:n]...)
n = i + 1
break
}
}
}
s := 0
for s < n {
hit := bytes.Index(chunk[s:n], pattern)
if hit < 0 {
break
}
hit += s
startOfLine := hit
for ; startOfLine > 0; startOfLine-- {
if chunk[startOfLine] == '\n' {
startOfLine++
break
}
}
endOfLine := hit + len(pattern)
for ; endOfLine < n; endOfLine++ {
if chunk[endOfLine] == '\n' {
break
}
}
endOfLine++
_, err = buffer.Write(chunk[s:startOfLine])
if err != nil {
return nil, err
}
s = endOfLine
}
if s < n {
_, err = buffer.Write(chunk[s:n])
if err != nil {
return nil, err
}
}
}
return buffer, nil
}
func readChunk(reader io.Reader, chunk, remaining []byte, chunkSize int) (int, error) {
copy(chunk, remaining)
r := len(remaining)
for r < chunkSize {
n, err := reader.Read(chunk[r:])
r += n
if err != nil {
return r, err
}
}
return r, nil
}
And the benchmark part might look something like this:
func BenchmarkTestFilter(b *testing.B) {
for i := 0; i < b.N; i++ {
_, err := filterFile("base64dump.log", "AUDIT")
if err != nil {
b.Fatal(err)
}
}
}
The filter function was split and the actual job is done in func filter(reader io.Reader, pattern []byte, chunkSize int) (*bytes.Buffer, error).
By injecting a reader and a chunkSize, the creation of unit tests is already prepared or contemplated, which is missing here, but is definitely recommended when dealing with indexes.
However, the main point here was to find a way to significantly improve it in terms of performance.

Count lines via bufio

I'm utilizing bufio to do a for loop for each line in a text file. I have no idea how to count the amount of lines though.
scanner := bufio.NewScanner(bufio.NewReader(file))
The above is what I use to scan my file.
You could do something like this:
counter := 0
for scanner.Scan() {
line := scanner.Text()
counter++
// do something with your line
}
fmt.Printf("Lines read: %d", counter)
Keep it simple and fast. No need for buffering, scanner already does that. Don't do unnecessary string conversions. For example,
package main
import (
"bufio"
"fmt"
"os"
)
func lineCount(filename string) (int64, error) {
lc := int64(0)
f, err := os.Open(filename)
if err != nil {
return 0, err
}
defer f.Close()
s := bufio.NewScanner(f)
for s.Scan() {
lc++
}
return lc, s.Err()
}
func main() {
filename := `testfile`
lc, err := lineCount(filename)
if err != nil {
fmt.Println(err)
return
}
fmt.Println(filename+" line count:", lc)
}
As I commented, the accepted answer fails at long lines. The default limit is bufio.MaxScanTokenSize which is 64KiB. So if your line is longer than 65536 chars, it will silently fail. You've got two options.
Call scanner.Buffer() and supply the sufficient max parameter. buffer may be small by default because Scanner is smart enough to allocate new ones. Can be a problem if you don't know the total size beforehand, like with vanilla Reader interface, and you've got huge lines - the memory consumption will grow correspondingly as Scanner records all the line.
Recreate scanner in the outer loop, this will ensure that you advance further:
var scanner *bufio.Scanner
counter := 0
for scanner == nil || scanner.Err() == bufio.ErrTooLong {
scanner = bufio.NewScanner(reader)
for scanner.Scan() {
counter++
}
}
The problem with (2) is that you keep allocating and deallocating buffers instead of reusing them. So let's fuse (1) and (2):
var scanner *bufio.Scanner
buffer := make([]byte, bufio.MaxScanTokenSize)
counter := 0
for scanner == nil || scanner.Err() == bufio.ErrTooLong {
scanner = bufio.NewScanner(reader)
scanner.Buffer(buffer, 0)
for scanner.Scan() {
counter++
}
}
Here is my approach to do the task:
inputFile, err := os.Open("input.txt")
if err != nil {
panic("Error happend during opening the file. Please check if file exists!")
os.Exit(1)
}
defer inputFile.Close()
inputReader := bufio.NewReader(inputFile)
scanner := bufio.NewScanner(inputReader)
// Count the words.
count := 0
for scanner.Scan() {
line := scanner.Text()
fmt.Printf("%v\n", line)
count++
}
if err := scanner.Err(); err != nil {
fmt.Fprintln(os.Stderr, "reading input:", err)
}
fmt.Printf("%d\n", count)

How do I read in a large flat file

I have a flat file that has 339276 line of text in it for a size of 62.1 MB. I am attempting to read in all the lines, parse them based on some conditions I have and then insert them into a database.
I originally attempted to use a bufio.Scan() loop and bufio.Text() to get the line but I was running out of buffer space. I switched to using bufio.ReadLine/ReadString/ReadByte (I tried each) and had the same problem with each. I didn't have enough buffer space.
I tried using read and setting the buffer size but as the document says it actually a const that can be made smaller but never bigger that 64*1024 bytes. I then tried to use File.ReadAt where I set the starting postilion and moved it along as I brought in each section to no avail. I have looked at the following examples and explanations (not an exhaustive list):
Read text file into string array (and write)
How to Read last lines from a big file with Go every 10 secs
reading file line by line in go
How do I read in an entire file (either line by line or the whole thing at once) into a slice so I can then go do things to the lines?
Here is some code that I have tried:
file, err := os.Open(feedFolder + value)
handleError(err)
defer file.Close()
// fileInfo, _ := file.Stat()
var linesInFile []string
r := bufio.NewReader(file)
for {
path, err := r.ReadLine("\n") // 0x0A separator = newline
linesInFile = append(linesInFile, path)
if err == io.EOF {
fmt.Printf("End Of File: %s", err)
break
} else if err != nil {
handleError(err) // if you return error
}
}
fmt.Println("Last Line: ", linesInFile[len(linesInFile)-1])
Here is something else I tried:
var fileSize int64 = fileInfo.Size()
fmt.Printf("File Size: %d\t", fileSize)
var bufferSize int64 = 1024 * 60
bytes := make([]byte, bufferSize)
var fullFile []byte
var start int64 = 0
var interationCounter int64 = 1
var currentErr error = nil
for currentErr != io.EOF {
_, currentErr = file.ReadAt(bytes, st)
fullFile = append(fullFile, bytes...)
start = (bufferSize * interationCounter) + 1
interationCounter++
}
fmt.Printf("Err: %s\n", currentErr)
fmt.Printf("fullFile Size: %s\n", len(fullFile))
fmt.Printf("Start: %d", start)
var currentLine []string
for _, value := range fullFile {
if string(value) != "\n" {
currentLine = append(currentLine, string(value))
} else {
singleLine := strings.Join(currentLine, "")
linesInFile = append(linesInFile, singleLine)
currentLine = nil
}
}
I am at a loss. Either I don't understand exactly how the buffer works or I don't understand something else. Thanks for reading.
bufio.Scan() and bufio.Text() in a loop perfectly works for me on a files with much larger size, so I suppose you have lines exceeded buffer capacity. Then
check your line ending
and which Go version you use path, err :=r.ReadLine("\n") // 0x0A separator = newline? Looks like func (b *bufio.Reader) ReadLine() (line []byte, isPrefix bool, err error) has return value isPrefix specifically for your use case
http://golang.org/pkg/bufio/#Reader.ReadLine
It's not clear that it's necessary to read in all the lines before parsing them and inserting them into a database. Try to avoid that.
You have a small file: "a flat file that has 339276 line of text in it for a size of 62.1 MB." For example,
package main
import (
"bytes"
"fmt"
"io"
"io/ioutil"
)
func readLines(filename string) ([]string, error) {
var lines []string
file, err := ioutil.ReadFile(filename)
if err != nil {
return lines, err
}
buf := bytes.NewBuffer(file)
for {
line, err := buf.ReadString('\n')
if len(line) == 0 {
if err != nil {
if err == io.EOF {
break
}
return lines, err
}
}
lines = append(lines, line)
if err != nil && err != io.EOF {
return lines, err
}
}
return lines, nil
}
func main() {
// a flat file that has 339276 lines of text in it for a size of 62.1 MB
filename := "flat.file"
lines, err := readLines(filename)
fmt.Println(len(lines))
if err != nil {
fmt.Println(err)
return
}
}
It seems to me this variant of readLines is shorter and faster than suggested peterSO
func readLines(filename string) (map[int]string, error) {
lines := make(map[int]string)
data, err := ioutil.ReadFile(filename)
if err != nil {
return nil, err
}
for n, line := range strings.Split(string(data), "\n") {
lines[n] = line
}
return lines, nil
}
package main
import (
"fmt"
"os"
"log"
"bufio"
)
func main() {
FileName := "assets/file.txt"
file, err := os.Open(FileName)
if err != nil {
log.Fatal(err)
}
defer file.Close()
scanner := bufio.NewScanner(file)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
}

How to stop io.CopyN

I have some code that copies from a file to a tcp socket (like an ftp server) and want to be able to abort this copy if needed.
Im just using io.CopyN(socket, file, size) and cant see a way to signal an abort. Any ideas?
How about just closing the input file? io.CopyN will then return an error and abort.
Here is a demonstration (If not running on Linux change /dev/zero & /dev/null for your OS equivalent!)
package main
import (
"fmt"
"io"
"log"
"os"
"time"
)
func main() {
in, err := os.Open("/dev/zero")
if err != nil {
log.Fatal(err)
}
out, err := os.Create("/dev/null")
if err != nil {
log.Fatal(err)
}
go func() {
time.Sleep(time.Second)
in.Close()
}()
written, err := io.CopyN(out, in, 1E12)
fmt.Printf("%d bytes written with error %s\n", written, err)
}
When run it will print something like
9756147712 bytes written with error read /dev/zero: bad file descriptor
CopyN tries hard to copy N bytes. If you want to optionally copy less than N bytes then don't use CopyN in the first place. I would probably adapt the original code to something like (untested code):
func copyUpToN(dst Writer, src Reader, n int64, signal chan int) (written int64, err error) {
buf := make([]byte, 32*1024)
for written < n {
select {
default:
case <-signal:
return 0, fmt.Errorf("Aborted") // or whatever
}
l := len(buf)
if d := n - written; d < int64(l) {
l = int(d)
}
nr, er := src.Read(buf[0:l])
if nr > 0 {
nw, ew := dst.Write(buf[0:nr])
if nw > 0 {
written += int64(nw)
}
if ew != nil {
err = ew
break
}
if nr != nw {
err = io.ErrShortWrite
break
}
}
if er != nil {
err = er
break
}
}
return written, err
}

How to calculate checksum of a file efficiently

I want to efficiently calculate the checksum of a very large file (multiple GB). This Go program has two approaches one chunks the file and calculates the checksum quicksha but it's not correct. Another classical approach slowsha works well.
Can you help me fix quicksha?
package main
import (
"bufio"
"crypto/sha256"
"encoding/hex"
"io"
"log"
"net/http"
"net/http/pprof"
"os"
)
func slowsha(fname string) {
f, err := os.Open(fname)
if err != nil {
log.Fatal(err)
}
defer f.Close()
h := sha256.New()
if _, err := io.Copy(h, f); err != nil {
log.Fatal(err)
}
log.Printf("%s %s", hex.EncodeToString(h.Sum(nil)), os.Args[1])
}
func quicksha(fname string) {
f, err := os.Open(fname)
if err != nil {
log.Fatal(err)
}
defer f.Close()
buf := make([]byte, 16*1024)
pr, pw := io.Pipe()
go func() {
w := bufio.NewWriter(pw)
for {
n, err := f.Read(buf)
if n > 0 {
buf = buf[:n]
w.Write(buf)
}
if err == io.EOF {
pw.Close()
break
}
}
}()
h := sha256.New()
io.Copy(h, pr)
log.Printf("%s %s", hex.EncodeToString(h.Sum(nil)), os.Args[1])
}
func main() {
fname := os.Args[2]
choice := os.Args[1]
for i := 0; i < 100; i++ {
if choice == "-s" {
slowsha(fname)
} else if choice == "-f" {
quicksha(fname)
} else {
log.Fatal("Bad choice")
}
}
}
Output
shasum -a 256 lessthan20MBTest.doc >> reference answer
d91b998a372035c2378fc40a6d0eee17b9f16d60207343f9fc3558eb77f90b71 lessthan20MBTest.doc
./quicksha -f lessthan20MBTest.doc >> wrong answer
b97d5167bbe945ca90223b7503653df89ba9e7d420268da27851fca6db3fcdcf lessthan20MBTest.doc
./quicksha -s lessthan20MBTest.doc . >>> right answer
d91b998a372035c2378fc40a6d0eee17b9f16d60207343f9fc3558eb77f90b71 lessthan20MBTest.doc
There are several problems in your program:
First: you are already using a buffer for reading/writing, so there is no need to use a bufio.Writer. You are double-buffering with that. Which also happens to be the reason why you don't get the result you want: you have to w.Flush() before closing the pipe, because you haven't written what's in the bufio.Writer's buffers to the pipe:
if err == io.EOF {
w.Flush()
pw.Close()
break
}
Second: you are making your buffer shorter. In general, read does not have to read to fill the buffer. If the underlying stream is a network stream, read may read less than the buffer size, and that doesn't mean the end of stream reached. For files, this does not make any difference in practice but in general, you should do:
if n > 0 {
w.Write(buf[:n])
}
Third: Did you measure? It is unlikely that the 'faster' implementation is actually faster. Including the buffering in io.Copy, you're triple-buffering with this implementation.

Resources