To convert [][]byte to []string, I do this
data, err := ioutil.ReadFile("test.txt")
if err != nil {
return nil, err
}
db := bytes.Split(data, []uint8("\n"))
// Convert [][]byte to []string
s := make([]string, len(db))
for i, val := range db {
s[i] = string(val)
}
fmt.Printf("%v", s)
I am new to golang, I'm not sure is most efficient way to do this.
The most effective way would be to remove this step: db := bytes.Split(data, []uint8("\n")) and instead iterate over data like that:
func main() {
data, _ := ioutil.ReadFile("test.txt")
s := make([]string, 0)
start := 0
for i := range data {
if data[i] == '\n' {
elem := string(data[start : i-1])
s = append(s, elem)
start = i
}
}
fmt.Printf("%v", s)
}
Or if you want to convert [][]byte to []string:
func convert(data [][]byte) []string {
s := make([]string, len(data))
for row := range data {
s[row] = string(data[row])
}
return s
}
If you actually want to convert a file content to a []string, you can use bufio.Scanner which is cleaner (IMO) and more efficient than the code you posted:
func readFile(filename string) ([]string, error) {
file, err := os.Open(filename)
if err != nil {
return nil, err
}
defer file.Close()
scanner := bufio.NewScanner(file)
var data []string
for scanner.Scan() {
line := scanner.Text()
data = append(data, line)
}
if err = scanner.Err(); err != nil {
return nil, err
}
return data, nil
}
Here's a benchmark* comparing the original function (readFile1) and my function (readFile2):
BenchmarkReadFile1-8 300 4632189 ns/op 3035552 B/op 10570 allocs/op
BenchmarkReadFile2-8 1000 1695820 ns/op 2169655 B/op 10587 allocs/op
*the benchmark read a sample file of 1.2 MiB and ~10K lines
The new code runs in 36% of the time and 71% of the memory used by the original function.
Related
I'm trying to implement a function to ignore a line containing a pattern from a long text file (ASCII guaranteed) in Go
The functions I have below withoutIgnore and withIgnore, both take a filename argument input and return a *byte.Buffer, which can be subsequently used to write to a io.Writer.
The withIgnore function takes an additional argument pattern to exclude the line containing the pattern from the file. The function works, but with benchmarking, found it to be 5x slower than withoutIgnore. Is there a way it could be improved?
package main
import (
"bufio"
"bytes"
"io"
"log"
"os"
)
func withoutIgnore(f string) (*bytes.Buffer, error) {
rfd, err := os.Open(f)
if err != nil {
log.Fatal(err)
}
defer func() {
if err := rfd.Close(); err != nil {
log.Fatal(err)
}
}()
inputBuffer := make([]byte, 1048576)
var bytesRead int
var bs []byte
opBuffer := bytes.NewBuffer(bs)
for {
bytesRead, err = rfd.Read(inputBuffer)
if err == io.EOF {
return opBuffer, nil
}
if err != nil {
return nil, nil
}
_, err = opBuffer.Write(inputBuffer[:bytesRead])
if err != nil {
return nil, err
}
}
return opBuffer, nil
}
func withIgnore(f, pattern string) (*bytes.Buffer, error) {
rfd, err := os.Open(f)
if err != nil {
log.Fatal(err)
}
defer func() {
if err := rfd.Close(); err != nil {
log.Fatal(err)
}
}()
scanner := bufio.NewScanner(rfd)
var bs []byte
buffer := bytes.NewBuffer(bs)
for scanner.Scan() {
if !bytes.Contains(scanner.Bytes(), []byte(pattern)) {
_, err := buffer.WriteString(scanner.Text() + "\n")
if err != nil {
return nil, nil
}
}
}
return buffer, nil
}
func main() {
// buff, err := withoutIgnore("base64dump.log")
buff, err := withIgnore("base64dump.log", "AUDIT")
if err != nil {
log.Fatal(err)
}
_, err = buff.WriteTo(os.Stdout)
if err != nil {
log.Fatal(err)
}
}
Benchmark test
package main
import "testing"
func BenchmarkTestWithoutIgnore(b *testing.B) {
for i := 0; i < b.N; i++ {
_, err := withoutIgnore("base64dump.log")
if err != nil {
b.Fatal(err)
}
}
}
func BenchmarkTestWithIgnore(b *testing.B) {
for i := 0; i < b.N; i++ {
_, err := withIgnore("base64dump.log", "AUDIT")
if err != nil {
b.Fatal(err)
}
}
}
and the "base64dump.log" can be generated in the command line using
base64 /dev/urandom | head -c 10000000 > base64dump.log
Since ASCII is guaranteed, one can work directly at byte level.
Still if one checks each byte for line breaks when reading the input and then searches for the pattern again within the line, operations are applied to each byte.
If, on the other hand, one reads chunks of the input and performs an optimized search for the pattern in the text, not even examining each input byte, one minimizes the operations per input byte.
For example, there is the Boyer-Moore string search algorithm. Go's built-in bytes.Index function is also optimized. The achieved speed depends of course on the input data and the actual pattern. For the input as specified in the question, `bytes.Index turned out to be significantly more performant when measured.
Procedure
read in a chunk, where the chunk size should be significantly longer than the maximum line length, a value >= 64KB should probably be good, in the test 1MB was used as in the question.
a chunk usually doesn't end at a linefeed, so search from the end of the chunk to the next linefeed, limit the search to this slice and remember the remaining data for the next pass
the last chunk does not necessarily end in a linefeed
with the help of the performant GO function bytes.Index you can find the places where the pattern occurs in the chunk
from the found location one searches for the preceding and the following linefeed
then the block is output up to the corresponding beginning of the line
and the search is continued from the end of the line where the pattern occurred
if the search does not find another location, the rest is output
read the next chunk and apply the described steps again until the end of the file is reached
Noteworthy
A read operation may return less data than the chunk size, so it makes sense to repeat the read operation until the chunk size data has been read.
Benchmark
Optimized code is often significantly more complicated, but the performance is also significantly better, as we will see in a moment.
BenchmarkTestWithoutIgnore-8 270 4137267 ns/op
BenchmarkTestWithIgnore-8 54 22403931 ns/op
BenchmarkTestFilter-8 150 7947454 ns/op
Here, the optimized code BenchmarkTestFilter-8 is only about 1.9x slower than the operation without filtering while the BenchmarkTestWithIgnore-8 method is 5.4x slower than the comparison value without filtering.
Looked at another way: the optimized code is 2.8 times faster than the unoptimized one.
Code
Of course, here is the code for your own tests:
func filterFile(f, pattern string) (*bytes.Buffer, error) {
rfd, err := os.Open(f)
if err != nil {
log.Fatal(err)
}
defer func() {
if err := rfd.Close(); err != nil {
log.Fatal(err)
}
}()
reader := bufio.NewReader(rfd)
return filter(reader, []byte(pattern), 1024*1024)
}
// chunkSize must be larger than the longest line
// a reasonable size is probably >= 64K
func filter(reader io.Reader, pattern []byte, chunkSize int) (*bytes.Buffer, error) {
var bs []byte
buffer := bytes.NewBuffer(bs)
chunk := make([]byte, chunkSize)
var remaining []byte
for lastChunk := false; !lastChunk; {
n, err := readChunk(reader, chunk, remaining, chunkSize)
if err != nil {
if err == io.EOF {
lastChunk = true
} else {
return nil, err
}
}
remaining = remaining[:0]
if !lastChunk {
for i := n - 1; i > 0; i-- {
if chunk[i] == '\n' {
remaining = append(remaining, chunk[i+1:n]...)
n = i + 1
break
}
}
}
s := 0
for s < n {
hit := bytes.Index(chunk[s:n], pattern)
if hit < 0 {
break
}
hit += s
startOfLine := hit
for ; startOfLine > 0; startOfLine-- {
if chunk[startOfLine] == '\n' {
startOfLine++
break
}
}
endOfLine := hit + len(pattern)
for ; endOfLine < n; endOfLine++ {
if chunk[endOfLine] == '\n' {
break
}
}
endOfLine++
_, err = buffer.Write(chunk[s:startOfLine])
if err != nil {
return nil, err
}
s = endOfLine
}
if s < n {
_, err = buffer.Write(chunk[s:n])
if err != nil {
return nil, err
}
}
}
return buffer, nil
}
func readChunk(reader io.Reader, chunk, remaining []byte, chunkSize int) (int, error) {
copy(chunk, remaining)
r := len(remaining)
for r < chunkSize {
n, err := reader.Read(chunk[r:])
r += n
if err != nil {
return r, err
}
}
return r, nil
}
And the benchmark part might look something like this:
func BenchmarkTestFilter(b *testing.B) {
for i := 0; i < b.N; i++ {
_, err := filterFile("base64dump.log", "AUDIT")
if err != nil {
b.Fatal(err)
}
}
}
The filter function was split and the actual job is done in func filter(reader io.Reader, pattern []byte, chunkSize int) (*bytes.Buffer, error).
By injecting a reader and a chunkSize, the creation of unit tests is already prepared or contemplated, which is missing here, but is definitely recommended when dealing with indexes.
However, the main point here was to find a way to significantly improve it in terms of performance.
I have a function which splits data and returns slice of subslices:
(buf []byte, lim int) [][]byte
Obviously I get an error if I do:
n, err = out.Write(split(buf[:n], 100))
The error:
cannot convert split(buf[:n], 100) (type [][]byte) to type []byte
How do I convert [][]byte to []byte?
Edit based on #Wishwa Perera: https://play.golang.org/p/nApPAYRV4ZW
Since you are splitting buf into chunks, you can pass them individually to Write by looping over the result of split.
for _, chunk := range split(buf[:n], 100) {
if _, err := out.Write(chunk); err != nil {
panic(err)
}
}
If out is a net.Conn as in your other question, then use net.Buffers to write the [][]byte.
b := net.Buffers(split(buf[:n], 100))
_, err := b.WriteTo(out)
if err != nil {
panic(err)
}
How do I find and read line number something in a file that corresponds some input?
I googled up this code, but it loads whole content of a file into single array with all the lines indexed. Isn't there simpler way?
func LinesInFile(fileName string) []string {
f, _ := os.Open(fileName)
// Create new Scanner.
scanner := bufio.NewScanner(f)
result := []string{}
// Use Scan.
for scanner.Scan() {
line := scanner.Text()
// Append line to result.
result = append(result, line)
}
return result
}
You should just ignore lines you're not interested.
func ReadExactLine(fileName string, lineNumber int) string {
inputFile, err := os.Open(fileName)
if err != nil {
fmt.Println("Error occurred! ", err)
}
br := bufio.NewReader(inputFile)
for i := 1; i < lineNumber; i++ {
_, _ = br.ReadString('\n')
}
str, err := br.ReadString('\n')
fmt.Println("Line is ", str)
return str
}
I have the ff:
func getSlice(distinctSymbols []string) []symbols {
// Prepare connection
stmt1, err := db.Prepare("Select count(*) from stockticker_daily where symbol = $1;")
checkError(err)
defer stmt1.Close()
stmt2, err := db.Prepare("Select date from stockticker_daily where symbol = $1 order by date asc limit 1;")
checkError(err)
defer stmt2.Close()
var symbolsSlice []symbols
c := make(chan symbols)
for _, symbol := range distinctSymbols {
go worker(symbol, stmt1, stmt2, c)
**symbolsFromChannel := <-c**
**symbolsSlice = append(symbolsSlice, symbolsFromChannel})**
}
return symbolsSlice
}
func worker(symbol string, stmt1 *sql.Stmt, stmt2 *sql.Stmt, symbolsChan chan symbols) {
var countdp int
var earliestdate string
row := stmt1.QueryRow(symbol)
if err := row.Scan(&countdp); err != nil {
log.Fatal(err)
}
row = stmt2.QueryRow(symbol)
if err := row.Scan(&earliestdate); err != nil {
log.Fatal(err)
}
symbolsChan <- symbols{symbol, countdp, earliestdate}
}
Please take a look at the first function, I know it won't work as I expect since the line symbolsFromChannel := <-c will block until it receives from the channel, so the iteration on the goroutine go worker will not continue unless the block is removed. What is the best or correct way to do that?
Just do the loop twice, e.g.
for _, symbol := range distinctSymbols {
go worker(symbol, stmt1, stmt2, c)
}
for range distinctSymbols {
symbolsSlice = append(symbolsSlice, <-c)
}
According to Scanner.scan documents, Scan() advances the Scanner to the next token, but what does that mean? I find that Scanner.Text and Scanner.Bytes can be different, which is puzzling.
This code doesn't always cause an error, but as the file becomes larger it does:
func TestScanner(t *testing.T) {
path := "/tmp/test.txt"
f, err := os.Open(path)
if err != nil {
panic(fmt.Sprint("failed to open ", path))
}
defer f.Close()
scanner := bufio.NewScanner(f)
bs := make([][]byte, 0)
for scanner.Scan() {
bs = append(bs, scanner.Bytes())
}
f, err = os.Open(path)
if err != nil {
panic(fmt.Sprint("failed to open ", path))
}
defer f.Close()
scanner = bufio.NewScanner(f)
ss := make([]string, 0)
for scanner.Scan() {
ss = append(ss, scanner.Text())
}
for i, b := range bs {
if string(b) != ss[i] {
t.Errorf("expect %s, got %s", ss[i], string(b))
}
}
}
The token is defined by the scanner's split function. Scan() returns when the split function finds a token or there's an error.
The String() and Bytes() methods both return the current token. The String() method returns a copy of the token. The Bytes() method does not allocate memory and returns a slice that may use a backing array that's overwritten on a subsequent call to Scan().
Copy the slice returned from Bytes() to avoid this issue:
for scanner.Scan() {
bs = append(bs, append([]byte(nil), scanner.Bytes()...))
}