I'm trying to compress file from buffered reader and pass compressed bytes through byte channel, but with poor results :), here's what I came up till now, obviously this don't works...
func Compress(r io.Reader) (<-chan byte) {
c := make(chan byte)
go func(){
var wBuff bytes.Buffer
rBuff := make([]byte, 1024)
writer := zlib.NewWriter(*wBuff)
for {
n, err := r.Read(rBuff)
if err != nil && err != io.EOF { panic(err) }
if n == 0 { break }
writer.Write(rBuff) // Compress and write compressed data
// How to send written compressed bytes through channel?
// as fas as I understand wBuff will eventually contain
// whole compressed data?
}
writer.Close()
close(c) // Indicate that no more data follows
}()
return c
}
Please bear with me, as I'm very new to Go
I suggest to use []byte instead of byte. It is more efficient. Because of concurrent memory accesses it may be necessary to send a copy of the buffer through the channel rather than sending the []byte buffer itself.
You can define a type ChanWriter chan []byte and let it implement the io.Writer interface. Then pass the ChanWriter to zlib.NewWriter.
You can create a goroutine for doing the compression and then immediately return the ChanWriter's channel from your Compress function. If there is no goroutine then there is no reason for the function to return a channel and the preferred return type is io.Reader.
The return type of the Compress function should be changed into something like chan <-BytesWithError. In this case ChanWriter can be defined as type ChanWriter chan BytesWithError.
Sending bytes one by one down a channel is not going to be particularly efficient. Another approach that may be more useful would be to return an object implementing the io.Reader interface, implementing the Read() method by reading a block from a original io.Reader and compressing its output before returning it.
Your writer.Write(rBuff) statement always writes len(rBuff) bytes, even when n != len(rBuff).
writer.Write(rBuff[:n])
Also, your Read loop is
for {
n, err := r.Read(rBuff)
if err != nil && err != io.EOF {
panic(err)
}
if n == 0 {
break
}
writer.Write(rBuff[:n])
// ...
}
which is equivalent to
for {
n, err := r.Read(rBuff)
if err != nil && err != io.EOF {
panic(err)
}
// !(err != nil && err != io.EOF)
// !(err != nil) || !(err != io.EOF)
// err == nil || err == io.EOF
if err == nil || err == io.EOF {
if n == 0 {
break
}
}
writer.Write(rBuff[:n])
// ...
}
The loop exits prematurely if err == nil && if n == 0.
Instead, write
for {
n, err := r.Read(rBuf)
if err != nil {
if err != io.EOF {
panic(err)
}
if n == 0 {
break
}
}
writer.Write(rBuf[:n])
// ...
}
Ok, I've found working solution: (Feel free to indicate where it can be improved, or maybe I'm doing something wrong?)
func Compress(r io.Reader) (<-chan byte) {
c := make(chan byte)
go func(){
var wBuff bytes.Buffer
rBuff := make([]byte, 1024)
writer := zlib.NewWriter(&wBuff)
for {
n, err := r.Read(rBuff)
if err != nil {
if err != io.EOF {
panic(err)
}
if n == 0 {
break
}
}
writer.Write(rBuff[:n])
for _, v := range wBuff.Bytes() {
c <- v
}
wBuff.Truncate(0)
}
writer.Close()
for _, v := range wBuff.Bytes() {
c <- v
}
close(c) // Indicate that no more data follows
}()
return c
}
Related
I'm trying to implement a function to ignore a line containing a pattern from a long text file (ASCII guaranteed) in Go
The functions I have below withoutIgnore and withIgnore, both take a filename argument input and return a *byte.Buffer, which can be subsequently used to write to a io.Writer.
The withIgnore function takes an additional argument pattern to exclude the line containing the pattern from the file. The function works, but with benchmarking, found it to be 5x slower than withoutIgnore. Is there a way it could be improved?
package main
import (
"bufio"
"bytes"
"io"
"log"
"os"
)
func withoutIgnore(f string) (*bytes.Buffer, error) {
rfd, err := os.Open(f)
if err != nil {
log.Fatal(err)
}
defer func() {
if err := rfd.Close(); err != nil {
log.Fatal(err)
}
}()
inputBuffer := make([]byte, 1048576)
var bytesRead int
var bs []byte
opBuffer := bytes.NewBuffer(bs)
for {
bytesRead, err = rfd.Read(inputBuffer)
if err == io.EOF {
return opBuffer, nil
}
if err != nil {
return nil, nil
}
_, err = opBuffer.Write(inputBuffer[:bytesRead])
if err != nil {
return nil, err
}
}
return opBuffer, nil
}
func withIgnore(f, pattern string) (*bytes.Buffer, error) {
rfd, err := os.Open(f)
if err != nil {
log.Fatal(err)
}
defer func() {
if err := rfd.Close(); err != nil {
log.Fatal(err)
}
}()
scanner := bufio.NewScanner(rfd)
var bs []byte
buffer := bytes.NewBuffer(bs)
for scanner.Scan() {
if !bytes.Contains(scanner.Bytes(), []byte(pattern)) {
_, err := buffer.WriteString(scanner.Text() + "\n")
if err != nil {
return nil, nil
}
}
}
return buffer, nil
}
func main() {
// buff, err := withoutIgnore("base64dump.log")
buff, err := withIgnore("base64dump.log", "AUDIT")
if err != nil {
log.Fatal(err)
}
_, err = buff.WriteTo(os.Stdout)
if err != nil {
log.Fatal(err)
}
}
Benchmark test
package main
import "testing"
func BenchmarkTestWithoutIgnore(b *testing.B) {
for i := 0; i < b.N; i++ {
_, err := withoutIgnore("base64dump.log")
if err != nil {
b.Fatal(err)
}
}
}
func BenchmarkTestWithIgnore(b *testing.B) {
for i := 0; i < b.N; i++ {
_, err := withIgnore("base64dump.log", "AUDIT")
if err != nil {
b.Fatal(err)
}
}
}
and the "base64dump.log" can be generated in the command line using
base64 /dev/urandom | head -c 10000000 > base64dump.log
Since ASCII is guaranteed, one can work directly at byte level.
Still if one checks each byte for line breaks when reading the input and then searches for the pattern again within the line, operations are applied to each byte.
If, on the other hand, one reads chunks of the input and performs an optimized search for the pattern in the text, not even examining each input byte, one minimizes the operations per input byte.
For example, there is the Boyer-Moore string search algorithm. Go's built-in bytes.Index function is also optimized. The achieved speed depends of course on the input data and the actual pattern. For the input as specified in the question, `bytes.Index turned out to be significantly more performant when measured.
Procedure
read in a chunk, where the chunk size should be significantly longer than the maximum line length, a value >= 64KB should probably be good, in the test 1MB was used as in the question.
a chunk usually doesn't end at a linefeed, so search from the end of the chunk to the next linefeed, limit the search to this slice and remember the remaining data for the next pass
the last chunk does not necessarily end in a linefeed
with the help of the performant GO function bytes.Index you can find the places where the pattern occurs in the chunk
from the found location one searches for the preceding and the following linefeed
then the block is output up to the corresponding beginning of the line
and the search is continued from the end of the line where the pattern occurred
if the search does not find another location, the rest is output
read the next chunk and apply the described steps again until the end of the file is reached
Noteworthy
A read operation may return less data than the chunk size, so it makes sense to repeat the read operation until the chunk size data has been read.
Benchmark
Optimized code is often significantly more complicated, but the performance is also significantly better, as we will see in a moment.
BenchmarkTestWithoutIgnore-8 270 4137267 ns/op
BenchmarkTestWithIgnore-8 54 22403931 ns/op
BenchmarkTestFilter-8 150 7947454 ns/op
Here, the optimized code BenchmarkTestFilter-8 is only about 1.9x slower than the operation without filtering while the BenchmarkTestWithIgnore-8 method is 5.4x slower than the comparison value without filtering.
Looked at another way: the optimized code is 2.8 times faster than the unoptimized one.
Code
Of course, here is the code for your own tests:
func filterFile(f, pattern string) (*bytes.Buffer, error) {
rfd, err := os.Open(f)
if err != nil {
log.Fatal(err)
}
defer func() {
if err := rfd.Close(); err != nil {
log.Fatal(err)
}
}()
reader := bufio.NewReader(rfd)
return filter(reader, []byte(pattern), 1024*1024)
}
// chunkSize must be larger than the longest line
// a reasonable size is probably >= 64K
func filter(reader io.Reader, pattern []byte, chunkSize int) (*bytes.Buffer, error) {
var bs []byte
buffer := bytes.NewBuffer(bs)
chunk := make([]byte, chunkSize)
var remaining []byte
for lastChunk := false; !lastChunk; {
n, err := readChunk(reader, chunk, remaining, chunkSize)
if err != nil {
if err == io.EOF {
lastChunk = true
} else {
return nil, err
}
}
remaining = remaining[:0]
if !lastChunk {
for i := n - 1; i > 0; i-- {
if chunk[i] == '\n' {
remaining = append(remaining, chunk[i+1:n]...)
n = i + 1
break
}
}
}
s := 0
for s < n {
hit := bytes.Index(chunk[s:n], pattern)
if hit < 0 {
break
}
hit += s
startOfLine := hit
for ; startOfLine > 0; startOfLine-- {
if chunk[startOfLine] == '\n' {
startOfLine++
break
}
}
endOfLine := hit + len(pattern)
for ; endOfLine < n; endOfLine++ {
if chunk[endOfLine] == '\n' {
break
}
}
endOfLine++
_, err = buffer.Write(chunk[s:startOfLine])
if err != nil {
return nil, err
}
s = endOfLine
}
if s < n {
_, err = buffer.Write(chunk[s:n])
if err != nil {
return nil, err
}
}
}
return buffer, nil
}
func readChunk(reader io.Reader, chunk, remaining []byte, chunkSize int) (int, error) {
copy(chunk, remaining)
r := len(remaining)
for r < chunkSize {
n, err := reader.Read(chunk[r:])
r += n
if err != nil {
return r, err
}
}
return r, nil
}
And the benchmark part might look something like this:
func BenchmarkTestFilter(b *testing.B) {
for i := 0; i < b.N; i++ {
_, err := filterFile("base64dump.log", "AUDIT")
if err != nil {
b.Fatal(err)
}
}
}
The filter function was split and the actual job is done in func filter(reader io.Reader, pattern []byte, chunkSize int) (*bytes.Buffer, error).
By injecting a reader and a chunkSize, the creation of unit tests is already prepared or contemplated, which is missing here, but is definitely recommended when dealing with indexes.
However, the main point here was to find a way to significantly improve it in terms of performance.
I have a "chan string", where each entry is a CSV log line that I would like to convert to columns "[]string", currently I am (un-efficiently) creating a csv.NewReader(strings.NewReader(i)) for each item, which looks a lot more work than it really needs to be:
for i := range feederChan {
r := csv.NewReader(strings.NewReader(i))
a, err := r.Read()
if err != nil {
// log error...
continue
}
// then do stuff with 'a'
// ...
}
So, I'd really appreciate sharing if there's a more efficient way to do that, like creating the csv.Reader once, then feeding it the chan content somehow (stream 'chan' content to something that implements the 'io.Reader' interface?).
Use the following to convert a channel of strings to a reader:
type chanReader struct {
c chan string
buf string
}
func (r *chanReader) Read(p []byte) (int, error) {
// Fill the buffer when we have no data to return to the caller
if len(r.buf) == 0 {
var ok bool
r.buf, ok = <-r.c
if !ok {
// Return eof on channel closed
return 0, io.EOF
}
}
n := copy(p, r.buf)
r.buf = r.buf[n:]
return n, nil
}
Use it like this:
r := csv.NewReader(&chanReader{c: feederChan})
for {
a, err := r.Read()
if err != nil {
// handle error, break out of loop
}
// do something with a
}
Run it on the playground
If the application assumes that newlines separate the values received from the channel, then append a newline to each value received:
...
var ok bool
r.buf, ok = <-r.c
if !ok {
// Return eof on channel closed
return 0, io.EOF
}
r.buf += "\n"
...
The += "\n" copies the string. If this does not meet the application's efficiency requirements, then introduce a new field to manage line separators.
type chanReader struct {
c chan string // source of lines
buf string // the current line
nl bool // true if line separator is pending
}
func (r *chanReader) Read(p []byte) (int, error) {
// Fill the buffer when we have no data to return to the caller
if len(r.buf) == 0 && !r.nl {
var ok bool
r.buf, ok = <-r.c
if !ok {
// Return eof on channel closed
return 0, io.EOF
}
r.nl = true
}
// Return data if we have it
if len(r.buf) > 0 {
n := copy(p, r.buf)
r.buf = r.buf[n:]
return n, nil
}
// No data, return the line separator
n := copy(p, "\n")
r.nl = n == 0
return n, nil
}
Run it on the playground.
Another approach is to use an io.Pipe and goroutine to convert the channel to a io.Reader as suggested in a comment to the question. A first pass at this approach is:
var nl = []byte("\n")
func createChanReader(c chan string) io.Reader {
r, w := io.Pipe()
go func() {
defer w.Close()
for s := range c {
io.WriteString(w, s)
w.Write(nl)
}
}
}()
return r
}
Use it like this:
r := csv.NewReader(createChanReader(feederChan))
for {
a, err := r.Read()
if err != nil {
// handle error, break out of loop
}
// do something with a
}
This first pass at the io.Pipe solution leaks a goroutine when the application exits the loop before reading the pipe to EOF. The application might break out early because the CSV reader detected a syntax error, the application panicked because of a programmer error, or any number of other reasons.
To fix the goroutine leak, exit the writing goroutine on write error and close the pipe reader when done reading.
var nl = []byte("\n")
func createChanReader(c chan string) *io.PipeReader {
r, w := io.Pipe()
go func() {
defer w.Close()
for s := range c {
if _, err := io.WriteString(w, s); err != nil {
return
}
if _, err := w.Write(nl); err != nil {
return
}
}
}()
return r
}
Use it like this:
cr := createChanReader(feederChan)
defer cr.Close() // Required for goroutine cleanup
r := csv.NewReader(cr)
for {
a, err := r.Read()
if err != nil {
// handle error, break out of loop
}
// do something with a
}
Run it on the playground.
Even though "ThunderCat's" answer was really useful and appreciated, I ended up using io.Pipe() "as mh-cbon mentioned" which is much simpler and looks like more efficient (explained below):
rp, wp := io.Pipe()
go func() {
defer wp.Close()
for i := range feederChan {
fmt.Fprintln(wp, i)
}
}()
r := csv.NewReader(rp)
for { // keep reading
a, err := r.Read()
if err == io.EOF {
break
}
// do stuff with 'a'
// ...
}
The io.Pipe() is synchronous, and should be fairly efficient: it pipes data from writer to a reader; I fed the csv.NewReader() the reader part, and created a goroutine that drains the chan writing to the writer part.
Thanks a lot.
EDIT: ThunderCat added the io.Pipe approach to his answer (after I posted this I guess) ... his answer is much more comprehensive and was accepted as such.
I've written a little server which receives a blob of data in the form of an io.Reader, adds a header and streams the result back to the caller.
My implementation isn't particularly efficient as I'm buffering the blob's data in-memory so that I can calculate the blob's length, which needs to form part of the header.
I've seen some examples of io.Pipe() with io.TeeReader but they're more for splitting an io.Reader into two, and writing them away in parallel.
The blobs I'm dealing with are around 100KB, so not huge but if my server gets busy, memory's going to quickly become an issue...
Any ideas?
func addHeader(in io.Reader) (out io.Reader, err error) {
buf := new(bytes.Buffer)
if _, err = io.Copy(buf, in); err != nil {
return
}
header := bytes.NewReader([]byte(fmt.Sprintf("header:%d", buf.Len())))
return io.MultiReader(header, buf), nil
}
I appreciate it's not a good idea to return interfaces from functions but this code isn't destined to become an API, so I'm not too concerned with that bit.
In general, the only way to determine the length of data in an io.Reader is to read until EOF. There are ways to determine the length of the data for specific types.
func addHeader(in io.Reader) (out io.Reader, err error) {
n := 0
switch v := in.(type) {
case *bytes.Buffer:
n = v.Len()
case *bytes.Reader:
n = v.Len()
case *strings.Reader:
n = v.Len()
case io.Seeker:
cur, err := v.Seek(0, 1)
if err != nil {
return nil, err
}
end, err := v.Seek(0, 2)
if err != nil {
return nil, err
}
_, err = v.Seek(cur, 0)
if err != nil {
return nil, err
}
n = int(end - cur)
default:
var buf bytes.Buffer
if _, err := buf.ReadFrom(in); err != nil {
return nil, err
}
n = buf.Len()
in = &buf
}
header := strings.NewReader(fmt.Sprintf("header:%d", n))
return io.MultiReader(header, in), nil
}
This is similar to how the net/http package determines the content length of the request body.
I’ve written a short program in Go to communicate with a sensor through a serial port:
package main
import (
"fmt"
"github.com/tarm/goserial"
"time"
)
func main() {
c := &serial.Config{Name: "/dev/ttyUSB0", Baud: 9600}
s, err := serial.OpenPort(c)
if err != nil {
fmt.Println(err)
}
_, err = s.Write([]byte("\x16\x02N0C0 G A\x03\x0d\x0a"))
if err != nil {
fmt.Println(err)
}
time.Sleep(time.Second/2)
buf := make([]byte, 40)
n, err := s.Read(buf)
if err != nil {
fmt.Println(err)
}
fmt.Println(string(buf[:n]))
s.Close()
}
It works fine, but after writing to the port I have to wait about half a second before I can start reading from it. I would like to use a while-loop instead of time.Sleep to read all incoming data. My attempt doesn’t work:
buf := make([]byte, 40)
n := 0
for {
n, _ := s.Read(buf)
if n > 0 {
break
}
}
fmt.Println(string(buf[:n]))
I guess buf gets overwritten after every loop pass. Any suggestions?
Your problem is that Read() will return whenever it has some data - it won't wait for all the data. See the io.Reader specification for more info
What you want to do is read until you reach some delimiter. I don't know exactly what format you are trying to use, but it looks like maybe \x0a is the end delimiter.
In which case you would use a bufio.Reader like this
reader := bufio.NewReader(s)
reply, err := reader.ReadBytes('\x0a')
if err != nil {
panic(err)
}
fmt.Println(reply)
Which will read data until the first \x0a.
I guess buf gets overwritten after every loop pass. Any suggestions?
Yes, buf will get overwritten with every call to Read().
A timeout on the file handle would be the approach I would take.
s, _ := os.OpenFile("/dev/ttyS0", syscall.O_RDWR|syscall.O_NOCTTY|syscall.O_NONBLOCK, 0666)
t := syscall.Termios{
Iflag: syscall.IGNPAR,
Cflag: syscall.CS8 | syscall.CREAD | syscall.CLOCAL | syscall.B115200,
Cc: [32]uint8{syscall.VMIN: 0, syscall.VTIME: uint8(20)}, //2.0s timeout
Ispeed: syscall.B115200,
Ospeed: syscall.B115200,
}
// syscall
syscall.Syscall6(syscall.SYS_IOCTL, uintptr(s.Fd()),
uintptr(syscall.TCSETS), uintptr(unsafe.Pointer(&t)),
0, 0, 0)
// Send message
n, _ := s.Write([]byte("Test message"))
// Receive reply
for {
buf := make([]byte, 128)
n, err = s.Read(buf)
if err != nil { // err will equal io.EOF
break
}
fmt.Printf("%v\n", string(buf))
}
Also note, if there is no more data read and there is no error, os.File.Read() will return an error of io.EOF,
as you can see here.
I have some code that copies from a file to a tcp socket (like an ftp server) and want to be able to abort this copy if needed.
Im just using io.CopyN(socket, file, size) and cant see a way to signal an abort. Any ideas?
How about just closing the input file? io.CopyN will then return an error and abort.
Here is a demonstration (If not running on Linux change /dev/zero & /dev/null for your OS equivalent!)
package main
import (
"fmt"
"io"
"log"
"os"
"time"
)
func main() {
in, err := os.Open("/dev/zero")
if err != nil {
log.Fatal(err)
}
out, err := os.Create("/dev/null")
if err != nil {
log.Fatal(err)
}
go func() {
time.Sleep(time.Second)
in.Close()
}()
written, err := io.CopyN(out, in, 1E12)
fmt.Printf("%d bytes written with error %s\n", written, err)
}
When run it will print something like
9756147712 bytes written with error read /dev/zero: bad file descriptor
CopyN tries hard to copy N bytes. If you want to optionally copy less than N bytes then don't use CopyN in the first place. I would probably adapt the original code to something like (untested code):
func copyUpToN(dst Writer, src Reader, n int64, signal chan int) (written int64, err error) {
buf := make([]byte, 32*1024)
for written < n {
select {
default:
case <-signal:
return 0, fmt.Errorf("Aborted") // or whatever
}
l := len(buf)
if d := n - written; d < int64(l) {
l = int(d)
}
nr, er := src.Read(buf[0:l])
if nr > 0 {
nw, ew := dst.Write(buf[0:nr])
if nw > 0 {
written += int64(nw)
}
if ew != nil {
err = ew
break
}
if nr != nw {
err = io.ErrShortWrite
break
}
}
if er != nil {
err = er
break
}
}
return written, err
}