Reading from a file from bufio with a semi complex sequencing through file - go

So there may be questions like this but its not a super easy thing to google. Basically I have a file thats a set of protobufs encoded and sequenced as they normally are from the protobuf spec.
So think of the bytes values being chunked something like this throughout the file:
[EncodeVarInt(size of protobuf struct)] [protobuf stuct bytes]
So you have a few bytes read one at a time that are used for large jump of a read on our protof structure.
My implementation using the os ReadAt method on a file currently looks something like this.
// getting the next value in a file context feature
func (geobuf *Geobuf_Reader) Next() bool {
if geobuf.EndPos <= geobuf.Pos {
return false
} else {
startpos := int64(geobuf.Pos)
for int(geobuf.Get_Byte(geobuf.Pos)) > 127 {
geobuf.Pos += 1
}
geobuf.Pos += 1
sizebytes := make([]byte,geobuf.Pos-int(startpos))
geobuf.File.ReadAt(sizebytes,startpos)
size,_ := DecodeVarint(sizebytes)
geobuf.Feat_Pos = [2]int{int(size),geobuf.Pos}
geobuf.Pos = geobuf.Pos+int(size)
return true
}
return false
}
// reads a geobuf feature as geojson
func (geobuf *Geobuf_Reader) Feature() *geojson.Feature {
// getting raw bytes
a := make([]byte,geobuf.Feat_Pos[0])
geobuf.File.ReadAt(a,int64(geobuf.Feat_Pos[1]))
return Read_Feature(a)
}
How can I implement something like bufio or other chunked reading mechanisms to speed up so many file ReadAt's? Most bufio implementations I've seen are for having a specific delimitter. Thanks in advance hopefully this wasn't a horrible question.

Package bufio
import "bufio"
type SplitFunc
SplitFunc is the signature of the split function used to tokenize the
input. The arguments are an initial substring of the remaining
unprocessed data and a flag, atEOF, that reports whether the Reader
has no more data to give. The return values are the number of bytes to
advance the input and the next token to return to the user, plus an
error, if any. If the data does not yet hold a complete token, for
instance if it has no newline while scanning lines, SplitFunc can
return (0, nil, nil) to signal the Scanner to read more data into the
slice and try again with a longer slice starting at the same point in
the input.
If the returned error is non-nil, scanning stops and the error is
returned to the client.
The function is never called with an empty data slice unless atEOF is
true. If atEOF is true, however, data may be non-empty and, as always,
holds unprocessed text.
type SplitFunc func(data []byte, atEOF bool) (advance int, token []byte, err error)
Use bufio.Scanner and write a custom protobuf struct SplitFunc.

Related

LimitedReader reads only once

I'm trying to understand Go by studying gopl book. I'm stuck when trying to implement the LimitReader function. I realized that I have two problems so let me separate them.
First issue
The description from official doc is saying that:
A LimitedReader reads from R but limits the amount of data returned to just N bytes. Each call to Read updates N to reflect the new amount remaining. Read returns EOF when N <= 0 or when the underlying R returns EOF.
OK, so my understanding is that I can read from io.Reader type many times but I will be always limited to N bytes. Running this code shows me something different:
package main
import (
"fmt"
"io"
"log"
"strings"
)
func main() {
r := strings.NewReader("some io.Reader stream to be read\n")
lr := io.LimitReader(r, 4)
b := make([]byte, 7)
n, err := lr.Read(b)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Read %d bytes: %s\n", n, b)
b = make([]byte, 5)
n, _ = lr.Read(b)
// If removed because EOF
fmt.Printf("Read %d bytes: %s\n", n, b)
}
// Output:
// Read 4 bytes: some
// Read 0 bytes:
// I expect next 4 bytes instead
It seems that this type of object is able to read only once. Not quite sure but maybe this line in io.go source code could be changed to l.N = 0. The main question is why this code is inconsistent with doc description?
Second issue
When I've struggled with the first issue I was trying to display current N value. If I put fmt.Println(lr.N) to the code above it cannot be compiled lr.N undefined (type io.Reader has no field or method N). I realized that I still don't understand Go interfaces concept.
Here is my POV (based on listing above). Using io.LimitReader function I create LimitedReader object (see source code). Due to the fact that this object contains Read method with proper signature its interface type is io.Reader. That's is the reason why io.LimitReader returns io.Reader, right? OK, so everything works together.
The question is: why lr.N cannot be accessed? As I correctly understood the book, interface type only requires that data type contains some method(s). Nothing more.
LimitedReader limits the total size of data that can be read, not the amount of data that can be read at each read call. That is, if you set the limit to 4, you can perform 4 reads of 1 byte, or 1 read of 4 bytes, and after that, all reads will fail.
For your second question: lr is an io.Reader, so you cannot read lr.N. However, you can access the underlying concrete type using a type assertion: lr.(*io.LimitedReader).N should work.

How to transform HTML entities via io.Reader

My Go program makes HTTP requests whose response bodies are large JSON documents whose strings encode the ampersand character & as & (presumably due to some Microsoft platform quirk?). My program needs to convert those entities back to the ampersand character in a way that is compatible with json.Decoder.
An example response might look like the following:
{"name":"A&B","comment":"foo&bar"}
Whose corresponding object would be as below:
pkg.Object{Name:"A&B", Comment:"foo&bar"}
The documents come in various shapes so it's not feasible to convert the HTML entities after decoding. Ideally it would be done by wrapping the response body reader in another reader that performs the transformation.
Is there an easy way to wrap the http.Response.Body in some io.ReadCloser which replaces all instances of & with & (or in the general case, replaces any string X with string Y)?
I suspect this is possible with x/text/transform but don't immediately see how. In particular, I'm concerned about edge cases wherein an entity spans batches of bytes. That is, one batch ends with &am and the next batch starts with p;, for example. Is there some library or idiom that gracefully handles that situation?
If you don't want to rely on an external package like transform.Reader you can write a custom io.Reader wrapper.
The following will handle the edge case where the find element may span two Read() calls:
type fixer struct {
r io.Reader // source reader
fnd, rpl []byte // find & replace sequences
partial int // track partial find matches from previous Read()
}
// Read satisfies io.Reader interface
func (f *fixer) Read(b []byte) (int, error) {
off := f.partial
if off > 0 {
copy(b, f.fnd[:off]) // copy any partial match from previous `Read`
}
n, err := f.r.Read(b[off:])
n += off
if err != io.EOF {
// no need to check for partial match, if EOF, as that is the last Read!
f.partial = partialFind(b[:n], f.fnd)
n -= f.partial // lop off any partial bytes
}
fixb := bytes.ReplaceAll(b[:n], f.fnd, f.rpl)
return copy(b, fixb), err // preserve err as it may be io.EOF etc.
}
Along with this helper (which could probably use some optimization):
// returns number of matched bytes, if byte-slice ends in a partial-match
func partialFind(b, find []byte) int {
for n := len(find) - 1; n > 0; n-- {
if bytes.HasSuffix(b, find[:n]) {
return n
}
}
return 0 // no match
}
Working playground example.
Note: to test the edge-case logic, one could use a narrowReader to ensure short Read's and force a match is split across Reads like this: validation playground example
You need to create a transform.Transformer that replaces your characters.
So we need one that transforms an old []byte to a new []byte while preserving all other data. An implementation could look like this:
type simpleTransformer struct {
Old, New []byte
}
// Transform transforms `t.Old` bytes to `t.New` bytes.
// The current implementation assumes that len(t.Old) >= len(t.New), but it also seems to work when len(t.Old) < len(t.New) (this has not been tested extensively)
func (t *simpleTransformer) Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error) {
// Get the position of the first occurance of `t.Old` so we can replace it
var ci = bytes.Index(src[nSrc:], t.Old)
// Loop over the slice until we can't find any occurances of `t.Old`
// also make sure we don't run into index out of range panics
for ci != -1 && nSrc < len(src) {
// Copy source data before `nSrc+ci` that doesn't need transformation
copied := copy(dst[nDst:nDst+ci], src[nSrc:nSrc+ci])
nDst += copied
nSrc += copied
// Copy new data with transformation to `dst`
nDst += copy(dst[nDst:nDst+len(t.New)], t.New)
// Skip the rest of old bytes in the next iteration
nSrc += len(t.Old)
// search for the next occurance of `t.Old`
ci = bytes.Index(src[nSrc:], t.Old)
}
// Mark the rest of data as not completely processed if it contains a start element of `t.Old`
// (e.g. if the end is `&amp` and we're looking for `&`)
// This data will not yet be copied to `dst` so we can work with it again
// If it is at the end (`atEOF`), we don't need to do the check anymore as the string might just end with `&amp`
if bytes.Contains(src[nSrc:], t.Old[0:1]) && !atEOF {
err = transform.ErrShortSrc
return
}
// Copy rest of data that doesn't need any transformations
// The for loop processed everything except this last chunk
copied := copy(dst[nDst:], src[nSrc:])
nDst += copied
nSrc += copied
return nDst, nSrc, err
}
// To satisfy transformer.Transformer interface
func (t *simpleTransformer) Reset() {}
The implementation has to make sure that it deals with characters that are split between multible calls of the Transform method, which is why it returns transform.ErrShortSrc to tell the transform.Reader that it needs more information about the next bytes.
This can now be used to replace characters in a stream:
var input = strings.NewReader(`{"name":"A&B","comment":"foo&bar"}`)
r := transform.NewReader(input, &simpleTransformer{[]byte(`&`), []byte(`&`)})
io.Copy(os.Stdout, r) // Instead of io.Copy, use the JSON decoder to read from `r`
Output:
{"name":"A&B","comment":"foo&bar"}
You can also see this in action on the Go Playground.

Write fixed length padded lines to file Go

For printing, justified and fixed length, seems like what everyone asks about and there are many examples that I have found, like...
package main
import "fmt"
func main() {
values := []string{"Mustang", "10", "car"}
for i := range(values) {
fmt.Printf("%10v...\n", values[i])
}
for i := range(values) {
fmt.Printf("|%-10v|\n", values[i])
}
}
Situation
But what if I need to WRITE to a file with fixed length bytes?
For example: what if I have requirement that states, write this line to a file that must be 32 bytes, left justified and padded to the right with 0's
Question
So, how do you accomplish this when writing to a file?
There are analogous functions to fmt.PrintXX() functions, ones that start with an F, take the form of fmt.FprintXX(). These variants write the result to an io.Writer which may be an os.File as well.
So if you have the fmt.Printf() statements which you want to direct to a file, just change them to call fmt.Fprintf() instead, passing the file as the first argument:
var f *os.File = ... // Initialize / open file
fmt.Fprintf(f, "%10v...\n", values[i])
If you look into the implementation of fmt.Printf():
func Printf(format string, a ...interface{}) (n int, err error) {
return Fprintf(os.Stdout, format, a...)
}
It does exactly this: it calls fmt.Fprintf(), passing os.Stdout as the output to write to.
For how to open a file, see How to read/write from/to file using Go?
See related question: Format a Go string without printing?

golang read bytes from net.TCPConn with 4 bytes as message separation

I am working on SIP over TCP mock service in golang. Incoming SIP messages are separated by '\r\n\r\n' sequence (I do not care about SDP for now). I want to extract message based on that delimiter and send it over to the processing goroutine. Looking through golang standard libraries I see no trivial way of achieving it. There seems to be no one shop stop in io and bufio packages. Currently I see two options of going forward (bufio):
*Reader.ReadBytes function with '/r' set as the delimiter. Further processing is done by using ReadByte function and comparing it sequentially with each byte of the delimiter and unreading them if necessary (which looks quite tedious)
Using Scanner with a custom split function, which does not look too trivial as well.
I wonder whether there are any other better options, functionality seems so common that it is hard to believe that it is not possible to just define delimiter for tcp stream and extract messages from it.
You can either choose to buffer the reads up yourself and split on the \r\n\r\n delimiter, or let a bufio.Scanner do it for you. There's nothing onerous about implementing a scanner.SplitFunc, and it's definitely simpler than the alternative. Using bufio.ScanLines as an example, you could use:
scanner.Split(func(data []byte, atEOF bool) (advance int, token []byte, err error) {
delim := []byte{'\r', '\n', '\r', '\n'}
if atEOF && len(data) == 0 {
return 0, nil, nil
}
if i := bytes.Index(data, delim); i >= 0 {
return i + len(delim), data[0:i], nil
}
if atEOF {
return len(data), data, nil
}
return 0, nil, nil
}

Most efficient way to read Zlib compressed file in Golang?

I'm reading in and at the same time parsing (decoding) a file in a custom format, which is compressed with zlib. My question is how can I efficiently uncompress and then parse the uncompressed content without growing the slice? I would like to parse it whilst reading it into a reusable buffer.
This is for a speed-sensitive application and so I'd like to read it in as efficiently as possible. Normally I would just ioutil.ReadAll and then loop again through the data to parse it. This time I'd like to parse it as it's read, without having to grow the buffer into which it is read, for maximum efficiency.
Basically I'm thinking that if I can find a buffer of the perfect size then I can read into this, parse it, and then write over the buffer again, then parse that, etc. The issue here is that the zlib reader appears to read an arbitrary number of bytes each time Read(b) is called; it does not fill the slice. Because of this I don't know what the perfect buffer size would be. I'm concerned that it might break up some of the data that I wrote into two chunks, making it difficult to parse because one say uint64 could be split from into two reads and therefore not occur in the same buffer read - or perhaps that can never happen and it's always read out in chunks of the same size as were originally written?
What is the optimal buffer size, or is there a way to calculate this?
If I have written data into the zlib writer with f.Write(b []byte) is it possible that this same data could be split into two reads when reading back the compressed data (meaning I will have to have a history during parsing), or will it always come back in the same read?
You can wrap your zlib reader in a bufio reader, then implement a specialized reader on top that will rebuild your chunks of data by reading from the bufio reader until a full chunk is read. Be aware that bufio.Read calls Read at most once on the underlying Reader, so you need to call ReadByte in a loop. bufio will however take care of the unpredictable size of data returned by the zlib reader for you.
If you do not want to implement a specialized reader, you can just go with a bufio reader and read as many bytes as needed with ReadByte() to fill a given data type. The optimal buffer size is at least the size of your largest data structure, up to whatever you can shove into memory.
If you read directly from the zlib reader, there is no guarantee that your data won't be split between two reads.
Another, maybe cleaner, solution is to implement a writer for your data, then use io.Copy(your_writer, zlib_reader).
OK, so I figured this out in the end using my own implementation of a reader.
Basically the struct looks like this:
type reader struct {
at int
n int
f io.ReadCloser
buf []byte
}
This can be attached to the zlib reader:
// Open file for reading
fi, err := os.Open(filename)
if err != nil {
return nil, err
}
defer fi.Close()
// Attach zlib reader
r := new(reader)
r.buf = make([]byte, 2048)
r.f, err = zlib.NewReader(fi)
if err != nil {
return nil, err
}
defer r.f.Close()
Then x number of bytes can be read straight out of the zlib reader using a function like this:
mydata := r.readx(10)
func (r *reader) readx(x int) []byte {
for r.n < x {
copy(r.buf, r.buf[r.at:r.at+r.n])
r.at = 0
m, err := r.f.Read(r.buf[r.n:])
if err != nil {
panic(err)
}
r.n += m
}
tmp := make([]byte, x)
copy(tmp, r.buf[r.at:r.at+x]) // must be copied to avoid memory leak
r.at += x
r.n -= x
return tmp
}
Note that I have no need to check for EOF because I my parser should stop itself at the right place.

Resources