How to transform HTML entities via io.Reader - go

My Go program makes HTTP requests whose response bodies are large JSON documents whose strings encode the ampersand character & as & (presumably due to some Microsoft platform quirk?). My program needs to convert those entities back to the ampersand character in a way that is compatible with json.Decoder.
An example response might look like the following:
{"name":"A&B","comment":"foo&bar"}
Whose corresponding object would be as below:
pkg.Object{Name:"A&B", Comment:"foo&bar"}
The documents come in various shapes so it's not feasible to convert the HTML entities after decoding. Ideally it would be done by wrapping the response body reader in another reader that performs the transformation.
Is there an easy way to wrap the http.Response.Body in some io.ReadCloser which replaces all instances of & with & (or in the general case, replaces any string X with string Y)?
I suspect this is possible with x/text/transform but don't immediately see how. In particular, I'm concerned about edge cases wherein an entity spans batches of bytes. That is, one batch ends with &am and the next batch starts with p;, for example. Is there some library or idiom that gracefully handles that situation?

If you don't want to rely on an external package like transform.Reader you can write a custom io.Reader wrapper.
The following will handle the edge case where the find element may span two Read() calls:
type fixer struct {
r io.Reader // source reader
fnd, rpl []byte // find & replace sequences
partial int // track partial find matches from previous Read()
}
// Read satisfies io.Reader interface
func (f *fixer) Read(b []byte) (int, error) {
off := f.partial
if off > 0 {
copy(b, f.fnd[:off]) // copy any partial match from previous `Read`
}
n, err := f.r.Read(b[off:])
n += off
if err != io.EOF {
// no need to check for partial match, if EOF, as that is the last Read!
f.partial = partialFind(b[:n], f.fnd)
n -= f.partial // lop off any partial bytes
}
fixb := bytes.ReplaceAll(b[:n], f.fnd, f.rpl)
return copy(b, fixb), err // preserve err as it may be io.EOF etc.
}
Along with this helper (which could probably use some optimization):
// returns number of matched bytes, if byte-slice ends in a partial-match
func partialFind(b, find []byte) int {
for n := len(find) - 1; n > 0; n-- {
if bytes.HasSuffix(b, find[:n]) {
return n
}
}
return 0 // no match
}
Working playground example.
Note: to test the edge-case logic, one could use a narrowReader to ensure short Read's and force a match is split across Reads like this: validation playground example

You need to create a transform.Transformer that replaces your characters.
So we need one that transforms an old []byte to a new []byte while preserving all other data. An implementation could look like this:
type simpleTransformer struct {
Old, New []byte
}
// Transform transforms `t.Old` bytes to `t.New` bytes.
// The current implementation assumes that len(t.Old) >= len(t.New), but it also seems to work when len(t.Old) < len(t.New) (this has not been tested extensively)
func (t *simpleTransformer) Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error) {
// Get the position of the first occurance of `t.Old` so we can replace it
var ci = bytes.Index(src[nSrc:], t.Old)
// Loop over the slice until we can't find any occurances of `t.Old`
// also make sure we don't run into index out of range panics
for ci != -1 && nSrc < len(src) {
// Copy source data before `nSrc+ci` that doesn't need transformation
copied := copy(dst[nDst:nDst+ci], src[nSrc:nSrc+ci])
nDst += copied
nSrc += copied
// Copy new data with transformation to `dst`
nDst += copy(dst[nDst:nDst+len(t.New)], t.New)
// Skip the rest of old bytes in the next iteration
nSrc += len(t.Old)
// search for the next occurance of `t.Old`
ci = bytes.Index(src[nSrc:], t.Old)
}
// Mark the rest of data as not completely processed if it contains a start element of `t.Old`
// (e.g. if the end is `&amp` and we're looking for `&`)
// This data will not yet be copied to `dst` so we can work with it again
// If it is at the end (`atEOF`), we don't need to do the check anymore as the string might just end with `&amp`
if bytes.Contains(src[nSrc:], t.Old[0:1]) && !atEOF {
err = transform.ErrShortSrc
return
}
// Copy rest of data that doesn't need any transformations
// The for loop processed everything except this last chunk
copied := copy(dst[nDst:], src[nSrc:])
nDst += copied
nSrc += copied
return nDst, nSrc, err
}
// To satisfy transformer.Transformer interface
func (t *simpleTransformer) Reset() {}
The implementation has to make sure that it deals with characters that are split between multible calls of the Transform method, which is why it returns transform.ErrShortSrc to tell the transform.Reader that it needs more information about the next bytes.
This can now be used to replace characters in a stream:
var input = strings.NewReader(`{"name":"A&B","comment":"foo&bar"}`)
r := transform.NewReader(input, &simpleTransformer{[]byte(`&`), []byte(`&`)})
io.Copy(os.Stdout, r) // Instead of io.Copy, use the JSON decoder to read from `r`
Output:
{"name":"A&B","comment":"foo&bar"}
You can also see this in action on the Go Playground.

Related

Why copyBuffer implements while loop

I am trying to understand how copyBuffer works under the hood, but what is not clear to me is the use of while loop
for {
nr, er := src.Read(buf)
//...
}
Full code below:
// copyBuffer is the actual implementation of Copy and CopyBuffer.
// if buf is nil, one is allocated.
func copyBuffer(dst Writer, src Reader, buf []byte) (written int64, err error) {
// If the reader has a WriteTo method, use it to do the copy.
// Avoids an allocation and a copy.
if wt, ok := src.(WriterTo); ok {
return wt.WriteTo(dst)
}
// Similarly, if the writer has a ReadFrom method, use it to do the copy.
if rt, ok := dst.(ReaderFrom); ok {
return rt.ReadFrom(src)
}
size := 32 * 1024
if l, ok := src.(*LimitedReader); ok && int64(size) > l.N {
if l.N < 1 {
size = 1
} else {
size = int(l.N)
}
}
if buf == nil {
buf = make([]byte, size)
}
for {
nr, er := src.Read(buf)
if nr > 0 {
nw, ew := dst.Write(buf[0:nr])
if nw > 0 {
written += int64(nw)
}
if ew != nil {
err = ew
break
}
if nr != nw {
err = ErrShortWrite
break
}
}
if er != nil {
if er != EOF {
err = er
}
break
}
}
return written, err
}
It writes to nw, ew := dst.Write(buf[0:nr]) when nr is the number of bytes read, so why is the while loop necessary?
Let's assume that src does not implement WriterTo and dst does not implement ReaderFrom, since otherwise we would not get down to the for loop at all.
Let's further assume, for simplicity, that src does not implement LimitedReader, so that size is 32 * 1024: 32 kBytes. (There is no real loss of generality here as LimitedReader just allows the source to pick an even smaller number, at least in this case.)
Finally, let's assume buf is nil. (Or, if it's not nil, let's assume it has a capacity of 32768 bytes. If it has a large capacity, we can just change the rest of the assumptions below, so that src has more bytes than there are in the buffer.)
So: we enter the loop with size holding the size of the temporary buffer buf, which is 32k. Now suppose the source is a file that holds 64k. It will take at least two src.Read() calls to read it! Clearly we need an outer loop. That's the overall for here.
Now suppose that src.Read() really does read the full 32k, so that nr is also 32 * 1024. The code will now call dst.Write(), passing the full 32k of data. Unlike src.Read()—which is allowed to only read, say, 1k instead of the full 32k—the next chunk of code requires that dst.Write() write all 32k. If it doesn't, the loop will break with err set to ErrShortWrite.
(An alternative would have been to keep calling dst.Write() with the remaining bytes, so that dst.Write() could write only 1k of the 32k, requiring 32 calls to get it all written.)
Note that src.Read() can choose to read only, say, 1k instead of 32k. If the actual file is 64k, it will then take 64 trips, rather than 2, through the outer loop. (An alternative choice would have been to force such a reader to implement the LimitedReaderinterface. That's not as flexible, though, and is not what LimitedReader is intended for.)
func copyBuffer(dst Writer, src Reader, buf []byte) (written int64, err error)
when the total data size to copy if larger than len(buf), nr, er := src.Read(buf) will try read at most len(buf) data every time.
that's how copyBuffer works:
for {
copy `len(buf)` data from `src` to `dst`;
if EOF {
//done
break;
}
if other Errors {
return Error
}
}
In the normal case, you would just call Copy rather than CopyBuffer.
func Copy(dst Writer, src Reader) (written int64, err error) {
return copyBuffer(dst, src, nil)
}
The option to have a user-supplied buffer is, I think, just for extreme optimization scenarios. The use of the word "Buffer" in the name is possibly a source of confusion since the function is not copying the buffer -- just using it internally.
There are two reasons for the looping...
The buffer might not be large enough to copy all of the data (the size of which is not necessarily known in advance) in one pass.
Reader, though not 'Writer', may return partial results when it makes sense to do so.
Regarding the second item, consider that the Reader does not necessarily represent a fixed file or data buffer. It could, instead, be a live stream from some other thread or process. As such, there are many valid scenarios for stream data to be read and processed on an as-available basis. Although CopyBuffer doesn't do this, it still has to work with such behaviors from any Reader.

How to write a vector

I am using the Go flatbuffers interface for the first time. I find the instructions sparse.
I would like to write a vector of uint64s into a table. Ideally, I would like to store numbers directly in a vector without knowing how many there are up front (I'm reading them from sql.Rows iterator). I see the generated code for the table has functions:
func DatasetGridAddDates(builder *flatbuffers.Builder, dates flatbuffers.UOffsetT) {
builder.PrependUOffsetTSlot(2, flatbuffers.UOffsetT(dates), 0)
}
func DatasetGridStartDatesVector(builder *flatbuffers.Builder, numElems int) flatbuffers.UOffsetT {
return builder.StartVector(8, numElems, 8)
}
Can I first write the vector using (??), then use DatasetGridAddDates to record the resulting vector in the containing "DatasetGrid" table?
(caveat: I have not heard of FlatBuffers prior to reading your question)
If you do know the length in advance, storing a vector is done as explained in the tutorial:
name := builder.CreateString("hello")
q55310927.DatasetGridStartDatesVector(builder, len(myDates))
for i := len(myDates) - 1; i >= 0; i-- {
builder.PrependUint64(myDates[i])
}
dates := builder.EndVector(len(myDates))
q55310927.DatasetGridStart(builder)
q55310927.DatasetGridAddName(builder, name)
q55310927.DatasetGridAddDates(builder, dates)
grid := q55310927.DatasetGridEnd(builder)
builder.Finish(grid)
Now what if you don’t have len(myDates)? On a toy example I get exactly the same output if I replace StartDatesVector(builder, len(myDates)) with StartDatesVector(builder, 0). Looking at the source code, it seems like the numElems may be necessary for alignment and for growing the buffer. I imagine alignment might be moot when you’re dealing with uint64, and growing seems to happen automatically on PrependUint64, too.
So, try doing it without numElems:
q55310927.DatasetGridStartDatesVector(builder, 0)
var n int
for rows.Next() { // use ORDER BY to make them go in reverse order
var date uint64
if err := rows.Scan(&date); err != nil {
// ...
}
builder.PrependUint64(date)
n++
}
dates := builder.EndVector(n)
and see if it works on your data.

Reading from a file from bufio with a semi complex sequencing through file

So there may be questions like this but its not a super easy thing to google. Basically I have a file thats a set of protobufs encoded and sequenced as they normally are from the protobuf spec.
So think of the bytes values being chunked something like this throughout the file:
[EncodeVarInt(size of protobuf struct)] [protobuf stuct bytes]
So you have a few bytes read one at a time that are used for large jump of a read on our protof structure.
My implementation using the os ReadAt method on a file currently looks something like this.
// getting the next value in a file context feature
func (geobuf *Geobuf_Reader) Next() bool {
if geobuf.EndPos <= geobuf.Pos {
return false
} else {
startpos := int64(geobuf.Pos)
for int(geobuf.Get_Byte(geobuf.Pos)) > 127 {
geobuf.Pos += 1
}
geobuf.Pos += 1
sizebytes := make([]byte,geobuf.Pos-int(startpos))
geobuf.File.ReadAt(sizebytes,startpos)
size,_ := DecodeVarint(sizebytes)
geobuf.Feat_Pos = [2]int{int(size),geobuf.Pos}
geobuf.Pos = geobuf.Pos+int(size)
return true
}
return false
}
// reads a geobuf feature as geojson
func (geobuf *Geobuf_Reader) Feature() *geojson.Feature {
// getting raw bytes
a := make([]byte,geobuf.Feat_Pos[0])
geobuf.File.ReadAt(a,int64(geobuf.Feat_Pos[1]))
return Read_Feature(a)
}
How can I implement something like bufio or other chunked reading mechanisms to speed up so many file ReadAt's? Most bufio implementations I've seen are for having a specific delimitter. Thanks in advance hopefully this wasn't a horrible question.
Package bufio
import "bufio"
type SplitFunc
SplitFunc is the signature of the split function used to tokenize the
input. The arguments are an initial substring of the remaining
unprocessed data and a flag, atEOF, that reports whether the Reader
has no more data to give. The return values are the number of bytes to
advance the input and the next token to return to the user, plus an
error, if any. If the data does not yet hold a complete token, for
instance if it has no newline while scanning lines, SplitFunc can
return (0, nil, nil) to signal the Scanner to read more data into the
slice and try again with a longer slice starting at the same point in
the input.
If the returned error is non-nil, scanning stops and the error is
returned to the client.
The function is never called with an empty data slice unless atEOF is
true. If atEOF is true, however, data may be non-empty and, as always,
holds unprocessed text.
type SplitFunc func(data []byte, atEOF bool) (advance int, token []byte, err error)
Use bufio.Scanner and write a custom protobuf struct SplitFunc.

Writing a struct's fields and values of different types to a file in Go

I'm writing a simple program that takes in input from a form, populates an instance of a struct with the received data and the writes this received data to a file.
I'm a bit stuck at the moment with figuring out the best way to iterate over the populated struct and write its contents to the file.
The struct in question contains 3 different types of fields (ints, strings, []strings).
I can iterate over them but I am unable to get their actual type.
Inspecting my posted code below with print statements reveals that each of their types is coming back as structs rather than the aforementioned string, int etc.
The desired output format is be plain text.
For example:
field_1="value_1"
field_2=10
field_3=["a", "b", "c"]
Anyone have any ideas? Perhaps I'm going about this the wrong way entirely?
func (c *Config) writeConfigToFile(file *os.File) {
listVal := reflect.ValueOf(c)
element := listVal.Elem()
for i := 0; i < element.NumField(); i++ {
field := element.Field(i)
myType := reflect.TypeOf(field)
if myType.Kind() == reflect.Int {
file.Write(field.Bytes())
} else {
file.WriteString(field.String())
}
}
}
Instead of using the Bytes method on reflect.Value which does not work as you initially intended, you can use either the strconv package or the fmt to format you fields.
Here's an example using fmt:
var s string
switch fi.Kind() {
case reflect.String:
s = fmt.Sprintf("%q", fi.String())
case reflect.Int:
s = fmt.Sprintf("%d", fi.Int())
case reflect.Slice:
if fi.Type().Elem().Kind() != reflect.String {
continue
}
s = "["
for j := 0; j < fi.Len(); j++ {
s = fmt.Sprintf("%s%q, ", s, fi.Index(i).String())
}
s = strings.TrimRight(s, ", ") + "]"
default:
continue
}
sf := rv.Type().Field(i)
if _, err := fmt.Fprintf(file, "%s=%s\n", sf.Name, s); err!= nil {
panic(err)
}
Playground: https://play.golang.org/p/KQF3CicVzA
Why not use the built-in gob package to store your struct values?
I use it to store different structures, one per line, in files. During decoding, you can test the type conversion or provide a hint in a wrapper - whichever is faster for your given use case.
You'd treat each line as a buffer when Encoding and Decoding when reading back the line. You can even gzip/zlib/compress, encrypt/decrypt, etc the stream in real-time.
No point in re-inventing the wheel when you have a polished and armorall'd wheel already at your disposal.

Efficiently listing files in a directory having very many entries

I need to recursively read a directory structure, but I also need to perform an additional step once I have read through all entries for each directory. Therefore, I need to write my own recursion logic (and can't use the simplistic filepath.Walk routine). However, the ioutil.ReadDir and filepath.Glob routines only return slices. What if I'm pushing the limits of ext4 or xfs and have a directory with files numbering into the billions? I would expect golang to have a function that returns an unsorted series of os.FileInfo (or, even better, raw strings) over a channel rather than a sorted slice. How do we efficiently read file entries in this case?
All of the functions cited above seem to rely on readdirnames in os/dir_unix.go, and, for some reason, it just makes an array when it seems like it would've been easy to spawn a gothread and and push the values into a channel. There might have been sound logic to do this, but it's not clear what it is. I'm new to Go, so I also could've easy missed some principle that's obvious to everyone else.
This is the sourcecode, for convenience:
func (f *File) readdirnames(n int) (names []string, err error) {
// If this file has no dirinfo, create one.
if f.dirinfo == nil {
f.dirinfo = new(dirInfo)
// The buffer must be at least a block long.
f.dirinfo.buf = make([]byte, blockSize)
}
d := f.dirinfo
size := n
if size <= 0 {
size = 100
n = -1
}
names = make([]string, 0, size) // Empty with room to grow.
for n != 0 {
// Refill the buffer if necessary
if d.bufp >= d.nbuf {
d.bufp = 0
var errno error
d.nbuf, errno = fixCount(syscall.ReadDirent(f.fd, d.buf))
if errno != nil {
return names, NewSyscallError("readdirent", errno)
}
if d.nbuf <= 0 {
break // EOF
}
}
// Drain the buffer
var nb, nc int
nb, nc, names = syscall.ParseDirent(d.buf[d.bufp:d.nbuf], n, names)
d.bufp += nb
n -= nc
}
if n >= 0 && len(names) == 0 {
return names, io.EOF
}
return names, nil
}
ioutil.ReadDir and filepath.Glob are just convenience functions around reading directory entries.
You can read directory entries in batches by directly using the Readdir or Readdirnames methods, if you supply an n argument > 0.
For something as basic as reading directory entries, there's no need to add the overhead of a goroutine and channel, and also provide an alternate way to return the error. You can always wrap the batched calls with your own goroutine and channel pattern if you wish.

Resources