Scanner.Buffer - max value has no effect on custom Split? - go

To reduce the default 64k scanner buffer (for microcomputer with low memory), I try to use this buffer and custom split functions:
scanner.Buffer(make([]byte, 5120), 64)
scanner.Split(Scan64Bytes)
Here I noticed that the second buffer argument "max" has no effect. If I instead insert e.g. 0, 1, 5120 or bufio.MaxScanTokenSize, I can' t see any difference.
Only the first argument "buf" has consequences. Is the capacity to small the scan is incomplete and if it's to large the B/op benchmem value increases.
From the doc:
The maximum token size is the larger of max and cap(buf). If max <= cap(buf), Scan will use this buffer only and do no allocation.
I don't understand which is the correct max value. Can you maybe explain this to me, please?
Go Playground
package main
import (
"bufio"
"bytes"
"fmt"
)
func Scan64Bytes(data []byte, atEOF bool) (advance int, token []byte, err error) {
if len(data) < 64 {
return 0, data[0:], bufio.ErrFinalToken
}
return 64, data[0:64], nil
}
func main() {
// improvised source of the same size:
cmdstd := bytes.NewReader(make([]byte, 5120))
scanner := bufio.NewScanner(cmdstd)
// I guess 64 is the correct max arg:
scanner.Buffer(make([]byte, 5120), 64)
scanner.Split(Scan64Bytes)
for i := 0; scanner.Scan(); i++ {
fmt.Printf("%v: %v\r\n", i, scanner.Bytes())
}
if err := scanner.Err(); err != nil {
fmt.Println(err)
}
}

max value has no effect on custom Split?
No, without split there is the same result. But this wouldn't be possible without split and ErrFinalToken:
//your reader/input
cmdstd := bytes.NewReader(make([]byte, 5120))
// your scanner buffer size
scanner.Buffer(make([]byte, 5120), 64)
The buffer size from the scanner should be larger. This is how I would set buf and max:
scanner.Buffer(make([]byte, 5121), 5120)

Related

Binary Encoding/Decoding File in Golang Gives Different Checksum

I'm working on encoding and decoding files in golang. I specifically do need the 2D array that I'm using, this is just test code to show the point. I'm not entirely sure what I'm doing wrong, I'm attempting to convert the file into a list of uint32 numbers and then take those numbers and convert them back to a file. The problem is that when I do it the file looks fine but the checksum doesn't line up. I suspect that I'm doing something wrong in the conversion to uint32. I have to do the switch/case because I have no way of knowing how many bytes I'll read for sure at the end of a given file.
package main
import (
"bufio"
"encoding/binary"
"fmt"
"io"
"os"
)
const (
headerSeq = 8
body = 24
)
type part struct {
Seq int
Data uint32
}
func main() {
f, err := os.Open("speech.pdf")
if err != nil {
panic(err)
}
defer f.Close()
reader := bufio.NewReader(f)
b := make([]byte, 4)
o := make([][]byte, 0)
var value uint32
for {
n, err := reader.Read(b)
if err != nil {
if err != io.EOF {
panic(err)
}
}
if n == 0 {
break
}
fmt.Printf("len array %d\n", len(b))
fmt.Printf("len n %d\n", n)
switch n {
case 1:
value = uint32(b[0])
case 2:
value = uint32(uint32(b[1]) | uint32(b[0])<<8)
case 3:
value = uint32(uint32(b[2]) | uint32(b[1])<<8 | uint32(b[0])<<16)
case 4:
value = uint32(uint32(b[3]) | uint32(b[2])<<8 | uint32(b[1])<<16 | uint32(b[0])<<24)
}
fmt.Println(value)
bs := make([]byte, 4)
binary.BigEndian.PutUint32(bs, value)
o = append(o, bs)
}
fo, err := os.OpenFile("test.pdf", os.O_APPEND|os.O_WRONLY|os.O_CREATE, 0600)
if err != nil {
panic(err)
}
defer fo.Close()
for _, ba := range o {
_, err := fo.Write(ba)
if err != nil {
panic(err)
}
}
}
So, you want to write and read arrays of varying length in a file.
import "encoding/binary"
// You need a consistent byte order for reading and writing multi-byte data types
const order = binary.LittleEndian
var dataToWrite = []byte{ ... ... ... }
var err error
// To write a recoverable array of varying length
var w io.Writer
// First, encode the length of data that will be written
err = binary.Write(w, order, int64(len(dataToWrite)))
// Check error
err = binary.Write(w, order, dataToWrite)
// Check error
// To read a variable length array
var r io.Reader
var dataLen int64
// First, we need to know the length of data to be read
err = binary.Read(r, order, &dataLen)
// Check error
// Allocate a slice to hold the expected amount of data
dataReadIn := make([]byte, dataLen)
err = binary.Read(r, order, dataReadIn)
// Check error
This pattern works not just with byte, but any other fixed size data type. See binary.Write for specifics about the encoding.
If the size of encoded data is a big concern, you can save some bytes by storing the array length as a varint with binary.PutVarint and binary.ReadVarint

If I know the max size of many tmp slices, should I set capacity when creating them?

If I need to use tmp slices in a function and the function will be called many times, their max capacity will not exceed 10. But the length of them are varied. Just for example, maybe 80% of them only have size of 1. 10% of them have size 3 and 10% of them have size 10.
I can think of an example function like the following:
func getDataFromDb(s []string) []string {
tmpSlice := make([]string, 0, 10)
for _, v := range s {
if check(v) {
tmpSlice = append(tmpSlice, v)
}
}
......
return searchDb(tmpSlice)
}
So should I do var tmpSlice []string, tmpSlice := make([]string, 0, 0), tmpSlice := make([]string, 0, 5), or tmpSlice := make([]string, 0, 10)? or any other suggestions?
Fastest would be if code doesn't allocate on the heap.
Create variables that allocate on the stack and do no escape (pass variables by value, otherwise they will escape).
Escaping you can check by adding -gcflags "-m -l" on building.
Here is an example that shows if we substitute slice with array and pass it by value, it results in fast code without allocation (on the heap).
package main
import "testing"
func BenchmarkAllocation(b *testing.B) {
b.Run("Slice", func(b2 *testing.B) {
for i := 0; i < b2.N; i++ {
_ = getDataFromDbSlice([]string{"one", "two"})
}
})
b.Run("Array", func(b2 *testing.B) {
for i := 0; i < b2.N; i++ {
_ = getDataFromDbArray([]string{"one", "two"})
}
})
}
type DbQuery [10]string
type DbQueryResult [10]string
func getDataFromDbArray(s []string) DbQueryResult {
q := DbQuery{}
return processQueryArray(q)
}
func processQueryArray(q DbQuery) DbQueryResult {
return (DbQueryResult)(q)
}
func getDataFromDbSlice(s []string) []string {
tmpArray := make([]string, 0, 10)
return processQuerySlice(tmpArray)
}
func processQuerySlice(q []string) []string {
return q
}
Running benchmark with benchmem gives this results:
BenchmarkAllocation/Slice-6 30000000 51.8 ns/op 160 B/op 1 allocs/op
BenchmarkAllocation/Array-6 100000000 15.7 ns/op 0 B/op 0 allocs/op
This answer assumes that searchDB does not retain a reference to the slice passed to it. It seems unlikely that the function retains a reference given the variable and function names.
These options have the same memory and performance characteristics:
var tmpSlice []string
tmpSlice := []string{}
tmpSlice := make([]string, 0)
tmpSlice := make([]string, 0, 0)
None of them allocate memory until the first append operation. If these are your only options, then pick one of the first two because they are easier to read.
This option will have the best performance:
tmpSlice := make([]string, 0, 10)
This ensures that the backing array for the slice is allocated once. There will be no reallocations of the backing array as values are appended.
If searchDB's argument does not escape, then the one allocation for the backing array will be made on the stack. This is the best possible performance. You can find out if the argument escapes by building with the -gcflags "-m -l" option.
Given that getDataFromDb invokes a database operation, any performance difference between the options will be in the noise. It's more important is to write clear and simple code than to optimize this.
I would probably go with the var tmpSlice []string over tmpSlice := make([]string, 0, 10) because there's no need to understand where the value 10 came from with the former.
I would do
var tmpSlice []string
This would give you an empty string slice and you can append as needed.
Unless the slice range gets big and you know the dimension beforehand, otherwise I wouldn't pre-allocate memory for it

Comparing the same value in ioutil package?

I'm puzzled by what this line of code does from the ioutil package. It appears to compare the same value twice but casts it twice on one side. Any insights would be greatly appreciated!
int64(int(capacity)) == capacity
from this function
func readAll(r io.Reader, capacity int64) (b []byte, err error) {
var buf bytes.Buffer
// If the buffer overflows, we will get bytes.ErrTooLarge.
// Return that as an error. Any other panic remains.
defer func() {
e := recover()
if e == nil {
return
}
if panicErr, ok := e.(error); ok && panicErr == bytes.ErrTooLarge {
err = panicErr
} else {
panic(e)
}
}()
if int64(int(capacity)) == capacity {
buf.Grow(int(capacity))
}
_, err = buf.ReadFrom(r)
return buf.Bytes(), err
}
Converting Tim Cooper's comment into an answer.
bytes.Buffer.Grow takes in an int and capacity is int64.
func (b *Buffer) Grow(n int)
Grow grows the buffer's capacity, if necessary, to guarantee space for
another n bytes. After Grow(n), at least n bytes can be written to the
buffer without another allocation.
As mentioned in GoDoc, Grow is used as an optimisation to prevent further allocations.
int64(int(capacity)) == capacity
makes sure that capacity is within the range of int values so that the optimisation can be applied.

Newbie: Properly sizing a []byte size in GO (Chunking)

Go Newbie alert!
Not quite sure how to do this - I want to make a "file chunker" where I grab fixed slices out of a binary file for later upload as a learning project.
I currently have this:
type (
fileChunk []byte
fileChunks []fileChunk
)
func NumChunks(fi os.FileInfo, chunkSize int) int {
chunks := fi.Size() / int64(chunkSize)
if rem := fi.Size() % int64(chunkSize) != 0; rem {
chunks++
}
return int(chunks)
}
// left out err checks for brevity
func chunker(filePtr *string) fileChunks {
f, err := os.Open(*filePtr)
defer f.Close()
// create the initial container to hold the slices
file_chunks := make(fileChunks, 0)
fi, err := f.Stat()
// show me how big the original file is
fmt.Printf("File Name: %s, Size: %d\n", fi.Name(), fi.Size())
// let's partition it into 10000 byte pieces
chunkSize := 10000
chunks := NumChunks(fi, chunkSize)
fmt.Printf("Need %d chunks for this file", chunks)
for i := 0; i < chunks; i++ {
b := make(fileChunk, chunkSize) // allocate a chunk, 10000 bytes
n1, err := f.Read(b)
fmt.Printf("Chunk: %d, %d bytes read\n", i, n1)
// add chunk to "container"
file_chunks = append(file_chunks, b)
}
fmt.Println(len(file_chunks))
return file_chunks
}
This all works mostly fine, but here's what happens if my fize size is 31234 bytes, then I'll end up with three slices full of the first 30000 bytes from the file, the final "chunk" will consist of 1234 "file bytes" followed by "padding" to the 10000 byte chunk size - I'd like the "remainder" filechunk ([]byte) to be sized to 1234, not the full capacity - what would the proper way to do this be? On the receiving side I would then "stitch" together all the pieces to recreate the original file.
You need to re-slice the remainder chunk to be just the length of the last chunk read:
n1, err := f.Read(b)
fmt.Printf("Chunk: %d, %d bytes read\n", i, n1)
b = b[:n1]
This does the re-slicing for all chunks. Normally, n1 will be 10000 for all the non-remainder chunks, but there is no guarantee. The docs say "Read reads up to len(b) bytes from the File." So it's good to pay attention to n1 all the time.

Convert an integer to a byte array

I have a function which receives a []byte but what I have is an int, what is the best way to go about this conversion ?
err = a.Write([]byte(myInt))
I guess I could go the long way and get it into a string and put that into bytes, but it sounds ugly and I guess there are better ways to do it.
I agree with Brainstorm's approach: assuming that you're passing a machine-friendly binary representation, use the encoding/binary library. The OP suggests that binary.Write() might have some overhead. Looking at the source for the implementation of Write(), I see that it does some runtime decisions for maximum flexibility.
func Write(w io.Writer, order ByteOrder, data interface{}) error {
// Fast path for basic types.
var b [8]byte
var bs []byte
switch v := data.(type) {
case *int8:
bs = b[:1]
b[0] = byte(*v)
case int8:
bs = b[:1]
b[0] = byte(v)
case *uint8:
bs = b[:1]
b[0] = *v
...
Right? Write() takes in a very generic data third argument, and that's imposing some overhead as the Go runtime then is forced into encoding type information. Since Write() is doing some runtime decisions here that you simply don't need in your situation, maybe you can just directly call the encoding functions and see if it performs better.
Something like this:
package main
import (
"encoding/binary"
"fmt"
)
func main() {
bs := make([]byte, 4)
binary.LittleEndian.PutUint32(bs, 31415926)
fmt.Println(bs)
}
Let us know how this performs.
Otherwise, if you're just trying to get an ASCII representation of the integer, you can get the string representation (probably with strconv.Itoa) and cast that string to the []byte type.
package main
import (
"fmt"
"strconv"
)
func main() {
bs := []byte(strconv.Itoa(31415926))
fmt.Println(bs)
}
Check out the "encoding/binary" package. Particularly the Read and Write functions:
binary.Write(a, binary.LittleEndian, myInt)
Sorry, this might be a bit late. But I think I found a better implementation on the go docs.
buf := new(bytes.Buffer)
var num uint16 = 1234
err := binary.Write(buf, binary.LittleEndian, num)
if err != nil {
fmt.Println("binary.Write failed:", err)
}
fmt.Printf("% x", buf.Bytes())
i thought int type has any method for getting int hash to bytes, but first i find math / big method for this
https://golang.org/pkg/math/big/
var f int = 52452356235; // int
var s = big.NewInt(int64(f)) // int to big Int
var b = s.Bytes() // big Int to bytes
// b - byte slise
var r = big.NewInt(0).SetBytes(b) // bytes to big Int
var i int = int(r.Int64()) // big Int to int
https://play.golang.org/p/VAKSGw8XNQq
However, this method uses an absolute value.
If you spend 1 byte more, you can transfer the sign
func IntToBytes(i int) []byte{
if i > 0 {
return append(big.NewInt(int64(i)).Bytes(), byte(1))
}
return append(big.NewInt(int64(i)).Bytes(), byte(0))
}
func BytesToInt(b []byte) int{
if b[len(b)-1]==0 {
return -int(big.NewInt(0).SetBytes(b[:len(b)-1]).Int64())
}
return int(big.NewInt(0).SetBytes(b[:len(b)-1]).Int64())
}
https://play.golang.org/p/mR5Sp5hu4jk
or new(https://play.golang.org/p/7ZAK4QL96FO)
(The package also provides functions for fill into an existing slice)
https://golang.org/pkg/math/big/#Int.FillBytes
Adding this option for dealing with basic uint8 to byte[] conversion
foo := 255 // 1 - 255
ufoo := uint16(foo)
far := []byte{0,0}
binary.LittleEndian.PutUint16(far, ufoo)
bar := int(far[0]) // back to int
fmt.Println("foo, far, bar : ",foo,far,bar)
output :
foo, far, bar : 255 [255 0] 255
Here is another option, based on the Go source code [1]:
package main
import (
"encoding/binary"
"fmt"
"math/bits"
)
func encodeUint(x uint64) []byte {
buf := make([]byte, 8)
binary.BigEndian.PutUint64(buf, x)
return buf[bits.LeadingZeros64(x) >> 3:]
}
func main() {
for x := 0; x <= 64; x += 8 {
buf := encodeUint(1<<x-1)
fmt.Println(buf)
}
}
Result:
[]
[255]
[255 255]
[255 255 255]
[255 255 255 255]
[255 255 255 255 255]
[255 255 255 255 255 255]
[255 255 255 255 255 255 255]
[255 255 255 255 255 255 255 255]
Much faster than math/big:
BenchmarkBig-12 28348621 40.62 ns/op
BenchmarkBit-12 731601145 1.641 ns/op
https://github.com/golang/go/blob/go1.16.5/src/encoding/gob/encode.go#L113-L117
You can try musgo_int. All you need to do is to cast your variable:
package main
import (
"github.com/ymz-ncnk/musgo_int"
)
func main() {
var myInt int = 1234
// from int to []byte
buf := make([]byte, musgo_int.Int(myInt).SizeMUS())
musgo_int.Int(myInt).MarshalMUS(buf)
// from []byte to int
_, err := (*musgo_int.Int)(&myInt).UnmarshalMUS(buf)
if err != nil {
panic(err)
}
}
Convert Integer to byte slice.
import (
"bytes"
"encoding/binary"
"log"
)
func IntToBytes(num int64) []byte {
buff := new(bytes.Buffer)
bigOrLittleEndian := binary.BigEndian
err := binary.Write(buff, bigOrLittleEndian, num)
if err != nil {
log.Panic(err)
}
return buff.Bytes()
}
Maybe the simple way is using protobuf, see the Protocol Buffer Basics: Go
define message like
message MyData {
int32 id = 1;
}
get more in Defining your protocol format
// Write
out, err := proto.Marshal(mydata)
read more in Writing a Message
Try math/big package to convert bytes array to int and to convert int to bytes array.
package main
import (
"fmt"
"math/big"
)
func main() {
// Convert int to []byte
var int_to_encode int64 = 65535
var bytes_array []byte = big.NewInt(int_to_encode).Bytes()
fmt.Println("bytes array", bytes_array)
// Convert []byte to int
var decoded_int int64 = new(big.Int).SetBytes(bytes_array).Int64()
fmt.Println("decoded int", decoded_int)
}
This is the most straight forward (and shortest (and safest) (and maybe most performant)) way:
buf.Bytes() is of type bytes slice.
var val uint32 = 42
buf := new(bytes.Buffer)
err := binary.Write(buf, binary.LittleEndian, val)
if err != nil {
fmt.Println("binary.Write failed:", err)
}
fmt.Printf("% x\n", buf.Bytes())
see also https://stackoverflow.com/a/74819602/589493
What's wrong with converting it to a string?
[]byte(fmt.Sprintf("%d", myint))

Resources