Newbie: Properly sizing a []byte size in GO (Chunking) - go

Go Newbie alert!
Not quite sure how to do this - I want to make a "file chunker" where I grab fixed slices out of a binary file for later upload as a learning project.
I currently have this:
type (
fileChunk []byte
fileChunks []fileChunk
)
func NumChunks(fi os.FileInfo, chunkSize int) int {
chunks := fi.Size() / int64(chunkSize)
if rem := fi.Size() % int64(chunkSize) != 0; rem {
chunks++
}
return int(chunks)
}
// left out err checks for brevity
func chunker(filePtr *string) fileChunks {
f, err := os.Open(*filePtr)
defer f.Close()
// create the initial container to hold the slices
file_chunks := make(fileChunks, 0)
fi, err := f.Stat()
// show me how big the original file is
fmt.Printf("File Name: %s, Size: %d\n", fi.Name(), fi.Size())
// let's partition it into 10000 byte pieces
chunkSize := 10000
chunks := NumChunks(fi, chunkSize)
fmt.Printf("Need %d chunks for this file", chunks)
for i := 0; i < chunks; i++ {
b := make(fileChunk, chunkSize) // allocate a chunk, 10000 bytes
n1, err := f.Read(b)
fmt.Printf("Chunk: %d, %d bytes read\n", i, n1)
// add chunk to "container"
file_chunks = append(file_chunks, b)
}
fmt.Println(len(file_chunks))
return file_chunks
}
This all works mostly fine, but here's what happens if my fize size is 31234 bytes, then I'll end up with three slices full of the first 30000 bytes from the file, the final "chunk" will consist of 1234 "file bytes" followed by "padding" to the 10000 byte chunk size - I'd like the "remainder" filechunk ([]byte) to be sized to 1234, not the full capacity - what would the proper way to do this be? On the receiving side I would then "stitch" together all the pieces to recreate the original file.

You need to re-slice the remainder chunk to be just the length of the last chunk read:
n1, err := f.Read(b)
fmt.Printf("Chunk: %d, %d bytes read\n", i, n1)
b = b[:n1]
This does the re-slicing for all chunks. Normally, n1 will be 10000 for all the non-remainder chunks, but there is no guarantee. The docs say "Read reads up to len(b) bytes from the File." So it's good to pay attention to n1 all the time.

Related

Scanner.Buffer - max value has no effect on custom Split?

To reduce the default 64k scanner buffer (for microcomputer with low memory), I try to use this buffer and custom split functions:
scanner.Buffer(make([]byte, 5120), 64)
scanner.Split(Scan64Bytes)
Here I noticed that the second buffer argument "max" has no effect. If I instead insert e.g. 0, 1, 5120 or bufio.MaxScanTokenSize, I can' t see any difference.
Only the first argument "buf" has consequences. Is the capacity to small the scan is incomplete and if it's to large the B/op benchmem value increases.
From the doc:
The maximum token size is the larger of max and cap(buf). If max <= cap(buf), Scan will use this buffer only and do no allocation.
I don't understand which is the correct max value. Can you maybe explain this to me, please?
Go Playground
package main
import (
"bufio"
"bytes"
"fmt"
)
func Scan64Bytes(data []byte, atEOF bool) (advance int, token []byte, err error) {
if len(data) < 64 {
return 0, data[0:], bufio.ErrFinalToken
}
return 64, data[0:64], nil
}
func main() {
// improvised source of the same size:
cmdstd := bytes.NewReader(make([]byte, 5120))
scanner := bufio.NewScanner(cmdstd)
// I guess 64 is the correct max arg:
scanner.Buffer(make([]byte, 5120), 64)
scanner.Split(Scan64Bytes)
for i := 0; scanner.Scan(); i++ {
fmt.Printf("%v: %v\r\n", i, scanner.Bytes())
}
if err := scanner.Err(); err != nil {
fmt.Println(err)
}
}
max value has no effect on custom Split?
No, without split there is the same result. But this wouldn't be possible without split and ErrFinalToken:
//your reader/input
cmdstd := bytes.NewReader(make([]byte, 5120))
// your scanner buffer size
scanner.Buffer(make([]byte, 5120), 64)
The buffer size from the scanner should be larger. This is how I would set buf and max:
scanner.Buffer(make([]byte, 5121), 5120)

Reading a random line from a file in constant time in Go

I have the following code to choose 2 random lines from a file containing lines of the form ip:port:
import (
"os"
"fmt"
"math/rand"
"log"
"time"
"unicode/utf8"
//"bufio"
)
func main() {
fmt.Println("num bytes in line is: \n", utf8.RuneCountInString("10.244.1.8:8080"))
file_pods_array, err_file_pods_array := os.Open("pods_array.txt")
if err_file_pods_array != nil {
log.Fatalf("failed opening file: %s", err_file_pods_array)
}
//16 = num of bytes in ip:port pair
randsource := rand.NewSource(time.Now().UnixNano())
randgenerator := rand.New(randsource)
firstLoc := randgenerator.Intn(10)
secondLoc := randgenerator.Intn(10)
candidate1 := ""
candidate2 := ""
num_bytes_from_start_first := 16 * (firstLoc + 1)
num_bytes_from_start_second := 16 * (secondLoc + 1)
buf_ipport_first := make([]byte, int64(15))
buf_ipport_second := make([]byte, int64(15))
start_first := int64(num_bytes_from_start_first)
start_second := int64(num_bytes_from_start_second)
_, err_first := file_pods_array.ReadAt(buf_ipport_first, start_first)
first_ipport_ep := buf_ipport_first
if err_first == nil {
candidate1 = string(first_ipport_ep)
}
_, err_second := file_pods_array.ReadAt(buf_ipport_second, start_second)
second_ipport_ep := buf_ipport_second
if err_second == nil {
candidate2 = string(second_ipport_ep)
}
fmt.Println("first is: ", candidate1)
fmt.Println("sec is: ", candidate2)
}
This sometimes prints empty or partial lines.
Why does this happen and how can I fix it?
Output example:
num bytes in line is:
15
first is: 10.244.1.17:808
sec is:
10.244.1.11:80
Thank you.
If your lines were of a fixed length you could do this in constant time.
Length of each line is L.
Check the size of the file, S.
Divide S/L to get the number of lines N.
Pick a random number R from 0 to N-1.
Seek to R*L in the file.
Read L bytes.
But you don't have fixed length lines. We can't do constant time, but we can do it in constant memory and O(n) time using the technique from The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.
Read a line. Remember its line number M.
Pick a random number from 1 to M.
If it's 1, remember this line.
That is, as you read each line you have a 1/M chance of picking it. Cumulatively this adds up to 1/N for every line.
If we have three lines, the first line has a 1/1 chance of being picked. Then a 1/2 chance of remaining. Then a 2/3 chance of remaining. Total chance: 1 * 1/2 * 2/3 = 1/3.
The second line has a 1/2 chance of being picked and a 2/3 chance of remaining. Total chance: 1/2 * 2/3 = 1/3.
The third line has a 1/3 chance of being picked.
package main
import(
"bufio"
"fmt"
"os"
"log"
"math/rand"
"time"
);
func main() {
file, err := os.Open("pods_array.txt")
if err != nil {
log.Fatal(err)
}
defer file.Close()
scanner := bufio.NewScanner(file)
randsource := rand.NewSource(time.Now().UnixNano())
randgenerator := rand.New(randsource)
lineNum := 1
var pick string
for scanner.Scan() {
line := scanner.Text()
fmt.Printf("Considering %v at 1/%v.\n", scanner.Text(), lineNum)
// Instead of 1 to N it's 0 to N-1
roll := randgenerator.Intn(lineNum)
fmt.Printf("We rolled a %v.\n", roll)
if roll == 0 {
fmt.Printf("Picking line.\n")
pick = line
}
lineNum += 1
}
fmt.Printf("Picked: %v\n", pick)
}
Because rand.Intn(n) returns [0,n), that is from 0 to n-1, we check for 0, not 1.
Maybe you're thinking "what if I seek to a random point in the file and then read the next full line?" That wouldn't quite be constant time, it would beO(longest-line), but it wouldn't be truly random. Longer lines would get picked more frequently.
Note that since these are (I assume) all IP addresses and ports you could have constant record lengths. Store the IPv4 address as a 32 bits and the port as a 16 bits. 48 bits per line.
However, this will break on IPv6. For forward compatibility store everything as IPv6: 128 bits for the IP and 16 bits for the port. 144 bits per line. Convert IPv4 addresses to IPv6 for storage.
This will allow you to pick random addresses in constant time, and it will save disk space.
Alternatively, store them in SQLite.
found a solution using ioutil and strings:
func main() {
randsource := rand.NewSource(time.Now().UnixNano())
randgenerator := rand.New(randsource)
firstLoc := randgenerator.Intn(10)
secondLoc := randgenerator.Intn(10)
candidate1 := ""
candidate2 := ""
dat, err := ioutil.ReadFile("pods_array.txt")
if err == nil {
ascii := string(dat)
splt := strings.Split(ascii, "\n")
candidate1 = splt[firstLoc]
candidate2 = splt[secondLoc]
}
fmt.Println(candidate1)
fmt.Println(candidate2)
}
Output
10.244.1.3:8080
10.244.1.11:8080

Efficient way to read Mmap

I am using syscall to read a byte array out of mmap:
file, e := os.Open(path)
if e != nil {...}
defer file.Close()
fi, e := file.Stat()
if e != nil {...}
data, e := syscall.Mmap(int(file.Fd()), 0, int(fi.Size()), syscall.PROT_READ, syscall.MAP_SHARED)
if e != nil {...}
data is the binary array I need.
I am using || as a delimiter, so I can get slices by using bytes.Split:
slices := bytes.Split(data, []byte("||"))
for _, s := range slices {
str := string(s[:])
fmt.Println(str)
}
This works fine, and I also stored the total number of messages (its type is uint32 which takes 8 bytes) at the beginning of the mmap.
When a new message is written in, I can get the total number of messages by reading the first 8 bytes.
Assuming I have the number of messages as n, I still need to do the following to read the new message:
slices := bytes.Split(data, []byte("||"))
s := slices[n - 1]
str := string(s[:])
fmt.Println(str)
Is there a more efficient way to do this?

Why don't goroutines write in parallel using WriteAt?

I'm experimenting a bit with reading and writing from a file.
To write to a file concurrently I created the following function:
func write(f *os.File, b []byte, off int64, c chan int) {
var _, err = f.WriteAt(b, off)
check(err)
c <- 0
}
I then create a file and 100000 goroutines to perform the write operations.
They each write an array of 16384 bytes to the hard disk:
func main() {
path := "E:/test"
f, err := os.OpenFile(path, os.O_RDWR|os.O_CREATE, 0666)
check(err)
size := int64(16384)
ones := make([]byte, size)
n := int64(100000)
c := make(chan int, n)
for i := int64(0); i < size; i++ {
ones[i] = 1
}
// Start timing
start := time.Now()
for i := int64(0); i < n; i++ {
go write(f, ones, size*i, c)
}
for i := int64(0); i < n; i++ {
<-c
}
// Check elapsed time
fmt.Println(time.Now().Sub(start))
err = f.Sync()
check(err)
err = f.Close()
check(err)
}
In this case about 1.6 GB is written where each goroutine writes to a non-overlapping byte range. The documentation for the io package states that Clients of WriteAt can execute parallel WriteAt calls on the same destination if the ranges do not overlap.
So what I expect to see, is that when I use go write(f, ones, 0, c), it would take much longer since all write operations would be on the same byterange.
However after testing this my results are quite unexpected:
Using go write(f, ones, size*i, c) took an average of about 3s
But using go write(f, ones, 0, c) only took an average of about 480ms
Do I use the WriteAt function in the wrong way? How could i achieve concurrent writing to non-overlapping byteranges?

How to turn a slice of Uint64 into a slice of Bytes

I currently have a protobuf struct that looks like this:
type RequestEnvelop_MessageQuad struct {
F1 [][]byte `protobuf:"bytes,1,rep,name=f1,proto3" json:"f1,omitempty"`
F2 []byte `protobuf:"bytes,2,opt,name=f2,proto3" json:"f2,omitempty"`
Lat float64 `protobuf:"fixed64,3,opt,name=lat" json:"lat,omitempty"`
Long float64 `protobuf:"fixed64,4,opt,name=long" json:"long,omitempty"`
}
F1 takes some S2 Geometry data which I have generated like so:
ll := s2.LatLngFromDegrees(location.Latitude, location.Longitude)
cid := s2.CellIDFromLatLng(ll).Parent(15)
walkData := []uint64{cid.Pos()}
next := cid.Next()
prev := cid.Prev()
// 10 Before, 10 After
for i := 0; i < 10; i++ {
walkData = append(walkData, next.Pos())
walkData = append(walkData, prev.Pos())
next = next.Next()
prev = prev.Prev()
}
log.Println(walkData)
The only problem is, the protobuf struct expects a type of [][]byte I'm just not sure how I can get my uint64 data into bytes. Thanks.
Integer values can be encoded into byte arrays with the encoding/binary package from the standard library.
For instance, to encode a uint64 into a byte buffer, we could use the binary.PutUvarint function:
big := uint64(257)
buf := make([]byte, 2)
n := binary.PutUvarint(buf, big)
fmt.Printf("Wrote %d bytes into buffer: [% x]\n", n, buf)
Which would print:
Wrote 2 bytes into buffer: [81 02]
We can also write a generic stream to the buffer using the binary.Write function:
buf := new(bytes.Buffer)
var pi float64 = math.Pi
err := binary.Write(buf, binary.LittleEndian, pi)
if err != nil {
fmt.Println("binary.Write failed:", err)
}
fmt.Printf("% x", buf.Bytes())
Which outputs:
18 2d 44 54 fb 21 09 40
(this second example was borrowed from that packages documentation, where you will find other simliar examples)

Resources