Go GZIP Flush and Determining Size - go

I am creating GZIPs on demand by streaming data, but I need to split it because the receving end has a hard code limit. When I Flush() and Close(), I see that the underyling byte buffer grows by 13 bytes. I looked at the source code of Gzip Close:
func (z *Writer) Close() error {
if z.err != nil {
return z.err
}
if z.closed {
return nil
}
z.closed = true
if !z.wroteHeader {
z.Write(nil)
if z.err != nil {
return z.err
}
}
z.err = z.compressor.Close()
if z.err != nil {
return z.err
}
le.PutUint32(z.buf[:4], z.digest)
le.PutUint32(z.buf[4:8], z.size)
_, z.err = z.w.Write(z.buf[:8])
return z.err
}
It indeed writes something but is there someway to determine it more pragmatic than just saying 13 bytes? There can be headers etc. I just want to have a safe margin, is there any possibilities that it can grow way larger than 13 bytes? I can happily set 1kb margin and live with it.

The 13 bytes are the maximum value to my knowledge. 8 bytes come from the gzip footer, the two PutUint32 calls.
The other 5 bytes are added by the huffmann compressor which ads an empty final block when the compressor is closed. It will add 3 bits (= 1 byte) for the final block header and 2 bytes for the length 0 and another 2 bytes which are inverted length 0xffff. So i assume you can calculate with those 13 bytes.

A conservative upper bound for the gzip-compressed output is:
n + ((n + 7) >> 3) + ((n + 63) >> 6) + 23
where n is the size of the input in bytes.

Related

how to manipulate very long string to avoid out of memory with golang

I trying for personal skills improvement to solve the hacker rank challenge:
There is a string, s, of lowercase English letters that is repeated infinitely many times. Given an integer, n, find and print the number of letter a's in the first n letters of the infinite string.
1<=s<=100 && 1<=n<=10^12
Very naively I though this code will be fine:
fs := strings.Repeat(s, int(n)) // full string
ss := fs[:n] // sub string
fmt.Println(strings.Count(ss, "a"))
Obviously I explode the memory and got an: "out of memory".
I never faced this kind of issue, and I'm clueless on how to handle it.
How can I manipulate very long string to avoid out of memory ?
I hope this helps, you don't have to actually count by running through the string.
That is the naive approach. You need to use some basic arithmetic to get the answer without running out of memory, I hope the comments help.
var answer int64
// 1st figure out how many a's are present in s.
aCount := int64(strings.Count(s, "a"))
// How many times will s repeat in its entirety if it had to be of length n
repeats := n / int64(len(s))
remainder := n % int64(len(s))
// If n/len(s) is not perfectly divisible, it means there has to be a remainder, check if that's the case.
// If s is of length 5 and the value of n = 22, then the first 2 characters of s would repeat an extra time.
if remainder > 0{
aCountInRemainder := strings.Count(s[:remainder], "a")
answer = int64((aCount * repeats) + int64(aCountInRemainder))
} else{
answer = int64((aCount * repeats))
}
return answer
There might be other methods but this is what came to my mind.
As you found out, if you actually generate the string you will end up having that huge memory block in RAM.
One common way to represent a "big sequence of incoming bytes" is to implement it as an io.Reader (which you can view as a stream of bytes), and have your code run a r.Read(buff) loop.
Given the specifics of the exercise you mention (a fixed string repeated n times), the number of occurrence of a specific letter can also be computed straight from the number of occurences of that letter in s, plus something more (I'll let you figure out what multiplications and counting should be done).
How to implement a Reader that repeats the string without allocating 10^12 times the string ?
Note that, when implementing the .Read() method, the caller has already allocated his buffer. You don't need to repeat your string in memory, you just need to fill the buffer with the correct values -- for example by copying byte by byte your data into the buffer.
Here is one way to do it :
type RepeatReader struct {
str string
count int
}
func (r *RepeatReader) Read(p []byte) (int, error) {
if r.count == 0 {
return 0, io.EOF
}
// at each iteration, pos will hold the number of bytes copied so far
var pos = 0
for r.count > 0 && pos < len(p) {
// to copy slices over, you can use the built-in 'copy' method
// at each iteration, you need to write bytes *after* the ones you have already copied,
// hence the "p[pos:]"
n := copy(p[pos:], r.str)
// update the amount of copied bytes
pos += n
// bad computation for this first example :
// I decrement one complete count, even if str was only partially copied
r.count--
}
return pos, nil
}
https://go.dev/play/p/QyFQ-3NzUDV
To have a complete, correct implementation, you also need to keep track of the offset you need to start from next time .Read() is called :
type RepeatReader struct {
str string
count int
offset int
}
func (r *RepeatReader) Read(p []byte) (int, error) {
if r.count == 0 {
return 0, io.EOF
}
var pos = 0
for r.count > 0 && pos < len(p) {
// when copying over to p, you should start at r.offset :
n := copy(p[pos:], r.str[r.offset:])
pos += n
// update r.offset :
r.offset += n
// if one full copy of str has been issued, decrement 'count' and reset 'offset' to 0
if r.offset == len(r.str) {
r.count--
r.offset = 0
}
}
return pos, nil
}
https://go.dev/play/p/YapRuioQcOz
You can now count the as while iterating through this Reader.

How to generate a Youtube ID in Go?

I'm assuming all I need to do is encode 2^64 as base64 to get a 11 character Youtube identifier. I created a Go program https://play.golang.org/p/2nuA3JxVMd0
package main
import (
"crypto/rand"
"encoding/base64"
"encoding/binary"
"fmt"
"math"
"math/big"
"strings"
)
func main() {
// For example Youtube uses 11 characters of base64.
// How many base64 characters would it require to express a 2^64 number? 2^6^x = 2^64 .. so x = 64/6 = 10.666666666 … i.e. eleven rounded up.
// Generate a 64 bit number
val, _ := randint64()
fmt.Println(val)
// Encode the 64 bit number
b := make([]byte, 8)
binary.LittleEndian.PutUint64(b, uint64(val))
encoded := base64.StdEncoding.EncodeToString([]byte(b))
fmt.Println(encoded, len(encoded))
// https://youtu.be/gocwRvLhDf8?t=75
ytid := strings.ReplaceAll(encoded, "+", "-")
ytid = strings.ReplaceAll(ytid, "/", "_")
fmt.Println("Youtube ID from 64 bit number:", ytid)
}
func randint64() (int64, error) {
val, err := rand.Int(rand.Reader, big.NewInt(int64(math.MaxInt64)))
if err != nil {
return 0, err
}
return val.Int64(), nil
}
But it has two issues:
The identifier is 12 characters instead of the expected 11
The encoded base64 suffix is "=" which means that it didn't have enough to encode?
So where am I going wrong?
tl;dr
An 8-byte int64 (no matter what value) will always encode to 11 base64 bytes followed by a single padded byte =, so you can reliably do this to get your 11 character YouTubeID:
var replacer = strings.NewReplacer(
"+", "-",
"/", "_",
)
ytid := replacer.Replace(encoded[:11])
or (H/T #Crowman & #Peter) one can encode without padding & without replacing + and / with base64.RawURLEncoding:
//encoded := base64.StdEncoding.EncodeToString(b) // may include + or /
ytid := base64.RawURLEncoding.EncodeToString(b) // produces URL-friendly - and _
https://play.golang.org/p/AjlvtfR7RWD
One byte (i.e. 8-bits) of Base64 output conveys 6-bits of input. So the formula to determine the number of output bytes given a certain inputs is:
out = in * 8 / 6
or
out = in * 4 / 3
With a devisor of 3 this will lead to partial use of output bytes in some cases. If the input bytes length is:
divisible by 3 - the final byte lands on a byte boundary
not divisible by 3 - the final byte is not on a byte-boundary and requires padding
In the case of 8 bytes of input:
out = 8 * 4 / 3 = 10 2/3
will utilize 10 fully utilized output base64 bytes - and one partial byte (for the 2/3) - so 11 base64 bytes plus padding to indicate how many wasted bits.
Padding is indicated via the = character and the number of = indicates the number of "wasted" bits:
waste padding
===== =======
0
1/3 =
2/3 ==
Since the output produces 10 2/3 used bytes - then 1/3 bytes were "wasted" so the padding is a single =
So base64 encoding 8 input bytes will always produce 11 base64 bytes followed by a single = padding character to produce 12 bytes in total.
= in base64 is padding, but in 64-bit numbers, this padding is extra and does not require 12 characters, but why?
see Encoding.Encode function source:
func (enc *Encoding) Encode(dst, src []byte) {
if len(src) == 0 {
return
}
// enc is a pointer receiver, so the use of enc.encode within the hot
// loop below means a nil check at every operation. Lift that nil check
// outside of the loop to speed up the encoder.
_ = enc.encode
di, si := 0, 0
n := (len(src) / 3) * 3
//https://golang.org/src/encoding/base64/base64.go
in this (len(src) / 3) * 3 part , used 3 instead of 6
so output of this function always is string with even length, if your input is always 64-bit, you can delete = after encoding and add it again for decoding.
for i := 8; i <= 18; i++ {
b := make([]byte, i)
binary.LittleEndian.PutUint64(b, uint64(0))
encoded := base64.StdEncoding.EncodeToString(b)
fmt.Println(encoded)
}
AAAAAAAAAAA=
AAAAAAAAAAAA
AAAAAAAAAAAAAA==
AAAAAAAAAAAAAAA=
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAA==
AAAAAAAAAAAAAAAAAAA=
AAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAA==
AAAAAAAAAAAAAAAAAAAAAAA=
AAAAAAAAAAAAAAAAAAAAAAAA
What do I mean by 6 (or 3)?
base64 use 64 character, each character map to one value (from 000000 to 111111)
example:
a 64bit value (uint64):
11154013587666973726
binary representation:
1001101011001011000001000100001011110000110001010011010000011110
split each six digit:
001001,101011,001011,000001,000100,001011,110000,110001,010011,010000,011110
J, r, L, B, E, L, w, x, T, Q, e

How to convert any negative value to zero with bitwise operators?

I'm writing the PopBack() operation for a LinkedList in Go, the code looks like this:
// PopBack will remove an item from the end of the linked list
func (ll *LinkedList) PopBack() {
lastNode := &ll.node
for *lastNode != nil && (*lastNode).next != nil {
lastNode = &(*lastNode).next
}
*lastNode = nil
if ll.Size() != 0 {
ll.size -= 1
}
}
I don't like the last if clause; if the size is zero we don't want to decrement to a negative value. I was wondering if there is a bitwise operation in which whatever the value is after the decrement, if it's only negative it should covert to a zero?
Negative values have the sign bit set, so you can do like this
ll.size += (-ll.size >> 31)
Suppose ll.size is int32 and ll.Size() returns ll.size. Of course this also implies that size is never negative. When the size is positive then the right shift will sign-extend -ll.size to make it -1, otherwise it'll be 0
If ll.size is int64 then change the shift count to 63. If ll.size is uint64 you can simply cast to int64 if the size is never larger than 263. But if the size can be that large (although almost impossible to occur in the far future) then things are much more trickier:
mask := uint64(-int64(ll.size >> 63)) // all ones if ll.size >= (1 << 63)
ll.size = ((ll.size - 1) & mask) | ((ll.size + uint64(-int64(ll.size) >> 63)) & ^mask)
It's basically a bitwise mux that's usually used in bithacks because you cannot cast bool to int without if in golang
Neither of these are quite readable at first glance so the if block is usually better
Trade a nil check in each iteration of the loop for a single nil check before the loop. With this change, the loop runs faster and the operator for updating size is subtraction.
func (ll *LinkedList) PopBack() {
if ll.node == nil {
return
}
lastNode := &ll.node
for (*lastNode).next != nil {
lastNode = &(*lastNode).next
}
*lastNode = nil
ll.size -= 1
}

Repeated calls to image.png.Decode() leads to out of memory errors

I am attempting to do what I originally thought would be pretty simple. To wit:
For every file in a list of input files:
open the file with png.Decode()
scan every pixel in the file and test to see if it is "grey".
Return the percentage of "grey" pixels in the image.
This is the function I am calling:
func greyLevel(fname string) (float64, string) {
f, err := os.Open(fname)
if err != nil {
return -1.0, "can't open file"
}
defer f.Close()
i, err := png.Decode(f)
if err != nil {
return -1.0, "unable to decode"
}
bounds := i.Bounds()
var lo uint32 = 122 // Low grey RGB value.
var hi uint32 = 134 // High grey RGB value.
var gpix float64 // Grey pixel count.
var opix float64 // Other (non-grey) pixel count.
var tpix float64 // Total pixels.
for x := bounds.Min.X; x < bounds.Max.X; x++ {
for y := bounds.Min.Y; y < bounds.Max.Y; y++ {
r, g, b, _ := i.At(x, y).RGBA()
if ((r/255)-1 > lo && (r/255)-1 < hi) &&
((g/255)-1 > lo && (g/255)-1 < hi) &&
((b/255)-1 > lo && (b/255)-1 < hi) {
gpix++
} else {
opix++
}
tpix++
}
}
return (gpix / tpix) * 100, ""
}
func main() {
srcDir := flag.String("s", "", "Directory containing image files.")
threshold := flag.Float64("t", 65.0, "Threshold (in percent) of grey pixels.")
flag.Parse()
dirlist, direrr := ioutil.ReadDir(*srcDir)
if direrr != nil {
log.Fatalf("Error reading %s: %s\n", *srcDir, direrr)
}
for f := range dirlist {
src := path.Join(*srcDir, dirlist[f].Name())
level, msg := greyLevel(src)
if msg != "" {
log.Printf("error processing %s: %s\n", src, msg)
continue
}
if level >= *threshold {
log.Printf("%s is grey (%2.2f%%)\n", src, level)
} else {
log.Printf("%s is not grey (%2.2f%%)\n", src, level)
}
}
}
The files are relatively small (960x720, 8-bit RGB)
I am calling ioutil.ReadDir() to generate a list of files, looping over the slice and calling greyLevel().
After about 155 files (out of a list of >4000) the script panics with:
runtime: memory allocated by OS not in usable range
runtime: out of memory: cannot allocate 2818048-byte block (534708224 in use)
throw: out of memory
I figure there is something simple I am missing. I thought that Go would de-allocate the memory allocated in greyLevels() but I guess not?
Follow up:
After inserting runtime.GC() after every call to greyLevels, the memory usage evens out. Last night I was teting to about 800 images then stopped. Today I let it run over the entire input set, approximately 6800 images.
After 1500 images, top looks like this:
top - 10:30:11 up 41 days, 11:47, 2 users, load average: 1.46, 1.25, 0.88
Tasks: 135 total, 2 running, 131 sleeping, 1 stopped, 1 zombie
Cpu(s): 49.8%us, 5.1%sy, 0.2%ni, 29.6%id, 15.0%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 3090304k total, 2921108k used, 169196k free, 2840k buffers
Swap: 3135484k total, 31500k used, 3103984k free, 640676k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28474 mtw 20 0 2311m 1.8g 412 R 99 60.5 16:48.52 8.out
And remained steady after processing another 5000 images.
It appears that you are using a 32-bit machine. It is likely that the program runs out of memory because Go's garbage collector is conservative. A conservative garbage collector may fail to detect that some region of memory is no longer in use. There is currently no workaround for this in Go programs other than avoiding data structures that the garbage collector cannot handle (such as: struct {...; binaryData [256]byte})
Try to call runtime.GC() in each iteration of the loop in which you are calling function greyLevel. Maybe it will help the program to process more images.
If calling runtime.GC() fails to improve the situation you may want to change your strategy so that the program processes a smaller number of PNG files per run.
Seems like issue 3173 which was recently fixed. Could you please retry with latest weekly? (Assuming you now use some pre 2012-03-07 version).

With Go, how to append unknown number of byte into a vector and get a slice of bytes?

I'm trying to encode a large number to a list of bytes(uint8 in Go).
The number of bytes is unknown, so I'd like to use vector.
But Go doesn't provide vector of byte, what can I do?
And is it possible to get a slice of such a byte vector?
I intends to implement data compression.
Instead of store small and large number with the same number of bytes,
I'm implements a variable bytes that uses less bytes with small number
and more bytes with large number.
My code can not compile, invalid type assertion:
1 package main
2
3 import (
4 //"fmt"
5 "container/vector"
6 )
7
8 func vbEncodeNumber(n uint) []byte{
9 bytes := new(vector.Vector)
10 for {
11 bytes.Push(n % 128)
12 if n < 128 {
13 break
14 }
15 n /= 128
16 }
17 bytes.Set(bytes.Len()-1, bytes.Last().(byte)+byte(128))
18 return bytes.Data().([]byte) // <-
19 }
20
21 func main() { vbEncodeNumber(10000) }
I wish to writes a lot of such code into binary file,
so I wish the func can return byte array.
I haven't find a code example on vector.
Since you're trying to represent large numbers, you might see if the big package serves your purposes.
The general Vector struct can be used to store bytes. It accepts an empty interface as its type, and any other type satisfies that interface. You can retrieve a slice of interfaces through the Data method, but there's no way to convert that to a slice of bytes without copying it. You can't use type assertion to turn a slice of interface{} into a slice of something else. You'd have to do something like the following at the end of your function: (I haven't tried compiling this code because I can't right now)
byteSlice = make([]byte, bytes.Len())
for i, _ := range byteSlice {
byteSlice[i] = bytes.At(i).(byte)
}
return byteSlice
Take a look at the bytes package and the Buffer type there. You can write your ints as bytes into the buffer and then you can use the Bytes() method to access byte slices of the buffer.
I've found the vectors to be a lot less useful since the generic append and copy were added to the language. Here's how I'd do it in one shot with less copying:
package main
import "fmt"
func vbEncodeNumber(n uint) []byte {
bytes := make([]byte, 0, 4)
for n > 0 {
bytes = append(bytes, byte(n%256))
n >>= 8
}
return bytes
}
func main() {
bytes := vbEncodeNumber(10000)
for i := len(bytes)-1; i >= 0 ; i-- {
fmt.Printf("%02x ", bytes[i])
}
fmt.Println("")
}

Resources