I'm playing around with the odin language and found something that I can't get my head around.
I have the following code:
package main
import "core:fmt"
import "core:os"
File :: struct {
data: []u8
}
main :: proc() {
if len(os.args) != 2 {
fmt.println("usage: jinspector <classfile>")
os.exit(1)
}
file, ok := init_file(os.args[1])
if !ok {
os.exit(1)
}
deinit_file(file)
}
init_file :: proc(file: string) -> (^File, bool) {
if !os.exists(file) {
fmt.printf("file '%s' does not exist\n", file)
return nil, false
}
f := new(File)
data, ok := os.read_entire_file_from_filename(file)
if !ok {
free(f)
fmt.println("reading file failed")
return nil, false
}
f.data = data
return f, true
}
deinit_file :: proc(f: ^File) {
delete(f.data)
free(f)
}
As far as I can see I did free everything correctly but when I run valgrind, I get the following output:
==10344== Memcheck, a memory error detector
==10344== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==10344== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==10344== Command: ./jinspector TestProject.class
==10344==
==10344==
==10344== HEAP SUMMARY:
==10344== in use at exit: 47 bytes in 1 blocks
==10344== total heap usage: 4 allocs, 3 frees, 4,194,961 bytes allocated
==10344==
==10344== 47 bytes in 1 blocks are possibly lost in loss record 1 of 1
==10344== at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==10344== by 0x421F7A: os.heap_alloc (in /home/luis/PythonProjects/jinspector/jinspector)
==10344== by 0x43E95D: os.heap_allocator_proc.aligned_alloc-0 (in /home/luis/PythonProjects/jinspector/jinspector)
==10344== by 0x421656: os.heap_allocator_proc (in /home/luis/PythonProjects/jinspector/jinspector)
==10344== by 0x43BCE5: runtime.make_aligned-17749 (in /home/luis/PythonProjects/jinspector/jinspector)
==10344== by 0x43ABEC: runtime.make_slice-10183 (in /home/luis/PythonProjects/jinspector/jinspector)
==10344== by 0x406877: os._alloc_command_line_arguments (in /home/luis/PythonProjects/jinspector/jinspector)
==10344== by 0x406775: __$startup_runtime (in /home/luis/PythonProjects/jinspector/jinspector)
==10344== by 0x43A4C3: main (in /home/luis/PythonProjects/jinspector/jinspector)
==10344==
==10344== LEAK SUMMARY:
==10344== definitely lost: 0 bytes in 0 blocks
==10344== indirectly lost: 0 bytes in 0 blocks
==10344== possibly lost: 47 bytes in 1 blocks
==10344== still reachable: 0 bytes in 0 blocks
==10344== suppressed: 0 bytes in 0 blocks
==10344==
==10344== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
I free the pointer if the read fails (in that case the slice that gets returned from the read function should have been deleted already if it was already allocated (at least it looks like it by looking at the source code)) and if everything is okay, I delete the slice and free the pointer afterwards.
There are still possible memory leaks reported even though I think I freed all the memory.
Is this a false positive on valgrinds end or did I overlook something regarding odins memory management?
Odin version I use: dev-2022-10
Related
I am creating GZIPs on demand by streaming data, but I need to split it because the receving end has a hard code limit. When I Flush() and Close(), I see that the underyling byte buffer grows by 13 bytes. I looked at the source code of Gzip Close:
func (z *Writer) Close() error {
if z.err != nil {
return z.err
}
if z.closed {
return nil
}
z.closed = true
if !z.wroteHeader {
z.Write(nil)
if z.err != nil {
return z.err
}
}
z.err = z.compressor.Close()
if z.err != nil {
return z.err
}
le.PutUint32(z.buf[:4], z.digest)
le.PutUint32(z.buf[4:8], z.size)
_, z.err = z.w.Write(z.buf[:8])
return z.err
}
It indeed writes something but is there someway to determine it more pragmatic than just saying 13 bytes? There can be headers etc. I just want to have a safe margin, is there any possibilities that it can grow way larger than 13 bytes? I can happily set 1kb margin and live with it.
The 13 bytes are the maximum value to my knowledge. 8 bytes come from the gzip footer, the two PutUint32 calls.
The other 5 bytes are added by the huffmann compressor which ads an empty final block when the compressor is closed. It will add 3 bits (= 1 byte) for the final block header and 2 bytes for the length 0 and another 2 bytes which are inverted length 0xffff. So i assume you can calculate with those 13 bytes.
A conservative upper bound for the gzip-compressed output is:
n + ((n + 7) >> 3) + ((n + 63) >> 6) + 23
where n is the size of the input in bytes.
What I found:
I print the time cost of golang's copy, and it shows the first time of memory copy is slow. But the second time is much faster even I run "copy" on different memory address.
Here is my test codes:
func TestCopyLoop1x32M(t *testing.T) {
copyLoopSameDst(32*1024*1024, 1)
}
func TestCopyLoopOnex32M(t *testing.T) {
copyLoopSameDst(32*1024*1024, 1)
}
func copyLoopSameDst(size, loops int) {
in := make([]byte, size)
out := make([]byte, size)
rand.Seed(0)
fillRandom(in) // insert random byte into slice
now := time.Now()
for i := 0; i < loops; i++ {
copy(out, in)
}
cost := time.Since(now)
fmt.Println(cost.Seconds() / float64(loops))
}
func TestCopyDiffLoop1x32M(t *testing.T) {
copyLoopDiffDst(32*1024*1024, 1)
}
func copyLoopDiffDst(size, loops int) {
ins := make([][]byte, loops)
outs := make([][]byte, loops)
for i := 0; i < loops; i++ {
out := make([]byte, size)
outs[i] = out
in := make([]byte, size)
rand.Seed(0)
fillRandom(in)
ins[i] = in
}
now := time.Now()
for i := 0; i < loops; i++ {
copy(outs[i], ins[i])
}
cost := time.Since(now)
fmt.Println(cost.Seconds() / float64(loops))
}
The Result(on a i5-4278U):
Run all the three case:
TestCopyLoop1x32M : 0.023s
TestCopyLoopOnex32M : 0.0038s
TestCopyDiffLoop1x32M : 0.0038s
Run first&second case:
TestCopyLoop1x32M : 0.023s
TestCopyLoopOnex32M : 0.0038s
Run first&third case:
TestCopyLoop1x32M : 0.023s
TestCopyLoop1x32M : 0.023s
My questions:
They have different memory address and different data, how could the next case get benefit from the first one?
Why the Result3 is not same as Result2? Don't they do the same thing?
If I add the loop in "copyLoopSameDst", I know the next time will be faster because the cache, but my cpu's L3 Cache is only 3MB, I can't explain the huge improvement
Why "copyLoopDiffDst" will speed up after two case?
My guess:
the instruction cache help to improve performance, but it can't explain question2
the cpu cache works beyond my imagination, but it can't explain question2 either
After more research and testing, I think I can answer part of my questions.
The reason of cache works in next test case is Golang's (maybe other languages will do same things, because malloc memory is a system call) memory allocation.
When the data is big, kernel will reuse the block which just been freed.
I print the in&out []byte's address(in Golang, the first 8bytes of a slice is it's memory address, so I write a assembly to get the address):
addr: [0 192 8 32 196 0 0 0] [0 192 8 34 196 0 0 0]
cost: 0.019228028
addr: [0 192 8 36 196 0 0 0] [0 192 8 32 196 0 0 0]
cost: 0.003770281
addr: [0 192 8 34 196 0 0 0] [0 192 8 32 196 0 0 0]
cost: 0.003806502
You will find program reusing some memory address, so write hit happen in the next copy action.
If I create in/out out of function, the reusing will not happen, and it slow down.
But if you set the block very small (for example, under 32KB) You will find the speeding up again although kernel give your a new memory address. In my opinion the main reason is the data is not aligned by 64bytes, so the next loop data (its location is nearby the first one) will be caught into cache, at the same time, the first loop waster much time for fill cache. And the next loop can get the instruction cache and other data cache for run the function. When the data is small, these little cache will make big influence.
I still feel amazed, the data size is 10x of my cpu cache size, but the cache still can help a lot. Anyway, it's another question.
I'm spending some time experimenting with Go's internals and I ended up writing my own implementation of a stack using slices.
As correctly pointed out by a reddit user in this post and as outlined by another user in this SO answer Go already tries to optimise slices resize.
Turns out, though, that I rather have performance gains using my own implementation of slice growing rather than sticking with the default one.
This is the structure I use for holding the stack:
type Stack struct {
slice []interface{}
blockSize int
}
const s_DefaultAllocBlockSize = 20;
This is my own implementation of the Push method
func (s *Stack) Push(elem interface{}) {
if len(s.slice) + 1 == cap(s.slice) {
slice := make([]interface{}, 0, len(s.slice) + s.blockSize)
copy(slice, s.slice)
s.slice = slice
}
s.slice = append(s.slice, elem)
}
This is a plain implementation
func (s *Stack) Push(elem interface{}) {
s.slice = append(s.slice, elem)
}
Running the benchmarks I've implemented using Go's testing package my own implementation performs this way:
Benchmark_PushDefaultStack 20000000 87.7 ns/op 24 B/op 1 allocs/op
While relying on the plain append the results are the following
Benchmark_PushDefaultStack 10000000 209 ns/op 90 B/op 1 allocs/op
The machine I run tests on is an early 2011 Mac Book Pro, 2.3 GHz Intel Core i5 with 8GB of RAM 1333MHz DDR3
EDIT
The actual question is: is my implementation really faster than the default append behavior? Or am I not taking something into account?
Reading your code, tests, benchmarks, and results it's easy to see that they are flawed. A full code review is beyond the scope of StackOverflow.
One specific bug.
// Push pushes a new element to the stack
func (s *Stack) Push(elem interface{}) {
if len(s.slice)+1 == cap(s.slice) {
slice := make([]interface{}, 0, len(s.slice)+s.blockSize)
copy(slice, s.slice)
s.slice = slice
}
s.slice = append(s.slice, elem)
}
Should be
// Push pushes a new element to the stack
func (s *Stack) Push(elem interface{}) {
if len(s.slice)+1 == cap(s.slice) {
slice := make([]interface{}, len(s.slice), len(s.slice)+s.blockSize)
copy(slice, s.slice)
s.slice = slice
}
s.slice = append(s.slice, elem)
}
copying
slices
The function copy copies slice elements from a source src to a
destination dst and returns the number of elements copied. The
number of elements copied is the minimum of len(src) and len(dst).
You copied 0, you should have copied len(s.slice).
As expected, your Push algorithm is inordinately slow:
append:
Benchmark_PushDefaultStack-4 2000000 941 ns/op 49 B/op 1 allocs/op
alediaferia:
Benchmark_PushDefaultStack-4 100000 1246315 ns/op 42355 B/op 1 allocs/op
This how append works: append complexity.
There are other things wrong too. Your benchmark results are often not valid.
I believe your example is faster because you have a fairly small data set and are allocating with an initial capacity of 0. In your version of append you preempt a large amount of allocations by growing the block size more dramatically early (by 20) circumventing the (in this case) expensive reallocs that take you through all those trivially small capacities 0,1,2,4,8,16,32,64 ect
If your data sets were a lot larger this would likely be marginalized by the cost of large copies. I've seen a lot of misuse of slice in Go. The clear performance win is had by making your slice with a reasonable default capacity.
I am attempting to do what I originally thought would be pretty simple. To wit:
For every file in a list of input files:
open the file with png.Decode()
scan every pixel in the file and test to see if it is "grey".
Return the percentage of "grey" pixels in the image.
This is the function I am calling:
func greyLevel(fname string) (float64, string) {
f, err := os.Open(fname)
if err != nil {
return -1.0, "can't open file"
}
defer f.Close()
i, err := png.Decode(f)
if err != nil {
return -1.0, "unable to decode"
}
bounds := i.Bounds()
var lo uint32 = 122 // Low grey RGB value.
var hi uint32 = 134 // High grey RGB value.
var gpix float64 // Grey pixel count.
var opix float64 // Other (non-grey) pixel count.
var tpix float64 // Total pixels.
for x := bounds.Min.X; x < bounds.Max.X; x++ {
for y := bounds.Min.Y; y < bounds.Max.Y; y++ {
r, g, b, _ := i.At(x, y).RGBA()
if ((r/255)-1 > lo && (r/255)-1 < hi) &&
((g/255)-1 > lo && (g/255)-1 < hi) &&
((b/255)-1 > lo && (b/255)-1 < hi) {
gpix++
} else {
opix++
}
tpix++
}
}
return (gpix / tpix) * 100, ""
}
func main() {
srcDir := flag.String("s", "", "Directory containing image files.")
threshold := flag.Float64("t", 65.0, "Threshold (in percent) of grey pixels.")
flag.Parse()
dirlist, direrr := ioutil.ReadDir(*srcDir)
if direrr != nil {
log.Fatalf("Error reading %s: %s\n", *srcDir, direrr)
}
for f := range dirlist {
src := path.Join(*srcDir, dirlist[f].Name())
level, msg := greyLevel(src)
if msg != "" {
log.Printf("error processing %s: %s\n", src, msg)
continue
}
if level >= *threshold {
log.Printf("%s is grey (%2.2f%%)\n", src, level)
} else {
log.Printf("%s is not grey (%2.2f%%)\n", src, level)
}
}
}
The files are relatively small (960x720, 8-bit RGB)
I am calling ioutil.ReadDir() to generate a list of files, looping over the slice and calling greyLevel().
After about 155 files (out of a list of >4000) the script panics with:
runtime: memory allocated by OS not in usable range
runtime: out of memory: cannot allocate 2818048-byte block (534708224 in use)
throw: out of memory
I figure there is something simple I am missing. I thought that Go would de-allocate the memory allocated in greyLevels() but I guess not?
Follow up:
After inserting runtime.GC() after every call to greyLevels, the memory usage evens out. Last night I was teting to about 800 images then stopped. Today I let it run over the entire input set, approximately 6800 images.
After 1500 images, top looks like this:
top - 10:30:11 up 41 days, 11:47, 2 users, load average: 1.46, 1.25, 0.88
Tasks: 135 total, 2 running, 131 sleeping, 1 stopped, 1 zombie
Cpu(s): 49.8%us, 5.1%sy, 0.2%ni, 29.6%id, 15.0%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 3090304k total, 2921108k used, 169196k free, 2840k buffers
Swap: 3135484k total, 31500k used, 3103984k free, 640676k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28474 mtw 20 0 2311m 1.8g 412 R 99 60.5 16:48.52 8.out
And remained steady after processing another 5000 images.
It appears that you are using a 32-bit machine. It is likely that the program runs out of memory because Go's garbage collector is conservative. A conservative garbage collector may fail to detect that some region of memory is no longer in use. There is currently no workaround for this in Go programs other than avoiding data structures that the garbage collector cannot handle (such as: struct {...; binaryData [256]byte})
Try to call runtime.GC() in each iteration of the loop in which you are calling function greyLevel. Maybe it will help the program to process more images.
If calling runtime.GC() fails to improve the situation you may want to change your strategy so that the program processes a smaller number of PNG files per run.
Seems like issue 3173 which was recently fixed. Could you please retry with latest weekly? (Assuming you now use some pre 2012-03-07 version).
I'm trying to encode a large number to a list of bytes(uint8 in Go).
The number of bytes is unknown, so I'd like to use vector.
But Go doesn't provide vector of byte, what can I do?
And is it possible to get a slice of such a byte vector?
I intends to implement data compression.
Instead of store small and large number with the same number of bytes,
I'm implements a variable bytes that uses less bytes with small number
and more bytes with large number.
My code can not compile, invalid type assertion:
1 package main
2
3 import (
4 //"fmt"
5 "container/vector"
6 )
7
8 func vbEncodeNumber(n uint) []byte{
9 bytes := new(vector.Vector)
10 for {
11 bytes.Push(n % 128)
12 if n < 128 {
13 break
14 }
15 n /= 128
16 }
17 bytes.Set(bytes.Len()-1, bytes.Last().(byte)+byte(128))
18 return bytes.Data().([]byte) // <-
19 }
20
21 func main() { vbEncodeNumber(10000) }
I wish to writes a lot of such code into binary file,
so I wish the func can return byte array.
I haven't find a code example on vector.
Since you're trying to represent large numbers, you might see if the big package serves your purposes.
The general Vector struct can be used to store bytes. It accepts an empty interface as its type, and any other type satisfies that interface. You can retrieve a slice of interfaces through the Data method, but there's no way to convert that to a slice of bytes without copying it. You can't use type assertion to turn a slice of interface{} into a slice of something else. You'd have to do something like the following at the end of your function: (I haven't tried compiling this code because I can't right now)
byteSlice = make([]byte, bytes.Len())
for i, _ := range byteSlice {
byteSlice[i] = bytes.At(i).(byte)
}
return byteSlice
Take a look at the bytes package and the Buffer type there. You can write your ints as bytes into the buffer and then you can use the Bytes() method to access byte slices of the buffer.
I've found the vectors to be a lot less useful since the generic append and copy were added to the language. Here's how I'd do it in one shot with less copying:
package main
import "fmt"
func vbEncodeNumber(n uint) []byte {
bytes := make([]byte, 0, 4)
for n > 0 {
bytes = append(bytes, byte(n%256))
n >>= 8
}
return bytes
}
func main() {
bytes := vbEncodeNumber(10000)
for i := len(bytes)-1; i >= 0 ; i-- {
fmt.Printf("%02x ", bytes[i])
}
fmt.Println("")
}