How to release memory allocated by a slice? [duplicate] - go

This question already has answers here:
How do you clear a slice in Go?
(3 answers)
Cannot free memory once occupied by bytes.Buffer
(2 answers)
Closed 5 years ago.
package main
import (
"fmt"
"time"
)
func main() {
storage := []string{}
for i := 0; i < 50000000; i++ {
storage = append(storage, "string string string string string string string string string string string string")
}
fmt.Println("done allocating, emptying")
storage = storage[:0]
storage = nil
for {
time.Sleep(1 * time.Second)
}
}
The code above will allocate about ~30mb of memory, and then won't release it. Why is that? How can I force go to release memory used by this slice? I sliced that slice and then nilled it.
The program I'm debugging is a simple HTTP input buffer: it appends all requests into large chunks, and sends these chunks over a channel to goroutine for processing. But problem is illustrated above - I can't get storage to release the memory and then eventually run out of memory.
Edit: as some people pointed out to similar question, no, it first doesn't work, second isn't what I'm asking for. The slice gets emptied, the memory does not.

There are several things going on here.
The first one which is needed to be absorbed is that Go is
a garbage-collected language; the actual algorithm of its GC
is mostly irrelevant but one aspect of it is crucial to understand:
it does not use reference counting, and hence there's no way to
somehow make the GC immediately reclaim the memory of any given
value whose storage is allocated on the heap.
To recap it in more simple words, it's futile to do
s := make([]string, 10*100*100)
s = nil
as the second statement will indeed remove the sole reference
to the slice's underlying memory but won't make the GC go
and "mark" that memory as available for reuse.
This means two things:
You should know how the GC works.
This explains how it works
since v1.5 and up until now (v1.10 these days).
You should structure those of your algorythms which are
memory-intensive in a way that reduces memory pressure.
The latter can be done in several ways:
Preallocate, when you have a sensible idea about how much to.
In your example, you start with a slice of length 0,
and then append to it a lot. Now, almost all library code which deals
with growing memory buffers—the Go runtime included—deals with these
allocations by 1) allocating twice the memory requested—hoping to
prevent several future allocations, and 2) copies the "old" contents
over, when it had to reallocate. This one is important: when reallocation
happens, it means there's two memory regions now: the old one and the new
one.
If you can estimate that you may need to hold N elements on
average, preallocate for them using make([]T, 0, N)—
more info here
and here.
If you'll need to hold less than N elements, the tail of that buffer
will be unused, and if you'll need to hold more than N, you'll need
to reallocate, but on average, you won't need any reallocations.
Re-use your slice(s). Say, in your case, you could "reset" the slice
by reslicing it to the zero length and then use it again for the next
request. This is called "pooling", and in the case of mass-parallel access
to such a pool, you could use sync.Pool to hold your buffers.
Limit the load on your system to make the GC be able to cope with
the sustained load. A good overview of the two approaches to such
limiting is this.

In the program you wrote, it makes no sense to release memory because no part of code is requesting it any more.
To make a valid case, you have to request a new memory and release it inside the loop. Then you will observe that the memory consumption will stabilize at some point.

Related

How does Golang GC reclaim complete allocations when user references don't point to the start of the allocation?

How does the Go garbage collector know the beginning address of a slice's backing store, if the slice points to the middle of the original store?
Consider the following:
array := make([]int, 1000)
slice := array[999:1000]
array = nil
We are famously warned that the GC will hang on to the entire array although the ptr part of slice points towards the end of the original array.
But where is the original address of the array kept so the GC can clear it entirely?
When the GC begins its mark phase and reaches slice, how does it mark the beginning of the array as "in use"?
Same question could be asked for other cases of the GC not collecting objects when the user only has a reference to part of them such as when the user only has references to a part of a struct or a string.
Edit:
After some discussion I suspect that in all of these cases when the GC performs the mark phase for a user reference it performs a lookup in its internal allocation block mechanisms searching for the block within which the address of the user reference is contained and marks the entire block as "In Use"

Faster memory allocation and freeing algorithm than multiple Free List method

We allocate and free many memory blocks. We use Memory Heap. However, heap access is costly.
For faster memory access allocation and freeing, we adopt a global Free List. As we make a multithreaded program, the Free List is protected by a Critical Section. However, Critical Section causes a bottleneck in parallelism.
For removing the Critical Section, we assign a Free List for each thread, i.e. Thread Local Storage. However, thread T1 always memory blocks and thread T2 always frees them, so Free List in thread T2 is always increasing, meanwhile there is no benefit of Free List.
Despite of the bottleneck of Critical Section, we adopt the Critical Section again, with some different method. We prepare several Free Lists as well as Critical Sections which is assigned to each Free List, thus 0~N-1 Free Lists and 0~N-1 Critical Sections. We prepare an atomic-operated integer value which mutates to 0, 1, 2, ... N-1 then 0, 1, 2, ... again. For each allocation and freeing, we get the integer value X, then mutate it, access X-th Critical Section, then access X-th Free List. However, this is quite slower than the previous method (using Thread Local Storage). Atomic operation is quite slow as there are more threads.
As mutating the integer value non-atomically cause no corruption, we did the mutation in non-atomic way. However, as the integer value is sometimes stale, there is many chance of accessing the same Critical Section and Free List by different threads. This causes the bottleneck again, though it is quite few than the previous method.
Instead of the integer value, we used thread ID with hashing to the range (0~N-1), then the performance got better.
I guess there must be much better way of doing this, but I cannot find an exact one. Are there any ideas for improving what we have made?
Dealing with heap memory is a task for the OS. Nothing guarantees you can do a better/faster job than the OS does.
But there are some conditions where you can get a bit of improvement, specially when you know something about your memory usage that is unknown to the OS.
I'm writting here my untested idea, hope you'll get some profit of it.
Let's say you have T threads, all of them reserving and freeing memory. The main goal is speed, so I'll try not to use TLS, nor critical blocking, not atomic ops.
If (repeat: if, if, if) the app can fit to several discrete sizes of memory blocks (not random sizes, so as to avoid fragmentation and unuseful holes) then start asking the OS for a number of these discrete blocks.
For example, you have an array of n1 blocks each of size size1, an array of n2 blocks each of size size2, an array of n3... and so on. Each array is bidimensional, the second field just stores a flag for used/free block. If your arrays are very large then it's better to use a dedicated array for the flags (due to contiguous memory usage is always faster).
Now, some one asks for a block of memory of size sB. A specialized function (or object or whatever) searches the array of blocks of size greater or equal to sB, and then selects a block by looking at the used/free flag. Just before ending this task the proper block-flag is set to "used".
When two or more threads ask for blocks of the same size there may be a corruption of the flag. Using TLS will solve this issue, and critical blocking too. I think you can set a bool flag at the beggining of the search into flags-array, that makes the other threads to wait until the flag changes, which only happens after the block-flag changes. With pseudo code:
MemoryGetter(sB)
{
//select which array depending of 'sB'
for (i=0, i < numOfarrays, i++)
if (sizeOfArr(i) >= sB)
arrMatch = i
break //exit for
//wait if other thread wants a block from the same arrMatch array
while ( searching(arrMatch) == true )
; //wait
//blocks other threads wanting a block from the same arrMatch array
searching(arrMatch) = true
//Get the first free block
for (i=0, i < numOfBlocks, i++)
if ( arrOfUsed(arrMatch, i) != true )
selectedBlock = addressOf(....)
//mark the block as used
arrOfUsed(arrMatch, i) = true
break; //exit for
//Allow other threads
searching(arrMatch) = false
return selectedBlock //NOTE: selectedBlock==NULL means no free block
}
Freeing a block is easier, just mark it as free, no thread concurrency issue.
Dealing with no free blocks is up to you (wait, use a bigger block, ask OS for more, etc).
Note that the whole memory is reserved from the OS at app start, which can be a problem.
If this idea makes your app faster, let me know. What I can say for sure is that memory used is greater than if you use normal OS request; but not much if you choose "good" sizes, those most used.
Some improvements can be done:
Cache the last freeded block (per size) so as to avoid the search.
Start with not that much blocks, and ask the OS for more memory only
when needed. Play with 'number of blocks' for each size depending on
your app. Find the optimal case.

Write operation cost

I have a Go program which writes strings into a file.I have a loop which is iterated 20000 times and in each iteration i am writing around 20-30 strings into a file. I just wanted to know which is the best way to write it into a file.
Approach 1: Keep open the file pointer at the start of the code and
write it for every string. It makes it 20000*30 write operations.
Approach 2: Use bytes.Buffer Go and store everything in the buffer and
write it at the end.Also in this case should the file pointer be
opened from the beginning of the code or at the end of the code. Does
it matter?
I am assuming approach 2 should work better. Can someone confirm this with a reason. How does writing at once be better than writing periodically. Because the file pointer will anyways be open.
I am using f.WriteString(<string>) and buffer.WriteString(<some string>) buffer is of type bytes.Buffer and f is the file pointer open.
bufio package has been created exactly for this kind of task. Instead of making a syscall for each Write call bufio.Writer buffers up to a fixed number of bytes in the internal memory before making a syscall. After a syscall the internal buffer is reused for the next portion of data
Comparing to your second approach bufio.Writer
makes more syscalls (N/S instead of 1)
uses less memory (S bytes instead of N bytes)
where S - is buffer size (can be specified via bufio.NewWriterSize), N - total size of data that needs to be written.
Example usage (https://play.golang.org/p/AvBE1d6wpT):
f, err := os.Create("file.txt")
if err != nil {
log.Fatal(err)
}
defer f.Close()
w := bufio.NewWriter(f)
fmt.Fprint(w, "Hello, ")
fmt.Fprint(w, "world!")
err = w.Flush() // Don't forget to flush!
if err != nil {
log.Fatal(err)
}
The operations that take time when writing in files are the syscalls and the disk I/O. The fact that the file pointer is open doesn't cost you anything. So naively, we could say that the second method is best.
Now, as you may know, you OS doesn't directly write into files, it uses an internal in-memory cache for files that are written and do the real I/O later. I don't know the exacts details of that, and generally speaking I don't need to.
What I would advise is a middle-ground solution: do a buffer for every loop iteration, and write this one N times. That way to cut a big part of the number of syscalls and (potentially) disk writes, but without consuming too much memory with the buffer (dependeing on the size of your strings, that my be a point to be taken into account).
I would suggest benchmarking for the best solution, but due to the caching done by the system, benchmarking disk I/O is a real nightmare.
Syscalls are not cheap, so the second approach is better.
You can use lat_syscall tool from lmbench to measure how long it takes to call single write:
$ ./lat_syscall write
Simple write: 0.1522 microseconds
So, on my system it will take approximately 20000 * 0.15μs = 3ms extra time just to call write for every string.

Go memory consumption with many goroutines

I was trying to check how Go will perform with 100,000 goroutines. I wrote a simple program to spawn that many routines which does nothing but print some announcements. I restricted the MaxStack size to just 512 bytes. But what I noticed was the program size doesn't decrease with that. It was consuming around 460 MB of memory and hence around 4 KB per goroutine. My question is, can we set max stack size lower than the "minimum" stack size (which may be 4 KB) for the goroutines. How can we set the minimum Stack size that Goroutine starts with ?
Below is sample code I used for the test:
package main
import "fmt"
import "time"
import "runtime/debug"
func main() {
fmt.Printf("%v\n", debug.SetMaxStack(512))
var i int
for i = 0; i < 100000; i++ {
go func(x int) {
for {
time.Sleep(10 * time.Millisecond)
//fmt.Printf("I am %v\n", x)
}
}(i)
}
fmt.Println("Done")
time.Sleep(999999999999)
}
There's currently no way to set the minimum stack size for goroutines.
Go 1.2 increased the minimum size from 4KB to 8KB
The docs say:
"In Go 1.2, the minimum size of the stack when a goroutine is created has been lifted from 4KB to 8KB. Many programs were suffering performance problems with the old size, which had a tendency to introduce expensive stack-segment switching in performance-critical sections. The new number was determined by empirical testing."
But they go on to say:
"Updating: The increased minimum stack size may cause programs with many goroutines to use more memory. There is no workaround, but plans for future releases include new stack management technology that should address the problem better."
So you may have more luck in the future.
See http://golang.org/doc/go1.2#stack_size for more info.
The runtime/debug.SetMaxStack function only determines a what point does go consider a program infinitely recursive, and terminate it. http://golang.org/pkg/runtime/debug/#SetMaxStack
Setting it absurdly low does nothing to the minimum size of stacks, and only limits the maximum size by virtue of your program crashing when any stack's in-use size exceeds the limit.
Technically the crash only happens when the stack must be grown, so your program will die when a stack needs more than 8KB (or 4KB prior to go 1.2).
The reason why your program uses a minimum of 4KB * nGoroutines is because stacks are page-aligned, so there can never be more than one stack on a VM page. Therefore your program will use at least nGoroutines worth of pages, and OSes usually only measure and allocate memory in page-sized increments.
The only way to change the starting (minimum) size of a stack is to modify and recompile the go runtime (and possibly the compiler too).
Go 1.3 will include contiguous stacks, which are generally faster than the split stacks in Go 1.2 and earlier, and which may also lead to smaller initial stacks in the future.
Just a note: in Go 1.4: the minimum size of the goroutine stack has decreased from 8Kb to 2Kb.
And as per Go 1.13, it's still same -https://github.com/golang/go/blob/bbd25d26c0a86660fb3968137f16e74837b7a9c6/src/runtime/stack.go#L72:
// The minimum size of stack used by Go code
_StackMin = 2048

are array initialization operations cached as well

If you are not reading a value but assigning a value
for example
int array[] = new int[5];
for(int i =0; i < array.length(); i++){
array[i] = 2;
}
Still does the array come to the cache? Can't the cpu bring the array elements one by one to its registers and do the assignment and after that write the updated value to the main memory, bypasing the cache because its not necessary in this case ?
The answer depends on the cache protocol I answered assuming Write Back Write Allocate.
The array will still come to the cache and it will make a difference. When a cache block is retrieved from it's more than just a single memory location (the actual size depends on the design of the cache). So since arrays are stored in order in memory pulling in array[0] will pulling the rest of the block which will include (at least some of) array[1] array[2] array[3] and array[4]. This means the following calls will not have to access main memory.
Also after all this is done the values will NOT be written to memory immediately (under write back) instead the CPU will keep using the cache as the memory for reads/writes until that cache block is replaced from the cache at which point the values will be written to main memory.
Overall this is preferable to going to memory every time because the cache is much faster and the chances are the user is going to use the memory he just set relatively soon.
If the protocol is Write Through No Allocate then it won't bring the block into memory and it will right straight through to the main memory.

Resources