Julia supports in place factorization of matrices (for some factorizations).
I wonder if one could also eliminate any allocation of memory inside the function.
For instance, is there a way to apply a Cholesky factorization on a matrix with no hidden memory allocation?
Non allocating LAPACK functions have bindings in Julia. They are documented in Julia Documentation - Linear Algebra - LAPACK Functions.
The Cholesky factorization cholesky!(A) overwrites A and does allocate a fixed small amount of memory, whereas cholesky(A) does allocate a larger amount. Here, allocations (bytes) do grow quadratically with the size of A.
let n = 1000; M = rand(n,n); B = transpose(M)*M
cholesky(B)
#time cholesky(B)
# 0.023478 seconds (5 allocations: 7.630 MiB)
end
vs
let n = 1000; M = rand(n,n); B = transpose(M)*M
cholesky!(copy(B))
#time cholesky!(B)
# 0.021360 seconds (3 allocations: 80 bytes)
end
Performance differences are small as pointed out by Oscar Smith.
This question already has answers here:
Why slice length greater than capacity gives runtime error?
(3 answers)
Closed 2 years ago.
https://github.com/google/codesearch/blob/master/index/write.go#L498
The following code is mentioned in on the above page. Can len() be greater than cap()? I think = instead of >= should be used here. Thanks.
if len(b.buf) >= cap(b.buf) {
Spec: Length and capacity:
The capacity of a slice is the number of elements for which there is space allocated in the underlying array. At any time the following relationship holds:
0 <= len(s) <= cap(s)
So no, length of a slice cannot be greater than its capacity.
In the referenced code len(b.buf) == cap(b.buf) would be enough, it may be the code calculated length some other way (e.g. including the length of something to be appended), which would make sense. Then it could be the code was changed / rewritten but not the relation.
I've already known that the runtime.morestack will cause goroutine context switch (If the sysmon goroutine has marked it "has to switch").
And when I do some experiment around this, I've found an interesting fact.
Compare the following codes.
func main() {
_ = make([]int, 13)
}
func main() {
_ = make([]int, 14)
}
And compile them by running the following command: (Tried in go1.9 and go 1.11)
$ go build -gcflags "-S -l -N" x.go
You may find a major difference that the first outputs contains CALL runtime.morestack_noctxt(SB) while the second doesn't.
I guess it is an optimization, but why?
Finally, I got the answer.
Making a slice that less than 65,536 bytes and not escaped from the func will be allocated in the stack, not the heap.
StackGuard0 will be higher than the stack's lowest address by at least 128 bytes. (Even after downsizing)
make([]int, 13) will allocate 128 bytes memories in total.
sizeof(struct slice) + 13 * 8 = 24 + 104 = 128.
So the answer is clear, this is an optimization for amd64.
For a leaf function, if it used less than 128 bytes memories, the compiler wouldn't generate codes that checking if the stack is overflowed (because there are enough spaces).
Here is the explanation in go/src/runtime/stack.go
The per-goroutine g->stackguard is set to point StackGuard bytes
above the bottom of the stack. Each function compares its stack
pointer against g->stackguard to check for overflow. To cut one
instruction from the check sequence for functions with tiny frames,
the stack is allowed to protrude StackSmall bytes below the stack
guard. Functions with large frames don't bother with the check and
always call morestack. The sequences are (for amd64, others are
similar):
In Golang, we can use the builtin make() function to create a slice with a given initial length and capacity.
Consider the following lines, the slice's length is set to 1, and its capacity 3:
func main() {
var slice = make([]int, 1, 3)
slice[0] = 1
slice = append(slice, 6, 0, 2, 4, 3, 1)
fmt.Println(slice)
}
I was surprised to see that this program prints:
[1 6 0 2 4 3 1]
This got me wondering- what is the point of initially defining a slice's capacity if append() can simply blow past it? Are there performance gains for setting a sufficiently large capacity?
A slice is really just a fancy way to manage an underlying array. It automatically tracks size, and re-allocates new space as needed.
As you append to a slice, the runtime doubles its capacity every time it exceeds its current capacity. It has to copy all of the elements to do that. If you know how big it will be before you start, you can avoid a few copy operations and memory allocations by grabbing it all up front.
When you make a slice providing capacity, you set the initial capacity, not any kind of limit.
See this blog post on slices for some interesting internal details of slices.
A slice is a wonderful abstraction of a simple array. You get all sorts of nice features, but deep down at its core, lies an array. (I explain the following in reverse order for a reason). Therefore, if/when you specify a capacity of 3, deep down, an array of length 3 is allocated in memory, which you can append up to without having it need to reallocate memory. This attribute is optional in the make command, but note that a slice will always have a capacity whether or not you choose to specify one. If you specify a length (which always exists as well), the slice be indexable up to that length. The rest of the capacity is hidden away behind the scenes so it does not have to allocate an entirely new array when append is used.
Here is an example to better explain the mechanics.
s := make([]int, 1, 3)
The underlying array will be allocated with 3 of the zero value of int (which is 0):
[0,0,0]
However, the length is set to 1, so the slice itself will only print [0], and if you try to index the second or third value, it will panic, as the slice's mechanics do not allow it. If you s = append(s, 1) to it, you will find that it has actually been created to contain zero values up to the length, and you will end up with [0,1]. At this point, you can append once more before the entire underlying array is filled, and another append will force it to allocate a new one and copy all the values over with a doubled capacity. This is actually a rather expensive operation.
Therefore the short answer to your question is that preallocating the capacity can be used to vastly improve the efficiency of your code. Especially so if the slice is either going to end up very large, or contains complex structs (or both), as the zero value of a struct is effectively the zero values of every single one of its fields. This is not because it would avoid allocating those values, as it has to anyway, but because append would have to reallocate new arrays full of these zero values each time it would need to resize the underlying array.
Short playground example: https://play.golang.org/p/LGAYVlw-jr
As others have already said, using the cap parameter can avoid unnecessary allocations. To give a sense of the performance difference, imagine you have a []float64 of random values and want a new slice that filters out values that are not above, say, 0.5.
Naive approach - no len or cap param
func filter(input []float64) []float64 {
ret := make([]float64, 0)
for _, el := range input {
if el > .5 {
ret = append(ret, el)
}
}
return ret
}
Better approach - using cap param
func filterCap(input []float64) []float64 {
ret := make([]float64, 0, len(input))
for _, el := range input {
if el > .5 {
ret = append(ret, el)
}
}
return ret
}
Benchmarks (n=10)
filter 131 ns/op 56 B/op 3 allocs/op
filterCap 56 ns/op 80 B/op 1 allocs/op
Using cap made the program 2x+ faster and reduced the number of allocations from 3 to 1. Now what happens at scale?
Benchmarks (n=1,000,000)
filter 9630341 ns/op 23004421 B/op 37 allocs/op
filterCap 6906778 ns/op 8003584 B/op 1 allocs/op
The speed difference is still significant (~1.4x) thanks to 36 fewer calls to runtime.makeslice. However, the bigger difference is the memory allocation (~4x less).
Even better - calibrating the cap
You may have noticed in the first benchmark that cap makes the overall memory allocation worse (80B vs 56B). This is because you allocate 10 slots but only need, on average, 5 of them. This is why you don't want to set cap unnecessarily high. Given what you know about your program, you may be able to calibrate the capacity. In this case, we can estimate that our filtered slice will need 50% as many slots as the original slice.
func filterCalibratedCap(input []float64) []float64 {
ret := make([]float64, 0, len(input)/2)
for _, el := range input {
if el > .5 {
ret = append(ret, el)
}
}
return ret
}
Unsurprisingly, this calibrated cap allocates 50% as much memory as its predecessor, so that's ~8x improvement on the naive implementation at 1m elements.
Another option - using direct access instead of append
If you are looking to shave even more time off a program like this, initialize with the len parameter (and ignore the cap parameter), access the new slice directly instead of using append, then throw away all the slots you don't need.
func filterLen(input []float64) []float64 {
ret := make([]float64, len(input))
var counter int
for _, el := range input {
if el > .5 {
ret[counter] = el
counter++
}
}
return ret[:counter]
}
This is ~10% faster than filterCap at scale. However, in addition to being more complicated, this pattern does not provide the same safety as cap if you try and calibrate the memory requirement.
With cap calibration, if you underestimate the total capacity required, then the program will automatically allocate more when it needs it.
With this approach, if you underestimate the total len required, the program will fail. In this example, if you initialize as ret := make([]float64, len(input)/2), and it turns out that len(output) > len(input)/2, then at some point the program will try to access a non-existent slot and panic.
Each time you add an item to a slice that has len(mySlice) == cap(mySlice), the underlying data structure is replaced with a larger structure.
fmt.Printf("Original Capacity: %v", cap(mySlice)) // Output: 8
mySlice = append(mySlice, myNewItem)
fmt.Printf("New Capacity: %v", cap(mySlice)) // Output: 16
Here, mySlice is replaced (through the assignment operator) with a new slice containing all the elements of the original mySlice, plus myNewItem, plus some room (capacity) to grow without triggering this resize.
As you can imagine, this resizing operation is computationally non-trivial.
Quite often, all the resize operations can be avoided if you know how many items you will need to store in mySlice. If you have this foreknowledge, you can set the capacity of the original slice upfront and avoid all the resize operations.
(In practice, it's quite often possible to know how many items will be added to a collection; especially when transforming data from one format to another.)
The bytes.Buffer object has a Truncate(n int) method to discard all but the first n bytes.
I'd need the exact inverse of that - keeping the last n bytes.
I could do the following
b := buf.Bytes()
buf.Reset()
buf.Write(b[offset:])
but I'm not sure if this will re-use the slice efficiently.
Are there better options?
There are two alternatives:
The solution you give, which allows the first 'offset' bytes to be reused.
Create a bytes.NewBuffer(b[offset:]) and use that. This will not allow the first 'offset' bytes to be collected until you're done with the new buffer, but it avoids the cost of copying.
Let bytes.Buffer handle the buffer management. The internal grow method slides the data down. Use the Next method. For example,
package main
import (
"bytes"
"fmt"
)
func main() {
var buf bytes.Buffer
for i := 0; i < 8; i++ {
buf.WriteByte(byte(i))
}
fmt.Println(buf.Len(), buf.Bytes())
n := buf.Len() / 2
// Keep last n bytes.
if n > buf.Len() {
n = buf.Len()
}
buf.Next(buf.Len() - n)
fmt.Println(buf.Len(), buf.Bytes())
}
Output:
8 [0 1 2 3 4 5 6 7]
4 [4 5 6 7]
I reckon the problem with your idea is that "truncating the buffer from its start" is impossible simply because the memory allocator allocates memory in full chunks and there's no machinery in it to split an already allocated chunk into a set of "sub chunks" — essentially what you're asking for. So to support "trimming from the beginning" the implementation of bytes.Buffer would have to allocate a smaller buffer, move the "tail" there and then mark the original buffer for reuse.
This naturally leads us to another idea: use two (or more) buffers. They might either be allocated separately and treated as adjacent by your algorythms or you might use custom allocation: allocate one big slice and then reslice it twice or more times to produce several physically adjacent buffers, or slide one or more "window" slices over it. This means implementing a custom data structure of course…