Go: multiple len() calls vs performance? - algorithm

At the moment I am implementing some sorting algorithms. As it's in the nature of algorithms, there are a lot of calls on the length of some arrays/slices using the len() method.
Now, given the following code for a (part of) the Mergesort algorithm:
for len(left) > 0 || len(right) > 0 {
if len(left) > 0 && len(right) > 0 {
if left[0] <= right[0] {
result = append(result, left[0])
left = left[1:len(left)]
} else {
result = append(result, right[0])
right = right[1:len(right)]
}
} else if len(left) > 0 {
result = append(result, left[0])
left = left[1:len(left)]
} else if len(right) > 0 {
result = append(result, right[0])
right = right[1:len(right)]
}
}
My question is: Do these multiple len() calls affect the performance of the algorithm negatively? Is it better to make a temporary variable for the length of the right and left slice? Or does the compiler does this itself?

There are two cases:
Local slice: length will be cached and there is no overhead
Global slice or passed (by reference): length cannot be cached and there is overhead
No overhead for local slices
For locally defined slices the length is cached, so there is no runtime overhead. You can see this in the assembly of the following program:
func generateSlice(x int) []int {
return make([]int, x)
}
func main() {
x := generateSlice(10)
println(len(x))
}
Compiled with go tool 6g -S test.go this yields, amongst other things, the following lines:
MOVQ "".x+40(SP),BX
MOVQ BX,(SP)
// ...
CALL ,runtime.printint(SB)
What happens here is that the first line retrieves the length of x by getting the value located 40 bytes from the beginning of x and most importantly caches this value in BX, which is then used for every occurrence of len(x). The reason for the offset is that an array has the following structure (source):
typedef struct
{ // must not move anything
uchar array[8]; // pointer to data
uchar nel[4]; // number of elements
uchar cap[4]; // allocated number of elements
} Array;
nel is what is accessed by len(). You can see this in the code generation as well.
Global and referenced slices have overhead
For shared values caching of the length is not possible since the compiler has to assume that the slice changes between calls. Therefore the compiler has to write code that accesses the length attribute directly every time. Example:
func accessLocal() int {
a := make([]int, 1000) // local
count := 0
for i := 0; i < len(a); i++ {
count += len(a)
}
return count
}
var ag = make([]int, 1000) // pseudo-code
func accessGlobal() int {
count := 0
for i := 0; i < len(ag); i++ {
count += len(ag)
}
return count
}
Comparing the assembly of both functions yields the crucial difference that as soon as the variable is global the access to the nel attribute is not cached anymore and there will be a runtime overhead:
// accessLocal
MOVQ "".a+8048(SP),SI // cache length in SI
// ...
CMPQ SI,AX // i < len(a)
// ...
MOVQ SI,BX
ADDQ CX,BX
MOVQ BX,CX // count += len(a)
// accessGlobal
MOVQ "".ag+8(SB),BX
CMPQ BX,AX // i < len(ag)
// ...
MOVQ "".ag+8(SB),BX
ADDQ CX,BX
MOVQ BX,CX // count += len(ag)

Despite the good answers you are getting, I'm getting poorer performance if calling len(a) constantly, for example in this test http://play.golang.org/p/fiP1Sy2Hfk
package main
import "testing"
func BenchmarkTest1(b *testing.B) {
a := make([]int, 1000)
for i := 0; i < b.N; i++ {
count := 0
for i := 0; i < len(a); i++ {
count += len(a)
}
}
}
func BenchmarkTest2(b *testing.B) {
a := make([]int, 1000)
for i := 0; i < b.N; i++ {
count := 0
lena := len(a)
for i := 0; i < lena; i++ {
count += lena
}
}
}
When run as go test -bench=. I get:
BenchmarkTest1 5000000 668 ns/op
BenchmarkTest2 5000000 402 ns/op
So there is clearly a penalty here, possibly because the compiler is making worse optimizations in compile-time.

Hope things got improved in the latest version of Go
go version go1.16.7 linux/amd64
goos: linux
goarch: amd64
pkg: 001_test
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 # 2.80GHz
BenchmarkTest1-8 4903609 228.8 ns/op
BenchmarkTest2-8 5280086 229.9 ns/op

Related

Why is accessing a variable so much slower than accessing len()?

I wrote this function uniq that takes in a sorted slice of ints
and returns the slice with duplicates removed:
func uniq(x []int) []int {
i := 0
for i < len(x)-1 {
if x[i] == x[i+1] {
copy(x[i:], x[i+1:])
x = x[:len(x)-1]
} else {
i++
}
}
return x
}
and uniq2, a rewrite of uniq with the same results:
func uniq2(x []int) []int {
i := 0
l := len(x)
for i < l-1 {
if x[i] == x[i+1] {
copy(x[i:], x[i+1:])
l--
} else {
i++
}
}
return x[:l]
}
The only difference between the two functions
is that in uniq2, instead of slicing x
and directly accessing len(x) each time,
I save len(x) to a variable l
and decrement it whenever I shift the slice.
I thought that uniq2 would be slightly faster than uniq
because len(x) would no longer be called iteration,
but in reality, it is inexplicably much slower.
With this test that generates a random sorted slice
and calls uniq/uniq2 on it 1000 times,
which I run on Linux:
func main() {
rand.Seed(time.Now().Unix())
for i := 0; i < 1000; i++ {
_ = uniq(genSlice())
//_ = uniq2(genSlice())
}
}
func genSlice() []int {
x := make([]int, 0, 1000)
for num := 1; num <= 10; num++ {
amount := rand.Intn(1000)
for i := 0; i < amount; i++ {
x = append(x, num)
}
}
return x
}
$ go build uniq.go
$ time ./uniq
uniq usually takes 5--6 seconds to finish.
while uniq2 is more than two times slower,
taking between 12--15 seconds.
Why is uniq2, where I save the slice length to a variable,
so much slower than uniq, where I directly call len?
Shouldn't it slightly faster?
You expect roughly the same execution time because you think they do roughly the same thing.
The only difference between the two functions is that in uniq2, instead of slicing x and directly accessing len(x) each time, I save len(x) to a variable l and decrement it whenever I shift the slice.
This is wrong.
The first version does:
copy(x[i:], x[i+1:])
x = x[:len(x)-1]
And second does:
copy(x[i:], x[i+1:])
l--
The first difference is that the first assigns (copies) a slice header which is a reflect.SliceHeader value, being 3 integer (24 bytes on 64-bit architecture), while l-- does a simple decrement, it's much faster.
But the main difference does not stem from this. The main difference is that since the first version changes the x slice (the header, the length included), you end up copying less and less elements, while the second version does not change x and always copies to the end of the slice. x[i+1:] is equivalent to x[x+1:len(x)].
To demonstrate, imagine you pass a slice with length=10 and having all equal elements. The first version will copy 9 elements first, then 8, then 7 etc. The second version will copy 9 elements first, then 9 again, then 9 again etc.
Let's modify your functions to count the number of copied elements:
func uniq(x []int) []int {
count := 0
i := 0
for i < len(x)-1 {
if x[i] == x[i+1] {
count += copy(x[i:], x[i+1:])
x = x[:len(x)-1]
} else {
i++
}
}
fmt.Println("uniq copied", count, "elements")
return x
}
func uniq2(x []int) []int {
count := 0
i := 0
l := len(x)
for i < l-1 {
if x[i] == x[i+1] {
count += copy(x[i:], x[i+1:])
l--
} else {
i++
}
}
fmt.Println("uniq2 copied", count, "elements")
return x[:l]
}
Testing it:
uniq(make([]int, 1000))
uniq2(make([]int, 1000))
Output is:
uniq copied 499500 elements
uniq2 copied 998001 elements
uniq2() copies twice as many elements!
If we test it with a random slice:
uniq(genSlice())
uniq2(genSlice())
Output is:
uniq copied 7956671 elements
uniq2 copied 11900262 elements
Again, uniq2() copies roughly 1.5 times more elements! (But this greatly depends on the random numbers.)
Try the examples on the Go Playground.
The "fix" is to modify uniq2() to copy until l:
copy(x[i:], x[i+1:l])
l--
With this "appropriate" change, performance is roughly the same.

Is there another way of testing if a big.Int is 0?

I'm working with big.Ints and need to test for 0. Right now, I'm using zero = big.NewInt(0)and Cmp(zero)==0 which works fine, but I was wondering if there's a quicker way specifically for 0 (I need this program to be very fast)?
big.Int exposes Int.Bits() to access the raw bytes of the representation, which is a slice and it shares the same underlying array: the returned slice is not copied. So it's fast. It's exposed to "support implementation of missing low-level Int functionality".
Perfect, exactly what we want.
0. Testing for 0
Documentation of big.Int also states that "the zero value for an Int represents the value 0". So in the zero value (which represents 0) the slice will be empty (zero value for slices is nil and the length of a nil slice is 0). We can simply check that:
if len(i1.Bits()) == 0 {
}
Also note that there is an Int.BitLen() function returning this, which also states that "the bit length of 0 is 0". So we may also use this:
if i1.BitLen() == 0 {
}
Let's benchmark these solutions:
func BenchmarkCompare(b *testing.B) {
zero := big.NewInt(0)
i1 := big.NewInt(1)
i2 := big.NewInt(0)
for i := 0; i < b.N; i++ {
if i1.Cmp(zero) == 0 {
}
if i2.Cmp(zero) == 0 {
}
}
}
func BenchmarkBits(b *testing.B) {
i1 := big.NewInt(1)
i2 := big.NewInt(0)
for i := 0; i < b.N; i++ {
if len(i1.Bits()) == 0 {
}
if len(i2.Bits()) == 0 {
}
}
}
func BenchmarkBitLen(b *testing.B) {
i1 := big.NewInt(1)
i2 := big.NewInt(0)
for i := 0; i < b.N; i++ {
if i1.BitLen() == 0 {
}
if i2.BitLen() == 0 {
}
}
}
Benchmark results:
BenchmarkCompare-8 76975251 13.3 ns/op
BenchmarkBits-8 1000000000 0.656 ns/op
BenchmarkBitLen-8 1000000000 1.11 ns/op
Getting the bits and comparing the slice length to 0 is 20 times faster than comparing it to another big.Int representing 0, using Int.BitLen() is also 10 times faster.
1. Testing for 1
Something similar could be made to test if a big.Int value equals to 1, but probably not as fast as testing for zero: 0 is the most special value. Its internal representation is a nil slice, any other value requires a non-nil slice. Also: 0 has another special property: its absolute value is itself.
This absolute value property is important, because Int.Bits() returns the absolute value. So in case of a non-zero value checking just the bits slice is insufficient, as it carries no sign information.
So testing for 1 can be implemented by checking if the bits content represents 1, and the sign is positive:
func isOne(i *big.Int) bool {
bits := i.Bits()
return len(bits) == 1 && bits[0] == 1 && i.Sign() > 0
}
Let's benchmark this along with comparing the number to one := big.NewInt(1):
func BenchmarkCompareOne(b *testing.B) {
one := big.NewInt(1)
i1 := big.NewInt(0)
i2 := big.NewInt(1)
i3 := big.NewInt(2)
for i := 0; i < b.N; i++ {
if i1.Cmp(one) == 0 {
}
if i2.Cmp(one) == 0 {
}
if i3.Cmp(one) == 0 {
}
}
}
func BenchmarkBitsOne(b *testing.B) {
i1 := big.NewInt(0)
i2 := big.NewInt(1)
i3 := big.NewInt(2)
for i := 0; i < b.N; i++ {
if isOne(i1) {
}
if isOne(i2) {
}
if isOne(i3) {
}
}
}
And the benchmark results:
BenchmarkCompareOne-8 58631458 18.9 ns/op
BenchmarkBitsOne-8 715606629 1.76 ns/op
Not bad! Our way of testing for 1 is again 10 times faster.

Why are bitwise operators slower than division and modulo in Go?

Usually I program in C and frequently use bitwise operators since they are faster. Now I encountered this timing difference by solving Project Euler Problem 14 while using bitwise operators or division and modulo. The program was compiled with go version go1.6.2.
Version with bitwise operators:
package main
import (
"fmt"
)
func main() {
var buf, longest, cnt, longest_start int
for i:=2; i<1e6; i++ {
buf = i
cnt = 0
for buf > 1 {
if (buf & 0x01) == 0 {
buf >>= 1
} else {
buf = buf * 3 + 1
}
cnt++
}
if cnt > longest {
longest = cnt
longest_start = i
}
}
fmt.Println(longest_start)
}
executing the program:
time ./prob14
837799
real 0m0.300s
user 0m0.301s
sys 0m0.000s
Version without bitwise operators (replacing & 0x01 with % 2 and >>= 1 with /=2):
for buf > 1 {
if (buf % 2) == 0 {
buf /= 2
} else {
buf = buf * 3 + 1
}
cnt++
}
executing the program:
$ time ./prob14
837799
real 0m0.273s
user 0m0.274s
sys 0m0.000s
Why is the version with the bitwise operators in Go slower?
(I also created a solution for the problem in C. Here was the version with the bitwise operators faster without optimization flag (with -O3 they are equal).)
EDIT
I did a benchmark as suggested in the comments.
package main
import (
"testing"
)
func Colatz(num int) {
cnt := 0
buf := num
for buf > 1 {
if (buf % 2) == 0 {
buf /= 2
} else {
buf = buf * 3 + 1
}
cnt++
}
}
func ColatzBitwise(num int) {
cnt := 0
buf := num
for buf > 1 {
if (buf & 0x01) == 0 {
buf >>= 1
} else {
buf = buf * 3 + 1
}
cnt++
}
}
func BenchmarkColatz(b *testing.B) {
for i:=0; i<b.N; i++ {
Colatz(837799)
}
}
func BenchmarkColatzBitwise(b *testing.B) {
for i:=0; i<b.N; i++ {
ColatzBitwise(837799)
}
}
Here are the benchmark results:
go test -bench=.
PASS
BenchmarkColatz-8 2000000 650 ns/op
BenchmarkColatzBitwise-8 2000000 609 ns/op
It turns out the bitwise version is faster in the benchmark.
EDIT 2
I changed the type of all variables in the functions to uint. Here is the benchmark:
go test -bench=.
PASS
BenchmarkColatz-8 3000000 516 ns/op
BenchmarkColatzBitwise-8 3000000 590 ns/op
The arithmetic version is now faster, as Marc has written in his answer. I will test also with a newer compiler version.
If they ever were, they aren't now.
There are a few problems with your approach:
you're using go1.6.2 which was released over 4 years ago
you're running a binary that does other things and running it just once
you're expecting bitshift and arithmetic operations on signed integers to be the same, they're not
Using go1.15 with micro benchmarks will show the bitwise operations to be faster. The main reason for this is that a bitwise shift and a division by two are absolutely not the same for signed integers: the bitwise shift doesn't care about the sign but the division has to preserve it.
If you want to have something closer to equivalent, use unsigned integers for your arithmetic operations, the compiler may optimize it to a single bitshift.
In go1.15 on my machine, I see the following being generated for each type of division by 2:
buf >>=1:
MOVQ AX, DX
SARQ $1, AX
buf /= 2 with var buf int:
MOVQ AX, DX
SHRQ $63, AX
ADDQ DX, AX
SARQ $1, AX
buf /= 2 with var buf uint:
MOVQ CX, BX
SHRQ $1, CX
Even then, all this must be taken with a large grain of salt: the generated code will depend massively on what else is happening and how the results are used.
But the basic rule applies: when performing arithmetic operations, the type matters a lot. Bitshift operators don't care about sign.

Efficient allocation of slices (cap vs length)

Assuming I am creating a slice, which I know in advance that I want to populate via a for loop with 1e5 elements via successive calls to append:
// Append 1e5 strings to the slice
for i := 0; i<= 1e5; i++ {
value := fmt.Sprintf("Entry: %d", i)
myslice = append(myslice, value)
}
which is the more efficient way of initialising the slice and why:
a. declaring a nil slice of strings?
var myslice []string
b. setting its length in advance to 1e5?
myslice = make([]string, 1e5)
c. setting both its length and capacity to 1e5?
myslice = make([]string, 1e5, 1e5)
Your b and c solutions are identical: creating a slice with make() where you don't specify the capacity, the "missing" capacity defaults to the given length.
Also note that if you create the slice with a length in advance, you can't use append() to populate the slice, because it adds new elements to the slice, and it doesn't "reuse" the allocated elements. So in that case you have to assign values to the elements using an index expression, e.g. myslice[i] = value.
If you start with a slice with 0 capacity, a new backing array have to be allocated and "old" content have to be copied over whenever you append an element that does not fit into the capacity, so that solution must be slower inherently.
I would define and consider the following different solutions (I use an []int slice to avoid fmt.Sprintf() to intervene / interfere with our benchmarks):
var s []int
func BenchmarkA(b *testing.B) {
for i := 0; i < b.N; i++ {
s = nil
for j := 0; j < 1e5; j++ {
s = append(s, j)
}
}
}
func BenchmarkB(b *testing.B) {
for i := 0; i < b.N; i++ {
s = make([]int, 0, 1e5)
for j := 0; j < 1e5; j++ {
s = append(s, j)
}
}
}
func BenchmarkBLocal(b *testing.B) {
for i := 0; i < b.N; i++ {
s := make([]int, 0, 1e5)
for j := 0; j < 1e5; j++ {
s = append(s, j)
}
}
}
func BenchmarkD(b *testing.B) {
for i := 0; i < b.N; i++ {
s = make([]int, 1e5)
for j := range s {
s[j] = j
}
}
}
Note: I use package level variables in benchmarks (except BLocal), because some optimization may (and actually do) happen when using a local slice variable).
And the benchmark results:
BenchmarkA-4 1000 1081599 ns/op 4654332 B/op 30 allocs/op
BenchmarkB-4 3000 371096 ns/op 802816 B/op 1 allocs/op
BenchmarkBLocal-4 10000 172427 ns/op 802816 B/op 1 allocs/op
BenchmarkD-4 10000 167305 ns/op 802816 B/op 1 allocs/op
A: As you can see, starting with a nil slice is the slowest, uses the most memory and allocations.
B: Pre-allocating the slice with capacity (but still 0 length) and using append: it requires only a single allocation and is much faster, almost thrice as fast.
BLocal: Do note that when using a local slice instead of a package variable, (compiler) optimizations happen and it gets a lot faster: twice as fast, almost as fast as D.
D: Not using append() but assigning elements to a preallocated slice wins in every aspect, even when using a non-local variable.
For this use case, since you already know the number of string elements that you want to assign to the slice,
I would prefer approach b or c.
Since you will prevent resizing of the slice using these two approaches.
If you choose to use approach a, the slice will double its size everytime a new element is added after len equals capacity.
https://play.golang.org/p/kSuX7cE176j

for loop speed comparison

I was wondering how fast was the len operator in Go and I wrote a simple benchmark. My expectations were that by avoiding calling len during each loop iteration, the code would run faster, but it is in fact the opposite.
Here's the benchmark:
func sumArrayNumber(input []int) int {
var res int
for i, length := 0, len(input); i < length; i += 1 {
res += input[i]
}
return res
}
func sumArrayNumber2(input []int) int {
var res int
for i := 0; i < len(input); i += 1 {
res += input[i]
}
return res
}
var result int
var input = []int{3, 6, 22, 68, 11, -7, 22, 5, 0, 0, 1}
func BenchmarkSumArrayNumber(b *testing.B) {
var r int
for n := 0; n < b.N; n++ {
r = sumArrayNumber(input)
}
result = r
}
func BenchmarkSumArrayNumber2(b *testing.B) {
var r int
for n := 0; n < b.N; n++ {
r = sumArrayNumber2(input)
}
result = r
}
And here are the results:
goos: windows
goarch: amd64
BenchmarkSumArrayNumber-8 300000000 4.75 ns/op
BenchmarkSumArrayNumber2-8 300000000 4.67 ns/op
PASS
ok command-line-arguments 4.000s
I confirmed the resistent are consistents by doing the following:
doubling the input array size roughly double the execution time per op. The speed difference scales with the length of the input array.
exchanging the test order does not impact the results.
Why is the code checking len() at every loop iteration is faster?
One may argue that a difference of 0.08ns is not statistically relevant to say that one for-loop is faster than the other. You problably need to run the same test many times (more than 20 times at least), at that point you should be able to derive mean value and standard variation.
Moreover, there are many factors that can speedup the len() operator. Like CPU cache and compiler optimizations. I think that the most relevant factor in your specific example is that len() operator for slice and array just reads the len field in slice's data structure. Thus, it is O(1).

Resources