Go memory allocation - new objects, pointers and escape analysis - go

I read that Golang language manages memory in a smart way. Using escape analysis, go may not allocate memory when calling new, and vice versa. Can golang allocate memory with such a notation var bob * Person = & Person {2, 3}. Or always the pointer will point to the stack

The pointer may escape to the heap, or it may not, depends on your use case. The compiler is pretty smart. E.g. given:
type Person struct {
b, c int
}
func foo(b, c int) int {
bob := &Person{b, c}
return bob.b
}
The function foo will be compiled into:
TEXT "".foo(SB)
MOVQ "".b+8(SP), AX
MOVQ AX, "".~r2+24(SP)
RET
It's all on the stack here, because even though bob is a pointer, it doesn't escape this function's scope.
However, if we consider a slight (albeit artificial) modification:
var globalBob *Person
func foo(b, c int) int {
bob := &Person{b, c}
globalBob = bob
return bob.b
}
Then bob escapes, and foo will be compiled to:
TEXT "".foo(SB), ABIInternal, $24-24
MOVQ (TLS), CX
CMPQ SP, 16(CX)
PCDATA $0, $-2
JLS foo_pc115
PCDATA $0, $-1
SUBQ $24, SP
MOVQ BP, 16(SP)
LEAQ 16(SP), BP
LEAQ type."".Person(SB), AX
MOVQ AX, (SP)
PCDATA $1, $0
CALL runtime.newobject(SB)
MOVQ 8(SP), AX
MOVQ "".b+32(SP), CX
MOVQ CX, (AX)
MOVQ "".c+40(SP), CX
MOVQ CX, 8(AX)
PCDATA $0, $-2
CMPL runtime.writeBarrier(SB), $0
JNE foo_pc101
MOVQ AX, "".globalBob(SB)
foo_pc83:
PCDATA $0, $-1
MOVQ (AX), AX
MOVQ AX, "".~r2+48(SP)
MOVQ 16(SP), BP
ADDQ $24, SP
RET
Which, as you can see, invokes newobject.
These disassembly listings were generated by https://godbolt.org/, and are for go 1.16 on amd64

Whether memory is allocated on the stack or "escapes" to the heap is entirely dependent on how you use the memory, not on how you declare the variable.
If you return a pointer to a stack-allocated variable in, say, C, the value your pointer will be invalid by the time you attempt to use it. This isn't possible in Go, because you cannot explicitly tell Go where to place a variable. It does a very good job of choosing the correct place, and if it sees that references to a blob of memory may live beyond the stack frame, it will ensure that allocation happens on the heap instead.
Can golang allocate memory with such a notation
var bob * Person = & Person {2, 3}
Or always the pointer will point to the stack
That line of code cannot be said "always" point to the stack, but it might sometimes, so yes, it may allocate memory (on the heap).
Again, it's not about that line of code, it's about what comes after it. If value of bob is returned (the address of the Person object) then it cannot be allocated on the stack because the returned address would point to reclaimed memory.

Put simply, if the compiler can prove that the value can safely be created on the stack, it will (probably) be created on the stack. Otherwise, it will be allocated on the heap.
The tools that the compiler has to do these proofs are pretty good, but it doesn't get it right all the time. Most of the time, though, cost vs benefit of worrying about it is not really benefitial.

Related

Decoding compiler output for interface run type assertion

I recently encountered empty interfaces while using the Load() method of Atomic.Value . I was experimenting with empty interfaces type assertion a bit - https://play.golang.org/p/CLyY2y9-2VF
This piqued my interest, and I decided to take a peek behind the curtains to see what actions does a compiler take so that the code doesn't panic in case of trying to read the concrete value on a nil interface {} (e.g., when you call Load.(type) when Store hasn't been called yet).
I could see that in the unsafe version, compiler had this assembly instruction that cause the panic : call runtime.panicdottypeE(SB)
The panic instruction is obviously not present in the safe version. Can someone please explain this in more details on what compiler is doing when we capture return value with ok (and perhaps point me to the corresponding assembly instructions in the godbolt link)?
Here are the godbolt compiler links for unsafe version [1] and safe version [2].
[1] https://godbolt.org/z/76onvj
[2] https://godbolt.org/z/e8aoqe
Empty interface type (called eface in runtime package) is 2 pointers, first one to underlying type (e.g. bool type, int type, YourStruct type, ...) second one is a pointer to data (IIRC in some cases the data itself).
Unsafe version:
call "".returnEmptyInterface(SB) Call the function
pcdata $0, $1 pcdata is not a real instruction, ignore
movq 8(SP), AX AX <- pointer to data
movq (SP), CX CX <- pointer to type
pcdata $0, $2
leaq type.bool(SB), DX DX <- pointer to bool type
cmpq CX, DX Compare CX and DX
jne main_pc156 If they are not equal jump to main_pc156
In main_pc156 compiler will call runtime.panicdottypeE that will basically panic. Source code from runtime/iface.go (can find it in your $GOPATH/src):
func panicdottypeE(have, want, iface *_type) {
panic(&TypeAssertionError{iface, have, want, ""})
}
Safe version:
pcdata $0, $0
pcdata $1, $0
call "".returnEmptyInterface(SB) Call the function
pcdata $0, $1
movq 8(SP), AX AX <- pointer to data
movq (SP), CX CX <- pointer to type
main_pc47:
pcdata $0, $2
leaq type.bool(SB), DX DX <- pointer to bool type
cmpq CX, DX Compare CX and DX
jne main_pc186 If not equal jump main_pc186
pcdata $0, $3
movblzx (AX), AX AX <- dereference AX
main_pc62:
and in main_pc186
pcdata $0, $3
xorl AX, AX AX <- 0
jmp main_pc62 Jump back at the end of previous block
Here AX corresponds to x in the code but what corresponds to ok in code? Nothing! If you check code of println you see:
cmpq CX, DX Compare CX and DX
seteq AL AL <- 1 if CX equal to DX otherwise 0
So compiler decided to compare them again when printing.
The quick summary: they do exactly the same thing, it's just that the case for ok == false is different.
The two bits have the following in common:
pcdata $0, $0
pcdata $1, $0
call "".returnEmptyInterface(SB)
pcdata $0, $1
movq 8(SP), AX
movq (SP), CX
pcdata $0, $2
leaq type.bool(SB), DX
cmpq CX, DX
jne main_pc156 <==== jump
pcdata $0, $3
movblzx (AX), AX
The code found in main_pc156 is the part we care about. As you noticed, for the single-value type assertion, this is:
main_pc156:
movq DX, 8(SP)
pcdata $0, $1
leaq type.interface {}(SB), AX
pcdata $0, $0
movq AX, 16(SP)
call runtime.panicdottypeE(SB)
xchgl AX, AX
There's no escaping this, once we jump to main_pc156, we panic.
On the other hand, the code for the two-value type assertion is:
main_pc186:
pcdata $0, $3
xorl AX, AX
jmp main_pc62
This is massively different from the previous case, and takes us right back to the end of the first bit of code, resuming execution.

Is movzbl followed by testl faster than testb?

Consider this C code:
int f(void) {
int ret;
char carry;
__asm__(
"nop # do something that sets eax and CF"
: "=a"(ret), "=#ccc"(carry)
);
return carry ? -ret : ret;
}
When I compile it with gcc -O3, I get this:
f:
nop # do something that sets eax and CF
setc %cl
movl %eax, %edx
negl %edx
testb %cl, %cl
cmovne %edx, %eax
ret
If I change char carry to int carry, I instead get this:
f:
nop # do something that sets eax and CF
setc %cl
movl %eax, %edx
movzbl %cl, %ecx
negl %edx
testl %ecx, %ecx
cmovne %edx, %eax
ret
That change replaced testb %cl, %cl with movzbl %cl, %ecx and testl %ecx, %ecx. The program is actually equivalent, though, and GCC knows it. As evidence of this, if I compile with -Os instead of -O3, then both char carry and int carry result in the exact same assembly:
f:
nop # do something that sets eax and CF
jnc .L1
negl %eax
.L1:
ret
It seems like one of two things must be true, but I'm not sure which:
A testb is faster than a movzbl followed by a testl, so GCC's use of the latter with int is a missed optimization.
A testb is slower than a movzbl followed by a testl, so GCC's use of the former with char is a missed optimization.
My gut tells me that an extra instruction will be slower, but I also have a nagging doubt that it's preventing a partial register stall that I just don't see.
By the way, the usual recommended approach of xoring the register to zero before the setc doesn't work in my real example. You can't do it after the inline assembly runs, since xor will overwrite the carry flag, and you can't do it before the inline assembly runs, since in the real context of this code, every general-purpose call-clobbered register is already in use somehow.
There's no downside I'm aware of to reading a byte register with test vs. movzb.
If you are going to zero-extend, it's also a missed optimization not to xor-zero a reg ahead of the asm statement, and setc into that so the cost of zero-extension is off the critical path. (On CPUs other than Intel IvyBridge+ where movzx r32, r8 is not zero latency). Assuming there's a free register, of course. Recent GCC does sometimes find this zero/set-flags/setcc optimization for generating a 32-bit boolean from a flag-setting instruction, but often misses it when things get complex.
Fortunately for you, your real use-case couldn't do that optimization anyway (except with mov $0, %eax zeroing, which would be off the critical path for latency but cause a partial-register stall on Intel P6 family, and cost more code size.) But it's still a missed optimization for your test case.

Segmentation fault: 11 With Array Assignment in Loop Using x86 GNU GAS Assembly

This question is similar to another question I posted here. I am attempting to write the Assembly version of the following in c/c++:
int x[10];
for (int i = 0; i < 10; i++){
x[i] = i;
}
Essentially, creating an array storing the values 1 through 9.
My current logic is to create a label that loops up to 10 (calling itself until reaching the end value). In the label, I have placed the instructions to update the array at the current index of iteration. However, after compiling with gcc filename.s and running with ./a.out, the error Segmentation fault: 11 is printed to the console. My code is below:
.data
x:.fill 10, 4
index:.int 0
end:.int 10
.text
.globl _main
_main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
jmp outer_loop
leave
ret
outer_loop:
movl index(%rip), %eax;
cmpl end(%rip), %eax
jge end_loop
lea x(%rip), %rdi;
mov index(%rip), %rsi;
movl index(%rip), %eax;
movl %eax, (%rdi, %rsi, 4)
incl index(%rip)
jmp outer_loop
leave
ret
end_loop:
leave
ret
Oddly the code below
lea x(%rip), %rdi;
mov index(%rip), %rsi;
movl index(%rip), %eax;
movl %eax, (%rdi, %rsi, 4)
works only if it is not in a label that is called repetitively. Does anyone know how I can implement the code above in a loop, without Segmentation fault: 11 being raised? I am using x86 Assembly on MacOS with GNU GAS syntax compiled with gcc.
Please note that this question is not a duplicate of this question as different Assembly syntax is being used and the scope of the problem is different.
You're using a 64-bit instruction to access a 32-bit area of memory :
mov index(%rip), %rsi;
This results in %rsi being assigned the contents of memory starting from index and ending at end (I'm assuming no alignment, though I don't remember GAS's rules regarding it). Thus, %rsi effectively is assigned the value 0xa00000000 (assuming first iteration of the loop), and executing the following movl %eax, (%rdi, %rsi, 4) results in the CPU trying to access the address that's not mapped by your process.
The solution is to remove the assignment, and replace the line after it with movl index(%rip), %esi. 32-bit operations are guaranteed to always clear out the upper bits of 64-bit registers, so you can then safely use %rsi in the address calculation, as it's going to contain the current index and nothing more.
Your debugger would've told you this, so please do use it next time.

Assembly multiplication loop returning wrong high number

I am trying to write a for loop that does multiplication by adding a number (var a) by another number (var b) times.
.globl times
times:
movl $0, %ecx # i = 0
cmpl %ecx, %esi #if b-i
jge end # if >= 0, jump to end
loop:
addl (%edi, %eax), %eax #sum += a
incl %ecx # i++
cmpl %esi, %ecx # compare (i-b)
jl loop # < 0? loop b times total
end:
ret
Where am I going wrong? I've run through the logic and I can't figure out what the problem is.
TL:DR: you didn't zero EAX, and your ADD instruction is using a memory operand.
You should have used a debugger. You'd easily have seen that EAX wasn't zero to start with. See the bottom of the x86 tag wiki for tips on using gdb to debug asm.
I guess you're using the x86-64 System V ABI, so your args (a and b) are in %edi and %esi.
At the start of a function, registers other than the ones holding your args should be assumed to contain garbage. Even the high parts of registers that are holding your args can contain garbage. (exception to this rule: unofficially, narrow args are sign or zero extended to 32-bit by the caller)
Neither arg is a pointer, so you shouldn't dereference them. add (%edi, %eax), %eax calculates a 32-bit address as EDI+EAX, and then loads 32 bits from there. It adds that dword to EAX (the destination operand).
I'm shocked that your program didn't segfault, since you're using your integer arg as a pointer.
For many x86 instructions (like ADD), the destination operand is not write-only. add %edi, %eax does EAX += EDI. I think you're getting mixed up with 3-operand RISC syntax, where you might have an instruction like add %src1, %src2, %dst.
x86 has some instructions like that, added as recent extensions, like BMI2 bzhi, but the usual instructions are all 2-operand with destructive destinations. (except for LEA, where instead of loading from the address, it stores the address in the destination. So lea (%edi, %eax), %eax would work. You could even put the result in a different register. LEA is great for saving MOV instructions by doing shift+add and a mov all in one instruction, using the addressing mode syntax and machine-code encoding.
You have a comment that says ie eax = sum + (a x 4bits). No clue what you're talking about there. a is 4 bytes (not bits), and you're not multiplying a (%edi) by anything.
Just for fun, here's how I'd write your function (if I had to avoid imul %edi, %esi / mov %esi, %eax). I'll assume both args are non-negative, to keep it simple. If your args are signed integers, and you have to loop -b times if b is negative, then you need some extra code.
# args: int a(%edi), int b(%esi) # comments are important for documenting inputs/outputs to blocks of code
# return value: product in %eax
# assumptions: b is non-negative.
times:
xor %eax, %eax # zero eax
test %esi, %esi # set flags from b
jz loop_end # early-out if it's zero
loop: # do{
add %edi, %eax # sum += a,
dec %esi # b-- (setting flags based on the result, except for CF so don't use ja or jb after it)
jge loop # }while(b>=0)
loop_end:
ret
Note the indenting style, so it's easy to find the branch targets. Some people like to indent extra for instructions inside loops.
Your way works fine (if you do it right), but my way illustrates that counting down is easier in asm (no need for an extra register or immediate to hold the upper bound). Also, avoiding redundant compares. But don't worry about optimizing until after you're comfortable writing code that at least works.
This is a pseudo code, keep that in mind.
mov X,ebx <- put into EBX your counter, your B
mov Y,edx <- put into EDX your value, your A
mov 0,eax <- Result
loop:
add eax,edx
dec ebx
jnz loop <- While EBX is not zero
The above implementation should result in your value into EAX. Your code looks like it's missing the eax initialisation.

why for loop has 1 extra instruction than expected?

I write a lot of vectorized loops, so 1 common idiom is
volatile int dummy[1<<10];
for (int64_t i = 0; i + 16 <= argc; i+= 16) // process all elements with whole vector
{
int x = dummy[i];
}
// handle remainder (hopefully with SIMD too)
But the resulting machine code has 1 more instruction than I would like (using gcc 4.9)
.L3:
leaq -16(%rax), %rdx
addq $16, %rax
cmpq %rcx, %rax
movl -120(%rsp,%rdx,4), %edx
jbe .L3
If I change the code to for (int64_t i = 0; i <= argc - 16; i+= 16), then the "extra"
instruction is gone:
.L2:
movl -120(%rsp,%rax,4), %ecx
addq $16, %rax
cmpq %rdx, %rax
jbe .L2
But why the difference? I was thinking maybe it was due to loop invariants, but too vaguely. Then I noticed in the 5 instruction case, the increment is done before the load, which would require an extra mov due to x86's destructive 2 operand instructions.
So another explanation could be that it's trading instruction parallelism for 1 extra instruction.
Although it seems there would hardly be any performance difference, can someone explain this mystery (preferably who knows about compiler transformations)?
Ideally I would like to keep the i + 16 <= size form since that has a more intuitive meaning (the last element of the vector doesn't go out of bounds)
If argc were below -2147483632, and i was below 2147483632, the expressions i+16 <= argc would be required to yield an arithmetically-correct result, while the expression and i<argc-16 would not. The need to give an arithmetically-correct result in that corner case prevents the compiler from optimizing the former expression to match the latter.

Resources