Why do I find some never called instructions nopl, nopw after ret or jmp in GCC compiled code? [duplicate] - gcc

I've been working with C for a short while and very recently started to get into ASM. When I compile a program:
int main(void)
{
int a = 0;
a += 1;
return 0;
}
The objdump disassembly has the code, but nops after the ret:
...
08048394 <main>:
8048394: 55 push %ebp
8048395: 89 e5 mov %esp,%ebp
8048397: 83 ec 10 sub $0x10,%esp
804839a: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp)
80483a1: 83 45 fc 01 addl $0x1,-0x4(%ebp)
80483a5: b8 00 00 00 00 mov $0x0,%eax
80483aa: c9 leave
80483ab: c3 ret
80483ac: 90 nop
80483ad: 90 nop
80483ae: 90 nop
80483af: 90 nop
...
From what I learned nops do nothing, and since after ret wouldn't even be executed.
My question is: why bother? Couldn't ELF(linux-x86) work with a .text section(+main) of any size?
I'd appreciate any help, just trying to learn.

First of all, gcc doesn't always do this. The padding is controlled by -falign-functions, which is automatically turned on by -O2 and -O3:
-falign-functions
-falign-functions=n
Align the start of functions to the next power-of-two greater than n, skipping up to n bytes. For instance,
-falign-functions=32 aligns functions to the next 32-byte boundary, but -falign-functions=24 would align to the next 32-byte boundary only
if this can be done by skipping 23 bytes or less.
-fno-align-functions and -falign-functions=1 are equivalent and mean that functions will not be aligned.
Some assemblers only support this flag when n is a power of two; in
that case, it is rounded up.
If n is not specified or is zero, use a machine-dependent default.
Enabled at levels -O2, -O3.
There could be multiple reasons for doing this, but the main one on x86 is probably this:
Most processors fetch instructions in aligned 16-byte or 32-byte blocks. It can be
advantageous to align critical loop entries and subroutine entries by 16 in order to minimize
the number of 16-byte boundaries in the code. Alternatively, make sure that there is no 16-byte boundary in the first few instructions after a critical loop entry or subroutine entry.
(Quoted from "Optimizing subroutines in assembly
language" by Agner Fog.)
edit: Here is an example that demonstrates the padding:
// align.c
int f(void) { return 0; }
int g(void) { return 0; }
When compiled using gcc 4.4.5 with default settings, I get:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
000000000000000b <g>:
b: 55 push %rbp
c: 48 89 e5 mov %rsp,%rbp
f: b8 00 00 00 00 mov $0x0,%eax
14: c9 leaveq
15: c3 retq
Specifying -falign-functions gives:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
b: eb 03 jmp 10 <g>
d: 90 nop
e: 90 nop
f: 90 nop
0000000000000010 <g>:
10: 55 push %rbp
11: 48 89 e5 mov %rsp,%rbp
14: b8 00 00 00 00 mov $0x0,%eax
19: c9 leaveq
1a: c3 retq

This is done to align the next function by 8, 16 or 32-byte boundary.
From “Optimizing subroutines in assembly language” by A.Fog:
11.5 Alignment of code
Most microprocessors fetch code in aligned 16-byte or 32-byte blocks. If an importantsubroutine entry or jump label happens to be near the end of a 16-byte block then themicroprocessor will only get a few useful bytes of code when fetching that block of code. Itmay have to fetch the next 16 bytes too before it can decode the first instructions after thelabel. This can be avoided by aligning important subroutine entries and loop entries by 16.
[...]
Aligning a subroutine entry is as simple as putting as many
NOP
's as needed before thesubroutine entry to make the address divisible by 8, 16, 32 or 64, as desired.

As far as I remember, instructions are pipelined in cpu and different cpu blocks (loader, decoder and such) process subsequent instructions. When RET instructions is being executed, few next instructions are already loaded into cpu pipeline. It's a guess, but you can start digging here and if you find out (maybe the specific number of NOPs that are safe, share your findings please.

Related

Why does loop alignment on 32 byte make code faster?

Look at this code:
one.cpp:
bool test(int a, int b, int c, int d);
int main() {
volatile int va = 1;
volatile int vb = 2;
volatile int vc = 3;
volatile int vd = 4;
int a = va;
int b = vb;
int c = vc;
int d = vd;
int s = 0;
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
for (int i=0; i<2000000000; i++) {
s += test(a, b, c, d);
}
return s;
}
two.cpp:
bool test(int a, int b, int c, int d) {
// return a == d || b == d || c == d;
return false;
}
There are 16 nops in one.cpp. You can comment/decomment them to change alignment of the loop's entry point between 16 and 32. I've compiled them with g++ one.cpp two.cpp -O3 -mtune=native.
Here are my questions:
the 32-aligned version is faster than the 16-aligned version. On Sandy Bridge, the difference is 20%; on Haswell, 8%. Why is the difference?
with the 32-aligned version, the code runs the same speed on Sandy Bridge, doesn't matter which return statement is in two.cpp. I thought the return false version should be faster at least a little bit. But no, exactly the same speed!
If I remove volatiles from one.cpp, code becomes slower (Haswell: before: ~2.17 sec, after: ~2.38 sec). Why is that? But this only happens, when the loop aligned to 32.
The fact that 32-aligned version is faster, is strange to me, because Intel® 64 and IA-32 Architectures
Optimization Reference Manual says (page 3-9):
Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch
targets should be 16- byte aligned.
Another little question: is there any tricks to make only this loop 32-aligned (so rest of the code could keep using 16-byte alignment)?
Note: I've tried compilers gcc 6, gcc 7 and clang 3.9, same results.
Here's the code with volatile (the code is the same for 16/32 aligned, just the address differ):
0000000000000560 <main>:
560: 41 57 push r15
562: 41 56 push r14
564: 41 55 push r13
566: 41 54 push r12
568: 55 push rbp
569: 31 ed xor ebp,ebp
56b: 53 push rbx
56c: bb 00 94 35 77 mov ebx,0x77359400
571: 48 83 ec 18 sub rsp,0x18
575: c7 04 24 01 00 00 00 mov DWORD PTR [rsp],0x1
57c: c7 44 24 04 02 00 00 mov DWORD PTR [rsp+0x4],0x2
583: 00
584: c7 44 24 08 03 00 00 mov DWORD PTR [rsp+0x8],0x3
58b: 00
58c: c7 44 24 0c 04 00 00 mov DWORD PTR [rsp+0xc],0x4
593: 00
594: 44 8b 3c 24 mov r15d,DWORD PTR [rsp]
598: 44 8b 74 24 04 mov r14d,DWORD PTR [rsp+0x4]
59d: 44 8b 6c 24 08 mov r13d,DWORD PTR [rsp+0x8]
5a2: 44 8b 64 24 0c mov r12d,DWORD PTR [rsp+0xc]
5a7: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
5ac: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
5b3: 00 00 00
5b6: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
5bd: 00 00 00
5c0: 44 89 e1 mov ecx,r12d
5c3: 44 89 ea mov edx,r13d
5c6: 44 89 f6 mov esi,r14d
5c9: 44 89 ff mov edi,r15d
5cc: e8 4f 01 00 00 call 720 <test(int, int, int, int)>
5d1: 0f b6 c0 movzx eax,al
5d4: 01 c5 add ebp,eax
5d6: 83 eb 01 sub ebx,0x1
5d9: 75 e5 jne 5c0 <main+0x60>
5db: 48 83 c4 18 add rsp,0x18
5df: 89 e8 mov eax,ebp
5e1: 5b pop rbx
5e2: 5d pop rbp
5e3: 41 5c pop r12
5e5: 41 5d pop r13
5e7: 41 5e pop r14
5e9: 41 5f pop r15
5eb: c3 ret
5ec: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
Without volatile:
0000000000000560 <main>:
560: 55 push rbp
561: 31 ed xor ebp,ebp
563: 53 push rbx
564: bb 00 94 35 77 mov ebx,0x77359400
569: 48 83 ec 08 sub rsp,0x8
56d: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0]
574: 00 00
576: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
57d: 00 00 00
580: b9 04 00 00 00 mov ecx,0x4
585: ba 03 00 00 00 mov edx,0x3
58a: be 02 00 00 00 mov esi,0x2
58f: bf 01 00 00 00 mov edi,0x1
594: e8 47 01 00 00 call 6e0 <test(int, int, int, int)>
599: 0f b6 c0 movzx eax,al
59c: 01 c5 add ebp,eax
59e: 83 eb 01 sub ebx,0x1
5a1: 75 dd jne 580 <main+0x20>
5a3: 48 83 c4 08 add rsp,0x8
5a7: 89 e8 mov eax,ebp
5a9: 5b pop rbx
5aa: 5d pop rbp
5ab: c3 ret
5ac: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
This doesn't answer point 2 (return a == d || b == d || c == d; being the same speed as return false). That's still a maybe-interesting question, since that must compile multiple to uop-cache lines of instructions.
The fact that 32-aligned version is faster, is strange to me, because [Intel's manual says to align to 16]
That optimization-guide advice is a very general guideline, and definitely doesn't mean that larger never helps. Usually it doesn't, and padding to 32 would be more likely to hurt than help. (I-cache misses, ITLB misses, and more code bytes to load from disk).
In fact, 16B alignment is rarely necessary, especially on CPUs with a uop cache. For a small loop that can run from the loop buffer, it alignment is usually totally irrelevant.
(Skylake microcode updates disabled the loop buffer to work around a partial-register AH-merging bug, SKL150. This creates problems for tiny loops that span a 32-byte boundary, only running one iteration per 2 clocks, instead of the one iteration per 1.5 clocks you might get from a 6 uop loop on Haswell, or on SKL with older microcode. The LSD is not re-enabled until Ice Lake, broken in Kaby/Coffee/Comet Lake which are the same microarchitecture as SKL/SKX.)
Another SKL erratum workaround created another worse code-alignment pothole: How can I mitigate the impact of the Intel jcc erratum on gcc?
16B is still not bad as a broad recommendation, but it doesn't tell you everything you need to know to understand one specific case on a couple of specific CPUs.
Compilers usually default to aligning loop branches and function entry-points, but usually don't align other branch targets. The cost of executing a NOP (and code bloat) is often larger than the likely cost of an unaligned non-loop branch target.
Code alignment has some direct and some indirect effects. The direct effects include the uop cache on Intel SnB-family. For example, see Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs.
Another section of Intel's optimization manual goes into some detail about how the uop cache works:
2.3.2.2 Decoded ICache:
All micro-ops in a Way (uop cache line) represent instructions which are statically contiguous in the code and have their EIPs within the
same aligned 32-byte region. (I think this means an instruction that
extends past the boundary goes in the uop cache for the block
containing its start, rather than end. Spanning instructions have to
go somewhere, and the branch target address that would run the
instruction is the start of the insn, so it's most useful to put it in
a line for that block).
A multi micro-op instruction cannot be split across Ways.
An instruction which turns on the MSROM consumes an entire Way.
Up to two branches are allowed per Way.
A pair of macro-fused instructions is kept as one micro-op.
See also Agner Fog's microarch guide. He adds:
An unconditional jump or call always ends a μop cache line
lots of other stuff that that probably isn't relevant here.
Also, that if your code doesn't fit in the uop cache, it can't run from the loop buffer.
The indirect effects of alignment include:
larger/smaller code-size (L1I cache misses, TLB). Not relevant for your test
which branches alias each other in the BTB (Branch Target Buffer).
If I remove volatiles from one.cpp, code becomes slower. Why is that?
The larger instructions push the last instruction into the loop across a 32B boundary:
59e: 83 eb 01 sub ebx,0x1
5a1: 75 dd jne 580 <main+0x20>
So if you aren't running from the loop buffer (LSD), then without volatile one of the uop-cache fetch cycles gets only 1 uop.
If sub/jne macro-fuses, this might not apply. And I think only crossing a 64B boundary would break macro-fusion.
Also, those aren't real addresses. Have you checked what the addresses are after linking? There could be a 64B boundary there after linking, if the text section has less than 64B alignment.
Also related to 32-byte boundaries, the JCC erratum disables the uop cache for blocks where a branch (including macro-fused ALU+JCC) includes the last byte of the line, on Skylake CPUs.
How can I mitigate the impact of the Intel jcc erratum on gcc?
Sorry I haven't actually tested this to say more about this specific case. The point is, when you bottleneck on the front-end from stuff like having a call/ret inside a tight loop, alignment becomes important and can get is extremely complex. Boundary-crossing or not for all future instructions is affected. Do not expect it to be simple. If you've read my other answers, you'll know I'm not usually the kind of person to say "it's too complicated to fully explain", but alignment can be that way.
See also Code alignment in one object file is affecting the performance of a function in another object file
In your case, make sure tiny functions inline. Use link-time optimization if your code-base has any important tiny functions in separate .c files instead of in a .h where they can inline. Or change your code to put them in a .h.

Weird SSE assembler instructions for double negation

GCC and Clang compilers seem to employ some dark magic. The C code just negates the value of a double, but the assembler instructions involve bit-wise XOR and the instruction pointer. Can somebody explain what is happening and why is it an optimal solution. Thank you.
Contents of test.c:
void function(double *a, double *b) {
*a = -(*b); // This line.
}
The resulting assembler instructions:
(gcc)
0000000000000000 <function>:
0: f2 0f 10 06 movsd xmm0,QWORD PTR [rsi]
4: 66 0f 57 05 00 00 00 xorpd xmm0,XMMWORD PTR [rip+0x0] # c <function+0xc>
b: 00
c: f2 0f 11 07 movsd QWORD PTR [rdi],xmm0
10: c3 ret
(clang)
0000000000000000 <function>:
0: f2 0f 10 06 movsd xmm0,QWORD PTR [rsi]
4: 0f 57 05 00 00 00 00 xorps xmm0,XMMWORD PTR [rip+0x0] # b <function+0xb>
b: 0f 13 07 movlps QWORD PTR [rdi],xmm0
e: c3 ret
The assembler instruction at address 0x4 represents "This line", however I can't understand how it works. The xorpd/xorps instructions are supposed to be bit-wise XOR and PTR [rip] is the instruction pointer.
I suspect that at the moment of execution rip is pointing somewhere near the 0f 57 05 00 00 00 0f strip of bytes, but I can't quite figure out, how is this working and why do both compilers choose this approach.
P.S. I should point out that this is compiled using -O3
for me the output of gcc with the -S -O3 options for the same code is:
.file "test.c"
.text
.p2align 4,,15
.globl function
.type function, #function
function:
.LFB0:
.cfi_startproc
movsd (%rsi), %xmm0
xorpd .LC0(%rip), %xmm0
movsd %xmm0, (%rdi)
ret
.cfi_endproc
.LFE0:
.size function, .-function
.section .rodata.cst16,"aM",#progbits,16
.align 16
.LC0:
.long 0
.long -2147483648
.long 0
.long 0
.ident "GCC: (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406"
.section .note.GNU-stack,"",#progbits
here the xorpd instruction uses instruction pointer relative addressing with the offset which points to .LC0 label with the 64 bit value 0x8000000000000000(the 63rd bit is set to one).
.LC0:
.long 0
.long -2147483648
if your compiler was big endian these lines where swaped.
xoring the double value with 0x8000000000000000 sets the sign bit(which is the 63rd bit) to one for a negative value.
clang uses xorps instruction for the same manner this xors the first 32bit of the double value.
if you run object dump with -r option it will show you the relocations that should be done on the program before running it.
objdump -d test.o -r
test.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <function>:
0: f2 0f 10 06 movsd (%rsi),%xmm0
4: 66 0f 57 05 00 00 00 xorpd 0x0(%rip),%xmm0 # c <function+0xc>
b: 00
8: R_X86_64_PC32 .LC0-0x4
c: f2 0f 11 07 movsd %xmm0,(%rdi)
10: c3 retq
Disassembly of section .text.startup:
0000000000000000 <main>:
0: 31 c0 xor %eax,%eax
2: c3 retq
here at <function + 0xb> we have a relocation of type R_X86_64_PC32.
PS: I'm using gcc 6.3.0
xorps xmm0,XMMWORD PTR [rip+0x0]
Any part of an instruction surrounded by [] is an indirect reference to memory.
In this case a reference to the memory at address RIP+0
(I doubt it is actually RIP+0, you might have edited the actual offset)
The X64 instruction set adds instruction pointer relative addressing. This means you can have (usually read-only) data in your program that you can address easily even if the program is moved around in memory.
A XOR xmm0,Y inverts all bits in xmm0 that are set in Y.
Negation involves inverting the sign bit, so that's why xor is used. Specifically xorpd/s because we are dealing with double resp. single floats.

Reversed Mach-O 64-bit x86 Assembly analysis

This question is for Intel x86 assembly experts to answer. Thanks for your effort in advance!
Problem Specification
I am analysing a binary file, which match Mach-O 64-bit x86 assembly. I am currently using MacOS 64 OS. The assembly comes from objdump.
The problem is that when I am learning assembly, I can see variable name "$xxx", I can see string value in ascii and I can also see the callee name like "call _printf"
But in this assembly, I can get nothing above:
no main function:
Disassembly of section __TEXT,__text:
__text:
100000c90: 55 pushq %rbp
100000c91: 48 89 e5 movq %rsp, %rbp
100000c94: 48 83 ec 10 subq $16, %rsp
100000c98: 48 8d 3d bf 02 00 00 leaq 703(%rip), %rdi
100000c9f: b0 00 movb $0, %al
100000ca1: e8 68 02 00 00 callq 616
100000ca6: 89 45 fc movl %eax, -4(%rbp)
100000ca9: 48 83 c4 10 addq $16, %rsp
100000cad: 5d popq %rbp
100000cae: c3 retq
100000caf: 90 nop
100000cb0: 55 pushq %rbp
...
The above is codes frame will be executed, but I have no idea where it is executed.
Also, I newbie of AT&T assemble. Hence, could you tell me what is the meaning of instruction:
0000000100000c90 pushq %rbp
0000000100000c98 leaq 0x2bf(%rip), %rdi ## literal pool for: "xxxx\n"
...
0000000100000cd0 callq 0x100000c90
Is it a loop? I am not sure but it seems to be. And why we they use %rip and %rdi register. In intel x86 I know that EIP represents current caller address, but I don't understand the meaning here.
call integer:
No matter what call convention they used, I had never seen code pattern like "call 616":
"100000cd0: e8 bb ff ff ff callq -69 <__mh_execute_header+C90>"
After ret:
Ret in intel x86, means delete stack frame and return control flow to caller. It should be an independent function. However, after this, we can see codes like
100000cae: c3 retq
100000caf: 90 nop
/* new function call */
100000cb0: 55 pushq %rbp
...
It is ridiculous!
ASCII string lost:
I have already viewed the binary in Hexadecimal format, and recognise some ascii string before reverse it to asm file.
However, in this file no ascii string occurrences!
Total architecture review:
Disassembly of section __TEXT,__text:
__text:
from address 10000c90 to 100000ef6 of 145 lines
Disassembly of section __TEXT,__stubs:
__stubs:
from address 100000efc to 100000f14 of 5 lines asm codes:
100000efc: ff 25 16 01 00 00 jmp qword ptr [rip + 278]
100000f02: ff 25 18 01 00 00 jmp qword ptr [rip + 280]
100000f08: ff 25 1a 01 00 00 jmp qword ptr [rip + 282]
100000f0e: ff 25 1c 01 00 00 jmp qword ptr [rip + 284]
100000f14: ff 25 1e 01 00 00 jmp qword ptr [rip + 286]
Disassembly of section __TEXT,__stub_helper:
__stub_helper:
...
Disassembly of section __TEXT,__cstring:
__cstring:
...
Disassembly of section __TEXT,__unwind_info:
__unwind_info:
...
Disassembly of section __DATA,__nl_symbol_ptr:
__nl_symbol_ptr:
...
Disassembly of section __DATA,__got:
__got:
...
Disassembly of section __DATA,__la_symbol_ptr:
__la_symbol_ptr:
...
Disassembly of section __DATA,__data:
__data:
...
Since it might be a virus, I cannot execute it. How should I analyse it ?
Update on May 21
I have already identified where is the output, and if I totally understand the data flow pipeline represented in this programme, I might be able to figure out the possible solutions.
I am appreciated if someone can give me the detailed explanation. Thank you !
Update on May 22
I installed a MacOS in VirtualBox and after chmod privileges , I executed the programme but nothing special except for two lines of output happened. And the result hiding in the binary file.
You don't need a main if you are not using C. The binary header contains the entry point address.
Nothing special about call 616, it's just that you don't have (all) symbols. It's somewhat strange that objdump didn't calculate the address for you, but it should be 0x100000ca6+616.
Not sure what you find ridiculous there. One function ends, another starts.
That's not a question. Yes, you can create strings at runtime so you won't have them in the image. Possibly they are encrypted.

What's the purpose of signal pt in this example

What does callq 400b90 <signal#plt> do?
How would it look line in C?
4013a2: 48 83 ec 08 sub $0x8,%rsp
4013a6: be a0 12 40 00 mov $0x4012a0,%esi
4013ab: bf 02 00 00 00 mov $0x2,%edi
4013b0: e8 db f7 ff ff callq 400b90 <signal#plt>
4013b5: 48 83 c4 08 add $0x8,%rsp
4013b9: c3 retq
What does callq 400b90 <signal#plt> do?
Call the signal function via the PLT (procedure linkage table). So more technical: It pushes the current instruction pointer onto the stack and jumps to signal#plt.
How would it look line in C?
void* foo(void) {
return signal(2, (void *) 0x4012a0);
}
Let's look at your code line-by-line:
sub $0x8,%rsp
This reserves some stack space. You can ignore this (the stack space is unused).
mov $0x4012a0,%esi
mov $0x2,%edi
Put the value 0x4012a0 and 0x2 in the registers ESI and EDI. By the ABI, this is how arguments are passed to a function.
callq 400b90 <signal#plt>
Call the function signal through the PLT. The PLT has something to do with the dynamic linker since we cannot be sure where the signal function will end up in memory whenthis is built. Basically, this just finds the final memory location and calls signal.
add $0x8,%rsp
retq
Undo the sub from earlier and return to the caller.

Why do we allocate 12 bytes for each variable?

In visual Studio 2010 Professional (x86, Windows 7):
... more
00DC1362 B9 39 00 00 00 mov ecx,39h
00DC1367 B8 CC CC CC CC mov eax,0CCCCCCCCh
00DC136C F3 AB rep stos dword ptr es:[edi]
20: int a = 3;
00DC136E C7 45 F8 03 00 00 00 mov dword ptr [ebp-8],3
21: int b = 10;
00DC1375 C7 45 EC 0A 00 00 00 mov dword ptr [ebp-14h],0Ah
22: int c;
23: c = a + b;
00DC137C 8B 45 F8 mov eax,dword ptr [ebp-8]
00DC137F 03 45 EC add eax,dword ptr [ebp-14h]
00DC1382 89 45 E0 mov dword ptr [ebp-20h],eax
24: return 0;
Notice how the relative addressing variable A and B are not aligned by word size of 4?
What is happening here?
Also, why do we skip $ebp - 8 ?
Turning off the optimization will show the ideal addressing scheme.
Can someone please explain the reason? Thanks.
The offset of each variable is 12 bytes. A -> B -> C
I made a mistake. I meant why do we skip the first 8 bytes.
You are looking at the code generated by the default Debug build setting. Particularly the /RTC option (enable run-time error checks). Filling the stack frame with 0xcccccccc helps diagnose uninitialized variables, the gaps around the variables help diagnose buffer overflow.
There isn't much point in looking at this code, you are not going to ship that. It is purely a Debug build artifact, only there to help you get the bugs out of the code. None of it remains in the Release build.

Resources