Understanding non-contiguous stack addressing in gcc x86 disassembly - gcc

I am using this C source code to compile with gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0.
/////////////////////////////////////////////////////////////////////////////////////////////
// Name: megabeets_0x1.c
// Description: Simple crackme intended to teach radare2 framework capabilities.
// Compilation: $ gcc megabeets_0x1.c -o megabeets_0x1 -fno-stack-protector -m32 -z execstac
//
// Author: Itay Cohen (#megabeets)
// Website: https://www.megabeets.net
/////////////////////////////////////////////////////////////////////////////////////////////
#include <stdio.h>
#include <string.h>
void rot13 (char *s) {
if (s == NULL)
return;
int i;
for (i = 0; s[i]; i++) {
if (s[i] >= 'a' && s[i] <= 'm') { s[i] += 13; continue; }
if (s[i] >= 'A' && s[i] <= 'M') { s[i] += 13; continue; }
if (s[i] >= 'n' && s[i] <= 'z') { s[i] -= 13; continue; }
if (s[i] >= 'N' && s[i] <= 'Z') { s[i] -= 13; continue; }
}
}
int beet(char *name)
{
char buf[128];
strcpy(buf, name);
char string[] = "Megabeets";
rot13(string);
return !strcmp(buf, string);
}
int main(int argc, char *argv[])
{
printf("\n .:: Megabeets ::.\n");
printf("Think you can make it?\n");
if (argc >= 2 && beet(argv[1]))
{
printf("Success!\n\n");
}
else
printf("Nop, Wrong argument.\n\n");
return 0;
}
gcc command used
gcc megabeets_0x1.c -o test32 -fno-stack-protector -z execstack -m32 -no-pie -fno-pic
The disassembly of function beet generated using objdump looks like the following:
080485a8 <beet>:
80485a8: 55 push ebp
80485a9: 89 e5 mov ebp,esp
80485ab: 81 ec 98 00 00 00 sub esp,0x98
80485b1: 83 ec 08 sub esp,0x8
80485b4: ff 75 08 push DWORD PTR [ebp+0x8]
80485b7: 8d 85 78 ff ff ff lea eax,[ebp-0x88]
80485bd: 50 push eax
80485be: e8 6d fd ff ff call 8048330 <strcpy#plt>
80485c3: 83 c4 10 add esp,0x10
80485c6: c7 85 6e ff ff ff 4d mov DWORD PTR [ebp-0x92],0x6167654d
80485cd: 65 67 61
80485d0: c7 85 72 ff ff ff 62 mov DWORD PTR [ebp-0x8e],0x74656562
80485d7: 65 65 74
80485da: 66 c7 85 76 ff ff ff mov WORD PTR [ebp-0x8a],0x73
80485e1: 73 00
80485e3: 83 ec 0c sub esp,0xc
80485e6: 8d 85 6e ff ff ff lea eax,[ebp-0x92]
80485ec: 50 push eax
80485ed: e8 94 fe ff ff call 8048486 <rot13>
80485f2: 83 c4 10 add esp,0x10
80485f5: 83 ec 08 sub esp,0x8
80485f8: 8d 85 6e ff ff ff lea eax,[ebp-0x92]
80485fe: 50 push eax
80485ff: 8d 85 78 ff ff ff lea eax,[ebp-0x88]
8048605: 50 push eax
8048606: e8 15 fd ff ff call 8048320 <strcmp#plt>
804860b: 83 c4 10 add esp,0x10
804860e: 85 c0 test eax,eax
8048610: 0f 94 c0 sete al
8048613: 0f b6 c0 movzx eax,al
8048616: c9 leave
8048617: c3 ret
I have few doubts regarding this disassembly,
After pushing ebp and moving esp to ebp, stack pointer is decreased by 0x98 first time, then by 0x8, totalling to 0xA0 which results stack frame aligned to 16 bytes. Why didn't compiler do a direct subtraction of 0xA0 from esp instead of 2 subsequent subtraction?
As can be seen from the C code, variable buf in function beet is 128 bytes. But in this disassembly buf is pointed by ebp-0x88 which means 136 bytes for buffer. Why 136 bytes allocated instead of 128 bytes?
Before calling functions like strcpy or rot13, random number of bytes first subtracted from esp before calling and after execution completion of these functions another random number of bytes is added to esp(which I guess to clear the arguments sent to those functions).
Example- Before calling rot13, 0xc is subtracted from esp, after completion 0x10 added instead of 0xc.
So, these random shifting of esp and pushing data results non-contiguous data, resulting in lower utilization of stack memory. Is there any particular reason behind this behaviour ?
After searching on google or stackoverflow I couldn't find any answer to these doubts.
Thank you
NOTE:
GCC code optimization results almost same disassembly.

Subtracting 0x98 from the stack leaves it 16-byte aligned. The additional 8 bytes is to prepare for pushing the parameters to strcpy, so that the stack is 16-byte aligned again before the call.
It does allocate 128 bytes for buf. The additional bytes between buf and ebp are either for alignment or for compiler temporaries or some other purpose of the compiler. Perhaps there is space for the return value here. In any case, the compiler doesn’t end up needing to use the space. If you enable optimization, it probably wouldn’t be there.
As in #1, the stack pointer is adjusted before pushing the parameters for each call so that the stack is 16-byte aligned before the call.

Related

Is tooling available to 'assemble' WebAssembly to x86-64 native code?

I am guessing that a Wasm binary is usually JIT-compiled to native code, but given a Wasm source, is there a tool to see the actual generated x86-64 machine code?
Or asked in a different way, is there a tool that consumes Wasm and outputs native code?
The online WasmExplorer compiles C code to both WebAssembly and FireFox x86, using the SpiderMonkey compiler. Given the following simple function:
int testFunction(int* input, int length) {
int sum = 0;
for (int i = 0; i < length; ++i) {
sum += input[i];
}
return sum;
}
Here is the x86 output:
wasm-function[0]:
sub rsp, 8 ; 0x000000 48 83 ec 08
cmp esi, 1 ; 0x000004 83 fe 01
jge 0x14 ; 0x000007 0f 8d 07 00 00 00
0x00000d:
xor eax, eax ; 0x00000d 33 c0
jmp 0x26 ; 0x00000f e9 12 00 00 00
0x000014:
xor eax, eax ; 0x000014 33 c0
0x000016: ; 0x000016 from: [0x000024]
mov ecx, dword ptr [r15 + rdi] ; 0x000016 41 8b 0c 3f
add eax, ecx ; 0x00001a 03 c1
add edi, 4 ; 0x00001c 83 c7 04
add esi, -1 ; 0x00001f 83 c6 ff
test esi, esi ; 0x000022 85 f6
jne 0x16 ; 0x000024 75 f0
0x000026:
nop ; 0x000026 66 90
add rsp, 8 ; 0x000028 48 83 c4 08
ret
You can view this example online.
WasmExplorer compiles code into wasm / x86 via a service - you can see the scripts that are run on Github - you should be able to use these to construct a command-line tool yourself.

Why does loop alignment on 32 byte make code faster?

Look at this code:
one.cpp:
bool test(int a, int b, int c, int d);
int main() {
volatile int va = 1;
volatile int vb = 2;
volatile int vc = 3;
volatile int vd = 4;
int a = va;
int b = vb;
int c = vc;
int d = vd;
int s = 0;
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
for (int i=0; i<2000000000; i++) {
s += test(a, b, c, d);
}
return s;
}
two.cpp:
bool test(int a, int b, int c, int d) {
// return a == d || b == d || c == d;
return false;
}
There are 16 nops in one.cpp. You can comment/decomment them to change alignment of the loop's entry point between 16 and 32. I've compiled them with g++ one.cpp two.cpp -O3 -mtune=native.
Here are my questions:
the 32-aligned version is faster than the 16-aligned version. On Sandy Bridge, the difference is 20%; on Haswell, 8%. Why is the difference?
with the 32-aligned version, the code runs the same speed on Sandy Bridge, doesn't matter which return statement is in two.cpp. I thought the return false version should be faster at least a little bit. But no, exactly the same speed!
If I remove volatiles from one.cpp, code becomes slower (Haswell: before: ~2.17 sec, after: ~2.38 sec). Why is that? But this only happens, when the loop aligned to 32.
The fact that 32-aligned version is faster, is strange to me, because Intel® 64 and IA-32 Architectures
Optimization Reference Manual says (page 3-9):
Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch
targets should be 16- byte aligned.
Another little question: is there any tricks to make only this loop 32-aligned (so rest of the code could keep using 16-byte alignment)?
Note: I've tried compilers gcc 6, gcc 7 and clang 3.9, same results.
Here's the code with volatile (the code is the same for 16/32 aligned, just the address differ):
0000000000000560 <main>:
560: 41 57 push r15
562: 41 56 push r14
564: 41 55 push r13
566: 41 54 push r12
568: 55 push rbp
569: 31 ed xor ebp,ebp
56b: 53 push rbx
56c: bb 00 94 35 77 mov ebx,0x77359400
571: 48 83 ec 18 sub rsp,0x18
575: c7 04 24 01 00 00 00 mov DWORD PTR [rsp],0x1
57c: c7 44 24 04 02 00 00 mov DWORD PTR [rsp+0x4],0x2
583: 00
584: c7 44 24 08 03 00 00 mov DWORD PTR [rsp+0x8],0x3
58b: 00
58c: c7 44 24 0c 04 00 00 mov DWORD PTR [rsp+0xc],0x4
593: 00
594: 44 8b 3c 24 mov r15d,DWORD PTR [rsp]
598: 44 8b 74 24 04 mov r14d,DWORD PTR [rsp+0x4]
59d: 44 8b 6c 24 08 mov r13d,DWORD PTR [rsp+0x8]
5a2: 44 8b 64 24 0c mov r12d,DWORD PTR [rsp+0xc]
5a7: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
5ac: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
5b3: 00 00 00
5b6: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
5bd: 00 00 00
5c0: 44 89 e1 mov ecx,r12d
5c3: 44 89 ea mov edx,r13d
5c6: 44 89 f6 mov esi,r14d
5c9: 44 89 ff mov edi,r15d
5cc: e8 4f 01 00 00 call 720 <test(int, int, int, int)>
5d1: 0f b6 c0 movzx eax,al
5d4: 01 c5 add ebp,eax
5d6: 83 eb 01 sub ebx,0x1
5d9: 75 e5 jne 5c0 <main+0x60>
5db: 48 83 c4 18 add rsp,0x18
5df: 89 e8 mov eax,ebp
5e1: 5b pop rbx
5e2: 5d pop rbp
5e3: 41 5c pop r12
5e5: 41 5d pop r13
5e7: 41 5e pop r14
5e9: 41 5f pop r15
5eb: c3 ret
5ec: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
Without volatile:
0000000000000560 <main>:
560: 55 push rbp
561: 31 ed xor ebp,ebp
563: 53 push rbx
564: bb 00 94 35 77 mov ebx,0x77359400
569: 48 83 ec 08 sub rsp,0x8
56d: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0]
574: 00 00
576: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
57d: 00 00 00
580: b9 04 00 00 00 mov ecx,0x4
585: ba 03 00 00 00 mov edx,0x3
58a: be 02 00 00 00 mov esi,0x2
58f: bf 01 00 00 00 mov edi,0x1
594: e8 47 01 00 00 call 6e0 <test(int, int, int, int)>
599: 0f b6 c0 movzx eax,al
59c: 01 c5 add ebp,eax
59e: 83 eb 01 sub ebx,0x1
5a1: 75 dd jne 580 <main+0x20>
5a3: 48 83 c4 08 add rsp,0x8
5a7: 89 e8 mov eax,ebp
5a9: 5b pop rbx
5aa: 5d pop rbp
5ab: c3 ret
5ac: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
This doesn't answer point 2 (return a == d || b == d || c == d; being the same speed as return false). That's still a maybe-interesting question, since that must compile multiple to uop-cache lines of instructions.
The fact that 32-aligned version is faster, is strange to me, because [Intel's manual says to align to 16]
That optimization-guide advice is a very general guideline, and definitely doesn't mean that larger never helps. Usually it doesn't, and padding to 32 would be more likely to hurt than help. (I-cache misses, ITLB misses, and more code bytes to load from disk).
In fact, 16B alignment is rarely necessary, especially on CPUs with a uop cache. For a small loop that can run from the loop buffer, it alignment is usually totally irrelevant.
(Skylake microcode updates disabled the loop buffer to work around a partial-register AH-merging bug, SKL150. This creates problems for tiny loops that span a 32-byte boundary, only running one iteration per 2 clocks, instead of the one iteration per 1.5 clocks you might get from a 6 uop loop on Haswell, or on SKL with older microcode. The LSD is not re-enabled until Ice Lake, broken in Kaby/Coffee/Comet Lake which are the same microarchitecture as SKL/SKX.)
Another SKL erratum workaround created another worse code-alignment pothole: How can I mitigate the impact of the Intel jcc erratum on gcc?
16B is still not bad as a broad recommendation, but it doesn't tell you everything you need to know to understand one specific case on a couple of specific CPUs.
Compilers usually default to aligning loop branches and function entry-points, but usually don't align other branch targets. The cost of executing a NOP (and code bloat) is often larger than the likely cost of an unaligned non-loop branch target.
Code alignment has some direct and some indirect effects. The direct effects include the uop cache on Intel SnB-family. For example, see Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs.
Another section of Intel's optimization manual goes into some detail about how the uop cache works:
2.3.2.2 Decoded ICache:
All micro-ops in a Way (uop cache line) represent instructions which are statically contiguous in the code and have their EIPs within the
same aligned 32-byte region. (I think this means an instruction that
extends past the boundary goes in the uop cache for the block
containing its start, rather than end. Spanning instructions have to
go somewhere, and the branch target address that would run the
instruction is the start of the insn, so it's most useful to put it in
a line for that block).
A multi micro-op instruction cannot be split across Ways.
An instruction which turns on the MSROM consumes an entire Way.
Up to two branches are allowed per Way.
A pair of macro-fused instructions is kept as one micro-op.
See also Agner Fog's microarch guide. He adds:
An unconditional jump or call always ends a μop cache line
lots of other stuff that that probably isn't relevant here.
Also, that if your code doesn't fit in the uop cache, it can't run from the loop buffer.
The indirect effects of alignment include:
larger/smaller code-size (L1I cache misses, TLB). Not relevant for your test
which branches alias each other in the BTB (Branch Target Buffer).
If I remove volatiles from one.cpp, code becomes slower. Why is that?
The larger instructions push the last instruction into the loop across a 32B boundary:
59e: 83 eb 01 sub ebx,0x1
5a1: 75 dd jne 580 <main+0x20>
So if you aren't running from the loop buffer (LSD), then without volatile one of the uop-cache fetch cycles gets only 1 uop.
If sub/jne macro-fuses, this might not apply. And I think only crossing a 64B boundary would break macro-fusion.
Also, those aren't real addresses. Have you checked what the addresses are after linking? There could be a 64B boundary there after linking, if the text section has less than 64B alignment.
Also related to 32-byte boundaries, the JCC erratum disables the uop cache for blocks where a branch (including macro-fused ALU+JCC) includes the last byte of the line, on Skylake CPUs.
How can I mitigate the impact of the Intel jcc erratum on gcc?
Sorry I haven't actually tested this to say more about this specific case. The point is, when you bottleneck on the front-end from stuff like having a call/ret inside a tight loop, alignment becomes important and can get is extremely complex. Boundary-crossing or not for all future instructions is affected. Do not expect it to be simple. If you've read my other answers, you'll know I'm not usually the kind of person to say "it's too complicated to fully explain", but alignment can be that way.
See also Code alignment in one object file is affecting the performance of a function in another object file
In your case, make sure tiny functions inline. Use link-time optimization if your code-base has any important tiny functions in separate .c files instead of in a .h where they can inline. Or change your code to put them in a .h.

Why are ternary and logical operators more efficient than if branches?

I stumbled upon this question/answer which mentions that in most languages, logical operators such as:
x == y && doSomething();
can be faster than doing the same thing with an if branch:
if(x == y) {
doSomething();
}
Similarly, it says that the ternary operator:
x = y == z ? 0 : 1
is usually faster than using an if branch:
if(y == z) {
x = 0;
} else {
x = 1;
}
This got me Googling, which led me to this fantastic answer which explains branch prediction.
Basically, what it says is that the CPU operates at very fast speeds, and rather than slowing down to compute every if branch, it tries to guess what outcome will take place and places the appropriate instructions in its pipeline. But if it makes the wrong guess, it will have to back up and recompute the appropriate instructions.
But this still doesn't explain to me why logical operators or the ternary operator are treated differently than if branches. Since the CPU doesn't know the outcome of x == y, shouldn't it still have to guess whether to place the call to doSomething() (and therefore, all of doSomething's code) into its pipeline? And, therefore, back up if its guess was incorrect? Similarly, for the ternary operator, shouldn't the CPU have to guess whether y == z will evaluate to true when determining what to store in x, and back up if its guess was wrong?
I don't understand why if branches are treated any differently by the compiler than any other statement which is conditional. Shouldn't all conditionals be evaluated the same way?
Short answer - it simply isn't. While helping branch prediction could improve you performance - using this as a part a logical statement doesn't change the compiled code.
If you want to help branch prediction use __builtin_expect (for GNU)
To emphasize let's compare the compiler output:
#include <stdio.h>
int main(){
int foo;
scanf("%d", &foo); /*Needed to eliminate optimizations*/
#ifdef IF
if (foo)
printf("Foo!");
#else
foo && printf("Foo!");
#endif
return 0;
}
For gcc -O3 branch.c -DIF
We get:
0000000000400540 <main>:
400540: 48 83 ec 18 sub $0x18,%rsp
400544: 31 c0 xor %eax,%eax
400546: bf 68 06 40 00 mov $0x400668,%edi
40054b: 48 8d 74 24 0c lea 0xc(%rsp),%rsi
400550: e8 e3 fe ff ff callq 400438 <__isoc99_scanf#plt>
400555: 8b 44 24 0c mov 0xc(%rsp),%eax
400559: 85 c0 test %eax,%eax #This is the relevant part
40055b: 74 0c je 400569 <main+0x29>
40055d: bf 6b 06 40 00 mov $0x40066b,%edi
400562: 31 c0 xor %eax,%eax
400564: e8 af fe ff ff callq 400418 <printf#plt>
400569: 31 c0 xor %eax,%eax
40056b: 48 83 c4 18 add $0x18,%rsp
40056f: c3 retq
And for gcc -O3 branch.c
0000000000400540 <main>:
400540: 48 83 ec 18 sub $0x18,%rsp
400544: 31 c0 xor %eax,%eax
400546: bf 68 06 40 00 mov $0x400668,%edi
40054b: 48 8d 74 24 0c lea 0xc(%rsp),%rsi
400550: e8 e3 fe ff ff callq 400438 <__isoc99_scanf#plt>
400555: 8b 44 24 0c mov 0xc(%rsp),%eax
400559: 85 c0 test %eax,%eax
40055b: 74 0c je 400569 <main+0x29>
40055d: bf 6b 06 40 00 mov $0x40066b,%edi
400562: 31 c0 xor %eax,%eax
400564: e8 af fe ff ff callq 400418 <printf#plt>
400569: 31 c0 xor %eax,%eax
40056b: 48 83 c4 18 add $0x18,%rsp
40056f: c3 retq
This is exactly the same code.
The question you linked to measures performance for JAVAScript. Note that there it may be interpreted (since Java script is interpreted or JIT depending on the version) to something different for the two cases.
Anyway JavaScript is not the best for learning about performance.

Why is it necessary to use edi constraint in this inline assembly?

centos 6.5 64bit vps, 500MB ram gcc 4.8.2
I have the following function that works only if I use edi as the constraint to hold the string pointer. If I try to use any other register or constraintg or q etc, it segfaults.
BUT this problem only occurs when both link time optimization and o3 are used together. If o2 it's fine. If I don't use -flto, it's fine. But both together then the only register I can use that doesn't crash is edi
gcc -flto
CFLAGS=-I. -flto -std=gnu11 -msse4.2 -fno-builtin-printf -Wall -Winline -Wstrict-aliasing -g -pg -O3 -lrt -lpthread
It seems like there might be some sort of register clobbering going on or something else. I'm really at a loss to understand why and how to fix this. Another interesting aspect is the generated assembly puts rdi into rdx before using the pointer but if I try to use either register as the input constraint... it segfaults! If it fails under aggressive compiling options it suggests to me either the compiler is stuffing up somehow, or more likely I'm doing something wrong.
char *sse4_strCRLF(char *str)
{
__m128i M = _mm_set1_epi8(13);
char *res;
__asm__ __volatile__(
"xor %0,%0\n\t"
"sub $1, %1\n\t"
"1:" "sub $15,%1\n\t"
".align 16\n\t"
"2:" "add $16, %1\n\t"
"pcmpistri $0x08,(%1),%2\n\t"
"ja 2b\n\t"
"jnc 2f\n\t"
"cmpb $10,1(%1,%%rcx)\n\t"
"jne 1b\n\t"
"add %%rcx,%1\n\t"
"mov %1,%0\n\t"
"2:"
:"=q"(res)
:"edi"(str),"x"(M) //<-- if use anything except edi, it segfaults
:"rcx"
);
return (char*) res;
}
Disassembled output:
00000000000002e0 <sse4_strCRLF>:
2e0: 55 push rbp
2e1: 48 89 e5 mov rbp,rsp
2e4: e8 00 00 00 00 call 2e9 <sse4_strCRLF+0x9>
2e9: 66 0f 6f 05 00 00 00 00 movdqa xmm0,[rip+0x0] # 2f1 <sse4_strCRLF+0x11>
2f1: 48 89 fa mov rdx,rdi //<--- puts rdi into rdx!
2f4: 48 31 c0 xor rax,rax
2f7: 48 83 ea 01 sub rdx,0x1
2fb: 48 83 ea 0f sub rdx,0xf
2ff: 90 nop
300: 48 83 c2 10 add rdx,0x10
304: 66 0f 3a 63 02 08 pcmpistri xmm0,[rdx],0x8
30a: 77 f4 ja 300 <sse4_strCRLF+0x20>
30c: 73 0d jae 31b <sse4_strCRLF+0x3b>
30e: 80 7c 0a 01 0a cmp byte[rdx+rcx*1+0x1],0xa
313: 75 e6 jne 2fb <sse4_strCRLF+0x1b>
315: 48 01 ca add rdx,rcx
318: 48 89 d0 mov rax,rdx
31b: 5d pop rbp
31c: c3 ret
#David Wohlferd gave me the answer. It was 2 dumb mistakes I was making due to ignorance and assumptions. The below code is modified such that the input variable char pointer is not modified by the routine. It's copied into a register and that register is used. Also I was mistakenly thinking I could directly specify a particular register as opposed to a b etc.
gcc still seems to be fussy about what constraints I use. e.g. If I use a and b for res and str respectively, it compiles fine but segfaults on running. But using S and D seems to work fine.
#David Wohlferd, I'd like to credit you as the answerer but I don't think I can do that to a comment.
char *sse4_strCRLF(char *str)
{
__m128i M = _mm_set1_epi8(13);
char *res;
__asm__ __volatile__(
"xor %0,%0\n\t"
"mov %1,%%rdx\n\t"
"sub $1,%%rdx\n\t"
"1:" "sub $15,%%rdx\n\t"
".align 16\n\t"
"2:" "add $16, %%rdx\n\t"
"pcmpistri $0x08,(%%rdx),%2\n\t"
"ja 2b\n\t"
"jnc 2f\n\t"
"cmpb $10,1(%%rdx,%%rcx)\n\t"
"jne 1b\n\t"
"add %%rcx,%%rdx\n\t"
"mov %%rdx,%0\n\t"
"2:"
:"=S"(res)
:"D"(str),"x"(M)
:"rcx","rdx"
);
return (char*) res;
}

Why do we allocate 12 bytes for each variable?

In visual Studio 2010 Professional (x86, Windows 7):
... more
00DC1362 B9 39 00 00 00 mov ecx,39h
00DC1367 B8 CC CC CC CC mov eax,0CCCCCCCCh
00DC136C F3 AB rep stos dword ptr es:[edi]
20: int a = 3;
00DC136E C7 45 F8 03 00 00 00 mov dword ptr [ebp-8],3
21: int b = 10;
00DC1375 C7 45 EC 0A 00 00 00 mov dword ptr [ebp-14h],0Ah
22: int c;
23: c = a + b;
00DC137C 8B 45 F8 mov eax,dword ptr [ebp-8]
00DC137F 03 45 EC add eax,dword ptr [ebp-14h]
00DC1382 89 45 E0 mov dword ptr [ebp-20h],eax
24: return 0;
Notice how the relative addressing variable A and B are not aligned by word size of 4?
What is happening here?
Also, why do we skip $ebp - 8 ?
Turning off the optimization will show the ideal addressing scheme.
Can someone please explain the reason? Thanks.
The offset of each variable is 12 bytes. A -> B -> C
I made a mistake. I meant why do we skip the first 8 bytes.
You are looking at the code generated by the default Debug build setting. Particularly the /RTC option (enable run-time error checks). Filling the stack frame with 0xcccccccc helps diagnose uninitialized variables, the gaps around the variables help diagnose buffer overflow.
There isn't much point in looking at this code, you are not going to ship that. It is purely a Debug build artifact, only there to help you get the bugs out of the code. None of it remains in the Release build.

Resources