EDIT The real question is at the end of the post
I'm trying to undestand how gcc manage the stack size but I have some question I don't find the answer.
Gcc does something weird when I call a function in another one. It allocates extra bytes and I don't understand what for.
Here is the simpliest C code ever :
int f(){
int i =12;
return 0;
}
int main(void){
f();
return 0;
}
and then the disass of f() that gdb produces :
0x08048386 <+0>: push %ebp
0x08048387 <+1>: mov %esp,%ebp
0x08048389 <+3>: sub $0x10,%esp <- this part
0x0804838c <+6>: movl $0xc,-0x4(%ebp)
0x08048393 <+13>: mov $0x0,%eax
0x08048398 <+18>: leave
0x08048399 <+19>: ret
Here ok I undestand. gcc makes 16 bytes alignement stack as i is
an int (so 4 bytes) gcc allocates 16 bytes on the stack for i.
But As soon as I call a function in f() I don't get what gcc is doing.
Here is the new C code :
int g(int i){
i=12;
return i;
}
int f(){
int i =12;
g(i);
return 0;
}
int main(void){
f();
return 0;
}
And then the f() disass :
0x08048386 <+0>: push %ebp
0x08048387 <+1>: mov %esp,%ebp
0x08048389 <+3>: sub $0x14,%esp <- Here is my understanding
0x0804838c <+6>: movl $0xc,-0x4(%ebp)
0x08048393 <+13>: mov -0x4(%ebp),%eax
0x08048396 <+16>: mov %eax,(%esp)
0x08048399 <+19>: call 0x8048374 <g>
0x0804839e <+24>: mov $0x0,%eax
0x080483a3 <+29>: leave
0x080483a4 <+30>: ret
Then gcc allocates 4 extra bytes whereas there is no more change than f()
is calling g().
And this can be worst when I play with more functions.
So any of you have any idea what are this extra bytes for and what is gcc's
stack allocation policy ?
Thank you in advance.
EDIT: real question
Ok sorry I wrote the question too fast in fact it's ok with the
sub 0x14 %esp my real understanding is with this piece of code :
int f(){
char i[5];
char j[5];
i[4]=0;
j[4]=0;
strcpy(i,j);
return 0;
}
int main(void){
f();
return 0;
}
And then f()'s disass :
0x080483a4 <+0>: push %ebp
0x080483a5 <+1>: mov %esp,%ebp
0x080483a7 <+3>: sub $0x28,%esp
0x080483aa <+6>: movb $0x0,-0x9(%ebp)
0x080483ae <+10>: movb $0x0,-0xe(%ebp)
0x080483b2 <+14>: lea -0x12(%ebp),%eax
0x080483b5 <+17>: mov %eax,0x4(%esp)
0x080483b9 <+21>: lea -0xd(%ebp),%eax
0x080483bc <+24>: mov %eax,(%esp)
0x080483bf <+27>: call 0x80482d8 <strcpy#plt>
0x080483c4 <+32>: mov $0x0,%eax
0x080483c9 <+37>: leave
0x080483ca <+38>: ret
The stack look like something like that :
[oldip][oldebp][Extra(8B)][Arrays(10B)][Rearrange stack(14B)][Argument1(4B)][Argument2(4B)]
Here we see that 8 extra bytes come between the saved ebp and the
local variables. So here is really my understanding.
Sorry for having posting too fast and still thanks for your quick
answer.
0x08048386 <+0>: push %ebp
0x08048387 <+1>: mov %esp,%ebp
0x08048389 <+3>: sub $0x14,%esp <- include 0x10 bytes for stack alignment and 4 byte for 1 parameter
0x0804838c <+6>: movl $0xc,-0x4(%ebp)
0x08048393 <+13>: mov -0x4(%ebp),%eax
0x08048396 <+16>: mov %eax,(%esp)
0x08048399 <+19>: call 0x8048374 <g>
0x0804839e <+24>: mov $0x0,%eax
0x080483a3 <+29>: leave
0x080483a4 <+30>: ret
as you can see, it allocate 16 bytes included i and stack alignment, plus 4 byte for one parameter, the stack will look like this.
00 7f 7c 13 --> return from call address
00 00 00 00 --> this one for parameter when call g(i) --> low address
00 00 00 00
00 00 00 00
00 00 00 00
00 00 00 0C ---> i (ignore about edian) --> high address
I would think in the first case the 4 bytes are for the one parameter needed when calling g(). and in the second case two 4 byte words needed for the two parameters when calling strcpy(). call a dummy function with three parameters and see if it changes to 12 bytes.
Related
Am basically learning how to make my own instruction in the X86 architecture, but to do that I am understanding how they are decoded and and interpreted to a low level language,
By taking an example of a simple mov instruction and using the .byte notation I wanted to understand in detail as to how instructions are decoded,
My simple code is as follows:
#include <stdio.h>
#include <iostream>
int main(int argc, char const *argv[])
{
int x{5};
int y{0};
// mov %%eax, %0
asm (".byte 0x8b,0x45,0xf8\n\t" //mov %1, eax
".byte 0x89, 0xC0\n\t"
: "=r" (y)
: "r" (x)
);
printf ("dst value : %d\n", y);
return 0;
}
and when I use objdump to analyze how it is broken down to machine language, i get the following output:
000000000000078a <main>:
78a: 55 push %ebp
78b: 48 dec %eax
78c: 89 e5 mov %esp,%ebp
78e: 48 dec %eax
78f: 83 ec 20 sub $0x20,%esp
792: 89 7d ec mov %edi,-0x14(%ebp)
795: 48 dec %eax
796: 89 75 e0 mov %esi,-0x20(%ebp)
799: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%ebp)
7a0: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp)
7a7: 8b 45 f8 mov -0x8(%ebp),%eax
7aa: 8b 45 f8 mov -0x8(%ebp),%eax
7ad: 89 c0 mov %eax,%eax
7af: 89 45 fc mov %eax,-0x4(%ebp)
7b2: 8b 45 fc mov -0x4(%ebp),%eax
7b5: 89 c6 mov %eax,%esi
7b7: 48 dec %eax
7b8: 8d 3d f7 00 00 00 lea 0xf7,%edi
7be: b8 00 00 00 00 mov $0x0,%eax
7c3: e8 78 fe ff ff call 640 <printf#plt>
7c8: b8 00 00 00 00 mov $0x0,%eax
7cd: c9 leave
7ce: c3 ret
With regard to this output of objdump why is the instruction 7aa: 8b 45 f8 mov -0x8(%ebp),%eax repeated twice, any reason behind it or am I doing something wrong while using the .byte notation?
One of those is compiler-generated, because you asked GCC to have the input in its choice of register for you. That's what "r"(x) means. And you compiled with optimization disabled (the default -O0) so it actually stored x to memory and then reloaded it before your asm statement.
Your code has no business assuming anything about the contents of memory or where EBP points.
Since you're using 89 c0 mov %eax,%eax, the only safe constraints for your asm statement are "a" explicit-register constraints for input and output, forcing the compiler to pick that. If you compile with optimization enabled, your code totally breaks because you lied to the compiler about what your code actually does.
// constraints that match your manually-encoded instruction
asm (".byte 0x89, 0xC0\n\t"
: "=a" (y)
: "a" (x)
);
There's no constraint to force GCC to pick a certain addressing mode for a "m" source or "=m" dest operand so you need to ask for inputs/outputs in specific registers.
If you want to encode your own mov instructions differently from standard mov, see which MOV instructions in the x86 are not used or the least used, and can be used for a custom MOV extension - you might want to use a prefix in front of regular mov opcodes so you can let the assembler encode registers and addressing modes for you, like .byte something; mov %1, %0.
Look at the compiler-generate asm output (gcc -S, not disassembly of the .o or executable). Then you can see which instructions come from the asm statement and which are emitted by GCC.
If you don't explicitly reference some operands in the asm template but still want to see what the compiler picked, you can use them in asm comments like this:
asm (".byte 0x8b,0x45,0xf8 # 0 = %0 1 = %1 \n\t"
".byte 0x89, 0xC0\n\t"
: "=r" (y)
: "r" (x)
);
and gcc will fill it in for you so you can see what operands it expects you to be reading and writing. (Godbolt with g++ -m32 -O3). I put your code in void foo(){} instead of main because GCC -m32 thinks it needs to re-align the stack at the top of main. This makes the code a lot harder to follow.
# gcc-9.2 -O3 -m32 -fverbose-asm
.LC0:
.string "dst value : %d\n"
foo():
subl $20, %esp #,
movl $5, %eax #, tmp84
## Notice that GCC hasn't set up EBP at all before it runs your asm,
## and hasn't stored x in memory.
## It only put it in a register like you asked it to.
.byte 0x8b,0x45,0xf8 # 0 = %eax 1 = %eax # y, tmp84
.byte 0x89, 0xC0
pushl %eax # y
pushl $.LC0 #
call printf #
addl $28, %esp #,
ret
Also note that if you were compiling as 64-bit, it would probably pick %esi as a register because printf will want its 2nd arg there. So the "a" instead of "r" constraint would actually matter.
You could get 32-bit GCC to use a different register if you were assigning to a variable that has to survive across a function call; then GCC would pick a call-preserved reg like EBX instead of EAX.
Consider the following toy example that allocates memory on the stack by means of the alloca() function:
#include <alloca.h>
void foo() {
volatile int *p = alloca(4);
*p = 7;
}
Compiling the function above using gcc 8.2 with -O3 results in the following assembly code:
foo:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq 15(%rsp), %rax
andq $-16, %rax
movl $7, (%rax)
leave
ret
Honestly, I would have expected a more compact assembly code.
16-byte alignment for allocated memory
The instruction andq $-16, %rax in the code above results in rax containing the (only) 16-byte-aligned address between the addresses rsp and rsp + 15 (both inclusive).
This alignment enforcement is the first thing I don't understand: Why does alloca() align the allocated memory to a 16-byte boundary?
Possible missed optimization?
Let's consider anyway that we want the memory allocated by alloca() to be 16-byte aligned. Even so, in the assembly code above, keeping in mind that GCC assumes the stack to be aligned to a 16-byte boundary at the moment of performing the function call (i.e., call foo), if we pay attention to the status of the stack inside foo() just after pushing the rbp register:
Size Stack RSP mod 16 Description
-----------------------------------------------------------------------------------
------------------
| . |
| . |
| . |
------------------........0 at "call foo" (stack 16-byte aligned)
8 bytes | return address |
------------------........8 at foo entry
8 bytes | saved RBP |
------------------........0 <----- RSP is 16-byte aligned!!!
I think that by taking advantage of the red zone (i.e., no need to modify rsp) and the fact that rsp already contains a 16-byte aligned address, the following code could be used instead:
foo:
pushq %rbp
movq %rsp, %rbp
movl $7, -16(%rbp)
leave
ret
The address contained in the register rbp is 16-byte aligned, therefore rbp - 16 will also be aligned to a 16-byte boundary.
Even better, the creation of the new stack frame can be optimized away, since rsp is not modified:
foo:
movl $7, -8(%rsp)
ret
Is this just a missed optimization or I am missing something else here?
This is (partially) missed optimization in gcc. Clang does it as expected.
I said partially because if you know you will be using gcc you can use builtin functions (use conditional compilation for gcc and other compilers to have portable code).
__builtin_alloca_with_align is your friend ;)
Here is an example (changed so the compiler will not reduce function call to single ret):
#include <alloca.h>
volatile int* p;
void foo()
{
p = alloca(4) ;
*p = 7;
}
void zoo()
{
// aligment is 16 bits, not bytes
p = __builtin_alloca_with_align(4,16) ;
*p = 7;
}
int main()
{
foo();
zoo();
}
Disassembled code (with objdump -d -w --insn-width=12 -M intel)
Clang will produce the following code (clang -O3 test.c) - both functions look alike
0000000000400480 <foo>:
400480: 48 8d 44 24 f8 lea rax,[rsp-0x8]
400485: 48 89 05 a4 0b 20 00 mov QWORD PTR [rip+0x200ba4],rax # 601030 <p>
40048c: c7 44 24 f8 07 00 00 00 mov DWORD PTR [rsp-0x8],0x7
400494: c3 ret
00000000004004a0 <zoo>:
4004a0: 48 8d 44 24 fc lea rax,[rsp-0x4]
4004a5: 48 89 05 84 0b 20 00 mov QWORD PTR [rip+0x200b84],rax # 601030 <p>
4004ac: c7 44 24 fc 07 00 00 00 mov DWORD PTR [rsp-0x4],0x7
4004b4: c3 ret
GCC this one (gcc -g -O3 -fno-stack-protector)
0000000000000620 <foo>:
620: 55 push rbp
621: 48 89 e5 mov rbp,rsp
624: 48 83 ec 20 sub rsp,0x20
628: 48 8d 44 24 0f lea rax,[rsp+0xf]
62d: 48 83 e0 f0 and rax,0xfffffffffffffff0
631: 48 89 05 e0 09 20 00 mov QWORD PTR [rip+0x2009e0],rax # 201018 <p>
638: c7 00 07 00 00 00 mov DWORD PTR [rax],0x7
63e: c9 leave
63f: c3 ret
0000000000000640 <zoo>:
640: 48 8d 44 24 fc lea rax,[rsp-0x4]
645: c7 44 24 fc 07 00 00 00 mov DWORD PTR [rsp-0x4],0x7
64d: 48 89 05 c4 09 20 00 mov QWORD PTR [rip+0x2009c4],rax # 201018 <p>
654: c3 ret
As you can see zoo now looks like expected and similar to clang code.
The x86-64 System V ABI requires VLAs (C99 Variable Length Arrays) to be 16-byte aligned, same for automatic / static arrays that are >= 16 bytes.
It looks like gcc is treating alloca as a VLA, and failing to do constant-propagation into an alloca that only runs once per function call. (Or that it internally uses alloca for VLAs.)
A generic alloca / VLA can't use the red-zone, in case the runtime value is larger than 128 bytes. GCC also makes a stack frame with RBP instead of saving the allocation size and doing an add rsp, rdx later.
So the asm looks exactly like what it would if the size was a function arg or other runtime variable instead of a constant. That's what led me to this conclusion.
Also alignof(maxalign_t) == 16 , but alloca and malloc can satisfy the requirement to return memory usable for any object without 16-byte alignment for objects smaller than 16 bytes. None of the standard types have alignment requirements wider than their size in x86-64 SysV.
You're right, it should be able to optimize it to this:
void foo() {
alignas(16) int dummy[1];
volatile int *p = dummy; // alloca(4)
*p = 7;
}
and compile it to the movl $7, -8(%rsp) ; ret you suggested.
The alignas(16) might be optional here for alloca.
If you really need gcc to emit better code when constant propagation makes the arg to alloca a compile-time constant, you could consider simply using a VLA in the first place. GNU C++ supports C99-style VLAs in C++ mode, but ISO C++ (and MSVC) don't.
Or possibly use if(__builtin_constant_p(size)) { VLA version } else { alloca version }, but scoping of VLAs means you can't return a VLA from the scope of an if that detects that we're being inlined with a compile-time constant size. So you'd have to duplicate the code that needs the pointer.
Visual C++, using Microsoft's compiler, allows us to define inline assembly code using:
__asm {
nop
}
What I need is a macro that makes possible to multiply such instruction n times like:
ASM_EMIT_MULT(op, times)
for example:
ASM_EMIT_MULT(0x90, 160)
Is that possible? How could I do this?
With MASM, this is very simple to do. Part of the installation is a file named listing.inc (since everyone gets MASM as part of Visual Studio now, this will be located in your Visual Studio root directory/VC/include). This file defines a series of npad macros that take a single size argument and expand to an appropriate sequence of non-destructive "padding" opcodes. If you only need one byte of padding, you use the obvious nop instruction. But rather than using a long series of nops until you reach the desired length, Intel actually recommends other non-destructive opcodes of the appropriate length, as do other vendors. These pre-defined npad macros free you from having to memorize that table, not to mention making the code much more readable.
Unfortunately, inline assembly is not a full-featured assembler. There are a lot of things missing that you would expect to find in real assemblers like MASM. Macros (MACRO) and repeats (REPEAT/REPT) are among the things that are missing.
However, ALIGN directives are available in inline assembly. These will generate the required number of nops or other non-destructive opcodes to enforce alignment of the next instruction. Using this is drop-dead simple. Here is a very stupid example, where I've taken working code and peppered it with random aligns:
unsigned long CountDigits(unsigned long value)
{
__asm
{
mov edx, DWORD PTR [value]
bsr eax, edx
align 4
xor eax, 1073741792
mov eax, DWORD PTR [4 * eax + kMaxDigits+132]
align 16
cmp edx, DWORD PTR [4 * eax + kPowers-4]
sbb eax, 0
align 8
}
}
This generates the following output (MSVC's assembly listings use npad x, where x is the number of bytes, just as you'd write it in MASM):
PUBLIC CountDigits
_TEXT SEGMENT
_value$ = 8
CountDigits PROC
00000 8b 54 24 04 mov edx, DWORD PTR _value$[esp-4]
00004 0f bd c2 bsr eax, edx
00007 90 npad 1 ;// enforcing the "align 4"
00008 35 e0 ff ff 3f xor eax, 1073741792
0000d 8b 04 85 84 00
00 00 mov eax, DWORD PTR _kMaxDigits[eax*4+132]
00014 eb 0a 8d a4 24
00 00 00 00 8d
49 00 npad 12 ;// enforcing the "align 16"
00020 3b 14 85 fc ff
ff ff cmp edx, DWORD PTR _kPowers[eax*4-4]
00027 83 d8 00 sbb eax, 0
0002a 8d 9b 00 00 00
00 npad 6 ;// enforcing the "align 8"
00030 c2 04 00 ret 4
CountDigits ENDP
_TEXT ENDS
If you aren't actually wanting to enforce alignment, but just want to insert an arbitrary number of nops (perhaps as filler for later hot-patching?), then you can use C macros to simulate the effect:
#define NOP1 __asm { nop }
#define NOP2 NOP1 NOP1
#define NOP4 NOP2 NOP2
#define NOP8 NOP4 NOP4
#define NOP16 NOP8 NOP8
// ...
#define NOP64 NOP16 NOP16 NOP16 NOP16
// ...etc.
And then pepper your code as desired:
unsigned long CountDigits(unsigned long value)
{
__asm
{
mov edx, DWORD PTR [value]
bsr eax, edx
NOP8
xor eax, 1073741792
mov eax, DWORD PTR [4 * eax + kMaxDigits+132]
NOP4
cmp edx, DWORD PTR [4 * eax + kPowers-4]
sbb eax, 0
}
}
to produce the following output:
PUBLIC CountDigits
_TEXT SEGMENT
_value$ = 8
CountDigits PROC
00000 8b 54 24 04 mov edx, DWORD PTR _value$[esp-4]
00004 0f bd c2 bsr eax, edx
00007 90 npad 1 ;// these are, of course, just good old NOPs
00008 90 npad 1
00009 90 npad 1
0000a 90 npad 1
0000b 90 npad 1
0000c 90 npad 1
0000d 90 npad 1
0000e 90 npad 1
0000f 35 e0 ff ff 3f xor eax, 1073741792
00014 8b 04 85 84 00
00 00 mov eax, DWORD PTR _kMaxDigits[eax*4+132]
0001b 90 npad 1
0001c 90 npad 1
0001d 90 npad 1
0001e 90 npad 1
0001f 3b 14 85 fc ff
ff ff cmp edx, DWORD PTR _kPowers[eax*4-4]
00026 83 d8 00 sbb eax, 0
00029 c2 04 00 ret 4
CountDigits ENDP
_TEXT ENDS
Or, even cooler, we can use a bit of template meta-programming magic to get the same effect in style. Just define the following template function and its specialization (important to prevent infinite recursion):
template <size_t N> __forceinline void npad()
{
npad<N-1>();
__asm { nop }
}
template <> __forceinline void npad<0>() { }
And use it like this:
unsigned long CountDigits(unsigned long value)
{
__asm
{
mov edx, DWORD PTR [value]
bsr eax, edx
}
npad<8>();
__asm
{
xor eax, 1073741792
mov eax, DWORD PTR [4 * eax + kMaxDigits+132]
}
npad<4>();
__asm
{
cmp edx, DWORD PTR [4 * eax + kPowers-4]
sbb eax, 0
}
}
That'll produce the desired output (exactly the same as the one just above) in all optimized builds—whether you optimize for size (/O1) or speed (/O2)—…but not in debugging builds. If you need it in debug builds, you'll have to resort to the C macros. :-(
Base on Cody Gray Answer and code example for metaprogramming using template recursion and inline or forceinline as stated on the code before
template <size_t N> __forceinline void npad()
{
npad<N-1>();
__asm { nop }
}
template <> __forceinline void npad<0>() { }
It won't work on visual studio, without setting some options and is not a guarantee it will work
Although __forceinline is a stronger indication to the compiler than
__inline, inlining is still performed at the compiler's discretion, but no heuristics are used to determine the benefits from inlining this function.
You can read more about this here https://learn.microsoft.com/en-us/cpp/error-messages/compiler-warnings/compiler-warning-level-4-c4714?view=vs-2019
Visual Studio produces the following machine code when _InterlockedIncrement is used:
; 40 : _InterlockedIncrement(&framecounter);
00078 b8 00 00 00 00 mov eax, OFFSET ?framecounter##3JA ; framecounter
0007d b9 01 00 00 00 mov ecx, 1
00082 f0 0f c1 08 lock xadd DWORD PTR [eax], ecx
If I would be writing this i would use just lock inc DWORD PTR [eax] instead of mov and xadd
Is there a valid reason why Microsoft preferred xadd and using 2 instructions instead of 1?
Because _InterlockedIncrement also returns the new value.
You can't do that with lock inc DWORD PTR [eax], because now neither the old nor the new value are anywhere to be found. Except in memory, but if you do an other read, clearly it won't be atomic (the increment itself would be, but you could get a value back that has nothing to do with what happened at the time of the increment).
Returning the value makes _InterlockedIncrement more useful.
When call WinExec to run a .exe, I get return value 0x21.
According to MSDN, a return value greater than 31 (0x1F) means function succeeds.
But what does it mean of 0x21, Why it didn't return other value to me?
It is not useful for you to know what it means. That is an implementation detail. Even if you knew what it meant for this version, the meaning might change in the next version. As a programmer, you are concerned only with programming against the interface, not the underlying implementation.
However, if you are really interested, I will take you through the approach I would take to reverse engineer the function. On my system, WinExec is disassembled to this:
764F2C21 > 8BFF MOV EDI,EDI
764F2C23 55 PUSH EBP
764F2C24 8BEC MOV EBP,ESP
764F2C26 81EC 80000000 SUB ESP,80
764F2C2C 53 PUSH EBX
764F2C2D 8B5D 0C MOV EBX,DWORD PTR SS:[EBP+C]
764F2C30 56 PUSH ESI
764F2C31 57 PUSH EDI
764F2C32 33FF XOR EDI,EDI
764F2C34 47 INC EDI
764F2C35 33F6 XOR ESI,ESI
764F2C37 85DB TEST EBX,EBX
764F2C39 79 4F JNS SHORT kernel32.764F2C8A
764F2C3B 8D45 FC LEA EAX,DWORD PTR SS:[EBP-4]
764F2C3E 50 PUSH EAX
764F2C3F 56 PUSH ESI
764F2C40 57 PUSH EDI
764F2C41 8D45 C8 LEA EAX,DWORD PTR SS:[EBP-38]
764F2C44 50 PUSH EAX
764F2C45 C745 FC 20000000 MOV DWORD PTR SS:[EBP-4],20
764F2C4C E8 90BE0200 CALL <JMP.&API-MS-Win-Core-ProcessThread>
764F2C51 85C0 TEST EAX,EAX
764F2C53 0F84 D2000000 JE kernel32.764F2D2B
764F2C59 56 PUSH ESI
764F2C5A 56 PUSH ESI
764F2C5B 6A 04 PUSH 4
764F2C5D 8D45 F8 LEA EAX,DWORD PTR SS:[EBP-8]
764F2C60 50 PUSH EAX
764F2C61 68 01000600 PUSH 60001
764F2C66 56 PUSH ESI
764F2C67 8D45 C8 LEA EAX,DWORD PTR SS:[EBP-38]
764F2C6A 50 PUSH EAX
764F2C6B C745 0C 00000800 MOV DWORD PTR SS:[EBP+C],80000
764F2C72 897D F8 MOV DWORD PTR SS:[EBP-8],EDI
764F2C75 E8 5CBE0200 CALL <JMP.&API-MS-Win-Core-ProcessThread>
764F2C7A 85C0 TEST EAX,EAX
764F2C7C 0F84 95000000 JE kernel32.764F2D17
764F2C82 8D45 C8 LEA EAX,DWORD PTR SS:[EBP-38]
764F2C85 8945 C4 MOV DWORD PTR SS:[EBP-3C],EAX
764F2C88 EB 03 JMP SHORT kernel32.764F2C8D
764F2C8A 8975 0C MOV DWORD PTR SS:[EBP+C],ESI
764F2C8D 6A 44 PUSH 44
764F2C8F 8D45 80 LEA EAX,DWORD PTR SS:[EBP-80]
764F2C92 56 PUSH ESI
764F2C93 50 PUSH EAX
764F2C94 E8 B5E9F7FF CALL <JMP.&ntdll.memset>
764F2C99 83C4 0C ADD ESP,0C
764F2C9C 33C0 XOR EAX,EAX
764F2C9E 3975 0C CMP DWORD PTR SS:[EBP+C],ESI
764F2CA1 897D AC MOV DWORD PTR SS:[EBP-54],EDI
764F2CA4 0F95C0 SETNE AL
764F2CA7 66:895D B0 MOV WORD PTR SS:[EBP-50],BX
764F2CAB 8D0485 44000000 LEA EAX,DWORD PTR DS:[EAX*4+44]
764F2CB2 8945 80 MOV DWORD PTR SS:[EBP-80],EAX
764F2CB5 8D45 E8 LEA EAX,DWORD PTR SS:[EBP-18]
764F2CB8 50 PUSH EAX
764F2CB9 8D45 80 LEA EAX,DWORD PTR SS:[EBP-80]
764F2CBC 50 PUSH EAX
764F2CBD 56 PUSH ESI
764F2CBE 56 PUSH ESI
764F2CBF FF75 0C PUSH DWORD PTR SS:[EBP+C]
764F2CC2 56 PUSH ESI
764F2CC3 56 PUSH ESI
764F2CC4 56 PUSH ESI
764F2CC5 FF75 08 PUSH DWORD PTR SS:[EBP+8]
764F2CC8 56 PUSH ESI
764F2CC9 E8 A4E3F7FF CALL kernel32.CreateProcessA
764F2CCE 85C0 TEST EAX,EAX
764F2CD0 74 27 JE SHORT kernel32.764F2CF9
764F2CD2 A1 3C005476 MOV EAX,DWORD PTR DS:[7654003C]
764F2CD7 3BC6 CMP EAX,ESI
764F2CD9 74 0A JE SHORT kernel32.764F2CE5
764F2CDB 68 30750000 PUSH 7530
764F2CE0 FF75 E8 PUSH DWORD PTR SS:[EBP-18]
764F2CE3 FFD0 CALL EAX
764F2CE5 FF75 E8 PUSH DWORD PTR SS:[EBP-18]
764F2CE8 8B35 A0054776 MOV ESI,DWORD PTR DS:[<&ntdll.NtClose>] ; ntdll.ZwClose
764F2CEE FFD6 CALL ESI
764F2CF0 FF75 EC PUSH DWORD PTR SS:[EBP-14]
764F2CF3 FFD6 CALL ESI
764F2CF5 6A 21 PUSH 21
764F2CF7 EB 1D JMP SHORT kernel32.764F2D16
764F2CF9 E8 C9E4F7FF CALL <JMP.&API-MS-Win-Core-ErrorHandling>
764F2CFE 48 DEC EAX
764F2CFF 48 DEC EAX
764F2D00 74 12 JE SHORT kernel32.764F2D14
764F2D02 48 DEC EAX
764F2D03 74 0B JE SHORT kernel32.764F2D10
764F2D05 2D BE000000 SUB EAX,0BE
764F2D0A 75 0B JNZ SHORT kernel32.764F2D17
764F2D0C 6A 0B PUSH 0B
764F2D0E EB 06 JMP SHORT kernel32.764F2D16
764F2D10 6A 03 PUSH 3
764F2D12 EB 02 JMP SHORT kernel32.764F2D16
764F2D14 6A 02 PUSH 2
764F2D16 5E POP ESI
764F2D17 F745 0C 00000800 TEST DWORD PTR SS:[EBP+C],80000
764F2D1E 74 09 JE SHORT kernel32.764F2D29
764F2D20 8D45 C8 LEA EAX,DWORD PTR SS:[EBP-38]
764F2D23 50 PUSH EAX
764F2D24 E8 A2BD0200 CALL <JMP.&API-MS-Win-Core-ProcessThread>
764F2D29 8BC6 MOV EAX,ESI
764F2D2B 5F POP EDI
764F2D2C 5E POP ESI
764F2D2D 5B POP EBX
764F2D2E C9 LEAVE
764F2D2F C2 0800 RETN 8
The call convention used under Win32 is stdcall which mandates return values be held in EAX. In the case of WinExec, there is only one exit from the function (0x764F2D2F). Tracing back from there, EAX is set by (at least when the return is 0x21):
764F2D29 8BC6 MOV EAX,ESI
Tracing back further, ESI itself is set from POP ESI which pops the top of the stack into ESI. The value of this is dependent on what was previously pushed on the stack. In the case of 0x21, this happens at:
764F2CF5 6A 21 PUSH 21
Immediately afterwards, a JMP is made to the POP ESI. How we got to the PUSH 21 is interesting only from after the CreateProcess call.
764F2CC9 E8 A4E3F7FF CALL kernel32.CreateProcessA
764F2CCE 85C0 TEST EAX,EAX
764F2CD0 74 27 JE SHORT kernel32.764F2CF9
764F2CD2 A1 3C005476 MOV EAX,DWORD PTR DS:[7654003C]
764F2CD7 3BC6 CMP EAX,ESI
764F2CD9 74 0A JE SHORT kernel32.764F2CE5
764F2CDB 68 30750000 PUSH 7530
764F2CE0 FF75 E8 PUSH DWORD PTR SS:[EBP-18]
764F2CE3 FFD0 CALL EAX
764F2CE5 FF75 E8 PUSH DWORD PTR SS:[EBP-18]
764F2CE8 8B35 A0054776 MOV ESI,DWORD PTR DS:[<&ntdll.NtClose>] ; ntdll.ZwClose
764F2CEE FFD6 CALL ESI
764F2CF0 FF75 EC PUSH DWORD PTR SS:[EBP-14]
764F2CF3 FFD6 CALL ESI
764F2CF5 6A 21 PUSH 21
To see how the path will take you to the PUSH 21, observe different branches. The first occurs as:
764F2CD0 74 27 JE SHORT kernel32.764F2CF9
This is saying if CreateProcess returned 0, call Win-Core-ErrorHandling. The return values are then set differently (0x2, 0x3 and 0xB are all possible return values if CreateProcess failed).
The next branch is a lot less obvious to reverse engineer:
764F2CD9 74 0A JE SHORT kernel32.764F2CE5
What it does is read a memory address which probably contains a function pointer (we know this because the result of the read is called later on). This JE simply indicates whether or not to make this call at all. Regardless of whether the call is made, the next step is to call ZwClose (twice). Finally 0x21 is returned.
So one simple way of looking at it is that when CreateProcess succeeds, 0x21 is returned, otherwise 0x2, 0x3 or 0xB are returned. This is not to say these are the only return values. For example, 0x0 can also be returned from the branch at 0x764F2C53 (in this case, ESI is not used in the same way at all). There are a few more possible return values but I will leave those for you to look into yourself.
What I've shown you is how to do a very shallow analysis of WinExec specifically for the 0x21 return. If you want to find out more, you need to poke around more in-depth and try to understand from a higher level what is going on. You'll be able to find out a lot more just by breakpointing the function and stepping through it (this way you can observe data values).
One other way is to look at the Wine source, where someone has already done all the hard work for you:
UINT WINAPI WinExec( LPCSTR lpCmdLine, UINT nCmdShow )
{
PROCESS_INFORMATION info;
STARTUPINFOA startup;
char *cmdline;
UINT ret;
memset( &startup, 0, sizeof(startup) );
startup.cb = sizeof(startup);
startup.dwFlags = STARTF_USESHOWWINDOW;
startup.wShowWindow = nCmdShow;
/* cmdline needs to be writable for CreateProcess */
if (!(cmdline = HeapAlloc( GetProcessHeap(), 0, strlen(lpCmdLine)+1 ))) return 0;
strcpy( cmdline, lpCmdLine );
if (CreateProcessA( NULL, cmdline, NULL, NULL, FALSE,
0, NULL, NULL, &startup, &info ))
{
/* Give 30 seconds to the app to come up */
if (wait_input_idle( info.hProcess, 30000 ) == WAIT_FAILED)
WARN("WaitForInputIdle failed: Error %d\n", GetLastError() );
ret = 33;
/* Close off the handles */
CloseHandle( info.hThread );
CloseHandle( info.hProcess );
}
else if ((ret = GetLastError()) >= 32)
{
FIXME("Strange error set by CreateProcess: %d\n", ret );
ret = 11;
}
HeapFree( GetProcessHeap(), 0, cmdline );
return ret;
}
33d is 0x21 so this actually just confirms the fruits of our earlier analysis.
In regards to the reason 0x21 is returned, my guess is that perhaps there exists more internal documentation which makes it more useful in some way.
Other than that this means success, the meaning of the return value is not defined. Perhaps it was chosen such that legacy applications will work well with this particular value. One thing is certain: there are more important things to worry about!
http://msdn.microsoft.com/en-us/library/windows/desktop/ms687393(v=vs.85).aspx
EDIT: This answer is wrong because the OP's result is not an error code. I mistakenly thought it was said that it was an error code. I still think the practical info below can be useful, plus that it can be useful to see what a wrong assumption can lead to, so I let this answer stand.
If you have installed Visual Studio (full or express edition), then you have a tool called errlook, which uses the FormatMessage API function to tell you what an error code or HRESULT value means.
In this case,
The process cannot access the file because another process has locked a portion of the file.
You can do much of the same manually by looking in the <winerror.h> file. For example, type an #include of that in a C++ source file in Visual Studio, then right click and ask it to open the header. Where you find that
//
// MessageId: ERROR_LOCK_VIOLATION
//
// MessageText:
//
// The process cannot access the file because another process has locked a portion of the file.
//
#define ERROR_LOCK_VIOLATION 33L
By the way, WinExec is just an old compatibility function. Preferably use ShellExecute or CreateProcess. The ShellExecute function is able to play more nicely with Windows Vista and 7 User Access Control, and it is simpler to use, so it is generally preferable.