I've developed a little program that tests the performance of 32 bit Windows structured exception handling. To keep the overhead minimal in contrast to the rest, I wrote the code generating an filtering the exception in assembly.
This is the C++-code:
#include <Windows.h>
#include <iostream>
using namespace std;
bool __fastcall getPointerFaultSafe( void *volatile *from, void **to );
int main()
{
auto getThreadTimes = []( LONGLONG &kt, LONGLONG &ut )
{
union
{
FILETIME ft;
LONGLONG ll;
} creationTime, exitTime, kernelTime, userTime;
GetThreadTimes( GetCurrentThread(), &creationTime.ft, &exitTime.ft, &kernelTime.ft, &userTime.ft );
kt = kernelTime.ll;
ut = userTime.ll;
};
LONGLONG ktStart, utStart;
getThreadTimes( ktStart, utStart );
size_t const COUNT = 100'000;
void *pv;
for( size_t c = COUNT; c; --c )
getPointerFaultSafe( nullptr, &pv );
LONGLONG ktEnd, utEnd;
getThreadTimes( ktEnd, utEnd );
double ktNsPerException = (ktEnd - ktStart) * 100.0 / COUNT,
utNsPerException = (utEnd - utStart) * 100.0 / COUNT;
cout << "kernel-time per exception: " << ktNsPerException << "ns" << endl;
cout << "user-time per exception: " << utNsPerException << "ns" << endl;
return 0;
}
This is the assembly-code:
.686P
PUBLIC ?getPointerFaultSafe##YI_NPCRAXPAPAX#Z
PUBLIC sehHandler
.SAFESEH sehHandler
sehHandler PROTO
_DATA SEGMENT
byebyeOffset dd 0
_DATA ENDS
exc_ctx_eax = 0b0h
exc_ctx_eip = 0b8h
_TEXT SEGMENT
?getPointerFaultSafe##YI_NPCRAXPAPAX#Z PROC
ASSUME ds:_DATA
push OFFSET sehHandler
push dword ptr fs:0
mov dword ptr fs:0, esp
mov byebyeOffset, OFFSET byebye - OFFSET mightfail
mov al, 1
mightfail:
mov ecx, dword ptr [ecx]
mov dword ptr [edx], ecx
byebye:
mov edx, dword ptr [esp]
mov dword ptr fs:0, edx
add esp, 8
ret 0
?getPointerFaultSafe##YI_NPCRAXPAPAX#Z ENDP
sehHandler PROC
mov eax, dword ptr [esp + 12]
mov dword ptr [eax + exc_ctx_eax], 0
mov edx, byebyeOffset
add [eax + exc_ctx_eip], edx
mov eax, 0
ret 0
sehHandler ENDP
_TEXT ENDS
END
How do I get the asm-module of my program /SAFESEH-compatible?
Why does this program consume so much userland CPU-time? The library-code being called by the operating-system after the exception has begun to be handled should have only to save all the registers in the CONTEXT-structure, fill the EXCEPTION_RECORD-structure, call the topmost exception-filter which - in this case - shifts the execution two instructions further, and when it returns it will in my case restore all the registers an continue execution according to what I returned in EAX. That's should all not be so much time that almost 1/3 of the CPU-time will be spent in userland. That's about 2,3ms, i.e. when my old Ryzen 1800X is boosting on one core with 4GHz, about 5.200 clock-cycles.
I'm using the byebyeOffset-variable in my code to carry the distance between the unsafe instruction that might generate an access-violation and the safe code afterwards. I'm initializing this variable before the unsafe instruction. But it would be nice to have this offset statically as an immediate at the point where I add it on EIP in the exception-filter function sehHandler; but the offsets are scoped to getPointerFaultSafe. Of course storing the offset and fetching it from the variable take a negligible part of the overall computation time, but it would be nicer to have a clean solution.
Related
Before calling a member function of an object, the address of the object will be moved to ECX.
Inside the function, ECX will be moved to dword ptr [this], what does this mean?
C++ Source
#include <iostream>
class CAdd
{
public:
CAdd(int x, int y) : _x(x), _y(y) {}
int Do() { return _x + _y; }
private:
int _x;
int _y;
};
int main()
{
CAdd ca(1, 2);
int n = ca.Do();
std::cout << n << std::endl;
}
Disassembly
...
CAdd ca(1, 2);
00A87B4F push 2
00A87B51 push 1
00A87B53 lea ecx,[ca] ; the instance address
00A87B56 call CAdd::CAdd (0A6BA32h)
int Do() { return _x + _y; }
00A7FFB0 push ebp
00A7FFB1 mov ebp,esp
00A7FFB3 sub esp,0CCh
00A7FFB9 push ebx
00A7FFBA push esi
00A7FFBB push edi
00A7FFBC push ecx
00A7FFBD lea edi,[ebp-0Ch]
00A7FFC0 mov ecx,3
00A7FFC5 mov eax,0CCCCCCCCh
00A7FFCA rep stos dword ptr es:[edi]
00A7FFCC pop ecx
00A7FFCD mov dword ptr [this],ecx ; ========= QUESTION HERE!!! =========
00A7FFD0 mov ecx,offset _CC7F790E_main#cpp (0BC51F2h)
00A7FFD5 call #__CheckForDebuggerJustMyCode#4 (0A6AC36h)
00A7FFDA mov eax,dword ptr [this] ; ========= AND HERE!!! =========
00A7FFDD mov eax,dword ptr [eax]
00A7FFDF mov ecx,dword ptr [this]
00A7FFE2 add eax,dword ptr [ecx+4]
00A7FFE5 pop edi
00A7FFE6 pop esi
00A7FFE7 pop ebx
00A7FFE8 add esp,0CCh
00A7FFEE cmp ebp,esp
00A7FFF0 call __RTC_CheckEsp (0A69561h)
00A7FFF5 mov esp,ebp
00A7FFF7 pop ebp
00A7FFF8 ret
MSVC's asm output itself (https://godbolt.org/z/h44rW3Mxh) uses _this$[ebp] with _this$ = -4, in a debug build like this which wastes instructions storing/reloading incoming register args.
_this$ = -4
int CAdd::Do(void) PROC ; CAdd::Do, COMDAT
push ebp
mov ebp, esp
push ecx ; dummy push instead of sub to reserve 4 bytes
mov DWORD PTR _this$[ebp], ecx
mov eax, DWORD PTR _this$[ebp]
...
This is just spilling the register arg to a local on the stack with that name. (The default options for the MSVC version I used on Godbolt, x86 MSVC 19.29.30136, don't include __CheckForDebuggerJustMyCode#4 or the runtime-check stack poisoning (rep stos) in Do(), but the usage of this is still there.)
Amusingly, the push ecx it uses (as a micro-optimization) instead of sub esp, 4 to reserve stack space already stored ECX, making the mov store redundant.
(AFAIK, no compilers actually do use push to both initialize and make space for locals, but it would be an optimization for cases like this: What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?. It's just using the push for its effect on ESP, not caring what it stores, even if you enabled optimization. In a function where it did still need to spill it, instead of keeping it in memory.)
Your disassembler apparently folds the frame-pointer (EBP +) into what its defining as a this symbol / macro, making it more confusing if you don't look around at other lines to find out how it defines that text macro or whatever it is.
What disassembler are you using? The one built-in to Visual Studio's debugger?
I guess that would make sense that it's using C local var names this way, even though it looks super weird to people familiar with asm. (Because only static storage is addressable with a mode like [symbol] not involving any registers.)
I've been through a couple of threads already, and everything seems to be worded correctly...but my output_01 is returning false. It seems that the inline assembly is writing zeros to my variables...and I just can't figure out why. Below is code from the main c file which calls the assembly, and the assembly it calls (and thrown in the header file as well, though I don't think it has any bearing...but the devil is in the details...right?
main file
#include stdio.h
#include "lab.h"
int output_01();
int main()
{
printf("Starting exercise lab_01:\n");
if(output_01())
printf("lab_01: Successful!");
else
printf("lab_01: Failed!");
return 0;
}
int output_01()
{
int eax=0;
int edx=0;
int ecx=0;
asm("call lab_01;"
"mov %0, eax;"
"mov %1, edx;"
"mov %2, ecx;"
:
:"r" (eax), "r" (edx), "r" (ecx)
:
);
if(eax==3 && edx==1 && ecx==2)
{
printf("\teax = %i\n",eax);
printf("\tedx = %i\n",edx);
printf("\tecx = %i\n",ecx);
return 1;
}
else
{
printf("\teax = %i\n",eax);
printf("\tedx = %i\n",edx);
printf("\tecx = %i\n",ecx);
return 0;
}
}
assembly file
BITS 32 ;you must specify bits mode
segment .text ;you must specify a section
GLOBAL lab_01, labSize_01
lab_01:
;instructions:
;the following registers have the following values:
;eax = 1
;edx = 2
;ecx = 3
;Make it so that the registers have the following values, using only the allowed opcodes and registers:
;eax = 3
;edx = 1
;ecx = 2
;Allowed registers: eax,ebx,ecx,edx
;Allowed opcodes: mov, int3
;Non volatile registers: ebp, ebx, edi, esi
;Volatile registers: eax, ecx, edx
;Forbidden items: immediate values, memory addresses
;;;;;;;;;;;;; EXERCISE SETUP CODE - DO NOT TOUCH
int3 ;make it 'easier' to debug
push ebx; this is to save ebx onto the stack.
mov eax, 1
mov edx, 2
mov ecx, 3
;;;;;;;;;;;;; YOUR CODE BELOW
;int3 ;make it 'easier' to debug
mov ebx, eax ;hold 1
mov eax, ecx ;eax is set 3
mov ecx, edx ;ecx is set to 2
mov edx, ebx ;edx is set to 1
int3 ;make it 'easier' to debug
;;;;;;;;;;;;; YOUR CODE ABOVE
pop ebx;
ret
labSize_01 dd $-lab_01 -1
lab.h file:
extern int lab_01();
You listed the registers as input only. You have no outputs at all. The correct asm for this is:
asm("call lab_01;"
: "=a" (eax), "=d" (edx), "=c" (ecx)
);
I was trying to reduce the clutter in my original question (below), but I am afraid that made it harder to follow. So here is the original source along with IDA's disassembly.
My question still is this: why does getStruct() pop the return argument and only the return argument off the stack? (It's calling ret 4 instead of ret for no arguments or ret 12 for all three arguments).
#include <iostream>
struct SomeStruct {
char m_buff[0x1000];
};
SomeStruct getStruct(uint32_t someArg1, uint32_t someArg2)
{
return SomeStruct();
}
int main(int argc, const char * argv[])
{
SomeStruct myLocalStruct = getStruct(0x20,0x30);
return 0;
}
; _DWORD __stdcall getStruct(unsigned int, unsigned int)
public getStruct(unsigned int, unsigned int)
getStruct(unsigned int, unsigned int) proc near ; CODE XREF: _main+4Dp
var_8 = dword ptr -8
var_4 = dword ptr -4
arg_0 = dword ptr 8
arg_4 = dword ptr 0Ch
arg_8 = dword ptr 10h
push ebp
mov ebp, esp
sub esp, 18h
mov eax, [ebp+arg_8]
mov ecx, [ebp+arg_4]
mov edx, [ebp+arg_0]
mov [ebp+var_4], ecx
mov [ebp+var_8], eax
mov eax, esp
mov [eax], edx
mov dword ptr [eax+4], 1000h
call ___bzero
add esp, 18h
pop ebp
retn 4
getStruct(unsigned int, unsigned int) endp
; ---------------------------------------------------------------------------
align 10h
; =============== S U B R O U T I N E =======================================
; Attributes: bp-based frame
; int __cdecl main(int argc, const char **argv, const char **envp)
public _main
_main proc near
var_1020 = dword ptr -1020h
var_101C = dword ptr -101Ch
var_1018 = dword ptr -1018h
var_14 = dword ptr -14h
var_10 = dword ptr -10h
var_C = dword ptr -0Ch
argc = dword ptr 8
argv = dword ptr 0Ch
envp = dword ptr 10h
push ebp
mov ebp, esp
push edi
push esi
sub esp, 1030h
mov eax, [ebp+argv]
mov ecx, [ebp+argc]
lea edx, [ebp+var_1018]
mov esi, 20h
mov edi, 30h
mov [ebp+var_C], 0
mov [ebp+var_10], ecx
mov [ebp+var_14], eax
mov [esp], edx ; ptr to destination
mov dword ptr [esp+4], 20h ; unsigned int
mov dword ptr [esp+8], 30h
mov [ebp+var_101C], esi
mov [ebp+var_1020], edi
call getStruct(uint,uint)
sub esp, 4
mov eax, 0
add esp, 1030h
pop esi
pop edi
pop ebp
retn
_main endp
Original question below:
I have some function with the following declaration:
SomeStruct getStruct(uint32_t someArg1, uint32_t someArg2);
getStruct is being called like this:
myLocalStruct = getStruct(someArg1,someArg2);
When compiling this using clang on x86 the calling code looks roughly like this:
lea esi, [ebp-myLocalStructOffset]
mov [esp], esi
mov [esp+4], someArg1
mov [esp+8], someArg2
call getStruct;
sub esp, 4;
So the caller is restoring its stack pointer after the call. Sure enough, the implementation of getStruct ends with a ret 4, effectively popping the structs pointer.
This looks like it is partially cdecl with the caller being responsible for the stack cleanup, and partially stdcall with the callee removing the arguments. I just cannot figure out what the reason is for this approach. Why not leave all the cleanup to the caller? Is there any benefit to this ?
It looks as if you forgot to quote the few lines of assembler above the part you quoted. I assume there is something like:
sub esp,12
somewhere above what you quoted. The calling convention looks like pure stdcall, and the return value is in reality passed as a hidden pointer argument, i.e. the code is in fact compiled as if you had declared:
void __stdcall getStruct(SomeStruct *returnValue, uint32_t someArg1, uint32_t someArg2);
I ve got a little problem with using MapViewOfFile. This function returns the starting address of the mapped view so as I think it's a sequence of bytes. And this is where I ve stacked:
INVOKE MapViewOfFile, hMapFile, FILE_MAP_READ, 0, 0, 0
mov pMemory, eax
mov edx, DWORD PTR [pMemory]
The pointer is correct cause during saving as a whole block of memory to file, everything is fine. So my question is: how to refer to every single elements(bytes).
Thanks in advance
Cast pMemory to the correct type and move it around from pMemory to pMemory + size of the mapped memory - size of the type to which you refer...
In other words, you have effectively allocated memory and associated the menory with a file that is changed as you change the memory.
In C assuming pMemory is the pointer returned by MapViewOfFile:
int x = (*(int *)pMemory); // Read the first int
char c = (*(char *)pMemory); // Read the first char
typedef struct oddball { int x, int y, int z, char str[256] } oddball; // assume the typedef syntax is right...
oddball *p = (oddball *)pMemory; // point to the base of the mapped memory
p += 14; // p now points to the 15th instance of oddball in the file.
// Or... p = &p[14];
p->x = 0;
p->y = 0;
p->z = 0;
strcpy( p->str( "This is the 0, 0, 0 position" ) );
// You've now changed the memory to which p[14] refers.
// To read every byte... (Again in C, use the compiler to generate asm
// Assumes:
// fileSize is the size of the mapped memory in bytes
// pMemory is the pointer returned by MapViewOfFile
// buffer is a block of memory that will hold n bytes
// pos is the position from which you want to read
// n is the number of bytes to read from position pos and the smallest size in bytes to which buffer can point
void readBytes( unsigned int fileSize, char *pMemory, char *buffer, unsigned int n, unsigned int pos )
{
char *endP = pMemory + fileSize;
char *start = pMemory + pos;
char *end = start + n;
int i = 0;
// Code to stay within your memory boundaries
if( end > endP )
{
n -= (end - endP); // This might be wrong...
end = endP;
}
if( end < start )
return;
// end boundary check
for( ; start < end; start++, i++ )
{
buffer[i] = *start;
}
}
Here's the asm code generated from the code above by the compiler with -O2
.686P
.XMM
.model flat
PUBLIC _readBytes
_TEXT SEGMENT
_fileSize$ = 8 ; size = 4
_pMemory$ = 12 ; size = 4
_buffer$ = 16 ; size = 4
_n$ = 20 ; size = 4
_pos$ = 24 ; size = 4
_readBytes PROC ; COMDAT
mov eax, DWORD PTR _pMemory$[esp-4]
mov edx, DWORD PTR _fileSize$[esp-4]
mov ecx, DWORD PTR _n$[esp-4]
add edx, eax
add eax, DWORD PTR _pos$[esp-4]
add ecx, eax
cmp ecx, edx
jbe SHORT $LN5#readBytes
mov ecx, edx
$LN5#readBytes:
cmp eax, ecx
jae SHORT $LN1#readBytes
push esi
mov esi, DWORD PTR _buffer$[esp]
sub esi, eax
$LL3#readBytes:
mov dl, BYTE PTR [eax]
mov BYTE PTR [esi+eax], dl
inc eax
cmp eax, ecx
jb SHORT $LL3#readBytes
pop esi
$LN1#readBytes:
ret 0
_readBytes ENDP
_TEXT ENDS
END
I have a function in VS where I pass a pointer to the function. I then want to store the pointer in a register to further manipulate. How do you do that?
I have tried
float __declspec(align(16)) x[16] =
{
0.125000, 0.125000, 0.125000, 0,
-0.125000, 0.125000, -0.125000, 0,
0.125000, -0.125000, -0.125000, 0,
-0.125000, -0.125000, 0.125000, 0
};
void e()
{
__asm mov eax, x // doesn't work
__asm mov ebx, [eax]
}
void f(float *p)
{
__asm mov eax, p // does work
__asm mov ebx, [eax]
}
int main()
{
f(x);
e();
}
Option 1 actually seems to work fine. Consider the following program:
#include <stdio.h>
void f(int *p) {
__asm mov eax, p
__asm mov ebx, [eax]
// break here
}
void main()
{
int i = 0x12345678;
f(&i);
}
With Visual Studio 2008 SP1, a single-file C++ program and debug build, I'm getting the following in the registers window when stepping into the end of f():
EAX = 004DF960
EBX = 12345678
ECX = 00000000
EDX = 00000001
ESI = 00000000
EDI = 004DF884
EIP = 003013C3
ESP = 004DF7B8
EBP = 004DF884
EFL = 00000202
Looking at the values in EAX, EBX and ESP, that looks like a pretty good evidence that you actually have the pointer you wanted in EAX. The address in EAX is just a tad higher than the one in ESP, suggesting it's one frame higher up the stack. The dereferenced value loaded into EBX suggests we got the right address.
Loading the address of a global is subtly different. The following example uses the LEA instruction to accomplish the task.
#include <stdio.h>
int a[] = { 0x1234, 0x4567 };
void main()
{
// __asm mov eax, a ; interpreted as eax <- a[0]
__asm lea eax, a ; interpreted as eax <- &a[0]
__asm mov ebx, [eax]
__asm mov ecx, [eax+4]
// break here
}
Stepping to the end of main() gets you the following register values. EAX gets the address of the first element in the array, while EBX an ECX get the values of its members.
EAX = 00157038
EBX = 00001234
ECX = 00004567
EDX = 00000001
ESI = 00000000
EDI = 0047F800
EIP = 001513C9
ESP = 0047F734
EBP = 0047F800
EFL = 00000202
The magic isn't in the LEA instruction itself. Rather, it appears that the __asm directive treats C/C++ identifiers differently depending on whether a MOV or an LEA instruction is used. Here is the ASM dump of the same program, when the MOV instruction is uncommented. Notice how the MOV instruction gets the content of a[] as its argument (DWORD PTR), while the LEA instruction gets its offset.
; ...
PUBLIC ?a##3PAHA ; a
_DATA SEGMENT
?a##3PAHA DD 01234H ; a
DD 04567H
_DATA ENDS
; ...
mov eax, DWORD PTR ?a##3PAHA
lea eax, OFFSET ?a##3PAHA
mov ebx, DWORD PTR [eax]
mov ecx, DWORD PTR [eax+4]
; ...
I'm not sure if this is correct, but have you tried casting *p to an int first and then loading that value?
void f(*p)
{
int tmp = (int)p;
// asm stuff...
}