Related
I have to convert this java program into x86 assembly language. I have already tried and coded the whole thing but my program would just get stuck in an infinite loop, or it would end wrong.
public static void main(String[] args)
{
int sum = 0;
int i = 0;
int j = 12;
int var1 = 3;
int var2 = 3;
int var3 = 0;
for(i=0; i<j; i++)
{
if (var1 > var3)
{
var1 = var1-i;
}
else
{
var3 = var3+ i;
}
sum = var1 + var2+ var3;
j = j -1;
}
}
What I have so far
.386
.model flat, stdcall
.stack 4096
ExitProcess PROTO, dwExitCode: DWORD
.data
sum SDWORD 0
i SDWORD 0
j SDWORD 12
var1 SDWORD 3
var2 SDWORD 3
var3 SDWORD 0
.code
main PROC
mov esi, i ; esi = 0
mov eax, j ; eax = 12
mov ebx, var1 ; ebx = 3
mov ecx, var2 ; ecx = 3
mov edx, var3 ; edx = 0
for_loop:
cmp esi, eax ; compare esi and eax
inc esi ; increase esi
jb begin_if ; if esi < eax (i < j) jump to begin_if
jmp end_loop ; if not, jump to end_loop
begin_if:
cmp ebx, edx ; compare ebx and edx
ja if_block ; if ebx > edx (var1 > var3) jump to if_block
jmp else_block ; if not, jump to else_block
if_block:
sub ebx, esi ; subtract esi from ebx which is i from var1
jmp end_for ; jump to end_loop
else_block:
add edx, esi ; add esi to edx which is i + var3
jmp end_for ; jump to end_loop
end_for:
add ecx, edx ; add edx to ecx which is var3 + var2 and store it in ecx
add ebx, ecx ; then add ecx to ebx which is varr2+var3+var1
mov sum, ebx ; then store ebx into sum
sub eax, 1 ; subtract 1 from ebx which is j - 1
jmp for_loop ; repeat the loop
end_loop: ; end of the loop
INVOKE ExitProcess, 0
main ENDP
END main
I have trouble because the sum needs to be 15, but my code ended when the sum is only 6.
Im trying to make a Base64Encode in inline assembler in Visual Studio.
I got this func
char* Base64Encode(char* data, int len)
{
// Tabelle mit den Zeichen für die Codierung
const char* encodeTable = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
//
char* result;
if (len > 0) // sonst gibt es nichts zu tun ...
{
int encodedLength = ((len + 2) / 3) * 4; // effektiv die Ganzzahlfassung von ceil(length/3)*4
result = new char[encodedLength+1]; // +1 wg. Null-Terminierung
_asm
{
mov esi,data
mov edi,encodeTable
xor eax, eax
// get 3 bytes
mov ah, byte ptr[esi]
mov al, byte ptr[esi+1]
shl eax,16
mov ah, byte ptr[esi+2]
//
mov edx,eax
shl eax,6
shr edx, 26
mov bl, byte ptr[edi + edx]
mov [result],bl
//
mov edx, eax
shl eax, 6
shr edx, 26
mov bl, byte ptr[edi + edx]
mov[result+1], bl
//manipulate in edx bitset3
mov edx, eax
shl eax, 6
shr edx, 26
mov bl, byte ptr[edi + edx]
mov[result+2], bl
//manipulate in edx bitset4
mov edx, eax
shl eax, 6
shr edx, 26
mov bl, byte ptr[edi + edx]
mov[result+3], bl
}
}
else
{
result = "";
}
return result;
}
The encoding is working proper, I have in bl always the right letter, but the output is not working ( result array doesn't fill with the letters, im getting the error that the operands have different sizes, I am only allowed to make changes in the __asm function ) .
Could somebody help me how to fill the result-array with the letters I get in bl? Debugging always shows me the right letters in the bl ' s if i comment out all the result lines.
EDIT:
I get nothing in the result array when i use the byte ptr.
Any ideas?
EDIT2:
The issue in your code is a matter of indirection. You define and initialize a variable result like this in the C++ code:
char* result;
result = new char[encodedLength+1];
result is a memory location that holds a pointer to an array of characters returned by new. result is not the memory location where data will be stored, but contains a pointer to that data area. You then access it in the ASM block like this:
mov [result],bl
The compiler/assembler(MASM) warned that there was an operand mismatch when it said Operands different sizes. It knew that result was the location of a 32-bit pointer (not single characters). Since result is the address containing a pointer the code above would have moved the contents of bl to the memory location result. This had the effect of changing the pointer (returned by new) not what result was pointing at.
You need to deal with indirection here. You want to get the address that is stored in result and use that as a base for memory addressing. You can choose an available register like ECX and MOV the contents of result into it. You could do that with something like this at the top of your ASM block:
mov ecx, dword ptr [result]
This takes the 32-bit(dword) value at memory location result and stores that in ECX. Now that we have the memory location to the beginning of the character buffer we can now modify all references of result in the ASM block and change it to ECX. Examples:
mov [result],bl
would become:
mov byte ptr [ecx],bl
and
mov[result+1], bl
would become:
mov byte ptr [ecx+1], bl
The second example is called base plus displacement (or offset) addressing . That link also describes all the addressing modes on x86. If you were using 16-bit code (which you aren't) there are some extra restrictions in the register choices that can be use for base and indexing.
As well user3144770 also pointed out that you didn't null terminate your string (you only allocated space for it), so at the bottom of your ASM block you should have probably used something like:
mov byte ptr[ecx+4], 0
With the changes above your code could look something like:
char* Base64Encode(char* data, int len)
{
// Tabelle mit den Zeichen für die Codierung
const char* encodeTable = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
//
char* result;
if (len > 0) // sonst gibt es nichts zu tun ...
{
int encodedLength = ((len + 2) / 3) * 4; // effektiv die Ganzzahlfassung von ceil(length/3)*4
result = new char[encodedLength + 1]; // +1 wg. Null-Terminierung
_asm
{
mov esi, data
mov edi, encodeTable
mov ecx, dword ptr [result]
xor eax, eax
// get 3 bytes
mov ah, byte ptr[esi]
mov al, byte ptr[esi + 1]
shl eax, 16
mov ah, byte ptr[esi + 2]
//
mov edx, eax
shl eax, 6
shr edx, 26
mov bl, byte ptr[edi + edx]
mov byte ptr [ecx], bl
//
mov edx, eax
shl eax, 6
shr edx, 26
mov bl, byte ptr[edi + edx]
mov byte ptr [ecx + edx + 1], bl
//manipulate in edx bitset3
mov edx, eax
shl eax, 6
shr edx, 26
mov bl, byte ptr[edi + edx]
mov byte ptr [ecx + 2], bl
//manipulate in edx bitset4
mov edx, eax
shl eax, 6
shr edx, 26
mov bl, byte ptr[edi + edx]
mov byte ptr [ecx + 3], bl
mov byte ptr[ecx + 4], 0
}
}
else
{
result = "";
}
return result;
}
Perhaps it's enough to write byte ptr
mov bl, byte ptr[edi + edx]
mov byte ptr[result], bl
Also you don't actually do the null-termination. ( +1 wg. Null-Terminierung)
mov byte ptr[result+4], 0
I ve got a little problem with using MapViewOfFile. This function returns the starting address of the mapped view so as I think it's a sequence of bytes. And this is where I ve stacked:
INVOKE MapViewOfFile, hMapFile, FILE_MAP_READ, 0, 0, 0
mov pMemory, eax
mov edx, DWORD PTR [pMemory]
The pointer is correct cause during saving as a whole block of memory to file, everything is fine. So my question is: how to refer to every single elements(bytes).
Thanks in advance
Cast pMemory to the correct type and move it around from pMemory to pMemory + size of the mapped memory - size of the type to which you refer...
In other words, you have effectively allocated memory and associated the menory with a file that is changed as you change the memory.
In C assuming pMemory is the pointer returned by MapViewOfFile:
int x = (*(int *)pMemory); // Read the first int
char c = (*(char *)pMemory); // Read the first char
typedef struct oddball { int x, int y, int z, char str[256] } oddball; // assume the typedef syntax is right...
oddball *p = (oddball *)pMemory; // point to the base of the mapped memory
p += 14; // p now points to the 15th instance of oddball in the file.
// Or... p = &p[14];
p->x = 0;
p->y = 0;
p->z = 0;
strcpy( p->str( "This is the 0, 0, 0 position" ) );
// You've now changed the memory to which p[14] refers.
// To read every byte... (Again in C, use the compiler to generate asm
// Assumes:
// fileSize is the size of the mapped memory in bytes
// pMemory is the pointer returned by MapViewOfFile
// buffer is a block of memory that will hold n bytes
// pos is the position from which you want to read
// n is the number of bytes to read from position pos and the smallest size in bytes to which buffer can point
void readBytes( unsigned int fileSize, char *pMemory, char *buffer, unsigned int n, unsigned int pos )
{
char *endP = pMemory + fileSize;
char *start = pMemory + pos;
char *end = start + n;
int i = 0;
// Code to stay within your memory boundaries
if( end > endP )
{
n -= (end - endP); // This might be wrong...
end = endP;
}
if( end < start )
return;
// end boundary check
for( ; start < end; start++, i++ )
{
buffer[i] = *start;
}
}
Here's the asm code generated from the code above by the compiler with -O2
.686P
.XMM
.model flat
PUBLIC _readBytes
_TEXT SEGMENT
_fileSize$ = 8 ; size = 4
_pMemory$ = 12 ; size = 4
_buffer$ = 16 ; size = 4
_n$ = 20 ; size = 4
_pos$ = 24 ; size = 4
_readBytes PROC ; COMDAT
mov eax, DWORD PTR _pMemory$[esp-4]
mov edx, DWORD PTR _fileSize$[esp-4]
mov ecx, DWORD PTR _n$[esp-4]
add edx, eax
add eax, DWORD PTR _pos$[esp-4]
add ecx, eax
cmp ecx, edx
jbe SHORT $LN5#readBytes
mov ecx, edx
$LN5#readBytes:
cmp eax, ecx
jae SHORT $LN1#readBytes
push esi
mov esi, DWORD PTR _buffer$[esp]
sub esi, eax
$LL3#readBytes:
mov dl, BYTE PTR [eax]
mov BYTE PTR [esi+eax], dl
inc eax
cmp eax, ecx
jb SHORT $LL3#readBytes
pop esi
$LN1#readBytes:
ret 0
_readBytes ENDP
_TEXT ENDS
END
I have the following code which is the bottleneck in one part of my application. All I do is subtract on Array from another. Both of these arrays have more around 100000 elements. I'm trying to find a way to make this more performant.
var
Array1, Array2 : array of integer;
.....
// Code that fills the arrays
.....
for ix := 0 to length(array1)-1
Array1[ix] := Array1[ix] - Array2[ix];
end;
Does anybody have a suggestion?
Running this on multiple threads, with that big an array will net linear speed-up. It's embarrassingly parallel as they say.
Running subtraction on more threads sounds good, but 100K integer sunstraction don't take a lot of CPU time, so maybe threadpool... However settings threads have also a lot of overhead, so short arrays will have slower productivity in parallel threads than in only one (main) thread!
Did you switch off in compiler settings, overflow and range checking?
You can try to use asm rutine, it is very simple...
Something like:
procedure SubArray(var ar1, ar2; length: integer);
asm
//length must be > than 0!
push ebx
lea ar1, ar1 -4
lea ar2, ar2 -4
#Loop:
mov ebx, [ar2 + length *4]
sub [ar1 + length *4], ebx
dec length
//Here you can put more folloving parts of rutine to more unrole it to speed up.
jz #exit
mov ebx, [ar2 + length *4]
sub [ar1 + length *4], ebx
dec length
//
jnz #Loop
#exit:
pop ebx
end;
begin
SubArray(Array1[0], Array2[0], length(Array1));
It can be much faster...
EDIT: Added procedure with SIMD instructions.
This procedure request SSE CPU support. It can take 4 integers in XMM register and subtract at once. There is also possibility to use movdqa instead movdqu it is faster, but you must first to ensure 16 byte aligment. You can also unrole the XMM par like in my first asm case. (I'm interesting about speed measurment. :) )
var
array1, array2: array of integer;
procedure SubQIntArray(var ar1, ar2; length: integer);
asm
//prepare length if not rounded to 4
push ecx
shr length, 2
jz #LengthToSmall
#Loop:
movdqu xmm1, [ar1] //or movdqa but ensure 16b aligment first
movdqu xmm2, [ar2] //or movdqa but ensure 16b aligment first
psubd xmm1, xmm2
movdqu [ar1], xmm1 //or movdqa but ensure 16b aligment first
add ar1, 16
add ar2, 16
dec length
jnz #Loop
#LengthToSmall:
pop ecx
push ebx
and ecx, 3
jz #Exit
mov ebx, [ar2]
sub [ar1], ebx
dec ecx
jz #Exit
mov ebx, [ar2 + 4]
sub [ar1 + 4], ebx
dec ecx
jz #Exit
mov ebx, [ar2 + 8]
sub [ar1 + 8], ebx
#Exit:
pop ebx
end;
begin
//Fill arrays first!
SubQIntArray(Array1[0], Array2[0], length(Array1));
I was very curious about speed optimisation in this simple case.
So I have made 6 simple procedures and measure CPU tick and time at array size 100000;
Pascal procedure with compiler option Range and Overflow Checking On
Pascal procedure with compiler option Range and Overflow Checking off
Classic x86 assembler procedure.
Assembler procedure with SSE instructions and unaligned 16 byte move.
Assembler procedure with SSE instructions and aligned 16 byte move.
Assembler 8 times unrolled loop with SSE instructions and aligned 16 byte move.
Check results on picture and code for more information.
To get 16 byte memory alignment first delite the dot in file 'FastMM4Options.inc' directive {$.define Align16Bytes}
!
program SubTest;
{$APPTYPE CONSOLE}
uses
//In file 'FastMM4Options.inc' delite the dot in directive {$.define Align16Bytes}
//to get 16 byte memory alignment!
FastMM4,
windows,
SysUtils;
var
Ar1 :array of integer;
Ar2 :array of integer;
ArLength :integer;
StartTicks :int64;
EndTicks :int64;
TicksPerMicroSecond :int64;
function GetCpuTicks: int64;
asm
rdtsc
end;
{$R+}
{$Q+}
procedure SubArPasRangeOvfChkOn(length: integer);
var
n: integer;
begin
for n := 0 to length -1 do
Ar1[n] := Ar1[n] - Ar2[n];
end;
{$R-}
{$Q-}
procedure SubArPas(length: integer);
var
n: integer;
begin
for n := 0 to length -1 do
Ar1[n] := Ar1[n] - Ar2[n];
end;
procedure SubArAsm(var ar1, ar2; length: integer);
asm
//Length must be > than 0!
push ebx
lea ar1, ar1 - 4
lea ar2, ar2 - 4
#Loop:
mov ebx, [ar2 + length * 4]
sub [ar1 + length * 4], ebx
dec length
jnz #Loop
#exit:
pop ebx
end;
procedure SubArAsmSimdU(var ar1, ar2; length: integer);
asm
//Prepare length
push length
shr length, 2
jz #Finish
#Loop:
movdqu xmm1, [ar1]
movdqu xmm2, [ar2]
psubd xmm1, xmm2
movdqu [ar1], xmm1
add ar1, 16
add ar2, 16
dec length
jnz #Loop
#Finish:
pop length
push ebx
and length, 3
jz #Exit
//Do rest, up to 3 subtractions...
mov ebx, [ar2]
sub [ar1], ebx
dec length
jz #Exit
mov ebx, [ar2 + 4]
sub [ar1 + 4], ebx
dec length
jz #Exit
mov ebx, [ar2 + 8]
sub [ar1 + 8], ebx
#Exit:
pop ebx
end;
procedure SubArAsmSimdA(var ar1, ar2; length: integer);
asm
push ebx
//Unfortunately delphi use first 8 bytes for dinamic array length and reference
//counter, from that reason the dinamic array address should start with $xxxxxxx8
//instead &xxxxxxx0. So we must first align ar1, ar2 pointers!
mov ebx, [ar2]
sub [ar1], ebx
dec length
jz #exit
mov ebx, [ar2 + 4]
sub [ar1 + 4], ebx
dec length
jz #exit
add ar1, 8
add ar2, 8
//Prepare length for 16 byte data transfer
push length
shr length, 2
jz #Finish
#Loop:
movdqa xmm1, [ar1]
movdqa xmm2, [ar2]
psubd xmm1, xmm2
movdqa [ar1], xmm1
add ar1, 16
add ar2, 16
dec length
jnz #Loop
#Finish:
pop length
and length, 3
jz #Exit
//Do rest, up to 3 subtractions...
mov ebx, [ar2]
sub [ar1], ebx
dec length
jz #Exit
mov ebx, [ar2 + 4]
sub [ar1 + 4], ebx
dec length
jz #Exit
mov ebx, [ar2 + 8]
sub [ar1 + 8], ebx
#Exit:
pop ebx
end;
procedure SubArAsmSimdAUnrolled8(var ar1, ar2; length: integer);
asm
push ebx
//Unfortunately delphi use first 8 bytes for dinamic array length and reference
//counter, from that reason the dinamic array address should start with $xxxxxxx8
//instead &xxxxxxx0. So we must first align ar1, ar2 pointers!
mov ebx, [ar2]
sub [ar1], ebx
dec length
jz #exit
mov ebx, [ar2 + 4]
sub [ar1 + 4], ebx
dec length
jz #exit
add ar1, 8 //Align pointer to 16 byte
add ar2, 8 //Align pointer to 16 byte
//Prepare length for 16 byte data transfer
push length
shr length, 5 //8 * 4 subtructions per loop
jz #Finish //To small for LoopUnrolled
#LoopUnrolled:
//Unrolle 1, 2, 3, 4
movdqa xmm4, [ar2]
movdqa xmm5, [16 + ar2]
movdqa xmm6, [32 + ar2]
movdqa xmm7, [48 + ar2]
//
movdqa xmm0, [ar1]
movdqa xmm1, [16 + ar1]
movdqa xmm2, [32 + ar1]
movdqa xmm3, [48 + ar1]
//
psubd xmm0, xmm4
psubd xmm1, xmm5
psubd xmm2, xmm6
psubd xmm3, xmm7
//
movdqa [48 + ar1], xmm3
movdqa [32 + ar1], xmm2
movdqa [16 + ar1], xmm1
movdqa [ar1], xmm0
//Unrolle 5, 6, 7, 8
movdqa xmm4, [64 + ar2]
movdqa xmm5, [80 + ar2]
movdqa xmm6, [96 + ar2]
movdqa xmm7, [112 + ar2]
//
movdqa xmm0, [64 + ar1]
movdqa xmm1, [80 + ar1]
movdqa xmm2, [96 + ar1]
movdqa xmm3, [112 + ar1]
//
psubd xmm0, xmm4
psubd xmm1, xmm5
psubd xmm2, xmm6
psubd xmm3, xmm7
//
movdqa [112 + ar1], xmm3
movdqa [96 + ar1], xmm2
movdqa [80 + ar1], xmm1
movdqa [64 + ar1], xmm0
//
add ar1, 128
add ar2, 128
dec length
jnz #LoopUnrolled
#FinishUnrolled:
pop length
and length, $1F
//Do rest, up to 31 subtractions...
#Finish:
mov ebx, [ar2]
sub [ar1], ebx
add ar1, 4
add ar2, 4
dec length
jnz #Finish
#Exit:
pop ebx
end;
procedure WriteOut(EndTicks: Int64; Str: string);
begin
WriteLn(Str + IntToStr(EndTicks - StartTicks)
+ ' Time: ' + IntToStr((EndTicks - StartTicks) div TicksPerMicroSecond) + 'us');
Sleep(5);
SwitchToThread;
StartTicks := GetCpuTicks;
end;
begin
ArLength := 100000;
//Set TicksPerMicroSecond
QueryPerformanceFrequency(TicksPerMicroSecond);
TicksPerMicroSecond := TicksPerMicroSecond div 1000000;
//
SetLength(Ar1, ArLength);
SetLength(Ar2, ArLength);
//Fill arrays
//...
//Tick time info
WriteLn('CPU ticks per mikro second: ' + IntToStr(TicksPerMicroSecond));
Sleep(5);
SwitchToThread;
StartTicks := GetCpuTicks;
//Test 1
SubArPasRangeOvfChkOn(ArLength);
WriteOut(GetCpuTicks, 'SubAr Pas Range and Overflow Checking On, Ticks: ');
//Test 2
SubArPas(ArLength);
WriteOut(GetCpuTicks, 'SubAr Pas, Ticks: ');
//Test 3
SubArAsm(Ar1[0], Ar2[0], ArLength);
WriteOut(GetCpuTicks, 'SubAr Asm, Ticks: ');
//Test 4
SubArAsmSimdU(Ar1[0], Ar2[0], ArLength);
WriteOut(GetCpuTicks, 'SubAr Asm SIMD mem unaligned, Ticks: ');
//Test 5
SubArAsmSimdA(Ar1[0], Ar2[0], ArLength);
WriteOut(GetCpuTicks, 'SubAr Asm with SIMD mem aligned, Ticks: ');
//Test 6
SubArAsmSimdAUnrolled8(Ar1[0], Ar2[0], ArLength);
WriteOut(GetCpuTicks, 'SubAr Asm with SIMD mem aligned 8*unrolled, Ticks: ');
//
ReadLn;
Ar1 := nil;
Ar2 := nil;
end.
...
The fastest asm procedure with 8 times unrolled SIMD instructions takes only 68us and is about 4 time faster than Pascal procedure.
As we can see the Pascal loop procedure probably isn't critical, it takes only about 277us (Overflow and Range checking off) on 2,4GHz CPU at 100000 subtractions.
So this code can't be bottleneck?
I'm not assembly expert but I think the following are near optimal if you don't take into account SIMD instructions or parallel processing, the later can be easily accomplished by passing portions of the array to the function.
like
Thread1: SubArray(ar1[0], ar2[0], 50);
Thread2: SubArray(ar1[50], ar2[50], 50);
procedure SubArray(var Array1, Array2; const Length: Integer);
var
ap1, ap2 : PInteger;
i : Integer;
begin
ap1 := #Array1;
ap2 := #Array2;
i := Length;
while i > 0 do
begin
ap1^ := ap1^ - ap2^;
Inc(ap1);
Inc(ap2);
Dec(i);
end;
end;
// similar assembly version
procedure SubArrayEx(var Array1, Array2; const Length: Integer);
asm
// eax = #Array1
// edx = #Array2
// ecx = Length
// esi = temp register for array2^
push esi
cmp ecx, 0
jle #Exit
#Loop:
mov esi, [edx]
sub [eax], esi
add eax, 4
add edx, 4
dec ecx
jnz #Loop
#Exit:
pop esi
end;
procedure Test();
var
a1, a2 : array of Integer;
i : Integer;
begin
SetLength(a1, 3);
a1[0] := 3;
a1[1] := 1;
a1[2] := 2;
SetLength(a2, 3);
a2[0] := 4;
a2[1] := 21;
a2[2] := 2;
SubArray(a1[0], a2[0], Length(a1));
for i := 0 to Length(a1) - 1 do
Writeln(a1[i]);
Readln;
end;
It's not a real answer to your question, but I would investigate if I could do the subtraction already at some time while filling the arrays with values. I would optionally even consider a third array in memory to store the result of the subtraction. In modern computing, the 'cost' of memory is considerably lower than the 'cost' of the time it takes to perform an extra action on memory.
In theory you'll gain at least a little performance when the subtraction can be done while the values are still in registers or processor cache, but in practice you just might stumble upon a few tricks that could enhance performance of the entire algorithm.
Is it possible to hook into thread termination on Windows? IOW, I would like to be notified if a thread inside the process (not interested in other processes and their threads) has terminated (either normally or - more important - forcefully).
Alternatively, hooking into thread creation would also do.
Rationale: I have a library that manages some information on per-thread basis (think of it as a process-wide per-thread cache for some information). When a thread is terminated I have to remove all thread-specific information from the cache. [Cache associations are implemented using thread ID which may get reused for future threads.]
There's no problem with "normal" execution order as the library user will detach the current thread from the library which will clear the state. Problems start to appear if somebody kills the thread owning cached resource.
The best way is to call
WaitForSingleObject with the HANDLE of the thread (call OpenThread using the thread id to get the HANDLE).
If your program is in a dll, you can set up to handle the DllMain method. This is called when a thread or process starts/ends.
For example,
library MyDLL;
uses
SysUtils, Windows;
procedure DllMain(reason: integer) ;
var
dyingThreadId: Cardinal;
begin
case reason of
DLL_THREAD_DETACH:
begin
dyingThreadId := GetCurrentThreadId();
// handle thread exit with thread id
end;
end;
end;
begin
DllProc := #DllMain;
end.
EDIT: The call is made in the context of the exiting thread, so you can call GetCurrentThreadId() to get the thread's id.
You can use the Win32_ThreadStopTrace WMI event to detect the termination of any thread in the system.
To start monitoring this event you must write a WQLsentence like this
Select * from Win32_ThreadStopTrace Within 1 Where ProcessID=PID_Of_Your_App
check this sample
uses
Classes;
type
TProcWmiEventThreadeCallBack = procedure(AObject: OleVariant) of object;
TWmiEventThread = class(TThread)
private
Success : HResult;
FSWbemLocator: OleVariant;
FWMIService : OleVariant;
FEventSource : OleVariant;
FWbemObject : OleVariant;
FCallBack : TProcWmiEventThreadeCallBack;
FWQL : string;
FServer : string;
FUser : string;
FPassword : string;
FNameSpace : string;
TimeoutMs : Integer;
procedure RunCallBack;
public
Constructor Create(CallBack : TProcWmiEventThreadeCallBack;const Server,User,PassWord,NameSpace,WQL:string;iTimeoutMs : Integer); overload;
destructor Destroy; override;
procedure Execute; override;
end;
implementation
uses
SysUtils,
ComObj,
Variants,
ActiveX;
constructor TWmiEventThread.Create(CallBack : TProcWmiEventThreadeCallBack;const Server,User,PassWord,NameSpace,WQL:string;iTimeoutMs : Integer);
begin
inherited Create(False);
FreeOnTerminate := True;
FCallBack := CallBack;
FWQL := WQL;
FServer := Server;
FUser := User;
FPassword := PassWord;
FNameSpace := NameSpace;
TimeoutMs := iTimeoutMs;
end;
destructor TWmiEventThread.Destroy;
begin
FSWbemLocator:=Unassigned;
FWMIService :=Unassigned;
FEventSource :=Unassigned;
FWbemObject :=Unassigned;
inherited;
end;
procedure TWmiEventThread.Execute;
const
wbemErrTimedout = $80043001;
begin
Success := CoInitialize(nil); //CoInitializeEx(nil, COINIT_MULTITHREADED);
try
FSWbemLocator := CreateOleObject('WbemScripting.SWbemLocator');
FWMIService := FSWbemLocator.ConnectServer(FServer, FNameSpace, FUser, FPassword);
FEventSource := FWMIService.ExecNotificationQuery(FWQL);
while not Terminated do
begin
try
FWbemObject := FEventSource.NextEvent(TimeoutMs); //set the max time to wait (ms)
except
on E:EOleException do
if EOleException(E).ErrorCode=HRESULT(wbemErrTimedout) then //Check for the timeout exception and ignore if exist
FWbemObject:=Null
else
raise;
end;
if FindVarData(FWbemObject)^.VType <> varNull then
Synchronize(RunCallBack);
FWbemObject:=Unassigned;
end;
finally
case Success of
S_OK, S_FALSE: CoUninitialize;
end;
end;
end;
procedure TWmiEventThread.RunCallBack;
begin
FCallBack(FWbemObject);
end;
Now to use this thread in your app you must call it in this way
WmiThread:=TWmiEventThread.Create(
Log,
'.',
'',
'',
'root\cimv2',
Format('Select * from Win32_ThreadStopTrace Within 1 Where ProcessID=%d',[GetCurrentProcessId]),1);
and in the callback function
procedure TForm1.Log(AObject: OleVariant);
begin
{
The OleVariant parameter has these properties
uint32 ProcessID;
uint8 SECURITY_DESCRIPTOR[];
uint32 ThreadID;
uint64 TIME_CREATED;
}
//do your stuff here
Memo1.Lines.Add(Format('Thread %s terminated ',[AObject.ThreadID]));
end;
You could use something like Detours to do API-level hooking of Win32 APIs like TerminateThread.
I'm not seeing why you need to do this, though. It sounds like you need to clear the thread's associated cache when the thread dies so you can re-use that slot if another thread with the same ID comes along. Is this correct?
If so, couldn't you just clear the cache association in DllMain when you get the DLL_THREAD_ATTACH event? This is essentially your new thread notification. At this point, you know you have a new thread, so isn't it safe to clear the existing associated cache?
The other alternative that might work is thread-local storage (TLS). You can use Win32 APIs like TlsAlloc/TlsSetValue to store thread-specific information. You could also define a variable with __declspec(thread) to have the compiler manage the TLS for you. This way, each thread maintains its own cache. The code remains the same for each thread, but the data accesses are relative to the thread.
program ThreadExitHook;
{$APPTYPE CONSOLE}
uses
Windows,
Classes,
madCodeHook;
type
TLdrShutdownThread = procedure; stdcall;
var
LdrShutdownThreadNext : TLdrShutdownThread;
procedure LdrShutdownThreadCallback; stdcall;
begin
WriteLn('Thread terminating:', GetCurrentThreadId);
LdrShutdownThreadNext;
end;
begin
HookAPI('ntdll.dll', 'LdrShutdownThread', #LdrShutdownThreadCallback, #LdrShutdownThreadNext);
TThread.CreateAnonymousThread(procedure begin
WriteLn('Hello from Thread');
Sleep(1000);
end).Start;
ReadLn;
UnhookAPI(#LdrShutdownThreadNext);
end.
Here is a version that does not depend on any external library:
program Project7;
{$APPTYPE CONSOLE}
uses
Windows,
Classes;
{==============================================================================}
function IsWin9x: Boolean;
asm
MOV EAX, FS:[030H]
TEST EAX, EAX
SETS AL
end;
{------------------------------------------------------------------------------}
function CalcJump(Src, Dest: DWORD): DWORD;
begin
if (Dest < Src) then begin
Result := Src - Dest;
Result := $FFFFFFFF - Result;
Result := Result - 4;
end else begin
Result := Dest - Src;
Result := Result - 5;
end;
end;
{------------------------------------------------------------------------------}
function OpCodeLength(Address: DWORD): DWORD; cdecl; assembler;
const
O_UNIQUE = 0;
O_PREFIX = 1;
O_IMM8 = 2;
O_IMM16 = 3;
O_IMM24 = 4;
O_IMM32 = 5;
O_IMM48 = 6;
O_MODRM = 7;
O_MODRM8 = 8;
O_MODRM32 = 9;
O_EXTENDED = 10;
O_WEIRD = 11;
O_ERROR = 12;
asm
pushad
cld
xor edx, edx
mov esi, Address
mov ebp, esp
push 1097F71Ch
push 0F71C6780h
push 17389718h
push 101CB718h
push 17302C17h
push 18173017h
push 0F715F547h
push 4C103748h
push 272CE7F7h
push 0F7AC6087h
push 1C121C52h
push 7C10871Ch
push 201C701Ch
push 4767602Bh
push 20211011h
push 40121625h
push 82872022h
push 47201220h
push 13101419h
push 18271013h
push 28858260h
push 15124045h
push 5016A0C7h
push 28191812h
push 0F2401812h
push 19154127h
push 50F0F011h
mov ecx, 15124710h
push ecx
push 11151247h
push 10111512h
push 47101115h
mov eax, 12472015h
push eax
push eax
push 12471A10h
add cl, 10h
push ecx
sub cl, 20h
push ecx
xor ecx, ecx
dec ecx
##ps:
inc ecx
mov edi, esp
##go:
lodsb
mov bh, al
##ft:
mov ah, [edi]
inc edi
shr ah, 4
sub al, ah
jnc ##ft
mov al, [edi-1]
and al, 0Fh
cmp al, O_ERROR
jnz ##i7
pop edx
not edx
##i7:
inc edx
cmp al, O_UNIQUE
jz ##t_exit
cmp al, O_PREFIX
jz ##ps
add edi, 51h
cmp al, O_EXTENDED
jz ##go
mov edi, [ebp+((1+8)*4)+4]
##i6:
inc edx
cmp al, O_IMM8
jz ##t_exit
cmp al, O_MODRM
jz ##t_modrm
cmp al, O_WEIRD
jz ##t_weird
##i5:
inc edx
cmp al, O_IMM16
jz ##t_exit
cmp al, O_MODRM8
jz ##t_modrm
##i4:
inc edx
cmp al, O_IMM24
jz ##t_exit
##i3:
inc edx
##i2:
inc edx
pushad
mov al, 66h
repnz scasb
popad
jnz ##c32
##d2:
dec edx
dec edx
##c32:
cmp al, O_MODRM32
jz ##t_modrm
sub al, O_IMM32
jz ##t_imm32
##i1:
inc edx
##t_exit:
jmp ##ASMEnded
##t_modrm:
lodsb
mov ah, al
shr al, 7
jb ##prmk
jz ##prm
add dl, 4
pushad
mov al, 67h
repnz scasb
popad
jnz ##prm
##d3: sub dl, 3
dec al
##prmk:jnz ##t_exit
inc edx
inc eax
##prm:
and ah, 00000111b
pushad
mov al, 67h
repnz scasb
popad
jz ##prm67chk
cmp ah, 04h
jz ##prmsib
cmp ah, 05h
jnz ##t_exit
##prm5chk:
dec al
jz ##t_exit
##i42: add dl, 4
jmp ##t_exit
##prm67chk:
cmp ax, 0600h
jnz ##t_exit
inc edx
jmp ##i1
##prmsib:
cmp al, 00h
jnz ##i1
lodsb
and al, 00000111b
sub al, 05h
jnz ##i1
inc edx
jmp ##i42
##t_weird:
test byte ptr [esi], 00111000b
jnz ##t_modrm
mov al, O_MODRM8
shr bh, 1
adc al, 0
jmp ##i5
##t_imm32:
sub bh, 0A0h
cmp bh, 04h
jae ##d2
pushad
mov al, 67h
repnz scasb
popad
jnz ##chk66t
##d4: dec edx
dec edx
##chk66t:
pushad
mov al, 66h
repnz scasb
popad
jz ##i1
jnz ##d2
##ASMEnded:
mov esp, ebp
mov [result+(9*4)], edx
popad
end;
{------------------------------------------------------------------------------}
function ApiHook(ModName, ApiName: PChar; FuncAddr, HookedApi: Pointer; var MainApi: Pointer): Boolean;
var
dwCount, Cnt, i, jmp: DWORD;
P: Pointer;
hMod, OldP, TMP: Cardinal;
begin
Result := False;
if IsWin9x then
Exit;
P := FuncAddr;
if P = nil then begin
hMod := GetModuleHandle(ModName);
if hMod = 0 then
hMod := LoadLibrary(ModName);
P := GetProcAddress(hMod, ApiName);
end;
if (P = nil) or (HookedApi = nil) then
Exit;
if not VirtualProtect(P, $40, PAGE_EXECUTE_READWRITE, #OldP) then
Exit;
if ((Byte(P^) = $68) and (DWORD(Pointer(DWORD(P) + 1)^) = DWORD(HookedApi))) then
Exit;
MainApi := VirtualAlloc(nil, $1000, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
if MainApi = nil then
Exit;
Cnt := 0;
for dwCount := 0 to $3F do begin
Inc(Cnt, OpCodeLength(DWORD(P) + Cnt));
for i := 0 to Cnt - 1 do
PByte(MainApi)[i] := PByte(P)[i];
if Cnt > 5 then
Break;
end;
PByte(MainApi)[Cnt] := $68;
DWORD(Pointer(DWORD(MainApi) + Cnt + 1)^) := DWORD(P) + Cnt;
PByte(MainApi)[Cnt + 5] := $C3;
PByte(MainApi)[Cnt + 6] := $99;
if (OpCodeLength(DWORD(MainApi)) = 5) and
((Byte(MainApi^) = $E8) or (Byte(MainApi^) = $E9)) then
begin
jmp := DWORD(P) + DWORD(Pointer(DWORD(MainApi) + 1)^) + 5;
DWORD(Pointer(DWORD(MainApi) + 1)^) := CalcJump(DWORD(MainApi), jmp);
end;
PByte(P)[0] := $68;
DWORD(Pointer(DWORD(P) + 1)^) := DWORD(HookedApi);
PByte(P)[5] := $C3;
VirtualProtect(P, $40, OldP, #TMP);
Result := True;
end;
{------------------------------------------------------------------------------}
function ApiUnHook(ModName, ApiName: PChar; FuncAddr, HookedApi: Pointer; var MainApi: Pointer): Boolean;
var
dwCount, Cnt, i, jmp: DWORD;
P: Pointer;
hMod, OldP, TMP: Cardinal;
begin
Result := False;
if IsWin9x then
Exit;
P := FuncAddr;
if P = nil then begin
hMod := GetModuleHandle(ModName);
P := GetProcAddress(hMod, ApiName);
end;
if (P = nil) or (MainApi = nil) or (HookedApi = nil) then
Exit;
if not VirtualProtect(P, $40, PAGE_EXECUTE_READWRITE, #OldP) then
Exit;
if ((Byte(P^) <> $68) or (DWORD(Pointer(DWORD(P) + 1)^) <> DWORD(HookedApi))) then
Exit;
Cnt := 0;
for dwCount := 0 to $3F do begin
Inc(Cnt, OpCodeLength(DWORD(MainApi) + Cnt));
if (Byte(Pointer(DWORD(MainApi) + Cnt)^) = $C3) and
(Byte(Pointer(DWORD(MainApi) + Cnt + 1)^) = $99) then
Break;
for i := 0 to Cnt - 1 do
PByte(P)[i] := PByte(MainApi)[i];
end;
if (OpCodeLength(DWORD(P)) = 5) and ((Byte(P^) = $E8) or (byte(P^) = $E9)) then begin
jmp := DWORD(MainApi) + DWORD(Pointer(DWORD(MainApi) + 1)^) + 5;
DWORD(Pointer(DWORD(P) + 1)^) := CalcJump(DWORD(P), jmp);
end;
VirtualProtect(P, $40, OldP, #TMP);
VirtualFree(MainApi, 0, MEM_RELEASE);
Result := True;
end;
{==============================================================================}
type
TLdrShutdownThread = procedure; stdcall;
var
LdrShutdownThreadNext : TLdrShutdownThread;
procedure LdrShutdownThreadCallback; stdcall;
begin
WriteLn('Thread terminating:', GetCurrentThreadId);
LdrShutdownThreadNext;
end;
begin
ApiHook('ntdll.dll', 'LdrShutdownThread', nil, #LdrShutdownThreadCallback, #LdrShutdownThreadNext);
TThread.CreateAnonymousThread(procedure begin
WriteLn('Hello from Thread');
Sleep(1000);
WriteLn('Waking up');
end).Start;
ReadLn;
ApiUnHook('ntdll.dll', 'LdrShutdownThread', nil, #LdrShutdownThreadCallback, #LdrShutdownThreadNext);
TThread.CreateAnonymousThread(procedure begin
WriteLn('Hello from Thread');
Sleep(1000);
WriteLn('Waking up');
end).Start;
ReadLn;
end.
Chris' mention of DLL_THREAD_ATTACH gave me an idea ...
Basically, associating cache with thread ID is a bad thing. I have to rework my library so that a thread will initially establish some kind of handle and then manage associations using this handle.
I guess if you really want to badly enough, you could use the debugging API (e.g., WaitForDebugEvent, ContinueDebugEvent), . You'll get an EXIT_THREAD_DEBUG_EVENT when a thread exits.
I can't say that's exactly a straightforward or clean way to do it, but if you can't come up with anything else, it's probably better than nothing.
Boost provides boost::this_thread::at_thread_exit() which allows you to provide arbitrary code to run when the current thread exits. If you call this on each thread then when it exits normally the code will be run. If a thread is terminated forcibly with TerminateThread then no more code is run on that thread, so the at_thread_exit functions are not called. The only way to handle such cases would be to hook TerminateThread, though this won't necessarily handle the case that another process terminates your threads.
The only way to reliably do this is in a DLL that hooks DLL_THREAD_ATTACH and DLL_THREAD_DETACH. See previous discussion here.