Find most significant DWORD in an DWORD array - algorithm

I want to find the most significant DWORD which isn't equal to 0 in an DWORD array. The algorithm should be optimized for data sizes up to 128 byte.
I've made three different functions, which all returns the index of the specific DWORD.
unsigned long msb_msvc(long* dw, std::intptr_t n)
{
while( --n )
{
if( dw[n] )
break;
}
return n;
}
static inline unsigned long msb_386(long* dw, std::intptr_t n)
{
__asm
{
mov ecx, [dw]
mov eax, [n]
__loop: sub eax, 1
jz SHORT __exit
cmp DWORD PTR [ecx + eax * 4], 0
jz SHORT __loop
__exit:
}
}
static inline unsigned long msb_sse2(long* dw, std::intptr_t n)
{
__asm
{
mov ecx, [dw]
mov eax, [n]
test ecx, 0x0f
jnz SHORT __128_unaligned
__128_aligned:
cmp eax, 4
jb SHORT __64
sub eax, 4
movdqa xmm0, XMMWORD PTR [ecx + eax * 4]
pxor xmm1, xmm1
pcmpeqd xmm0, xmm1
pmovmskb edx, xmm0
not edx
and edx, 0xffff
jz SHORT __128_aligned
jmp SHORT __exit
__128_unaligned:
cmp eax, 4
jb SHORT __64
sub eax, 4
movdqu xmm0, XMMWORD PTR [ecx + eax * 4]
pxor xmm1, xmm1
pcmpeqd xmm0, xmm1
pmovmskb edx, xmm0
not edx
and edx, 0xffff
jz SHORT __128_unaligned
jmp SHORT __exit
__64:
cmp eax, 2
jb __32
sub eax, 2
movq mm0, MMWORD PTR [ecx + eax * 4]
pxor mm1, mm1
pcmpeqd mm0, mm1
pmovmskb edx, mm0
not edx
and edx, 0xff
emms
jz SHORT __64
jmp SHORT __exit
__32:
test eax, eax
jz SHORT __exit
xor eax, eax
jmp __leave ; retn
__exit:
bsr edx, edx
shr edx, 2
add eax, edx
__leave:
}
}
These function should be used, to preselect data which will be compared against each other. So, it needs to be performant.
Does anybody know a better algorithm?

I think you are just looking for the first non-zero word in a given array. I would definitely go with a simple loop written in C. If there's some reason why this is super performance critical, I would recommend you look in the larger context of your program and ask e.g. the question why you need to find the non-zero object from the array and why can't you know its location already.

Related

What exactly is GCC's auto-vectorized SSE2 implementation of sum += 1..n doing?

When GCC 8.3 for x86-64 with -O3 option is fed this small C function
int sum(int n) {
int sum = 0;
for (int i = 1; i <= n; i++) {
sum += i;
}
return sum;
}
it produces the following assembly (courtesy of godbolt):
sum:
test edi, edi
jle .L8
lea eax, [rdi-1]
cmp eax, 17
jbe .L9
mov edx, edi
movdqa xmm1, XMMWORD PTR .LC0[rip]
xor eax, eax
pxor xmm0, xmm0
movdqa xmm2, XMMWORD PTR .LC1[rip]
shr edx, 2
.L4:
add eax, 1
paddd xmm0, xmm1
paddd xmm1, xmm2
cmp eax, edx
jne .L4
movdqa xmm1, xmm0
mov ecx, edi
psrldq xmm1, 8
and ecx, -4
paddd xmm0, xmm1
lea edx, [rcx+1]
movdqa xmm1, xmm0
psrldq xmm1, 4
paddd xmm0, xmm1
movd eax, xmm0
cmp edi, ecx
je .L13
.L7:
add eax, edx
add edx, 1
cmp edi, edx
jge .L7
ret
.L13:
ret
.L8:
xor eax, eax
ret
.L9:
mov edx, 1
xor eax, eax
jmp .L7
.LC0:
.long 1
.long 2
.long 3
.long 4
.LC1:
.long 4
.long 4
.long 4
.long 4
I understand that for values of n less than 19, a completely unoptmized loop (code at .L9 and .L7) is used, but I can't make heads nor tails of what is happening for larger values of n — could someone explain it?
Clang, on the other hand, simply calculates (n-1)*(n-2)/2 + 2*n - 1, which is a slighlty more roundabout way of calculating n*(n+1)/2 — perhaps to prevent some problems with signed overflow — which seems to be a much more effective way to optimize this loop.

Replacing #pragma omp atomic with c++ atomics

I'm replacing some OpenMP code with standard C++11/C++14 atomics/thread support. Here is the OpenMP minimal code example:
#include <vector>
#include <cstdint>
void omp_atomic_add(std::vector<std::int64_t> const& rows,
std::vector<std::int64_t> const& cols,
std::vector<double>& values,
std::size_t const row,
std::size_t const col,
double const value)
{
for (auto i = rows[row]; i < rows[row+1]; ++i)
{
if (cols[i] == col)
{
#pragma omp atomic
values[i] += value;
return;
}
}
}
The code updates a CSR matrix format and occurs in a hot path for scientific computation. It is technically possible to use a std::mutex but the values vector can have millions of elements and is accessed many times more than that so a std::mutex is too heavy.
Checking the assembly https://godbolt.org/g/nPE9Dt, it seems to use CAS (with the disclaimer my atomic and assembly knowledge is severely limited so my comments are likely incorrect):
mov rax, qword ptr [rdi]
mov rdi, qword ptr [rax + 8*rcx]
mov rax, qword ptr [rax + 8*rcx + 8]
cmp rdi, rax
jge .LBB0_6
mov rcx, qword ptr [rsi]
.LBB0_2: # =>This Inner Loop Header: Depth=1
cmp qword ptr [rcx + 8*rdi], r8
je .LBB0_3
inc rdi
cmp rdi, rax
jl .LBB0_2
jmp .LBB0_6
#### Interesting stuff happens from here onwards
.LBB0_3:
mov rcx, qword ptr [rdx] # Load values pointer into register
mov rax, qword ptr [rcx + 8*rdi] # Offset to value[i]
.LBB0_4: # =>This Inner Loop Header: Depth=1
movq xmm1, rax # Move value into floating point register
addsd xmm1, xmm0 # Add function arg to the value from the vector<double>
movq rdx, xmm1 # Move result to register
lock # x86 lock
cmpxchg qword ptr [rcx + 8*rdi], rdx # Compare exchange on the value in the vector
jne .LBB0_4 # If failed, go back to the top and try again
.LBB0_6:
ret
Is this possible to do using C++ atomics? The examples I've seen only use std::atomic<double> value{} and nothing in the context of accessing a value through a pointer.
You can create a std::vector<std::atomic<double>> but you cannot change its size.
The first thing I'd do is get gsl::span or write my own variant. Then gsl::span<std::atomic<double>> is a better model for values than std::vector<std::atomic<double>>.
Once we have done that, simply remove the #pragma omp atomic and your code is atomic in c++20. In c++17 and before you have to manually implement +=.
double old = values[i];
while(!values[i].compare_exchange_weak(old, old+value))
{}
Live example.
Clang 5 generates:
omp_atomic_add(std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<std::atomic<double>, std::allocator<std::atomic<double> > >&, unsigned long, unsigned long, double): # #omp_atomic_add(std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<std::atomic<double>, std::allocator<std::atomic<double> > >&, unsigned long, unsigned long, double)
mov rax, qword ptr [rdi]
mov rdi, qword ptr [rax + 8*rcx]
mov rax, qword ptr [rax + 8*rcx + 8]
cmp rdi, rax
jge .LBB0_6
mov rcx, qword ptr [rsi]
.LBB0_2: # =>This Inner Loop Header: Depth=1
cmp qword ptr [rcx + 8*rdi], r8
je .LBB0_3
inc rdi
cmp rdi, rax
jl .LBB0_2
jmp .LBB0_6
.LBB0_3:
mov rax, qword ptr [rdx]
mov rax, qword ptr [rax + 8*rdi]
.LBB0_4: # =>This Inner Loop Header: Depth=1
mov rcx, qword ptr [rdx]
movq xmm1, rax
addsd xmm1, xmm0
movq rsi, xmm1
lock
cmpxchg qword ptr [rcx + 8*rdi], rsi
jne .LBB0_4
.LBB0_6:
ret
which seems identical to my casual glance.
There is a proposal for atomic_view that lets you manipulate a non-atomic value through an atomic view. In general, C++ only lets you operate atomically on atomic data.

Signed 64-Bit multiply and 128-Bit Divide on x86 in assembly

I have 2 functions written in assembly (masm) in visual studio that i use in my C++ project. They are an Unsigned 64-Bit multiply function that produces a 128-Bit result, and a Unsigned 128-Bit divide function that produces a 128-Bit Quotient and returns a 32-Bit Remainder.
What i need is a signed version of the functions but I'm not sure how to do it.
Below is the code of the .asm file with the Unsigned functions:
.MODEL flat, stdcall
.CODE
MUL64 PROC, A:QWORD, B:QWORD, pu128:DWORD
push EAX
push EDX
push EBX
push ECX
push EDI
mov EDI,pu128
; LO(A) * LO(B)
mov EAX,DWORD PTR A
mov EDX,DWORD PTR B
MUL EDX
mov [EDI],EAX ; Save the partial product.
mov ECX,EDX
; LO(A) * HI(B)
mov EAX,DWORD PTR A
mov EDX,DWORD PTR B+4
MUL EDX
ADD EAX,ECX
ADC EDX,0
mov EBX,EAX
mov ECX,EDX
; HI(A) * LO(B)
mov EAX,DWORD PTR A+4
mov EDX,DWORD PTR B
MUL EDX
ADD EAX,EBX
ADC ECX,EDX
PUSHFD ; Save carry.
mov [EDI+4],EAX ; Save the partial product.
; HI(A) * HI(B)
mov EAX,DWORD PTR A+4
mov EDX,DWORD PTR B+4
MUL EDX
POPFD ; Retrieve carry from above.
ADC EAX,ECX
ADC EDX,0
mov [EDI+8],EAX ; Save the partial product.
mov [EDI+12],EDX ; Save the partial product.
pop EDI
pop ECX
pop EBX
pop EDX
pop EAX
ret 20
MUL64 ENDP
IMUL64 PROC, A:SQWORD, B:SQWORD, pi128:DWORD
; How to make this work?
ret 20
IMUL64 ENDP
DIV128 PROC, pDividend128:DWORD, Divisor:DWORD, pQuotient128:DWORD
push EDX
push EBX
push ESI
push EDI
MOV ESI,pDividend128
MOV EDI,pQuotient128
MOV EBX,Divisor
XOR EDX,EDX
MOV EAX,[ESI+12]
DIV EBX
MOV [EDI+12],EAX
MOV EAX,[ESI+8]
DIV EBX
MOV [EDI+8],EAX
MOV EAX,[ESI+4]
DIV EBX
MOV [EDI+4],EAX
MOV EAX,[ESI]
DIV EBX
MOV [EDI],EAX
MOV EAX,EDX
pop EDI
pop ESI
pop EBX
pop EDX
ret 12
DIV128 ENDP
IDIV128 PROC, pDividend128:DWORD, Divisor:DWORD, pQuotient128:DWORD
; How to make this work?
ret 12
IDIV128 ENDP
END
If you found this helpful in anyway please help the project by helping code the Signed version of the functions.
First, the MUL64 function does not work 100%
If you try to do 0xFFFFFFFFFFFFFFFF x 0xFFFFFFFFFFFFFFFF, the Hi 64-bit result is 0xFFFFFFFeFFFFFFFF, it should be 0xFFFFFFFFFFFFFFFe
To fix this, the carry flag after the POPFD instruction should be added to EDX, the highest 32-bit part of the result. Now following Peter Cordes advice, remove the push and pops of EAX/ECX/EDX. Finally use setc BL and movzx EBX,BL to save the flag. Note: you cannot easily use xor EBX,EBX to zero it because xor effects the flags. We use movzx because its faster than add BL,0xFF and add is faster than adc based on Skylake specs.
The Result:
MUL64 PROC, A:QWORD, B:QWORD, pu128:DWORD
push EBX
push EDI
mov EDI,pu128
; LO(A) * LO(B)
mov EAX,DWORD PTR A
mov EDX,DWORD PTR B
mul EDX
mov [EDI],EAX ; Save the partial product.
mov ECX,EDX
; LO(A) * HI(B)
mov EAX,DWORD PTR A
mov EDX,DWORD PTR B+4
mul EDX
add EAX,ECX
adc EDX,0
mov EBX,EAX
mov ECX,EDX
; HI(A) * LO(B)
mov EAX,DWORD PTR A+4
mov EDX,DWORD PTR B
mul EDX
add EAX,EBX
adc ECX,EDX
setc BL ; Save carry.
movzx EBX,BL ; Zero-Extend carry.
mov [EDI+4],EAX ; Save the partial product.
; HI(A) * HI(B)
mov EAX,DWORD PTR A+4
mov EDX,DWORD PTR B+4
mul EDX
add EDX,EBX ; Add carry from above.
add EAX,ECX
adc EDX,0
mov [EDI+8],EAX ; Save the partial product.
mov [EDI+12],EDX ; Save the partial product.
pop EDI
pop EBX
ret 20
MUL64 ENDP
Now, to make a signed version of the function use this formula:
my128.Hi -= (((A < 0) ? B : 0) + ((B < 0) ? A : 0));
The Result:
IMUL64 PROC, A:SQWORD, B:SQWORD, pi128:DWORD
push EBX
push EDI
mov EDI,pi128
; LO(A) * LO(B)
mov EAX,DWORD PTR A
mov EDX,DWORD PTR B
mul EDX
mov [EDI],EAX ; Save the partial product.
mov ECX,EDX
; LO(A) * HI(B)
mov EAX,DWORD PTR A
mov EDX,DWORD PTR B+4
mul EDX
add EAX,ECX
adc EDX,0
mov EBX,EAX
mov ECX,EDX
; HI(A) * LO(B)
mov EAX,DWORD PTR A+4
mov EDX,DWORD PTR B
mul EDX
add EAX,EBX
adc ECX,EDX
setc BL ; Save carry.
movzx EBX,BL ; Zero-Extend carry.
mov [EDI+4],EAX ; Save the partial product.
; HI(A) * HI(B)
mov EAX,DWORD PTR A+4
mov EDX,DWORD PTR B+4
mul EDX
add EDX,EBX ; Add carry from above.
add EAX,ECX
adc EDX,0
mov [EDI+8],EAX ; Save the partial product.
mov [EDI+12],EDX ; Save the partial product.
; Signed version only:
cmp DWORD PTR A+4,0
jg zero_b
jl use_b
cmp DWORD PTR A,0
jae zero_b
use_b:
mov ECX,DWORD PTR B
mov EBX,DWORD PTR B+4
jmp test_b
zero_b:
xor ECX,ECX
mov EBX,ECX
test_b:
cmp DWORD PTR B+4,0
jg zero_a
jl use_a
cmp DWORD PTR B,0
jae zero_a
use_a:
mov EAX,DWORD PTR A
mov EDX,DWORD PTR A+4
jmp do_last_op
zero_a:
xor EAX,EAX
mov EDX,EAX
do_last_op:
add EAX,ECX
adc EDX,EBX
sub [EDI+8],EAX
sbb [EDI+12],EDX
; End of signed version!
pop EDI
pop EBX
ret 20
IMUL64 ENDP
The DIV128 function should be fine (also probably the fastest) for getting a 128-bit quotient from a 32-bit divisor, but if you need to use a 128-bit divisor then look at this code https://www.codeproject.com/Tips/785014/UInt-Division-Modulus that has an example of using the Binary Shift Algorithm for 128-bit division. It could probably be 3x faster if written in assembly.
To make a signed version of DIV128, first determine if the sign of the divisor and dividend are the same or different. If they are the same, then the result should be positive. If they are different, then the result should be negative. So... Make the dividend and divisor positive if they are negative and call DIV128, after that, negate the results if the signs were different.
Here is some example code written in C++
VOID IDIV128(PSDQWORD Dividend, PSDQWORD Divisor, PSDQWORD Quotient, PSDQWORD Remainder)
{
BOOL Negate;
DQWORD DD, DV;
Negate = TRUE;
// Use local DD and DV so Dividend and Divisor dont get currupted.
DD.Lo = Dividend->Lo;
DD.Hi = Dividend->Hi;
DV.Lo = Divisor->Lo;
DV.Hi = Divisor->Hi;
// if the signs are the same then: Negate = FALSE;
if ((DD.Hi & 0x8000000000000000) == (DV.Hi & 0x8000000000000000)) Negate = FALSE;
// Covert Dividend and Divisor to possitive if negative: (negate)
if (DD.Hi & 0x8000000000000000) NEG128((PSDQWORD)&DD);
if (DV.Hi & 0x8000000000000000) NEG128((PSDQWORD)&DV);
DIV128(&DD, &DV, (PDQWORD)Quotient, (PDQWORD)Remainder);
if (Negate == TRUE)
{
NEG128(Quotient);
NEG128(Remainder);
}
}
EDIT:
Following Peter Cordes advice, we can optimize MUL64/IMUL64 even more. Look at the comments for specific changes being made. I have also replaced MUL64 PROC, A:QWORD, B:QWORD, pu128:DWORD with MUL64#20: and IMUL64#20: to eliminate unnecessary use of EBP that masm adds. I also optimized the sign-fixing work for IMUL64.
The current .asm file for MUL64/IMUL64
.MODEL flat, stdcall
EXTERNDEF MUL64#20 :PROC
EXTERNDEF IMUL64#20 :PROC
.CODE
MUL64#20:
push EBX
push EDI
; -----------------
; | pu128 |
; |---------------|
; | B |
; |---------------|
; | A |
; |---------------|
; | ret address |
; |---------------|
; | EBX |
; |---------------|
; ESP---->| EDI |
; -----------------
A TEXTEQU <[ESP+12]>
B TEXTEQU <[ESP+20]>
pu128 TEXTEQU <[ESP+28]>
mov EDI,pu128
; LO(A) * LO(B)
mov EAX,DWORD PTR A
mul DWORD PTR B
mov [EDI],EAX ; Save the partial product.
mov ECX,EDX
; LO(A) * HI(B)
mov EAX,DWORD PTR A
mul DWORD PTR B+4
add EAX,ECX
adc EDX,0
mov EBX,EAX
mov ECX,EDX
; HI(A) * LO(B)
mov EAX,DWORD PTR A+4
mul DWORD PTR B
add EAX,EBX
adc ECX,EDX
setc BL ; Save carry.
mov [EDI+4],EAX ; Save the partial product.
; HI(A) * HI(B)
mov EAX,DWORD PTR A+4
mul DWORD PTR B+4
add EAX,ECX
movzx ECX,BL ; Zero-Extend saved carry from above.
adc EDX,ECX
mov [EDI+8],EAX ; Save the partial product.
mov [EDI+12],EDX ; Save the partial product.
pop EDI
pop EBX
ret 20
IMUL64#20:
push EBX
push EDI
; -----------------
; | pi128 |
; |---------------|
; | B |
; |---------------|
; | A |
; |---------------|
; | ret address |
; |---------------|
; | EBX |
; |---------------|
; ESP---->| EDI |
; -----------------
A TEXTEQU <[ESP+12]>
B TEXTEQU <[ESP+20]>
pi128 TEXTEQU <[ESP+28]>
mov EDI,pi128
; LO(A) * LO(B)
mov EAX,DWORD PTR A
mul DWORD PTR B
mov [EDI],EAX ; Save the partial product.
mov ECX,EDX
; LO(A) * HI(B)
mov EAX,DWORD PTR A
mul DWORD PTR B+4
add EAX,ECX
adc EDX,0
mov EBX,EAX
mov ECX,EDX
; HI(A) * LO(B)
mov EAX,DWORD PTR A+4
mul DWORD PTR B
add EAX,EBX
adc ECX,EDX
setc BL ; Save carry.
mov [EDI+4],EAX ; Save the partial product.
; HI(A) * HI(B)
mov EAX,DWORD PTR A+4
mul DWORD PTR B+4
add EAX,ECX
movzx ECX,BL ; Zero-Extend saved carry from above.
adc EDX,ECX
mov [EDI+8],EAX ; Save the partial product.
mov [EDI+12],EDX ; Save the partial product.
; Signed version only:
mov BL,BYTE PTR B+7
and BL,80H
jz zero_a
mov EAX,DWORD PTR A
mov EDX,DWORD PTR A+4
jmp test_a
zero_a:
xor EAX,EAX
mov EDX,EAX
test_a:
mov BL,BYTE PTR A+7
and BL,80H
jz do_last_op
add EAX,DWORD PTR B
adc EDX,DWORD PTR B+4
do_last_op:
sub [EDI+8],EAX
sbb [EDI+12],EDX
; End of signed version!
pop EDI
pop EBX
ret 20
END

Is there an easier way to write a bubble sort algorithm in masm modular style?

I wrote a bubble sort algorithm in assembly, I'm proud of myself, but at the same time I think my bubble sort is wrong.
Can someone let me know if it's right? And how do I make my program more modular so I can reuse it later?
.386
.model flat,stdcall
.stack 100h
printf proto c arg1:ptr byte, printlist:vararg
.data
array dword 8,9,10,40,80,0
fmtmsg2 db 0dh,0ah,0
fmtmsg1 db "%d ",0
.code
public main
main proc
mov ecx,0
mov edx,0
mov esi,offset array
innerloop:
inc ecx
cmp ecx,5
je outerloop
mov eax,[esi]
cmp eax,[esi + 4]
Jge noexchange
;exchange values
xchg eax,[esi+4]
mov [esi],eax
noexchange:
add esi,4
jmp innerloop
outerloop:
mov esi,offset array
;inner loop counter
mov ecx,0
;outer loop counter
inc edx
cmp edx,5
jne innerloop
;loop 3 counter
mov edx,0
;load array offset
mov esi,offset array
loop3:
mov eax,[esi]
push edx
invoke printf,addr fmtmsg1,eax
pop edx
add esi,4
inc edx
cmp edx,5
jne loop3
invoke printf,addr fmtmsg2
ret
main endp
end main
Your original algorithm works great (congratulations). It sorts an array in descending order, for example, if array is [1,2,3,4,5] the result is [5,4,3,2,1]. If you want it in ascending order, just change one instruction. I used Visual Studio 2010, but the code is the same (my changes are pointed by arrows, but you only need one change: "jbe") :
void death_reverse () {
int array[5] = { 5,4,3,2,1 }; // <=====================
__asm { mov ecx,0
mov edx,0
lea esi, array // <=====================
innerloop:
inc ecx
cmp ecx,5
je outerloop
mov eax,[esi]
cmp eax,[esi + 4]
Jbe noexchange // <=============== ASCENDING ORDER.
;exchange values
xchg eax,[esi+4]
mov [esi],eax
noexchange:
add esi,4
jmp innerloop
outerloop:
lea esi, array // <=====================
;inner loop counter
mov ecx,0
;outer loop counter
inc edx
cmp edx,5
jne innerloop
;loop 3 counter
mov edx,0
;load array offset
lea esi, array // <=====================
loop3:
mov eax,[esi]
push edx
invoke printf,addr fmtmsg1,eax
pop edx
add esi,4
inc edx
cmp edx,5
jne loop3
invoke printf,addr fmtmsg2
}
}

Reversing of _PrepareMenuWindow() subroutine

Can some one help me with reversing of _PrepareMenuWindow() subroutine?
I am trying to find the signature of the method.
__text:000639A7 _PrepareMenuWindow proc near ; CODE XREF: DrawTheMenu(MenuSelectData *,__CFArray **,uchar,uchar *)+274p
__text:000639A7 ; PopUpMenuSelectCore(MenuData *,Point,double,Point,ushort,uint,Rect const*,ushort,ulong,Rect const*,Rect const*,__CFString const*,OpaqueMenuRef **,ushort *)+528p
__text:000639A7
__text:000639A7 var_44 = dword ptr -44h
__text:000639A7 var_40 = dword ptr -40h
__text:000639A7 var_3C = dword ptr -3Ch
__text:000639A7 var_34 = dword ptr -34h
__text:000639A7 var_30 = dword ptr -30h
__text:000639A7 var_2C = dword ptr -2Ch
__text:000639A7 var_28 = dword ptr -28h
__text:000639A7 var_24 = word ptr -24h
__text:000639A7 var_20 = dword ptr -20h
__text:000639A7 var_1A = word ptr -1Ah
__text:000639A7 arg_0 = dword ptr 8
__text:000639A7 arg_4 = dword ptr 0Ch
__text:000639A7 arg_8 = dword ptr 10h
__text:000639A7
__text:000639A7 push ebp
__text:000639A8 mov ebp, esp
__text:000639AA push edi
__text:000639AB push esi
__text:000639AC push ebx
__text:000639AD sub esp, 5Ch
__text:000639B0 xor edi, edi
__text:000639B2 mov eax, [ebp+arg_0]
__text:000639B5 test eax, eax
__text:000639B7 jz short loc_639C6
__text:000639B9 mov eax, [ebp+arg_0]
__text:000639BC mov [esp], eax
__text:000639BF call __ZNK8HIObject13GetEncodedRefEv ; HIObject::GetEncodedRef(void)
__text:000639C4 mov edi, eax
__text:000639C6
__text:000639C6 loc_639C6: ; CODE XREF: _PrepareMenuWindow+10j
__text:000639C6 mov ecx, [ebp+arg_4]
__text:000639C9 mov eax, [ecx]
__text:000639CB mov edx, [ecx+4]
__text:000639CE mov [ebp+var_2C], eax
__text:000639D1 mov [ebp+var_28], edx
__text:000639D4 lea eax, [ebp+var_1A]
__text:000639D7 mov [ebp+var_40], eax
__text:000639DA mov [esp+4], eax
__text:000639DE mov [esp], edi
__text:000639E1 call _GetMenuType
__text:000639E6 mov dword ptr [esp+4], 0
__text:000639EE mov [esp], edi
__text:000639F1 call _IsMenuItemEnabled
__text:000639F6 movzx edx, [ebp+var_1A]
__text:000639FA or dh, 1
__text:000639FD test al, al
__text:000639FF movzx ebx, [ebp+var_1A]
__text:00063A03 cmovz ebx, edx
__text:00063A06 mov [ebp+var_1A], bx
__text:00063A0A mov eax, [ebp+arg_8]
__text:00063A0D mov [esp+0Ch], eax
__text:00063A11 lea ecx, [ebp+var_2C]
__text:00063A14 mov [ebp+var_44], ecx
__text:00063A17 mov [esp+8], ecx
__text:00063A1B mov eax, [ebp+arg_4]
__text:00063A1E mov [esp+4], eax
__text:00063A22 mov [esp], edi
__text:00063A25 call __AddOpenMenu
__text:00063A2A mov ecx, [ebp+var_44]
__text:00063A2D mov [esp], ecx
__text:00063A30 call _EmptyRect
__text:00063A35 test al, al
__text:00063A37 jnz loc_63B94
__text:00063A3D mov [esp], edi
__text:00063A40 call __Z11GetMenuDataP13OpaqueMenuRef ; GetMenuData(OpaqueMenuRef *)
__text:00063A45 mov [ebp+var_3C], eax
__text:00063A48 call _NewRgn
__text:00063A4D mov esi, eax
__text:00063A4F test eax, eax
__text:00063A51 jz loc_63BDD
__text:00063A57 movzx ebx, bx
__text:00063A5A mov eax, [ebp+var_3C]
__text:00063A5D mov eax, [eax+40h]
__text:00063A60 test eax, eax
__text:00063A62 jnz loc_63B23
__text:00063A68 mov [ebp+var_1A], 0
__text:00063A6E mov eax, [ebp+var_2C]
__text:00063A71 mov edx, [ebp+var_28]
__text:00063A74 mov [ebp+var_34], eax
__text:00063A77 mov [ebp+var_30], edx
__text:00063A7A mov ecx, [ebp+var_40]
__text:00063A7D mov [esp+10h], ecx
__text:00063A81 mov dword ptr [esp+0Ch], 0
__text:00063A89 lea eax, [ebp+var_34]
__text:00063A8C mov [esp+8], eax
__text:00063A90 mov dword ptr [esp+4], 7
__text:00063A98 mov eax, [ebp+var_3C]
__text:00063A9B mov [esp], eax
__text:00063A9E call __Z12_CallMenuDefP8MenuDatasP4Rect5PointPs ; _CallMenuDef(MenuData *,short,Rect *,Point,short *)
__text:00063AA3 cmp [ebp+var_1A], 7473h
__text:00063AA9 jz short loc_63ADC
__text:00063AAB add word ptr [ebp+var_2C], 3
__text:00063AB0 mov dword ptr [esp+8], 0FFFFFFFCh
__text:00063AB8 mov dword ptr [esp+4], 0FFFFFFFCh
__text:00063AC0 mov ecx, [ebp+var_44]
__text:00063AC3 mov [esp], ecx
__text:00063AC6 call _InsetRect
__text:00063ACB mov eax, [ebp+var_44]
__text:00063ACE mov [esp+4], eax
__text:00063AD2 mov [esp], esi
__text:00063AD5 call _RectRgn
__text:00063ADA jmp short loc_63B23
__text:00063ADC ; ---------------------------------------------------------------------------
__text:00063ADC
__text:00063ADC loc_63ADC: ; CODE XREF: _PrepareMenuWindow+102j
__text:00063ADC lea eax, [ebp+var_24]
__text:00063ADF mov [esp+8], eax
__text:00063AE3 lea eax, [ebp+var_20]
__text:00063AE6 mov [esp+4], eax
__text:00063AEA mov [esp], edi
__text:00063AED call __GetMenuCallout
__text:00063AF2 movsx eax, [ebp+var_24]
__text:00063AF6 mov [esp+10h], eax
__text:00063AFA mov eax, [ebp+var_20]
__text:00063AFD mov [esp+0Ch], eax
__text:00063B01 mov [esp+8], esi
__text:00063B05 mov [esp+4], ebx
__text:00063B09 mov ecx, [ebp+var_44]
__text:00063B0C mov [esp], ecx
__text:00063B0F call __GetThemeMenuBackgroundRegionWithCallout
__text:00063B14 mov eax, [ebp+var_44]
__text:00063B17 mov [esp+4], eax
__text:00063B1B mov [esp], esi
__text:00063B1E call _GetRegionBounds
__text:00063B23
__text:00063B23 loc_63B23: ; CODE XREF: _PrepareMenuWindow+BBj
__text:00063B23 ; _PrepareMenuWindow+133j
__text:00063B23 mov [esp+0Ch], esi
__text:00063B27 mov ecx, [ebp+var_44]
__text:00063B2A mov [esp+8], ecx
__text:00063B2E mov [esp+4], ebx
__text:00063B32 mov [esp], edi
__text:00063B35 call __ZL13GetMenuWindowP13OpaqueMenuReftPK4RectP15OpaqueRgnHandle ; GetMenuWindow(OpaqueMenuRef *,ushort,Rect const*,OpaqueRgnHandle *)
__text:00063B3A test eax, eax
__text:00063B3C jz short loc_63BA1
__text:00063B3E mov [esp], eax
__text:00063B41 call _GetWindowPort
__text:00063B46 mov [esp], eax
__text:00063B49 call _SetPortWrapper
__text:00063B4E mov [esp], esi
__text:00063B51 call _SetClipWrapper
__text:00063B56 mov [esp], esi
__text:00063B59 call _DisposeRgn
__text:00063B5E mov eax, [ebp+var_3C]
__text:00063B61 mov eax, [eax+40h]
__text:00063B64 test eax, eax
__text:00063B66 jnz short loc_63BDD
__text:00063B68 mov dword ptr [esp+14h], 0
__text:00063B70 mov dword ptr [esp+10h], 0
__text:00063B78 mov [esp+0Ch], ebx
__text:00063B7C mov ecx, [ebp+arg_4]
__text:00063B7F mov [esp+8], ecx
__text:00063B83 mov eax, [ebp+var_44]
__text:00063B86 mov [esp+4], eax
__text:00063B8A mov [esp], edi
__text:00063B8D call __Z18DrawMenuBackgroundP13OpaqueMenuRefRK4RectS3_thPv ; DrawMenuBackground(OpaqueMenuRef *,Rect const&,Rect const&,ushort,uchar,void *)
__text:00063B92 jmp short loc_63BDD
__text:00063B94 ; ---------------------------------------------------------------------------
__text:00063B94
__text:00063B94 loc_63B94: ; CODE XREF: _PrepareMenuWindow+90j
__text:00063B94 mov ecx, [ebp+arg_0]
__text:00063B97 mov [esp], ecx
__text:00063B9A call _DisposeMenuWindow
__text:00063B9F jmp short loc_63BDD
__text:00063BA1 ; ---------------------------------------------------------------------------
__text:00063BA1
__text:00063BA1 loc_63BA1: ; CODE XREF: _PrepareMenuWindow+195j
__text:00063BA1 mov eax, [ebp+arg_0]
__text:00063BA4 mov [esp], eax
__text:00063BA7 call __Z11FindMBEntryP8MenuData ; FindMBEntry(MenuData *)
__text:00063BAC mov ecx, eax
__text:00063BAE test eax, eax
__text:00063BB0 jz short loc_63BD5
__text:00063BB2 mov word ptr [eax+1Eh], 0
__text:00063BB8 mov word ptr [eax+1Ch], 0
__text:00063BBE mov word ptr [eax+1Ah], 0
__text:00063BC4 mov word ptr [eax+18h], 0
__text:00063BCA mov eax, [eax+18h]
__text:00063BCD mov edx, [ecx+1Ch]
__text:00063BD0 mov [ecx], eax
__text:00063BD2 mov [ecx+4], edx
__text:00063BD5
__text:00063BD5 loc_63BD5: ; CODE XREF: _PrepareMenuWindow+209j
__text:00063BD5 mov [esp], esi
__text:00063BD8 call _DisposeRgn
__text:00063BDD
__text:00063BDD loc_63BDD: ; CODE XREF: _PrepareMenuWindow+AAj
__text:00063BDD ; _PrepareMenuWindow+1BFj ...
__text:00063BDD xor eax, eax
__text:00063BDF add esp, 5Ch
__text:00063BE2 pop ebx
__text:00063BE3 pop esi
__text:00063BE4 pop edi
__text:00063BE5 leave
__text:00063BE6 retn
__text:00063BE6 _PrepareMenuWindow endp
What have you got so far that isn't generated by IDA? (ie: your analysis of the function).
From the looks of it its a __cdecl function that always returns NULL/false/0. It also seems to take 3 arguments(which can be confirmed by looking at what cleanup is by the caller, if there is any).
Arg 0 is a MenuData*, arg 4 seems to be a Rect&(which is secretly just Rect*), arg 8 would be whatever type __AddOpenMenu takes as its fourth argument.
So i'd assume something along the lines of typedef BOOL(__cdecl*)(MenuData*,Rect&,void*)

Resources