I'm learning x86 asm and using masm, and am trying to write a function which has the equivalent signature to the following c function:
void func(double a[], double b[], double c[], int len);
I'm not sure how to implement it?
The asm file will be compiled into a win32 DLL.
So that I can understand how to do this, can someone please translate this very simple function into asm for me:
void func(double a[], double b[], double c[], int len)
{
// a, b, and c have the same length, given by len
for (int i = 0; i < length; i++)
c[i] = a[i] + b[i];
}
I tried writing a function like this in C, compiling it, and looking at the corresponding disassembled code in the exe using OllyDbg but I couldn't even find my function in it.
Thank you kindly.
I haven't written x86 for a while but I can give you a general idea of how to do it. Since I don't have an assembler handy, this is written in notepad.
func proc a:DWORD, b:DWORD, c:DWORD, len:DWORD
mov eax, len
test eax, eax
jnz #f
ret
##:
push ebx
push esi
xor eax, eax
mov esi, a
mov ebx, b
mov ecx, c
##:
mov edx, dword ptr ds:[ebx+eax*4]
add edx, dword ptr ds:[ecx+eax*4]
mov [esi+eax*4], edx
cmp eax, len
jl #b
pop esi
pop ebx
ret
func endp
The above function conforms to stdcall and is approximately how you would translate to x86 if your arguments were integers. Unfortunately, you are using doubles. The loop would be the same but you'd need to use the FPU stack and opcodes for doing the arithmetic. I haven't used that for a while and couldn't remember the instructions off the top of my head unfortunately.
You have to pass the memory addresses of the arrays. Consider the following code:
.data?
array1 DWORD 4 DUP(?)
.code
main PROC
push LENGTHOF array1
push OFFSET array1
call arrayFunc
main ENDP
arrayFunc PROC
push ebp
mov ebp, esp
push edi
mov edi, [ebp+08h]
mov ecx, [ebp+0Ch]
L1:
;reference each element of given array by [edi]
;add "TYPE" *array* to edi to increment
loop L1:
pop edi
pop ebp
ret 8
arrayFunc ENDP
END main
I just wrote this code for you to understand the concept. I leave it to you to figure out how to properly figure the usage of registers in order to achieve your program's goals.
Related
I am currently learning Rust, and as a first exercise I wanted to implement a function that computes the nth fibonacci number:
fn main() {
for i in 0..48 {
println!("{}: {}", i, fibonacci(i));
}
}
fn fibonacci(n: u32) -> u32 {
match n {
0 => 0,
1 => 1,
_ => fibonacci(n - 1) + fibonacci(n - 2),
}
}
I run it as:
$ time cargo run --release
real 0m15.380s
user 0m15.362s
sys 0m0.014s
As an exercise, I also implemented the same algorithm in C++. I was expecting a similar performance, but the C++ code runs in 80% of the time:
#include<iostream>
unsigned int fibonacci(unsigned int n);
int main (int argc, char* argv[]) {
for(unsigned int i = 0; i < 48; ++i) {
std::cout << i << ": " << fibonacci(i) << '\n';
}
return 0;
}
unsigned int fibonacci(unsigned int n) {
if(n == 0) {
return 0;
} else if (n == 1) {
return 1;
} else {
return fibonacci(n - 1) + fibonacci(n - 2);
}
}
Compiled as:
$ g++ test.cpp -o test.exe -O2
And running:
$ time ./test.exe
real 0m12.127s
user 0m12.124s
sys 0m0.000s
Why do I see such a difference in performance? I am not interested in calculating the fibonacci faster in Rust (with a different algorithm); I am only interested on where the difference comes from. This is just an exercise in my progress as I learn Rust.
TL;DR: It's not Rust vs C++, it's LLVM (Clang) vs GCC.
Different optimizers optimize the code differently, and in this case GCC produces larger but faster code.
This can be verified using godbolt.
Here is Rust, compiled with both GCC (via rustgcc-master):
example::fibonacci:
push r15
push r14
push r13
push r12
push rbp
xor ebp, ebp
push rbx
mov ebx, edi
sub rsp, 24
.L2:
test ebx, ebx
je .L1
cmp ebx, 1
je .L4
lea r12d, -1[rbx]
xor r13d, r13d
.L19:
cmp r12d, 1
je .L6
lea r14d, -1[r12]
xor r15d, r15d
.L16:
cmp r14d, 1
je .L8
lea edx, -1[r14]
xor ecx, ecx
.L13:
cmp edx, 1
je .L10
lea edi, -1[rdx]
mov DWORD PTR 12[rsp], ecx
mov DWORD PTR 8[rsp], edx
call example::fibonacci.localalias
mov ecx, DWORD PTR 12[rsp]
mov edx, DWORD PTR 8[rsp]
add ecx, eax
sub edx, 2
jne .L13
.L14:
add r15d, ecx
sub r14d, 2
je .L17
jmp .L16
.L4:
add ebp, 1
.L1:
add rsp, 24
mov eax, ebp
pop rbx
pop rbp
pop r12
pop r13
pop r14
pop r15
ret
.L6:
add r13d, 1
.L20:
sub ebx, 2
add ebp, r13d
jmp .L2
.L8:
add r15d, 1
.L17:
add r13d, r15d
sub r12d, 2
je .L20
jmp .L19
.L10:
add ecx, 1
jmp .L14
And with LLVM (via rustc):
example::fibonacci:
push rbp
push r14
push rbx
mov ebx, edi
xor ebp, ebp
mov r14, qword ptr [rip + example::fibonacci#GOTPCREL]
cmp ebx, 2
jb .LBB0_3
.LBB0_2:
lea edi, [rbx - 1]
call r14
add ebp, eax
add ebx, -2
cmp ebx, 2
jae .LBB0_2
.LBB0_3:
add ebx, ebp
mov eax, ebx
pop rbx
pop r14
pop rbp
ret
We can see that LLVM produces a naive version -- calling the function in each iteration of the loop -- while GCC partially unrolls the recursion by inlining some calls. This results in a smaller number of calls in the case of GCC, and at about 5ns of overhead per function call, it's significant enough.
We can do the same exercise with the C++ version using LLVM via Clang and GCC and note that the result is pretty much similar.
So, as announced, it's a LLVM vs GCC difference, not a language one.
Incidentally, the fact that optimizers may produce such widely different results is a reason why I am quite excited at the progress of the rustc_codegen_gcc initiative (dubbed rustgcc-master on godbolt) which aims at pluging a GCC backend into the rustc frontend: once complete anyone will be able to switch to the better optimizer for their own workload.
Before calling a member function of an object, the address of the object will be moved to ECX.
Inside the function, ECX will be moved to dword ptr [this], what does this mean?
C++ Source
#include <iostream>
class CAdd
{
public:
CAdd(int x, int y) : _x(x), _y(y) {}
int Do() { return _x + _y; }
private:
int _x;
int _y;
};
int main()
{
CAdd ca(1, 2);
int n = ca.Do();
std::cout << n << std::endl;
}
Disassembly
...
CAdd ca(1, 2);
00A87B4F push 2
00A87B51 push 1
00A87B53 lea ecx,[ca] ; the instance address
00A87B56 call CAdd::CAdd (0A6BA32h)
int Do() { return _x + _y; }
00A7FFB0 push ebp
00A7FFB1 mov ebp,esp
00A7FFB3 sub esp,0CCh
00A7FFB9 push ebx
00A7FFBA push esi
00A7FFBB push edi
00A7FFBC push ecx
00A7FFBD lea edi,[ebp-0Ch]
00A7FFC0 mov ecx,3
00A7FFC5 mov eax,0CCCCCCCCh
00A7FFCA rep stos dword ptr es:[edi]
00A7FFCC pop ecx
00A7FFCD mov dword ptr [this],ecx ; ========= QUESTION HERE!!! =========
00A7FFD0 mov ecx,offset _CC7F790E_main#cpp (0BC51F2h)
00A7FFD5 call #__CheckForDebuggerJustMyCode#4 (0A6AC36h)
00A7FFDA mov eax,dword ptr [this] ; ========= AND HERE!!! =========
00A7FFDD mov eax,dword ptr [eax]
00A7FFDF mov ecx,dword ptr [this]
00A7FFE2 add eax,dword ptr [ecx+4]
00A7FFE5 pop edi
00A7FFE6 pop esi
00A7FFE7 pop ebx
00A7FFE8 add esp,0CCh
00A7FFEE cmp ebp,esp
00A7FFF0 call __RTC_CheckEsp (0A69561h)
00A7FFF5 mov esp,ebp
00A7FFF7 pop ebp
00A7FFF8 ret
MSVC's asm output itself (https://godbolt.org/z/h44rW3Mxh) uses _this$[ebp] with _this$ = -4, in a debug build like this which wastes instructions storing/reloading incoming register args.
_this$ = -4
int CAdd::Do(void) PROC ; CAdd::Do, COMDAT
push ebp
mov ebp, esp
push ecx ; dummy push instead of sub to reserve 4 bytes
mov DWORD PTR _this$[ebp], ecx
mov eax, DWORD PTR _this$[ebp]
...
This is just spilling the register arg to a local on the stack with that name. (The default options for the MSVC version I used on Godbolt, x86 MSVC 19.29.30136, don't include __CheckForDebuggerJustMyCode#4 or the runtime-check stack poisoning (rep stos) in Do(), but the usage of this is still there.)
Amusingly, the push ecx it uses (as a micro-optimization) instead of sub esp, 4 to reserve stack space already stored ECX, making the mov store redundant.
(AFAIK, no compilers actually do use push to both initialize and make space for locals, but it would be an optimization for cases like this: What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?. It's just using the push for its effect on ESP, not caring what it stores, even if you enabled optimization. In a function where it did still need to spill it, instead of keeping it in memory.)
Your disassembler apparently folds the frame-pointer (EBP +) into what its defining as a this symbol / macro, making it more confusing if you don't look around at other lines to find out how it defines that text macro or whatever it is.
What disassembler are you using? The one built-in to Visual Studio's debugger?
I guess that would make sense that it's using C local var names this way, even though it looks super weird to people familiar with asm. (Because only static storage is addressable with a mode like [symbol] not involving any registers.)
I am looking for a way to print an integer in assembler (the compiler I am using is NASM on Linux), however, after doing some research, I have not been able to find a truly viable solution. I was able to find a description for a basic algorithm to serve this purpose, and based on that I developed this code:
global _start
section .bss
digit: resb 16
count: resb 16
i: resb 16
section .data
section .text
_start:
mov dword[i], 108eh ; i = 4238
mov dword[count], 1
L01:
mov eax, dword[i]
cdq
mov ecx, 0Ah
div ecx
mov dword[digit], edx
add dword[digit], 30h ; add 48 to digit to make it an ASCII char
call write_digit
inc dword[count]
mov eax, dword[i]
cdq
mov ecx, 0Ah
div ecx
mov dword[i], eax
cmp dword[i], 0Ah
jg L01
add dword[i], 48 ; add 48 to i to make it an ASCII char
mov eax, 4 ; system call #4 = sys_write
mov ebx, 1 ; file descriptor 1 = stdout
mov ecx, i ; store *address* of i into ecx
mov edx, 16 ; byte size of 16
int 80h
jmp exit
exit:
mov eax, 01h ; exit()
xor ebx, ebx ; errno
int 80h
write_digit:
mov eax, 4 ; system call #4 = sys_write
mov ebx, 1 ; file descriptor 1 = stdout
mov ecx, digit ; store *address* of digit into ecx
mov edx, 16 ; byte size of 16
int 80h
ret
C# version of what I want to achieve (for clarity):
static string int2string(int i)
{
Stack<char> stack = new Stack<char>();
string s = "";
do
{
stack.Push((char)((i % 10) + 48));
i = i / 10;
} while (i > 10);
stack.Push((char)(i + 48));
foreach (char c in stack)
{
s += c;
}
return s;
}
The issue is that it outputs the characters in reverse, so for 4238, the output is 8324. At first, I thought that I could use the x86 stack to solve this problem, push the digits in, and pop them out and print them at the end, however when I tried implementing that feature, it flopped and I could no longer get an output.
As a result, I am a little bit perplexed about how I can implement a stack in to this algorithm in order to accomplish my goal, aka printing an integer. I would also be interested in a simpler/better solution if one is available (as it's one of my first assembler programs).
One approach is to use recursion. In this case you divide the number by 10 (getting a quotient and a remainder) and then call yourself with the quotient as the number to display; and then display the digit corresponding to the remainder.
An example of this would be:
;Input
; eax = number to display
section .data
const10: dd 10
section .text
printNumber:
push eax
push edx
xor edx,edx ;edx:eax = number
div dword [const10] ;eax = quotient, edx = remainder
test eax,eax ;Is quotient zero?
je .l1 ; yes, don't display it
call printNumber ;Display the quotient
.l1:
lea eax,[edx+'0']
call printCharacter ;Display the remainder
pop edx
pop eax
ret
Another approach is to avoid recursion by changing the divisor. An example of this would be:
;Input
; eax = number to display
section .data
divisorTable:
dd 1000000000
dd 100000000
dd 10000000
dd 1000000
dd 100000
dd 10000
dd 1000
dd 100
dd 10
dd 1
dd 0
section .text
printNumber:
push eax
push ebx
push edx
mov ebx,divisorTable
.nextDigit:
xor edx,edx ;edx:eax = number
div dword [ebx] ;eax = quotient, edx = remainder
add eax,'0'
call printCharacter ;Display the quotient
mov eax,edx ;eax = remainder
add ebx,4 ;ebx = address of next divisor
cmp dword [ebx],0 ;Have all divisors been done?
jne .nextDigit
pop edx
pop ebx
pop eax
ret
This example doesn't suppress leading zeros, but that would be easy to add.
I think that maybe implementing a stack is not the best way to do this (and I really think you could figure out how to do that, saying as how pop is just a mov and a decrement of sp, so you can really set up a stack anywhere you like by just allocating memory for it and setting one of your registers as your new 'stack pointer').
I think this code could be made clearer and more modular if you actually allocated memory for a c-style null delimited string, then create a function to convert the int to string, by the same algorithm you use, then pass the result to another function capable of printing those strings. It will avoid some of the spaghetti code syndrome you are suffering from, and fix your problem to boot. If you want me to demonstrate, just ask, but if you wrote the thing above, I think you can figure out how with the more split up process.
; Input
; EAX = pointer to the int to convert
; EDI = address of the result
; Output:
; None
int_to_string:
xor ebx, ebx ; clear the ebx, I will use as counter for stack pushes
.push_chars:
xor edx, edx ; clear edx
mov ecx, 10 ; ecx is divisor, devide by 10
div ecx ; devide edx by ecx, result in eax remainder in edx
add edx, 0x30 ; add 0x30 to edx convert int => ascii
push edx ; push result to stack
inc ebx ; increment my stack push counter
test eax, eax ; is eax 0?
jnz .push_chars ; if eax not 0 repeat
.pop_chars:
pop eax ; pop result from stack into eax
stosb ; store contents of eax in at the address of num which is in EDI
dec ebx ; decrement my stack push counter
cmp ebx, 0 ; check if stack push counter is 0
jg .pop_chars ; not 0 repeat
mov eax, 0x0a
stosb ; add line feed
ret ; return to main
; eax = number to stringify/output
; edi = location of buffer
intToString:
push edx
push ecx
push edi
push ebp
mov ebp, esp
mov ecx, 10
.pushDigits:
xor edx, edx ; zero-extend eax
div ecx ; divide by 10; now edx = next digit
add edx, 30h ; decimal value + 30h => ascii digit
push edx ; push the whole dword, cause that's how x86 rolls
test eax, eax ; leading zeros suck
jnz .pushDigits
.popDigits:
pop eax
stosb ; don't write the whole dword, just the low byte
cmp esp, ebp ; if esp==ebp, we've popped all the digits
jne .popDigits
xor eax, eax ; add trailing nul
stosb
mov eax, edi
pop ebp
pop edi
pop ecx
pop edx
sub eax, edi ; return number of bytes written
ret
I am writing this Euclidian GCD program in Language assembly and I think I know what is the problem but I don't know how to fix it. The thing is I am calling GCD recursively from within and every time I call GCD the ESP moves 4 bytes down because it has to store the return address on the stack with each call. Therefore, my EBP will point 4 bytes down from the previous call. Can someone help me fix this code?
;Kirtan Patel
;Create a Euclidian GCD Program
;10/30/2014
.586
.MODEL FLAT
.STACK 4096
.DATA
numberm DWORD 14
numbern DWORD 10
.CODE
main PROC
push numbern ;push 10 onto the stack
push numberm ;push 14 onto the stack
call gcd ; call gcd function
add esp, 8 ;pop off the parameters from the stack.
ret ;exit the program
main ENDP
gcd PROC
push ebp ;push ebp onto the stack to preserve previous contents of ebp
mov ebp, esp ;copy esp to ebp to access the parameters 10 and 14 later on
push edx ;save the registers
push ebx
push ecx
mov ecx, DWORD PTR[ebp+12] ;copy 10 to ecx
cmp ecx, 0 ;compare to see if the divisor is zero
jnz recur ;if it is not zero then recursively call gcd
mov eax, DWORD PTR[ebp+8] ; if it zero then copy 14 to eax and return
pop ecx ;restore the contents of registers before exiting the function
pop ebx
pop edx
pop ebp
ret
recur: mov eax, DWORD PTR[ebp+8] ;copy 14 to eax
cdq ; prepare the edx register for division to store the remainder
div ecx ;eax/ecx (14/10)
mov DWORD PTR[ebp+12], edx ;copy the remainder into numbern on the stack
mov DWORD PTR[ebp+8], ecx ;copy the new divisor into numberm on the stack
pop ecx ;restore registers
pop ebx
pop edx
pop ebp
call gcd ;recursively call gcd
gcd ENDP
END
You can pass parameters on the stack. Use this C program as a prototype for your recursive function, and use the techniques described here to pass your parameters on each recursive call.
int findgcd(int x,int y){
while(x!=y){
if(x>y)
return findgcd(x-y,y);
else
return findgcd(x,y-x);
}
return x;
}
I'm using C++builder for GUI application on Win32. Borland compiler optimization is very bad and does not know how to use SSE.
I have a function that is 5 times faster when compiled with mingw gcc 4.7.
I think about asking gcc to generate assembler code and then use this cod inside my C function because Borland compiler allows inline assembler.
The function in C looks like this :
void Test_Fn(double *x, size_t n,double *AV, size_t *mA, size_t NT)
{
double s = 77.777;
size_t m = mA[NT-3];
AV[2]=x[n-4]+m*s;
}
I made the function code very simple in order to simplify my question. My real function contains many loops.
The Borland C++ compiler generated this assembler code :
;
; void Test_Fn(double *x, size_t n,double *AV, size_t *mA, size_t NT)
;
#1:
push ebp
mov ebp,esp
add esp,-16
push ebx
;
; {
; double s = 77.777;
;
mov dword ptr [ebp-8],1580547965
mov dword ptr [ebp-4],1079210426
;
; size_t m = mA[NT-3];
;
mov edx,dword ptr [ebp+20]
mov ecx,dword ptr [ebp+24]
mov eax,dword ptr [edx+4*ecx-12]
;
; AV[2]=x[n-4]+m*s;
;
?live16385#48: ; EAX = m
xor edx,edx
mov dword ptr [ebp-16],eax
mov dword ptr [ebp-12],edx
fild qword ptr [ebp-16]
mov ecx,dword ptr [ebp+8]
mov ebx,dword ptr [ebp+12]
mov eax,dword ptr [ebp+16]
fmul qword ptr [ebp-8]
fadd qword ptr [ecx+8*ebx-32]
fstp qword ptr [eax+16]
;
; }
;
?live16385#64: ;
#2:
pop ebx
mov esp,ebp
pop ebp
ret
While the gcc generated assembler code is :
_Test_Fn:
mov edx, DWORD PTR [esp+20]
mov eax, DWORD PTR [esp+16]
mov eax, DWORD PTR [eax-12+edx*4]
mov edx, DWORD PTR [esp+8]
add eax, -2147483648
cvtsi2sd xmm0, eax
mov eax, DWORD PTR [esp+4]
addsd xmm0, QWORD PTR LC0
mulsd xmm0, QWORD PTR LC1
addsd xmm0, QWORD PTR [eax-32+edx*8]
mov eax, DWORD PTR [esp+12]
movsd QWORD PTR [eax+16], xmm0
ret
LC0:
.long 0
.long 1105199104
.align 8
LC1:
.long 1580547965
.long 1079210426
.align 8
I like to get help about how the function arguments acces is done in gcc and Borland C++.
My function in C++ for Borland would be something like :
void Test_Fn(double *x, size_t n,double *AV, size_t *mA, size_t NT)
{
__asm
{
put gcc generated assembler here
}
}
Borland starts using ebp register while gcc use esp register.
Can I force one of the compilers to generate compatible code for accessing the arguments using some calling conventions like cdecl ou stdcall ?
The arguments are passed similarly in both cases. The difference is that the code generated by Borland expresses the argument locations relative to EBP register and GCC relative to ESP, but both of them refer to the same addresses.
Borlands sets EBP to point to the start of the function's stack frame and expresses locations relative to that, while GCC doesn't set up a new stack frame but expresses locations relative to ESP, which the caller has left pointing to the end of the caller's stack frame.
The code generated by Borland sets up a stack frame at the beginning of the function, causing EBP in the Borland code to be equal to ESP in the GCC code decreased by 4. This can be seen by looking at the first two Borland lines:
push ebp ; decrease esp by 4
mov ebp,esp ; ebp = the original esp decreased by 4
The GCC code doesn't alter ESP and Borland code doesn't alter EBP until the end of the procedure, so the relationsip holds when the arguments are accessed.
The calling convention seems to be cdecl in both of the cases, and there's no difference in how the functions are called. You can add keyword __cdecl to both in order to make that clear.
void __cdecl Test_Fn(double *x, size_t n,double *AV, size_t *mA, size_t NT)
However adding inline assembly compiled with GCC to the function compiled with Borland is not straightforward, because Borland might set up a stack frame even if the function body contains only inline assembly, causing the value of ESP register to differ from the one used in the GCC code. I see three possible workarounds:
Compile with Borland without the option "Standard stack frames". If the compiler figures out that a stack frame is not needed, this might work.
Compile with GCC without the option -fomit-frame-pointer. This should make sure that atleast the value of EBP is the same in both. The option is enabled at levels -O, -O2, -O3 and -Os.
Manually edit the assembly produced by GCC, changing references to ESP to EBP and adding 4 to the offset.
I would recommend you do some reading up on Application Binary Interfaces.
Here is a relevant link to help you figure out what compiler generates what sort of code:
https://en.wikipedia.org/wiki/X86_calling_conventions
I'd try either compiling everything with GCC, or see if compiling just the critical file with GCC and the rest with Borland and linking together works. What you explain can be made to work, but it will be a hard job that probably isn't worth your invested time (unless it will run very frequently on many, many machines).