NASM on OS X work with bit, not byte - macos

I am working on my first NASM program, and while trying to figure out the not instruction, I realized that instead of reversing the bit 0, it was reversing the byte 00000000. How would I tell it to work with a bit or otherwise fix this? Here is my code...
section .text
global start
start:
mov eax, 255
not eax
push eax
mov eax, 0x1
sub esp, 4
int 0x80
Feel free to give me pointers also on my assembly coding, as I don't want to get into any bad habits.

In most computer architectures (including the x86), a bit is not a directly addressable unit of memory. The smallest unit that you can directly refer to is a byte, which happens to contain 8 bits on the x86. You have not stated what you're exactly trying to accomplish, so I'm not able to give you an exact solution to your problem, but working with single bits (or groups of bits) is most often achieved by masking out the bits that are of no interest with the AND instruction, eventually shifting the value left or right, and then doing the processing.
If you want to actually get the value of the n-th bit in a register, then you're most probably looking for the instruction BT. It stores the value of the n-th bit in the Carry Flag.
When it comes to other tips : the push instruction decrements the stack pointer by the number of bytes pushed to the stack. This is a characteristic of the x86 architecture - the stack grows, by design, downwards. Therefore, if you want to free some space on the stack, you do add esp, number_of_bytes, not sub (the way you did), which just reserves more space on the stack.

Related

Allocating memory using malloc() in 32-bit and 64-bit assembly language

I have to do a 64 bits stack. To make myself comfortable with malloc I managed to write two integers(32 bits) into memory and read from there:
But, when i try to do this with 64 bits:
The first snippet of code works perfectly fine. As Jester suggested, you are writing a 64-bit value in two separate (32-bit) halves. This is the way you have to do it on a 32-bit architecture. You don't have 64-bit registers available, and you can't write 64-bit chunks of memory at once. But you already seemed to know that, so I won't belabor it.
In the second snippet of code, you tried to target a 64-bit architecture (x86-64). Now, you no longer have to write 64-bit values in two 32-bit halves, since 64-bit architectures natively support 64-bit integers. You have 64-bit wide registers available, and you can write a 64-bit chunk to memory directly. Take advantage of that to simplify (and speed up) the code.
The 64-bit registers are Rxx instead of Exx. When you use QWORD PTR, you will want to use Rxx; when you use DWORD PTR, you will want to use Exx. Both are legal in 64-bit code, but only 32-bit DWORDs are legal in 32-bit code.
A couple of other things to note:
Although it is perfectly valid to clear a register using MOV xxx, 0, it is smaller and faster to use XOR eax, eax, so this is generally what you should write. It is a very old trick, something that any assembly-language programmer should know, and if you ever try to read other people's assembly programs, you'll need to be familiar with this idiom. (But actually, in the code you're writing, you don't need to do this at all. For the reason why, see point #2.)
In 64-bit mode, all instructions implicitly zero the upper 32 bits when writing the lower 32 bits, so you can simply write XOR eax, eax instead of XOR rax, rax. This is, again, smaller and faster.
The calling convention for 64-bit programs is different than the one used in 32-bit programs. The exact specification of the calling convention is going to vary, depending on which operating system you're using. As Peter Cordes commented, there is information on this in the x86 tag wiki. Both Windows and Linux x64 calling conventions pass at least the first 4 integer parameters in registers (rather than on the stack like the x86-32 calling convention), but which registers are actually used is different. Also, the 64-bit calling conventions have different requirements than do the 32-bit calling conventions for how you must set up the stack before calling functions.
(Since your screenshot says something about "MASM", I'll assume that you're using Windows in the sample code below.)
; Set up the stack, as required by the Windows x64 calling convention.
; (Note that we use the 64-bit form of the instruction, with the RSP register,
; to support stack pointers larger than 32 bits.)
sub rsp, 40
; Dynamically allocate 8 bytes of memory by calling malloc().
; (Note that the x64 calling convention passes the parameter in a register, rather
; than via the stack. On Windows, the first parameter is passed in RCX.)
; (Also note that we use the 32-bit form of the instruction here, storing the
; value into ECX, which is safe because it implicitly zeros the upper 32 bits.)
mov ecx, 8
call malloc
; Write a single 64-bit value into memory.
; (The pointer to the memory block allocated by malloc() is returned in RAX.)
mov qword ptr [rax], 1
; ... do whatever
; Clean up the stack space that we allocated at the top of the function.
add rsp, 40
If you wanted to do this in 32-bit halves, even on a 64-bit architecture, you certainly could. That would look like the following:
sub rsp, 40 ; set up stack
mov ecx, 8 ; request 8 bytes
call malloc ; allocate memory
mov dword ptr [eax], 1 ; write "1" into low 32 bits
mov dword ptr [eax+4], 2 ; write "2" into high 32 bits
; ... do whatever
add rsp, 40 ; clean up stack
Note that these last two MOV instructions are identical to what you wrote in the 32-bit version of the code. That makes sense, because you're doing exactly the same thing.
The reason the code you originally wrote didn't work is because EAX doesn't contain a QWORD PTR, it contains a DWORD PTR. Hence, the assembler generated the "invalid instruction operands" error, because there was a mismatch. This is the same reason that you don't offset by 8, because a DWORD PTR is only 4 bytes. A QWORD PTR is indeed 8 bytes, but you don't have one of those in EAX.
Or, if you wanted to write 16 bytes:
sub rsp, 40 ; set up stack
mov ecx, 16 ; request 16 bytes
call malloc ; allocate memory
mov qword ptr [rax], 1 ; write "1" into low 64 bits
mov qword ptr [rax+8], 2 ; write "2" into high 64 bits
; ... do whatever
add rsp, 40 ; clean up stack
Compare these three snippets of code, and make sure you understand the differences and why they need to be written as they are!

Windows x86 assembly language syntax [duplicate]

This question already has an answer here:
Which segment register is used by default?
(1 answer)
Closed 6 years ago.
(1) What does the following code mean? I cannot find any reference about the ds:[ ] syntax anywhere online. How is it different from without the ds:?
cmp eax,dword ptr ds:[12B656Ch]
(2) In the following instruction,
movsx eax,word ptr [esi+24h]
What is the esi register used for? Is it possible to guess what the original C code is doing from using such a rare register?
DS refers to the Data Segment.
In Win32, CS = DS = ES = SS = 0.
That is these segments do not matter and a flat 32 bit address space is used.
The Data segment is the default segment when accessing memory. Some disassemblers mistakenly list it, even though it serves no purpose to list a default segment.
You can list a different segment if you do wish by using a segment override.
CS is de Code Segment which is the default segment for jumps and calls and SS is the Stack segment which is the default for addresses based on ESP.
ES is the Extra Segment which is used for string instructions.
The only segment override that makes sense in Win32 is FS (The F does not stand for anything, but it comes after E).
FS links to the Thread Information Block (TIB) which houses thread specific data and is very useful for Thread Local Storage and multi-threading in general.
There is also a GS which is reserved for future use in Win32 and is used for the TIB in Win64.
In Linux the picture is more or less the same.
What is register X for
You must let go of the notion that registers have special purposes.
In x86 you can use almost any register for almost any purpose.
Only a few complex instructions use specific registers, but the normal instructions can use any register.
The compiler will try and use as many registers as possible to avoid having to use memory.
Having said this the original purposes of the 8 x86 registers are as follows:
EAX : accumulator, some instructions using this register have 'short versions'.
EDX : overflow for EAX, used to store 64 bit values when multiplying or dividing.
ECX : counter, used in string instructions like rep mov and shifts.
EBX : miscellaneous general purpose register.
ESI : Source Index register, used as source pointer for string instructions
EDI : Destination Index register, used as destination pointer
ESP : Stack pointer, used to keep track of the stack
EBP : Base pointer, used in stack frames
You can use any register pretty much as you please, with the exception of ESP. Although ESP will work in many instructions, it is just too awkward to lose track of the stack.
Is it possible to guess what the original C code is doing from using such a rare register?
My guess:
struct x {
int a,b,c,d,e,f,g,h,i,j; //36 bytes
short s };
....
int i = x.s;
ESI likely points to some structure or object. At offset 24h (36) a short is present which is transfered into an int. (hence the mov with Sign eXtend).
ESI does not link local variable, because in that case EBP or ESP would be used.
If you want to know more about the c code you'd need more context.
Many c constructs translate into multiple cpu instructions.
The best way to see this is to write c code and inspect the cpu code that gets generated.

How does `sub rsp, 16` aligns the stack on Mac OSX?

I'm currently learning x64 asm on Mac OSX using nasm.
I've come across the problem of aligning the stack, a necessary step for some system calls such as malloc, which is done with these instructions:
push rbp
mov rbp, rsp
sub rsp, 16
Can anyone explain me how the function prolog does align the stack ? I mean, if it's not already on a multiple of 16, why would sub rsp, 16 correct it ?
Let's say that esp = 0x35 , after sub rsp, 16, esp = 0x25 right ? So esp wasn't aligned on a multiple of 16 before the sub, and the sub didn't align it neither so I think I haven't quite understood what "aligning the stack" means.
Can someone tell me what I should understand when I read "the stack needs to be aligned on a 16 bytes boundary" ?
It doesn't align it, as you say it just keeps alignment. Obviously the sub rsp, 16 is only needed if you want to allocate space for 1-15 bytes of local variables. You should make sure the number there is the next multiple of 16 above the space needed, assuming the stack is already aligned. Note that the return address and the frame pointer also add up to 16 bytes, if you don't use a frame pointer you need to account for that too.
In general, the calling convention mandates that it be aligned in a particular way upon entry to all functions, so you just have to maintain that. The only place that is normally not the case is possibly at process or thread startup but that is usually taken care of by the system libraries.

How does address operand affect performance and size of machine code?

Starting with 32-bit CPU mode, there are extended address operands available for x86 architecture. One can specify the base address, a displacement, an index register and a scaling factor.
For example, we would like to stride through a list of 32-bit integers (every first two from an array of 32-byte-long data structures, %rdi as data index, %rbx as base pointer).
addl $8, %rdi # skip eight values: advance index by 8
movl (%rbx, %rdi, 4), %eax # load data: pointer + scaled index
movl 4(%rbx, %rdi, 4), %edx # load data: pointer + scaled index + displacement
As I know, such complex addressing fits into a single machine-code instruction. But what is the cost of such operation and how does it compare to simple addressing with independent pointer calculation:
addl $32, %rbx # skip eight values: move pointer forward by 32 bytes
movl (%rbx), %eax # load data: pointer
addl $4, %rbx # point next value: move pointer forward by 4 bytes
movl (%rbx), %edx # load data: pointer
In the latter example, I have introduced one extra instruction and a dependency. But integer addition is very fast, I gained simpler address operands, and there are no multiplications any more. On the other hand, since the allowed scaling factors are powers of 2, the multiplication comes down to a bit shift, which is also a very fast operation. Still, two additions and a bit shift can be replaced with one addition.
What are the performance and code size differences between these two approaches? Are there any best practices for using the extended addressing operands?
Or, asking it from a C programmer's point of view, what is faster: array indexing or pointer arithmetic?
Is there any assembly editor meant for size/performance tuning? I wish I could see the machine-code size of each assembly instruction, its execution time in clock cycles or a dependency graph. There are thousands of assembly freaks that would benefit from such application, so I bet that something like this already exists!
The address arithmetic is very fast and should be used always if possible.
But here is something that the question misses.
At first you can't multiply by 32 using address arithmetic - 8 is the maximal possible constant.
The first version of the code in not complete, because it will need second instruction, that to increment rbx. So, we have following two variants:
inc rbx
mov eax, [8*rbx+rdi]
vs
add rbx, 8
mov eax, [rbx]
This way, the speed of the two variants will be the same. The size is the same - 6 bytes as well.
So, what code is better depends only on the program context - if we have a register that already contains the address of the needed array cell - use mov eax, [rbx]
If we have register containing the index of the cell and another containing the start address, then use the first variant. This way, after the algorithm ends, we still will have the start address of the array in rdi.
The answer to your question depends on the given, local program flow circumstances - and they, in turn, may very somewhat between processor manufacturers and architectures. To microanalyze a single instruction or two is usually pointless. You have a multi-stage pipeline, more than one integer unit, caches and lots more that come into play and which you need to factor into your analysis.
You could try reverse-engineering by looking at generated assembly code and analyzing why the sequence looks the way it does with respect to the different hardware units that will be working on it.
Another way is to use a profiler and experiment with different constructs to see what works well and what doesn't.
You could also download the source code for gcc and see how the really cool programmers evaluate a sequence to produce the fastest possible code. One day you might become one of them :-)
In any case I expect that you will come to the conclusion that the best sequence varies greatly depending on processor, compiler, optimization level and surrounding instructions. If you are using C the quality of the source code is extremely important: garbage in = garbage out.

Why gcc does so when creating assembler code?

I am playing around with gcc -S to understand how memory and stack works. During these plays I found several things unclear to me. Could you please help me to understand the reasons?
When calling function sets arguments for a called one it uses mov to esp instead push. What is the advantage not using push?
Function which works with its stack located arguments points to them as ebp + (N + offset) (where N is a size reserved for return address). I expect to see esp - offset which is more understandable. What is the reason to use ebp as fundamental point everywhere? I know these ones are equal but anyway?
What is this magic for in the beginning of main? Why esp must be initialized in this way only?
and esp,0xfffffff0
Thanks,
I will assume you are working under a 32-bit environment because in a 64-bit environment arguments are passed in registers.
Question 1
Perhaps you are passing a floating point argument here. You cannot push these directly, as the push instruction in a 32-bit runtime pushes 4 bytes at a time so you would have to break up the value. It is sometimes easier to subtract 8 from esp and them mov the 8-byte quadword into [esp].
Question 2
ebp is frequently used to index the parameters and locals in stack frames in 32-bit code. This allows the offsets within frames to be fixed even as the stack pointer moves. For example consider
void f(int x) {
int a;
g(x, 5);
}
Now if you only accessed the stack frame contents with esp, then a is at [esp], the return address would be at [esp+4] and x would be at [esp+8]. Now let's generate code to call g. We have to first push 5 then push x. But after pushing 5, the offset of x from esp has changed! This is why ebp is used. Normally on entry to functions we push the old value of ebp to save it, then copy esp to ebp. Now ebp can be used to access stack frame contents. It won't move when we are in the middle of passing arguments.
Question 3
This and instruction zeros out the last 4 bits of esp, aligning it to a 16-byte boundary. Since the stack grows downward, this is nice and safe.

Resources