Windows x86 assembly language syntax [duplicate] - windows

This question already has an answer here:
Which segment register is used by default?
(1 answer)
Closed 6 years ago.
(1) What does the following code mean? I cannot find any reference about the ds:[ ] syntax anywhere online. How is it different from without the ds:?
cmp eax,dword ptr ds:[12B656Ch]
(2) In the following instruction,
movsx eax,word ptr [esi+24h]
What is the esi register used for? Is it possible to guess what the original C code is doing from using such a rare register?

DS refers to the Data Segment.
In Win32, CS = DS = ES = SS = 0.
That is these segments do not matter and a flat 32 bit address space is used.
The Data segment is the default segment when accessing memory. Some disassemblers mistakenly list it, even though it serves no purpose to list a default segment.
You can list a different segment if you do wish by using a segment override.
CS is de Code Segment which is the default segment for jumps and calls and SS is the Stack segment which is the default for addresses based on ESP.
ES is the Extra Segment which is used for string instructions.
The only segment override that makes sense in Win32 is FS (The F does not stand for anything, but it comes after E).
FS links to the Thread Information Block (TIB) which houses thread specific data and is very useful for Thread Local Storage and multi-threading in general.
There is also a GS which is reserved for future use in Win32 and is used for the TIB in Win64.
In Linux the picture is more or less the same.
What is register X for
You must let go of the notion that registers have special purposes.
In x86 you can use almost any register for almost any purpose.
Only a few complex instructions use specific registers, but the normal instructions can use any register.
The compiler will try and use as many registers as possible to avoid having to use memory.
Having said this the original purposes of the 8 x86 registers are as follows:
EAX : accumulator, some instructions using this register have 'short versions'.
EDX : overflow for EAX, used to store 64 bit values when multiplying or dividing.
ECX : counter, used in string instructions like rep mov and shifts.
EBX : miscellaneous general purpose register.
ESI : Source Index register, used as source pointer for string instructions
EDI : Destination Index register, used as destination pointer
ESP : Stack pointer, used to keep track of the stack
EBP : Base pointer, used in stack frames
You can use any register pretty much as you please, with the exception of ESP. Although ESP will work in many instructions, it is just too awkward to lose track of the stack.
Is it possible to guess what the original C code is doing from using such a rare register?
My guess:
struct x {
int a,b,c,d,e,f,g,h,i,j; //36 bytes
short s };
....
int i = x.s;
ESI likely points to some structure or object. At offset 24h (36) a short is present which is transfered into an int. (hence the mov with Sign eXtend).
ESI does not link local variable, because in that case EBP or ESP would be used.
If you want to know more about the c code you'd need more context.
Many c constructs translate into multiple cpu instructions.
The best way to see this is to write c code and inspect the cpu code that gets generated.

Related

Calculate Physical Address at which AL will get stored in MOV AL,5[SI][BP] [duplicate]

I am studying computer architecture from the Intel Manual. The thing that I understand is that the instructions that we give are logical addresses which consist of a segment selector and an offset.
It is basically CS register<<4 + offset. The Segment Selector maps to the GDT or LDT as given in the TI bit of the segment selector. GDT consists of Segment Descriptors which have BASE, LIMIT and RPL and the output is base address. This base address + offset provides the logical address.
What are the rules that decide which segment register (SS, DS, etc.) applies to different memory operations? e.g. what determines which segment is used for mov eax, [edi]?
Code fetch always uses CS.
Data addressing modes default to DS (or SS when EBP or ESP are the base register) in "normal" addressing modes. (e.g. mov eax, [edi] is equivalent to [ds:edi], mov eax, [ebp+edi*4] is equivalent to mov eax, [ss: ebp + edi*4]).
(Some disassemblers make the segment explicit even when it's the default, so you see a lot of DS: cluttering up the disassembly output. (You can use a segment override prefix to select which segment will apply to the memory operand in an instruction.) In NASM syntax, explicitly using a [ds:edi] addressing mode will result in a redundant ds prefix in the machine code.)
Some instructions with implicit memory operands have different defaults:
Some string instructions use ES:EDI implicitly. e.g. The movs instruction reads from [DS:ESI] and writes to [ES:EDI], making it easy to copy between segments without segment override prefixes.
Memory operands using esp or ebp as the base register default to SS, and so do the implicit accesses for stack instructions like push/pop/call/ret.
FS and GS are never the default, so they can be used for special purposes (like thread-local storage) in a flat memory model system like modern 32 and 64-bit OSes.
wikipedia explains the same thing here.
This is also documented officially in Intel's ISA manuals. e.g. in Volume 2 (the instruction-set ref), Table 2-1. 16-Bit Addressing Forms with the ModR/M Byte has a footnote saying:
The default segment register is SS for the effective addresses containing a BP index, DS for other effective addresses.
(note that SP isn't a valid base address for 16-bit addressing modes.
Also note that when they say "index", that means when BP is used at all, even for [bp + si] or [bp+di]. In 32 and 64-bit addressing modes, there is a clearer distinction between base and index, and [symbol + ebp*4] still implies DS as the segment because EBP is used as an index, not the base.)
There's no equivalent footnote for 32 or 64-bit addressing modes, so the details must be in another volume of the manual.
See also the x86 tag wiki for more links.

Can someone help me with segmentation and 8086 intel's microprocessor?

I am reading about the architecture of intel's 8086 and can't figure out the following things about segmentation: I know that segment registers point to segments respectively and contain the base address of a 64kb long segment. But who calculates and in which point sets the physical address in the segment registers? Also, because one physical address can be accessed by multiple segment:offset pairs and segments can overlap, how you can be sure that you won't overwrite something? Where I can read more about this?
Generally speaking the Assembler will only use offset addresses to access a logical address. For example looking at this code:
start lea si,[hello] ; Load effective address of string
mov word [ds:si+10],0 ; Zero-terminate string after 10th letter
jmp $ ; Loop endlessly
; Fill rest of the segment with 0s
times 65536-($-$$) db 0x00
hello db "I'm just outside of the current segment. Hello!",0
The assembler will try to calculate the offset of 'hello' from the origin of the program. Since no origin is defined 0x0 will be assumed. However the offset of 'hello' would be 0x10000 in this case, which does not fit 16-bits. Therefor the Assembler will truncate the address to 0x0000. It will not change any of the Segment registers. However it will likely issue a warning, for example test.asm:1: warning: word data exceeds bounds. What actually happens when you run this program is that the jmp $ line is overwritten with zeroes, because the address of hello wrapped around and the CPU will start executing nothing but Zeroes, which was not what you intended to do.
That is of course only if the code-segment and data-segment are the same. Now who guarantees that is the case? Nobody really. Especially since I still don't know what platform you are coding for. It is entirely your resposibility to set up the segment registers with correct values. The easiest way to do so is:
push cs ; Push address of code segment to stack
pop ds ; Pop address back into data segment
push cs ; Same for extra data segment
pop es ;
This way you can be certain your you are accessing the offset in the correct-data segment.
Now regarding 'How do you make sure the code segment doesnt overlap the data segment', why shouldn't it? When your program with data is smaller than 64KB it is actually the easiest way to access data if your code and data segment are identical.
And how can you be sure that you don't overwrite anything important? Assembler can't help you with that, you have to check yourself if the segment:offset address you are writing to already contains data.

How does `sub rsp, 16` aligns the stack on Mac OSX?

I'm currently learning x64 asm on Mac OSX using nasm.
I've come across the problem of aligning the stack, a necessary step for some system calls such as malloc, which is done with these instructions:
push rbp
mov rbp, rsp
sub rsp, 16
Can anyone explain me how the function prolog does align the stack ? I mean, if it's not already on a multiple of 16, why would sub rsp, 16 correct it ?
Let's say that esp = 0x35 , after sub rsp, 16, esp = 0x25 right ? So esp wasn't aligned on a multiple of 16 before the sub, and the sub didn't align it neither so I think I haven't quite understood what "aligning the stack" means.
Can someone tell me what I should understand when I read "the stack needs to be aligned on a 16 bytes boundary" ?
It doesn't align it, as you say it just keeps alignment. Obviously the sub rsp, 16 is only needed if you want to allocate space for 1-15 bytes of local variables. You should make sure the number there is the next multiple of 16 above the space needed, assuming the stack is already aligned. Note that the return address and the frame pointer also add up to 16 bytes, if you don't use a frame pointer you need to account for that too.
In general, the calling convention mandates that it be aligned in a particular way upon entry to all functions, so you just have to maintain that. The only place that is normally not the case is possibly at process or thread startup but that is usually taken care of by the system libraries.

Why gcc does so when creating assembler code?

I am playing around with gcc -S to understand how memory and stack works. During these plays I found several things unclear to me. Could you please help me to understand the reasons?
When calling function sets arguments for a called one it uses mov to esp instead push. What is the advantage not using push?
Function which works with its stack located arguments points to them as ebp + (N + offset) (where N is a size reserved for return address). I expect to see esp - offset which is more understandable. What is the reason to use ebp as fundamental point everywhere? I know these ones are equal but anyway?
What is this magic for in the beginning of main? Why esp must be initialized in this way only?
and esp,0xfffffff0
Thanks,
I will assume you are working under a 32-bit environment because in a 64-bit environment arguments are passed in registers.
Question 1
Perhaps you are passing a floating point argument here. You cannot push these directly, as the push instruction in a 32-bit runtime pushes 4 bytes at a time so you would have to break up the value. It is sometimes easier to subtract 8 from esp and them mov the 8-byte quadword into [esp].
Question 2
ebp is frequently used to index the parameters and locals in stack frames in 32-bit code. This allows the offsets within frames to be fixed even as the stack pointer moves. For example consider
void f(int x) {
int a;
g(x, 5);
}
Now if you only accessed the stack frame contents with esp, then a is at [esp], the return address would be at [esp+4] and x would be at [esp+8]. Now let's generate code to call g. We have to first push 5 then push x. But after pushing 5, the offset of x from esp has changed! This is why ebp is used. Normally on entry to functions we push the old value of ebp to save it, then copy esp to ebp. Now ebp can be used to access stack frame contents. It won't move when we are in the middle of passing arguments.
Question 3
This and instruction zeros out the last 4 bits of esp, aligning it to a 16-byte boundary. Since the stack grows downward, this is nice and safe.

NASM on OS X work with bit, not byte

I am working on my first NASM program, and while trying to figure out the not instruction, I realized that instead of reversing the bit 0, it was reversing the byte 00000000. How would I tell it to work with a bit or otherwise fix this? Here is my code...
section .text
global start
start:
mov eax, 255
not eax
push eax
mov eax, 0x1
sub esp, 4
int 0x80
Feel free to give me pointers also on my assembly coding, as I don't want to get into any bad habits.
In most computer architectures (including the x86), a bit is not a directly addressable unit of memory. The smallest unit that you can directly refer to is a byte, which happens to contain 8 bits on the x86. You have not stated what you're exactly trying to accomplish, so I'm not able to give you an exact solution to your problem, but working with single bits (or groups of bits) is most often achieved by masking out the bits that are of no interest with the AND instruction, eventually shifting the value left or right, and then doing the processing.
If you want to actually get the value of the n-th bit in a register, then you're most probably looking for the instruction BT. It stores the value of the n-th bit in the Carry Flag.
When it comes to other tips : the push instruction decrements the stack pointer by the number of bytes pushed to the stack. This is a characteristic of the x86 architecture - the stack grows, by design, downwards. Therefore, if you want to free some space on the stack, you do add esp, number_of_bytes, not sub (the way you did), which just reserves more space on the stack.

Resources