Calculate Physical Address at which AL will get stored in MOV AL,5[SI][BP] [duplicate] - x86-16

I am studying computer architecture from the Intel Manual. The thing that I understand is that the instructions that we give are logical addresses which consist of a segment selector and an offset.
It is basically CS register<<4 + offset. The Segment Selector maps to the GDT or LDT as given in the TI bit of the segment selector. GDT consists of Segment Descriptors which have BASE, LIMIT and RPL and the output is base address. This base address + offset provides the logical address.
What are the rules that decide which segment register (SS, DS, etc.) applies to different memory operations? e.g. what determines which segment is used for mov eax, [edi]?

Code fetch always uses CS.
Data addressing modes default to DS (or SS when EBP or ESP are the base register) in "normal" addressing modes. (e.g. mov eax, [edi] is equivalent to [ds:edi], mov eax, [ebp+edi*4] is equivalent to mov eax, [ss: ebp + edi*4]).
(Some disassemblers make the segment explicit even when it's the default, so you see a lot of DS: cluttering up the disassembly output. (You can use a segment override prefix to select which segment will apply to the memory operand in an instruction.) In NASM syntax, explicitly using a [ds:edi] addressing mode will result in a redundant ds prefix in the machine code.)
Some instructions with implicit memory operands have different defaults:
Some string instructions use ES:EDI implicitly. e.g. The movs instruction reads from [DS:ESI] and writes to [ES:EDI], making it easy to copy between segments without segment override prefixes.
Memory operands using esp or ebp as the base register default to SS, and so do the implicit accesses for stack instructions like push/pop/call/ret.
FS and GS are never the default, so they can be used for special purposes (like thread-local storage) in a flat memory model system like modern 32 and 64-bit OSes.
wikipedia explains the same thing here.
This is also documented officially in Intel's ISA manuals. e.g. in Volume 2 (the instruction-set ref), Table 2-1. 16-Bit Addressing Forms with the ModR/M Byte has a footnote saying:
The default segment register is SS for the effective addresses containing a BP index, DS for other effective addresses.
(note that SP isn't a valid base address for 16-bit addressing modes.
Also note that when they say "index", that means when BP is used at all, even for [bp + si] or [bp+di]. In 32 and 64-bit addressing modes, there is a clearer distinction between base and index, and [symbol + ebp*4] still implies DS as the segment because EBP is used as an index, not the base.)
There's no equivalent footnote for 32 or 64-bit addressing modes, so the details must be in another volume of the manual.
See also the x86 tag wiki for more links.

Related

Are some general purpose registers faster than others?

In x86-64, will certain instructions execute faster if some general purpose registers are preferred over others?
For instance, would mov eax, ecx execute faster than mov r8d, ecx? I can imagine that the latter would need a REX prefix which would make the instruction fetch slower?
What about using rax instead of rcx? What about add or xor? Other operations? Smaller registers like r15b vs al? al vs ah?
AMD vs Intel? Newer processors? Older processors? Combinations of instructions?
Clarification: Should certain general purpose registers be preferred over others, and which ones are they?
In general, architectural registers are all equal, and renamed onto a large array of physical registers.
(Except partial registers can be slower, especially high-byte AH/BH/CH/DH which are slow to read after writing the full register, on Haswell and later. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent and also Why doesn't GCC use partial registers? for problems when writing 8-bit and 16-bit registers). The rest of this answer is just going to consider 32/64-bit operand-size.)
But some instruction require specific registers, like legacy variable-count shifts (without BMI2 shrx etc) require the count in CL. Division requires the dividend in EDX:EAX (or RDX:RAX for the slower 64-bit version).
Using a call-preserved register like RBX means your function has to spend extra instructions saving/restoring it.
But of course there are perf differences if you need more instructions. So lets assume all else is equal, and just talk about the uops, latency, and code-size of a single instruction just by changing which register is used for one of its operands. TL:DR: the only perf difference is due to instruction-encoding restrictions / differences. Sometimes a different register will allow / require (or get the assembler to pick) a different encoding, which will often be smaller / larger as a special case, and sometimes even executes differently.
Generally smaller code is faster, and packs better in the uop cache and I-cache, so unless you've analyzed a specific case and found a problem, favour the smaller encoding. Often that means keeping a byte value in AL so you can use those special-case instructions, and avoiding RBP / R13 for pointers.
Special cases where a specific encoding is extra slow, not just size
LEA with RBP or R13 as a base can be slower on Intel if the addressing mode didn't already have a +displacement constant.
e.g. lea eax, [rbp + 12] is encodeable as-written, and is just as fast as lea eax, [rcx + 12].
But lea eax, [rbp + rcx*4] can only be encoded in machine code as lea eax, [rbp + rcx*4 + 0] (because of addressing mode escape-code stuff), which is a 3-component LEA, and thus slower on Intel (3 cycle latency on Sandybridge-family instead of 1 cycle, see https://agner.org/optimize/ instruction tables and microarch PDF). On AMD, having a scaled-index would already make it a slow-LEA even with lea eax, [rdx + rcx*4]
Outside of LEA, using RBP / R13 as the base in any addressing mode always requires a disp8/32 byte or dword, but I don't think the actual AGUs are slower for a 3-component addressing mode. So it's just a code-size effect.
Other cases include Which Intel microarchitecture introduced the ADC reg,0 single-uop special case? where the short-form 2-byte encoding for adc al, imm8 is 2 uops even on modern uarches like Skylake, where adc bl, imm8 is 1 uop.
So not only does the adc reg,0 special case not work for adc al,0 on Sandybridge through Haswell, Broadwell and newer forgot (or chose not to) optimize how that encoding decodes to uops. (Of course you could manually encode adc al,0 using the 3-byte Mod/RM encoding, but assemblers will always pick the shortest encoding so adc al,0 will assemble to the short form by default.) Only a problem with byte registers; adc eax,0 will use the opcode ModRM imm8 3-byte encoding, not 5-byte opcode imm32.
For other cases of op al,imm8, the only difference is code-size, which only indirectly matters for performance. (Because of decoding, uop-cache packing, and I-cache misses).
See Tips for golfing in x86/x64 machine code for more about special cases of code-size, like xchg eax, ecx being 1-byte vs. xchg edx, ecx being 2 bytes.
add rsp, 8 can need an extra stack-sync uop if there hasn't been an explicit use of RSP or ESP since the last push/pop/call/ret (along the path of execution of course, not in the static code layout). (What is the stack engine in the Sandybridge microarchitecture?). This is why compilers like clang use a dummy push or pop to reserve / free a single stack slot: Why does this function push RAX to the stack as the first operation?
LEA will be slower with EBP, RBP, or R13 as the base (PDF warning, page 3-22). But generally the answer is No.
Taking a step back, it's important to realize that since the advent of register renaming that architectural registers don't deal with actual, physical registers on most micro-architectures. For example, each Cascade Lake core has a register file of 180 integer and 168 FP registers.
You have stuffed too many questions altogether, however, if I understood the question well, you are confusing the processor architecture with the small but fast Register file, which fills in the speed gap between the processor and memory technologies. The register file is small enough that it can only support one instruction at a time, i.e. the current instruction, and fast enough that it can almost catch up with the processor speed.
I would like to build a short background, the naming conventions of these registers serves two purposes: one, it makes the older versions of the x86 ISA implementations compatible up till now, and two, every name of these registers has a special purpose to it besides its general purpose use. For example, the ECX register is used as a counter to implement loops i.e. instructions like JECXZ and LOOP uses ECX register exclusively. Though you need to watch out for some flags that you would not want to lose.
And now the answer to your question stems from the second purpose. So some registers would seem to be faster because these special registers are hardcoded into the processor and can be accessed much quicker, however, the difference should not be much.
And the second thing that you might know, not all instructions are of the same complexity, especially in x86, the opcode of instructions can be from 1-3 bytes and as more and more functionality is added to the instruction in terms of, prefixes, addressing modes, etc. these instructions start to become slower, So it is not the case that some registers are slower than other, it is just that some registers are encoded into the instruction and therefore those instructions run faster with that combination of register. And if otherwise used, it would seem slower. I hope that helps. Thanks

Are scaled-index addressing modes a good idea?

Consider the following code:
void foo(int* __restrict__ a)
{
int i; int val = 0;
for (i = 0; i < 100; i++) {
val = 2 * i;
a[i] = val;
}
}
This complies (with maximum optimization but no unrolling or vectorization) into...
GCC 7.2:
foo(int*):
xor eax, eax
.L2:
mov DWORD PTR [rdi], eax
add eax, 2
add rdi, 4
cmp eax, 200
jne .L2
rep ret
clang 5.0:
foo(int*): # #foo(int*)
xor eax, eax
.LBB0_1: # =>This Inner Loop Header: Depth=1
mov dword ptr [rdi + 2*rax], eax
add rax, 2
cmp rax, 200
jne .LBB0_1
ret
What are the pros and cons of GCC's vs clang's approach? i.e. an extra variable incremented separately, vs multiplying via a more complex addressing mode?
Notes:
This question also relates to this one with about the same code, but with float's rather than int's.
Yes, take advantage of the power of x86 addressing modes to save uops, in cases where an index doesn't unlaminate into more extra uops than it would cost to do pointer increments.
(In many cases unrolling and using pointer increments is a win because of unlamination on Intel Sandybridge-family, but if you're not unrolling or if you're only using mov loads instead of folding memory operands into ALU ops for micro-fusion, then indexed addressing modes are often break even on some CPUs and a win on others.)
It's essential to read and understand Micro fusion and addressing modes if you want to make optimal choices here. (And note that IACA gets it wrong, and doesn't simulate Haswell and later keeping some uops micro-fused, so you can't even just check your work by having it do static analysis for you.)
Indexed addressing modes are generally cheap. At worst they cost one extra uop for the front-end (on Intel SnB-family CPUs in some situations), and/or prevent a store-address uop from using port7 (which only supports base + displacement addressing modes). See Agner Fog's microarch pdf, and also David Kanter's Haswell write-up, for more about the store-AGU on port7 which Intel added in Haswell.
On Haswell+, if you need your loop to sustain more than 2 memory ops per clock, then avoid indexed stores.
At best they're free other than the code-size cost of the extra byte in the machine-code encoding. (Having an index register requires a SIB (Scale Index Base) byte in the encoding).
More often the only penalty is the 1 extra cycle of load-use latency vs. a simple [base + 0-2047] addressing mode, on Intel Sandybridge-family CPUs.
It's usually only worth using an extra instruction to avoid an indexed addressing mode if you're going to use that addressing mode in multiple instructions. (e.g. load / modify / store).
Scaling the index is free (on modern CPUs at least) if you're already using a 2-register addressing mode. For lea, Agner Fog's table lists AMD Ryzen as having 2c latency and only 2 per clock throughput for lea with scaled-index addressing modes (or 3-component), otherwise 1c latency and 0.25c throughput. e.g. lea rax, [rcx + rdx] is faster than lea rax, [rcx + 2*rdx], but not by enough to be worth using extra instructions instead.) Ryzen also doesn't like a 32-bit destination in 64-bit mode, for some reason. But the worst-case LEA is still not bad at all. And anyway, mostly unrelated to address-mode choice for loads, because most CPUs (other than in-order Atom) run LEA on the ALUs, not the AGUs used for actual loads/stores.
The main question is between one-register unscaled (so it can be a "base" register in the machine-code encoding: [base + idx*scale + disp]) or two-register. Note that for Intel's micro-fusion limitations, [disp32 + idx*scale] (e.g. indexing a static array) is an indexed addressing mode.
Neither function is totally optimal (even without considering unrolling or vectorization), but clang's looks very close.
The only thing clang could do better is save 2 bytes of code size by avoiding the REX prefixes with add eax, 2 and cmp eax, 200. It promoted all the operands to 64-bit because it's using them with pointers and I guess proved that the C loop doesn't need them to wrap, so in asm it uses 64-bit everywhere. This is pointless; 32-bit operations are always at least as fast as 64, and implicit zero-extension is free. But this only costs 2 bytes of code-size, and costs no performance other than indirect front-end effects from that.
You've constructed your loop so the compiler needs to keep a specific value in registers and can't totally transform the problem into just a pointer-increment + compare against an end pointer (which compilers often do when they don't need the loop variable for anything except array indexing).
You also can't transform to counting a negative index up towards zero (which compilers never do, but reduces the loop overhead to a total of 1 macro-fused add + branch uop on Intel CPUs (which can fuse add + jcc, while AMD can only fuse test or cmp / jcc).
Clang has done a good job noticing that it can use 2*var as the array index (in bytes). This is a good optimization for tune=generic. The indexed store will un-laminate on Intel Sandybridge and Ivybridge, but stay micro-fused on Haswell and later. (And on other CPUs, like Nehalem, Silvermont, Ryzen, Jaguar, or whatever, there's no disadvantage.)
gcc's loop has 1 extra uop in the loop. It can still in theory run at 1 store per clock on Core2 / Nehalem, but it's right up against the 4 uops per clock limit. (And actually, Core2 can't macro-fuse the cmp/jcc in 64-bit mode, so it bottlenecks on the front-end).
Indexed addressing (in loads and stores, lea is different still) has some trade-offs, for example
On many µarchs, instructions that use indexed addressing have a slightly longer latency than instruction that don't. But usually throughput is a more important consideration.
On Netburst, stores with a SIB byte generate an extra µop, and therefore may cost throughput as well. The SIB byte causes an extra µop regardless of whether you use it for indexes addressing or not, but indexed addressing always costs the extra µop. It doesn't apply to loads.
On Haswell/Broadwell (and still in Skylake/Kabylake), stores with indexed addressing cannot use port 7 for address generation, instead one of the more general address generation ports will be used, reducing the throughput available for loads.
So for loads it's usually good (or not bad) to use indexed addressing if it saves an add somewhere, unless they are part of a chain of dependent loads. For stores it's more dangerous to use indexed addressing. In the example code it shouldn't make a large difference. Saving the add is not really relevant, ALU instructions wouldn't be the bottleneck. The address generation happening in ports 2 or 3 doesn't matter either since there are no loads.

Windows x86 assembly language syntax [duplicate]

This question already has an answer here:
Which segment register is used by default?
(1 answer)
Closed 6 years ago.
(1) What does the following code mean? I cannot find any reference about the ds:[ ] syntax anywhere online. How is it different from without the ds:?
cmp eax,dword ptr ds:[12B656Ch]
(2) In the following instruction,
movsx eax,word ptr [esi+24h]
What is the esi register used for? Is it possible to guess what the original C code is doing from using such a rare register?
DS refers to the Data Segment.
In Win32, CS = DS = ES = SS = 0.
That is these segments do not matter and a flat 32 bit address space is used.
The Data segment is the default segment when accessing memory. Some disassemblers mistakenly list it, even though it serves no purpose to list a default segment.
You can list a different segment if you do wish by using a segment override.
CS is de Code Segment which is the default segment for jumps and calls and SS is the Stack segment which is the default for addresses based on ESP.
ES is the Extra Segment which is used for string instructions.
The only segment override that makes sense in Win32 is FS (The F does not stand for anything, but it comes after E).
FS links to the Thread Information Block (TIB) which houses thread specific data and is very useful for Thread Local Storage and multi-threading in general.
There is also a GS which is reserved for future use in Win32 and is used for the TIB in Win64.
In Linux the picture is more or less the same.
What is register X for
You must let go of the notion that registers have special purposes.
In x86 you can use almost any register for almost any purpose.
Only a few complex instructions use specific registers, but the normal instructions can use any register.
The compiler will try and use as many registers as possible to avoid having to use memory.
Having said this the original purposes of the 8 x86 registers are as follows:
EAX : accumulator, some instructions using this register have 'short versions'.
EDX : overflow for EAX, used to store 64 bit values when multiplying or dividing.
ECX : counter, used in string instructions like rep mov and shifts.
EBX : miscellaneous general purpose register.
ESI : Source Index register, used as source pointer for string instructions
EDI : Destination Index register, used as destination pointer
ESP : Stack pointer, used to keep track of the stack
EBP : Base pointer, used in stack frames
You can use any register pretty much as you please, with the exception of ESP. Although ESP will work in many instructions, it is just too awkward to lose track of the stack.
Is it possible to guess what the original C code is doing from using such a rare register?
My guess:
struct x {
int a,b,c,d,e,f,g,h,i,j; //36 bytes
short s };
....
int i = x.s;
ESI likely points to some structure or object. At offset 24h (36) a short is present which is transfered into an int. (hence the mov with Sign eXtend).
ESI does not link local variable, because in that case EBP or ESP would be used.
If you want to know more about the c code you'd need more context.
Many c constructs translate into multiple cpu instructions.
The best way to see this is to write c code and inspect the cpu code that gets generated.

What addressing mode is used in 'mov cx, [bp+6]'?

What addressing mode is used in "mov cx, [bp+6]"? The processor is intel 8086. I am studying "Microprocessor and Interfacing" by Douglas V. Hall. I know its memory addressing mode. But not sure whether its based addressing mode or index addressing mode?
[bp+6] is the based addressing mode. From the original 8086 docs:
In based addressing, the effective address is the sum of a displacement value and the content of register BX or register BP.
Indexed addressing mode is similar but with the SI or DI registers.
Basically, you have the following modes:
Direct memory accessing like [1234].
Register indirect like [bx].
Based addressing like 4[bx] or [bp+8].
Indexed addressing like 4[si] or [di+4].
Based indexed addressing (combo of the previous two) such as 4[bx][si] or [bx+si+4].
Some other inconsequential (in this context) ones like implicit, port, string, relative.

How does writing to CPU register actually work?

When writing to a register, say, like mov ax, 1, it overwrites the value it may have had earlier.
Now what I wonder is that how big figures/strings can I feed into a register, and that can another application overwrite my app's register values? I mean, are the registers shared among processes or do they receive their own sandboxed/virtual registers?
I am interesting in Intel x86(-64) Core CPUs and Windows.
Only one thread is scheduled at a time on a single core. The core is what has the registers.
When a new thread is scheduled, the registers are first saved, and the previously-saved registers of the thread are restored. This includes the Program Counter register, which points to the next instruction to execute.
Registers (from memory):
AX, BX, CX, DX are 16 bits, broken into bytes (AH, AL, BH, BL)
SI, DI, SP and BP are also 16 bits
EAX, EBX, ECX etc. are 32 bits
I'm not sure what they're called on a 64-bit system. I think I saw RAX, but I'm not sure.
There are also special-purpose registers, floating-point registers, etc.
1) The size of registers depends (in well-defined ways) on what names you're using for them. For instance, eax is 32 bits wide, ax is 16 bits, and ah/al are 8 bits. If you're on a 64-bit system, rax is 64 bits wide.
The exact limits of these register sizes will depend somewhat on how you're interpreting the values (in particular, whether you're treating them as signed or unsigned). The size is what fundamentally matters, though.
2) The operating system kernel will save your process's registers while other processes, or the kernel, are running. The registers do take on other values while you're not running, but it's all transparent -- while your process is running, registers won't change out from under you.

Resources