What addressing mode is used in 'mov cx, [bp+6]'? - x86-16

What addressing mode is used in "mov cx, [bp+6]"? The processor is intel 8086. I am studying "Microprocessor and Interfacing" by Douglas V. Hall. I know its memory addressing mode. But not sure whether its based addressing mode or index addressing mode?

[bp+6] is the based addressing mode. From the original 8086 docs:
In based addressing, the effective address is the sum of a displacement value and the content of register BX or register BP.
Indexed addressing mode is similar but with the SI or DI registers.
Basically, you have the following modes:
Direct memory accessing like [1234].
Register indirect like [bx].
Based addressing like 4[bx] or [bp+8].
Indexed addressing like 4[si] or [di+4].
Based indexed addressing (combo of the previous two) such as 4[bx][si] or [bx+si+4].
Some other inconsequential (in this context) ones like implicit, port, string, relative.

Related

Calculate Physical Address at which AL will get stored in MOV AL,5[SI][BP] [duplicate]

I am studying computer architecture from the Intel Manual. The thing that I understand is that the instructions that we give are logical addresses which consist of a segment selector and an offset.
It is basically CS register<<4 + offset. The Segment Selector maps to the GDT or LDT as given in the TI bit of the segment selector. GDT consists of Segment Descriptors which have BASE, LIMIT and RPL and the output is base address. This base address + offset provides the logical address.
What are the rules that decide which segment register (SS, DS, etc.) applies to different memory operations? e.g. what determines which segment is used for mov eax, [edi]?
Code fetch always uses CS.
Data addressing modes default to DS (or SS when EBP or ESP are the base register) in "normal" addressing modes. (e.g. mov eax, [edi] is equivalent to [ds:edi], mov eax, [ebp+edi*4] is equivalent to mov eax, [ss: ebp + edi*4]).
(Some disassemblers make the segment explicit even when it's the default, so you see a lot of DS: cluttering up the disassembly output. (You can use a segment override prefix to select which segment will apply to the memory operand in an instruction.) In NASM syntax, explicitly using a [ds:edi] addressing mode will result in a redundant ds prefix in the machine code.)
Some instructions with implicit memory operands have different defaults:
Some string instructions use ES:EDI implicitly. e.g. The movs instruction reads from [DS:ESI] and writes to [ES:EDI], making it easy to copy between segments without segment override prefixes.
Memory operands using esp or ebp as the base register default to SS, and so do the implicit accesses for stack instructions like push/pop/call/ret.
FS and GS are never the default, so they can be used for special purposes (like thread-local storage) in a flat memory model system like modern 32 and 64-bit OSes.
wikipedia explains the same thing here.
This is also documented officially in Intel's ISA manuals. e.g. in Volume 2 (the instruction-set ref), Table 2-1. 16-Bit Addressing Forms with the ModR/M Byte has a footnote saying:
The default segment register is SS for the effective addresses containing a BP index, DS for other effective addresses.
(note that SP isn't a valid base address for 16-bit addressing modes.
Also note that when they say "index", that means when BP is used at all, even for [bp + si] or [bp+di]. In 32 and 64-bit addressing modes, there is a clearer distinction between base and index, and [symbol + ebp*4] still implies DS as the segment because EBP is used as an index, not the base.)
There's no equivalent footnote for 32 or 64-bit addressing modes, so the details must be in another volume of the manual.
See also the x86 tag wiki for more links.

Does the CS and DS registers still affect instructions in x64 intel

Afaik they are never used and CS=DS=SS nowadays. However if I were to set these values, would anything change or does the processor ignore them. Ive found really conflicting information on the question and I don't understand why they would still be there if they are ignored. Help pls
Yes, the segment registers do still affect code execution.
The question and some of the comments don't seem to distinguish between the selector value and the base address. To clearly understand some of the apparently conflicting information you're reading on this topic, you need to make sure you recognize which one is being discussed.
The CS selector cannot be 0. It must refer to a valid code segment descriptor in the GDT or LDT. The L bit of the code segment descriptor controls whether the current process is 64-bit mode or 32-bit compatibility mode.
CS (the selector) cannot be equal to DS and SS. CS must refer to a code segment, whereas DS and SS must refer to data segments (possibly the same one). The DS and SS selectors are allowed to be 0 (which would cause a GP fault in 32-bit mode).
The main aspect of segment registers that doesn't still have an effect is the base address and segment limit; the base address of CS, DS, ES, and SS are all treated as if they are 0, and there are no segment limit checks in 64-bit code.
This is the reason you see people saying that they are ignored.
As Margaret mentioned, the current privilege level (CPL) is in the low 2 bits of the CS and SS selector registers and also in the DPL bits of the descriptors in the GDT. These bits should be either 0 or 3, since no current operating systems use rings 1 and 2, as far as I know.
One other minor point is that certain faults caused by memory accesses are reported as stack faults instead of GP faults, if the memory access is performed using the SS segment (because RBP or RSP is used as a base register in the instruction operand).

use of multi register to work with large numbers

how to use of multi register to work with large numbers?
assuming you need to do some calculation of large numbers that can not fit in 32 bit register and the only way to solve your problem is use registers, the memory solution is not available
like multiplication and deviation:
we have edx:eax
is there is a way or an algorithm or an instruction to put your value directly in reg1:reg2:reg3...regn ?
like if you have dq in memory how to store it in two register of 32 bit in one shout if it is possible
With old 8080 and Z80 CPUs there was direct support for multi-register operations, although only with pre-selected pairs of registers, like the Z80 had 8 bit registers a, b, c, d, e, h, l and instructions like add hl,de operating with 16b pairs of them (but for example this 16b add does not update flags, contrary to 8 bit add d,e, etc. and they are somewhat slower than 8 bit variants, so there's still some penalty for using 16 bit values, although usually 16b pairs are more efficient than same task written with 8 bit instructions only.
This feature of 8080 was inspiration (I guess, no facts) of 8086 ah:al = ax 8b registers forming 16b registers and having instructions operating not only with 8 bit registers but also with the pre-selected pairs. Although 8086 is lot more like native 16b CPU, so this feature is rather "let support the breakdown of 16b registers to 8b for making migration of 8b SW easier", than "let support pairing two 8b registers for 16b math".
After that with 80386 this practice was abandoned, and the 32 bit extension of a register called eax didn't add new alias for upper 16b part, making it harder to access it separately (the low 16b is aliased by original ax, which is needed for backward compatibility with 8086/186/286 any way).
Because those extra 16b registers of upper parts of eax, ebx, ... would bump up the number of registers considerably, making the old instruction encoding not feasible any more, and actually with the nature of x86 instruction set it would be quite difficult to keep basic instructions mostly 2 bytes long, the extra combinations would probably raise that average toward 3 bytes.
Now your idea of supporting many more multi-register combinations would explode the required opcodes even more and faster, so such ISA would need probably about 4-6 bytes per instruction on average.
While basically it took quite some decade before people did start to feel seriously limited by 16b math (having values from 0 to 65535 did look to me as quite a lot, back when I was doing some programs on ZX Spectrum with Z80 CPU), and the 32b was true breakthrough, even most of the real-life human math task like prices in shop/etc. can be done with 32b integers easily. It took another decade+ to hit this limit more often (than just in special cases), like when whole movies did start to be encoded on disk, and disks generally went over gigabytes size.
So what you are asking is usually not needed at all (pushed even much further by the 64b options today, which covers insane range of values), and when one eventually needs that, it's very simple to build that from separate instructions... like for example 80386+ code to add eax:ebx:ecx with esi:edi:edx:
; eax:ebx:ecx += esi:edi:edx (96b integer addition)
add ecx, edx
adc ebx, edi
adc eax, esi
Simple enough to not justify the above mentioned explosion of opcode sizes for putting such thing directly into the CPU.

Windows x86 assembly language syntax [duplicate]

This question already has an answer here:
Which segment register is used by default?
(1 answer)
Closed 6 years ago.
(1) What does the following code mean? I cannot find any reference about the ds:[ ] syntax anywhere online. How is it different from without the ds:?
cmp eax,dword ptr ds:[12B656Ch]
(2) In the following instruction,
movsx eax,word ptr [esi+24h]
What is the esi register used for? Is it possible to guess what the original C code is doing from using such a rare register?
DS refers to the Data Segment.
In Win32, CS = DS = ES = SS = 0.
That is these segments do not matter and a flat 32 bit address space is used.
The Data segment is the default segment when accessing memory. Some disassemblers mistakenly list it, even though it serves no purpose to list a default segment.
You can list a different segment if you do wish by using a segment override.
CS is de Code Segment which is the default segment for jumps and calls and SS is the Stack segment which is the default for addresses based on ESP.
ES is the Extra Segment which is used for string instructions.
The only segment override that makes sense in Win32 is FS (The F does not stand for anything, but it comes after E).
FS links to the Thread Information Block (TIB) which houses thread specific data and is very useful for Thread Local Storage and multi-threading in general.
There is also a GS which is reserved for future use in Win32 and is used for the TIB in Win64.
In Linux the picture is more or less the same.
What is register X for
You must let go of the notion that registers have special purposes.
In x86 you can use almost any register for almost any purpose.
Only a few complex instructions use specific registers, but the normal instructions can use any register.
The compiler will try and use as many registers as possible to avoid having to use memory.
Having said this the original purposes of the 8 x86 registers are as follows:
EAX : accumulator, some instructions using this register have 'short versions'.
EDX : overflow for EAX, used to store 64 bit values when multiplying or dividing.
ECX : counter, used in string instructions like rep mov and shifts.
EBX : miscellaneous general purpose register.
ESI : Source Index register, used as source pointer for string instructions
EDI : Destination Index register, used as destination pointer
ESP : Stack pointer, used to keep track of the stack
EBP : Base pointer, used in stack frames
You can use any register pretty much as you please, with the exception of ESP. Although ESP will work in many instructions, it is just too awkward to lose track of the stack.
Is it possible to guess what the original C code is doing from using such a rare register?
My guess:
struct x {
int a,b,c,d,e,f,g,h,i,j; //36 bytes
short s };
....
int i = x.s;
ESI likely points to some structure or object. At offset 24h (36) a short is present which is transfered into an int. (hence the mov with Sign eXtend).
ESI does not link local variable, because in that case EBP or ESP would be used.
If you want to know more about the c code you'd need more context.
Many c constructs translate into multiple cpu instructions.
The best way to see this is to write c code and inspect the cpu code that gets generated.

How to enable alignment exceptions for my process on x64?

I'm curious to see if my 64-bit application suffers from alignment faults.
From Windows Data Alignment on IPF, x86, and x64 archive:
In Windows, an application program that generates an alignment fault will raise an exception, EXCEPTION_DATATYPE_MISALIGNMENT.
On the x64 architecture, the alignment exceptions are disabled by default, and the fix-ups are done by the hardware. The application can enable alignment exceptions by setting a couple of register bits, in which case the exceptions will be raised unless the user has the operating system mask the exceptions with SEM_NOALIGNMENTFAULTEXCEPT. (For details, see the AMD Architecture Programmer's Manual Volume 2: System Programming.)
[Ed. emphasis mine]
On the x86 architecture, the operating system does not make the alignment fault visible to the application. On these two platforms, you will also suffer performance degradation on the alignment fault, but it will be significantly less severe than on the Itanium, because the hardware will make the multiple accesses of memory to retrieve the unaligned data.
On the Itanium, by default, the operating system (OS) will make this exception visible to the application, and a termination handler might be useful in these cases. If you do not set up a handler, then your program will hang or crash. In Listing 3, we provide an example that shows how to catch the EXCEPTION_DATATYPE_MISALIGNMENT exception.
Ignoring the direction to consult the AMD Architecture Programmer's Manual, i will instead consult the Intel 64 and IA-32 Architectures Software Developer’s Manual
5.10.5 Checking Alignment
When the CPL is 3, alignment of memory references can be checked by setting the
AM flag in the CR0 register and the AC flag in the EFLAGS register. Unaligned memory
references generate alignment exceptions (#AC). The processor does not generate
alignment exceptions when operating at privilege level 0, 1, or 2. See Table 6-7 for a
description of the alignment requirements when alignment checking is enabled.
Excellent. I'm not sure what that means, but excellent.
Then there's also:
2.5 CONTROL REGISTERS
Control registers (CR0, CR1, CR2, CR3, and CR4; see Figure 2-6) determine operating
mode of the processor and the characteristics of the currently executing task.
These registers are 32 bits in all 32-bit modes and compatibility mode.
In 64-bit mode, control registers are expanded to 64 bits. The MOV CRn instructions
are used to manipulate the register bits. Operand-size prefixes for these instructions
are ignored.
The control registers are summarized below, and each architecturally defined control
field in these control registers are described individually. In Figure 2-6, the width of
the register in 64-bit mode is indicated in parenthesis (except for CR0).
CR0 — Contains system control flags that control operating mode and states of
the processor
AM
Alignment Mask (bit 18 of CR0) — Enables automatic alignment checking
when set; disables alignment checking when clear. Alignment checking is
performed only when the AM flag is set, the AC flag in the EFLAGS register is
set, CPL is 3, and the processor is operating in either protected or virtual-
8086 mode.
I tried
The language i am actually using is Delphi, but pretend it's language agnostic pseudocode:
void UnmaskAlignmentExceptions()
{
asm
mov rax, cr0; //copy CR0 flags into RAX
or rax, 0x20000; //set bit 18 (AM)
mov cr0, rax; //copy flags back
}
The first instruction
mov rax, cr0;
fails with a Privileged Instruction exception.
How to enable alignment exceptions for my process on x64?
PUSHF
I discovered that the x86 has the instruction:
PUSHF, POPF: Push/pop first 16-bits of EFLAGS on/off the stack
PUSHFD, POPFD: Push/pop all 32-bits of EFLAGS on/off the stack
That then led me to the x64 version:
PUSHFQ, POPFQ: Push/pop the RFLAGS quad on/off the stack
(In 64-bit world the EFLAGS are renamed RFLAGS).
So i wrote:
void EnableAlignmentExceptions;
{
asm
PUSHFQ; //Push RFLAGS quadword onto the stack
POP RAX; //Pop them flags into RAX
OR RAX, $20000; //set bit 18 (AC=Alignment Check) of the flags
PUSH RAX; //Push the modified flags back onto the stack
POPFQ; //Pop the stack back into RFLAGS;
}
And it didn't crash or trigger a protection exception. I have no idea if it does what i want it to.
Bonus Reading
How to catch data-alignment faults on x86 (aka SIGBUS on Sparc) (unrelated question; x86 not x64, Ubunutu not Windows, gcc vs not)
Applications running on x64 have access to a flag register (sometimes referred to as EFLAGS). Bit 18 in this register allows applications to get exceptions when alignment errors occur. So in theory, all a program has to do to enable exceptions for alignment errors is modify the flags register.
However
In order for that to actually work, the operating system kernel must set cr0's bit 18 to allow it. And the Windows operating system doesn't do that. Why not? Who knows?
Applications can not set values in the control register. Only the kernel can do this. Device drivers run inside the kernel, so they can set this too.
It is possible to muck about and try to get this to work by creating a device driver, see:
Old New Thing - Disabling the program crash dialog archive
and the comments that follow. Note that this post is over a decade old, so some of the links are dead.
You might also find this comment (and some of the other answers in this question) to be useful:
Larry Osterman - 07-28-2004 2:22 AM
We actually built a version of NT with alignment exceptions turned on for x86 (you can do that as Skywing mentioned).
We quickly turned it off, because of the number of apps that broke :)
As an alternative to AC for finding slowdowns due to unaligned accesses, you can use hardware performance counter events on Intel CPUs for mem_inst_retired.split_loads and mem_inst_retired.split_stores to find loads/stores that split across a cache-line boundary.
perf record -c 10 -e mem_inst_retired.split_stores,mem_inst_retired.split_loads ./a.out should be useful on Linux. -c 10 records a sample every 10 HW events. If your program does a lot of unaligned accesses and you only want to find the real hotspots, leave it at the default. But -c 10 can get useful data even on a tiny binary that calls printf once. Other perf options like -g to record parent functions on each sample work as usual, and could be useful.
On Windows, use whatever tool you prefer for looking at perf counters. VTune is popular.
Modern Intel CPUs (P6 family and newer) have no penalty for misalignment within a cache line. https://agner.org/optimize/. In fact, such loads/stores are even guaranteed to be atomic (up to 8 bytes), on Intel CPUs. So AC is stricter than necessary, but it will help find potentially-risky accesses that could be page-splits or cache-line splits with differently-aligned data.
AMD CPUs may have penalties for crossing a 16-byte boundary within a 64-byte cache line. I'm not familiar with what hardware counters are available there. Beware that profiling on Intel HW won't necessarily find slowdowns that occur on AMD CPUs, if the offending access never crosses a cache line boundary.
See How can I accurately benchmark unaligned access speed on x86_64? for some details on the penalties, including my testing on 4k-split latency and throughput on Skylake.
See also http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ for possible penalties to store-forwarding efficiency for misaligned loads/stores on Intel/AMD.
Running normal binaries with AC set is not always practical. Compiler-generated code might choose to use an unaligned 8-byte load or store to copy multiple struct members, or to store some literal data.
gcc -O3 -mtune=generic (i.e. the default with optimization enabled) assumes that cache-line splits are cheap enough to be worth the risk of using unaligned accesses instead of multiple narrow accesses like the source does. Page-splits got much cheaper in Skylake, down from ~100 to 150 cycles in Haswell to ~10 cycles in Skylake (about the same penalty as CL splits), because apparently Intel found they were less rare than they previously thought.
Many optimized library functions (like memcpy) use unaligned integer accesses. e.g. glibc's memcpy, for a 6-byte copy, would do 2 overlapping 4-byte loads from the start/end of the buffer, then 2 overlapping stores. (It doesn't have a special case for exactly 6 bytes to do a dword + word, just increasing powers of 2). This comment in the source explains its strategies.
So even if your OS would let you enable AC, you might need a special version of libraries to not trigger AC all over the place for stuff like small memcpy.
SIMD
Alignment when looping sequentially over an array really matters for AVX512, where a vector is the same width as a cache line. If your pointers are misaligned, every access is a cache-line split, not just every other with AVX2. Aligned is always better, but for many algorithms with a decent amount of computation mixed with memory access, it only makes a significant difference with AVX512.
(So with AVX1/2, it's often good to just use unaligned loads, instead of always doing extra work to check alignment and go scalar until an alignment boundary. Especially if your data is usually aligned but you want the function to still work marginally slower in case it isn't.)
Scattered misaligned accesses cross a cache line boundary essentially have twice the cache footprint from touching both lines, if the lines aren't otherwise touched.
Checking for 16, 32 or 64 byte alignment with SIMD is simple in asm: just use [v]movdqa alignment-required loads/stores, or legacy-SSE memory source operands for instructions like paddb xmm0, [rdi]. Instead of vmovdqu or VEX-coded memory source operands like vpaddb xmm0, xmm1, [rdi] which let hardware handle the case of misalignment if/when it occurs.
But in C with intrinsics, some compilers (MSVC and ICC) compile alignment-required intrinsics like _mm_load_si128 into [v]movdqu, never using [v]movdqa, so that's annoying if you actually wanted to use alignment-required loads.
Of course, _mm256_load_si256 or 128 can fold into an AVX memory source operand for vpaddb ymm0, ymm1, [rdi] with any compiler including GCC/clang, same for 128-bit any time AVX and optimization are enabled. But store intrinsics that don't get optimized away entirely do get done with vmovdqa / vmovaps, so at least you can verify store alignment.
To verify load alignment with AVX, you can disable optimization so you'll get separate load / spill into __m256i temporary / reload.
This works in 64-bit Intel CPU. May fail in some AMD
pushfq
bts qword ptr [rsp], 12h ; set AC bit of rflags
popfq
It will not work right away in 32-bit CPUs, these will require first a kernel driver to change the AM bit of CR0 and then
pushfd
bts dword ptr [esp], 12h
popfd

Resources