Generating address of a instruction using Code Segment and Instruction Pointer - x86-16

Consider the Code Segment Address to be FE00 and Instruction Pointer to be ABBE. Shifting the Code Segment by 4 bits and Adding the Instruction pointer leads to an additional carry. How do we represent the generated address?

Consider the Code Segment Address to be FE00 and Instruction Pointer to be ABBE. How do we represent the generated address?
Either you represent the address as
0FE00h:0ABBEh, its segmented form using two 16-bit numbers separated by a colon and always (segment - colon - offset)
00108BBEh, its linear form using one 32-bit number
No matter your choice it will always require 2 word sized registers on 8086.
"The instruction pointer is an offset in the 64KB memory segment that starts at the linear address obtained from multiplying the value in the CS code segment register by 16 (same as shifting left 4 times)."
Calculating the linear address could be done inefficiently (but easy to understand) like this:
mov ax, 0FE00h ; The code segment
mov dx, 16
mul dx ; "shifting the code segment by 4 bits"
add ax, 0ABBEh ; "adding the instruction pointer"
adc dx, 0 ; Taking care of the additional carry
This linear address 00108BBEh uses 2 registers AX and DX. The AX register will hold the least significant part 8BBEh and the DX register will hold the most significant part 0010h. If you need to refer to the whole pair of registers you do it like DX:AX. So highWord - colon - lowWord.
Unlike with seg:off notation, this is just one 32-bit number split over 2 registers, no overlapping place values. The low bit of the high half has place value 2^16 when we're talking about a flat 32-bit (or 20-bit) number, not a seg:off address.

Related

Why does 8086 segment addressing not overflow?

I know that the 8086 addressing mode is a left shift of the segment register by four bits then add an offset. But will the information of significant bit not be lost after shifting the 16-bit register?
For example, the segment register stores 0x1000, will it not become 0x0000 after shifting four bits to the left?
Address calculation happens with 20-bit math. Or more on later CPUs. Not the same width as the inputs. The CPU doesn't shift inside the segment reg!
In CPUs that don't support other modes (like protected mode where the base comes from the GDT), I assume the selected segment register is just hard-wired to bits [20:4] of an adder in the AGU, so the "shift" is just built-in to the wiring. Not like when you run a shift instruction and the result goes back into the destination.
(Actual 8086 didn't have an AGU separate from its ALU, so presumably that's what happens when using that block of logic to do address math instead of 16-bit normal integer math. It also supports a mul instruction which produces a 32-bit product, that has to get split across DX:AX when it's done, but internally the ALU has to shift and add numbers across 16-bit boundaries for multiply to work.)

Is there a penalty when base+offset is in a different page than the base?

The execution times for these three snippets:
pageboundary: dq (pageboundary + 8)
...
mov rdx, [rel pageboundary]
.loop:
mov rdx, [rdx - 8]
sub ecx, 1
jnz .loop
And this:
pageboundary: dq (pageboundary - 8)
...
mov rdx, [rel pageboundary]
.loop:
mov rdx, [rdx + 8]
sub ecx, 1
jnz .loop
And this:
pageboundary: dq (pageboundary - 4096)
...
mov rdx, [rel pageboundary]
.loop:
mov rdx, [rdx + 4096]
sub ecx, 1
jnz .loop
Are, on a 4770K, roughly 5 cycles per iteration for the first snippet and roughly 9 cycles per iteration for the second snippet, then 5 cycles for the third snippet. They both access the exact same address, which is 4K-aligned. In the second snippet, only the address calculation crosses the page boundary: rdx and rdx + 8 don't belong to the same page, the load is still aligned. With a large offset it's back to 5 cycles again.
How does this effect work in general?
Routing the result from the load through an ALU instruction like this:
.loop:
mov rdx, [rdx + 8]
or rdx, 0
sub ecx, 1
jnz .loop
Makes it take 6 cycles per iteration, which makes sense as 5+1. Reg+8 should be a special fast load and AFAIK take 4 cycles, so even in this case there seems to be some penalty, but only 1 cycle.
A test like this was used in response to some of the comments:
.loop:
lfence
; or rdx, 0
mov rdx, [rdx + 8]
; or rdx, 0
; uncomment one of the ORs
lfence
sub ecx, 1
jnz .loop
Putting the or before the mov makes the loop faster than without any or, putting the or after the mov makes it a cycle slower.
Optimization rule: in pointer-connected data structures like linked-lists / trees, put the next or left/right pointers in the first 16 bytes of the object. malloc typically returns 16-byte aligned blocks (alignof(maxalign_t)), so this will ensure the linking pointers are in the same page as the start of the object.
Any other way of ensuring that important struct members are in the same page as the start of the object will also work.
Sandybridge-family normally has 5 cycle L1d load-use latency, but there's a special case for pointer-chasing with small positive displacements with base+disp addressing modes.
Sandybridge-family has 4 cycle load-use latency for [reg + 0..2047] addressing modes, when the base reg is the result of a mov load, not an ALU instruction. Or a penalty if reg+disp is in a different page than reg.
Based on these test results on Haswell and Skylake (and probably original SnB but we don't know), it appears that all of the following conditions must be true:
base reg comes from another load. (A rough heuristic for pointer-chasing, and usually means that load latency is probably part of a dep chain). If objects are usually allocated not crossing a page boundary, then this is a good heuristic. (The HW can apparently detect which execution unit the input is being forwarded from.)
Addressing mode is [reg] or [reg+disp8/disp32]. (Or an indexed load with an xor-zeroed index register! Usually not practically useful, but might provide some insight into the issue/rename stage transforming load uops.)
displacement < 2048. i.e. all bits above bit 11 are zero (a condition HW can check without a full integer adder/comparator.)
(Skylake but not Haswell/Broadwell): the last load wasn't a retried-fastpath. (So base = result of a 4 or 5 cycle load, it will attempt the fast path. But base = result of a 10 cycle retried load, it won't. The penalty on SKL seems to be 10, vs. 9 on HSW).
I don't know if it's the last load attempted on that load port that matters, or if it's actually what happened to the load that produced that input. Perhaps experiments chasing two dep chains in parallel could shed some light; I've only tried one pointer chasing dep chain with a mix of page-changing and non-page-changing displacements.
If all those things are true, the load port speculates that the final effective address will be in the same page as the base register. This is a useful optimization in real cases when load-use latency forms a loop-carried dep chain, like for a linked list or binary tree.
microarchitectural explanation (my best guess at explaining the result, not from anything Intel published):
It seems that indexing the L1dTLB is on the critical path for L1d load latency. Starting that 1 cycle early (without waiting for the output of an adder to calculate the final address) shaves a cycle off the full process of indexing L1d using the low 12 bits of the address, then comparing the 8 tags in that set against the high bits of the physical address produced by the TLB. (Intel's L1d is VIPT 8-way 32kiB, so it has no aliasing problems because the index bits all come from the low 12 bits of the address: the offset within a page which is the same in both the virtual and physical address. i.e. the low 12 bits translate for free from virt to phys.)
Since we don't find an effect for crossing 64-byte boundaries, we know the load port is adding the displacement before indexing the cache.
As Hadi suggests, it seems likely that if there's carry-out from bit 11, the load port lets the wrong-TLB load complete and then redoes it using the normal path. (On HSW, the total load latency = 9. On SKL the total load latency can be 7.5 or 10).
Aborting right away and retrying on the next cycle (to make it 5 or 6 cycles instead of 9) would in theory be possible, but remember that the load ports are pipelined with 1 per clock throughput. The scheduler is expecting to be able to send another uop to the load port in the next cycle, and Sandybridge-family standardizes latencies for everything of 5 cycles and shorter. (There are no 2-cycle instructions).
I didn't test if 2M hugepages help, but probably not. I think the TLB hardware is simple enough that it couldn't recognize that a 1-page-higher index would still pick the same entry. So it probably does the slow retry any time the displacement crosses a 4k boundary, even if that's in the same hugepage. (Page-split loads work this way: if the data actually crosses a 4k boundary (e.g. 8-byte load from page-4), you pay the page-split penalty not just the cache-line split penalty, regardless of hugepages)
Intel's optimization manual documents this special case in section 2.4.5.2 L1 DCache (in the Sandybridge section), but doesn't mention any different-page limitation, or the fact that it's only for pointer-chasing, and doesn't happen when there's an ALU instruction in the dep chain.
(Sandybridge)
Table 2-21. Effect of Addressing Modes on Load Latency
-----------------------------------------------------------------------
Data Type | Base + Offset > 2048 | Base + Offset < 2048
| Base + Index [+ Offset] |
----------------------+--------------------------+----------------------
Integer | 5 | 4
MMX, SSE, 128-bit AVX | 6 | 5
X87 | 7 | 6
256-bit AVX | 7 | 7
(remember, 256-bit loads on SnB take 2 cycles in the load port, unlike on HSW/SKL)
The text around this table also doesn't mention the limitations that exist on Haswell/Skylake, and may also exist on SnB (I don't know).
Maybe Sandybridge doesn't have those limitations and Intel didn't document the Haswell regression, or else Intel just didn't document the limitations in the first place. The table is pretty definite about that addressing mode always being 4c latency with offset = 0..2047.
#Harold's experiment of putting an ALU instruction as part of the load/use pointer-chasing dependency chain confirms that it's this effect that's causing the slowdown: an ALU insn decreased the total latency, effectively giving an instruction like and rdx, rdx negative incremental latency when added to the mov rdx, [rdx-8] dep chain in this specific page-crossing case.
Previous guesses in this answer included the suggestion that using the load result in an ALU vs. another load was what determined the latency. That would be super weird and require looking into the future. That was a wrong interpretation on my part of the effect of adding an ALU instruction into the loop. (I hadn't known about the 9-cycle effect on page crossing, and was thinking that the HW mechanism was a forwarding fast-path for the result inside the load port. That would make sense.)
We can prove that it's the source of the base reg input that matters, not the destination of the load result: Store the same address at 2 separate locations, before and after a page boundary. Create a dep chain of ALU => load => load, and check that it's the 2nd load that's vulnerable to this slowdown / able to benefit from the speedup with a simple addressing mode.
%define off 16
lea rdi, [buf+4096 - 16]
mov [rdi], rdi
mov [rdi+off], rdi
mov ebp, 100000000
.loop:
and rdi, rdi
mov rdi, [rdi] ; base comes from AND
mov rdi, [rdi+off] ; base comes from a load
dec ebp
jnz .loop
... sys_exit_group(0)
section .bss
align 4096
buf: resb 4096*2
Timed with Linux perf on SKL i7-6700k.
off = 8, the speculation is correct and we get total latency = 10 cycles = 1 + 5 + 4. (10 cycles per iteration).
off = 16, the [rdi+off] load is slow, and we get 16 cycles / iter = 1 + 5 + 10. (The penalty seems to be higher on SKL than HSW)
With the load order reversed (doing the [rdi+off] load first), it's always 10c regardless of off=8 or off=16, so we've proved that mov rdi, [rdi+off] doesn't attempt the speculative fast-path if its input is from an ALU instruction.
Without the and, and off=8, we get the expected 8c per iter: both use the fast path. (#harold confirms HSW also gets 8 here).
Without the and, and off=16, we get 15c per iter: 5+10. The mov rdi, [rdi+16] attempts the fast path and fails, taking 10c. Then mov rdi, [rdi] doesn't attempt the fast-path because its input failed. (#harold's HSW takes 13 here: 4 + 9. So that confirms HSW does attempt the fast-path even if the last fast-path failed, and that the fast-path fail penalty really is only 9 on HSW vs. 10 on SKL)
It's unfortunate that SKL doesn't realize that [base] with no displacement can always safely use the fast path.
On SKL, with just mov rdi, [rdi+16] in the loop, the average latency is 7.5 cycles. Based on tests with other mixes, I think it alternates between 5c and 10c: after a 5c load that didn't attempt the fast path, the next one does attempt it and fails, taking 10c. That makes the next load use the safe 5c path.
Adding a zeroed index register actually speeds it up in this case where we know the fast-path is always going to fail. Or using no base register, like [nosplit off + rdi*1], which NASM assembles to 48 8b 3c 3d 10 00 00 00 mov rdi,QWORD PTR [rdi*1+0x10]. Notice that this requires a disp32, so it's bad for code size.
Also beware that indexed addressing modes for micro-fused memory operands are un-laminated in some cases, while base+disp modes aren't. But if you're using pure loads (like mov or vbroadcastss), there's nothing inherently wrong with an indexed addressing mode. Using an extra zeroed register isn't great, though.
On Ice Lake, this special 4 cycle fast path for pointer chasing loads is gone: GP register loads that hit in L1 now generally take 5 cycles, with no difference based on the presence of indexing or the size of the offset.
I've conducted a sufficient number of experiments on Haswell to determine exactly when memory loads are issued speculatively before the effective address is fully calculated. These results also confirm Peter's guess.
I've varied the following parameters:
The offset from pageboundary. The offset used is the same in the definition of pageboundary and the load instruction.
The sign of the offset is either + or -. The sign used in the definition is always the opposite of the one used in the load instruction.
The alignment of pageboundary within the executable binary.
In all of the following graphs, the Y axis represents the load latency in core cycles. The X axis represents the configuration in the form NS1S2, where N is the offset, S1 is the sign of the offset used in the definition, and S2 is the sign used in the load instruction.
The following graph shows that loads are issued before calculating the effective address only when the offset is positive or zero. Note that for all of the offsets between 0-15, the base address and the effective address used in the load instruction are both within the same 4K page.
The next graph shows the point where this pattern changes. The change occurs at offset 213, which is the smallest offset where the base address and the effective address used in the load instruction are both within different 4K pages.
Another important observation that can be made from the previous two graphs is that even if the base address points to a different cache set than the effective address, no penalty is incurred. So it seems that the cache set is opened after calculating the effective address. This indicates that the L1 DTLB hit latency is 2 cycles (that is, it takes 2 cycles for the L1D to receive the tag), but it takes only 1 cycle to open the cache's data array set and the cache's tag array set (which occurs in parallel).
The next graph shows what happens when pageboundary is aligned on a 4K page boundary. In this case, any offset that is not zero will make the base and effective addresses reside within different pages. For example, if the base address of pageboundary is 4096, then the base address of pageboundary used in the load instruction is 4096 - offset, which is obviously in a different 4K page for any non-zero offset.
The next graph shows that the pattern changes again starting from offset 2048. At this point, loads are never issued before calculating the effective address.
This analysis can be confirmed by measuring the number of uops dispatched to the load ports 2 and 3. The total number of retired load uops is 1 billion (equal to the number of iterations). However, when the measured load latency is 9 cycles, the number of load uops dispatched to each of the two ports is 1 billion. Also when the load latency is 5 or 4 cycles, the number of load uops dispatched to each of the two ports is 0.5 billion. So something like this would be happening:
The load unit checks whether the offset is non-negative and smaller than 2048. In that case, it will issue a data load request using the base address. It will also begin calculating the effective address.
In the next cycle, the effective address calculation is completed. If it turns out that the load is to a different 4K page, the load unit waits until the issued load completes and then it discards the results and replays the load. Either way, it supplies the data cache with the set index and line offset.
In the next cycle, the tag comparison is performed and the data is forwarded to the load buffer. (I'm not sure whether the address-speculative load will be aborted in the case of a miss in the L1D or the DTLB.)
In the next cycle, the load buffer receives the data from the cache. If it's supposed to discard the data, it's discarded and it tells the dispatcher to replay the load with address speculation disabled for it. Otherwise, the data is written back. If a following instruction requires the data for its address calculation, it will receive the data in the next cycle (so it will be dispatched in the next cycle if all of its other operands are ready).
These steps explain the observed 4, 5, and 9 cycle latencies.
It might happen that the target page is a hugepage. The only way for the load unit to know whether the base address and the effective address point to the same page when using hugepages is to have the TLB supply the load unit with the size of the page being accessed. Then the load unit has to check whether the effective address is within that page. In modern processors, on a TLB miss, dedicated page-walk hardware is used. In this case, I think that the load unit will not supply the cache set index and cache line offset to the data cache and will use the actual effective address to access the TLB. This requires enabling the page-walk hardware to distinguish between loads with speculative addresses and other loads. Only if that other access missed the TLB will the page walk take place. Now if the target page turned out to be a hugepage and it's a hit in the TLB, it might be possible to inform the load unit that the size of the page is larger than 4K or maybe even of the exact size of the page. The load unit can then make a better decision regarding whether the load should be replayed. However, this logic should take no more than the time for the (potentially wrong) data to reach the load buffer allocated for the load. I think this time is only one cycle.

Instruction Code

I cannot understand how the computer implement a binary operation.
First of all, an instruction is stored in the memory unit.
Then, control unit in CPU takes this instruction to store it in instruction register(IR) in itself.
After that, CU decodes operation code in the instruction code by using instruction decoder (ID).
Assume that opcode = sum
If it is so, we need two operands. Some books tell that there is only one address of operand in the address part of the instruction code.
Where is second operand ?
Where is address of second operand ?
Mostly of the time, the second operand is already stored in a register of the CPU, the ALU (Arithmetic Logical Unit).
The instruction sequence may look like this:
lda $1234 ' loads value from memory address $1234 into ALU
add #20 ' adds value 20 to the value stored in ALU and stores the result in the ALU again
sda $5678 ' stores the resulting value from the ALU in the memory at address $5678
Modern CPUs have a lot of registers, so the instruction sequence for a modern CPU may look like this:
load ax, $1234 ' loads value from memory address $1234 into AX
load bx, [si+2] ' load the value from the address stored in register SI+2
add ax, bx ' adds the value in register BX to the value in register AX and stores the result in AX
store di+5, ax ' stores the value of register in memory at address stored in DI+5
Hopefully this clears it up for you.

NASM on OS X work with bit, not byte

I am working on my first NASM program, and while trying to figure out the not instruction, I realized that instead of reversing the bit 0, it was reversing the byte 00000000. How would I tell it to work with a bit or otherwise fix this? Here is my code...
section .text
global start
start:
mov eax, 255
not eax
push eax
mov eax, 0x1
sub esp, 4
int 0x80
Feel free to give me pointers also on my assembly coding, as I don't want to get into any bad habits.
In most computer architectures (including the x86), a bit is not a directly addressable unit of memory. The smallest unit that you can directly refer to is a byte, which happens to contain 8 bits on the x86. You have not stated what you're exactly trying to accomplish, so I'm not able to give you an exact solution to your problem, but working with single bits (or groups of bits) is most often achieved by masking out the bits that are of no interest with the AND instruction, eventually shifting the value left or right, and then doing the processing.
If you want to actually get the value of the n-th bit in a register, then you're most probably looking for the instruction BT. It stores the value of the n-th bit in the Carry Flag.
When it comes to other tips : the push instruction decrements the stack pointer by the number of bytes pushed to the stack. This is a characteristic of the x86 architecture - the stack grows, by design, downwards. Therefore, if you want to free some space on the stack, you do add esp, number_of_bytes, not sub (the way you did), which just reserves more space on the stack.

Behind Windows x64's 44-bit virtual memory address limit

http://www.alex-ionescu.com/?p=50.
I read the above post. The author explains why Windows x64 supports only 44-bit virtual memory address with singly linked list example.
struct { // 8-byte header
ULONGLONG Depth:16;
ULONGLONG Sequence:9;
ULONGLONG NextEntry:39;
} Header8;
The first sacrifice to make was to reduce the space for the sequence
number to 9 bits instead of 16 bits, reducing the maximum sequence
number the list could achieve. This still only left 39 bits for the
pointer — a mediocre improvement over 32 bits. By forcing the
structure to be 16-byte aligned when allocated, 4 more bits could be
won, since the bottom bits could now always be assumed to be 0.
Oh, I can't understand.
What "By forcing the structure to be 16-byte aligned when allocated, 4 more bits could be won, since the bottom bits could now always be assumed to be 0." means?
16 is 0010000 in binary
32 is 0100000 in binary
64 is 1000000 in binary
etc
You can see that for all numbers that are multiples of 16, the last four bits are always zero.
So, instead of storing these bits you can leave them out and add them back in when it's time to use the pointer.
For a 2^N-byte aligned pointer, its address is always divisible by 2^N - which means that lower N bits are always zero. You can store additional information in them:
encode ptr payload = ptr | payload
decode_ptr data = data & ~mask
decode_payload data = data & mask
where mask is (1 << N) - 1 - i.e. a number with low N bits set.
This trick is often used to save space in low-level code (payload can be GC flags, type tag, etc.)
In effect you're not storing a pointer, but a number from which a pointer can be extracted. Of course, care should be taken to not dereference the number as a pointer without decoding.

Resources