Program counter, fences and processor re-ordering - windows

I understand that instructions can be re-ordered by the processor in addition to compilers.
I have a few questions that I can not get my head around.
Say we have three instructions:
Program order
S1
S2
S3
After re-ordering by the processor, order becomes (for whatever reason):
S3
S2
S1
So when the processor executes S1 (in the program order), what woul be the value of the Program Counter?
If windows (or another OS), context switches the thread out and schedules it in another processor, how would the other processor know which instruction to execute next? (Is it guaranteed to make the same re-orderings?)
Is a memory fence (for example, a full fence created by an atomic compare and swap instruction) on one processor valid after the thread is scheduled on another thread?
Any ideas on this is highly appreciated.

There is an instruction pointer associated with each instruction.
Although instructions may be executed out of order, they always complete in order. When an interrupt or fault occurs, all instructions preceding the saved IP address have been completed. The results of any subsequent instructions are discarded. When execution resumes, it starts at the saved address.
The steps taken by the OS to schedule a thread on another processor include fencing operations on both processors, so when the thread resumes on the new processor, all preceding operations are fully fenced (whether or not any explicit fences exist in the code of the thread).

Unlike static compile-time ordering, out-of-order exec preserves the illusion of running instructions in program order. Including the situation seen by an interrupt handler. Current CPUs don't rename the privilege level, so they generally roll back to a consistent state as part of taking an exception or interrupt, not keeping un-executed instructions in flight. When an interrupt occurs, what happens to instructions in the pipeline?
This also means that interrupts are delivered strictly between instructions, not in the middle of one. Interrupting an assembly instruction while it is operating (except for "interruptible" instructions like rep movsb that logically work as multiple instructions, or vpgatherdd that has documented semantics for a page fault in one of the gather operands.)
Memory ordering as observed by other cores is another matter, and can differ from program order even on an in-order CPU. (Can a speculatively executed CPU branch contain opcodes that access RAM?)
The kernel code for a context switch needs to include a strong enough barrier for a thread to see its own stores in program order when it resumes on another core. Generally just release/acquire sync is sufficient (and you already need something like that for the kernel on the other core to restore register values). Maybe also an sfence to make that apply even for NT stores on x86.

Related

Out-of-order instruction execution: is commit order preserved?

On the one hand, Wikipedia writes about the steps of the out-of-order execution:
Instruction fetch.
Instruction dispatch to an instruction queue (also called instruction buffer or reservation stations).
The instruction waits in the queue until its input operands are available. The instruction is then allowed to leave the queue before
earlier, older instructions.
The instruction is issued to the appropriate functional unit and executed by that unit.
The results are queued.
Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.
The similar information can be found in the "Computer Organization and Design" book:
To make programs behave as if they were running on a simple in-order
pipeline, the instruction fetch and decode unit is required to issue
instructions in order, which allows dependences to be tracked, and the
commit unit is required to write results to registers and memory in
program fetch order. This conservative mode is called in-order
commit... Today, all dynamically scheduled pipelines use in-order commit.
So, as far as I understand, even if the instructions execution is done in the out-of-order manner, the results of their executions are preserved in the reorder buffer and then committed to the memory/registers in a deterministic order.
On the other hand, there is a known fact that modern CPUs can reorder memory operations for the performance acceleration purposes (for example, two adjacent independent load instructions can be reordered). Wikipedia writes about it here.
Could you please shed some light on this discrepancy?
TL:DR: memory ordering is not the same thing as out of order execution. It happens even on in-order pipelined CPUs.
In-order commit is necessary1 for precise exceptions that can roll-back to exactly the instruction that faulted, without any instructions after that having already retired. The cardinal rule of out-of-order execution is don't break single-threaded code. If you allowed out-of-order commit (retirement) without any kind of other mechanism, you could have a page-fault happen while some later instructions had already executed once, and/or some earlier instructions hadn't executed yet. This would make restarting execution after handing a page-fault impossible the normal way.
(In-order issue/rename and dependency-tracking takes care of correct execution in the normal case of no exceptions.)
Memory ordering is all about what other cores see. Also notice that what you quoted is only talking about committing results to the register file, not to memory.
(Footnote 1: Kilo-instruction Processors: Overcoming the Memory Wall is a theoretical paper about checkpointing state to allow rollback to a consistent machine state at some point before an exception, allowing much larger out-of-order windows without a gigantic ROB of that size. AFAIK, no mainstream commercial designs have used that, but it shows that there are in theory approaches other than strictly in-order retirement to building a usable CPU.
Apple's M1 reportedly has a significantly larger out-of-order window than its x86 contemporaries, but I haven't seen any definite info that it uses anything other than a very large ROB.)
Since each core's private L1 cache is coherent with all the other data caches in the system, memory ordering is a question of when instructions read or write cache. This is separate from when they retire from the out-of-order core.
Loads become globally visible when they read their data from cache. This is more or less when they "execute", and definitely way before they retire (aka commit).
Stores become globally visible when their data is committed to cache. This has to wait until they're known to be non-speculative, i.e. that no exceptions or interrupts will cause a roll-back that has to "undo" the store. So a store can commit to L1 cache as early as when it retires from the out-of-order core.
But even in-order CPUs use a store queue or store buffer to hide the latency of stores that miss in L1 cache. The out-of-order machinery doesn't need to keep tracking a store once it's known that it will definitely happen, so a store insn/uop can retire even before it commits to L1 cache. The store buffer holds onto it until L1 cache is ready to accept it. i.e. when it owns the cache line (Exclusive or Modified state of the MESI cache coherency protocol), and the memory-ordering rules allow the store to become globally visible now.
See also my answer on Write Allocate / Fetch on Write Cache Policy
As I understand it, a store's data is added to the store queue when it "executes" in the out-of-order core, and that's what a store execution unit does. (Store-address writing the address, and store-data writing the data into the store-buffer entry reserved for it at allocation/rename time, so either of those parts can execute first on CPUs where those parts are scheduled separately, e.g. Intel.)
Loads have to probe the store queue so that they see recently-stored data.
For an ISA like x86, with strong ordering, the store queue has to preserve the memory-ordering semantics of the ISA. i.e. stores can't reorder with other stores, and stores can't become globally visible before earlier loads. (LoadStore reordering isn't allowed (nor is StoreStore or LoadLoad), only StoreLoad reordering).
David Kanter's article on how TSX (transactional memory) could be implemented in different ways than what Haswell does provides some insight into the Memory Order Buffer, and how it's a separate structure from the ReOrder Buffer (ROB) that tracks instruction/uop reordering. He starts by describing how things currently work, before getting into how it could be modified to track a transaction that can commit or abort as a group.

Processor pipeline state preservation

Is there any situation where the state of the processor pipeline (with already decoded or prefetched instructions) is saved and subsequently reloaded after resumption during a thread sleep/ context switch / interrupt etc.? (May be as a optimization).
This isn't possible for any CPU I'm aware of. There's no interface for doing it, and no conditions under which a CPU does it on its own. Dumping a huge amount of internal CPU state to RAM would take more cycles than it would save. Having the OS keep track of the variable-size chunks of RAM needed for this would just make the overhead worse.
If anything was worth saving, BTW, it would be results of already executed instruction that can't retire yet, because of a load that missed in cache. (All the common out-of-order execution designs for mainstream ISAs use in-order retirement to support precise exceptions. Out-of-order retirement with checkpointing / rollback on exceptions and mispredicts has been proposed. Search kilo-instruction processor, IIRC.)
(flawed idea): An aggressive out-of-order design could avoid wasting too much work on context switches by delaying the write of the interrupt-return address when an external interrupt arrives. i.e. they could pretend that the interrupt came in later than it did by allowing some instructions already in the pipeline to keep executing. If the user-space instruction pointer isn't needed until the interrupt handler returns, the CPU could clearing the pipeline.
Hrm, this has the major difficulty that register values on entry into the interrupt handler also depends on the architectural state, so this probably can't work.
This def. can't work for interrupts generated by user-space, because that fixes the return address.
This isn't an issue for threads that put themselves to sleep while waiting on a spinlock with monitor / mwait or something. mwait presumably doesn't take effect until it retires, and it won't retire until all previous work has been done. It would defeat the intended purpose for the CPU to be aggressive about speculatively executing past mwait, I think. Or maybe mwait doesn't even flush the pipeline, and just saves power.
The idea has been proposed, but you'd need a much denser memory technology which is only now becoming available. See this paper for example:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6489970
Basically, they propose pipes composed of a new set of latches & registers based on memristors (resistive non-volatile memory components), that can hold multiple values corresponding to multiple threads. Control logic can then tell all latches which thread should be active, and allow simultaneous context switching throughout the entire pipe.
Keep in mind that this only enhances the granularity to the latch level. Modern CPUs with simultaneous multithreading can already have different threads active on different units level without context switches, simple through arbitration. Other units with inherent parallelism may already handle multiple threads per cycle (e.g. - multi-ported ALUs)

Software Breakpoints and Modern OOOE Processors

I understand that modern operation systems provide api's for debugging. When a debugger process asks the kernel to set a breakpoint on machine code instruction of another process, the kernel replaces the first byte of the instruction with an opcode that causes an interrupt.
The interrupt handler would halt the process, save the registers and notify the debugging process.
What I do not understand is what exactly happens on out of order execution processors. The interrupt instruction could be executed before it's predecessors or after it's ancestors and thus, at the time of the interrupt, the registers and memory will contain wrong values.
That is why all the ordered events such as interrupts, faults, exceptions, etc, are always handled at the commit point in out-of-order processors, where the original program order is restored and the correct machine state can be captured. This means that you may know of a pending event but still delay handling it.
Note that actions visible by the external world such as stores to memory are also handled past this stage, so you can never view the speculative internal state of an out-of-order core (well, except for side channel attack methods...), and any interrupt or breakpoint will also be ordered correctly with regards to them

How do I force the CPU to perform in order execution of a program without any loops or branches?

Is it possible? For a small code without any branches/loops.
Are there any gcc flags or intrinsic instructions like SSE's for x86 and other processor families? I am just curious since all the processors available these days follow out of order execution model.
Thanks in advance
Most modern out-of-order CPUs are inherently out-of-order, without switching possible between in-order and out-of-order modes.
You can try to find some in-order CPU, and there are some:
x86: Intel Atom (only 45 nm and older versions; they have two parallel pipelines but executes all instructions in order)
arm: Cortex-A8, and many older cores;
While it is not possible to directly turn off instruction reordering in the typical out-of-order CPU, you can inject something serializing (like cpuid in x86 world) between every your instruction to simulate in-order execution.
There is a part of Intel manuals (vol 3a) about serializing instructions (copied from http://objectmix.com/asm-x86-asm-370/69413-serializing-instructions.html):
Volume 3A: System Programming Guide states
7.4 SERIALIZING INSTRUCTIONS
The Intel 64 and IA-32 architectures define several serializing
instructions. These instructions force the processor to complete all
modifications to flags, registers, and memory by previous instructions
and to drain all buffered writes to memory before the next instruction
is fetched and executed. For example, when a MOV to control register
instruction is used to load a new value into control register CR0 to
enable protected mode, the processor must perform a serializing
operation before it enters protected mode. This serializing operation
insures that all operations that were started while the processor was
in real-address mode are completed before the switch to protected
mode is made.
The concept of serializing instructions was introduced into the IA-32
architecture with the Pentium processor to support parallel
instruction execution. Serializing instructions have no meaning for
the Intel486 and earlier processors that do not implement parallel
instruction execution.
It is important to note that executing of serializing instructions on
P6 and more recent processor families constrain speculative execution
because the results of speculatively executed instructions are
discarded. The following instructions are serializing instructions:
o Privileged serializing instructions - MOV (to control register,
with the exception of MOV CR8), MOV (to debug register), WRMSR, INVD,
INVLPG, WBINVD, LGDT, LLDT, LIDT, and LTR.
o Non-privileged serializing instructions - CPUID, IRET, and RSM.
When the processor serializes instruction execution, it ensures that
all pending memory transactions are completed (including writes
stored in its store buffer) before it executes the next instruction.
Nothing can pass a serializing instruction and a serializing
instruction cannot pass any other instruction (read, write,
instruction fetch, or I/O). For example, CPUID can be executed at any
privilege level to serialize instruction execution with no effect on
program flow, except that the EAX, EBX, ECX, and EDX registers are
modified.
It is possible, but it depends on the CPU. Then again, the instructions themselves don't matter, the memory accesses matter.
AFAIK all CPUs guarantee that registers (and thus the internal state) appear updated in order, regardless of how execution happens. In some CPUs temporary registers are allocated with a value and the result is "written back" (register renaming, so there's no copy per se) at the appropriate time.
For memory accesses, most CPUs have memory barriers of some kind, which limit the reordering of memory accesses. There are several different kinds of memory barriers, and they differ from CPU to CPU. You can conceivably place a full memory barrier between each instruction and you'll get halfway there. If you have a multi-processor machine you might need to do some extra work to make sure the caches are also flushed. Without explicit instructions the other core may not see the results in order.
It very much depends on what you're trying to achieve and on which specific CPU. Every CPU out there is different in some way. And there won't be any magical gcc flags. In gcc the best you'll have are builtin atomic types (link below). The topic is huge. No simple answers.
Recommended reading list:
http://lxr.free-electrons.com/source/Documentation/memory-barriers.txt
https://en.wikipedia.org/wiki/Memory_ordering
http://gcc.gnu.org/onlinedocs/gcc-4.4.5/gcc/Atomic-Builtins.html
http://www.akkadia.org/drepper/cpumemory.pdf

very long instruction that consists of operations with different latencies

Consider a VLIW processor with an issue width equal to N: this means that it is able to start N operations simultaneously, so each very long instruction can consist of a maximum of N operations.
Suppose that the VLIW processor load a very long instruction which consists of operations with different latencies: operations belonging to the same very long instruction could end at different times. What happens if an operation finishes its execution before other operations belonging to the same very long instruction? Could a subsequent operation (that is an operation belonging to the next very long instruction) start execution before the remaining operations of the current very long instruction being executed? Or does a very long instruction wait for the completion of all operations belonging to the current very long instruction?
Most VLIW processors I've seen do support operations with different latencies.
It's up to the compiler to schedules these instructions, and to ensure that the
operands are available before the operation executes. A VLIW processor is
dumb, and doesn't check any dependencies between operations. When a long instruction
word executes, each operation in the word simply reads its input data from a register
file, and writes its result back at the end of the same cycle, or later if an
operation takes two or three cycles.
This only works when instructions are deterministic, and always take the same
number of cycles. All VLIW architectures I've seen have operations that take
a fixed number of cycles, no less, no more. In case they do take longer, like for
instance an external memory fetch, the whole machine is simply stalled.
Now there is one key thing that limits the scheduling of instructions that have
different latencies: the number of ports to the register file. The ports are the
connections between the register file and execution units of the operations.
In a VLIW processor, each operation executes in an issue slot, and each issue slot
has its own ports to the register file. Ports are expensive in terms of hardware.
The more ports, the more silicon is required to implement the register file.
Now consider the following situation where a two-cycle operation wants to write its
result to the register file at the same time as a single-cycle operation that
was scheduled right after it. There's now a conflict, as both operations want to
write to the same register file over the same port. Again, it's the compiler's task
to ensure this doesn't happen. In many VLIW architectures, the operands
that execute in the same issue slot all have the same latency. This avoids this
conflict.
Now to answer your questions:
You said: "What happens if an operation finishes its execution before other
operations belonging to the same very long instruction?"
Nothing special happens. The processor just continues to execute the next
very long instruction word.
You said: "Could a subsequent operation (that is an operation belonging to the
next very long instruction) start execution before the remaining operations of
the current very long instruction being executed?"
Yes, but this could present a register port conflict later on. It's up to the
compiler to prevent this situation.
You said: "Or does a very long instruction wait for the completion of all
operations belonging to the current very long instruction?"
No. The processor at every cycle simply goes to the next very long instruction
word. There's an exception and that is when an operation takes longer than
normal, for instance because there's a cache miss, and then the pipeline is
stalled, and the machine does not progress the next long instruction word.
The idea behind VLIW is that the compiler figures out lots of things for the processer to do in parallel and packages them up in bundles called "Very long instruction words".
Amhdahl's law tells us the the speedup of a parallel program (eg., the parallel parts of the VLIW instruction) is constrained by the slowest part (e.g, the longest-duration subinstruction).
The simple answer with VLIW and "long latencies" is "don't mix sub-instructions with different latencies". The practical answer is the VLIW machines try not to have sub-instructions with different latencies; rather ideally you want "one clock" subinstructions. Typically even memory fetches take only one clock by virtue of being divided into "memory fetch start (here's an address to fetch)" with the only variable latency subinstruction being "wait for previous fetch to arrive" with the idea being that the compiler generates as much other computation as it can so that the memory fetch latency is comvered by the other instructions.

Resources