Consider a VLIW processor with an issue width equal to N: this means that it is able to start N operations simultaneously, so each very long instruction can consist of a maximum of N operations.
Suppose that the VLIW processor load a very long instruction which consists of operations with different latencies: operations belonging to the same very long instruction could end at different times. What happens if an operation finishes its execution before other operations belonging to the same very long instruction? Could a subsequent operation (that is an operation belonging to the next very long instruction) start execution before the remaining operations of the current very long instruction being executed? Or does a very long instruction wait for the completion of all operations belonging to the current very long instruction?
Most VLIW processors I've seen do support operations with different latencies.
It's up to the compiler to schedules these instructions, and to ensure that the
operands are available before the operation executes. A VLIW processor is
dumb, and doesn't check any dependencies between operations. When a long instruction
word executes, each operation in the word simply reads its input data from a register
file, and writes its result back at the end of the same cycle, or later if an
operation takes two or three cycles.
This only works when instructions are deterministic, and always take the same
number of cycles. All VLIW architectures I've seen have operations that take
a fixed number of cycles, no less, no more. In case they do take longer, like for
instance an external memory fetch, the whole machine is simply stalled.
Now there is one key thing that limits the scheduling of instructions that have
different latencies: the number of ports to the register file. The ports are the
connections between the register file and execution units of the operations.
In a VLIW processor, each operation executes in an issue slot, and each issue slot
has its own ports to the register file. Ports are expensive in terms of hardware.
The more ports, the more silicon is required to implement the register file.
Now consider the following situation where a two-cycle operation wants to write its
result to the register file at the same time as a single-cycle operation that
was scheduled right after it. There's now a conflict, as both operations want to
write to the same register file over the same port. Again, it's the compiler's task
to ensure this doesn't happen. In many VLIW architectures, the operands
that execute in the same issue slot all have the same latency. This avoids this
conflict.
Now to answer your questions:
You said: "What happens if an operation finishes its execution before other
operations belonging to the same very long instruction?"
Nothing special happens. The processor just continues to execute the next
very long instruction word.
You said: "Could a subsequent operation (that is an operation belonging to the
next very long instruction) start execution before the remaining operations of
the current very long instruction being executed?"
Yes, but this could present a register port conflict later on. It's up to the
compiler to prevent this situation.
You said: "Or does a very long instruction wait for the completion of all
operations belonging to the current very long instruction?"
No. The processor at every cycle simply goes to the next very long instruction
word. There's an exception and that is when an operation takes longer than
normal, for instance because there's a cache miss, and then the pipeline is
stalled, and the machine does not progress the next long instruction word.
The idea behind VLIW is that the compiler figures out lots of things for the processer to do in parallel and packages them up in bundles called "Very long instruction words".
Amhdahl's law tells us the the speedup of a parallel program (eg., the parallel parts of the VLIW instruction) is constrained by the slowest part (e.g, the longest-duration subinstruction).
The simple answer with VLIW and "long latencies" is "don't mix sub-instructions with different latencies". The practical answer is the VLIW machines try not to have sub-instructions with different latencies; rather ideally you want "one clock" subinstructions. Typically even memory fetches take only one clock by virtue of being divided into "memory fetch start (here's an address to fetch)" with the only variable latency subinstruction being "wait for previous fetch to arrive" with the idea being that the compiler generates as much other computation as it can so that the memory fetch latency is comvered by the other instructions.
Related
I understand that instructions can be re-ordered by the processor in addition to compilers.
I have a few questions that I can not get my head around.
Say we have three instructions:
Program order
S1
S2
S3
After re-ordering by the processor, order becomes (for whatever reason):
S3
S2
S1
So when the processor executes S1 (in the program order), what woul be the value of the Program Counter?
If windows (or another OS), context switches the thread out and schedules it in another processor, how would the other processor know which instruction to execute next? (Is it guaranteed to make the same re-orderings?)
Is a memory fence (for example, a full fence created by an atomic compare and swap instruction) on one processor valid after the thread is scheduled on another thread?
Any ideas on this is highly appreciated.
There is an instruction pointer associated with each instruction.
Although instructions may be executed out of order, they always complete in order. When an interrupt or fault occurs, all instructions preceding the saved IP address have been completed. The results of any subsequent instructions are discarded. When execution resumes, it starts at the saved address.
The steps taken by the OS to schedule a thread on another processor include fencing operations on both processors, so when the thread resumes on the new processor, all preceding operations are fully fenced (whether or not any explicit fences exist in the code of the thread).
Unlike static compile-time ordering, out-of-order exec preserves the illusion of running instructions in program order. Including the situation seen by an interrupt handler. Current CPUs don't rename the privilege level, so they generally roll back to a consistent state as part of taking an exception or interrupt, not keeping un-executed instructions in flight. When an interrupt occurs, what happens to instructions in the pipeline?
This also means that interrupts are delivered strictly between instructions, not in the middle of one. Interrupting an assembly instruction while it is operating (except for "interruptible" instructions like rep movsb that logically work as multiple instructions, or vpgatherdd that has documented semantics for a page fault in one of the gather operands.)
Memory ordering as observed by other cores is another matter, and can differ from program order even on an in-order CPU. (Can a speculatively executed CPU branch contain opcodes that access RAM?)
The kernel code for a context switch needs to include a strong enough barrier for a thread to see its own stores in program order when it resumes on another core. Generally just release/acquire sync is sufficient (and you already need something like that for the kernel on the other core to restore register values). Maybe also an sfence to make that apply even for NT stores on x86.
I've profiled my code using Instrument's time profiler, and zooming in to the disassembly, here's a snippet of its results:
I wouldn't expect a mov instruction to take 23.3% of the time while a div instruction to take virtually nothing.
This causes me to believe these results are unreliable.
Is this true and known? Or am I just experiencing an Instruments bug? Or is there some option I need to use to obtain reliable results?
Is there any reference expanding on this issue?
First of all, it's possible that some counts that really belong to divss are being charged to later instructions, which is called a "skid". (Also see the rest of that comment thread for some more details.) Presumably Xcode is like Linux perf, and uses the fixed cpu_clk_unhalted.thread counter for cycles instead of one of the programmable counters. This is not a "precise" event (PEBS), so skids are possible. As #BeeOnRope points out, you can use a PEBS event that ticks once per cycle (like UOPS_RETIRED < 16) as a PEBS substitute for the fixed cycles counter, removing some of the dependence on interrupt behaviour.
But the way counters fundamentally work for pipelined / out-of-order execution also explains most of what you're seeing. Or it might; you didn't show the complete loop so we can't simulate the code on a simple pipeline model like IACA does, or by hand using hardware guides like http://agner.org/optimize/ and Intel's optimization manual. (And you haven't even specified what microarchitecture you have. I guess it's some member of Intel Sandybridge-family on a Mac).
Counts for cycles are typically charged to the instruction that's waiting for the result, not usually the instruction that's slow to produce the result. Pipelined CPUs don't stall until you try to read a result that isn't ready yet.
Out-of-order execution massively complicates this, but it's still generally true when there's one really slow instruction, like a load that often misses in cache. When the cycles counter overflows (triggering an interrupt), there are many instruction in flight, but only one can be the RIP associated with that performance-counter event. It's also the RIP where execution will resume after the interrupt.
So what happens when an interrupt is raised? See Andy Glew's answer about that, which explains the internals of perf-counter interrupts in the Intel P6 microarchitecture's pipeline, and why (before PEBS) they were always delayed. Sandybridge-family is similar to P6 for this.
I think a reasonable mental model for perf-counter interrupts on Intel CPUs is that it discards any uops that haven't yet been dispatched to an execution unit. But ALU uops that have been dispatched already go through the pipeline to retirement (if there aren't any younger uops that got discarded) instead of being aborted, which makes sense because the maximum extra latency is ~16 cycles for sqrtpd, and flushing the store queue can easily take longer than that. (Pending stores that have already retired can't be rolled back). IDK about loads/stores that haven't retired; at least the loads are probably discarded.
I'm basing this guess on the fact that it's easy to construct loops that don't show any counts for divss when the CPU is sometimes waiting for it to produce its outputs. If it was discarded without retiring, it would be the next instruction when resuming the interrupt, so (other than skids) you'd see lots of counts for it.
Thus, the distribution of cycles counts shows you which instructions spend the most time being the oldest not-yet-dispatched instruction in the scheduler. (Or in case of front-end stalls, which instructions the CPU is stalled trying to fetch / decode / issue). Remember, this usually means it shows you the instructions that are waiting for inputs, not the instructions that are slow to produce them.
(Hmm, this might not be right, and I haven't tested this much. I usually use perf stat to look at overall counts for a whole loop in a microbenchmark, not statistical profiles with perf record. addss and mulss are higher latency than andps, so you'd expect andps to get counts waiting for its xmm5 input if my proposed model was right.)
Anyway, the general problem is, with multiple instructions in flight at once, which one does the HW "blame" when the cycles counter wraps around?
Note that divss is slow to produce the result, but is only a single-uop instruction (unlike integer div which is microcoded on AMD and Intel). If you don't bottleneck on its latency or its not-fully-pipelined throughput, it's not slower than mulss because it can overlap with surrounding code just as well.
(divss / divps is not fully pipelined. On Haswell for example, an independent divps can start every 7 cycles. But each only takes 10-13 cycles to produce its result. All other execution units are fully pipelined; able to start a new operation on independent data every cycle.)
Consider a large loop that bottlenecks on throughput, not latency of any loop-carried dependency, and only needs divss to run once per 20 FP instructions. Using divss by a constant instead of mulss with the reciprocal constant should make (nearly) no difference in performance. (In practice out-of-order scheduling isn't perfect, and longer dependency chains hurt some even when not loop-carried, because they require more instructions to be in flight to hide all that latency and sustain max throughput. i.e. for the out-of-order core to find the instruction-level parallelism.)
Anyway, the point here is that divss is a single uop and it makes sense for it not to get many counts for the cycles event, depending on the surrounding code.
You see the same effect with a cache-miss load: the load itself mostly only gets counts if it has to wait for the registers in the addressing mode, and the first instruction in the dependency chain that uses the loaded data gets a lot of counts.
What your profile result might be telling us:
The divss isn't having to wait for its inputs to be ready. (The movaps %xmm3, %xmm5 before the divss sometimes takes some cycles, but the divss never does.)
We may come close to bottlenecking on the throughput of divss
The dependency chain involving xmm5 after divss is getting some counts. Out-of-order execution has to work to keep multiple independent iterations of that in flight at once.
The maxss / movaps loop-carried dependency chain may be a significant bottleneck. (Especially if you're on Skylake where divss throughput is one per 3 clocks, but maxss latency is 4 cycles. And resource conflicts from competition for ports 0 and 1 will delay maxss.)
The high counts for movaps might be due to it following maxss, forming the only loop-carried dependency in the part of the loop you show. So it's plausible that maxss really is slow to produce results. But if it really was a loop-carried dep chain that was the major bottleneck, you'd expect to see lots of counts on maxss itself, as it would be waiting for its input from the last iteration.
But maybe mov-elimination is "special", and all the counts for some reason get charged to movaps? On Ivybridge and later CPUs, register copies doesn't need an execution unit, but instead are handled in the issue/rename stage of the pipeline.
Is this true and known?
Yes, it is a known problem with profiling tools on Intel x86. I've observed it (time spent suspiciously assigned to seemingly innocent instructions) both with Linux perf_events and Intel VTune. It has also been reported elsewhere by other people.
A better and more honest visualization of collected results would have summed up all samples inside every basic block, and demonstrated the resulting value associated with a basic block, not its individual instructions. Not 100% fool-proof but a bit better and honest,
Or is there some option I need to use to obtain reliable results?
I do not know if newer profiling hardware, namely tools based on Intel Processor Trace (available starting from Broadwell, but improved in Skylake) instead of older PEBS, would give more accurate data. I guess one needs to experiment with such tools first.
I was wondering if someone could explain what a fetch execution cycle is and what are the steps involved.
I have been looking up online and get definitions like
"An instruction cycle (sometimes called fetch-decode-execute cycle) is the basic operation cycle of a computer. It is the process by which a computer retrieves a program instruction from its memory, determines what actions the instruction requires, and carries out those actions."
But Could someone break this down a bit further and explains the steps involved executing a fetch execution cycle?
I'll try to explain however i don't have all correct English terms, I think this is related to the operation pointer.
Every program have states which are in the registry of the CPU when they're executed, meaning not in a pending state by the scheduler. One of the values stored is the current value of the operation pointer. This pointer contains the memory address in the RAM of the next operation to execute.
So the computer read that value, use his "memory bus" (probably not the right term) to fetch the operation to execute from the memory, then execute it.
Then the operation pointer will contains the next operation to execute, either the next one or another if the operation was to move the operation pointer.
Note than an "operation" is just a raw value in memory it's the cpu that translate it to a "physical"/"logical" operation.
As we know atomic actions cannot be interleaved, so they can be used without fear of thread interference. For example, in a 32-bit OS "x = 3" is considered as an atomic operation "generally" but memory access mostly takes more than one clock cycles, let's say 3 cycles. So here is the case;
Assuming we have multiple parallel data & address buses and thread A tries to set "x = 3", isn't there any chance for another thread, lets say thread B, to access the same memory location in the second cycle ( while thread A in the middle of the write operation ). How the atomicity is gonna be preserved ?
Hope I was able to be clear.
Thanks
There is no problem with simple assignments at all provided a write performed in a single bus transaction. Even when memory write transaction takes 3 cycles then there are specific arrangements in place that prevent simultaneous bus access from different cores.
The problems arise when you do read-modify-write operations as these involve (at least) two bus transactions and thus such operations could lead to race conditions between cores (threads). These cases are solved by specific opcodes(prefixes) that assert bus lock signal for the whole duration of the next coming instruction or special instructions that do the whole job
Let us say we have a fictitious single core CPU with Program Counter and basic instruction set such as Load, Store, Compare, Branch, Add, Mul and some ROM and RAM. Upon switching on it executes a program from ROM.
Would it be fair to say the work the CPU does is based on the type of instruction it's executing. For example, a MUL operating would likely involve more transistors firing up than say Branch.
However from an outside perspective if the clock speed remains constant then surely the CPU could be said to be running at 100% constantly.
How exactly do we establish a paradigm for measuring the work of the CPU? Is there some kind of standard metric perhaps based on the type of instructions executing, the power consumption of the CPU, number of clock cycles to complete or even whether it's accessing RAM or ROM.
A related second question is what does it mean for the program to "stop". Usually does it just branch in an infinite loop or does the PC halt and the CPU waits for an interupt?
First of all, that a CPU is always executing some code is just an approximation these days. Computer systems have so-called sleep states which allow for energy saving when there is not too much work to do. Modern CPUs can also throttle their speed in order to improve battery life.
Apart from that, there is a difference between the CPU executing "some work" and "useful work". The CPU by itself can't tell, but the operating system usually can. Except for some embedded software, a CPU will never be running a single job, but rather an operating system with different processes within it. If there is no useful process to run, the Operating System will schedule the "idle task" which mostly means putting the CPU to sleep for some time (see above) or jsut burning CPU cycles in a loop which does nothing useful. Calculating the ratio of time spent in idle task to time spent in regular tasks gives the CPU's business factor.
So while in the old days of DOS when the computer was running (almost) only a single task, it was true that it was always doing something. Many applications used so-called busy-waiting if they jus thad to delay their execution for some time, doing nothing useful. But today there will almost always be a smart OS in place which can run the idle process than can put the CPU to sleep, throttle down its speed etc.
Oh boy, this is a toughie. It’s a very practical question as it is a measure of performance and efficiency, and also a very subjective question as it judges what instructions are more or less “useful” toward accomplishing the purpose of an application. The purpose of an application could be just about anything, such as finding the solution to a complex matrix equation or rendering an image on a display.
In addition, modern processors do things like clock gating in power idle states. The oscillator is still producing cycles, but no instructions execute due to certain circuitry being idled due to cycles not reaching them. These are cycles that are not doing anything useful and need to be ignored.
Similarly, modern processors can execute multiple instructions simultaneously, execute them out of order, and predict and execute which instructions will be executed next before your program (i.e. the IP or Instruction Pointer) actually reaches them. You don’t want to include instructions whose execution never actually complete, such as because the processor guesses wrong and has to flush those instructions, e.g. as due to a branch mispredict. So a better metric is counting those instructions that actually complete. Instructions that complete are termed “retired”.
So we should only count those instructions that complete (i.e. retire), and cycles that are actually used to execute instructions (i.e. unhalted).)
Perhaps the most practical general metric for “work” is CPI or cycles-per-instruction: CPI = CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY. CPU_CLK_UNHALTED.CORE are cycles used to execute actual instructions (vs those “wasted” in an idle state). INST_RETIRED are those instructions that complete (vs those that don’t due to something like a branch mispredict).
Trying to get a more specific metric, such as the instructions that contribute to the solution of a matrix multiple, and excluding instructions that don’t directly contribute to computing the solution, such as control instructions, is very subjective and difficult to gather statistics on. (There are some that you can, such as VECTOR_INTENSITY = VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED which is the number of SIMD vector operations, such as SSE or AVX, that are executed per second. These instructions are more likely to directly contribute to the solution of a mathematical solution as that is their primary purpose.)
Now that I’ve talked your ear off, check out some of the optimization resources at your local friendly Intel developer resource, software.intel.com. Particularly, check out how to effectively use VTune. I’m not suggesting you need to get VTune though you can get a free or very discounted student license (I think). But the material will tell you a lot about increasing your programs performance (i.e. optimizing), which is, if you think about it, increasing the useful work your program accomplishes.
Expanding on Michał's answer a bit:
Program written for modern multi-tasking OSes are more like a collection of event handlers: they effectively setup listeners for I/O and then yield control back to the OS. The OS wake them up each time there is something to process (e.g. user action, data from device) and they "go to sleep" by calling into the OS once they've finished processing. Most OSes will also preempt in case one process hog the CPU for too long and starve the others.
The OS can then keep tabs on how long each process are actually running (by remembering the start and end time of each run) and generate the statistics like CPU time and load (ready process queue length).
And to answer your second question:
To stop mostly means a process is no longer scheduled and all associated resource (scheduling data structures, file handles, memory space, ...) destroyed. This usually require the process to call a special OS call (syscall/interrupt) so the OS can release the resources gracefully.
If however a process run into an infinite loop and stops responding to OS events, then it can only be forcibly stopped (by simply not running it anymore).