I was wondering if someone could explain what a fetch execution cycle is and what are the steps involved.
I have been looking up online and get definitions like
"An instruction cycle (sometimes called fetch-decode-execute cycle) is the basic operation cycle of a computer. It is the process by which a computer retrieves a program instruction from its memory, determines what actions the instruction requires, and carries out those actions."
But Could someone break this down a bit further and explains the steps involved executing a fetch execution cycle?
I'll try to explain however i don't have all correct English terms, I think this is related to the operation pointer.
Every program have states which are in the registry of the CPU when they're executed, meaning not in a pending state by the scheduler. One of the values stored is the current value of the operation pointer. This pointer contains the memory address in the RAM of the next operation to execute.
So the computer read that value, use his "memory bus" (probably not the right term) to fetch the operation to execute from the memory, then execute it.
Then the operation pointer will contains the next operation to execute, either the next one or another if the operation was to move the operation pointer.
Note than an "operation" is just a raw value in memory it's the cpu that translate it to a "physical"/"logical" operation.
Related
While reading ARM core document, I got this doubt. How does the CPU differentiate the read data from data bus, whether to execute it as an instruction or as a data that it can operate upon?
Refer to the excerpt from the document -
"Data enters the processor core
through the Data bus. The data may be
an instruction to execute or a data
item."
Thanks in advance for enlightening me!
/MS
Simple answer - it doesn't. Machine code instructions are just binary numbers, as are data. More complicated answer - your processor may (or may not) provide segmentation of memory, meaning that attempting to execute what has been specified as data causes a trap of some sort. This is one of the the meaning of a "segmentation fault" - the processor tried to execute something that was not labelled as being executable code.
Each opcode will consist of an instruction of N bytes, which then expects the subsequent M bytes to be data (memory pointers etc.). So the CPU uses each opcode to determine how manyof the following bytes are data.
Certainly for old processors (e.g. old 8-bit types such as 6502 and the like) there was no differentiation. You would normally point the program counter to the beginning of the program in memory and that would reference data from somewhere else in memory, but program/data were stored as simple 8-bit values. The processor itself couldn't differentiate between the two.
It was perfectly possible to point the program counter at what had deemed as data, and in fact I remember an old college tutorial where my professor did exactly that, and we had to point the mistake out to him. His response was "but that's data! It can't execute that! Can it?", at which point I populated our data with valid opcodes to prove that, indeed, it could.
The original ARM design had a three-stage pipeline for executing instructions:
FETCH the instruction into the CPU
DECODE the instruction to configure the CPU for execution
EXECUTE the instruction.
The CPU's internal logic ensures that it knows whether it is fetching data in stage 1 (i.e. an instruction fetch), or in stage 3 (i.e. a data fetch due to a "load" instruction).
Modern ARM processors have a separate bus for fetching instructions (so the pipeline doesn't stall while fetching data), and a longer pipeline (to allow faster clock speeds), but the general idea is still the same.
Each read by the processor is known to be a data fetch or an instruction fetch. All processors old and new know their instruction fetches from data fetches. From the outside you may or may not be able to tell, usually not except for harvard architecture processors of course, which the ARM is not. I have been working with the mpcore (ARM11) lately and there are bits on the external interface that tell you a little about what kind of read it is, mostly to hook up an external cache, combine that with knowledge of if you have the mmu and L1 cache on and you can tell data from instruction, but that is the exception to the rule. From a memory bus perspective it is just data bits you dont know data from instruction, but the logic that initiated that memory cycle and is waiting for the result knew before it started the cycle what kind of fetch it was and what it is going to do with that data when it gets it.
I think its down to where the data is stored in the program and OS support for informing the CPU whether it is code or data.
All code is placed in different segment of the image (along with static data like constant character strings) compared to storage for variables. The OS (and memory management unit) need to know this because they can swap code out of memory by simply discarding it and reloading it from the original disk file (at least that's how Windows does it).
So, I think the CPU 'knows' whether memory is data or code. No doubt the modern pipeling CPUs we have now also have instructions to read this memory differently to assist the CPU is processing it as fast as possible (eg code may not be cached, data will always be accessed randomly rather than in a stream)
Its still possible to point your program counter at data, but the OS can tell the CPU to prevent this - see NX bit and Windows' "Data Execution Protection" settings (system control panel)
So, I think the CPU 'knows' whether memory is data or code. No doubt the modern pipeling CPUs we have now also have instructions to read this memory differently to assist the CPU is processing it as fast as possible (eg code may not be cached, data will always be accessed randomly rather than in a stream)
I understand that instructions can be re-ordered by the processor in addition to compilers.
I have a few questions that I can not get my head around.
Say we have three instructions:
Program order
S1
S2
S3
After re-ordering by the processor, order becomes (for whatever reason):
S3
S2
S1
So when the processor executes S1 (in the program order), what woul be the value of the Program Counter?
If windows (or another OS), context switches the thread out and schedules it in another processor, how would the other processor know which instruction to execute next? (Is it guaranteed to make the same re-orderings?)
Is a memory fence (for example, a full fence created by an atomic compare and swap instruction) on one processor valid after the thread is scheduled on another thread?
Any ideas on this is highly appreciated.
There is an instruction pointer associated with each instruction.
Although instructions may be executed out of order, they always complete in order. When an interrupt or fault occurs, all instructions preceding the saved IP address have been completed. The results of any subsequent instructions are discarded. When execution resumes, it starts at the saved address.
The steps taken by the OS to schedule a thread on another processor include fencing operations on both processors, so when the thread resumes on the new processor, all preceding operations are fully fenced (whether or not any explicit fences exist in the code of the thread).
Unlike static compile-time ordering, out-of-order exec preserves the illusion of running instructions in program order. Including the situation seen by an interrupt handler. Current CPUs don't rename the privilege level, so they generally roll back to a consistent state as part of taking an exception or interrupt, not keeping un-executed instructions in flight. When an interrupt occurs, what happens to instructions in the pipeline?
This also means that interrupts are delivered strictly between instructions, not in the middle of one. Interrupting an assembly instruction while it is operating (except for "interruptible" instructions like rep movsb that logically work as multiple instructions, or vpgatherdd that has documented semantics for a page fault in one of the gather operands.)
Memory ordering as observed by other cores is another matter, and can differ from program order even on an in-order CPU. (Can a speculatively executed CPU branch contain opcodes that access RAM?)
The kernel code for a context switch needs to include a strong enough barrier for a thread to see its own stores in program order when it resumes on another core. Generally just release/acquire sync is sufficient (and you already need something like that for the kernel on the other core to restore register values). Maybe also an sfence to make that apply even for NT stores on x86.
i am trying to implement some custom lock-free structures. its operates similar to a stack so it has a take() and a free() method and operates on pointer and underlying array. typically it uses optimistic conncurrency. free() writes a dummy value to pointer+1 increments the pointer and writes the real value to the new address. take() reads the value at pointer in a spin/sleep style until it doesnt read the dummy value and then decrements the pointer. in both operations changes to the pointer are done with compare and swap and if it fails, the whole operation starts again. the purpose of the dummy value is to insure consistency since the write operation can be preempted after the pointer is incremented.
this situation leads me to wonder weather it is possible to prevent preemtion in that critical place by somhow determining how much time is left before the thread will be preempted by the scheduler for another thread. im not worried about hardware interrupts. im trying to eliminate the possible sleep from my reading function so that i can rely on a pure spin.
is this at all possible?
are there other means to handle this situation?
EDIT: to clarify how this may be helpful, if the critical operation is interrupted, it will effectively be like taking out an exclusive lock, and all other threads will have to sleep before they could continue with their operations
EDIT: i am not hellbent on having it solved like this, i am merely trying to see if its possible. the probability of that operation being interrupted in that location for a very long time is extremely unlikely and if it does happen it will be OK if all the other operations need to sleep so that it can complete.
some regard this as premature optimization, but this is just my pet project. regardless - that does not exclude research and sience from attempting to improve techniques. even though computer sience has reasonably matured and every new technology we use today is just an implementation of what was already known 40 years ago, we should not stop to be creative to address even the smallest of concerns, like trying to make a reasonable set of operations atomic woithout too much performance implications.
Such information surely exists somewhere, but it is of no use for you.
Under "normal conditions", you can expect upwards of a dozen DPCs and upwards of 1,000 interrupts per second. These do not respect your time slices, they occur when they occur. Which means, on the average, you can expect 15-16 interrupts within a time slice.
Also, scheduling does not strictly go quantum by quantum. The scheduler under present Windows versions will normally let a thread run for 2 quantums, but may change its opinion in the middle if some external condition changes (for example, if an event object is signalled).
Insofar, even if you know that you still have so and so many nanoseconds left, whatever you think you know might not be true at all.
Cnnot be done without time-travel. You're stuffed.
Consider a VLIW processor with an issue width equal to N: this means that it is able to start N operations simultaneously, so each very long instruction can consist of a maximum of N operations.
Suppose that the VLIW processor load a very long instruction which consists of operations with different latencies: operations belonging to the same very long instruction could end at different times. What happens if an operation finishes its execution before other operations belonging to the same very long instruction? Could a subsequent operation (that is an operation belonging to the next very long instruction) start execution before the remaining operations of the current very long instruction being executed? Or does a very long instruction wait for the completion of all operations belonging to the current very long instruction?
Most VLIW processors I've seen do support operations with different latencies.
It's up to the compiler to schedules these instructions, and to ensure that the
operands are available before the operation executes. A VLIW processor is
dumb, and doesn't check any dependencies between operations. When a long instruction
word executes, each operation in the word simply reads its input data from a register
file, and writes its result back at the end of the same cycle, or later if an
operation takes two or three cycles.
This only works when instructions are deterministic, and always take the same
number of cycles. All VLIW architectures I've seen have operations that take
a fixed number of cycles, no less, no more. In case they do take longer, like for
instance an external memory fetch, the whole machine is simply stalled.
Now there is one key thing that limits the scheduling of instructions that have
different latencies: the number of ports to the register file. The ports are the
connections between the register file and execution units of the operations.
In a VLIW processor, each operation executes in an issue slot, and each issue slot
has its own ports to the register file. Ports are expensive in terms of hardware.
The more ports, the more silicon is required to implement the register file.
Now consider the following situation where a two-cycle operation wants to write its
result to the register file at the same time as a single-cycle operation that
was scheduled right after it. There's now a conflict, as both operations want to
write to the same register file over the same port. Again, it's the compiler's task
to ensure this doesn't happen. In many VLIW architectures, the operands
that execute in the same issue slot all have the same latency. This avoids this
conflict.
Now to answer your questions:
You said: "What happens if an operation finishes its execution before other
operations belonging to the same very long instruction?"
Nothing special happens. The processor just continues to execute the next
very long instruction word.
You said: "Could a subsequent operation (that is an operation belonging to the
next very long instruction) start execution before the remaining operations of
the current very long instruction being executed?"
Yes, but this could present a register port conflict later on. It's up to the
compiler to prevent this situation.
You said: "Or does a very long instruction wait for the completion of all
operations belonging to the current very long instruction?"
No. The processor at every cycle simply goes to the next very long instruction
word. There's an exception and that is when an operation takes longer than
normal, for instance because there's a cache miss, and then the pipeline is
stalled, and the machine does not progress the next long instruction word.
The idea behind VLIW is that the compiler figures out lots of things for the processer to do in parallel and packages them up in bundles called "Very long instruction words".
Amhdahl's law tells us the the speedup of a parallel program (eg., the parallel parts of the VLIW instruction) is constrained by the slowest part (e.g, the longest-duration subinstruction).
The simple answer with VLIW and "long latencies" is "don't mix sub-instructions with different latencies". The practical answer is the VLIW machines try not to have sub-instructions with different latencies; rather ideally you want "one clock" subinstructions. Typically even memory fetches take only one clock by virtue of being divided into "memory fetch start (here's an address to fetch)" with the only variable latency subinstruction being "wait for previous fetch to arrive" with the idea being that the compiler generates as much other computation as it can so that the memory fetch latency is comvered by the other instructions.
While reading ARM core document, I got this doubt. How does the CPU differentiate the read data from data bus, whether to execute it as an instruction or as a data that it can operate upon?
Refer to the excerpt from the document -
"Data enters the processor core
through the Data bus. The data may be
an instruction to execute or a data
item."
Thanks in advance for enlightening me!
/MS
Simple answer - it doesn't. Machine code instructions are just binary numbers, as are data. More complicated answer - your processor may (or may not) provide segmentation of memory, meaning that attempting to execute what has been specified as data causes a trap of some sort. This is one of the the meaning of a "segmentation fault" - the processor tried to execute something that was not labelled as being executable code.
Each opcode will consist of an instruction of N bytes, which then expects the subsequent M bytes to be data (memory pointers etc.). So the CPU uses each opcode to determine how manyof the following bytes are data.
Certainly for old processors (e.g. old 8-bit types such as 6502 and the like) there was no differentiation. You would normally point the program counter to the beginning of the program in memory and that would reference data from somewhere else in memory, but program/data were stored as simple 8-bit values. The processor itself couldn't differentiate between the two.
It was perfectly possible to point the program counter at what had deemed as data, and in fact I remember an old college tutorial where my professor did exactly that, and we had to point the mistake out to him. His response was "but that's data! It can't execute that! Can it?", at which point I populated our data with valid opcodes to prove that, indeed, it could.
The original ARM design had a three-stage pipeline for executing instructions:
FETCH the instruction into the CPU
DECODE the instruction to configure the CPU for execution
EXECUTE the instruction.
The CPU's internal logic ensures that it knows whether it is fetching data in stage 1 (i.e. an instruction fetch), or in stage 3 (i.e. a data fetch due to a "load" instruction).
Modern ARM processors have a separate bus for fetching instructions (so the pipeline doesn't stall while fetching data), and a longer pipeline (to allow faster clock speeds), but the general idea is still the same.
Each read by the processor is known to be a data fetch or an instruction fetch. All processors old and new know their instruction fetches from data fetches. From the outside you may or may not be able to tell, usually not except for harvard architecture processors of course, which the ARM is not. I have been working with the mpcore (ARM11) lately and there are bits on the external interface that tell you a little about what kind of read it is, mostly to hook up an external cache, combine that with knowledge of if you have the mmu and L1 cache on and you can tell data from instruction, but that is the exception to the rule. From a memory bus perspective it is just data bits you dont know data from instruction, but the logic that initiated that memory cycle and is waiting for the result knew before it started the cycle what kind of fetch it was and what it is going to do with that data when it gets it.
I think its down to where the data is stored in the program and OS support for informing the CPU whether it is code or data.
All code is placed in different segment of the image (along with static data like constant character strings) compared to storage for variables. The OS (and memory management unit) need to know this because they can swap code out of memory by simply discarding it and reloading it from the original disk file (at least that's how Windows does it).
So, I think the CPU 'knows' whether memory is data or code. No doubt the modern pipeling CPUs we have now also have instructions to read this memory differently to assist the CPU is processing it as fast as possible (eg code may not be cached, data will always be accessed randomly rather than in a stream)
Its still possible to point your program counter at data, but the OS can tell the CPU to prevent this - see NX bit and Windows' "Data Execution Protection" settings (system control panel)
So, I think the CPU 'knows' whether memory is data or code. No doubt the modern pipeling CPUs we have now also have instructions to read this memory differently to assist the CPU is processing it as fast as possible (eg code may not be cached, data will always be accessed randomly rather than in a stream)