Confusion with instruction register and counter register

Confusion with instruction register and counter register - avr

I have difficulty in understanding the workings of instruction register and counter register in avr micro-controller like an atmega128p.
Is there a way to explain what they do in a simple way?

I suppose you are interested in the difference between the Instruction Register(IR) and Program Counter (PC).
Their working principles are pretty simple.
IR: Holds the instruction that will be or is currently executed
PC: Holds the address of the instruction that should be loaded into IR from flash memory in the next clock cycle.
Note that AVRs normally use a 2-stage pipeline (fetch + execute) therefore the IR is double buffered (inside a clock cycle the next instruction is loaded from flash and the current instruction is executed).

Related

Program counter, fences and processor re-ordering

I understand that instructions can be re-ordered by the processor in addition to compilers.
I have a few questions that I can not get my head around.
Say we have three instructions:
Program order
S1
S2
S3
After re-ordering by the processor, order becomes (for whatever reason):
S3
S2
S1
So when the processor executes S1 (in the program order), what woul be the value of the Program Counter?
If windows (or another OS), context switches the thread out and schedules it in another processor, how would the other processor know which instruction to execute next? (Is it guaranteed to make the same re-orderings?)
Is a memory fence (for example, a full fence created by an atomic compare and swap instruction) on one processor valid after the thread is scheduled on another thread?
Any ideas on this is highly appreciated.

There is an instruction pointer associated with each instruction.
Although instructions may be executed out of order, they always complete in order. When an interrupt or fault occurs, all instructions preceding the saved IP address have been completed. The results of any subsequent instructions are discarded. When execution resumes, it starts at the saved address.
The steps taken by the OS to schedule a thread on another processor include fencing operations on both processors, so when the thread resumes on the new processor, all preceding operations are fully fenced (whether or not any explicit fences exist in the code of the thread).

Unlike static compile-time ordering, out-of-order exec preserves the illusion of running instructions in program order. Including the situation seen by an interrupt handler. Current CPUs don't rename the privilege level, so they generally roll back to a consistent state as part of taking an exception or interrupt, not keeping un-executed instructions in flight. When an interrupt occurs, what happens to instructions in the pipeline?
This also means that interrupts are delivered strictly between instructions, not in the middle of one. Interrupting an assembly instruction while it is operating (except for "interruptible" instructions like rep movsb that logically work as multiple instructions, or vpgatherdd that has documented semantics for a page fault in one of the gather operands.)
Memory ordering as observed by other cores is another matter, and can differ from program order even on an in-order CPU. (Can a speculatively executed CPU branch contain opcodes that access RAM?)
The kernel code for a context switch needs to include a strong enough barrier for a thread to see its own stores in program order when it resumes on another core. Generally just release/acquire sync is sufficient (and you already need something like that for the kernel on the other core to restore register values). Maybe also an sfence to make that apply even for NT stores on x86.

STM32H7 performance

I would appreciate a brief explanation of how my assembler timing loop on a NUCLEO-H723ZG board indicates that it is being executed in a single cpu clock cycle. The two instructions used, a SUBS and a BNE, consume three clock cycles when the loop is branching so there is some magic afoot! I am using the GPIO BSRR to toggle a LED and need to use a timing loop count of 275M to achieve an approximate one flash per second.

For the Cortex M0, M3 and M4 the cycle counts are included in the technical reference manual (eg Cortex M4). For the M7 they are not published, but it sounds like you have measured the answer for yourself so do not need it to be in the manual in this case.
If your code is correct, then the processor is able do those two instructions in a single cycle.
This is not surprising. For example the M4 can carry out a 16-bit data processing instruction and it instruction in a single cycle.
You can disable this if you require deterministic (but worse) performance. See the DISFOLD bit in the Auxiliary Control Register.

Simultaneous reading and writing to registers

I'm planning to design a MIPS-like CPU in VHDL on a FPGA. The CPU will have a classic five stage pipeline without forwarding and hazard prevention. In the computer architecture course I learned that the first MIPS-CPUs used to read from the register file on rising clock edge and write on falling clock edge. The FPGA I'm using doesn't support using rising and falling clock edge at the same time (regarding reading and writing to registers), so I can't exactly do like the original MIPS and have to do it all on rising clock edge.
So, here comes the part where I'm having a problem. The instruction writes back to the register in the write back stage. The write back stage sends the data directly to the decode stage. Another instruction in the decode stage wants to read the same register that also the write back stage wants to write.
What happens in this case? Does the decode stage take the new value for the instruction or the old value that is still in the register file?

A register file that fits in the decode stage of the classic five stage design consists of a triple port RAM (or two dual port RAM) and two muxers and comparators. The comparators and muxers are required to bypass the data coming from the write-back stage. This is needed as the write data is written into the triple port RAM in the next cycle. Because the signals coming from the write-back stage are synchronous, this is not a problem.

The question is what do you understand by the term "register". Or more specifically, how do you would like to map the register bank to the FPGA.
The easiest but not so efficient way is to map each MIPS register to several flip-flops according to the register size. You can update these flip-flops at only clock-edge (e.g. falling edge). After that you can read the new content at any time also known as asynchronous read. This solution is not so efficient because the multiplexer to select one MIPS register from the register bank requires a lot of logic resources.
If you have an FPGA where the LUTs can be used as distributed memory, then almost all of the logic resources for the multiplexers can be saved. Distributed memory typically provides an asynchronous read too (and a synchronous write of course). Please read the vender documentation of the synthesis tool on how to describe this type of memory for synthesis.
Last but not least, you can map the full register bank to on-chip block memory. These typically provide only a synchronous read, i.e., reading starts at a clock-edge. (Of course, they also provide only a synchronous write). However, these are typically dual-ported RAMs. Thus, you can write at the falling edge at one port and read with the rising at on the other port. Please read, the documentation of your FPGA on the timing of the write. For example, on some Altera FPGAs the writing is delayed to the next opposite edge (here rising-edge) of the clock.

InterlockedExchange on two CPU cores

I have a Windows 7 driver where I want to synchronize access to a variable. Can I use InterlockedExchange for it?
My current understanding of InterlockedExchange is, that InterlockedExchange is done via compiler intrinsics. That means, the read (InterlockedExchange returns the old value) and the write is done in one clock cycle. The interlocked functions are atomic only when the variable is always accessed via an interlocked function.
But what happens in this case:
CPU1: InterlockedExchange(&Adapter->StatusVariable, 5);
CPU2: InterlockedExchange(&Adapter->StatusVariable, 3);
StatusVariable is written in the same clock cycle on two CPU cores. Does the function notice that the variable is accessed and defer the write to a different clock cycle? Or is it undefined which value the variable has after the write? Is it also possible that the variable contains garbage?
Edit: I am on x86 or x64.

InterlockedExchange generates a xchg instruction that has an implicit memory barrier.
The Intel Instruction set reference is your friend :) See Chapter 8 for more information on how locks work.
From the XCHG instruction:
The exchange instructions swap the contents of one or more operands and, in some cases, perform additional operations such as asserting the LOCK signal or modifying flags in the EFLAGS register.
The XCHG (exchange) instruction swaps the contents of two operands. This instruction takes the place of three
MOV instructions and does not require a temporary location to save the contents of one operand location while the
other is being loaded. When a memory operand is used with the XCHG instruction, the processor’s LOCK signal is
automatically asserted. This instruction is thus useful for implementing semaphores or similar data structures for
process synchronization. See “Bus Locking” in Chapter 8, “Multiple-Processor Management,”of the Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volume 3A, for more information on bus locking.
If you have any questions about the reference just ask.

I have a Windows 7 driver where I want to synchronize access to a
variable. Can I use InterlockedExchange for it?
Maybe. Maybe not. It depends on what you are trying to do, what the variable represents and what your expectations are when you say "synchronize access".
With that said, I suspect the answer is no because I can't see how what you are doing counts as synchronization.
That means, the read (InterlockedExchange returns the old value) and
the write is done in one clock cycle.
Not exactly. The interlocked functions ensure that the operation happens atomically. How many clock cycles that takes is another issue. Forget about clock cycles.
The interlocked functions are atomic only when the variable is always
accessed via an interlocked function.
What does that even mean?
Does the function notice that the variable is accessed and defer the
write to a different clock cycle?
It's more accurate to say that the processor notices, which it does. Whether it defers one write to a different clock cycle, what do you care? Maybe it does, maybe it doesn't. It's none of your business what the process does.
All the compiler and processor will guarantee in your example and all that you need to know is that:
after the statement InterlockedExchange(&Adapter->StatusVariable, 3); the value of Adapter->StatusVariable will either be 3 or 5; and
after the statement InterlockedExchange(&Adapter->StatusVariable, 5); the value of Adapter->StatusVariable will either be 3 or 5.
It will have one of those two values and no other values. You just cannot know which of those values it will have and it should be obvious to see why that is.
Or is it undefined which value the variable has after the write?
That depends on your definition of "undefined" I guess. It's unclear which of the two values it will have, but it have either 3 or 5 assuming no other threads are changing the value after that point.
Is it also possible that the variable contains garbage?
If by 'garbage' you mean something other than either 3 or 5 then, in the absence of any other code that messes with the value, the answer is an unequivocal no. The variable will contain either the value 3 or the value 5.

Some doubts regarding diagram of Von Neumann Arcitechture

Well I cant understand the above diagram of Von Neumann architecture [Cited from wikipedia] and not even sure whether it is correct. Some obvious doubts that I have -
How can ALU communicate with Memory? Isn't that supposed to be CU's Job?
How is accumulator a part of ALU?
And, What is exactly the job of accumulator?

Judging from the diagram of the IAS computer (which should be pretty similar to EDVAC, the computer von Neumann wrote about) the control unit provides addresses (register MAR) and controls the bus transactions with signals such as AS, R/W*.
On the other hand, ALU is connected to the data bus (register MDR): it receives the data from memory and stores back the results. The diagram also shows that ALU receives the instructions and forwards them to the CU (register IBR).
For example, suppose the control unit just fetched the instruction ADD $1234. Then the processing proceeds as follows:
CU puts $1234 onto the address bus and initiates a read cycle
the operand is received by ALU (register MDR) and added with accumulator (register AC)
the result of addition is finally stored into the accumulator.
Answers to your questions:
ALU receives the data from memory, perform the operations and stores back the result. At the time all data were stored in memory (there were no general purpose registers), hence it was logical to put MDR in ALU, which means that ALU is supposed to be connected to the data bus.
The IAS computer was designed in a way that one ALU input and the ALU output are hardwired to the accumulator. Hence it was logical to place accumulator in ALU.
The accumulator is conceived as a place to store intermediate results, because having instructions with more than one memory operand was more difficult to implement.
Finally, I believe this discussion is purely historical. There is no particular reason to prefer associating the MDR with ALU rather than CU. It was just that von Neumann happened to think that way when he was writing a paper about EDVAC. To make the story complete, Wikipedia says that EDVAC was actually designed by Eckert and Mauchly, while von Neumann only did consulting and writing.

The accumulator is register where the result of arithmetic operation is stored temporarely. It is faster than directly using the main memory. Since it stores arithmetic results it makes sense to be part of the ALU.
The control unit is like a coordinator that tells to the other components to do this and that. But it does not provide the means how to do it so this is why the ALU need to communicate direclty with the memory.

Well, the ALU changes the flags register when does something, that's why it's connected with memory (flags aren't in the CU and in the ALU neither, and since these are the only components that are shown..). And the accumulator stores data temporarily waiting the ALU to process it. It's connected directly to the ALU because this register was thought to support it with its calculations, just like the ecx register is connected with counter circuits. Of course is possible to add ecx,edx but is slower. Choosing the source and the destination register is very difficult due to the extra circuits needed to implement in a CPU and it was archived recently (relatively). That image is quite old (ssegvic is right!) because it shows that input/output are possible only using the accumulator.
In my opinion this is more clear:
The ALU is connected at the internal bus, but this doesn't mean it will communicate with everything connected to it.
One last thing: looking for better images I noticed that the ALU isn't always connected with memory, in some of them it's connected only with the CU.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio