STM32H7 performance - performance

I would appreciate a brief explanation of how my assembler timing loop on a NUCLEO-H723ZG board indicates that it is being executed in a single cpu clock cycle. The two instructions used, a SUBS and a BNE, consume three clock cycles when the loop is branching so there is some magic afoot! I am using the GPIO BSRR to toggle a LED and need to use a timing loop count of 275M to achieve an approximate one flash per second.

For the Cortex M0, M3 and M4 the cycle counts are included in the technical reference manual (eg Cortex M4). For the M7 they are not published, but it sounds like you have measured the answer for yourself so do not need it to be in the manual in this case.
If your code is correct, then the processor is able do those two instructions in a single cycle.
This is not surprising. For example the M4 can carry out a 16-bit data processing instruction and it instruction in a single cycle.
You can disable this if you require deterministic (but worse) performance. See the DISFOLD bit in the Auxiliary Control Register.

Related

Does changing one single bit consume less cycles per byte than adding/subtracting/xoring an entire processor word?

Let's suppose I change one single bit in a word and add two other words.
Does changing one bit in a word consume less CPU cycles than changing an entire word?
If it consumes less CPU cycles, how much faster would it be?
Performance (in clock cycles) is not data-dependent for integer ALU instructions other than division on most CPUs. ADD and XOR have the same 1-cycle latency on the majority of modern pipelined CPUs. (And the same cycle cost as each other on most older / simpler CPUs, whether or not it's 1 cycle.)
See https://agner.org/optimize/ and https://uops.info/ for numbers on modern x86 CPUs.
Lower power can indirectly affect performance by allowing higher boost clocks without having to slow down for thermal limits. But the difference in this case is so small that I don't expect it would be a measurable difference on a mainstream CPU, like the efficiency cores of an Alder Lake, or even a mobile phone CPU that's more optimized for low power.
Power in a typical CPU (using CMOS logic) scales with how many gates have their outputs change value per cycle. When a transistor switches on, it conducts current from Vcc or to ground, charging or discharging the tiny parasitic capacitance of the things the logic gate's output is connected to. Since the majority of the (low) resistance in the path of that current is in the transistor itself, that's where the electrical energy turns into heat.
For more details, see:
Why does switching cause power dissipation? on electronics.SE for the details for one CMOS gate
For a mathematical operation in CPU, could power consumption depend on the operands?
Modern Microprocessors
A 90-Minute Guide! has a section about power. (And read the whole article if you have any general interest in CPU architecture; it's good stuff.)
ADD does require carry propagation potentially across the whole width of the word, e.g. for 0xFFFFFFFF + 1, so ALUs use tricks like carry-lookahead or carry-select to keep the worst case gate-delay latency within one cycle.
So ADD involves more gates than a simple bitwise operation like XOR, but still not many compared to the amount of gates involved in controlling all the decode and other control logic to get the operands to the ALU and the result written back (and potentially bypass-forwarded to later instructions that use the result right away.)
Also, a typical ALU probably doesn't have fully separate adder vs. bitwise units, so a lot of those adder gates are probably seeing their inputs change, but control signals block carry propagation. (i.e. a typical ALU implements XOR using a lot of the same gates as ADD, but with control signals controlling AND gates or something to all or block carry propagation. XOR is add-without-carry.) An integer ALU in a CPU will usually be at least an adder-subtractor so one of the inputs is coming through multiple gates, with other control signals that can make it do bitwise ops.
But there's still maybe a few fewer bit-flips when doing an XOR operation than an ADD. Partly it would depend what the previous outputs were (of whatever computation it did in the previous cycle, not the value of one of the inputs to the XOR). But with carry propagation blocked by AND gates, flipping the inputs to those gates doesn't change the outputs, so less capacitance is charged or discharged.
In a high-performance CPU, a lot of power is spent on pipelining and out-of-order exec, tracking instructions in flight, and writing back the results. So even the whole ALU ADD operation is a pretty minor component of total energy cost to execute the instruction. Small differences in that power due to operands are an even smaller difference. Pretty much negligible compared to how many gates flip every clock cycle just to get data and control signals sent to the right place.
Another tiny effect: if your CPU didn't do register renaming, then possibly a few fewer transistors might flip (in the register file's SRAM) when writing back the result if it's almost the same as what that register held before.
(Assuming an ISA like x86 where you do xor dst, src for dst ^= src, not a 3-operand ISA where xor dst, src1, src2 could be overwriting a different value if you didn't happen to pick the same register for dst and src1.)
If your CPU does out-of-order exec with register renaming, writes to the register file won't be overwriting the same SRAM cells as the original destination value, so it depends what other values were computed recently in registers.
If you want to see a measurable difference in power, run instructions like integer multiply, or FP mul or FMA. Or SIMD instructions, so the CPU is doing 4x or 8x 32-bit addition or shuffle in parallel. Or 8x 32-bit FMA. The max-power workload on a typical modern x86 CPU is two 256-bit FMAs per clock cycle.
See also:
Do sse instructions consume more power/energy? - Mysticial's answer is excellent, and discusses the race-to-sleep benefit of doing the same work faster and with fewer instructions, even if each one costs somewhat more power.
Why does the CPU get hotter when performing heavier calculations, compared to being idle?
How do I achieve the theoretical maximum of 4 FLOPs per cycle?

Instruction Completion Rate Vs. Instruction Throughput Vs. Instructions Per Clock

From what I understand:
ICR (Instruction Completion Rate): Is (# of instructions / time)
Instruction Throughput: Is usually an average of the number of instructions completed each clock cycle.
IPC (Instructions Per Clock): Is how many instructions are being completing each clock cycle. (Maybe this is usually an average?)
I'm confused on these definitions, I'm definitely looking for clarification. They might even be wrong, I've been having a tough time finding clear definitions of them.
How does the instruction completion rate affect overall performance of the processor?
How is Instruction Throughput affected compared to IPC?
Instruction throughput is typically used with respect to a specific type of instruction and is meant to provide instruction scheduling information in the context of structural hazards. For example, one might say "this fully pipelined multiplier has a latency of three cycles and an instruction throughput of one". Repeat rate is the inverse of throughput.
IPC describes performance per cycle, while your definition of instruction completion rate describes performance directly (independent of clock frequency).
(Of course, the performance value of "instruction" depends on the instruction set, the compiler, and the application — all of which influence the number of (and types of) instructions executed to complete a task. In addition, the relative performance of different instructions can depend on the hardware implementation; this can, in turn, drive compilation changes and sometimes application programming changes and even ISA changes.)

Post process `objdump --disassemble` with ARM cycle counts

Is there a script available for post processing some objdump --disassemble output to annotate with cycle counts? Especially for the ARM family. Most of the time this would only be a pattern match with a table lookup for the count. I guess annotations like +5M for five memory cycles might be needed. Perl, python, bash, C, etc are fine. I think this can be done generically, but I am interested in the ARM, which has an orthogonal instruction set. Here is a thread on the 68HC11 doing the same thing. The script would need an CPU model option to select the appropriate cycle counts; I think these counts already exist in the gcc machine description.
I don't think there is an objdump switch for this, but RTFM would be great.
Edit: To clarify, assumptions such as best case memory sub-system as will be the case when the code executes from cache are fine. The goal is not a 100% accurate cycle count as per some running machine. It is possible to get a reasonable estimate, otherwise compiler design would be impossible.
As DWelch points out, a simple running total is not possible with deep pipelined architecture, like more recent Cortex chips. The objdump post processing would have to look at surrounding opcodes. A gcc plug-in is more likely to be able to accomplish this and as that is new (4.5+), I don't think such a thing exists. A script for the ARM926 is certainly possible and fairly simple.
The memory latency doesn't matter. The memory controller is like another CPU. It is doing it's business while the CPU is doing arithmetic, etc. A good/well tuned algorithm will parallel the memory accesses with the computations. By counting loads/store and cycles you can determine how much parallelism is accomplished, when you actively profile with a timer. The pipeline is significant due to interlocks between registers, but a cycle count for basic blocks can reliably be calculated and used even on modern ARM processors; this is too complex for a simple script.
Cycle counts are not something that can be assessed by looking at the instruction alone on a modern high end ARM. There is a lot of runtime state that affects the real world retirement rate of an instruction. Does the data it needs exist in the cache? Does the instruction have any dependencies on previous instruction results? If so, what latencies does the forwarding unit remove? How full is the load/store buffer? What kind of memory mapping is it touching? How full are the processor pipelines that this instruction needs? Are there synchronizing instructions in the stream? Has speculation brought forward some data it depends on? What is the state of the register renamer? Have conditional instructions been filling the pipeline or was the decoder smart enough to skip them completely? What are the ratios between the core clock and the bus and memory clocks? What's the size of the branch prediction table?
Without a full processor simulation all you can get are guesses. Whether those numbers are meaningful to you depends on what you are trying to accomplish with them.
There is an online tool which estimates cycle counts on Cortex-A8. However, this CPU is quite old, and programs optimized for it might be suboptimal on newer CPUs.
AFAIK ARM also provides Cortex-A9 and Cortex-A5 cycle-accurate emulators in their RVDS software, but it is quite expensive.

How to compare two implementations of the same algorithm? (by examine their Assembly code)

Assume I have two implementations of the same algorithm in assembly. I would like to know by examining the two snippets codes which one is faster.
The parameters I thought one might take into account are: number of op-codes, number of branches, number of function frames.
My questions are:
Can I assume each opcode execution is one cycle ?
What is the overhead of branch which break the pipeline ?
What are the effects and overhead of calling a function ?
Is there a difference in the analysis between ARM and x86 ?
The question is theoretical since I have two implementations; one 130 instructions long and one is 184 instructions long.
And I would like to know if it is definitely true to say the 130 instructions long snippet is faster than the 184 instructions long implementation?
"BETTER == FASTER"
Without wanting to be flippant, the answers are
no
that depends on your hardware
that depends on your hardware
yes
You would really need to test things on your target hardware, or have a simulator that understands your hardware fully, in order to answer your question the way you meant to...
For the last part of your question, you need to define "better"…better.
Since you asked about a Cortex A9, the data sheet has instruction cycle counts in appendix B. These counts generally assume that the memory bus is fast enough to keep the CPU busy. In reality this is rarely the case. Many video/audio algorithms will have a big win in how they access memory.
One cycle per op
Of course you can't assume this if you want an exact count. However, if you are deciding which algorithm to choose, you can get a feel for the best algorithm by looking at the instructions in the inner loop. Here, your cache should allow the code to execute as per the instruction counts in the data sheet. If the counts are close, then you probably need to look at each instruction. Load/stores are more expensive and usually multiples, etc. Some algorithms, especially crytographic, will have big wins by using assembler that doesn't map well to C. For example, clz, ror, using the carry for multi-word arithmetic, etc.
Branch overhead
Look in Appendix B, or whatever data sheet has cycle counts for your processor. For an ARM926 it is about 3 cycles. The compiler only generates two conditional opcodes in a row to avoid branching, otherwise, it branches. If the algorithm is large, the branch may disrupt the cache. A hard answer depends on your CPU, cache, and memory. According to the Cortex A9 datasheet (B.5), there is only one cycle overhead to a fixed branch.
Function overhead
This is much the same as the branch overhead. However, the compiler will also have an influence. noted by Jim Does it cache align functions. Does the compiler perform leaf function optimizations, etc. With modern gcc versions, if all the functions are static, the compiler will generally in-line when it is advantageous. If the algorithms are particularly large, a register spill may be advantageous. However, with your example of 130/184 instructions, this seems unlikely. The compiler options will obviously effect the overhead. You can use objdump -S to examine the prologue/epilogue and then determine the number of cycles for your hardware.
ARM verus x86
Of course there is a technical difference in the cycle counts. The CISC x86 also has variable instruction size. This complicates the analysis. It is slightly easier on the ARM.
Normally, you want to ball park things and then actually run them with a profiler. The estimates can help guide development of the algorithms. Loop/memory tuning, etc for your hardware. Something like instruction emulation, page or alignment faults, etc may be dominant and make all the cycle count analysis meaningless. If the algorithm is in user space, per-emption, may negate cache wins from run to run. It is possible that one algorithm will work better in a little loaded system and the other will work better under a higher load.
A note on cycle counts
See the post-process objdump for some complications in getting cycle counts. Basically a typical CPU is several phases (a pipe line) and different conditions can cause stalls. As CPU's become more complex, the pipe line typically gets longer, meaning there are more conditions or phases which can stall. However, cycle count estimates can be helpful in guiding development of an algorithm and evaluating them. Things like memory timing or branch prediction can be just as important, depending on the algorithm. Ie, cycle counts are not completely useless, but they are not complete either. Profiling should confirm actual algorithm times. If they diverge, instruction re-ordering, pre-fetching and other techniques may bring them closer. The fact that cycle counts and active profiling diverge can be helpful in itself.
It is definitely not true to say that the 130 instruction code is faster than the 184 instruction code. it is very easy to have 1000 instructions run faster than 100 and vice versa on either of these platforms.
1 Can I assume each opcode execution is one cycle ?
Start by looking at the advertised mips/mhz, although a marketing number it gives a rough idea of what is possible. If the number is greater than one then more than one instruction per clock is possible.
2 What is the overhead of branch which break the pipeline ?
Anywhere from absolutely no affect to a very dramatic affect, on either system. one clock to hundreds are the potential penalty.
3 What are the effects and overhead of calling a function ?
Depends heavily on the function, and the function calling the function. Depending on the calling convention you might have to save registers to the stack, or rearrange the contents of registers to prepare for the parameters for the function to be called. If passing a struct by value a copy of the struct may need to be made on the stack, the bigger the struct passed the bigger the copy. once in the function a stack frame may need to be prepared, etc, etc. There are many factors involved. This question and answer are also independent of platform.
4 Is there a difference in the analysis between ARM and x86 ?
yes and no, both systems use all the modern tricks of pipelining, branch prediction, etc to keep the mips/mhz up. ARM is going to give a better mips per mhz than x86, x86 being variable instruction length might give more instructions per unit cache. How you analyze the cache, and memory and peripheral systems in the systems side of the analysis is roughly the same. The comparison of the instructions and core are similar and different depending on what aspects you are analyzing. The arm is not microcoded, the x86 likely is so you dont really see how many registers there really are, things like that. at the same time the x86 you can get a better look at the memory system with the arm, since they are generally not system on a chip. Depending on what ARM chip you buy you may lose a lot of the visibility in the boundaries of the chip, might not see all the memory and peripheral busses, for example. (x86 is changing that by putting pcie on chip now for example) in the case of something in the cortex-a class you mentioned you would have similar edge of chip visibility as those would use larger/cheaper dram based memory off chip rather than microcontroller like on chip resources.
Bottom line your final question:
"And I would like to know if it is definitely true to say the 130 instructions long snippet is faster than the 184 instructions long implementation?"
It is definitely NOT TRUE to say the 130 instruction snippet is faster than the 184 instruction snippet. It might be faster it might be slower and it might be about the same. With a lot more information we might be able to make a pretty good statement or it may still be non-deterministic. it is easy to choose 100 instructions that execute faster than 1000 instructions and likewise easy to choose 1000 instructions that execute faster than 100 instructions (even if I were to add no branching and no loops, just linear execution)
Your question is almost entirely meaningless: It probably depends on your input.
Most CPUs have something resembling a branch misprediction penalty (e.g. traditional ARM which throws away an instruction fetch/decode on any taken branch, IIRC). ARM and x86 also allow conditional execution, which can be faster than branching. If either of these are dependent on input data, then different inputs will follow different code paths.
Perhaps one version heavily uses conditional execution, which is wasteful when the condition is false. Perhaps another was compiled using some profiling information that performs no branches (except the return at the end) for a specific case. There are many, many reason why a compiler can take the same source and produce an "optimized" output which is faster for one input and slower for another.
Many optimizations have this characteristic — for example, aligning the start of a loop to 16 bytes helps on some processors, but not when the loop is only executed once.
Some text book answer to this question from Cortex
™
-A Series Programmer’s Guide, chapter 17.
Although cycle timing information can be found in the Technical Reference Manual (TRM) for the processor that you are using, it is very difficult to work out how many cycles even a trivial piece of code will take to execute. The movement of instructions through the pipeline is dependent on the progress of the surrounding instructions and can be significantly affected by memory system
activity. Pending loads or instruction fetches which miss in the cache can stall code for tens of cycles. Standard data processing instructions (logical and arithmetic) will take only one or two cycles to execute, but this does not give the full picture. Instead, we must use profiling tools, or the system performance monitor built-in to the processor, to extract useful information about performance.
Also read under 17.4 Cortex-A9 micro-architecture optimizations which answers your question very very much.

Count clock cycles from assembly source code?

I have the source code written and I want to measure efficiency as how many clock cycles it takes to complete a particular task. Where can I learn how many clock cycles different commands take? Does every command take the same amount of time on 8086?
RDTSC is the high-resolution clock fetch instruction.
Bear in mind that cache misses, context switches, instruction reordering and pipelining, and multicore contention can all interfere with the results.
Clock cycles and efficiency are not the same thing.
For efficiency of code you need to consider, in particular, how the memory is utalised, in particular the differing levels of the cache. Also important is the branching prediction of the code etc. You want a profiler that tells you these things, ideally one that gives you profile specific information: examples are CodeAnalyst for AMD chips.
To answer your question, particular base instructions do have a given (average) number of cycles (AMD release the approximate numbers for the basic maths functions in their maths library). These numbers are a poor place to start optimising code, however.

Resources