Performance of dependent pre/post-incremented memory accesses - performance

My question primarily applies to firestorm/icestorm (because that's the hardware I have), but I am curious about what other representative arm cores do too. Arm has strange pre- and post-incremented addressing modes. If I have (for instance) two post-incremented loads from the same register, will the second depend on the first, or is the CPU smart enough to perform them in parallel?

AFAIK the exact behaviour of the M1 execution units is mainly undocumented. Still, there is certainly a dependency chain in this case. In fact, it would be very hard to break it and the design of modern processors make this even harder: the decoders, execution units, schedulers are distinct units and it would be insane to dynamically adapt the scheduling based on the instructions executed in parallel by execution units so to be able to break the chain in this particular case. Not to mention that instructions are pipelined and it generally takes few cycles for them to be committed. Furthermore, the time of the instructions is variable based on the fetched memory location. Finally, even this would be the case, the Firestorm documents does not mention such a feedback loop (see below for the links). Another possible solution for a processor to optimize such a pattern is to fuse the microinstructions so to combine the increment and add more parallelism but this is pretty complex to do for a relatively small improvement and there is no evidence showing Firestorm can do that so far (see here for more information about Firestorm instruction fusion/elimitation).
The M1 big cores (Apple's Firestorm) are designed to be massively parallel. They have 6 ALUs per core so they can execute a lot instructions in parallel on each core (possibly at the expense of a higher latency). However, this design tends to require a lot more transistors than current mainstream x86 Intel/AMD alternative (Alderlake/XX-Cove architecture put aside). Thus, the cores operate at a significantly lower frequency so to keep the energy consumption low. This means dependency chains are significantly more expensive on such an architecture compared to others unless there are enough independent instructions to be execute in parallel on the critical path. For more information about how CPUs works please thread Modern Microprocessors - A 90-Minute Guide!. For more information about the M1 processors and especially the Firestorm architecture, please read this deep analysis.
Note that Icestorm cores are designed to be energy efficient so they are far less parallel and thus having a dependency chain should be less critical on such a core. Still, having less dependency is often a good idea.
As for other ARM processors, recent core architecture are not as parallel as Firestorm. For example, the Cortex-A77 and Neoverse V1 have "only" 4 ALUs (which is already quite good). One need to also care about the latency of each instruction actually used in a given code. This information is available on the ARM website and AFAIK not yet published for Apple processors (one need to benchmark the instructions).
As for the pre VS post increment, I expect them to take the same time (same latency and throughput), especially on big cores like Firestorm (that try to reduce the latency of most frequent instruction at the expense of more transistors). However, the actual scheduling of the instruction for a given code can cause one to be slower than the other if the latency is not hidden by other instructions.

I received an answer to this on IRC: such usage will be fairly fast (makes sense when you consider it corresponds to typical looping patterns; good if the loop-carried dependency doesn't hurt too much), but it is still better to avoid it if possible, as it takes up rename bandwidth.

Related

Are there deterministic architecture emulators available?

Does such a thing as a deterministic (as in same result every run) architecture emulator exist? It is to benchmark test compilers/interpreters.
I do not mean an emulator that simply runs your program on whatever simulated architecture, but something that would compute an efficiency/speed index based on the analysis of the generated code (such as, the thing would have a deterministic value for the time taken by each instruction).
I can compute benchmark statistics on a real machine, but a deterministic result would eliminate the particularities of my machine and allow me to see the effect of small optimizations.
Intel's IACA is a static analysis tool. What is IACA and how do I use it?. But it only works for a single loop and doesn't model cache effects, only the pipeline. (And it assumes nearly-ideal OoO scheduling, I think, so probably doesn't find ROB-size limits, only front-end vs. execution port vs. loop-carried dependency latency bottlenecks). Plus IACA has some bugs in its cost model (e.g. its unlamination rules for micro-fusion of indexed addressing modes are wrong for Haswell).
AFAIK, there are no cycle accurate x86 simulators publicly available for any modern micro-architecture. We only have emulators that don't even try to run at the same speed as any real hardware, just as fast as possible, like BOCHS and qemu. I'm sure Intel and AMD have simulator software internally to validate CPU designs and model their performance, though.
You could probably assign a cycle cost to every instruction in an interpreting emulator like BOCHS and get a deterministic number, and maybe model the cache, too (there are cache simulators). It would be the same every time you ran it, but it wouldn't correspond to the running time on any real hardware!
Being deterministic is nowhere near sufficient to be interesting for tuning software. Modern x86 CPUs have a lot of microarchitectural state for out-of-order execution. We can often predict very close to how they'll run a loop (http://agner.org/optimize/, and other performance links in the x86 tag wiki), but on a larger scale there are many things that are only known by the vendors so so we couldn't write a truly accurate simulator even if we had the time. Things like branch-prediction are known in general terms, but the details have not been reverse-engineered in full detail. But branch prediction is a critical part of making a heavily pipelined CPU sustain anywhere near 3 to 4 fused-domain (front-end) uops per clock in real code.
Things get even more complicated if you want to model a multi-core machine, and SMT / HT adds lots of complexity between threads sharing a core. It's barely deterministic in the real hardware because small timing variations can lead to different threads getting farther out of sync.
To be really useful, you'd want to be able to test your code on Sandybridge, Haswell, Skylake, Bulldozer, Ryzen, and maybe Silvermont. And maybe different variants of those with different amounts of cache, and server vs. desktop where L3 / memory latency differs. (Many-core servers have significantly worse uncore latency, and lower single-threaded bandwidth even though the aggregate bandwidth is higher.)
So the whole idea of a deterministic simulator for "the x86 architecture" is weird. You could make one as simply as by giving each instruction a cost of 1 cycle, but that would be totally unrealistic.

Emulate a very fast (virtual) CPU core

I know that the usual method when we want to make a big math computation faster is to use multiprocessing / parallel processing: we split the job in for example 4 parts, and we let 4 CPU cores run in parallel (parallelization). This is possible for example in Python with multiprocessing module: on a 4-core CPU, it would allow to use 100% of the processing power of the computer instead of only 25% for a single-process job.
But let's say we want to make faster a non-easily-splittable computation job.
Example: we are given a number generator function generate(n) that takes the previously-generated number as input, and "it is said to have 10^20 as period". We want to check this assertion with the following pseudo-code:
a = 17
for i = 1..10^20
a = generate(a)
check if a == 17
Instead of having a computer's 4 CPU cores (3.3 Ghz) running "in parallel" with a total of 4 processes, is it possible to emulate one very fast single-core CPU of 13.2 Ghz (4*3.3) running one single process with the previous code?
Is such technique available for a desktop computer? If not, is it available on cloud computing platforms (AWS EC2, etc.)?
Single-threaded performance is extremely valuable; it's much easier to write sequential code than to explicitly expose thread-level parallelism.
If there was an easy and efficient general-purpose way to do what you're asking which works when there is no parallelism in the code, it would already be in widespread use. Either internally inside multi-core CPUs, or in software if it required higher-level / larger-scale code transformations.
Out-of-order CPUs can find and exploit instruction-level parallelism within a single thread (over short distances, like a couple hundred instructions), but you need explicit thread-level parallelism to take advantage of multiple cores.
This is similar to How does a single thread run on multiple cores? over on SoftwareEnginnering.SE, except that you've already ruled out any easy-to-find parallelism including instruction-level parallelism. (And the answer is: it doesn't. It's the hardware of a single core that finds the instruction-level parallelism in a single thread; my answer there explains some of the microarchitectural details of how that works.)
The reverse process: turning one big CPU into multiple weaker CPUs does exist, and is useful for running multiple threads which don't have much instruction-level parallelism. It's called SMT (Simultaneous MultiThreading). You've probably heard of Intel's Hyperthreading, the most widely known implementation of SMT. It trades single-threaded performance for more throughput, keeping more execution units fed with useful work more of the time. The cost of building a single wide core grows at least quadratically, which is why typical desktop CPUs don't just have a single massive core with 8-way SMT. (And note that a really wide CPU still wouldn't help with a totally dependent instruction stream, unless the generate function has some internal instruction-level parallelism.)
SMT would be good if you wanted to test 8 different generate() functions at once on a quad-core CPU. Without SMT, you could alternate in software between two generate chains in one thread, so out-of-order execution could be working on instructions from both dependency chains in parallel.
Auto-parallelization by compilers at compile time is possible for source that has some visible parallelism, but if generate(a) isn't "separable" (not the correct technical term, I think) then you're out of luck.
e.g. if it's return a + hidden_array[static_counter++]; then the compiler can use math to prove that summing chunks of the array in parallel and adding the partial sums will still give the same result.
But if there's truly a serial dependency through a (like even a simple LCG PRNG), and the software doesn't know any mathematical tricks to break the dependency or reduce it to a closed form, you're out of luck. Compilers do know tricks like sum(0..n) = n*(n+1)/2 (evaluated slightly differently to avoid integer overflow in a partial result), or a+a+a+... (n times) is a * n, but that doesn't help here.
There is a scheme studied mostly in the academy called "Thread Decomposition". It aims to do more or less what you ask about - given a single-threaded code, it tries to break it down into multiple threads in order to divide the work on a multicore system. This process can be done by a compiler (although this requires figuring out all possible side effects at compile time which is very hard), by a JIT runtime, or through HW binary-translation, but each of these methods has complicated limitations and drawbacks.
Unfortunately, other than being automated, this process has very little appeal as it can hardly match true manual parallelization done by a person how understands the code. It also doesn't simply scale performance according to the number of threads, since it usually incurs a large overhead in the form of code that has to be duplicated.
Example paper by some nice folks from UPC in Barcelona: http://ieeexplore.ieee.org/abstract/document/5260571/

hyperthreading and turbo boost in matrix multiply - worse performance using hyper threading

I am tunning my GEMM code and comparing with Eigen and MKL. I have a system with four physical cores. Until now I have used the default number of threads from OpenMP (eight on my system). I assumed this would be at least as good as four threads. However, I discovered today that if I run Eigen and my own GEMM code on a large dense matrix (1000x1000) I get better performance using four threads instead of eight. The efficiency jumped from 45% to 65%. I think this can be also seen in this plot
https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba
The difference is quite substantial. However, the performance is much less stable. The performance jumps around quit a bit each iteration both with Eigen and my own GEMM code. I'm surprised that Hyperthreading makes the performance so much worse. I guess this is not not a question. It's an unexpected observation which I'm hoping to find feedback on.
I see that not using hyper threading is also suggested here.
How to speed up Eigen library's matrix product?
I do have a question regarding measuring max performance. What I do now is run CPUz and look at the frequency as I'm running my GEMM code and then use that number in my code (4.3 GHz on one overclocked system I use). Can I trust this number for all threads? How do I know the frequency per thread to determine the maximum? How to I properly account for turbo boost?
The purpose of hyperthreading is to improve CPU usage for code exhibiting high latency. Hyperthreading masks this latency by treating two threads at once thus having more instruction level parallelism.
However, a well written matrix product kernel exhibits an excellent instruction level parallelism and thus exploits nearly 100% of the CPU ressources. Therefore there is no room for a second "hyper" thread, and the overhead of its management can only decrease the overall performance.
Unless I've missed something, always possible, your CPU has one clock shared by all its components so if you measure it's rate at 4.3GHz (or whatever) then that's the rate of all the components for which it makes sense to figure out a rate. Imagine the chaos if this were not so, some cores running at one rate, others at another rate; the shared components (eg memory access) would become unmanageable.
As to hyperthreading actually worsening the performance of your matrix multiplication, I'm not surprised. After all, hyperthreading is a poor-person's parallelisation technique, duplicating instruction pipelines but not functional units. Once you've got your code screaming along pushing your n*10^6 contiguous memory locations through the FPUs a context switch in response to a pipeline stall isn't going to help much. At best the other pipeline will scream along for a while before another context switch robs you of useful clock cycles, at worst all the careful arrangement of data in the memory hierarchy will be horribly mangled at each switch.
Hyperthreading is designed not for parallel numeric computational speed but for improving the performance of a much more general workload; we use general-purpose CPUs in high-performance computing not because we want hyperthreading but because all the specialist parallel numeric CPUs have gone the way of all flesh.
As a provider of multithreaded concurrency services, I have explored how hyperthreading affects performance under a variety of conditions. I have found that with software that limits its own high-utilization threads to no more that the actual physical processors available, the presence or absence of HT makes very little difference. Software that attempts to use more threads than that for heavy computational work, is likely unaware that it is doing so, relying on merely the total processor count (which doubles under HT), and predictably runs more slowly. Perhaps the largest benefit that enabling HT may provide, is that you can max out all physical processors, without bringing the rest of the system to a crawl. Without HT, software often has to leave one CPU free to keep the host system running normally. Hyperthreads are just more switchable threads, they are not additional processors.

Post process `objdump --disassemble` with ARM cycle counts

Is there a script available for post processing some objdump --disassemble output to annotate with cycle counts? Especially for the ARM family. Most of the time this would only be a pattern match with a table lookup for the count. I guess annotations like +5M for five memory cycles might be needed. Perl, python, bash, C, etc are fine. I think this can be done generically, but I am interested in the ARM, which has an orthogonal instruction set. Here is a thread on the 68HC11 doing the same thing. The script would need an CPU model option to select the appropriate cycle counts; I think these counts already exist in the gcc machine description.
I don't think there is an objdump switch for this, but RTFM would be great.
Edit: To clarify, assumptions such as best case memory sub-system as will be the case when the code executes from cache are fine. The goal is not a 100% accurate cycle count as per some running machine. It is possible to get a reasonable estimate, otherwise compiler design would be impossible.
As DWelch points out, a simple running total is not possible with deep pipelined architecture, like more recent Cortex chips. The objdump post processing would have to look at surrounding opcodes. A gcc plug-in is more likely to be able to accomplish this and as that is new (4.5+), I don't think such a thing exists. A script for the ARM926 is certainly possible and fairly simple.
The memory latency doesn't matter. The memory controller is like another CPU. It is doing it's business while the CPU is doing arithmetic, etc. A good/well tuned algorithm will parallel the memory accesses with the computations. By counting loads/store and cycles you can determine how much parallelism is accomplished, when you actively profile with a timer. The pipeline is significant due to interlocks between registers, but a cycle count for basic blocks can reliably be calculated and used even on modern ARM processors; this is too complex for a simple script.
Cycle counts are not something that can be assessed by looking at the instruction alone on a modern high end ARM. There is a lot of runtime state that affects the real world retirement rate of an instruction. Does the data it needs exist in the cache? Does the instruction have any dependencies on previous instruction results? If so, what latencies does the forwarding unit remove? How full is the load/store buffer? What kind of memory mapping is it touching? How full are the processor pipelines that this instruction needs? Are there synchronizing instructions in the stream? Has speculation brought forward some data it depends on? What is the state of the register renamer? Have conditional instructions been filling the pipeline or was the decoder smart enough to skip them completely? What are the ratios between the core clock and the bus and memory clocks? What's the size of the branch prediction table?
Without a full processor simulation all you can get are guesses. Whether those numbers are meaningful to you depends on what you are trying to accomplish with them.
There is an online tool which estimates cycle counts on Cortex-A8. However, this CPU is quite old, and programs optimized for it might be suboptimal on newer CPUs.
AFAIK ARM also provides Cortex-A9 and Cortex-A5 cycle-accurate emulators in their RVDS software, but it is quite expensive.

measuring real running time of an algorithm

Approximately, how many physical instructions of MIPS does an abstract algorithm operation amortize to? As for an abstract algorithm operation, I means a basic operation, such as add, divide, etc.
I see this is not a strict measuring technique :-)
Kejia
There is a list of the basic MIPS instructions here. Most of the "basic operations" that you mentioned are a single MIPS instruction or perhaps two, which probably holds true on most current CPU families.
However this does not take into account at all the architecture and performance characteristics of any of the modern CPUs. Different instructions often have diffrent completion times. Current CPUs usually implement branch prediction, instruction pipelines, memory caching, parallelisation and a whole list of other techniques to make the code execution faster.
Therefore just having the assembly code implementation of an algorithm says nothing about its execution speed. You would have to measure and profile the code on the actual hardware to obtain comparable results. In fact, some algorithms may be far more effective on certain CPUs, even within the same CPU family.
A common and rather understandable example is the effect of the instruction cache. Unrolling a loop will eliminate a number of branch operations, which intuitively makes code faster. If you run that code on a CPU of the same family with very little instruction cache memory, though, the added accesses to the main memory can make it far slower than the simple branch-based loop.
Computers are complicated. If you want to get down to this level you need to start considering what kind of CPU you are using, how well your compiler can use this CPU's instruction set, what variables are being kept in what registers, what are their bit-level representations, etc. Even then, the number of instructions not always easily maps to the actual running time. Different instructions can take different ammounts of clock cycles to execute and this is not even thinking about OS threading and your program's cache miss rate.
In the end, there is a good reason we use big-O notatoin in the first place :)
BTW, most simple operations (add, subtract) on integers should map to a single machine instruction, in case you are worried.
It depends on the CPU architecture. Some processors requires several cycles for a single instruction such as divivide, while others manage to execute all machine code instructions in a single cycle each.
It is sometimes relevant to measure an algorithm in how many floating point operations it requires. However this does not take I/O (such as reading memory) into consideration.
The speed of a CPU is sometimes provided in FLOPS (Floating Point OPerations per Second) which could help to give you a time estimate. Again, not taking I/O into consideration - and not multi-threading issues (also a very important measuring factor).
Donald Knuth addressed this very problem in writing Volume 1 of "The Art of Computer Programming".
In the preface he gives a lengthy justification for presenting algorithms in the assembly code for an imaginary machine -
... To avoid this dilemma, I have
attempted to design an "ideal"
computer called "MIX," with very
simple rules of operation ...
That way, one can talk sensibly about how many "cycles" an algorithm would take, without having to care about differences between machines, caching, latency, pipelines, or any of the other ways computers have been optimized to save time, at the expense of knowing how long they will take.

Resources