Have advancements in CPU design like dynamic instruction scheduling narrowed the performance gap between code generated by splat compilers and by optimizing compilers, i.e. can compilers get away with being more stupid these days?
On the contrary, optimizing compilers achieve more on contemporary CPUs. Automatic vectorization makes code up to several times faster. Modern instruction sets also give some optimization opportunities (for example, using CMOV instead of conditional branch on x86).
There are some areas, where performance gap is narrowed. CPU performs function calls faster, so function inlining may be not as beneficial as earlier. Loop unrolling may sometimes make code a little bit slower. But in most cases compiler optimizations and CPU optimizations are orthogonal to each other. CPUs cannot do loop fusion or common subexpression elimination. Compilers cannot provide good alternative to dynamic instruction scheduling, branch prediction or data prefetch.
Related
My question primarily applies to firestorm/icestorm (because that's the hardware I have), but I am curious about what other representative arm cores do too. Arm has strange pre- and post-incremented addressing modes. If I have (for instance) two post-incremented loads from the same register, will the second depend on the first, or is the CPU smart enough to perform them in parallel?
AFAIK the exact behaviour of the M1 execution units is mainly undocumented. Still, there is certainly a dependency chain in this case. In fact, it would be very hard to break it and the design of modern processors make this even harder: the decoders, execution units, schedulers are distinct units and it would be insane to dynamically adapt the scheduling based on the instructions executed in parallel by execution units so to be able to break the chain in this particular case. Not to mention that instructions are pipelined and it generally takes few cycles for them to be committed. Furthermore, the time of the instructions is variable based on the fetched memory location. Finally, even this would be the case, the Firestorm documents does not mention such a feedback loop (see below for the links). Another possible solution for a processor to optimize such a pattern is to fuse the microinstructions so to combine the increment and add more parallelism but this is pretty complex to do for a relatively small improvement and there is no evidence showing Firestorm can do that so far (see here for more information about Firestorm instruction fusion/elimitation).
The M1 big cores (Apple's Firestorm) are designed to be massively parallel. They have 6 ALUs per core so they can execute a lot instructions in parallel on each core (possibly at the expense of a higher latency). However, this design tends to require a lot more transistors than current mainstream x86 Intel/AMD alternative (Alderlake/XX-Cove architecture put aside). Thus, the cores operate at a significantly lower frequency so to keep the energy consumption low. This means dependency chains are significantly more expensive on such an architecture compared to others unless there are enough independent instructions to be execute in parallel on the critical path. For more information about how CPUs works please thread Modern Microprocessors - A 90-Minute Guide!. For more information about the M1 processors and especially the Firestorm architecture, please read this deep analysis.
Note that Icestorm cores are designed to be energy efficient so they are far less parallel and thus having a dependency chain should be less critical on such a core. Still, having less dependency is often a good idea.
As for other ARM processors, recent core architecture are not as parallel as Firestorm. For example, the Cortex-A77 and Neoverse V1 have "only" 4 ALUs (which is already quite good). One need to also care about the latency of each instruction actually used in a given code. This information is available on the ARM website and AFAIK not yet published for Apple processors (one need to benchmark the instructions).
As for the pre VS post increment, I expect them to take the same time (same latency and throughput), especially on big cores like Firestorm (that try to reduce the latency of most frequent instruction at the expense of more transistors). However, the actual scheduling of the instruction for a given code can cause one to be slower than the other if the latency is not hidden by other instructions.
I received an answer to this on IRC: such usage will be fairly fast (makes sense when you consider it corresponds to typical looping patterns; good if the loop-carried dependency doesn't hurt too much), but it is still better to avoid it if possible, as it takes up rename bandwidth.
I know that the usual method when we want to make a big math computation faster is to use multiprocessing / parallel processing: we split the job in for example 4 parts, and we let 4 CPU cores run in parallel (parallelization). This is possible for example in Python with multiprocessing module: on a 4-core CPU, it would allow to use 100% of the processing power of the computer instead of only 25% for a single-process job.
But let's say we want to make faster a non-easily-splittable computation job.
Example: we are given a number generator function generate(n) that takes the previously-generated number as input, and "it is said to have 10^20 as period". We want to check this assertion with the following pseudo-code:
a = 17
for i = 1..10^20
a = generate(a)
check if a == 17
Instead of having a computer's 4 CPU cores (3.3 Ghz) running "in parallel" with a total of 4 processes, is it possible to emulate one very fast single-core CPU of 13.2 Ghz (4*3.3) running one single process with the previous code?
Is such technique available for a desktop computer? If not, is it available on cloud computing platforms (AWS EC2, etc.)?
Single-threaded performance is extremely valuable; it's much easier to write sequential code than to explicitly expose thread-level parallelism.
If there was an easy and efficient general-purpose way to do what you're asking which works when there is no parallelism in the code, it would already be in widespread use. Either internally inside multi-core CPUs, or in software if it required higher-level / larger-scale code transformations.
Out-of-order CPUs can find and exploit instruction-level parallelism within a single thread (over short distances, like a couple hundred instructions), but you need explicit thread-level parallelism to take advantage of multiple cores.
This is similar to How does a single thread run on multiple cores? over on SoftwareEnginnering.SE, except that you've already ruled out any easy-to-find parallelism including instruction-level parallelism. (And the answer is: it doesn't. It's the hardware of a single core that finds the instruction-level parallelism in a single thread; my answer there explains some of the microarchitectural details of how that works.)
The reverse process: turning one big CPU into multiple weaker CPUs does exist, and is useful for running multiple threads which don't have much instruction-level parallelism. It's called SMT (Simultaneous MultiThreading). You've probably heard of Intel's Hyperthreading, the most widely known implementation of SMT. It trades single-threaded performance for more throughput, keeping more execution units fed with useful work more of the time. The cost of building a single wide core grows at least quadratically, which is why typical desktop CPUs don't just have a single massive core with 8-way SMT. (And note that a really wide CPU still wouldn't help with a totally dependent instruction stream, unless the generate function has some internal instruction-level parallelism.)
SMT would be good if you wanted to test 8 different generate() functions at once on a quad-core CPU. Without SMT, you could alternate in software between two generate chains in one thread, so out-of-order execution could be working on instructions from both dependency chains in parallel.
Auto-parallelization by compilers at compile time is possible for source that has some visible parallelism, but if generate(a) isn't "separable" (not the correct technical term, I think) then you're out of luck.
e.g. if it's return a + hidden_array[static_counter++]; then the compiler can use math to prove that summing chunks of the array in parallel and adding the partial sums will still give the same result.
But if there's truly a serial dependency through a (like even a simple LCG PRNG), and the software doesn't know any mathematical tricks to break the dependency or reduce it to a closed form, you're out of luck. Compilers do know tricks like sum(0..n) = n*(n+1)/2 (evaluated slightly differently to avoid integer overflow in a partial result), or a+a+a+... (n times) is a * n, but that doesn't help here.
There is a scheme studied mostly in the academy called "Thread Decomposition". It aims to do more or less what you ask about - given a single-threaded code, it tries to break it down into multiple threads in order to divide the work on a multicore system. This process can be done by a compiler (although this requires figuring out all possible side effects at compile time which is very hard), by a JIT runtime, or through HW binary-translation, but each of these methods has complicated limitations and drawbacks.
Unfortunately, other than being automated, this process has very little appeal as it can hardly match true manual parallelization done by a person how understands the code. It also doesn't simply scale performance according to the number of threads, since it usually incurs a large overhead in the form of code that has to be duplicated.
Example paper by some nice folks from UPC in Barcelona: http://ieeexplore.ieee.org/abstract/document/5260571/
I have a task that requires ultra performance
Of course I can optimize its algorithm but I also want optimize on the hardware level.
I can of course use the CPU affinity in order to allocate a whole core to the thread that processes my task
Another kind of optimization could be to put in the CPU caches (L1, L2, L3) the data my tasks requires to complete, in order to avoid as far as possible the "RAM access" latency
What API can I use for such a development?
(In other words, my questions could be: "how to force to the CPU to place in the cache a given data-structure?")
Thank you for your help
Excellent comment by Peter C about prefetching. As a former optimizer, the first thing we'd do to improve code was to remove all the SW prefetching. Also, don't try to muck around with power states and such. They are so good now days that the effort isn't worth the gain in HPC. A possible exception is hyper threading. The only time you'd want to go there would be for certain benchmarking where you need consistency as well as performance.
Take a look at the Intel optimization resources such as the optimization guide. Also get yourself a good profiler; Intel's VTune is truly one of the best. For info from Intel, use bing (or google) to find stuff. Intel's site is and always has been a glossy mess. VTune has Student and Educator licensing.
Here are the steps I used to take in optimizing apps for performance. First off, exhaust the higher-level software changes. Then get down into tweaking for hardware performance. Why? Two reasons: (1) code changes are generally architecture independent and have a better chance of surviving a move to a different HW platform and generation. (2) They are a heck of a lot simpler to do (though perhaps not as fun).
CODE CHANGES:
Remove all SW prefetching.
Replace any polling with periodic interrupts
Make sure any checking interrupts have appropriate intervals
Use Fortran. Really. There's a reason Fortran is still alive. Take a look at the Intel Fortran forums. The forum's all classical HPC. And Intel's Fortran compiler is one of the best.
Use a good optimizing compiler, and play with the compiler settings and pragmas/annotations (e.g. #pragma loop count). Again, Intel's is one of the best. (I hate saying that, but it's true.)
Use a good SW profiler to find optimization opportunities (where most of your time is being spent). Make sure the profiler is able to dig into the source code to identify time spent in different functions. Optimize those functions first.
Find opportunities for thread parallization (multi-threading) properly scoped to the number of cores
Find opportunities for vectorization
Convert from AoS (Array of Structs) to SofA. Note that if you have to do the conversion on the fly, it may not be worth the performance cost.
Structure your loops such that they are more conducive to the compiler finding vectorization opportunities. See any good optimization book for how to do this.
HARDWARE HACKING/OPTIMIZATION (using a good HW-level performance analyzer)
Identify cache and TLB misses, and restructure code.
Identify branch mispredicts and restructure code.
Identify pipeline stalls and restructure code.
One last suggestion, though I'm sure you already know this. Remember, go after the hottest spots. Smaller opportunities are time consuming and performance improvements are not impactful to the overall application.
Best of luck. Optimization can be fun and rewarding (if you are slightly crazy).
You can't typically override the LRU replacement policies in CPU caches. x86 CPUs at least don't support any way to "pin" certain address ranges into any level of cache.
What you can do is "prefetch" ahead of use. "software prefetch" is only rarely helpful. Usually HW prefetching does a good job, and your data then stays in cache, as long as your cache footprint is small enough. Ulrich Drepper's What every programmer should know about memory covers this, and is still relevant. However, its emphasis on software prefetch (esp. a separate prefetch thread) was appropriate for P4, but not a good idea for other CPUs. Keep that in mind while reading.
Designing your data structures and access patterns to be cache-friendly is very important, too. Try googling "cache aware" algorithms, maybe (or just read Ulrich's paper). Or just tune as you go, using performance counters to see if you've accidentally done something that causes a lot of cache misses.
If you're running this on an Intel Haswell Xeon or newer (Exxx v3 or higher), you can partition the L3 cache so the core running your critical thread owns a chunk of L3, and it won't be evicted by other cores. This is called CAT (Cache Allocation Technology). See also this article by Dan Luu
Well, you'll need to use a low level language (C would probably be the go-to in this case).
Then you have some reading to do : What every programmer should know about memory. Pay special attention to chapter 6, which contains very useful programming advice on how to optimize for specific usage patterns.
Assume I have two implementations of the same algorithm in assembly. I would like to know by examining the two snippets codes which one is faster.
The parameters I thought one might take into account are: number of op-codes, number of branches, number of function frames.
My questions are:
Can I assume each opcode execution is one cycle ?
What is the overhead of branch which break the pipeline ?
What are the effects and overhead of calling a function ?
Is there a difference in the analysis between ARM and x86 ?
The question is theoretical since I have two implementations; one 130 instructions long and one is 184 instructions long.
And I would like to know if it is definitely true to say the 130 instructions long snippet is faster than the 184 instructions long implementation?
"BETTER == FASTER"
Without wanting to be flippant, the answers are
no
that depends on your hardware
that depends on your hardware
yes
You would really need to test things on your target hardware, or have a simulator that understands your hardware fully, in order to answer your question the way you meant to...
For the last part of your question, you need to define "better"…better.
Since you asked about a Cortex A9, the data sheet has instruction cycle counts in appendix B. These counts generally assume that the memory bus is fast enough to keep the CPU busy. In reality this is rarely the case. Many video/audio algorithms will have a big win in how they access memory.
One cycle per op
Of course you can't assume this if you want an exact count. However, if you are deciding which algorithm to choose, you can get a feel for the best algorithm by looking at the instructions in the inner loop. Here, your cache should allow the code to execute as per the instruction counts in the data sheet. If the counts are close, then you probably need to look at each instruction. Load/stores are more expensive and usually multiples, etc. Some algorithms, especially crytographic, will have big wins by using assembler that doesn't map well to C. For example, clz, ror, using the carry for multi-word arithmetic, etc.
Branch overhead
Look in Appendix B, or whatever data sheet has cycle counts for your processor. For an ARM926 it is about 3 cycles. The compiler only generates two conditional opcodes in a row to avoid branching, otherwise, it branches. If the algorithm is large, the branch may disrupt the cache. A hard answer depends on your CPU, cache, and memory. According to the Cortex A9 datasheet (B.5), there is only one cycle overhead to a fixed branch.
Function overhead
This is much the same as the branch overhead. However, the compiler will also have an influence. noted by Jim Does it cache align functions. Does the compiler perform leaf function optimizations, etc. With modern gcc versions, if all the functions are static, the compiler will generally in-line when it is advantageous. If the algorithms are particularly large, a register spill may be advantageous. However, with your example of 130/184 instructions, this seems unlikely. The compiler options will obviously effect the overhead. You can use objdump -S to examine the prologue/epilogue and then determine the number of cycles for your hardware.
ARM verus x86
Of course there is a technical difference in the cycle counts. The CISC x86 also has variable instruction size. This complicates the analysis. It is slightly easier on the ARM.
Normally, you want to ball park things and then actually run them with a profiler. The estimates can help guide development of the algorithms. Loop/memory tuning, etc for your hardware. Something like instruction emulation, page or alignment faults, etc may be dominant and make all the cycle count analysis meaningless. If the algorithm is in user space, per-emption, may negate cache wins from run to run. It is possible that one algorithm will work better in a little loaded system and the other will work better under a higher load.
A note on cycle counts
See the post-process objdump for some complications in getting cycle counts. Basically a typical CPU is several phases (a pipe line) and different conditions can cause stalls. As CPU's become more complex, the pipe line typically gets longer, meaning there are more conditions or phases which can stall. However, cycle count estimates can be helpful in guiding development of an algorithm and evaluating them. Things like memory timing or branch prediction can be just as important, depending on the algorithm. Ie, cycle counts are not completely useless, but they are not complete either. Profiling should confirm actual algorithm times. If they diverge, instruction re-ordering, pre-fetching and other techniques may bring them closer. The fact that cycle counts and active profiling diverge can be helpful in itself.
It is definitely not true to say that the 130 instruction code is faster than the 184 instruction code. it is very easy to have 1000 instructions run faster than 100 and vice versa on either of these platforms.
1 Can I assume each opcode execution is one cycle ?
Start by looking at the advertised mips/mhz, although a marketing number it gives a rough idea of what is possible. If the number is greater than one then more than one instruction per clock is possible.
2 What is the overhead of branch which break the pipeline ?
Anywhere from absolutely no affect to a very dramatic affect, on either system. one clock to hundreds are the potential penalty.
3 What are the effects and overhead of calling a function ?
Depends heavily on the function, and the function calling the function. Depending on the calling convention you might have to save registers to the stack, or rearrange the contents of registers to prepare for the parameters for the function to be called. If passing a struct by value a copy of the struct may need to be made on the stack, the bigger the struct passed the bigger the copy. once in the function a stack frame may need to be prepared, etc, etc. There are many factors involved. This question and answer are also independent of platform.
4 Is there a difference in the analysis between ARM and x86 ?
yes and no, both systems use all the modern tricks of pipelining, branch prediction, etc to keep the mips/mhz up. ARM is going to give a better mips per mhz than x86, x86 being variable instruction length might give more instructions per unit cache. How you analyze the cache, and memory and peripheral systems in the systems side of the analysis is roughly the same. The comparison of the instructions and core are similar and different depending on what aspects you are analyzing. The arm is not microcoded, the x86 likely is so you dont really see how many registers there really are, things like that. at the same time the x86 you can get a better look at the memory system with the arm, since they are generally not system on a chip. Depending on what ARM chip you buy you may lose a lot of the visibility in the boundaries of the chip, might not see all the memory and peripheral busses, for example. (x86 is changing that by putting pcie on chip now for example) in the case of something in the cortex-a class you mentioned you would have similar edge of chip visibility as those would use larger/cheaper dram based memory off chip rather than microcontroller like on chip resources.
Bottom line your final question:
"And I would like to know if it is definitely true to say the 130 instructions long snippet is faster than the 184 instructions long implementation?"
It is definitely NOT TRUE to say the 130 instruction snippet is faster than the 184 instruction snippet. It might be faster it might be slower and it might be about the same. With a lot more information we might be able to make a pretty good statement or it may still be non-deterministic. it is easy to choose 100 instructions that execute faster than 1000 instructions and likewise easy to choose 1000 instructions that execute faster than 100 instructions (even if I were to add no branching and no loops, just linear execution)
Your question is almost entirely meaningless: It probably depends on your input.
Most CPUs have something resembling a branch misprediction penalty (e.g. traditional ARM which throws away an instruction fetch/decode on any taken branch, IIRC). ARM and x86 also allow conditional execution, which can be faster than branching. If either of these are dependent on input data, then different inputs will follow different code paths.
Perhaps one version heavily uses conditional execution, which is wasteful when the condition is false. Perhaps another was compiled using some profiling information that performs no branches (except the return at the end) for a specific case. There are many, many reason why a compiler can take the same source and produce an "optimized" output which is faster for one input and slower for another.
Many optimizations have this characteristic — for example, aligning the start of a loop to 16 bytes helps on some processors, but not when the loop is only executed once.
Some text book answer to this question from Cortex
™
-A Series Programmer’s Guide, chapter 17.
Although cycle timing information can be found in the Technical Reference Manual (TRM) for the processor that you are using, it is very difficult to work out how many cycles even a trivial piece of code will take to execute. The movement of instructions through the pipeline is dependent on the progress of the surrounding instructions and can be significantly affected by memory system
activity. Pending loads or instruction fetches which miss in the cache can stall code for tens of cycles. Standard data processing instructions (logical and arithmetic) will take only one or two cycles to execute, but this does not give the full picture. Instead, we must use profiling tools, or the system performance monitor built-in to the processor, to extract useful information about performance.
Also read under 17.4 Cortex-A9 micro-architecture optimizations which answers your question very very much.
Feel free to correct me if any part of my understanding is wrong.
My understanding is that GPUs offer a subset of the instructions that a normal CPU provides but executes them much faster.
I know there are ways to utilize GPU cycles for non-graphical purpose, but it seems like (in theory) a language that's Just In Time compiled could detect the presence of a suitable GPU and offload some of the work to the GPU behind the scenes without code change.
Is my understanding naive? Is it just a matter of it's really complicated and just hasn't been done it?
My understanding is that GPUs offer a
subset of the instructions that a
normal CPU provides but executes them
much faster.
It's definitly not as simple. The GPU is tailored mainly at SIMD/vector processing. So even though the theoretical potential of GPUs nowadays is vastely superior to CPUs, only programs that can benefit from SIMD instructions can be executed efficiently on the GPU. Also, there is of course a performance penalty when data has to be transfered from the CPU to the GPU to be processed there.
So for a JIT compiler to be able to use the GPU efficiently, it must be able to detect code that can be parallelized to benefit from SIMD instructions and then has to determine, if the overhead induced by transfering data from the CPU to the GPU will be outweight by the performance improvements.
It is possible to use GPU (e.g., a CUDA- or OpenCL-enabled one) to speed up JIT itself. Both register allocation and instruction scheduling could be efficiently implemented.