Cycles/byte calculations - performance

In Crypto communities it is common to measure algorithm performance in cycles/byte. My question is, which parameters in the CPU architecture are affecting this number? Except the clockspeed ofcourse :)

Two important factors are:
the ISA of the CPU, or more specifically how closely CPU instructions map to the operations that you need to perform - if you can perform a given operation in one instruction one CPU but it requires 3 instructions on another CPU then the first CPU will probably be faster. If you have specific crypto instructions on the CPU, or extensions such as SIMD which can be leveraged, then so much the better.
the instruction issue rate of the CPU, i.e. how many instructions can be issued per clock cycle

Here are some CPU features that can impact cycles/byte:
depth of pipeline
number of IU and/or FPU able to work in parallel
size of cache memories
algorithms for branch prediction
algorithms for handling cache miss
Moreover, you may be interested in the general problem of assessing WCET (worst case execution time)

Mainly:
Memory bus bandwidth
CPU instructions per cycle
How much memory the CPU can access per second can be a limiting factor. That depends on the algorithm and how big part of the work is memory access. Also which parts of the memory that is accesses will affect how well the memory cache works.
Nowadays instruction times is not measured in how many cycles an instruction takes, but how many instructions can be executed in the same cycle. The pre-processor in the CPU lines up several instructions to be executed in parallel, so it depends on how many parallel lines the CPU has and how well the code can be parallelised. Generally a lot of conditional branching in the algorithm makes it harder to parallelise.

Related

If CPU frequencies don't increase, how can CPU be faster for non-parallel code? [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 months ago.
Improve this question
CPUs are still "improving", but their frequency haven't improved a lot during the last 10 years.
I can understand that the transistor count increases because of smaller and smaller transistors, but I don't understand how a non-parallel program (most programs are non-parallel, I think?) can be executed much faster on new CPUs if the frequency doesn't increase.
I can understand why GPU can be faster with more transistors because they're parallel processors (is that the right term?) and they only execute parallel code.
But most software is non parallel, so to me it seems that new CPU should not become much faster than previous CPUs, unless most programs can be parallelized, which is not the case (I'm not sure, but what are typical algorithms that can't be parallelized?).
Are larger L1/L2/L3 cache sizes allowing new CPU to be faster? Or are there other things like new instructions or branching things?
What am I missing?
More and more programs are using threads for things that can be parallelized reasonably. But you're asking about single-threaded (or per-core) performance, and that's totally fine and interesting.
You're missing instruction-level parallelism (ILP) and increasing IPC (instructions per cycle).
Also, SIMD (x86's SSE, AVX, AVX-512, or ARM NEON, SVE) to get more work done per instruction, exploiting data-parallelism in a (potentially small) loop that way instead of or as well as with threading. But that isn't a big factor for many applications.
Work per clock is instructions/cycle x work/insn x threads. (threads is basically number of clocks ticking at once, if your program is running on multiple cores). Even if threads is 1, the other two factors can increase.
A problem with lots of data parallelism (e.g. summing an array, or adding 1 to every element) can expose that parallelism to the CPU in three ways: SIMD, instruction-level parallelism (e.g. unroll with multiple accumulators if there's a dependency chain like a sum), and thread-level parallelism.
These are all orthogonal. And some of them apply to problem that aren't data parallel, just different steps of a complicated program. IPC applies all the time. With good enough branch prediction, CPUs can see far ahead in the instruction stream and finding parallel work to do (especially memory-level parallelism), as long as the code isn't doing something like traverse a linked list where the next load address depends on the current load result. Then you bottleneck on load latency, with no memory-level parallelism (except for whatever work you're doing on each node.)
Some major factors
Larger caches improve hit rates and effective bandwidth, leading to fewer stalls. That raises average IPC. (Also smarter cache-replacement algorithms, like L3 adaptive replacement in Ivy Bridge.)
Actual DRAM bandwidth increases help, too, (especially with good HW prefetching), but DRAM B/W is shared between cores. L1/L2 cache are private in modern CPUs, and L3 bandwidth scales nicely as well with different cores accessing different parts of it. Still, DRAM often comes into play, especially in code that isn't carefully tuned for cache-blocking. DRAM latency is near constant (in absolute nanoseconds, so getting "worse" in core clock cycles), but memory clocks have been climbing significantly in the past decade.
Larger ReOrder Buffers (ROB) and schedulers (RS) allow CPUs to find ILP over larger windows. Similarly, larger load and store buffers allow more memory-level parallelism, e.g. tracking more in-flight cache-miss loads in parallel. And having a larger buffer before you have to stall if a store misses in cache.
Better branch prediction reduces how often this speculative work has to be discarded if the CPU finds it had guessed the wrong path for an earlier branch.
Wider pipelines allow higher peak IPC. At best, in high-throughput code (not a lot of stalls, and lots of ILP), this can be sustained.
Otherwise, it at least helps get to the next stall sooner, doing a burst of work. And to clear out instructions waiting in the ROB when a cache-miss load does finally arrive, making room for new work to potentially see some independent work later. If execution of a loop condition can get far ahead of the actual work in the loop, a mispredict of the loop exit branch might be resolved before the back-end runs out of work to do. So a max IPC higher than the steady-state bottleneck of a loop is useful for loops that aren't infinite.
See also
Modern Microprocessors A 90-Minute Guide! - covers this quite well, how clocks mostly stopped increasing steadily once we hit the "power wall" in Pentium 4. As ever more efficient CPUs are designed, clocks are creeping up again, especially with fine-grained clock gating to stop heat generation from parts of a CPU that aren't doing anything in any given clock cycle.
This allows high turbo clocks when running "inefficient" code that bottlenecks on cache misses, branch misses, and other things like that so there aren't a lot of execution units busy at once.
How does a single thread run on multiple cores? - it doesn't, that's not what IPC is about. My answer there attempts to explain it so a beginner can understand.
https://www.realworldtech.com/sandy-bridge/ A deep dive on Sandybridge, how it finds instruction-level parallelism.
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? (much more technical) - what to look for when predicting how fast a CPU might run a given loop or sequence of assembly instructions.

Performance of dependent pre/post-incremented memory accesses

My question primarily applies to firestorm/icestorm (because that's the hardware I have), but I am curious about what other representative arm cores do too. Arm has strange pre- and post-incremented addressing modes. If I have (for instance) two post-incremented loads from the same register, will the second depend on the first, or is the CPU smart enough to perform them in parallel?
AFAIK the exact behaviour of the M1 execution units is mainly undocumented. Still, there is certainly a dependency chain in this case. In fact, it would be very hard to break it and the design of modern processors make this even harder: the decoders, execution units, schedulers are distinct units and it would be insane to dynamically adapt the scheduling based on the instructions executed in parallel by execution units so to be able to break the chain in this particular case. Not to mention that instructions are pipelined and it generally takes few cycles for them to be committed. Furthermore, the time of the instructions is variable based on the fetched memory location. Finally, even this would be the case, the Firestorm documents does not mention such a feedback loop (see below for the links). Another possible solution for a processor to optimize such a pattern is to fuse the microinstructions so to combine the increment and add more parallelism but this is pretty complex to do for a relatively small improvement and there is no evidence showing Firestorm can do that so far (see here for more information about Firestorm instruction fusion/elimitation).
The M1 big cores (Apple's Firestorm) are designed to be massively parallel. They have 6 ALUs per core so they can execute a lot instructions in parallel on each core (possibly at the expense of a higher latency). However, this design tends to require a lot more transistors than current mainstream x86 Intel/AMD alternative (Alderlake/XX-Cove architecture put aside). Thus, the cores operate at a significantly lower frequency so to keep the energy consumption low. This means dependency chains are significantly more expensive on such an architecture compared to others unless there are enough independent instructions to be execute in parallel on the critical path. For more information about how CPUs works please thread Modern Microprocessors - A 90-Minute Guide!. For more information about the M1 processors and especially the Firestorm architecture, please read this deep analysis.
Note that Icestorm cores are designed to be energy efficient so they are far less parallel and thus having a dependency chain should be less critical on such a core. Still, having less dependency is often a good idea.
As for other ARM processors, recent core architecture are not as parallel as Firestorm. For example, the Cortex-A77 and Neoverse V1 have "only" 4 ALUs (which is already quite good). One need to also care about the latency of each instruction actually used in a given code. This information is available on the ARM website and AFAIK not yet published for Apple processors (one need to benchmark the instructions).
As for the pre VS post increment, I expect them to take the same time (same latency and throughput), especially on big cores like Firestorm (that try to reduce the latency of most frequent instruction at the expense of more transistors). However, the actual scheduling of the instruction for a given code can cause one to be slower than the other if the latency is not hidden by other instructions.
I received an answer to this on IRC: such usage will be fairly fast (makes sense when you consider it corresponds to typical looping patterns; good if the loop-carried dependency doesn't hurt too much), but it is still better to avoid it if possible, as it takes up rename bandwidth.

hyperthreading and turbo boost in matrix multiply - worse performance using hyper threading

I am tunning my GEMM code and comparing with Eigen and MKL. I have a system with four physical cores. Until now I have used the default number of threads from OpenMP (eight on my system). I assumed this would be at least as good as four threads. However, I discovered today that if I run Eigen and my own GEMM code on a large dense matrix (1000x1000) I get better performance using four threads instead of eight. The efficiency jumped from 45% to 65%. I think this can be also seen in this plot
https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba
The difference is quite substantial. However, the performance is much less stable. The performance jumps around quit a bit each iteration both with Eigen and my own GEMM code. I'm surprised that Hyperthreading makes the performance so much worse. I guess this is not not a question. It's an unexpected observation which I'm hoping to find feedback on.
I see that not using hyper threading is also suggested here.
How to speed up Eigen library's matrix product?
I do have a question regarding measuring max performance. What I do now is run CPUz and look at the frequency as I'm running my GEMM code and then use that number in my code (4.3 GHz on one overclocked system I use). Can I trust this number for all threads? How do I know the frequency per thread to determine the maximum? How to I properly account for turbo boost?
The purpose of hyperthreading is to improve CPU usage for code exhibiting high latency. Hyperthreading masks this latency by treating two threads at once thus having more instruction level parallelism.
However, a well written matrix product kernel exhibits an excellent instruction level parallelism and thus exploits nearly 100% of the CPU ressources. Therefore there is no room for a second "hyper" thread, and the overhead of its management can only decrease the overall performance.
Unless I've missed something, always possible, your CPU has one clock shared by all its components so if you measure it's rate at 4.3GHz (or whatever) then that's the rate of all the components for which it makes sense to figure out a rate. Imagine the chaos if this were not so, some cores running at one rate, others at another rate; the shared components (eg memory access) would become unmanageable.
As to hyperthreading actually worsening the performance of your matrix multiplication, I'm not surprised. After all, hyperthreading is a poor-person's parallelisation technique, duplicating instruction pipelines but not functional units. Once you've got your code screaming along pushing your n*10^6 contiguous memory locations through the FPUs a context switch in response to a pipeline stall isn't going to help much. At best the other pipeline will scream along for a while before another context switch robs you of useful clock cycles, at worst all the careful arrangement of data in the memory hierarchy will be horribly mangled at each switch.
Hyperthreading is designed not for parallel numeric computational speed but for improving the performance of a much more general workload; we use general-purpose CPUs in high-performance computing not because we want hyperthreading but because all the specialist parallel numeric CPUs have gone the way of all flesh.
As a provider of multithreaded concurrency services, I have explored how hyperthreading affects performance under a variety of conditions. I have found that with software that limits its own high-utilization threads to no more that the actual physical processors available, the presence or absence of HT makes very little difference. Software that attempts to use more threads than that for heavy computational work, is likely unaware that it is doing so, relying on merely the total processor count (which doubles under HT), and predictably runs more slowly. Perhaps the largest benefit that enabling HT may provide, is that you can max out all physical processors, without bringing the rest of the system to a crawl. Without HT, software often has to leave one CPU free to keep the host system running normally. Hyperthreads are just more switchable threads, they are not additional processors.

Count clock cycles from assembly source code?

I have the source code written and I want to measure efficiency as how many clock cycles it takes to complete a particular task. Where can I learn how many clock cycles different commands take? Does every command take the same amount of time on 8086?
RDTSC is the high-resolution clock fetch instruction.
Bear in mind that cache misses, context switches, instruction reordering and pipelining, and multicore contention can all interfere with the results.
Clock cycles and efficiency are not the same thing.
For efficiency of code you need to consider, in particular, how the memory is utalised, in particular the differing levels of the cache. Also important is the branching prediction of the code etc. You want a profiler that tells you these things, ideally one that gives you profile specific information: examples are CodeAnalyst for AMD chips.
To answer your question, particular base instructions do have a given (average) number of cycles (AMD release the approximate numbers for the basic maths functions in their maths library). These numbers are a poor place to start optimising code, however.

measuring real running time of an algorithm

Approximately, how many physical instructions of MIPS does an abstract algorithm operation amortize to? As for an abstract algorithm operation, I means a basic operation, such as add, divide, etc.
I see this is not a strict measuring technique :-)
Kejia
There is a list of the basic MIPS instructions here. Most of the "basic operations" that you mentioned are a single MIPS instruction or perhaps two, which probably holds true on most current CPU families.
However this does not take into account at all the architecture and performance characteristics of any of the modern CPUs. Different instructions often have diffrent completion times. Current CPUs usually implement branch prediction, instruction pipelines, memory caching, parallelisation and a whole list of other techniques to make the code execution faster.
Therefore just having the assembly code implementation of an algorithm says nothing about its execution speed. You would have to measure and profile the code on the actual hardware to obtain comparable results. In fact, some algorithms may be far more effective on certain CPUs, even within the same CPU family.
A common and rather understandable example is the effect of the instruction cache. Unrolling a loop will eliminate a number of branch operations, which intuitively makes code faster. If you run that code on a CPU of the same family with very little instruction cache memory, though, the added accesses to the main memory can make it far slower than the simple branch-based loop.
Computers are complicated. If you want to get down to this level you need to start considering what kind of CPU you are using, how well your compiler can use this CPU's instruction set, what variables are being kept in what registers, what are their bit-level representations, etc. Even then, the number of instructions not always easily maps to the actual running time. Different instructions can take different ammounts of clock cycles to execute and this is not even thinking about OS threading and your program's cache miss rate.
In the end, there is a good reason we use big-O notatoin in the first place :)
BTW, most simple operations (add, subtract) on integers should map to a single machine instruction, in case you are worried.
It depends on the CPU architecture. Some processors requires several cycles for a single instruction such as divivide, while others manage to execute all machine code instructions in a single cycle each.
It is sometimes relevant to measure an algorithm in how many floating point operations it requires. However this does not take I/O (such as reading memory) into consideration.
The speed of a CPU is sometimes provided in FLOPS (Floating Point OPerations per Second) which could help to give you a time estimate. Again, not taking I/O into consideration - and not multi-threading issues (also a very important measuring factor).
Donald Knuth addressed this very problem in writing Volume 1 of "The Art of Computer Programming".
In the preface he gives a lengthy justification for presenting algorithms in the assembly code for an imaginary machine -
... To avoid this dilemma, I have
attempted to design an "ideal"
computer called "MIX," with very
simple rules of operation ...
That way, one can talk sensibly about how many "cycles" an algorithm would take, without having to care about differences between machines, caching, latency, pipelines, or any of the other ways computers have been optimized to save time, at the expense of knowing how long they will take.

Resources