Fast hardware integer division - performance

Hardware instruction for integer division has been historically very slow. For example, DIVQ on Skylake has latency of 42-95 cycles [1] (and reciprocal throughput of 24-90), for 64-bits inputs.
There are newer processor however, which perform much better: Goldmont has 14-43 latency and Ryzen has 14-47 latency [1], M1 apparently has "throughput of 2 clock cycles per divide" [2] and even Raspberry Pico has "8-cycle signed/unsigned divide/modulo circuit, per core" (though that seems to be for 32-bit inputs) [3].
My question is, what has changed? Was there a new algorithm invented? What algorithms do the new processors employ for division, anyway?
[1] https://www.agner.org/optimize/#manuals
[2] https://ridiculousfish.com/blog/posts/benchmarking-libdivide-m1-avx512.html
[3] https://raspberrypi.github.io/pico-sdk-doxygen/group__hardware__divider.html#details

On Intel before Ice Lake, 64-bit operand-size is an outlier, much slower than 32-bit operand size for integer division. div r32 is 10 uops, with 26 cycle worst-case latency but 6 cycle throughput. (https://uops.info/ and https://agner.org/optimize/, and Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux has detailed exploration.)
There wasn't a fundamental change in how divide units are built, just widening the HW divider to not need extended-precision microcode. (Intel has had fast-ish dividers for FP for much longer, and that's basically the same problem just with only 53 bits instead of 64. The hard part of FP division is integer division of the mantissas; subtracting the exponents is easy and done in parallel.)
The incremental changes are things like widening the radix to handle more bits with each step. And for example pipelining the refinement steps after the initial (table lookup?) value, to improve throughput but not latency.
Related:
How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson? brief high-level overview of the div/sqrt units that modern CPUs use, with for example a Radix-1024 divider being new in Broadwell.
Do FP and integer division compete for the same throughput resources on x86 CPUs? (No in Ice Lake and later on Intel; having a dedicated integer unit instead of using the low element of the FP mantissa divide/sqrt unit is presumably related to making it 64 bits wide.)
Divide units historically were often not pipelined at all, as that's hard because it requires replicating a lot of gates instead of iterating on the same multipliers, I think. And most software usually avoids (or avoided) integer division because it was historically very expensive, at least does it infrequently enough to not benefit very much from higher-throughput dividers with the same latency.
But with wider CPU pipelines with higher IPC shrinking the cycle gap between divisions, it's more worth doing. Also with huge transistor budgets, spending a bunch on something that will sit idle for a lot of the time in most programs still makes sense if it's very helpful for a few programs. (Like wider SIMD, and specialized execution units like x86 BMI2 pdep / pext). Dark silicon is necessary or chips would melt; power density is a huge concern, see Modern Microprocessors: A 90-Minute Guide!
Also, more and more software being written by people who don't know anything about performance, and more code avoiding compile-time constants in favour of being flexible (function args that ultimately come from some config option), I'd guess modern software doesn't avoid division as much as older programs did.
Floating-point division is often harder to avoid than integer, so it's definitely worth having fast FP dividers. And integer can borrow the mantissa divider from the low SIMD element, if there isn't a dedicated integer-divide unit.
So that FP motivation was likely the actual driving force behind Intel's improvements to divide throughput and latency even though they left 64-bit integer division with garbage performance until Ice Lake.

Related

Does changing one single bit consume less cycles per byte than adding/subtracting/xoring an entire processor word?

Let's suppose I change one single bit in a word and add two other words.
Does changing one bit in a word consume less CPU cycles than changing an entire word?
If it consumes less CPU cycles, how much faster would it be?
Performance (in clock cycles) is not data-dependent for integer ALU instructions other than division on most CPUs. ADD and XOR have the same 1-cycle latency on the majority of modern pipelined CPUs. (And the same cycle cost as each other on most older / simpler CPUs, whether or not it's 1 cycle.)
See https://agner.org/optimize/ and https://uops.info/ for numbers on modern x86 CPUs.
Lower power can indirectly affect performance by allowing higher boost clocks without having to slow down for thermal limits. But the difference in this case is so small that I don't expect it would be a measurable difference on a mainstream CPU, like the efficiency cores of an Alder Lake, or even a mobile phone CPU that's more optimized for low power.
Power in a typical CPU (using CMOS logic) scales with how many gates have their outputs change value per cycle. When a transistor switches on, it conducts current from Vcc or to ground, charging or discharging the tiny parasitic capacitance of the things the logic gate's output is connected to. Since the majority of the (low) resistance in the path of that current is in the transistor itself, that's where the electrical energy turns into heat.
For more details, see:
Why does switching cause power dissipation? on electronics.SE for the details for one CMOS gate
For a mathematical operation in CPU, could power consumption depend on the operands?
Modern Microprocessors
A 90-Minute Guide! has a section about power. (And read the whole article if you have any general interest in CPU architecture; it's good stuff.)
ADD does require carry propagation potentially across the whole width of the word, e.g. for 0xFFFFFFFF + 1, so ALUs use tricks like carry-lookahead or carry-select to keep the worst case gate-delay latency within one cycle.
So ADD involves more gates than a simple bitwise operation like XOR, but still not many compared to the amount of gates involved in controlling all the decode and other control logic to get the operands to the ALU and the result written back (and potentially bypass-forwarded to later instructions that use the result right away.)
Also, a typical ALU probably doesn't have fully separate adder vs. bitwise units, so a lot of those adder gates are probably seeing their inputs change, but control signals block carry propagation. (i.e. a typical ALU implements XOR using a lot of the same gates as ADD, but with control signals controlling AND gates or something to all or block carry propagation. XOR is add-without-carry.) An integer ALU in a CPU will usually be at least an adder-subtractor so one of the inputs is coming through multiple gates, with other control signals that can make it do bitwise ops.
But there's still maybe a few fewer bit-flips when doing an XOR operation than an ADD. Partly it would depend what the previous outputs were (of whatever computation it did in the previous cycle, not the value of one of the inputs to the XOR). But with carry propagation blocked by AND gates, flipping the inputs to those gates doesn't change the outputs, so less capacitance is charged or discharged.
In a high-performance CPU, a lot of power is spent on pipelining and out-of-order exec, tracking instructions in flight, and writing back the results. So even the whole ALU ADD operation is a pretty minor component of total energy cost to execute the instruction. Small differences in that power due to operands are an even smaller difference. Pretty much negligible compared to how many gates flip every clock cycle just to get data and control signals sent to the right place.
Another tiny effect: if your CPU didn't do register renaming, then possibly a few fewer transistors might flip (in the register file's SRAM) when writing back the result if it's almost the same as what that register held before.
(Assuming an ISA like x86 where you do xor dst, src for dst ^= src, not a 3-operand ISA where xor dst, src1, src2 could be overwriting a different value if you didn't happen to pick the same register for dst and src1.)
If your CPU does out-of-order exec with register renaming, writes to the register file won't be overwriting the same SRAM cells as the original destination value, so it depends what other values were computed recently in registers.
If you want to see a measurable difference in power, run instructions like integer multiply, or FP mul or FMA. Or SIMD instructions, so the CPU is doing 4x or 8x 32-bit addition or shuffle in parallel. Or 8x 32-bit FMA. The max-power workload on a typical modern x86 CPU is two 256-bit FMAs per clock cycle.
See also:
Do sse instructions consume more power/energy? - Mysticial's answer is excellent, and discusses the race-to-sleep benefit of doing the same work faster and with fewer instructions, even if each one costs somewhat more power.
Why does the CPU get hotter when performing heavier calculations, compared to being idle?
How do I achieve the theoretical maximum of 4 FLOPs per cycle?

Power consumption estimation from number of FLOPS (floating point operations)?

I have extracted how many flops (floating point operations) each of my algorithms are consuming,
I wonder if I implement this algorithms on FPGA or on a CPU, can predict (roughly at least) how much power is going to be consumed?
Both power estimation in either CPU or ASIC/FPGA are good for me. I am seeking something like a formula. I have this journal paper, for Intel CPUs. It gives power consumption per instruction (not only floating point operation but all those addressing, control, etc. instructions) so I need something more general to give power based on FLOPS not number of instructions of the code in a special processor.
Re CPU: It's not really possible with modern architectures. Let's assuming your program is running on bare metal (i.e. avoiding the complexities of modern OSs, other applications, interrupt processing, optimizing compilers, etc). Circuitry that isn't being used, the modern processor will operate at a reduced power level. There are also hardware power conservation states such as P (Power) and C (Sleep) states that are instruction independent and will vary your power consumption even with the same instruction sequence. Even if we assume your app is CPU-bound (meaning there are no periods long enough to allow the processor to drop into hardware power saving states), we can't predict power usage except at a gross statistical level. Instruction streams are pipelined, taken out-of-order, fused, etc. And this doesn't even include the memory hierarchy, etc.
FPGA: Oh, heck. My experience with FPGA is so old, that I really don't want to say from when. All I can say is that way back, when huge monsters roamed the earth, you could estimate power usage since you knew the circuit design, and the power consumption of on and off gates. I can't imagine that there aren't modern power conservation technologies built into modern FPGAs. Even so, what small literature I scanned implies that a lot of FPGA power technology is based upon a-priori analysis and optimization. See Design techniques for FPGA power optimization, and 40-nm FPGA Power Management and Advantages. (I just did a quick search and scan of the papers, by the way, so don't pay too much attention to my conclusion.)

Double-precision operations: 32-bit vs 64-bit machines

Why don't we see twice better performance when executing a 64-bit operations (e.g. double precision operation) on a 64-bit machine, compared to executing on a 32-bit machine?
In a 32-bit machine, don't we need to fetch from memory twice as much? More importantly, don't we need twice as much cycles to execute a 64-bit operation?
“64-bit machine” is an ambiguous term but usually means that the processor's General-Purpose Registers are 64-bit wide. Compare 8086 and 8088, which have the same instruction set and can both be called 16-bit processors in this sense.
When the phrase is used in this sense, it has nothing to do with the width of the memory bus, the width of the internal buses inside the CPU, and the ability of the ALU to operate efficiently on 32- or 64-bit wide data.
Your question also assumes that the hardest part of a multiplication is moving the operands to the unit that takes care of multiplication inside the processor, which wouldn't be quite true even if the operands came from memory and the bus was 32-bit wide, because latency != throughput. Also, regarding the mathematics of floating-point multiplication, a 64-bit multiplication is not twice as hard as a 32-bit one, it is roughly (53/24)2 times as hard (but, again, the transistors can be there to compute the double-precision multiplication efficiently regardless of the width of the General-Purpose Registers).
In a 32-bit machine, don't we need to fetch from memory twice as much?
No. In most modern CPUs, memory bus width is at least 64 bits. Newer microarchitectures may have wider bus. Quad-channel memory will have a minimum 256-bit bus. Many contemporary CPUs even support 6 or 8-channel memory. So you need only 1 fetch to get a double. Besides most of the time the value has already been in cache, so loading it won't take much time. CPUs don't load a single value but the whole cache line each time
more importantly, don't we need twice as much cycles to execute a 64-bit operation?
First you should know that the actual number of significant bits in double is 53 so it's not "twice as much" harder. It's more than twice the number in float (24 significant bits). When the number of bits is doubled, addition and subtraction is twice as hard while multiplication is 4 times as hard. Many other more complex operations will need even more effort
But despite that harder math work, non-memory operations on both float and double are usually the same on most modern architectures because both will be done in the same set of registers with the same ALU/FPU. Those powerful FPUs can add two doubles in a single cycle, so obviously even if you can add two floats faster then it still consumes 1 cycle. In the old Intel x87 the internal registers are 80 bits in length and both single and double precision must be extended to 80 bits, hence their performance will also be the same. There's no way to do math in narrower types than 80-bit extended
With SIMD support like SSE2/AVX/AVX-512 you'll be able to process 2/4/8 doubles at a time (or even more in other SIMD ISAs), so you can see that adding two doubles like that is only a tiny task for modern FPUs. However with SIMD we can fit twice the number of floats in a register compared to double, so float operations will be faster if you need to do a lot of math in parallel. In a cycle if you can work on 4 doubles at a time then you'll be able to do the same on 8 floats
Another case where float is faster than double would be when you work on a huge array, because more floats fit in the same cache line than double. As a result using double incurs more cache misses when you traverse through the array

hyperthreading and turbo boost in matrix multiply - worse performance using hyper threading

I am tunning my GEMM code and comparing with Eigen and MKL. I have a system with four physical cores. Until now I have used the default number of threads from OpenMP (eight on my system). I assumed this would be at least as good as four threads. However, I discovered today that if I run Eigen and my own GEMM code on a large dense matrix (1000x1000) I get better performance using four threads instead of eight. The efficiency jumped from 45% to 65%. I think this can be also seen in this plot
https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba
The difference is quite substantial. However, the performance is much less stable. The performance jumps around quit a bit each iteration both with Eigen and my own GEMM code. I'm surprised that Hyperthreading makes the performance so much worse. I guess this is not not a question. It's an unexpected observation which I'm hoping to find feedback on.
I see that not using hyper threading is also suggested here.
How to speed up Eigen library's matrix product?
I do have a question regarding measuring max performance. What I do now is run CPUz and look at the frequency as I'm running my GEMM code and then use that number in my code (4.3 GHz on one overclocked system I use). Can I trust this number for all threads? How do I know the frequency per thread to determine the maximum? How to I properly account for turbo boost?
The purpose of hyperthreading is to improve CPU usage for code exhibiting high latency. Hyperthreading masks this latency by treating two threads at once thus having more instruction level parallelism.
However, a well written matrix product kernel exhibits an excellent instruction level parallelism and thus exploits nearly 100% of the CPU ressources. Therefore there is no room for a second "hyper" thread, and the overhead of its management can only decrease the overall performance.
Unless I've missed something, always possible, your CPU has one clock shared by all its components so if you measure it's rate at 4.3GHz (or whatever) then that's the rate of all the components for which it makes sense to figure out a rate. Imagine the chaos if this were not so, some cores running at one rate, others at another rate; the shared components (eg memory access) would become unmanageable.
As to hyperthreading actually worsening the performance of your matrix multiplication, I'm not surprised. After all, hyperthreading is a poor-person's parallelisation technique, duplicating instruction pipelines but not functional units. Once you've got your code screaming along pushing your n*10^6 contiguous memory locations through the FPUs a context switch in response to a pipeline stall isn't going to help much. At best the other pipeline will scream along for a while before another context switch robs you of useful clock cycles, at worst all the careful arrangement of data in the memory hierarchy will be horribly mangled at each switch.
Hyperthreading is designed not for parallel numeric computational speed but for improving the performance of a much more general workload; we use general-purpose CPUs in high-performance computing not because we want hyperthreading but because all the specialist parallel numeric CPUs have gone the way of all flesh.
As a provider of multithreaded concurrency services, I have explored how hyperthreading affects performance under a variety of conditions. I have found that with software that limits its own high-utilization threads to no more that the actual physical processors available, the presence or absence of HT makes very little difference. Software that attempts to use more threads than that for heavy computational work, is likely unaware that it is doing so, relying on merely the total processor count (which doubles under HT), and predictably runs more slowly. Perhaps the largest benefit that enabling HT may provide, is that you can max out all physical processors, without bringing the rest of the system to a crawl. Without HT, software often has to leave one CPU free to keep the host system running normally. Hyperthreads are just more switchable threads, they are not additional processors.

How to evaluate CUDA performance?

I programmed CUDA kernel my own.
Compare to CPU code, my kernel code is 10 times faster than CPUs.
But I have question with my experiments.
Does my program fully be optimized using all GPU cores, proper shared memory use, adequate register count, enough occupancy?
How can I evaluate my kernel code's performance?
How can I calcuate CUDA's maximum throughput theoretically?
Am I right that compare between CPU's GFLOPS and GPU's GFLOPS and GFLOPS rate is transparent theoretical performance of them?
Thanks in advance.
Does my program fully be optimized using all GPU cores, proper shared memory use, adequate register count, enough occupancy?
To find this out, you use one of the CUDA profilers. See How Do You Profile & Optimize CUDA Kernels?
How can I calcuate CUDA's maximum throughput theoretically?
That math is slightly involved, different for each architecture and easy to get wrong. Better to look the numbers up in the specs for your chip. There are tables on Wikipedia, such as this one, for the GTX500 cards. For instance, you can see from the table that a GTX580 has a theoretical peak bandwidth of 192.4GB/s and compute throughput of 1581.1GFLOPs.
Am I right that compare between CPU's GFLOPS and GPU's GFLOPS and GFLOPS rate is transparent theoretical performance of them?
If I understand correctly, you are asking if the number of theoretical peak GFLOPs on a GPU can be directly compared with the corresponding number on a CPU. There are some things to consider when comparing these numbers:
Older GPUs did not support double precision (DP) floating point, only single precision (SP).
GPUs that do support DP do so with a significant performance degradation as compared to SP. The GFLOPs number I quoted above was for SP. On the other hand, numbers quoted for CPUs are often for DP, and there is less difference between the performance of SP and DP on a CPU.
CPU quotes can be for rates that are achievable only when using SIMD (single instruction, multiple data) vectorized instructions, and is typically very hard to write algorithms that can approach the theoretical maximum (and they may have to be written in assembly). Sometimes, CPU quotes are for a combination of all computing resources available through different types of instructions and it often virtually impossible to write a program that can exploit them all simultaneously.
The rates quoted for GPUs assume that you have enough parallel work to saturate the GPU and that your algorithm is not bandwidth bound.
The preferred measure of performance is elapsed time. GFLOPs can be used as a comparison method but it is often difficult to compare between compilers and architectures due to differences in instruction set, compiler code generation, and method of counting FLOPs.
The best method is to time the performance of the application. For the CUDA code you should time all code that will occur per launch. This includes memory copies and synchronization.
Nsight Visual Studio Edition and the Visual Profiler provide the most accurate measurement of each operation. Nsight Visual Studio Edition provides theoretical bandwidth and FLOPs values for each device. In addition the Achieved FLOPs experiment can be used to capture the FLOP count both for single and double precision.

Resources