Double-precision operations: 32-bit vs 64-bit machines - performance

Why don't we see twice better performance when executing a 64-bit operations (e.g. double precision operation) on a 64-bit machine, compared to executing on a 32-bit machine?
In a 32-bit machine, don't we need to fetch from memory twice as much? More importantly, don't we need twice as much cycles to execute a 64-bit operation?

“64-bit machine” is an ambiguous term but usually means that the processor's General-Purpose Registers are 64-bit wide. Compare 8086 and 8088, which have the same instruction set and can both be called 16-bit processors in this sense.
When the phrase is used in this sense, it has nothing to do with the width of the memory bus, the width of the internal buses inside the CPU, and the ability of the ALU to operate efficiently on 32- or 64-bit wide data.
Your question also assumes that the hardest part of a multiplication is moving the operands to the unit that takes care of multiplication inside the processor, which wouldn't be quite true even if the operands came from memory and the bus was 32-bit wide, because latency != throughput. Also, regarding the mathematics of floating-point multiplication, a 64-bit multiplication is not twice as hard as a 32-bit one, it is roughly (53/24)2 times as hard (but, again, the transistors can be there to compute the double-precision multiplication efficiently regardless of the width of the General-Purpose Registers).

In a 32-bit machine, don't we need to fetch from memory twice as much?
No. In most modern CPUs, memory bus width is at least 64 bits. Newer microarchitectures may have wider bus. Quad-channel memory will have a minimum 256-bit bus. Many contemporary CPUs even support 6 or 8-channel memory. So you need only 1 fetch to get a double. Besides most of the time the value has already been in cache, so loading it won't take much time. CPUs don't load a single value but the whole cache line each time
more importantly, don't we need twice as much cycles to execute a 64-bit operation?
First you should know that the actual number of significant bits in double is 53 so it's not "twice as much" harder. It's more than twice the number in float (24 significant bits). When the number of bits is doubled, addition and subtraction is twice as hard while multiplication is 4 times as hard. Many other more complex operations will need even more effort
But despite that harder math work, non-memory operations on both float and double are usually the same on most modern architectures because both will be done in the same set of registers with the same ALU/FPU. Those powerful FPUs can add two doubles in a single cycle, so obviously even if you can add two floats faster then it still consumes 1 cycle. In the old Intel x87 the internal registers are 80 bits in length and both single and double precision must be extended to 80 bits, hence their performance will also be the same. There's no way to do math in narrower types than 80-bit extended
With SIMD support like SSE2/AVX/AVX-512 you'll be able to process 2/4/8 doubles at a time (or even more in other SIMD ISAs), so you can see that adding two doubles like that is only a tiny task for modern FPUs. However with SIMD we can fit twice the number of floats in a register compared to double, so float operations will be faster if you need to do a lot of math in parallel. In a cycle if you can work on 4 doubles at a time then you'll be able to do the same on 8 floats
Another case where float is faster than double would be when you work on a huge array, because more floats fit in the same cache line than double. As a result using double incurs more cache misses when you traverse through the array

Related

In C Use 64 bit ints or 16 bit ints for performance on a 64 bit CPU?

On a 64 bit CPU with decent sized cache, which will lead to better performance in a C application which uses many fairly large arrays of structures of int: using 64 bit ints so that everything is always aligned on 8 byte boundaries, which the CPU likes, or 16 bit ints so that there are more array elements in the cache ?
Has anyone ever benchmarked this sort of issue ?
On mainstream 64-bit processors (ie. x86-64 and arm64), the size of integers has significant impact on the performance of scalar instructions.
However, it is generally better to work on the smallest possible type if the code is vectorized since SIMD instructions work on fixed-size internal SIMD vectors (128 bits for SSE, 256 for AVX/AVX2, 512 for AVX-512, 128 for Neon). Note that using types of different sizes can introduces quite-expensive conversions or reduce the capacity of some compiler to vectorize efficiently the loops (recent mainstream optimizing compilers are relatively good to vectorize code in this case although the generated code is often not optimal).
Regarding caches, arrays with smaller integer items can be loaded faster from the memory hierarchy. Indeed, the L1/L2 cache are generally quite small, so if an array can fit in such caches, the accesses to this array will be faster (lower latency and higher throughput). The impact is particularly visible for random accesses.
Regarding the alignment, its does not generally have a significant impact on x86-64 platforms as compilers and runtimes do a good job to align arrays efficiently and processors are optimized to access unaligned data (even using SIMD instruction). For example, malloc/realloc returns memory addresses aligned on 16-bytes by default on most x86-64/arm64 platforms.

Fast hardware integer division

Hardware instruction for integer division has been historically very slow. For example, DIVQ on Skylake has latency of 42-95 cycles [1] (and reciprocal throughput of 24-90), for 64-bits inputs.
There are newer processor however, which perform much better: Goldmont has 14-43 latency and Ryzen has 14-47 latency [1], M1 apparently has "throughput of 2 clock cycles per divide" [2] and even Raspberry Pico has "8-cycle signed/unsigned divide/modulo circuit, per core" (though that seems to be for 32-bit inputs) [3].
My question is, what has changed? Was there a new algorithm invented? What algorithms do the new processors employ for division, anyway?
[1] https://www.agner.org/optimize/#manuals
[2] https://ridiculousfish.com/blog/posts/benchmarking-libdivide-m1-avx512.html
[3] https://raspberrypi.github.io/pico-sdk-doxygen/group__hardware__divider.html#details
On Intel before Ice Lake, 64-bit operand-size is an outlier, much slower than 32-bit operand size for integer division. div r32 is 10 uops, with 26 cycle worst-case latency but 6 cycle throughput. (https://uops.info/ and https://agner.org/optimize/, and Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux has detailed exploration.)
There wasn't a fundamental change in how divide units are built, just widening the HW divider to not need extended-precision microcode. (Intel has had fast-ish dividers for FP for much longer, and that's basically the same problem just with only 53 bits instead of 64. The hard part of FP division is integer division of the mantissas; subtracting the exponents is easy and done in parallel.)
The incremental changes are things like widening the radix to handle more bits with each step. And for example pipelining the refinement steps after the initial (table lookup?) value, to improve throughput but not latency.
Related:
How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson? brief high-level overview of the div/sqrt units that modern CPUs use, with for example a Radix-1024 divider being new in Broadwell.
Do FP and integer division compete for the same throughput resources on x86 CPUs? (No in Ice Lake and later on Intel; having a dedicated integer unit instead of using the low element of the FP mantissa divide/sqrt unit is presumably related to making it 64 bits wide.)
Divide units historically were often not pipelined at all, as that's hard because it requires replicating a lot of gates instead of iterating on the same multipliers, I think. And most software usually avoids (or avoided) integer division because it was historically very expensive, at least does it infrequently enough to not benefit very much from higher-throughput dividers with the same latency.
But with wider CPU pipelines with higher IPC shrinking the cycle gap between divisions, it's more worth doing. Also with huge transistor budgets, spending a bunch on something that will sit idle for a lot of the time in most programs still makes sense if it's very helpful for a few programs. (Like wider SIMD, and specialized execution units like x86 BMI2 pdep / pext). Dark silicon is necessary or chips would melt; power density is a huge concern, see Modern Microprocessors: A 90-Minute Guide!
Also, more and more software being written by people who don't know anything about performance, and more code avoiding compile-time constants in favour of being flexible (function args that ultimately come from some config option), I'd guess modern software doesn't avoid division as much as older programs did.
Floating-point division is often harder to avoid than integer, so it's definitely worth having fast FP dividers. And integer can borrow the mantissa divider from the low SIMD element, if there isn't a dedicated integer-divide unit.
So that FP motivation was likely the actual driving force behind Intel's improvements to divide throughput and latency even though they left 64-bit integer division with garbage performance until Ice Lake.

32-bit versus 64-bit floating-point performance

I have ran into a curious problem. An algorithm I am working on consists of lots of computations like this
q = x(0)*y(0)*z(0) + x(1)*y(1)*z(1) + ...
where the length of summation is between 4 and 7.
The original computations are all done using 64-bit precision. For experimentation, I tried using 32-bit precision for x,y,z input values (so that computations are performed using 32-bit), and storing final result as 64-bit value (straightforward cast).
I expected 32-bit performance to be better (cache size, SIMD size, etc.), but to my surprise there was no difference in performance, maybe even decrease.
The architecture in question is Intel 64, Linux, and GCC. Both codes do seem to use SSE and arrays in both cases are aligned to 16 byte boundary.
Why would it be so? My guess so far is that 32-bit precision can use SSE only on the first four elements, with the rest being done serially compounded by cast overhead.
On x87 at least, everything is really done in 80-bit precision internally. The precision really just determines how many of those bits are stored in memory. This is part of the reason why different optimization settings can change results slightly: They change the amount of rounding from 80-bit to 32- or 64-bit.
In practice, using 80-bit floating point (long double in C and C++, real in D) is usually slow because there's no efficient way to load and store 80 bits from memory. 32- and 64-bit are usually equally fast provided that memory bandwidth isn't the bottleneck, i.e. if everything is in cache anyhow. 64-bit can be slower if either of the following happens:
Memory bandwidth is the bottleneck.
The 64-bit numbers aren't properly aligned on 8-byte boundaries. 32-bit numbers only require 4-byte alignment for optimal efficiency, so they're less finicky. Some compilers (the Digital Mars D compiler comes to mind) don't always get this right for 64-bit doubles stored on the stack. This causes twice the amount of memory operations to be necessary to load one, in practice resulting in about a 2x performance hit compared to properly aligned 64-bit floats or 32-bit floats.
As far as SIMD optimizations go, it should be noted that most compilers are horrible at auto-vectorizing code. If you don't want to write directly in assembly language, the best way to take advantage of these instructions is to use things like array-wise operations, which are available, for example, in D, and implemented in terms of SSE instructions. Similarly, in C or C++, you would probably want to use a high level library of functions that are SSE-optimized, though I don't know of a good one off the top of my head because I mostly program in D.
It's probably because your processor still makes the 64bit counting and then trimms the number. There was some CPU flag you could change, but I can't remember...
First check the ASM that gets produced. It may not be what you expect.
Also try writing it as a loop:
typedef float fp;
fp q = 0
for(int i = 0; i < N; i++)
q += x[i]*y[i]*z[i]
Some compiler might notice the loop and not the unrolled form.
Lastly, your code used () rather than []. If your code is making lots of function calls (12 to 21), that will swamp the FP cost and even removing the fp computation all together won't make much difference. Inlineing OTOH might.

Processor architecture

While HDDs evolve and offer more and more space on less room, why are we "sticking with" 32-bit or 64-bit?
Why can't there be a e.g.: 128-bit processor?
(This is not my homework; I'm just a student interested beyond the things they teach us in informatics)
Because the difference between 32-bit and 64-bit is astronomical - it's really the difference between 232 (a ten-digit number in the billions) and 264 (a twenty-digit number in the squillions :-).
64 bits will be more than enough for decades to come.
There's very little need for this, when do you deal with numbers that large? The current addressable memory space available to 64-bit is well beyond what any machine can handle for at least a few years...and beyond that it's probably more than any desktop will hold for quite a while.
Yes, desktop memory will continue to increase, but 4 billion times what it is now? That's going to take a while...sure we'll get to 128-bit, if the whole current model isn't thrown out before then, which I see equally as likely.
Also, it's worth noting that upgrading something from 32-bit to 64-bit puts you in a performance hole immediately in most scenarios (this is a major reason Visual Studio 2010 remains 32-bit only). The same will happen with 64-bit to 128-bit. The more small objects you have, the more pointers, which are now twice as large, that's more data to pass around to do the same thing, especially if you don't need that much addressable memory space.
When we talk about an n-bit architecture we are often conflating two rather different things:
(1) n-bit addressing, e.g. a CPU with 32-bit address registers and a 32-bit address bus can address 4 GB of physical memory
(2) size of CPU internal data paths and general purpose registers, e.g. a CPU with 32-bit internal architecture has 32-bit registers, 32-bit integer ALUs, 32-bit internal data paths, etc
In many cases (1) and (2) are the same, but there are plenty of exceptions and this may become increasingly the case, e.g. we may not need more than 64-bit addressing for the forseeable future, but we may want > 64 bits for registers and data paths (this is already the case with many CPUs with SIMD support).
So, in short, you need to be careful when you talk about, e.g. a "64-bit CPU" - it can mean different things in different contexts.
Cost. Also, what do you think the 128-bit architecture will get you? Memory addressing and such, but to handle it effectively, you need higher bandwidth buses and basically some new instruction languages that handle it. 64-bit is more than enough for addressing (18446744073709551616 bytes).
HDDs still have a bit of ground to catchup to RAM and such. They're still going to be the IO bottleneck I think. Plus, newer chips are just supporting more cores rather than making a massive change to the language.
Well, I happen to be a professional computer architect (my inventions are probably in the computer you are reading this on), and although I have not yet been paid to work on any processor with more than 64 bits of address, I know some of my friends who have been.
And I have been playing around with 128 bit architectures for fun for a few decades.
I.e. its already happening.
Actually, it has already happened to a limited extent. The HP Precision Architecture, Intel Itanium, and the higher end versions of the IBM Power line, have what I call a folded virtual memory. I have described these elsewhere, e.g. in comp.arch posts in some details, http://groups.google.com/group/comp.arch/browse_thread/thread/53a7396f56860e17/f62404dd5782f309?lnk=gst&q=folded+virtual+memory#f62404dd5782f309
I need to create a comp-arch.net wiki post for these.
But you can get the manuals for these processors and read them yourself.
E.g. you might start with a 64 bit user virtual address.
The upper 8 bits may be used to index a region table, that returns an upper 24 bits that is concatenated with the remaining 64-8=56 bits to produce an 80 bit expanded virtual address. Which is then translated by TLBs and page tables and hash lookups, as usual,
to whatever your physical address is.
Why go from 64->80?
One reason is shared libraries. You may want to have the shared libraries to stay at the same expanded virtual address in all processors, so that you cam share TLB entries. But you may be required, by your language tools, to relocate them to different user virtual addresses. Folded virtual addresses allow this.
Folded virtual addresses are not true >64 bit virtual addresses usable by the user.
For that matter, there are many proposals for >64 bit pointers: e.g. I worked on one where a pointer consisted of a 64bit address, and 64 bit lower and upper bounds, and metadata, for a total of 128 bits. Bounds checking. But, although these have >64 bit pointers or capabilities, they are not truly >64 bit virtual addresses.
Linus posts about 128 bit virtual addresses at http://www.realworldtech.com/beta/forums/index.cfm?action=detail&id=103574&threadid=103545&roomid=2
I'd also like to offer a computer architect's view of why 128bit is impractical at the moment:
Energy cost. See Bill Dally's presentations on how today, most energy in processors is spent moving data around (dissipated in the wires). However, since the most significant bits of a 128bit computation should change little, it should mitigate this problem.
Most arithmetic operations have a non-linear cost w.r.t operand size:
a. A tree multiplier has space complexity n^2, w.r.t. number of bits.
b. The delay of a hierarchical carry look ahead adder is Log[n] w.r.t number of bits (I think). So a 128bit adder will be slower than a 64bit add. Can anyone give some hard numbers (Log[n] seems very cheap) ?
Few programs use 128bit integers or quad precision floating point, and when they do, there are efficient ways to compose them from 32 or 64bit ops.
The next big thing in processor's architecture will be quantum computing. Instead of beeing just 0 or 1, a qbit has a probability of being 0 or 1.
This will lead to huge improvements in the performance of algorithm (for instance, it will be very easy to crack down any RSA private/public key).
Check http://en.wikipedia.org/wiki/Quantum_computer for more information and see you in 15 years ;-)
The main need for a 64 bit processor is to address more memory - and that is the driving force to switch to 64 bit. On 32 bit systems, you can really only address 4Gb of RAM, at least per process. 4Gb is not much.
64 bits give you an address space of several petabytes.(though, a lot of current 64 bit hardware can address "only" 48 bits - thats still enough to support 256 terrabytes of ram though).
Upping the natural integer sizes for a processor does not automatically make it "better" though. There are tradeoffs. With 128bit you'd need twice as much storage(registers/ram/caches/etc.) compared to 64 bit for common data types - with all the drawback that might have - more ram needed to store data, more data to transmit = slower, wider buses might requires more physical space/perhaps more power, etc.

64-bits and Memory Bandwidth

Mason asked about the advantages of a 64-bit processor.
Well, an obvious disadvantage is that you have to move more bits around. And given that memory accesses are a serious issue these days[1], moving around twice as much memory for a fair number of operations can't be a good thing.
But how bad is the effect of this, really? And what makes up for it? Or should I be running all my small apps on 32-bit machines?
I should mention that I'm considering, in particular, the case where one has a choice of running 32- or 64-bit on the same machine, so in either mode the bandwidth to main memory is the same.
[1]: And even fifteen years ago, for that matter. I remember talk as far back as that about good cache behaviour, and also particularly that the Alpha CPUs that won all the benchmarks had a giant, for the time, 8 MB of L2 cache.
Whether your app should be 64-bit depends a lot on what kind of computation it does. If you need to process very large data sets, you obviously need 64-bit pointers. If not, you need to know whether your app spends relatively more time doing arithmetic or memory accesses. On x86-64, the general purpose registers are not only twice as wide, there are twice as many and they are more "general purpose". This means that 64-bit code can have much better integer op performance. However, if your code doesn't need the extra register space, you'll probably see better performance by using smaller pointers and data, due to increased cache effectiveness. If your app is dominated by floating point operations, there probably isn't much point in making it 32-bit, because most of the memory accesses will be for wide vectors anyways, and having the extra SSE registers will help.
Most 64-bit programming environments use the "LP64" model, meaning that only pointers and long int variables (if you're a C/C++ programmer) are 64 bits. Integers (ints) remain 32-bits unless you're in the "ILP64" model, which is fairly uncommon.
I only bring it up because most int variables aren't being used for size_t-like purposes--that is, they stay within ranges comfortably held by 32 bits. For variables of that nature, you'll never be able to tell the difference.
If you're doing numerical or data-heavy work with > 4GB of data, you'll need 64 bits anyways. If you're not, you won't notice the difference, unless you're in the habit of using longs where most would use ints.
I think you're starting off with a bad assumption here. You say:
moving around twice as much memory
for a fair number of operations can't
be a good thing
and the first question is ask is "why not"? In a true 64 bit machine, the data path is 64 bits wide, and so moving 64 bits takes exactly (to a first approximation) as many cycles as moving 32 bits on a 32 bit machine. So, if you need to move 128 bytes, it takes half as many cycles as it would take on a 32 bit machine.

Resources