Are there CPU doing speculative execution that virtualize memory locations? - cpu

Consider the classical reuse of a register after an expensive computation, in pseudo assembly:
r2 = cos(r1)
*(r3) = r2
r2 = r5 + r6
*(r4) = r2
To be able to use the arithmetic units fully, the execution unit might do:
r2 = cos(r1)
*(r3) = r2
and in parallel:
r2bis = r5 + r6
*(r4) = r2bis
where r2bis is the virtualized (or renamed) r2 register.
Now imagine we work in a register poor CPU (or we have many but they are used already) and put the data in some temporary stack location:
*(sp+C) = cos(r1)
*(r3) = *(sp+C)
*(sp+C) = r5 + r6
*(r4) = *(sp+C)
Are there cases where the memory location whose address is known (as (sp+C) can be computed already) is virtualized by the execution unit to allow the same two execution to proceed in parallel?
That case may seem very silly as the compiler could be tasked with finding another location on the not so constrained stack space (unlike the very constrained register space). But other cases may not be so silly as virtualized memory could allow speculative executive of a condition branch that has to store short term data in memory. This is especially important for languages where there is no easy way to put object fields in registers, like Java for all but the most simple cases: you have to rule out "reference" (pointer) escape to avoid the new dynamic allocation and turn the Java class instance into a C++ class automatic instance (that can be stack allocated or in registers). (And then even C++ has difficulties not having a real this pointer in apparently simple uses of simple flat classes.)

Related

Does a process switch affect std::atomic compare and exchange in arm9 processor?

I am new to std::atomic in c++ and trying to understand the implementation of compare and exchange operations under ARM processors.I am using gcc on linux.
When i look into the assembly code
mcr p15, 0, r0, c7, c10, 5
.L41:
ldrexb r3, [r2]
cmp r3, r1
bne .L42
strexb ip, r0, [r2]
cmp ip, #0
bne .L41
.L42:
mcr p15, 0, r0, c7, c10, 5
My understanding is
it takes multiple instructions to do compare and exchange.
ldrex marks the memory location as exclusive and reads the data.
strex stores the data and clears the exclusive flag for that
location.
My questions are
does ldrex mark the Virtual addr. as exclusive or the physical address?
If Process P1 marks the virtual address as exclusive and a process switch occurs to P2, will that virtual addr. be accessible in P2? what will happen if P2 also execute an ldrex on the same address.
If Process P1 marks the physical address as exclusive and a process switch occurs, when P1 resumes isn't there a chance that the data now resides in a different location in physical memory due to paging.
I am trying to understand this because, i want to do a compare and exchange on a shared memory location accessed by multiple processes.
My c++ function looks like
std::atomic<bool> *flag;
flag = (std::atomic<bool> *) (shm_ptr);
bool temp = false ;
while(!std::atomic_compare_exchange_strong((flag),&temp,true))
{
std::this_thread::yield();
}
// update shared memory
std::atomic_store((flag), false);
Yes, it's safe to use lock-free std::atomic<T> on shared memory mapped by different processes, on all mainstream C++ implementations for ARM.
But non-lock-free atomics won't work, because different processes won't share the same table of locks.
An interrupt before the strex completes will cause it to fail. You don't have to worry about kernel code changing the page tables between ldrex and strex.
Resuming this code in the middle after an interrupt on the same or another CPU will mean the strex simply fails, because it's not executing as part of a "transaction" started by ldrex.
Atomicity is address-free on ARM, and on every normal mainstream system that implements C++11 lock-free atomics.
Everything still works if two threads / processes on different cores have the same physical page mapped to different virtual addresses. The C++11 standard explicitly recommends that implementations work this way for lock-free std::atomic<T>. (It stops short of requiring it, because then it would have to define what a process is, and functions for remapping virtual memory.)
This is nearly a duplicate of Are lock-free atomics address-free in practice?. See that for quotes from the standard and more details.
Modern computer systems ensure that their caches don't have aliasing homonym / synonym problems, because that would cause coherency problems in general, not just for atomic RMWs. Sometimes this requires cooperation from the OS kernel (e.g. page coloring if one cache index bit comes from the page number instead of just the offset-within-a-page part of the address), but in general caches behave as physical.
(Some early CPUs, like early MIPS, did sometimes use virtually-addressed L1 data caches, but that's not done on systems that can support multiple CPUs, AFAIK.)

How does a CPU know if an address in RAM contains an integer, a pre-defined CPU instruction, or any other kind of data?

The reason this gets me confused is that all addresses hold a sequence of 1's and 0's. So how does the CPU differentiate, let's say, 00000100(integer) from 00000100(CPU instruction)?
First of all, different commands have different values (opcodes). That's how the CPU knows what to do.
Finally, the questions remains: What's a command, what's data?
Modern PCs are working with the von Neumann-Architecture ( https://en.wikipedia.org/wiki/John_von_Neumann) where data and opcodes are stored in the same memory space. (There are architectures seperating between these two data types, such as the Harvard architecture)
Explaining everything in Detail would totally be beyond the scope of stackoverflow, most likely the amount of characters per post would not be sufficent.
To answer the question with as few words as possible (Everyone actually working on this level would kill me for the shortcuts in the explanation):
Data in the memory is stored at certain addresses.
Each CPU Advice is basically consisting of 3 different addresses (NOT values - just addresses!):
Adress about what to do
Adress about value
Adress about an additional value
So, assuming an addition should be performed, and you have 3 Adresses available in the memory, the application would Store (in case of 5+7) (I used "verbs" for the instructions)
Adress | Stored Value
1 | ADD
2 | 5
3 | 7
Finally the CPU receives the instruction 1 2 3, which then means ADD 5 7 (These things are order-sensitive! [Command] [v1] [v2])... And now things are getting complicated.
The CPU will move these values (actually not the values, just the adresses of the values) into its registers and then processing it. The exact registers to choose depend on datatype, datasize and opcode.
In the case of the command #1 #2 #3, the CPU will first read these memory addresses, then knowing that ADD 5 7 is desired.
Based on the opcode for ADD the CPU will know:
Put Address #2 into r1
Put Address #3 into r2
Read Memory-Value Stored at the address stored in r1
Read Memory-Value stored at the address stored in r2
Add both values
Write result somewhere in memory
Store Address of where I put the result into r3
Store Address stored in r3 into the Memory-Address stored in r1.
Note that this is simplified. Actually the CPU needs exact instructions on whether its handling a value or address. In Assembly this is done by using
eax (means value stored in register eax)
[eax] (means value stored in memory at the adress stored in the register eax)
The CPU cannot perform calculations on values stored in the memory, so it is quite busy moving values From memory to registers and from registers to memory.
i.e. If you have
eax = 0x2
and in memory
0x2 = 110011
and the instruction
MOV ebx, [eax]
this means: move the value, currently stored at the address, that is currently stored in eax into the register ebx. So finally
ebx = 110011
(This is happening EVERYTIME the CPU does a single calculation!. Memory -> Register -> Memory)
Finally, the demanding application can read its predefined memory address #2,
resulting in address #2568 and then knows, that the outcome of the calculation is stored at adress #2568. Reading that Adress will result in the value 12 (5+7)
This is just a tiny tiny example of whats going on. For a more detailed introduction about this, refer to http://www.cs.virginia.edu/~evans/cs216/guides/x86.html
One cannot really grasp the amount of data movement and calculations done for a simple addition of 2 values. Doing what a CPU does (on paper) would take you several minutes just to calculate "5+7", since there is no "5" and no "7" - Everything is hidden behind an address in memory, pointing to some bits, resulting in different values depending on what the bits at adress 0x1 are instructing...
Short form: The CPU does not know what's stored there, but the instructions tell the CPU how to interpret it.
Let's have a simplified example.
If the CPU is told to add a word (let's say, an 32 bit integer) stored at the location X, it fetches the content of that address and adds it.
If the program counter reaches the same location, the CPU will again fetch this word and execute it as a command.
The CPU (other than security stuff like the NX bit) is blind to whether it's data or code.
The only way data doesn't accidentally get executed as code is by carefully organizing the code to never refer to a location holding data with an instruction meant to operate on code.
When a program is started, the processor starts executing it at a predefined spot. The author of a program written in machine language will have intentionally put the beginning of their program there. From there, that instruction will always end up setting the next location the processor will execute to somewhere this is an instruction. This continues to be the case for all of the instructions that make up the program, unless there is a serious bug in the code.
There are two main ways instructions can set where the processor goes next: jumps/branches, and not explicitly specifying. If the instruction doesn't explicitly specify where to go next, the CPU defaults to the location directly after the current instruction. Contrast that to jumps and branches, which have space to specifically encode the address of the next instruction's address. Jumps always jump to the place specified. Branches check if a condition is true. If it is, the CPU will jump to the encoded location. If the condition is false, it will simply go to the instruction directly after the branch.
Additionally, the a machine language program should never write data to a location that is for instructions, or some other instruction at some future point in the program could try to run what was overwritten with data. Having that happen could cause all sorts of bad things to happen. The data there could have an "opcode" that doesn't match anything the processor knows what to do. Or, the data there could tell the computer to do something completely unintended. Either way, you're in for a bad day. Be glad that your compiler never messes up and accidentally inserts something that does this.
Unfortunately, sometimes the programmer using the compiler messes up, and does something that tells the CPU to write data outside of the area they allocated for data. (A common way this happens in C/C++ is to allocate an array L items long, and use an index >=L when writing data.) Having data written to an area set aside for code is what buffer overflow vulnerabilities are made of. Some program may have a bug that lets a remote machine trick the program into writing data (which the remote machine sent) beyond the end of an area set aside for data, and into an area set aside for code. Then, at some later point, the processor executes that "data" (which, remember, was sent from a remote computer). If the remote computer/attacker was smart, they carefully crafted the "data" that went past the boundary to be valid instructions that do something malicious. (To give them more access, destroy data, send back sensitive data from memory, etc).
this is because an ISA must take into account what a valid set of instructions are and how to encode data: memory address/registers/literals.
see this for more general info on how ISA is designed
https://en.wikipedia.org/wiki/Instruction_set
In short, the operating system tells it where the next instruction is. In the case of x64 there is a special register called rip (instruction pointer) which holds the address of the next instruction to be executed. It will automatically read the data at this address, decode and execute it, and automatically increment rip by the number of bytes of the instruction.
Generally, the OS can mark regions of memory (pages) as holding executable code or not. If an error or exploit tries to modify executable memory an error should occur, similarly if the CPU finds itself trying to execute non-executable memory it will/should also signal an error and terminate the program. Now you're into the wonderful world of software viruses!

Why would introducing useless MOV instructions speed up a tight loop in x86_64 assembly?

Background:
While optimizing some Pascal code with embedded assembly language, I noticed an unnecessary MOV instruction, and removed it.
To my surprise, removing the un-necessary instruction caused my program to slow down.
I found that adding arbitrary, useless MOV instructions increased performance even further.
The effect is erratic, and changes based on execution order: the same junk instructions transposed up or down by a single line produce a slowdown.
I understand that the CPU does all kinds of optimizations and streamlining, but, this seems more like black magic.
The data:
A version of my code conditionally compiles three junk operations in the middle of a loop that runs 2**20==1048576 times. (The surrounding program just calculates SHA-256 hashes).
The results on my rather old machine (Intel(R) Core(TM)2 CPU 6400 # 2.13 GHz):
avg time (ms) with -dJUNKOPS: 1822.84 ms
avg time (ms) without: 1836.44 ms
The programs were run 25 times in a loop, with the run order changing randomly each time.
Excerpt:
{$asmmode intel}
procedure example_junkop_in_sha256;
var s1, t2 : uint32;
begin
// Here are parts of the SHA-256 algorithm, in Pascal:
// s0 {r10d} := ror(a, 2) xor ror(a, 13) xor ror(a, 22)
// s1 {r11d} := ror(e, 6) xor ror(e, 11) xor ror(e, 25)
// Here is how I translated them (side by side to show symmetry):
asm
MOV r8d, a ; MOV r9d, e
ROR r8d, 2 ; ROR r9d, 6
MOV r10d, r8d ; MOV r11d, r9d
ROR r8d, 11 {13 total} ; ROR r9d, 5 {11 total}
XOR r10d, r8d ; XOR r11d, r9d
ROR r8d, 9 {22 total} ; ROR r9d, 14 {25 total}
XOR r10d, r8d ; XOR r11d, r9d
// Here is the extraneous operation that I removed, causing a speedup
// s1 is the uint32 variable declared at the start of the Pascal code.
//
// I had cleaned up the code, so I no longer needed this variable, and
// could just leave the value sitting in the r11d register until I needed
// it again later.
//
// Since copying to RAM seemed like a waste, I removed the instruction,
// only to discover that the code ran slower without it.
{$IFDEF JUNKOPS}
MOV s1, r11d
{$ENDIF}
// The next part of the code just moves on to another part of SHA-256,
// maj { r12d } := (a and b) xor (a and c) xor (b and c)
mov r8d, a
mov r9d, b
mov r13d, r9d // Set aside a copy of b
and r9d, r8d
mov r12d, c
and r8d, r12d { a and c }
xor r9d, r8d
and r12d, r13d { c and b }
xor r12d, r9d
// Copying the calculated value to the same s1 variable is another speedup.
// As far as I can tell, it doesn't actually matter what register is copied,
// but moving this line up or down makes a huge difference.
{$IFDEF JUNKOPS}
MOV s1, r9d // after mov r12d, c
{$ENDIF}
// And here is where the two calculated values above are actually used:
// T2 {r12d} := S0 {r10d} + Maj {r12d};
ADD r12d, r10d
MOV T2, r12d
end
end;
Try it yourself:
The code is online at GitHub if you want to try it out yourself.
My questions:
Why would uselessly copying a register's contents to RAM ever increase performance?
Why would the same useless instruction provide a speedup on some lines, and a slowdown on others?
Is this behavior something that could be exploited predictably by a compiler?
The most likely cause of the speed improvement is that:
inserting a MOV shifts the subsequent instructions to different memory addresses
one of those moved instructions was an important conditional branch
that branch was being incorrectly predicted due to aliasing in the branch prediction table
moving the branch eliminated the alias and allowed the branch to be predicted correctly
Your Core2 doesn't keep a separate history record for each conditional jump. Instead it keeps a shared history of all conditional jumps. One disadvantage of global branch prediction is that the history is diluted by irrelevant information if the different conditional jumps are uncorrelated.
This little branch prediction tutorial shows how branch prediction buffers work. The cache buffer is indexed by the lower portion of the address of the branch instruction. This works well unless two important uncorrelated branches share the same lower bits. In that case, you end-up with aliasing which causes many mispredicted branches (which stalls the instruction pipeline and slowing your program).
If you want to understand how branch mispredictions affect performance, take a look at this excellent answer: https://stackoverflow.com/a/11227902/1001643
Compilers typically don't have enough information to know which branches will alias and whether those aliases will be significant. However, that information can be determined at runtime with tools such as Cachegrind and VTune.
You may want to read http://research.google.com/pubs/pub37077.html
TL;DR: randomly inserting nop instructions in programs can easily increase performance by 5% or more, and no, compilers cannot easily exploit this. It's usually a combination of branch predictor and cache behaviour, but it can just as well be e.g. a reservation station stall (even in case there are no dependency chains that are broken or obvious resource over-subscriptions whatsoever).
I believe in modern CPUs the assembly instructions, while being the last visible layer to a programmer for providing execution instructions to a CPU, actually are several layers from actual execution by the CPU.
Modern CPUs are RISC/CISC hybrids that translate CISC x86 instructions into internal instructions that are more RISC in behavior. Additionally there are out-of-order execution analyzers, branch predictors, Intel's "micro-ops fusion" that try to group instructions into larger batches of simultaneous work (kind of like the VLIW/Itanium titanic). There are even cache boundaries that could make the code run faster for god-knows-why if it's bigger (maybe the cache controller slots it more intelligently, or keeps it around longer).
CISC has always had an assembly-to-microcode translation layer, but the point is that with modern CPUs things are much much much more complicated. With all the extra transistor real estate in modern semiconductor fabrication plants, CPUs can probably apply several optimization approaches in parallel and then select the one at the end that provides the best speedup. The extra instructions may be biasing the CPU to use one optimization path that is better than others.
The effect of the extra instructions probably depends on the CPU model / generation / manufacturer, and isn't likely to be predictable. Optimizing assembly language this way would require execution against many CPU architecture generations, perhaps using CPU-specific execution paths, and would only be desirable for really really important code sections, although if you're doing assembly, you probably already know that.
Preparing the cache
Move operations to memory can prepare the cache and make subsequent move operations faster. A CPU usually have two load units and one store units. A load unit can read from memory into a register (one read per cycle), a store unit stores from register to memory. There are also other units that do operations between registers. All the units work in parallel. So, on each cycle, we may do several operations at once, but no more than two loads, one store, and several register operations. Usually it is up to 4 simple operations with plain registers, up to 3 simple operations with XMM/YMM registers and a 1-2 complex operations with any kind of registers. Your code has lots of operations with registers, so one dummy memory store operation is free (since there are more than 4 register operations anyway), but it prepares memory cache for the subsequent store operation. To find out how memory stores work, please refer to the Intel 64 and IA-32 Architectures Optimization Reference Manual.
Breaking the false dependencies
Although this does not exactly refer to your case, but sometimes using 32-bit mov operations under the 64-bit processor (as in your case) are used to clear the higher bits (32-63) and break the dependency chains.
It is well known that under x86-64, using 32-bit operands clears the higher bits of the 64-bit register. Pleas read the relevant section - 3.4.1.1 - of The Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1:
32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register
So, the mov instructions, that may seem useless at the first sight, clear the higher bits of the appropriate registers. What it gives to us? It breaks dependency chains and allows the instructions to execute in parallel, in random order, by the Out-of-Order algorithm implemented internally by CPUs since Pentium Pro in 1995.
A Quote from the Intel® 64 and IA-32 Architectures Optimization Reference Manual, Section 3.5.1.8:
Code sequences that modifies partial register can experience some delay in its dependency chain, but can be avoided by using dependency breaking idioms. In processors based on Intel Core micro-architecture, a number of instructions can help clear execution dependency when software uses these instruction to clear register content to zero. Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For
moves, this can be accomplished with 32-bit moves or by using MOVZX.
Assembly/Compiler Coding Rule 37. (M impact, MH generality): Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.
The MOVZX and MOV with 32-bit operands for x64 are equivalent - they all break dependency chains.
That's why your code executes faster. If there are no dependencies, the CPU can internally rename the registers, even though at the first sight it may seem that the second instruction modifies a register used by the first instruction, and the two cannot execute in parallel. But due to register renaming they can.
Register renaming is a technique used internally by a CPU that eliminates the false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them.
I think you now see that it is too obvious.

Why "execute" located before "memory" in Instruction Set Achitecture?

I have learnd Processor Architecture 3 years ago.
Until today , I can't figure out why execute located before memory in the sequential instructions.
While executing the instruction [ mov (%eax) %ebx] , does it needn't to access memory?
Thanks!
Let's remember classic RISC pipeline, which is usually studied: http://en.wikipedia.org/wiki/Classic_RISC_pipeline. Here are its stages:
IF = Instruction Fetch
ID = Instruction Decode
EX = Execute
MEM = Memory access
WB = Register write back
In RISC you can only have loads and stores to work with memory. And EX stage for memory access instruction will compute the address in memory (take address from register file, scale it or add offset). Then address will be passed to MEM stage.
Your example, mov (%eax), %ebx is actually a load from memory without any additional computation and it can be represented even in RISC pipeline:
IF - get the instruction from instruction memory
ID - decode instruction, pass "eax" register to ALU as operand; remember "ebx" as output for WB (in control unit);
EX - compute "eax+0" in ALU and pass result to next stage MEM (as address in memory)
MEM - take address from EX stage (from ALU), go to memory and take value (this stage can take several ticks to reach memory with blocking of the pipeline). Pass value to WB
WB - take value from MEM and pass it back to register file. Control unit should set the register file into mode: "Writing"+"EBX selected"
Situation is more complex in true CISC instruction, e.g. add (%eax), %ebx (load word T from [%eax] memory, then store T+%ebx to %ebx). This instruction needs both address computation and addition in ALU. This can't be easily represented in simplest RISC (MIPS) pipelines.
First x86 cpu (8086) was not pipelined, it executed only single instruction at any moment. But since 80386 there is pipeline with 6 stages, which is more complex than in RISC. There is presentation about its pipeline, comparing it with MIPS: http://www.academic.marist.edu/~jzbv/architecture/Projects/projects2004/INTEL%20X86%20PIPELINING.ppt
Slide 17 says:
Intel combines the mem and EX stages to avoid loads and stalls, but does create stalls for address computation
All stages in mips takes one cycle, where as Intel may take more than one for certain stages. This creates asymmetric performance
In my example, add will be executed in that combined "MEM+EX" stage for several CPU ticks, generating many stalls.
Modern x86 CPUs have very long pipeline (16 stages is typical), and they are RISC-like cpus internally. Decoder stages (3 stage or more) will break most complex x86 instructions into series of internal RISC-like micro-operations (sometimes up to 450 microoperations per instruction are generated with help of microcode; more typical is 2-3 microoperations). For complex ALU/MEM operations, there will be microop for address computation, then microop for memory load and then microop for ALU action. Microoperations will have depends between them, and planned to different execution ports.

If registers are so blazingly fast, why don't we have more of them?

In 32bit, we had 8 "general purpose" registers. With 64bit, the amount doubles, but it seems independent of the 64bit change itself.
Now, if registers are so fast (no memory access), why aren't there more of them naturally? Shouldn't CPU builders work as many registers as possible into the CPU? What is the logical restriction to why we only have the amount we have?
There's many reasons you don't just have a huge number of registers:
They're highly linked to most pipeline stages. For starters, you need to track their lifetime, and forward results back to previous stages. The complexity gets intractable very quickly, and the number of wires (literally) involved grows at the same rate. It's expensive on area, which ultimately means it's expensive on power, price and performance after a certain point.
It takes up instruction encoding space. 16 registers takes up 4 bits for source and destination, and another 4 if you have 3-operand instructions (e.g ARM). That's an awful lot of instruction set encoding space taken up just to specify the register. This eventually impacts decoding, code size and again complexity.
There's better ways to achieve the same result...
These days we really do have lots of registers - they're just not explicitly programmed. We have "register renaming". While you only access a small set (8-32 registers), they're actually backed by a much larger set (e.g 64-256). The CPU then tracks the visibility of each register, and allocates them to the renamed set. For example, you can load, modify, then store to a register many times in a row, and have each of these operations actually performed independently depending on cache misses etc. In ARM:
ldr r0, [r4]
add r0, r0, #1
str r0, [r4]
ldr r0, [r5]
add r0, r0, #1
str r0, [r5]
Cortex A9 cores do register renaming, so the first load to "r0" actually goes to a renamed virtual register - let's call it "v0". The load, increment and store happen on "v0". Meanwhile, we also perform a load/modify/store to r0 again, but that'll get renamed to "v1" because this is an entirely independent sequence using r0. Let's say the load from the pointer in "r4" stalled due to a cache miss. That's ok - we don't need to wait for "r0" to be ready. Because it's renamed, we can run the next sequence with "v1" (also mapped to r0) - and perhaps that's a cache hit and we just had a huge performance win.
ldr v0, [v2]
add v0, v0, #1
str v0, [v2]
ldr v1, [v3]
add v1, v1, #1
str v1, [v3]
I think x86 is up to a gigantic number of renamed registers these days (ballpark 256). That would mean having 8 bits times 2 for every instruction just to say what the source and destination is. It would massively increase the number of wires needed across the core, and its size. So there's a sweet spot around 16-32 registers which most designers have settled for, and for out-of-order CPU designs, register renaming is the way to mitigate it.
Edit: The importance of out-of-order execution and register renaming on this. Once you have OOO, the number of registers doesn't matter so much, because they're just "temporary tags" and get renamed to the much larger virtual register set. You don't want the number to be too small, because it gets difficult to write small code sequences. This is a problem for x86-32, because the limited 8 registers means a lot of temporaries end up going through the stack, and the core needs extra logic to forward reads/writes to memory. If you don't have OOO, you're usually talking about a small core, in which case a large register set is a poor cost/performance benefit.
So there's a natural sweet spot for register bank size which maxes out at about 32 architected registers for most classes of CPU. x86-32 has 8 registers and it's definitely too small. ARM went with 16 registers and it's a good compromise. 32 registers is slightly too many if anything - you end up not needing the last 10 or so.
None of this touches on the extra registers you get for SSE and other vector floating point coprocessors. Those make sense as an extra set because they run independently of the integer core, and don't grow the CPU's complexity exponentially.
We Do Have More of Them
Because almost every instruction must select 1, 2, or 3 architecturally visible registers, expanding the number of them would increase code size by several bits on each instruction and so reduce code density. It also increases the amount of context that must be saved as thread state, and partially saved in a function's activation record. These operations occur frequently. Pipeline interlocks must check a scoreboard for every register and this has quadratic time and space complexity. And perhaps the biggest reason is simply compatibility with the already-defined instruction set.
But it turns out, thanks to register renaming, we really do have lots of registers available, and we don't even need to save them. The CPU actually has many register sets, and it automatically switches between them as your code exeutes. It does this purely to get you more registers.
Example:
load r1, a # x = a
store r1, x
load r1, b # y = b
store r1, y
In an architecture that has only r0-r7, the following code may be rewritten automatically by the CPU as something like:
load r1, a
store r1, x
load r10, b
store r10, y
In this case r10 is a hidden register that is substituted for r1 temporarily. The CPU can tell that the the value of r1 is never used again after the first store. This allows the first load to be delayed (even an on-chip cache hit usually takes several cycles) without requiring the delay of the second load or the second store.
They add registers all of the time, but they are often tied to special purpose instructions (e.g. SIMD, SSE2, etc) or require compiling to a specific CPU architecture, which lowers portability. Existing instructions often work on specific registers and couldn't take advantage of other registers if they were available. Legacy instruction set and all.
To add a little interesting info here you'll notice that having 8 same sized registers allows opcodes to maintain consistency with hexadecimal notation. For example the instruction push ax is opcode 0x50 on x86 and goes up to 0x57 for the last register di. Then the instruction pop ax starts at 0x58 and goes up to 0x5F pop di to complete the first base-16. Hexadecimal consistency is maintained with 8 registers per a size.

Resources