Related
I am working on some code that is meant to run on x86 in 32-bit mode. In that mode, I understand that I've got only 8 SIMD/AVX2-Registers (YMM0-7) to freely work with. However, some of my vector subroutines alone sometimes use more than that amount of registers simultainiously (meaning that they are still needed somewhere down the road - mostly not so far afterwards).
My understanding is, that compilers will export older registers to stack memory when they can't find unused registers. But how much does this impact performance ? (e.g. in cycles per export/import later on). Can I trust in stack memory mostly residing in L1-D-Cache (with 2 cycles latency in Haswell) or is there a significant performance impact of avoiding such register-to-memory (and vice versa) transfers ?
So far I wasn't able to find answers to this topic, especially since the registers keep getting bigger and bigger (1 Cacheline per register with upcoming Skylake platform). It would be nice if you could give sources in case you answer.
There is always an impact to hit memory.
A write is generally slow. However, if you only hit the L1 cache, it is close to instantaneous (nearly the same as copying a register to another.) If you hit L2 or L3, that's slower, but still very fast. If you hit actual memory, that's "dead" slow (in comparison.) So if your L1 cache is 12Kb, you can have up to 12Kb of data on your stack and still work really fast (although remember that caches are shared between your data and the code you're running; it may be 6Kb of instruction cache and 6Kb of data including stack.)
The main problem you'll hit is how much memory you're working on. If your input data is very large, that will have the largest impact. Especially if you cannot load the input data in a streaming manner for which processors are well optimized. (read X bytes from (eax), do eax + X, then repeat).
Note that if you have to write it in assembler, you'll have to do all the work that a compiler could do for you with zero errors and fully optimized. Compilers today are really good at optimizing (gcc/g++). It becomes particularly complicated when you PUSH on the stack, changing the offset of all your local variables currently on the stack (unless you use a frame pointer.)
As another detail, compilers see the stack as a structure defined from the set of all the local variables necessary in a function. So accessing the stack is very similar to accessing a structure.
When a dirty cache line is flushed (because of any reason), is the whole cache line written to the memory or CPU tracks down which words got written to and reduces the number of memory writes?
If this differs among architectures, I'm primarily interested in knowing this for Blackfin, but it would be nice to hear practices in x86, ARM, etc...
generally if you have a write buffer it flushes through the write buffer (entire cache line). Then the write buffer at some point completes the writes to ram. I have not heard of a cache that keeps track per item within a line which parts are dirty or not, that is why you have a cache line. So for the cases I have heard of the whole line goes out. Another point is that it is not uncommon for the slow memory on the back side of a cache DDR for example, is accessed through some fixed width, 32 bits at a time 64 bits at a time 128 bits at a time, or each part is at that width and there are multiple parts. That kind of thing, so to avoid a read-modify-write you want to write in complete ram width sizes. Cache lines are multiples of that, sure, and the opportunity to not do writes is there. Also if there is ecc on that ram then you need to write a whole ecc line at once to avoid a read-modify write.
You would need a dirty bit per writeable item in the cache line so that would multiply the dirty bit storage bu some amount, that may or may not have a real impact on size or cost, etc. There may or may not be an overhead on the ram side per transaction and it may be cheaper to do a multi word transaction rather than even two separate transactions, so this scheme might create a performance hit rather than boost (same problem inside the write buffer, instead of one transaction with a start address and length, now multiple transactions).
It just seems like a lot of work for something that may or may not result in a gain. If you find one that does please post it here.
I'm dusting off my cobwebby computer architecture knowledge from classes taken 15 years ago -- please be kind if I'm totally wrong.
I seem to remember that x86, MIPS and Motorola, the whole line gets written. This is because the cache line is the same as the bus width (except in very odd circumstances, such as the moldy old 386-SX line which was a 32-bit architecture with a 16-bit bus), so there's no point in trying to do word-wise optimization, the whole line is going to be written anyway.
I can't imagine any scenario in which a hardware architecture of any kind would do anything different, but I've been known to be wrong in the past.
Suppose I'm copying data between two arrays that are 1024000+1 bytes apart. Since the offset is not a multiple of word size, I'll need to do some misaligned accesses - either loads or stores (for the moment, let's forget that it's possible to avoid misaligned accesses entirely with some ORing and bit shifting). Which of misaligned loads or misaligned stores will be more expensive?
This is a hypothetical situation, so I can't just benchmark it :-) I'm more interested in what factors will lead to performance difference, if any. A pointer to some further reading would be great.
Thanks!
A misaligned write will need to read two destination words, merge in the new data, and write two words. This would be combined with an aligned read. So, 3R + 2W.
A misaligned read will need to read two source words, and merge the data (shift and bitor). This would be combined with an aligned write. So, 2R + 1W.
So, the misaligned read is a clear winner.
Of course, as you say there are more efficient ways to do this that avoid any mis-aligned operations except at the ends of the arrays.
Actually that depends greatly on the CPU you are using. On newer Intel CPUs there is no penalty for loading and storing unaligned words (at least none that you can notice). Only if you load and store 16byte or 32byte unaligned chunks you may see small performance degradation.
How much data? are we talking about two things unaligned at the ends of a large block of data (in the noise) or one item (word, etc) that is unaligned (100% of the data)?
Are you using a memcpy() to move this data, etc?
I'm more interested in what factors will lead to performance
difference, if any.
Memories, modules, chips, on die blocks, etc are usually organized with a fixed access size, at least somewhere along the way there is a fixed access size. Lets just say 64 bits wide, not an uncommon size these days. So at that layer wherever it is you can only write or read in aligned 64 bit units.
If you think about a write vs read, with a read you send out an address and that has to go to the memory and data come back, a full round trip has to happen. With a write everything you need to know to perform the write goes on the outbound path, so it is not uncommon to have a fire and forget type deal where the memory controller takes the address and data and tells the processor the write has finished even though the information has not net reached the memory. It does take time but not as long as a read (not talking about flash/proms just ram here) since a read requires both paths. So for aligned full width stuff a write CAN BE faster, some systems may wait for the data to make it all the way to the memory and then return a completion which is perhaps about the same amount of time as the read. It depends on your system though, the memory technology can make one or the other faster or slower right at the memory itself. Now the first write after nothing has been happening can do this fire and forget thing, but the second or third or fourth or 16th in a row eventually fills up a buffer somewhere along the path and the processor has to wait for the oldest one to make it all the way to the memory before the most recent one has a place in the queue. So for bursty stuff writes may be faster than reads but for large movements of data they approach each other.
Now alignment. The whole memory width will be read on a read, in this case lets say 64 bits, if you were only really interested in 8 of those bits, then somewhere between the memory and the processor the other 24 bits are discarded, where depends on the system. Writes that are not a whole, aligned, size of the memory mean that you have to read the width of the memory, lets say 64 bits, modify the new bits, say 8 bits, then write the whole 64 bits back. A read-modify-write. A read only needs a read a write needs a read-modify-write, the farther away from the memory requiring the read modify write the longer it takes the slower it is, no matter what the read-modify-write cant be any faster than the read alone so the read will be faster, the trimming of bits off the read generally wont take any time so reading a byte compared to reading 16 bits or 32 or 64 bits from the same location so long as the busses and destination are that width all the way, take the same time from the same location, in general, or should.
Unaligned simply multiplies the problem. Say worst case if you want to read 16 bits such that 8 bits are in one 64 bit location and the other 8 in the next 64 bit location, you need to read 128 bits to satisfy that 16 bit read. How that exactly happens and how much of a penalty is dependent on your system. some busses set up the transfer X number of clocks but the data is one clock per bus width after that so a 128 bit read might be only one clock longer (than the dozens to hundreds) of clocks it takes to read 64, or worst case it could take twice as long in order to get the 128 bits needed for this 16 bit read. A write, is a read-modify-write so take the read time, then modify the two 64 bit items, then write them back, same deal could be X+1 clocks in each direction or could be as bad as 2X number of clocks in each direction.
Caches help and hurt. A nice thing about using caches is that you can smooth out the transfers to the slow memory, you can let the cache worry about making sure all memory accesses are aligned and all writes are whole 64 bit writes, etc. How that happens though is the cache will perform same or larger sized reads. So reading 8 bits may result in one or many 64 bit reads of the slow memory, for the first byte, if you perform a second read right after that of the next byte location and if that location is in the same cache line then it doesnt go out to slow memory, it reads from the cache, much faster. and so on until you cross over into another cache boundary or other reads cause that cache line to be evicted. If the location being written is in cache then the read-modify-write happens in the cache, if not in cache then it depends on the system, a write doesnt necessarily mean the read modify write causes a cache line fill, it could happen on the back side as of the cache were not there. Now if you modified one byte in the cache line, now that line has to be written back it simply cannot be discarded so you have a one to few widths of the memory to write back as a result. your modification was fast but eventually the write happens to the slow memory and that affects the overall performance.
You could have situations where you do a (byte) read, the cache line if bigger than the external memory width can make that read slower than if the cache wasnt there, but then you do a byte write to some item in that cache line and that is fast since it is in the cache. So you might have experiments that happen to show writes are faster.
A painful case would be reading say 16 bits unaligned such that not only do they cross over a 64 bit memory width boundary but the cross over a cache line boundary, such that two cache lines have to be read, instead of reading 128 bits that might mean 256 or 512 or 1024 bits have to be read just to get your 16.
The memory sticks on your computer for example are actually multiple memories, say maybe 8 8 bit wide to make a 64 bit overall width or 16 4 bit wide to make an overall 64 bit width, etc. That doesnt mean you can isolate writes on one lane, but maybe, I dont know those modules very well but there are systems where you can/could do this, but those systems I would consider to be 8 or 4 bit wide as far as the smallest addressable size not 64 bit as far as this discussion goes. ECC makes things worse though. First you need an extra memory chip or more, basically more width 72 bits to support 64 for example. You must do full writes with ECC as the whole 72 bits lets say has to be self checking so you cant do fractions. if there is a correctable (single bit) error the read suffers no real penalty it gets the corrected 64 bits (somewhere in the path where this checking happens). Ideally you want a system to write back that corrected value but that is not how all systems work so a read could turn into a read modify write, aligned or not. The primary penalty is if you were able to do fractional writes you cant now with ECC has to be whole width writes.
Now to my question, lets say you use memcpy to move this data, many C libraries are tuned to do aligned transfers, at least where possible, if the source and destination are unaligned in a different way that can be bad, you might want to manage part of the copy yourself. say they are unaligned in the same way, the memcpy will try to copy the unaligned bytes first until it gets to an aligned boundary, then it shifts into high gear, copying aligned blocks until it gets near the end, it downshifts and copies the last few bytes if any, in an unaligned fashion. so if this memory copy you are talking about is thousands of bytes and the only unaligned stuff is near the ends then yes it will cost you some extra reads as much as two extra cache line fills, but that may be in the noise. Even on smaller sizes even if aligned on say 32 bit boundaries if you are not moving whole cache lines or whole memory widths there may still be an extra cache line involved, aligned or not you might only suffer an extra cache lines worth of reading and later writing...
The pure traditional, non-cached memory view of this, all other things held constant is as Doug wrote. Unaligned reads across one of these boundaries, like the 16 bits across two 64 bit words, costs you an extra read 2R vs 1R. A similar write costs you 2R+2W vs 1W, much more expensive. Caches and other things just complicate the problem greatly making the answer "it depends"...You need to know your system pretty well and what other stuff is going on around it, if any. Caches help and hurt, with any cache a test can be crafted to show the cache makes things slower and with the same system a test can be written to show the cache makes things faster.
Further reading would be go look at the databooks/sheets technical reference manuals or whatever the vendor calls their docs for various things. for ARM get the AXI/AMBA documentation on their busses, get the cache documentation for their cache (PL310 for example). Information on ddr memory the individual chips used in the modules you plug into your computer are all out there, lots of timing diagrams, etc. (note just because you think you are buying gigahertz memory, you are not, dram has not gotten faster in like 10 years or more, it is pretty slow around 133Mhz, it is just that the bus is faster and can queue more transfers, it still takes hundreds to thousands of processor cycles for a ddr memory cycle, read one byte that misses all the caches and you processor waits an eternity). so memory interfaces on the processors and docs on various memories, etc may help, along with text books on caches in general, etc.
In implementing most algorithms (sort, search, graph traversal, etc.), there is frequently a trade-off that can be made in reducing memory accesses at the cost of additional ordinary operations.
Knuth has a useful method for comparing the complexity of various algorithm implementations by abstracting it from particular processors and only distinguishing between ordinary operations (oops) and memory operations (mems).
In compiled programs, one typically lets the compiler organise the low level operations, and hopes that the operating system will handle the question of whether data is held in cache memory (faster) or in virtual memory (slower). Furthermore, the exact number / cost of instructions is encapsulated by the compiler.
With Forth, there is no longer such encapsulation, and one is much closer to the machine, albeit perhaps to a stack machine running on top of a register processor.
Ignoring the effect of an operating system (so no memory stalls, etc.), and assuming for the moment a simple processor,
(1) Can anyone advise on how the ordinary stack operations in Forth (e.g. dup, rot, over, swap, etc.) compare with the cost of Forth's memory access fetch (#) or store (!) ?
(2) Is there a rule of thumb I can use to decide how many ordinary operations to trade-off against saving a memory access?
What I'm looking for is something like 'memory access costs as much as 50 ordinary ops, or 500 ordinary ops, or 5 ordinary ops' Ballpark is absolutely fine.
I'm trying to get a sense of the relative expense of fetch and store vs. rot, swap, dup, drop, over, correct to an order of magnitude.
This article How much time does it take to fetch one word from memory? talks about main memory stall times, with some rule of thumb type numbers, but basically you can do lots of instructions while stalling for main memory. As others have said, the numbers vary a lot between systems.
Main memory stalls is a big area of interest, especially as CPUs have more cores, but typically not much faster memory bandwidth. There is some research going on around compressing data in main memory too, so that the CPU can take advantage of 'spare' cycles and tightly packed cache lines http://oai.cwi.nl/oai/asset/15564/15564B.pdf
For those who are really interested in the details, most CPU manufacturers publish in depth guides on memory optimisations etc. mostly aimed at high end and compiler writers, but readable by all 2gl and 3gl programmers.
Ps. Go Forth.
A comparison between memory fetches and register operations is okay for assembler programs, as it is for the output of c-compilers, which is in fact an assembler program.
In Forth this question hardly makes sense. In the first place Forth is an interpreter and in using Forth one foregoes the ultimate in speed. Of course one could add an optimiser on top of Forth but then the question makes even less sense, because the output of a c-optimiser and a Forth optimiser converge to -- you guessed it -- an optimal solution.
Let's look at an elementary operation in Forth like AND.
This is implemented as
> CODE AND
> POP AX
> POP BX
> AND AX, BX
> PUSH AX
> NEXT
So we see already three memory operations for something that looks like an elementary calculation operation. It appears the Knuth metric is not applicable. Also Forth seems to be loosing big time.That is however not true. Those memory operations are all onto the L1 cache of a typical processor. That is about as efficient as local variables in small c functions,
We can compare stack operations with memory operations using VARIABLE's and the stack. The answer is simple. A VARIABLE risks a memory stall. A stack operation will almost certainly be a L1 cache hit. This is the single most important point of consideration. However the question explicitly asks not to consider it!
So there.
I have a basic question about assembly.
Why do we bother doing arithmetic operations only on registers if they can work on memory as well?
For example both of the following cause (essentially) the same value to be calculated as an answer:
Snippet 1
.data
var dd 00000400h
.code
Start:
add var,0000000Bh
mov eax,var
;breakpoint: var = 00000B04
End Start
Snippet 2
.code
Start:
mov eax,00000400h
add eax,0000000bh
;breakpoint: eax = 0000040B
End Start
From what I can see most texts and tutorials do arithmetic operations mostly on registers. Is it just faster to work with registers?
If you look at computer architectures, you find a series of levels of memory. Those that are close to the CPU are the fast, expensive (per a bit), and therefore small, while at the other end you have big, slow and cheap memory devices. In a modern computer, these are typically something like:
CPU registers (slightly complicated, but in the order of 1KB per a core - there
are different types of registers. You might have 16 64 bit
general purpose registers plus a bunch of registers for special
purposes)
L1 cache (64KB per core)
L2 cache (256KB per core)
L3 cache (8MB)
Main memory (8GB)
HDD (1TB)
The internet (big)
Over time, more and more levels of cache have been added - I can remember a time when CPUs didn't have any onboard caches, and I'm not even old! These days, HDDs come with onboard caches, and the internet is cached in any number of places: in memory, on the HDD, and maybe on caching proxy servers.
There is a dramatic (often orders of magnitude) decrease in bandwidth and increase in latency in each step away from the CPU. For example, a HDD might be able to be read at 100MB/s with a latency of 5ms (these numbers may not be exactly correct), while your main memory can read at 6.4GB/s with a latency of 9ns (six orders of magnitude!). Latency is a very important factor, as you don't want to keep the CPU waiting any longer than it has to (this is especially true for architectures with deep pipelines, but that's a discussion for another day).
The idea is that you will often be reusing the same data over and over again, so it makes sense to put it in a small fast cache for subsequent operations. This is referred to as temporal locality. Another important principle of locality is spatial locality, which says that memory locations near each other will likely be read at about the same time. It is for this reason that reading from RAM will cause a much larger block of RAM to be read and put into on-CPU cache. If it wasn't for these principles of locality, then any location in memory would have an equally likely chance of being read at any one time, so there would be no way to predict what will be accessed next, and all the levels of cache in the world will not improve speed. You might as well just use a hard drive, but I'm sure you know what it's like to have the computer come to a grinding halt when paging (which is basically using the HDD as an extension to RAM). It is conceptually possible to have no memory except for a hard drive (and many small devices have a single memory), but this would be painfully slow compared to what we're familiar with.
One other advantage of having registers (and only a small number of registers) is that it lets you have shorter instructions. If you have instructions that contain two (or more) 64 bit addresses, you are going to have some long instructions!
Because RAM is slow. Very slow.
Registers are placed inside the CPU, right next to the ALU so signals can travel almost instantly. They're also the fastest memory type but they take significant space so we can have only a limited number of them. Increasing the number of registers increases
die size
distance needed for signals to travel
work to save the context when switching between threads
number of bits in the instruction encoding
Read If registers are so blazingly fast, why don't we have more of them?
More commonly used data will be placed in caches for faster accessing. In the past caches are very expensive so they're an optional part and can be purchased separately and plug into a socket outside the CPU. Nowadays they're often in the same die with the CPUs. Caches are constructed from SRAM cells which are smaller than register cells but maybe tens or hundreds of times slower.
Main memory will be made from DRAM which needs only one transistor per cell but are thousands of times slower than registers, hence we can't work with only DRAM in a high-performance system. However some embedded system do make use of register file so registers are also main memory
More information: Can we have a computer with just registers as memory?
Registers are much faster and also the operations that you can perform directly on memory are far more limited.
In real, there are tiny implementations that does not separate registers from memory. They can expose it, for example, in the way they have 512 bytes of RAM, and first 64 of them are exposed as 32 16-bit registers and in the same time accessible as addressable RAM. Or, another example, MosTek 6502 "zero page" (RAM range 0-255, accessed used 1-byte address) was a poor substitution for registers, due to small amount of real registers in CPU. But, this is poorly scalable to larger setups.
The advantage of registers are following:
They are the most fast. They are faster in a typical modern system than any cache, more so than DRAM. (In the example above, RAM is likely SRAM. But SRAM of a few gigabytes is unusably expensive.) And, they are close to processor. Difference of time between register access and DRAM access can reach values like 200 or even 1000. Even compared to L1 cache, register access is typically 2-4 times faster.
Their amount is limited. A typical instruction set will become too bloated if any memory location is addressed explicitly.
Registers are specific to each CPU (core, hardware thread, hart) separately. (In systems where fixed RAM addresses serve role of special registers, as e.g. zSeries does, this needs special remapping of such service area in absolute addresses, separate for each core.)
In the same manner as (3), registers are specific to each process thread without a need to adjust locations in code for a thread.
Registers (relatively easily) allow specific optimizations, as register renaming. This is too complex if memory addresses are used.
Additionally, there are registers that could not be implemented in separate block RAM because access to RAM needs their change. I mean the "execution phase" register in the simplest CPU designs, which takes values like "instruction extracting phase", "instruction decoding phase", "ALU phase", "data writing phase" and so on, and this register equivalents in more complicated (pipeline, out-of-order) designs; also different buffer registers on bus access, and so on. But, such registers are not visible to programmer, so you did likely not mean them.
x86, like pretty much every other "normal" CPU you might learn assembly for, is a register machine1. There are other ways to design something that you can program (e.g. a Turing machine that moves along a logical "tape" in memory, or the Game of Life), but register machines have proven to be basically the only way to go for high-performance.
https://www.realworldtech.com/architecture-basics/2/ covers possible alternatives like accumulator or stack machines which are also obsolete now. Although it omits CISCs like x86 which can be either load-store or register-memory. x86 instructions can actually be reg,mem; reg,reg; or even mem,reg. (Or with an immediate source.)
Footnote 1: The abstract model of computation called a register machine doesn't distinguish between registers and memory; what it calls registers are more like memory in real computers. I say "register machine" here to mean a machine with multiple general-purpose registers, as opposed to just one accumulator, or a stack machine or whatever. Most x86 instructions have 2 explicit operands (but it varies), up to one of which can be memory. Even microcontrollers like 6502 that can only really do math into one accumulator register almost invariably have some other registers (e.g. for pointers or indices), unlike true toy ISAs like Marie or LMC that are extremely inefficient to program for because you need to keep storing and reloading different things into the accumulator, and can't even keep an array index or loop counter anywhere that you can use it directly.
Since x86 was designed to use registers, you can't really avoid them entirely, even if you wanted to and didn't care about performance.
Current x86 CPUs can read/write many more registers per clock cycle than memory locations.
For example, Intel Skylake can do two loads and one store from/to its 32KiB 8-way associative L1D cache per cycle (best case), but can read upwards of 10 registers per clock, and write 3 or 4 (plus EFLAGS).
Building an L1D cache with as many read/write ports as the register file would be prohibitively expensive (in transistor count/area and power usage), especially if you wanted to keep it as large as it is. It's probably just not physically possible to build something that can use memory the way x86 uses registers with the same performance.
Also, writing a register and then reading it again has essentially zero latency because the CPU detects this and forwards the result directly from the output of one execution unit to the input of another, bypassing the write-back stage. (See https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing).
These result-forwarding connections between execution units are called the "bypass network" or "forwarding network", and it's much easier for the CPU to do this for a register design than if everything had to go into memory and back out. The CPU only has to check a 3 to 5 bit register number, instead of an 32-bit or 64-bit address, to detect cases where the output of one instruction is needed right away as the input for another operation. (And those register numbers are hard-coded into the machine-code, so they're available right away.)
As others have mentioned, 3 or 4 bits to address a register make the machine-code format much more compact than if every instruction had absolute addresses.
See also https://en.wikipedia.org/wiki/Memory_hierarchy: you can think of registers as a small fast fixed-size memory space separate from main memory, where only direct absolute addressing is supported. (You can't "index" a register: given an integer N in one register, you can't get the contents of the Nth register with one insn.)
Registers are also private to a single CPU core, so out-of-order execution can do whatever it wants with them. With memory, it has to worry about what order things become visible to other CPU cores.
Having a fixed number of registers is part of what lets CPUs do register-renaming for out-of-order execution. Having the register-number available right away when an instruction is decoded also makes this easier: there's never a read or write to a not-yet-known register.
See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an explanation of register renaming, and a specific example (the later edits to the question / later parts of my answer showing the speedup from unrolling with multiple accumulators to hide FMA latency even though it reuses the same architectural register repeatedly).
The store buffer with store forwarding does basically give you "memory renaming". A store/reload to a memory location is independent of earlier stores and load to that location from within this core. (Can a speculatively executed CPU branch contain opcodes that access RAM?)
Repeated function calls with a stack-args calling convention, and/or returning a value by reference, are cases where the same bytes of stack memory can be reused multiple times.
The seconds store/reload can execute even if the first store is still waiting for its inputs. (I've tested this on Skylake, but IDK if I ever posted the results in an answer anywhere.)
Registers are accessed way faster than RAM memory, since you don't have to access the "slow" memory bus!
We use registers because they are fast. Usually, they operate at CPU's speed.
Registers and CPU cache are made with different technology / fabrics and
they are expensive. RAM on the other hand is cheap and 100 times slower.
Generally speaking register arithmetic is much faster and much preferred. However there are some cases where the direct memory arithmetic is useful.
If all you want to do is increment a number in memory (and nothing else at least for a few million instructions) then a single direct memory arithmetic instruction is usually slightly faster than load/add/store.
Also if you are doing complex array operations you generally need a lot of registers to keep track of where you are and where your arrays end. On older architectures you could run out of register really quickly so the option of adding two bits of memory together without zapping any of your current registers was really useful.
Yes, it's much much much faster to use registers. Even if you only consider the physical distance from processor to register compared to proc to memory, you save a lot of time by not sending electrons so far, and that means you can run at a higher clock rate.
Yes - also you can typically push/pop registers easily for calling procedures, handling interrupts, etc
It's just that the instruction set will not allow you to do such complex operations:
add [0x40001234],[0x40002234]
You have to go through the registers.