Does cache line flush write the whole line to the memory?

Does cache line flush write the whole line to the memory? - caching

When a dirty cache line is flushed (because of any reason), is the whole cache line written to the memory or CPU tracks down which words got written to and reduces the number of memory writes?
If this differs among architectures, I'm primarily interested in knowing this for Blackfin, but it would be nice to hear practices in x86, ARM, etc...

generally if you have a write buffer it flushes through the write buffer (entire cache line). Then the write buffer at some point completes the writes to ram. I have not heard of a cache that keeps track per item within a line which parts are dirty or not, that is why you have a cache line. So for the cases I have heard of the whole line goes out. Another point is that it is not uncommon for the slow memory on the back side of a cache DDR for example, is accessed through some fixed width, 32 bits at a time 64 bits at a time 128 bits at a time, or each part is at that width and there are multiple parts. That kind of thing, so to avoid a read-modify-write you want to write in complete ram width sizes. Cache lines are multiples of that, sure, and the opportunity to not do writes is there. Also if there is ecc on that ram then you need to write a whole ecc line at once to avoid a read-modify write.
You would need a dirty bit per writeable item in the cache line so that would multiply the dirty bit storage bu some amount, that may or may not have a real impact on size or cost, etc. There may or may not be an overhead on the ram side per transaction and it may be cheaper to do a multi word transaction rather than even two separate transactions, so this scheme might create a performance hit rather than boost (same problem inside the write buffer, instead of one transaction with a start address and length, now multiple transactions).
It just seems like a lot of work for something that may or may not result in a gain. If you find one that does please post it here.

I'm dusting off my cobwebby computer architecture knowledge from classes taken 15 years ago -- please be kind if I'm totally wrong.
I seem to remember that x86, MIPS and Motorola, the whole line gets written. This is because the cache line is the same as the bus width (except in very odd circumstances, such as the moldy old 386-SX line which was a 32-bit architecture with a 16-bit bus), so there's no point in trying to do word-wise optimization, the whole line is going to be written anyway.
I can't imagine any scenario in which a hardware architecture of any kind would do anything different, but I've been known to be wrong in the past.

Related

Why are far pointers slow?

I have been programming some stuff on 16-bit DOS recently for fun. I see a lot of people mentioning that far pointers are slower than near pointers and to avoid them.
Why?
From an assembly point of view, this makes sense. There are several extra instruction involved. You have to store the old value of a segment register, store the new segment into a register, then move that into CS or DS (immediate mode stores aren't a valid opcode in 8086). You can then do whatever you need to do in that segment. After, you have to restore the old value.
It sounds like a lot, but in reality this doesn't eat up a lot of cycles. I guess if every pointer you were using were in a different segment this could add up, but data is usually grouped. So unless you were bouncing all over the place, which is slow for other reasons, the penalty shouldn't be that bad. If you have to hit DRAM, that should dominate the cost, right?
I feel like there is more to this story and I am having a hard time tracking it down. Hoping an 8086 wizard is hanging around who remembers this stuff.
To clarify: I am interested in actual 16-bit processors like the 8086 and the 80286 in real mode.

Why are far pointers slow?
Segment register loads are too complex for a core's front-end to convert into micro-ops. Instead they're "emulated" by micro-ops stored in a little ROM. This begins with some branches (which CPU mode is it?), and typically those branches can't benefit from CPU's branch prediction causing stall/s.
To avoid segment register loads (e.g. when the same far pointer is used multiple times) software tends to use more segment registers (e.g. tends to use ES, FS and GS); and this adds more prefixes (segment override prefixes) to instructions. These extra prefixes can also slow down instruction decoding.
I guess if every pointer you were using were in a different segment this could add up, but data is usually grouped.
Compilers aren't that smart. If a small piece of code uses 4 far pointers that all happen to use the same segment, the compiler won't know that they're all in the same segment and will do expensive segment register loads regardless. To work around that you could describe the data as a structure (e.g. 1 pointer to a structure that has 4 fields instead of 4 different pointers); but that requires the programmer to write software differently.
For an example; if you do something like "int foo(void) { int a; return bar(&a); }" then ss will probably be passed on the stack and then the callee (bar) will load that into another segment register, because bar() has to assume that the pointer could point anywhere.
The other problem is that sometimes data is larger than a segment (e.g. an array of 100000 bytes that doesn't fit in a 64 KiB segment); so someone (the programmer or the compiler) has to calculate and load a different segment to access (parts of) the same data. All pointer arithmetic may need to take this into account (e.g. a trivial looking pointer++; may become something more like offset++; segment += (offset >> 4); offset &= 0x000F; that causes a segment register load).
If you have to hit DRAM, that should dominate the cost, right?
For real mode; you're limited to about 640 KiB of RAM and the caches in CPUs are typically much larger, so you can expect that every memory access is going to be a cache hit. In some cases (e.g. cascade lake CPUs with 1 MiB of L2 cache per core) you won't even use the L3 cache (it'll be all L2 hits).
You can also expect that a segment register load is more expensive than a cache hit.
Hoping an 8086 wizard is hanging around who remembers this stuff.
When people say "segmentation is awful and should be avoided" they're not thinking of a CPU that's been obsolete for 40 years (8086), they're thinking of CPUs that were relevant during this century. Some of them may also be thinking of more than just performance alone (especially for assembly language programmers, segmentation is an annoyance/extra burden).

Drhystone benchmark on 32 bit micrcontroller

Currently I am doing a performance comparison on two 32bit microcontrollers. I used Dhrystone benchmark to run on both microcontrollers. One microcontroller has 4KB I-cache while second coontroller has 8KB of I-cache. Both microcontrollers are using same tool chain. As much as possible I kept same static and run-time settings onboth microcontrollers. But microcontroller with 4KB cache are faster than 8KB cache microcontroller. Both microcontroller are from same vendor and based on same CPU.
Could anyone provide some information why microcontroller with 4KB cache is faster than other?

Benchmarks are in general useless. dhrystone being one of the oldest ones and it may have had a little bit of value back then, before pipelines and too much compiler optimization. I think I gave up on dhrystone 15 years ago roughly about the time I started using it.
It is trivial to demonstrate that this code
.globl ASMDELAY
ASMDELAY:
sub r0,r0,#1
bne ASMDELAY
bx lr
Which is primarily two INSTRUCTIONS, can vary WIDELY in execution time on the same chip if you understand how modern processors work. The simple trick to seeing this is turn off the caches and prefetchers and such, and place this code at offset 0x0000, call it with some value. place it at 0x0004 repeat, then at 0x0008, repeat. keep doing this. You can put one, two, etc nops between the subtract and branch. try it at various offsets.
THEN, turn on and of caches for each of these alignments, if you have a prefetch outside the processor for the flash, turn that on and off.
THEN, vary your clocks, esp for those MCUs where you have to adjust the wait states based on clock rate.
On a SINGLE MCU, you are going to see those essentially two instructions vary in execution time by a very large amount. Lets just say 20 times longer in some cases than others.
Now take a small program or a small fraction of the dhrystone program. Compile that for your mcu, how many instructions do you see? Make minor to major optimization and other variations on the compile command line, how much does the code change. If two instructions can vary by lets call it 20 times in execution time, how bad can it get for 200 instructions or 2000 instructions? It can get pretty bad.
If you take the dhrystone programs you have right now with the compiler options you have right now, go into your bootstrap, add one nop (causing the entire binary to shift by one instruction in flash) run again. Add two, three, four. You are still not comparing different mcus just running your benchmark on one system.
Run with and without d cache, with and without i cache if you have each of those. turn on and off the flash prefetch if you have one and if you have a write buffer you can turn on and of try that. Still remaining on the same compiler same options same mcu.
Take the different functions in the dhrystone source code, and re-arrange them in the source code. instead of Proc_1, Proc_2, Proc_3 make it Proc_1, Proc_3, Proc_2. Do all of the above again. Re-arrange again, repeat.
Before leaving this mcu you should now see that the execution time of the same source code which is completely unmodified (other than perhaps re-arranging functions) can and will have vastly different execution times.
if you then start changing compiler options or keep the same source and change compilers, you will see even more of a difference in execution time.
How is it possible that dhrystone benchmarks today or from way back had a single result for each platform? Simple, that result was just one of a wide range, not really representing the platform.
So then if you try to compare two different hardware platforms be it the same arm core inside a different mcu from the same vendor or different vendors. the arm core, assuming (which is not a safe assumption) it is the same source with the same compile/build options, even assuming the same verilog compile and synthesis was used. you can have that same core change based on arm provided options. Anyway, how the vendor be it the same vendor in two instances or two different vendors wraps that same core, you will see variations. Then take a completely different core be it another arm or a mips, etc. How could you gain any value comparing those using a program like this that itself varies widely on each platform?
You cant. What you can do is use benchmarks to give the illusion of one thing being better than another, one computer is faster than another, one compiler is faster than another. In order to sell computers or compilers. Sprints coverage is within one percent of Verizons...does that tell us anything useful? Nope.
If you were to eliminate the compiler from the equation, and if these are truly the same "CPU", same rev of source from ARM, built the same way, then they should fetch the same, but the size of the cache is part of that so it may already be a different cpu implementation as the width or depth of the cache can affect things. In software it is like needing a 32 bit pointer rather than a 16 bit pointer (17 bit instead of 16, but you cant have a 17 bit generally in logic you can).
Anyway, if you compile the code under test one time for an address space that is common to both platforms, use that same binary exactly for that space, can attach different bootstrap code as needed, note the C library calls strcpy, etc also have to be the same in the same space between platforms to eliminate the compiler and alignment from messing you up. this may or may not level the playing field.
If you want to believe these are the same cpu, then turn the caches off, eliminate the compiler variations by doing the above. See if they execute the same. Copy the program to ram and run in ram, eliminate the flash issues. I assume you have them both clocked the same with the same wait states in the flash?
if they are the same cpu and the chip vendor has with these two chips made the memory system take the same number of clocks say for ram accesses, and it is really the same cpu, you should be able to get the same time by eliminating optimizations (caching, flash prefetching, alignment).
what you are probably seeing is some form of alignment with how the code lies in memory from the compiler vs cache lines, or it could be much much simpler it could be just the differences in the caches, how the hits and misses work and the 4KB is just more lucky than the 8KB for this particular program compiled a certain way, aligned in memory a certain way, etc.
With the simple two instruction loop above it is easy to see some of the reasons why performance varies on the same system, if your modern cpu fetches 8 instructions at a time and your loop gets too close to the tail end of that fetch, the prefetch may think it needs to fetch another 8 beyond that costing you those clock cycles. Certainly as you exactly straddle two "fetch lines" as I call them with those two instructions it is going to cost you a lot more cycles per loop even with a cache. Same problem happens when those two instructions approach a cache line (as you vary their alignment per test) eventually it takes two cache line reads instead of one to fetch those two instructions. At least the first time through there is an extra cache line read. The extra clocks for that first time through is something you can see using a simple benchmark like that while playing with alignment.
Michael Abrash, Zen of Assembly Language. There is an epub/etc you can build from github of this book. the 8088 was obsolete when this book came out if that is all you see 8088 stuff, then you are completely missing the point. It applies to the most modern processors today, how to view the problem, how to test, how to time the test, how to interpret the results. All the stuff I have mentioned so far, and all the things I know about this that I have not mentioned, all came from that books knowledge applied for however many decades I have been doing this.
So again if you have truly eliminated the compiler, alignment, the cpu, the memory system tied to that cpu, etc and it is down to only the size of the cache varies. Then it is probably related to how the cache line fetches hit and miss differently based on alignment of the code relative to the cache lines for the two caches. One is hitting more and missing less and/or evicting better for this particular binary. You can try rearranging the functions, can add nops, or if you cant get at the bootstrap then add whole functions or more code (another printf, etc) at a lower address in the binary causing the linker to slide the code under test through to different addresses changing how the program lines up with the cache lines. Since the functions in the code under test are so big (more than a few instructions to implement) you would have to start modifying the program in order to get a finer grained adjustment of binary relative to cache lines.
You most definitely should see execution time differences on the two platforms if you adjust alignment and or wholesale re-arrange the binary based on functions being re-arranged.
Bottom line benchmarks dont really tell you much, the results have more of a negative stink to them than a positive joy. Without a re-write a particular benchmark or application may just do better on one platform (be it everything is the same but the size of the cache or two completely different architectures) even when you try to vary the results due to alignment, turning on and off prefetching, write buffering, branch prediction, etc. Pulling out all the tricks you can come up with one may vary from x to y the other from n to m and maybe there is some overlap in the ranges. For these two platforms that you say are similiar except for cache size I would hope you could find a combination where sometimes A is faster than B and at least one combination where B is faster than A, with the same features turned on and off for both in that comparision. If/when you switch to some other application/benchmark this all starts over again, no reason to assume dhrystone results predict any other code under test. The only program that matters, esp on an MCU is the final build of your application. Just remember changing a single line of code, or even adding a single nop in the bootstrap can have dramatic results in performance sometimes several TIMES slower or faster from a single nop.

Which of a misaligned store and misaligned load is more expensive?

Suppose I'm copying data between two arrays that are 1024000+1 bytes apart. Since the offset is not a multiple of word size, I'll need to do some misaligned accesses - either loads or stores (for the moment, let's forget that it's possible to avoid misaligned accesses entirely with some ORing and bit shifting). Which of misaligned loads or misaligned stores will be more expensive?
This is a hypothetical situation, so I can't just benchmark it :-) I'm more interested in what factors will lead to performance difference, if any. A pointer to some further reading would be great.
Thanks!

A misaligned write will need to read two destination words, merge in the new data, and write two words. This would be combined with an aligned read. So, 3R + 2W.
A misaligned read will need to read two source words, and merge the data (shift and bitor). This would be combined with an aligned write. So, 2R + 1W.
So, the misaligned read is a clear winner.
Of course, as you say there are more efficient ways to do this that avoid any mis-aligned operations except at the ends of the arrays.

Actually that depends greatly on the CPU you are using. On newer Intel CPUs there is no penalty for loading and storing unaligned words (at least none that you can notice). Only if you load and store 16byte or 32byte unaligned chunks you may see small performance degradation.

How much data? are we talking about two things unaligned at the ends of a large block of data (in the noise) or one item (word, etc) that is unaligned (100% of the data)?
Are you using a memcpy() to move this data, etc?
I'm more interested in what factors will lead to performance
difference, if any.
Memories, modules, chips, on die blocks, etc are usually organized with a fixed access size, at least somewhere along the way there is a fixed access size. Lets just say 64 bits wide, not an uncommon size these days. So at that layer wherever it is you can only write or read in aligned 64 bit units.
If you think about a write vs read, with a read you send out an address and that has to go to the memory and data come back, a full round trip has to happen. With a write everything you need to know to perform the write goes on the outbound path, so it is not uncommon to have a fire and forget type deal where the memory controller takes the address and data and tells the processor the write has finished even though the information has not net reached the memory. It does take time but not as long as a read (not talking about flash/proms just ram here) since a read requires both paths. So for aligned full width stuff a write CAN BE faster, some systems may wait for the data to make it all the way to the memory and then return a completion which is perhaps about the same amount of time as the read. It depends on your system though, the memory technology can make one or the other faster or slower right at the memory itself. Now the first write after nothing has been happening can do this fire and forget thing, but the second or third or fourth or 16th in a row eventually fills up a buffer somewhere along the path and the processor has to wait for the oldest one to make it all the way to the memory before the most recent one has a place in the queue. So for bursty stuff writes may be faster than reads but for large movements of data they approach each other.
Now alignment. The whole memory width will be read on a read, in this case lets say 64 bits, if you were only really interested in 8 of those bits, then somewhere between the memory and the processor the other 24 bits are discarded, where depends on the system. Writes that are not a whole, aligned, size of the memory mean that you have to read the width of the memory, lets say 64 bits, modify the new bits, say 8 bits, then write the whole 64 bits back. A read-modify-write. A read only needs a read a write needs a read-modify-write, the farther away from the memory requiring the read modify write the longer it takes the slower it is, no matter what the read-modify-write cant be any faster than the read alone so the read will be faster, the trimming of bits off the read generally wont take any time so reading a byte compared to reading 16 bits or 32 or 64 bits from the same location so long as the busses and destination are that width all the way, take the same time from the same location, in general, or should.
Unaligned simply multiplies the problem. Say worst case if you want to read 16 bits such that 8 bits are in one 64 bit location and the other 8 in the next 64 bit location, you need to read 128 bits to satisfy that 16 bit read. How that exactly happens and how much of a penalty is dependent on your system. some busses set up the transfer X number of clocks but the data is one clock per bus width after that so a 128 bit read might be only one clock longer (than the dozens to hundreds) of clocks it takes to read 64, or worst case it could take twice as long in order to get the 128 bits needed for this 16 bit read. A write, is a read-modify-write so take the read time, then modify the two 64 bit items, then write them back, same deal could be X+1 clocks in each direction or could be as bad as 2X number of clocks in each direction.
Caches help and hurt. A nice thing about using caches is that you can smooth out the transfers to the slow memory, you can let the cache worry about making sure all memory accesses are aligned and all writes are whole 64 bit writes, etc. How that happens though is the cache will perform same or larger sized reads. So reading 8 bits may result in one or many 64 bit reads of the slow memory, for the first byte, if you perform a second read right after that of the next byte location and if that location is in the same cache line then it doesnt go out to slow memory, it reads from the cache, much faster. and so on until you cross over into another cache boundary or other reads cause that cache line to be evicted. If the location being written is in cache then the read-modify-write happens in the cache, if not in cache then it depends on the system, a write doesnt necessarily mean the read modify write causes a cache line fill, it could happen on the back side as of the cache were not there. Now if you modified one byte in the cache line, now that line has to be written back it simply cannot be discarded so you have a one to few widths of the memory to write back as a result. your modification was fast but eventually the write happens to the slow memory and that affects the overall performance.
You could have situations where you do a (byte) read, the cache line if bigger than the external memory width can make that read slower than if the cache wasnt there, but then you do a byte write to some item in that cache line and that is fast since it is in the cache. So you might have experiments that happen to show writes are faster.
A painful case would be reading say 16 bits unaligned such that not only do they cross over a 64 bit memory width boundary but the cross over a cache line boundary, such that two cache lines have to be read, instead of reading 128 bits that might mean 256 or 512 or 1024 bits have to be read just to get your 16.
The memory sticks on your computer for example are actually multiple memories, say maybe 8 8 bit wide to make a 64 bit overall width or 16 4 bit wide to make an overall 64 bit width, etc. That doesnt mean you can isolate writes on one lane, but maybe, I dont know those modules very well but there are systems where you can/could do this, but those systems I would consider to be 8 or 4 bit wide as far as the smallest addressable size not 64 bit as far as this discussion goes. ECC makes things worse though. First you need an extra memory chip or more, basically more width 72 bits to support 64 for example. You must do full writes with ECC as the whole 72 bits lets say has to be self checking so you cant do fractions. if there is a correctable (single bit) error the read suffers no real penalty it gets the corrected 64 bits (somewhere in the path where this checking happens). Ideally you want a system to write back that corrected value but that is not how all systems work so a read could turn into a read modify write, aligned or not. The primary penalty is if you were able to do fractional writes you cant now with ECC has to be whole width writes.
Now to my question, lets say you use memcpy to move this data, many C libraries are tuned to do aligned transfers, at least where possible, if the source and destination are unaligned in a different way that can be bad, you might want to manage part of the copy yourself. say they are unaligned in the same way, the memcpy will try to copy the unaligned bytes first until it gets to an aligned boundary, then it shifts into high gear, copying aligned blocks until it gets near the end, it downshifts and copies the last few bytes if any, in an unaligned fashion. so if this memory copy you are talking about is thousands of bytes and the only unaligned stuff is near the ends then yes it will cost you some extra reads as much as two extra cache line fills, but that may be in the noise. Even on smaller sizes even if aligned on say 32 bit boundaries if you are not moving whole cache lines or whole memory widths there may still be an extra cache line involved, aligned or not you might only suffer an extra cache lines worth of reading and later writing...
The pure traditional, non-cached memory view of this, all other things held constant is as Doug wrote. Unaligned reads across one of these boundaries, like the 16 bits across two 64 bit words, costs you an extra read 2R vs 1R. A similar write costs you 2R+2W vs 1W, much more expensive. Caches and other things just complicate the problem greatly making the answer "it depends"...You need to know your system pretty well and what other stuff is going on around it, if any. Caches help and hurt, with any cache a test can be crafted to show the cache makes things slower and with the same system a test can be written to show the cache makes things faster.
Further reading would be go look at the databooks/sheets technical reference manuals or whatever the vendor calls their docs for various things. for ARM get the AXI/AMBA documentation on their busses, get the cache documentation for their cache (PL310 for example). Information on ddr memory the individual chips used in the modules you plug into your computer are all out there, lots of timing diagrams, etc. (note just because you think you are buying gigahertz memory, you are not, dram has not gotten faster in like 10 years or more, it is pretty slow around 133Mhz, it is just that the bus is faster and can queue more transfers, it still takes hundreds to thousands of processor cycles for a ddr memory cycle, read one byte that misses all the caches and you processor waits an eternity). so memory interfaces on the processors and docs on various memories, etc may help, along with text books on caches in general, etc.

In what applications caching does not give any advantage?

Our professor asked us to think of an embedded system design where caches cannot be used to their full advantage. I have been trying to find such a design but could not find one yet. If you know such a design, can you give a few tips?

Caches exploit the fact data (and code) exhibit locality.
So an embedded system wich does not exhibit locality, will not benefit from a cache.
Example:
An embedded system has 1MB of memory and 1kB of cache.
If this embedded system is accessing memory with short jumps it will stay long in the same 1kB area of memory, which could be successfully cached.
If this embedded system is jumping in different distant places inside this 1MB and does that frequently, then there is no locality and cache will be used badly.
Also note that depending on architecture you can have different caches for data and code, or a single one.
More specific example:
If your embedded system spends most of its time accessing the same data and (e.g.) running in a tight loop that will fit in cache, then you're using cache to a full advantage.
If your system is something like a database that will be fetching random data from any memory range, then cache can not be used to it's full advantage. (Because the application is not exhibiting locality of data/code.)
Another, but weird example
Sometimes if you are building safety-critical or mission-critical system, you will want your system to be highly predictable. Caches makes your code execution being very unpredictable, because you can't predict if a certain memory is cached or not, thus you don't know how long it will take to access this memory. Thus if you disable cache it allows you to judge you program's performance more precisely and calculate worst-case execution time. That is why it is common to disable cache in such systems.

I do not know what you background is but I suggest to read about what the "volatile" keyword does in the c language.

Think about how a cache works. For example if you want to defeat a cache, depending on the cache, you might try having your often accessed data at 0x10000000, 0x20000000, 0x30000000, 0x40000000, etc. It takes very little data at each location to cause cache thrashing and a significant performance loss.
Another one is that caches generally pull in a "cache line" A single instruction fetch may cause 8 or 16 or more bytes or words to be read. Any situation where on average you use a small percentage of the cache line before it is evicted to bring in another cache line, will make your performance with the cache on go down.
In general you have to first understand your cache, then come up with ways to defeat the performance gain, then think about any real world situations that would cause that. Not all caches are created equal so there is no one good or bad habit or attack that will work for all caches. Same goes for the same cache with different memories behind it or a different processor or memory interface or memory cycles in front of it. You also need to think of the system as a whole.
EDIT:
Perhaps I answered the wrong question. not...full advantage. that is a much simpler question. In what situations does the embedded application have to touch memory beyond the cache (after the initial fill)? Going to main memory wipes out the word full in "full advantage". IMO.

Caching does not offer an advantage, and is actually a hindrance, in controlling memory-mapped peripherals. Things like coprocessors, motor controllers, and UARTs often appear as just another memory location in the processor's address space. Instead of simply storing a value, those locations can cause something to happen in the real world when written to or read from.
Cache causes problems for these devices because when software writes to them, the peripheral doesn't immediately see the write. If the cache line never gets flushed, the peripheral may never actually receive a command even after the CPU has sent hundreds of them. If writing 0xf0 to 0x5432 was supposed to cause the #3 spark plug to fire, or the right aileron to tilt down 2 degrees, then the cache will delay or stop that signal and cause the system to fail.
Similarly, the cache can prevent the CPU from getting fresh data from sensors. The CPU reads repeatedly from the address, and cache keeps sending back the value that was there the first time. On the other side of the cache, the sensor waits patiently for a query that will never come, while the software on the CPU frantically adjusts controls that do nothing to correct gauge readings that never change.

In addition to almost complete answer by Halst, I would like to mention one additional case where caches may be far from being an advantage. If you have multiple-core SoC where all cores, of course, have own cache(s) and depending on how program code utilizes these cores - caches can be very ineffective. This may happen if ,for example, due to incorrect design or program specific (e.g. multi-core communication) some data block in RAM is concurrently used by 2 or more cores.

Why do we bother with CPU registers in assembly, instead of just working directly with memory?

I have a basic question about assembly.
Why do we bother doing arithmetic operations only on registers if they can work on memory as well?
For example both of the following cause (essentially) the same value to be calculated as an answer:
Snippet 1
.data
var dd 00000400h
.code
Start:
add var,0000000Bh
mov eax,var
;breakpoint: var = 00000B04
End Start
Snippet 2
.code
Start:
mov eax,00000400h
add eax,0000000bh
;breakpoint: eax = 0000040B
End Start
From what I can see most texts and tutorials do arithmetic operations mostly on registers. Is it just faster to work with registers?

If you look at computer architectures, you find a series of levels of memory. Those that are close to the CPU are the fast, expensive (per a bit), and therefore small, while at the other end you have big, slow and cheap memory devices. In a modern computer, these are typically something like:
CPU registers (slightly complicated, but in the order of 1KB per a core - there
are different types of registers. You might have 16 64 bit
general purpose registers plus a bunch of registers for special
purposes)
L1 cache (64KB per core)
L2 cache (256KB per core)
L3 cache (8MB)
Main memory (8GB)
HDD (1TB)
The internet (big)
Over time, more and more levels of cache have been added - I can remember a time when CPUs didn't have any onboard caches, and I'm not even old! These days, HDDs come with onboard caches, and the internet is cached in any number of places: in memory, on the HDD, and maybe on caching proxy servers.
There is a dramatic (often orders of magnitude) decrease in bandwidth and increase in latency in each step away from the CPU. For example, a HDD might be able to be read at 100MB/s with a latency of 5ms (these numbers may not be exactly correct), while your main memory can read at 6.4GB/s with a latency of 9ns (six orders of magnitude!). Latency is a very important factor, as you don't want to keep the CPU waiting any longer than it has to (this is especially true for architectures with deep pipelines, but that's a discussion for another day).
The idea is that you will often be reusing the same data over and over again, so it makes sense to put it in a small fast cache for subsequent operations. This is referred to as temporal locality. Another important principle of locality is spatial locality, which says that memory locations near each other will likely be read at about the same time. It is for this reason that reading from RAM will cause a much larger block of RAM to be read and put into on-CPU cache. If it wasn't for these principles of locality, then any location in memory would have an equally likely chance of being read at any one time, so there would be no way to predict what will be accessed next, and all the levels of cache in the world will not improve speed. You might as well just use a hard drive, but I'm sure you know what it's like to have the computer come to a grinding halt when paging (which is basically using the HDD as an extension to RAM). It is conceptually possible to have no memory except for a hard drive (and many small devices have a single memory), but this would be painfully slow compared to what we're familiar with.
One other advantage of having registers (and only a small number of registers) is that it lets you have shorter instructions. If you have instructions that contain two (or more) 64 bit addresses, you are going to have some long instructions!

Because RAM is slow. Very slow.
Registers are placed inside the CPU, right next to the ALU so signals can travel almost instantly. They're also the fastest memory type but they take significant space so we can have only a limited number of them. Increasing the number of registers increases
die size
distance needed for signals to travel
work to save the context when switching between threads
number of bits in the instruction encoding
Read If registers are so blazingly fast, why don't we have more of them?
More commonly used data will be placed in caches for faster accessing. In the past caches are very expensive so they're an optional part and can be purchased separately and plug into a socket outside the CPU. Nowadays they're often in the same die with the CPUs. Caches are constructed from SRAM cells which are smaller than register cells but maybe tens or hundreds of times slower.
Main memory will be made from DRAM which needs only one transistor per cell but are thousands of times slower than registers, hence we can't work with only DRAM in a high-performance system. However some embedded system do make use of register file so registers are also main memory
More information: Can we have a computer with just registers as memory?

Registers are much faster and also the operations that you can perform directly on memory are far more limited.

In real, there are tiny implementations that does not separate registers from memory. They can expose it, for example, in the way they have 512 bytes of RAM, and first 64 of them are exposed as 32 16-bit registers and in the same time accessible as addressable RAM. Or, another example, MosTek 6502 "zero page" (RAM range 0-255, accessed used 1-byte address) was a poor substitution for registers, due to small amount of real registers in CPU. But, this is poorly scalable to larger setups.
The advantage of registers are following:
They are the most fast. They are faster in a typical modern system than any cache, more so than DRAM. (In the example above, RAM is likely SRAM. But SRAM of a few gigabytes is unusably expensive.) And, they are close to processor. Difference of time between register access and DRAM access can reach values like 200 or even 1000. Even compared to L1 cache, register access is typically 2-4 times faster.
Their amount is limited. A typical instruction set will become too bloated if any memory location is addressed explicitly.
Registers are specific to each CPU (core, hardware thread, hart) separately. (In systems where fixed RAM addresses serve role of special registers, as e.g. zSeries does, this needs special remapping of such service area in absolute addresses, separate for each core.)
In the same manner as (3), registers are specific to each process thread without a need to adjust locations in code for a thread.
Registers (relatively easily) allow specific optimizations, as register renaming. This is too complex if memory addresses are used.
Additionally, there are registers that could not be implemented in separate block RAM because access to RAM needs their change. I mean the "execution phase" register in the simplest CPU designs, which takes values like "instruction extracting phase", "instruction decoding phase", "ALU phase", "data writing phase" and so on, and this register equivalents in more complicated (pipeline, out-of-order) designs; also different buffer registers on bus access, and so on. But, such registers are not visible to programmer, so you did likely not mean them.

x86, like pretty much every other "normal" CPU you might learn assembly for, is a register machine1. There are other ways to design something that you can program (e.g. a Turing machine that moves along a logical "tape" in memory, or the Game of Life), but register machines have proven to be basically the only way to go for high-performance.
https://www.realworldtech.com/architecture-basics/2/ covers possible alternatives like accumulator or stack machines which are also obsolete now. Although it omits CISCs like x86 which can be either load-store or register-memory. x86 instructions can actually be reg,mem; reg,reg; or even mem,reg. (Or with an immediate source.)
Footnote 1: The abstract model of computation called a register machine doesn't distinguish between registers and memory; what it calls registers are more like memory in real computers. I say "register machine" here to mean a machine with multiple general-purpose registers, as opposed to just one accumulator, or a stack machine or whatever. Most x86 instructions have 2 explicit operands (but it varies), up to one of which can be memory. Even microcontrollers like 6502 that can only really do math into one accumulator register almost invariably have some other registers (e.g. for pointers or indices), unlike true toy ISAs like Marie or LMC that are extremely inefficient to program for because you need to keep storing and reloading different things into the accumulator, and can't even keep an array index or loop counter anywhere that you can use it directly.
Since x86 was designed to use registers, you can't really avoid them entirely, even if you wanted to and didn't care about performance.
Current x86 CPUs can read/write many more registers per clock cycle than memory locations.
For example, Intel Skylake can do two loads and one store from/to its 32KiB 8-way associative L1D cache per cycle (best case), but can read upwards of 10 registers per clock, and write 3 or 4 (plus EFLAGS).
Building an L1D cache with as many read/write ports as the register file would be prohibitively expensive (in transistor count/area and power usage), especially if you wanted to keep it as large as it is. It's probably just not physically possible to build something that can use memory the way x86 uses registers with the same performance.
Also, writing a register and then reading it again has essentially zero latency because the CPU detects this and forwards the result directly from the output of one execution unit to the input of another, bypassing the write-back stage. (See https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing).
These result-forwarding connections between execution units are called the "bypass network" or "forwarding network", and it's much easier for the CPU to do this for a register design than if everything had to go into memory and back out. The CPU only has to check a 3 to 5 bit register number, instead of an 32-bit or 64-bit address, to detect cases where the output of one instruction is needed right away as the input for another operation. (And those register numbers are hard-coded into the machine-code, so they're available right away.)
As others have mentioned, 3 or 4 bits to address a register make the machine-code format much more compact than if every instruction had absolute addresses.
See also https://en.wikipedia.org/wiki/Memory_hierarchy: you can think of registers as a small fast fixed-size memory space separate from main memory, where only direct absolute addressing is supported. (You can't "index" a register: given an integer N in one register, you can't get the contents of the Nth register with one insn.)
Registers are also private to a single CPU core, so out-of-order execution can do whatever it wants with them. With memory, it has to worry about what order things become visible to other CPU cores.
Having a fixed number of registers is part of what lets CPUs do register-renaming for out-of-order execution. Having the register-number available right away when an instruction is decoded also makes this easier: there's never a read or write to a not-yet-known register.
See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an explanation of register renaming, and a specific example (the later edits to the question / later parts of my answer showing the speedup from unrolling with multiple accumulators to hide FMA latency even though it reuses the same architectural register repeatedly).
The store buffer with store forwarding does basically give you "memory renaming". A store/reload to a memory location is independent of earlier stores and load to that location from within this core. (Can a speculatively executed CPU branch contain opcodes that access RAM?)
Repeated function calls with a stack-args calling convention, and/or returning a value by reference, are cases where the same bytes of stack memory can be reused multiple times.
The seconds store/reload can execute even if the first store is still waiting for its inputs. (I've tested this on Skylake, but IDK if I ever posted the results in an answer anywhere.)

Registers are accessed way faster than RAM memory, since you don't have to access the "slow" memory bus!

We use registers because they are fast. Usually, they operate at CPU's speed.
Registers and CPU cache are made with different technology / fabrics and
they are expensive. RAM on the other hand is cheap and 100 times slower.

Generally speaking register arithmetic is much faster and much preferred. However there are some cases where the direct memory arithmetic is useful.
If all you want to do is increment a number in memory (and nothing else at least for a few million instructions) then a single direct memory arithmetic instruction is usually slightly faster than load/add/store.
Also if you are doing complex array operations you generally need a lot of registers to keep track of where you are and where your arrays end. On older architectures you could run out of register really quickly so the option of adding two bits of memory together without zapping any of your current registers was really useful.

Yes, it's much much much faster to use registers. Even if you only consider the physical distance from processor to register compared to proc to memory, you save a lot of time by not sending electrons so far, and that means you can run at a higher clock rate.

Yes - also you can typically push/pop registers easily for calling procedures, handling interrupts, etc

It's just that the instruction set will not allow you to do such complex operations:
add [0x40001234],[0x40002234]
You have to go through the registers.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio