What's the actual effect of successful unaligned accesses on x86?

What's the actual effect of successful unaligned accesses on x86? - performance

I always hear that unaligned accesses are bad because they will either cause runtime errors and crash the program or slow memory accesses down. However I can't find any actual data on how much they will slow things down.
Suppose I'm on x86 and have some (yet unknown) share of unaligned accesses - what's the worst slowdown actually possible and how do I estimate it without eliminating all unaligned accesses and comparing run time of two versions of code?

It depends on the instruction(s), for most x86 SSE load/store instructions (excluding unaligned variants), it will cause a fault, which means it'll probably crash your program or lead to lots of round trips to your exception handler (which means almost or all performance is lost). The unaligned load/store variants run at double the amount of cycles IIRC, as they perform partial read/writes, so 2 are required to perform the operation (unless you are lucky and its in cache, which greatly reduces the penalty).
For general x86 load/store instructions, the penalty is speed, as more cycles are required to do the read or write. unalignment may also affect caching, leading to cache line splitting, and cache boundary straddling. It also prevents atomicity on reads and writes (which are guaranteed for all aligned read/writes of x86, barriers and propagation is something else, but using LOCK'ed instruction on unaligned data may cause and exception or greatly increase the already massive penalty the bu lock incurs), which is a no-no for concurrent programming.
Intels x86 & x64 optimizations manual goes into great detail about each aforementioned problem, their side-effects and how to remedy them.
Agner Fog' optimization manuals should have the exact numbers you are looking for in terms of raw cycle throughput.

In general estimating speed on modern processors is extremely complicated. This is true not only for unaligned accesses but in general.
Modern processors have pipelined architectures, out of order and possibly parallel execution of instructions and many other things that may impact execution.
If the unaligned access is not supported you get an exception. But if it is supported you may or may not get a slowdown depending on a lot of factors. These factors include what other instructions you were executing both before and after the unaligned one (because the processor may be able to start fetching your data while executing previous instructions or to go ahead and perform subsequent instructions while it waits).
Another very important difference happens if the unaligned access happens across cacheline boundaries. Wile in general a 2x access to the cache may happen for an unaligned access, the real slowdown is if the access crosses a cacheline boundary and causes a double cache miss. In the worst possible case a 2 byte unaligned read may require the processor to flush out two cachelines to memory and then read 2 chachelines from memory. That's a whole lot of data moving.
The general rule for optimization also applies here: first code, then measure, then if and only if there is a problem figure out a solution.

On some Intel micro-architectures, a load that is split by a cacheline boundary takes a dozen cycles longer than usual, and a load that is split by a page boundary takes over 200 cycles longer. It's bad enough that if loads are going to be consistently misaligned in a loop, it's worth doing two aligned loads and merging the results manually, even if palignr is not an option. Even SSE's unaligned loads won't save you, unless they are split exactly down the middle.
On AMD's this was never a problem, and the problem mostly disappeared in Nehalem, but there are still a lot of Core2's out there too.

Related

Optimizing double->long conversion with memory operand using perf?

I have an x86-64 Linux program which I am attempting to optimize via perf. The perf report shows the hottest instructions are scalar conversions from double to long with a memory argument, for example:
cvttsd2si (%rax, %rdi, 8), %rcx
which corresponds to C code like:
extern double *array;
long val = (long)array[idx];
(This is an unusual bottleneck but the code itself is very unusual.)
To inform optimizations I want to know if these instructions are hot because of the load from memory, or because of the arithmetic conversion itself. What's the best way to answer this question? What other data should I collect and how should I proceed to optimize this?
Some things I have looked at already. CPU counter results show 1.5% cache misses per instruction:
30686845493287      cache-references
     2140314044614      cache-misses           #    6.975 % of all cache refs
          52970546      page-faults
  1244774326560850      instructions
   194784478522820      branches
     2573746947618      branch-misses          #    1.32% of all branches
          52970546      faults
Top-down performance monitors show we are primarily backend-bound:
frontend bound retiring bad speculation backend bound
10.1% 25.9% 4.5% 59.5%
Ad-hoc measurement with top shows all CPUs pegged at 100% suggesting we are not waiting on memory.
A final note of interest: when run on AWS EC2, the code is dramatically slower (44%) on AMD vs Intel with the same core count. (Tested on Ice Lake 8375C vs EPYC 7R13). What could explain this discrepancy?
Thank you for any help.

To inform optimizations I want to know if these instructions are hot because of the load from memory, or because of the arithmetic conversion itself. What's the best way to answer this question?
I think there is two main reason for this instruction to be slow. 1. There is a dependency chain and the latency of this instruction is a problem since the processor is waiting on it to execute other instructions. 2. There is a cache miss (saturating the memory with such instruction is improbable unless many cores are doing memory-based operations).
First of all, tracking what is going on on a specific instruction is hard (especially if the instruction is not executed a lot of time). You need to use precise events to track the root of the problem, that is, events for which the exact instruction addresses that caused the event are available. Only a (small) subset of all events are precise one.
Regarding (1), the latency of the instruction should be about 12 cycles on both architecture (although it might be slightly more on the AMD processor, I do not expect a 44% difference). The target processor are able to execute multiple instruction at the same time in a given cycle. Instructions are executed on different port and are also pipelined. The port usage matters to understand what is going on. This means all the instruction in the hot loop matters. You cannot isolate this specific instruction. Modern processors are insanely complex so a basic analysis can be tricky. On Ice Lake processors, you can measure the average port usage with events like UOPS_DISPATCHED.PORT_XXX where XXX can be 0, 1, 2_3, 4_9, 5, 6, 7_8. Only the first three matters for this instruction. The EXE_ACTIVITY.XXX events may also be useful. You should check if a port is saturated and which one. AFAIK, none of these events are precise so you can only analyse a block of code (typically the hot loop). On Zen 3, the ports are FP23 and FP45. IDK what are the useful events on this architecture (I am not very familiar with it).
On Ice Lake, you can check the FRONTEND_RETIRED.LATENCY_GE_XXX events where XXX is a power of two integer (which should be precise one so you can see if this instruction is the one impacting the events). This help you to see whether the front-end or the back-end is the limiting factor.
Regarding (2), you can check the latency of the memory accesses as well as the number of L1/L2/L3 cache hits/misses. On Ice Lake, you can use events like MEM_LOAD_RETIRED.XXX where XXX can be for example L1_MISS L1_HIT, L2_MISS, L2_HIT, L3_MISS and L3_HIT. Still on Ice Lake, t may be useful to track the latency of the memory operation with MEM_TRANS_RETIRED.LOAD_LATENCY_GT_XXX where XXX is again a power of two integer.
You can also use LLVM-MCA to simulate the scheduling of the loop instruction statically on the target architecture (do not consider branches). This is very useful to understand deeply what the scheduler can do pretty easily.
What could explain this discrepancy?
The latency and reciprocal throughput should be about the same on the two platform or at least close. That being said, for the same core count, the two certainly do not operate at the same frequency. If this is not coming from that, then I doubt this instruction is actually the problem alone (tricky scheduling issues, wrong/inaccurate profiling results, etc.).
CPU counter results show 1.5% cache misses per instruction
The thing is the cache-misses event is certainly not very informative here. Indeed, it references the last-level cache (L3) misses. Thus, it does not give any information about the L1/L2 misses (previous events do).
how should I proceed to optimize this?
If the code is latency bound, the solution is to first break any dependency chain in this loop. Unrolling the loop dans rewriting it so to make it more SIMD-friendly can help a lot to improve performance (the reciprocal throughput of this instruction is about 1 cycle as opposed to 12 for the latency so there is a room for improvements in this case).
If the code is memory bound, they you should care about data locality. Data should fit in the L1 cache if possible. There are many tips to do so but it is hard to guide you without more context. This includes for example sorting data, reordering loop iterations, using smaller data types.
There are many possible source of weird unusual unexpected behaviours that can occurs. If such a thing happens, then it is nearly impossible to understand what is going on without the exact code executed. All details matter in this case.

Inlining and Instruction Cache Hit Rates and Thrashing

In this article, https://www.geeksforgeeks.org/inline-functions-cpp/, it states that the disadvantages of inlining are:
3) Too much inlining can also reduce your instruction cache hit rate, thus reducing the speed of instruction fetch from that of cache memory to that of primary memory.
How does inlining affect the instruction cache hit rate?
6) Inline functions might cause thrashing because inlining might increase size of the binary executable file. Thrashing in memory causes performance of computer to degrade.
How does inlining increase the size of the binary executable file? Is it just that it increases the code base length? Moreover, it is not clear to me why having a larger binary executable file would cause thrashing as the two dont seem linked.

It is possible that the confusion about why inlining can hurt i-cache hit rate or cause thrashing lies in the difference between static instruction count and dynamic instruction count. Inlining (almost always) reduces the latter but often increases the former.
Let us briefly examine those concepts.
Static Instruction Count
Static instruction count for some execution trace is the number of unique instructions0 that appear in the binary image. Basically, you just count the instruction lines in an assembly dump. The following snippet of x86 code has a static instruction count of 5 (the .top: line is a label which doesn't translate to anything in the binary):
mov eci, 10
mov eax, 0
.top:
add eax, eci
dec eci
jnz .top
The static instruction count is mostly important for binary size, and caching considerations.
Static instruction count may also be referred to simply as "code size" and I'll sometimes use that term below.
Dynamic Instruction Count
The dynamic instruction count, on the other hand, depends on the actual runtime behavior and is the number of instructions executed. The same static instruction can be counted multiple times due to loops and other branches, and some instructions included in the static count may never execute at all and so no count in the dynamic case. The snippet as above, has a dynamic instruction count of 2 + 30 = 32: the first two instructions are executed once, and then the loop executes 10 times with 3 instructions each iteration.
As a very rough approximation, dynamic instruction count is primarily important for runtime performance.
The Tradeoff
Many optimizations such as loop unrolling, function cloning, vectorization and so on increase code size (static instruction count) in order to improve runtime performance (often strongly correlated with dynamic instruction count).
Inlining is also such an optimization, although with the twist that for some call sites inlining reduces both dynamic and static instruction count.
How does inlining affect the instruction cache hit rate?
The article mentioned too much inlining, and the basic idea here is that a lot of inlining increases the code footprint by increasing the working set's static instruction count while usually reducing its dynamic instruction count. Since a typical instruction cache1 caches static instructions, a larger static footprint means increased cache pressure and often results in a worse cache hit rate.
The increased static instruction count occurs because inlining essentially duplicates the function body at each the call site. So rather than one copy of the function body an a few instructions to call the function N times, you end up with N copies of the function body.
Now this is a rather naive model of how inlining works since after inlining, it may be the case that further optimizations can be done in the context of a particular call-site, which may dramatically reduce the size of the inlined code. In the case of very small inlined functions or a large amount of subsequent optimization, the resulting code may be even smaller after inlining, since the remaining code (if any) may be smaller than the overhead involved in the calling the function2.
Still, the basic idea remains: too much inlining can bloat the code in the binary image.
The way the i-cache works depends on the static instruction count for some execution, or more specifically the number of instruction cache lines touched in the binary image, which is largely a fairly direct function of the static instruction count. That is, the i-cache caches regions of the binary image so the more regions and the larger they are, the larger the cache footprint, even if the dynamic instruction count happens to be lower.
How does inlining increase the size of the binary executable file?
It's exactly the same principle as the i-cache case above: larger static footprint means that more distinct pages need to paged in, potentially leading to more pressure on the VM system. Now we usually measure code sizes in megabytes, while memory on servers, desktops, etc are usually measured in gigabytes, so it's highly unlikely that excessive inlining is going to meaningfully contribute to the thrashing on such systems. It could perhaps be a concern on much smaller or embedded systems (although the latter often don't have a MMU at at all).
0 Here unique refers, for example, to the IP of the instruction, not to the actual value of the encoded instruction. You might find inc eax in multiple places in the binary, but each are unique in this sense since they occur at a different location.
1 There are exceptions, such as some types of trace caches.
2 On x86, the necessary overhead is pretty much just the call instruction. Depending on the call site, there may also be other overhead, such as shuffling values into the correct registers to adhere to the ABI, and spilling caller-saved registers. More generally, there may be a large cost to a function call simply because the compiler has to reset many of its assumptions across a function call, such as the state of memory.

Lets say you have a function thats 100 instructions long and it takes 10 instructions to call it whenever its called.
That means for 10 calls it uses up 100 + 10 * 10 = 200 instructions in the binary.
Now lets say its inlined everywhere its used. That uses up 100*10 = 1000 instructions in your binary.
So for point 3 this means that it will take significantly more space in the instructions cache (different invokations of an inline function are not 'shared' in the i-cache)
And for point 6 your total binary size is now bigger, and a bigger binary size can lead to thrashing

If compilers inlined everything they could, most functions would be gigantic. (Although you might just have one gigantic main function that calls library functions, but in the most extreme case all functions within your program would be inlined into main).
Imagine if everything was a macro instead of a function, so it fully expanded everywhere you used it. This is the source-level version of inlining.
Most functions have multiple call-sites. The code-size to call a function scales a bit with the number of args, but is generally pretty small compared to a medium to large function. So inlining a large function at all of its call sites will increase total code size, reducing I-cache hit rates.
But these days its common practice to write lots of small wrapper / helper functions, especially in C++. The code for a stand-alone version of a small function is often not much bigger than code necessary to call it, especially when you include the side-effects of a function call (like clobbering registers). Inlining small functions can often save code size, especially when further optimizations become possible after inlining. (e.g. the function computes some of the same stuff that code outside the function also computes, so CSE is possible).
So for a compiler, the decision of whether to inline into any specific call site or not should be based on the size of called function, and maybe whether its called inside a loop. (Optimizing away the call/ret overhead is more valuable if the call site runs more often.) Profile-guided optimization can help a compiler make better decisions, by "spending" more code-size on hot functions, and saving code-size in cold functions (e.g. many functions only run once over the lifetime of the program, while a few hot ones take most of the time).
If compilers didn't have good heuristics for when to inline, or you override them to be way too aggressive, then yes, I-cache misses would be the result.
But modern compilers do have good inlining heuristics, and usually this makes programs significantly faster but only a bit larger. The article you read is talking about why there need to be limits.
The above code-size reasoning should make it obvious that executable size increases, because it doesn't shrink the data any. Many functions will still have a stand-alone copy in the executable as well as inlined (and optimized) copies at various call sites.
There are a few factors that mitigate the I-cache hit rate problem. Better locality (from not jumping around as much) let code prefetch do a better job. Many programs spend most of their time in a small part of their total code, which usually still fits in I-cache after a bit of inlining.
But larger programs (like Firefox or GCC) have lots of code, and call the same functions from many call sites in large "hot" loops. Too much inlining bloating the total code size of each hot loop would hurt I-cache hit rates for them.
Thrashing in memory causes performance of computer to degrade.
https://en.wikipedia.org/wiki/Thrashing_(computer_science)
On modern computers with multiple GiB of RAM, thrashing of virtual memory (paging) is not plausible unless every program on the system was compiled with extremely aggressive inlining. These days most memory is taken up by data, not code (especially pixmaps in a computer running a GUI), so code would have to explode by a few orders of magnitude to start to make a real difference in overall memory pressure.
Thrashing the I-cache is pretty much the same thing as having lots of I-cache misses. But it would be possible to go beyond that into thrashing the larger unified caches (L2 and L3) that cache code + data.

Generally speaking, inlining tends to increase the emitted code size due to call sites being replaced with larger pieces of code. Consequently, more memory space may be required to hold the code, which may cause thrashing. I'll discuss this in a little more detail.
How does inlining affect the instruction cache hit rate?
The impact that inlining can have on performance is very difficult to statically characterize in general without actually running the code and measuring its performance.
Yes, inlining may impact the code size and typically makes the emitted native code larger. Let's consider the following cases:
The code executed within a particular period of time fits within a particular level of the memory hierarchy (say L1I) in both cases (with or without inlining). So performance with respect to that particular level will not change.
The code executed within a particular period of time fits within a particular level of the memory hierarchy in the case of no inlining, but doesn't fit with inlining. The impact this can have on performance depends on the locality of the executed. Essentially, if the hottest pieces of code first within that level of memory, then the miss ratio at the level might increase slightly. Features of modern processors such as speculative execution, out-of-order execution, prefetching can hide or reduce the penalty of the additional misses. It's important to note that inlining does improve the locality of code, which can result in a net positive imapct on performance despite of the increased code size. This is particularly true when the code inlined at a call site is frequently executed. Partial inlining techniques have been developed to inline only the parts of the function that are deemed hot.
The code executed within a particular period of time doesn't fit within a particular level of the memory hierarchy in both cases. So performance with respect to that particular level will not change.
Moreover, it is not clear to me why having a larger binary executable
file would cause thrashing as the two dont seem linked.
Consider the main memory level on a resource-constrained system. Even a mere 5% increase in code size can cause thrashing at main memory, which can result in significant performance degradations. On other resource-rich systems (desktops, workstations, servers), thrashing usually only occurs at caches when the total size of the hot instructions is too large to fit within one or more of the caches.

Would buffering cache changes prevent Meltdown?

If new CPUs had a cache buffer which was only committed to the actual CPU cache if the instructions are ever committed would attacks similar to Meltdown still be possible?
The proposal is to make speculative execution be able to load from memory, but not write to the CPU caches until they are actually committed.

TL:DR: yes I think it would solve Spectre (and Meltdown) in their current form (using a flush+read cache-timing side channel to copy the secret data from a physical register), but probably be too expensive (in power cost, and maybe also performance) to be a likely implementation.
But with hyperthreading (or more generally any SMT), there's also an ALU / port-pressure side-channel if you can get mis-speculation to run data-dependent ALU instructions with the secret data, instead of using it as an array index. The Meltdown paper discusses this possibility before focusing on the flush+reload cache-timing side-channel. (It's more viable for Meltdown than Spectre, because you have much better control of the timing of when the the secret data is used).
So modifying cache behaviour doesn't block the attacks. It would take away the reliable side-channel for getting the secret data into the attacking process, though. (i.e. ALU timing has higher noise and thus lower bandwidth to get the same reliability; Shannon's noisy channel theorem), and you have to make sure your code runs on the same physical core as the code under attack.
On CPUs without SMT (e.g. Intel's desktop i5 chips), the ALU timing side-channel is very hard to use with Spectre, because you can't directly use perf counters on code you don't have privilege for. (But Meltdown could still be exploited by timing your own ALU instructions with Linux perf, for example).
Meltdown specifically is much easier to defend against, microarchitecturally, with simpler and cheaper changes to the hard-wired parts of the CPU that microcode updates can't rewire.
You don't need to block speculative loads from affecting cache; the change could be as simple as letting speculative execution continue after a TLB-hit load that will fault if it reaches retirement, but with the value used by speculative execution of later instructions forced to 0 because of the failed permission check against the TLB entry.
So the mis-speculated (after the faulting load of secret) touch array[secret*4096] load would always make the same cache line hot, with no secret-data-dependent behaviour. The secret data itself would enter cache, but not a physical register. (And this stops ALU / port-pressure side-channels, too.)
Stopping the faulting load from even bringing the "secret" line into cache in the first place could make it harder to tell the difference between a kernel mapping and an unmapped page, which could possibly help protect against user-space trying to defeat KASLR by finding which virtual addresses the kernel has mapped. But that's not Meltdown.
Spectre
Spectre is the hard one because the mis-speculated instructions that make data-dependent modifications to microarchitectural state do have permission to read the secret data. Yes, a "load queue" that works similarly to the store queue could do the trick, but implementing it efficiently could be expensive. (Especially given the cache coherency problem that I didn't think of when I wrote this first section.)
(There are other ways of implementing the your basic idea; maybe there's even a way that's viable. But extra bits on L1D lines to track their status has downsides and isn't obviously easier.)
The store queue tracks stores from execution until they commit to L1D cache. (Stores can't commit to L1D until after they retire, because that's the point at which they're known to be non-speculative, and thus can be made globally visible to other cores).
A load queue would have to store whole incoming cache lines, not just the bytes that were loaded. (But note that Skylake-X can do 64-byte ZMM stores, so its store-buffer entries do have to be the size of a cache line. But if they can borrow space from each other or something, then there might not be 64 * entries bytes of storage available, i.e. maybe only the full number of entries is usable with scalar or narrow-vector stores. I've never read anything about a limitation like this, so I don't think there is one, but it's plausible)
A more serious problem is that Intel's current L1D design has 2 read ports + 1 write port. (And maybe another port for writing lines that arrive from L2 in parallel with committing a store? There was some discussion about that on Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake.)
If your loaded data can't enter L1D until after the loads retire, then they're probably going to be competing for the same write port that stores use.
Loads that hit in L1D can still come directly from L1D, though, and loads that hit in the memory-order-buffer could still be executed at 2 per clock. (The MOB would now include this new load queue as well as the usual store queue + markers for loads to maintain x86 memory ordering semantics). You still need both L1D read ports to maintain performance for code that doesn't touch a lot of new memory, and mostly is reloading stuff that's been hot in L1D for a while.
This would make the MOB about twice as large (in terms of data storage), although it doesn't need any more entries. As I understand it, the MOB in current Intel CPUs is composed of the individual load-buffer and store-buffer entries. (Haswell has 72 and 42 respectively).
Hmm, a further complication is that the load data in the MOB has to maintain cache coherency with other cores. This is very different from store data, which is private and hasn't become globally visible / isn't part of the global memory order and cache coherency until it commits to L1D.
So this proposed "load queue" implementation mechanism for your idea is probably not feasible without tweaks: it would have to be checked by invalidation-requests from other cores, so that's another read-port needed in the MOB.
Any possible implementation would have the problem of needing to later commit to L1D like a store. I think it would be a significant burden not to be able to evict + allocate a new line when it arrived from off-core.
(Even allowing speculative eviction but not speculative replacement from conflicts leaves open a possible cache-timing attack. You'd prime all the lines and then do a load that would evict one from one set of lines or another, and find which line was evicted instead of which one was fetched using a similar cache-timing side channel. So using extra bits in L1D to find / evict lines loaded during recovery from mis-speculation wouldn't eliminate this side-channel.)
Footnote: all instructions are speculative. This question is worded well, but I think many people reading about OoO exec and thinking about Meltdown / Spectre fall into this trap of confusing speculative execution with mis-speculation.
Remember that all instructions are speculative when they're executed. It's not known to be correct speculation until retirement. Meltdown / Spectre depend on accessing secret data and using it during mis-speculation. But the basis of current OoO CPU designs is that you don't know whether you've speculated correctly or not; everything is speculative until retirement.
Any load or store could potentially fault, and so can some ALU instructions (e.g. floating point if exceptions are unmasked), so any performance cost that applies "only when executing speculatively" actually applies all the time. This is why stores can't commit from the store queue into L1D until after the store uops have retired from the out-of-order CPU core (with the store data in the store queue).
However, I think conditional and indirect branches are treated specially, because they're expected to mis-speculate some of the time, and optimizing recovery for them is important. Modern CPUs do better with branches than just rolling back to the current retirement state when a mispredict is detected, I think using a checkpoint buffer of some sort. So out-of-order execution for instructions before the branch can continue during recovery.
But loop and other branches are very common, so most code executes "speculatively" in this sense, too, with at least one branch-rollback checkpoint not yet verified as correct speculation. Most of the time it's correct speculation, so no rollback happens.
Recovery for mis-speculation of memory ordering or faulting loads is a full pipeline-nuke, rolling back to the retirement architectural state. So I think only branches consume the branch checkpoint microarchitectural resources.
Anyway, all of this is what makes Spectre so insidious: the CPU can't tell the difference between mis-speculation and correct speculation until after the fact. If it knew it was mis-speculating, it would initiate rollback instead of executing useless instructions / uops. Indirect branches are not rare, either (in user-space); every DLL or shared library function call uses one in normal executables on Windows and Linux.

I suspect the overhead from buffering and committing the buffer would render the specEx/caching useless?
This is purely speculative (no pun intended) - I would love to see someone with a lower level background weigh in this!

Techniques available to control data/instructions in/out of the cache?

I have encountered some Intel compiler intrinsic functions which I believe allow developers to bypass the cache?
http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/fortran-mac/GUID-AF42A867-B796-4D29-8FED-C20193FD87E0.htm
I have also come across the GCC compiler prefetch keyword, although I cannot admit to fully appreciating what this does.
With the above in mind I wondered if any members could either elaborate on the above (which I badly described) or provide other techniques which allow the developer to have close control over which data (or instructions) is/isn't loaded in the CPU cache?

This page contains a lot of information about all intrinsics:
Intel Intrinsics Guide
The series of instructions that will write data to memory, avoiding cache evictions are generally named _mm_stream_.... As the name implies, these are ideal for applications that write a large stream of data that is basically contiguous in memory and unlikely to be accessed again in the near future. So, for example, if you are mixing audio buffers and producing a single waveform output this would work well.
One of the keys to using these instructions effectively is taking advantage of write combining. If your write locations are scattered throughout memory, these instructions will stall as badly, or possibly worse than any other kind of memory storage instruction you attempt. Since these writes do not wind up in cache, if you're not filling an entire write buffer then essentially your operation becomes a write-through operation, requiring a stall until the write is completed. If you are writing contiguous memory locations then write combining will apply, and make your data writes much more efficient.
The flip side of that coin is prefetching. Prefetching tells the system to start pulling a memory address into the desired level of cache so that by the time the memory read is complete, you are ready to use the data. This is much harder to use, and requires an appropriate data "stride" which takes into account the cache sizes, cache line size, and the number of instructions which can execute before the memory read completes. Using the hinting parameter, you can "suggest" that the data goes into the L1, L2, or L3 cache, or that it is "non-temporal", meaning that you're just going to use it once and it should be evicted first before any other cache evictions. The hardware has its own prefetching heuristics that work well for most problems without explicit prefetching instructions, but the classic counter-example is a matrix transpose:
Prefetching examples
Prefetching is generally very difficult to use effectively except in some very specific cases like this. Without a more specific problem statement from you, this is about all I can provide.

Why do we bother with CPU registers in assembly, instead of just working directly with memory?

I have a basic question about assembly.
Why do we bother doing arithmetic operations only on registers if they can work on memory as well?
For example both of the following cause (essentially) the same value to be calculated as an answer:
Snippet 1
.data
var dd 00000400h
.code
Start:
add var,0000000Bh
mov eax,var
;breakpoint: var = 00000B04
End Start
Snippet 2
.code
Start:
mov eax,00000400h
add eax,0000000bh
;breakpoint: eax = 0000040B
End Start
From what I can see most texts and tutorials do arithmetic operations mostly on registers. Is it just faster to work with registers?

If you look at computer architectures, you find a series of levels of memory. Those that are close to the CPU are the fast, expensive (per a bit), and therefore small, while at the other end you have big, slow and cheap memory devices. In a modern computer, these are typically something like:
CPU registers (slightly complicated, but in the order of 1KB per a core - there
are different types of registers. You might have 16 64 bit
general purpose registers plus a bunch of registers for special
purposes)
L1 cache (64KB per core)
L2 cache (256KB per core)
L3 cache (8MB)
Main memory (8GB)
HDD (1TB)
The internet (big)
Over time, more and more levels of cache have been added - I can remember a time when CPUs didn't have any onboard caches, and I'm not even old! These days, HDDs come with onboard caches, and the internet is cached in any number of places: in memory, on the HDD, and maybe on caching proxy servers.
There is a dramatic (often orders of magnitude) decrease in bandwidth and increase in latency in each step away from the CPU. For example, a HDD might be able to be read at 100MB/s with a latency of 5ms (these numbers may not be exactly correct), while your main memory can read at 6.4GB/s with a latency of 9ns (six orders of magnitude!). Latency is a very important factor, as you don't want to keep the CPU waiting any longer than it has to (this is especially true for architectures with deep pipelines, but that's a discussion for another day).
The idea is that you will often be reusing the same data over and over again, so it makes sense to put it in a small fast cache for subsequent operations. This is referred to as temporal locality. Another important principle of locality is spatial locality, which says that memory locations near each other will likely be read at about the same time. It is for this reason that reading from RAM will cause a much larger block of RAM to be read and put into on-CPU cache. If it wasn't for these principles of locality, then any location in memory would have an equally likely chance of being read at any one time, so there would be no way to predict what will be accessed next, and all the levels of cache in the world will not improve speed. You might as well just use a hard drive, but I'm sure you know what it's like to have the computer come to a grinding halt when paging (which is basically using the HDD as an extension to RAM). It is conceptually possible to have no memory except for a hard drive (and many small devices have a single memory), but this would be painfully slow compared to what we're familiar with.
One other advantage of having registers (and only a small number of registers) is that it lets you have shorter instructions. If you have instructions that contain two (or more) 64 bit addresses, you are going to have some long instructions!

Because RAM is slow. Very slow.
Registers are placed inside the CPU, right next to the ALU so signals can travel almost instantly. They're also the fastest memory type but they take significant space so we can have only a limited number of them. Increasing the number of registers increases
die size
distance needed for signals to travel
work to save the context when switching between threads
number of bits in the instruction encoding
Read If registers are so blazingly fast, why don't we have more of them?
More commonly used data will be placed in caches for faster accessing. In the past caches are very expensive so they're an optional part and can be purchased separately and plug into a socket outside the CPU. Nowadays they're often in the same die with the CPUs. Caches are constructed from SRAM cells which are smaller than register cells but maybe tens or hundreds of times slower.
Main memory will be made from DRAM which needs only one transistor per cell but are thousands of times slower than registers, hence we can't work with only DRAM in a high-performance system. However some embedded system do make use of register file so registers are also main memory
More information: Can we have a computer with just registers as memory?

Registers are much faster and also the operations that you can perform directly on memory are far more limited.

In real, there are tiny implementations that does not separate registers from memory. They can expose it, for example, in the way they have 512 bytes of RAM, and first 64 of them are exposed as 32 16-bit registers and in the same time accessible as addressable RAM. Or, another example, MosTek 6502 "zero page" (RAM range 0-255, accessed used 1-byte address) was a poor substitution for registers, due to small amount of real registers in CPU. But, this is poorly scalable to larger setups.
The advantage of registers are following:
They are the most fast. They are faster in a typical modern system than any cache, more so than DRAM. (In the example above, RAM is likely SRAM. But SRAM of a few gigabytes is unusably expensive.) And, they are close to processor. Difference of time between register access and DRAM access can reach values like 200 or even 1000. Even compared to L1 cache, register access is typically 2-4 times faster.
Their amount is limited. A typical instruction set will become too bloated if any memory location is addressed explicitly.
Registers are specific to each CPU (core, hardware thread, hart) separately. (In systems where fixed RAM addresses serve role of special registers, as e.g. zSeries does, this needs special remapping of such service area in absolute addresses, separate for each core.)
In the same manner as (3), registers are specific to each process thread without a need to adjust locations in code for a thread.
Registers (relatively easily) allow specific optimizations, as register renaming. This is too complex if memory addresses are used.
Additionally, there are registers that could not be implemented in separate block RAM because access to RAM needs their change. I mean the "execution phase" register in the simplest CPU designs, which takes values like "instruction extracting phase", "instruction decoding phase", "ALU phase", "data writing phase" and so on, and this register equivalents in more complicated (pipeline, out-of-order) designs; also different buffer registers on bus access, and so on. But, such registers are not visible to programmer, so you did likely not mean them.

x86, like pretty much every other "normal" CPU you might learn assembly for, is a register machine1. There are other ways to design something that you can program (e.g. a Turing machine that moves along a logical "tape" in memory, or the Game of Life), but register machines have proven to be basically the only way to go for high-performance.
https://www.realworldtech.com/architecture-basics/2/ covers possible alternatives like accumulator or stack machines which are also obsolete now. Although it omits CISCs like x86 which can be either load-store or register-memory. x86 instructions can actually be reg,mem; reg,reg; or even mem,reg. (Or with an immediate source.)
Footnote 1: The abstract model of computation called a register machine doesn't distinguish between registers and memory; what it calls registers are more like memory in real computers. I say "register machine" here to mean a machine with multiple general-purpose registers, as opposed to just one accumulator, or a stack machine or whatever. Most x86 instructions have 2 explicit operands (but it varies), up to one of which can be memory. Even microcontrollers like 6502 that can only really do math into one accumulator register almost invariably have some other registers (e.g. for pointers or indices), unlike true toy ISAs like Marie or LMC that are extremely inefficient to program for because you need to keep storing and reloading different things into the accumulator, and can't even keep an array index or loop counter anywhere that you can use it directly.
Since x86 was designed to use registers, you can't really avoid them entirely, even if you wanted to and didn't care about performance.
Current x86 CPUs can read/write many more registers per clock cycle than memory locations.
For example, Intel Skylake can do two loads and one store from/to its 32KiB 8-way associative L1D cache per cycle (best case), but can read upwards of 10 registers per clock, and write 3 or 4 (plus EFLAGS).
Building an L1D cache with as many read/write ports as the register file would be prohibitively expensive (in transistor count/area and power usage), especially if you wanted to keep it as large as it is. It's probably just not physically possible to build something that can use memory the way x86 uses registers with the same performance.
Also, writing a register and then reading it again has essentially zero latency because the CPU detects this and forwards the result directly from the output of one execution unit to the input of another, bypassing the write-back stage. (See https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing).
These result-forwarding connections between execution units are called the "bypass network" or "forwarding network", and it's much easier for the CPU to do this for a register design than if everything had to go into memory and back out. The CPU only has to check a 3 to 5 bit register number, instead of an 32-bit or 64-bit address, to detect cases where the output of one instruction is needed right away as the input for another operation. (And those register numbers are hard-coded into the machine-code, so they're available right away.)
As others have mentioned, 3 or 4 bits to address a register make the machine-code format much more compact than if every instruction had absolute addresses.
See also https://en.wikipedia.org/wiki/Memory_hierarchy: you can think of registers as a small fast fixed-size memory space separate from main memory, where only direct absolute addressing is supported. (You can't "index" a register: given an integer N in one register, you can't get the contents of the Nth register with one insn.)
Registers are also private to a single CPU core, so out-of-order execution can do whatever it wants with them. With memory, it has to worry about what order things become visible to other CPU cores.
Having a fixed number of registers is part of what lets CPUs do register-renaming for out-of-order execution. Having the register-number available right away when an instruction is decoded also makes this easier: there's never a read or write to a not-yet-known register.
See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an explanation of register renaming, and a specific example (the later edits to the question / later parts of my answer showing the speedup from unrolling with multiple accumulators to hide FMA latency even though it reuses the same architectural register repeatedly).
The store buffer with store forwarding does basically give you "memory renaming". A store/reload to a memory location is independent of earlier stores and load to that location from within this core. (Can a speculatively executed CPU branch contain opcodes that access RAM?)
Repeated function calls with a stack-args calling convention, and/or returning a value by reference, are cases where the same bytes of stack memory can be reused multiple times.
The seconds store/reload can execute even if the first store is still waiting for its inputs. (I've tested this on Skylake, but IDK if I ever posted the results in an answer anywhere.)

Registers are accessed way faster than RAM memory, since you don't have to access the "slow" memory bus!

We use registers because they are fast. Usually, they operate at CPU's speed.
Registers and CPU cache are made with different technology / fabrics and
they are expensive. RAM on the other hand is cheap and 100 times slower.

Generally speaking register arithmetic is much faster and much preferred. However there are some cases where the direct memory arithmetic is useful.
If all you want to do is increment a number in memory (and nothing else at least for a few million instructions) then a single direct memory arithmetic instruction is usually slightly faster than load/add/store.
Also if you are doing complex array operations you generally need a lot of registers to keep track of where you are and where your arrays end. On older architectures you could run out of register really quickly so the option of adding two bits of memory together without zapping any of your current registers was really useful.

Yes, it's much much much faster to use registers. Even if you only consider the physical distance from processor to register compared to proc to memory, you save a lot of time by not sending electrons so far, and that means you can run at a higher clock rate.

Yes - also you can typically push/pop registers easily for calling procedures, handling interrupts, etc

It's just that the instruction set will not allow you to do such complex operations:
add [0x40001234],[0x40002234]
You have to go through the registers.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio