Can you directly access the cache using assembly?

Can you directly access the cache using assembly? - performance

Caching is a core thing when it comes to efficiency.
I know that caching usually happens automatically.
However, I'd like to control cache usage myself, because I think that I can do better than some heuristics that don't know the exact program.
Therefore I would need assembly instructions to directly move to or from cache memory cells.
like:
movL1 address content
I know that there are some instructions that give the "caching system" hints, but I'm not sure if that's enough because the hints could be ignored or they maybe aren't sufficient to express anything expressable by such a move to/from cache order.
Are there any assemblers that allow for complete cache control?
Side note: why I'd like to improve caching:
consider a hypothetical CPU with 1 register and a cache containing 2 cells.
consider the following two programs:
(where x,y,z,a are memory cells)
"START"
"move 1 to x"
"move 2 to y"
"move 3 to z"
"move 4 to a"
"move z to x"
"move y to x"
"END"
"START"
"move 1 to x"
"move 2 to y"
"move 3 to z"
"move 4 to a"
"move a to x"
"move y to x"
"END"
In the first case, you'd use the register and the cache for x,y,z (a is only written to once)
In the second case, you'd use the register and the cache for a,x,y (z is only written to once)
If the CPU does the caching, it simply can't decide ahead of time which of the two above cases it's facing.
It has to decide for each of the memory cells x,y,z if its contents should be cached before it knows if the program executed, is no. 1 or no. 2, because both programs start out the same.
The programmer on the other hand knows ahead of time which memory cells are reused, and when they are reused.

Peter Cordes wrote:
On most microarchitectures for most ISAs, no, you can't pin a line in cache to stop it from being evicted. The only way to use cache is as a transparent cache that you load/store through.
This is correct, but the exceptions are of interest....
It is common in DSP ("Digital Signal Processing") chips to provide a limited ability to partition SRAM between "cache" and "scratchpad memory" functionality. There are lots of white papers and reference guides on this topic -- an example is http://www.ti.com/lit/ug/sprug82a/sprug82a.pdf. In this chip, there are three blocks of SRAM -- a small "Level-1 Instruction" SRAM, a small "Level-1 Data" SRAM, and a larger "Level-2" SRAM. Each of the three can be partitioned between Cache and directly-addressed memory, with the details depending on the specific chip. For example, a chip may allow no cache, 1/4 SRAM as cache, 1/2 SRAM as cache, or all SRAM as cache. (The ratios are limited so the allowed cache sizes can be indexed efficiently.)
The IBM "Cell" processor (used in the Sony PlayStation 3, released in 2006) was a multi-core chip with one ordinary general-purpose core and eight co-processor cores. The co-processor cores had a limited instruction set, with load and store instructions that could only access their private 128KiB "scratchpad" memory. In order to access main memory, the co-processors had to program a DMA engine to perform a block copy of main memory to local scratchpad memory (or vice versa). This approach provided (and required) perfect control over data motion, resulting in (a very small amount of) very high-performance software.
Some GPUs also have small on-chip SRAMs that can be configured as either an L1 cache or as explicitly controlled local memory.
All of these are considered to be "very hard" (or worse) to use, but this can be the right approach if the product requires very low cost, completely predictable performance, or very low power.

On most microarchitectures for most ISAs, no, you can't pin a line in cache to stop it from being evicted. The only way to use cache is as a transparent cache that you load/store through.
Of course, a normal load will definitely bring a cache line into L1d cache, at least temporarily. Nothing stops it from being evicted later, though. e.g. on x86-64: mov eax, [rdi] instead of prefetcht0 [rdi].
Before dedicated prefetch instructions existed, using a plain load as a prefetch was sometimes done (e.g. ahead of some loop-bounds calculations before entering a loop that would start looping over an array). For performance purposes, best-effort software prefetch instructions that the CPU can ignore are usually better.
A plain load has the downside of not being able to retire from the out-of-order back-end until the loaded data actually arrives. (At least I think it can't on x86 CPUs with x86's strongly ordered memory model. Weakly-ordered ISAs that allow out-of-order loads might let the load retire even if it hasn't truly completed yet.) Software prefetch instructions exist to allow prefetch as a hint without bottlenecking the CPU on waiting for the load to finish.
On modern x86, forced eviction of a cache is possible. NT stores guarantee that on Pentium-M or newer, or CPUs after Pentium-M, I forget which. Also, clflush and clflushopt exist specifically for that.
clflush is not just a hint that the CPU can drop; it guarantees correctness for non-volatile DIMMs like Optane DC PM. Why does CLFLUSH exist in x86?
Being guaranteed, not just a hint, makes it slow. You generally don't want to do this for performance. As #old_timer says, burning instructions / cycles micro-managing the cache is almost always a waste of time. Leaving things up to the hardware's pseudo-LRU replacement and HW prefetch algorithms usually provide good results in the long run. SW prefetch can help in a few cases.
Xeon Phi can configure its MCDRAM as a large last-level cache, or as architecturally visible "local memory" that's part of physical address space. But at 6 to 16GiB, it's vastly bigger than the on-die L1/L2 caches, or the L1/L2/L3 caches of modern mainstream CPUs.
Also, x86 CPUs can run in cache-as-RAM no-fill mode, used by the BIOS in early startup before configuring DRAM controllers. But that's really just no fills on read or write, and read-as-zero for invalid lines, so you can't use DRAM at all when no-fill-mode is activated. i.e. only cache is available, and you have to be careful not to evict anything that was cached. It's not usable for any practical purpose except early-boot.
What use is the INVD instruction? and Cache-as-Ram (no fill mode) Executable Code have some details.
I know that there are some instructions that give the "caching system" hints, but I'm not sure if that's enough because the hints could be ignored or they maybe aren't sufficient to express anything expressable by such a move to/from cache order.

Direct access to the cache srams has nothing to do with the instruction set, if you have access then you have access and you access it however the chip/system designers implemented it. It could be as simple as an address space or it may be some indirect peripheral like access where you poke at control registers and that logic accesses that item in the cache for you.
And this doesn't mean that all ARM processors can gain access to their cache in the same way. (arm is an IP company not a chip company) but it might mean that no you can't do this on any existing x86s. I know for a fact on the product I am part of we can do this because we have ECC on those SRAMs and have an access method to initialize the rams from software before enabling the monitor. Some of the srams you can do it through normal accesses, but for example the arm we are using was implemented with parity checking not ECC so we added ECC on the SRAM and a side door access for init because trying to go through the cache with normal accesses and get 100% coverage was a PITA and end the end not the right solution.
Also worked on a product where the dram controller cache can be used direct access as an on chip ram, up to software decide how to use it as an L2 cache or as on chip ram.
So it has and can be done, and these are isolated examples. As part of screening the parts there are mbist tests that run, but often those are driven through jtag and not directly available to the processor and/or the ram isn't, sometimes the mbist can be started and checked by software but the ram can't, and some implementations, the designers made it so software can touch all of it, including tag ram.
Which leads to if you think you can do a better job than the hardware and want to move stuff around then you will also likely need access to the tag ram as well so that you can trace/drive where you want the cache line, its status, etc.
Based on this comment:
Sorry, I'm a [beginner] at assembly, could you please explain this simpler? whats a CPU "mode"? What's that HBM? How to set a CPU mode? what are NDAs? – KGM
Two things, you can't do better than the cache, and two, you are not ready for this task.
Even with experience you can't generally do better than the cache, if you want to manipulate the cache you use the same knowledge as to how you write your code and where you place it in memory as well as where the data is you are using and then the logic implementation can work better for you. Burning instructions and cycles trying to reposition things runtime isn't going to help. You generally need access to the design at level that is not available to the general public. Thus an NDA (non disclosure agreement), and even then it is extremely unlikely that you will get the info you need and/or the gains will be minimal, may only work on one implementation and not across the whole family of products, etc.
More interesting is what do you think you can do better and how are you thinking you can do it? (also understand that many of us here can make any cache implementation fail and run slower than if it wasn't there, even if you create a newer better cache, by definition it only improves performance in certain cases).

Related

Does page walk take advantage of shared tables?

Suppose two address spaces share a largish lump of non-contiguous memory.
The system might want to share physical page table(s) between them.
These tables wouldn't use Global bits (even if supported), and would tie them to asids if supported.
There are immediate benefits since the data cache will be less polluted than by a copy, less pinned ram, etc.
Does the page walk take explicit advantage of this in any known architecture?
If so, does that imply the mmu is explicitly caching & sharing interior page tree nodes based on physical tag?
Sorry for the multiple questions; it really is one broken down. I am trying to determine if it is worth devising a measurement test for this.

On modern x86 CPUs (like Sandybridge-family), page walks fetch through the cache hierarchy (L1d / L2 / L3), so yes there's an obvious benefit there for having to different page directories point to the same subtree for a shared region of virtual address space. Or for some AMD, fetch through L2, skipping L1d.
What happens after a L2 TLB miss? has more details about the fact that page-walk definitely fetches through cache, e.g. Broadwell perf counters exist to measure hits.
("The MMU" is part of a CPU core; the L1dTLB is tightly coupled to load/store execution units. The page walker is a fairly separate thing, though, and runs in parallel with instruction execution, but is still part of the core and can be triggered speculatively, etc. So it's tightly coupled enough to access memory through L1d cache.)
Higher-level PDEs (page directory entries) can be worth caching inside the page-walk hardware. Section 3 of this paper confirms that Intel and AMD actually do this in practice, so you need to flush the TLB in cases where you might think you didn't need to.
However, I don't think you'll find that PDE caching happening across a change in the top-level page-tables.
On x86, you install a new page table with a mov to CR3; that implicitly flushes all cached translations and internal page-walker PDE caching, like invlpg does for one virtual address. (Or with ASIDs, makes TLB entries from different ASIDs unavailable for hits).
The main issue is that TLB the and page-walker internal caches are not coherent with main memory / data caches. I think all ISAs that do HW page walks at all require manual flushing of TLBs, with semantics like x86 for installing a new page table. (Some ISAs like MIPS only do software TLB management, invoking a special kernel TLB-miss handler; your question won't apply there.)
So yes, they could detect same physical address, but for sanity you also have to avoid using stale cached data from after a store to that physical address.
Without hardware-managed coherence between page-table stores and TLB/pagewalk, there's no way this cache could happen safely.
That said; some x86 CPUs do go beyond what's on paper and do limited coherency with stores, but only protecting you from speculative page walks for backwards compat with OSes that assumed a valid but not-yet-used PTE could be modified without invlpg. http://blog.stuffedcow.net/2015/08/pagewalk-coherence/
So it's not unheard of for microarchitectures to snoop stores to detect stores to certain ranges; you could plausibly have stores snoop the address ranges near locations the page-walker had internally cached, effectively providing coherence for internal page-walker caches.
Modern x86 does in practice detect self-modifying code by snoop for stores near any in-flight instructions. Observing stale instruction fetching on x86 with self-modifying code In that case snoop hits are handled by nuking the whole back-end state back to retirement state.
So it's plausible that you could in theory design a CPU with an efficient mechanism to be able to take advantage of this transparently, but it has significant cost (snooping every store against a CAM to check for matches on page-walker-cached addresses) for very low benefit. Unless I'm missing something, I don't think there's an easier way to do this, so I'd bet money that no real designs actually do this.
Hard to imagine outside of x86; almost everything else takes a "weaker" / "fewer guarantees" approach and would only snoop the store buffer (for store-forwarding). CAMs (content-addressable-memory = hardware hash table) are power-hungry, and handling the special case of a hit would complicate the pipeline. Especially an OoO exec pipeline where the store to a PTE might not have its store-address ready until after a load wanted to use that TLB entry. Introducing more pipeline nukes is a bad thing.
The benefit of this would be tiny
After the first page-walk fetches data from L1d cache (or farther away if it wasn't hot in L1d either), then the usual cache-within-page-walker mechanisms can act normally.
So further page walks for nearby pages before the next context switch can benefit from page-walker internal caches. This has benefits, and is what some real HW does (at least some x86; IDK about others).
All the argument above about why this would require snooping for coherent page tables is about having the page-walker internal caches stay hot across a context switch.
L1d can easily do that; VIPT caches that behave like PIPT (no aliasing) simply cache based on physical address and don't need flushing on context switch.
If you're context-switching very frequently, the ASIDs let TLB entries proper stay cached. If you're still getting a lot of TLB misses, the worst case is that they have to fetch through cache all the way from the top. This is really not bad and very much not worth spending a lot of transistors and power budget on.
I'm only considering OS on bare metal, not HW virtualization with nested page tables. (Hypervisor virtualizing the guest OS's page tables). I think all the same arguments basically apply, though. Page walk still definitely fetches through cache.

Would buffering cache changes prevent Meltdown?

If new CPUs had a cache buffer which was only committed to the actual CPU cache if the instructions are ever committed would attacks similar to Meltdown still be possible?
The proposal is to make speculative execution be able to load from memory, but not write to the CPU caches until they are actually committed.

TL:DR: yes I think it would solve Spectre (and Meltdown) in their current form (using a flush+read cache-timing side channel to copy the secret data from a physical register), but probably be too expensive (in power cost, and maybe also performance) to be a likely implementation.
But with hyperthreading (or more generally any SMT), there's also an ALU / port-pressure side-channel if you can get mis-speculation to run data-dependent ALU instructions with the secret data, instead of using it as an array index. The Meltdown paper discusses this possibility before focusing on the flush+reload cache-timing side-channel. (It's more viable for Meltdown than Spectre, because you have much better control of the timing of when the the secret data is used).
So modifying cache behaviour doesn't block the attacks. It would take away the reliable side-channel for getting the secret data into the attacking process, though. (i.e. ALU timing has higher noise and thus lower bandwidth to get the same reliability; Shannon's noisy channel theorem), and you have to make sure your code runs on the same physical core as the code under attack.
On CPUs without SMT (e.g. Intel's desktop i5 chips), the ALU timing side-channel is very hard to use with Spectre, because you can't directly use perf counters on code you don't have privilege for. (But Meltdown could still be exploited by timing your own ALU instructions with Linux perf, for example).
Meltdown specifically is much easier to defend against, microarchitecturally, with simpler and cheaper changes to the hard-wired parts of the CPU that microcode updates can't rewire.
You don't need to block speculative loads from affecting cache; the change could be as simple as letting speculative execution continue after a TLB-hit load that will fault if it reaches retirement, but with the value used by speculative execution of later instructions forced to 0 because of the failed permission check against the TLB entry.
So the mis-speculated (after the faulting load of secret) touch array[secret*4096] load would always make the same cache line hot, with no secret-data-dependent behaviour. The secret data itself would enter cache, but not a physical register. (And this stops ALU / port-pressure side-channels, too.)
Stopping the faulting load from even bringing the "secret" line into cache in the first place could make it harder to tell the difference between a kernel mapping and an unmapped page, which could possibly help protect against user-space trying to defeat KASLR by finding which virtual addresses the kernel has mapped. But that's not Meltdown.
Spectre
Spectre is the hard one because the mis-speculated instructions that make data-dependent modifications to microarchitectural state do have permission to read the secret data. Yes, a "load queue" that works similarly to the store queue could do the trick, but implementing it efficiently could be expensive. (Especially given the cache coherency problem that I didn't think of when I wrote this first section.)
(There are other ways of implementing the your basic idea; maybe there's even a way that's viable. But extra bits on L1D lines to track their status has downsides and isn't obviously easier.)
The store queue tracks stores from execution until they commit to L1D cache. (Stores can't commit to L1D until after they retire, because that's the point at which they're known to be non-speculative, and thus can be made globally visible to other cores).
A load queue would have to store whole incoming cache lines, not just the bytes that were loaded. (But note that Skylake-X can do 64-byte ZMM stores, so its store-buffer entries do have to be the size of a cache line. But if they can borrow space from each other or something, then there might not be 64 * entries bytes of storage available, i.e. maybe only the full number of entries is usable with scalar or narrow-vector stores. I've never read anything about a limitation like this, so I don't think there is one, but it's plausible)
A more serious problem is that Intel's current L1D design has 2 read ports + 1 write port. (And maybe another port for writing lines that arrive from L2 in parallel with committing a store? There was some discussion about that on Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake.)
If your loaded data can't enter L1D until after the loads retire, then they're probably going to be competing for the same write port that stores use.
Loads that hit in L1D can still come directly from L1D, though, and loads that hit in the memory-order-buffer could still be executed at 2 per clock. (The MOB would now include this new load queue as well as the usual store queue + markers for loads to maintain x86 memory ordering semantics). You still need both L1D read ports to maintain performance for code that doesn't touch a lot of new memory, and mostly is reloading stuff that's been hot in L1D for a while.
This would make the MOB about twice as large (in terms of data storage), although it doesn't need any more entries. As I understand it, the MOB in current Intel CPUs is composed of the individual load-buffer and store-buffer entries. (Haswell has 72 and 42 respectively).
Hmm, a further complication is that the load data in the MOB has to maintain cache coherency with other cores. This is very different from store data, which is private and hasn't become globally visible / isn't part of the global memory order and cache coherency until it commits to L1D.
So this proposed "load queue" implementation mechanism for your idea is probably not feasible without tweaks: it would have to be checked by invalidation-requests from other cores, so that's another read-port needed in the MOB.
Any possible implementation would have the problem of needing to later commit to L1D like a store. I think it would be a significant burden not to be able to evict + allocate a new line when it arrived from off-core.
(Even allowing speculative eviction but not speculative replacement from conflicts leaves open a possible cache-timing attack. You'd prime all the lines and then do a load that would evict one from one set of lines or another, and find which line was evicted instead of which one was fetched using a similar cache-timing side channel. So using extra bits in L1D to find / evict lines loaded during recovery from mis-speculation wouldn't eliminate this side-channel.)
Footnote: all instructions are speculative. This question is worded well, but I think many people reading about OoO exec and thinking about Meltdown / Spectre fall into this trap of confusing speculative execution with mis-speculation.
Remember that all instructions are speculative when they're executed. It's not known to be correct speculation until retirement. Meltdown / Spectre depend on accessing secret data and using it during mis-speculation. But the basis of current OoO CPU designs is that you don't know whether you've speculated correctly or not; everything is speculative until retirement.
Any load or store could potentially fault, and so can some ALU instructions (e.g. floating point if exceptions are unmasked), so any performance cost that applies "only when executing speculatively" actually applies all the time. This is why stores can't commit from the store queue into L1D until after the store uops have retired from the out-of-order CPU core (with the store data in the store queue).
However, I think conditional and indirect branches are treated specially, because they're expected to mis-speculate some of the time, and optimizing recovery for them is important. Modern CPUs do better with branches than just rolling back to the current retirement state when a mispredict is detected, I think using a checkpoint buffer of some sort. So out-of-order execution for instructions before the branch can continue during recovery.
But loop and other branches are very common, so most code executes "speculatively" in this sense, too, with at least one branch-rollback checkpoint not yet verified as correct speculation. Most of the time it's correct speculation, so no rollback happens.
Recovery for mis-speculation of memory ordering or faulting loads is a full pipeline-nuke, rolling back to the retirement architectural state. So I think only branches consume the branch checkpoint microarchitectural resources.
Anyway, all of this is what makes Spectre so insidious: the CPU can't tell the difference between mis-speculation and correct speculation until after the fact. If it knew it was mis-speculating, it would initiate rollback instead of executing useless instructions / uops. Indirect branches are not rare, either (in user-space); every DLL or shared library function call uses one in normal executables on Windows and Linux.

I suspect the overhead from buffering and committing the buffer would render the specEx/caching useless?
This is purely speculative (no pun intended) - I would love to see someone with a lower level background weigh in this!

Techniques available to control data/instructions in/out of the cache?

I have encountered some Intel compiler intrinsic functions which I believe allow developers to bypass the cache?
http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/fortran-mac/GUID-AF42A867-B796-4D29-8FED-C20193FD87E0.htm
I have also come across the GCC compiler prefetch keyword, although I cannot admit to fully appreciating what this does.
With the above in mind I wondered if any members could either elaborate on the above (which I badly described) or provide other techniques which allow the developer to have close control over which data (or instructions) is/isn't loaded in the CPU cache?

This page contains a lot of information about all intrinsics:
Intel Intrinsics Guide
The series of instructions that will write data to memory, avoiding cache evictions are generally named _mm_stream_.... As the name implies, these are ideal for applications that write a large stream of data that is basically contiguous in memory and unlikely to be accessed again in the near future. So, for example, if you are mixing audio buffers and producing a single waveform output this would work well.
One of the keys to using these instructions effectively is taking advantage of write combining. If your write locations are scattered throughout memory, these instructions will stall as badly, or possibly worse than any other kind of memory storage instruction you attempt. Since these writes do not wind up in cache, if you're not filling an entire write buffer then essentially your operation becomes a write-through operation, requiring a stall until the write is completed. If you are writing contiguous memory locations then write combining will apply, and make your data writes much more efficient.
The flip side of that coin is prefetching. Prefetching tells the system to start pulling a memory address into the desired level of cache so that by the time the memory read is complete, you are ready to use the data. This is much harder to use, and requires an appropriate data "stride" which takes into account the cache sizes, cache line size, and the number of instructions which can execute before the memory read completes. Using the hinting parameter, you can "suggest" that the data goes into the L1, L2, or L3 cache, or that it is "non-temporal", meaning that you're just going to use it once and it should be evicted first before any other cache evictions. The hardware has its own prefetching heuristics that work well for most problems without explicit prefetching instructions, but the classic counter-example is a matrix transpose:
Prefetching examples
Prefetching is generally very difficult to use effectively except in some very specific cases like this. Without a more specific problem statement from you, this is about all I can provide.

Why do we bother with CPU registers in assembly, instead of just working directly with memory?

I have a basic question about assembly.
Why do we bother doing arithmetic operations only on registers if they can work on memory as well?
For example both of the following cause (essentially) the same value to be calculated as an answer:
Snippet 1
.data
var dd 00000400h
.code
Start:
add var,0000000Bh
mov eax,var
;breakpoint: var = 00000B04
End Start
Snippet 2
.code
Start:
mov eax,00000400h
add eax,0000000bh
;breakpoint: eax = 0000040B
End Start
From what I can see most texts and tutorials do arithmetic operations mostly on registers. Is it just faster to work with registers?

If you look at computer architectures, you find a series of levels of memory. Those that are close to the CPU are the fast, expensive (per a bit), and therefore small, while at the other end you have big, slow and cheap memory devices. In a modern computer, these are typically something like:
CPU registers (slightly complicated, but in the order of 1KB per a core - there
are different types of registers. You might have 16 64 bit
general purpose registers plus a bunch of registers for special
purposes)
L1 cache (64KB per core)
L2 cache (256KB per core)
L3 cache (8MB)
Main memory (8GB)
HDD (1TB)
The internet (big)
Over time, more and more levels of cache have been added - I can remember a time when CPUs didn't have any onboard caches, and I'm not even old! These days, HDDs come with onboard caches, and the internet is cached in any number of places: in memory, on the HDD, and maybe on caching proxy servers.
There is a dramatic (often orders of magnitude) decrease in bandwidth and increase in latency in each step away from the CPU. For example, a HDD might be able to be read at 100MB/s with a latency of 5ms (these numbers may not be exactly correct), while your main memory can read at 6.4GB/s with a latency of 9ns (six orders of magnitude!). Latency is a very important factor, as you don't want to keep the CPU waiting any longer than it has to (this is especially true for architectures with deep pipelines, but that's a discussion for another day).
The idea is that you will often be reusing the same data over and over again, so it makes sense to put it in a small fast cache for subsequent operations. This is referred to as temporal locality. Another important principle of locality is spatial locality, which says that memory locations near each other will likely be read at about the same time. It is for this reason that reading from RAM will cause a much larger block of RAM to be read and put into on-CPU cache. If it wasn't for these principles of locality, then any location in memory would have an equally likely chance of being read at any one time, so there would be no way to predict what will be accessed next, and all the levels of cache in the world will not improve speed. You might as well just use a hard drive, but I'm sure you know what it's like to have the computer come to a grinding halt when paging (which is basically using the HDD as an extension to RAM). It is conceptually possible to have no memory except for a hard drive (and many small devices have a single memory), but this would be painfully slow compared to what we're familiar with.
One other advantage of having registers (and only a small number of registers) is that it lets you have shorter instructions. If you have instructions that contain two (or more) 64 bit addresses, you are going to have some long instructions!

Because RAM is slow. Very slow.
Registers are placed inside the CPU, right next to the ALU so signals can travel almost instantly. They're also the fastest memory type but they take significant space so we can have only a limited number of them. Increasing the number of registers increases
die size
distance needed for signals to travel
work to save the context when switching between threads
number of bits in the instruction encoding
Read If registers are so blazingly fast, why don't we have more of them?
More commonly used data will be placed in caches for faster accessing. In the past caches are very expensive so they're an optional part and can be purchased separately and plug into a socket outside the CPU. Nowadays they're often in the same die with the CPUs. Caches are constructed from SRAM cells which are smaller than register cells but maybe tens or hundreds of times slower.
Main memory will be made from DRAM which needs only one transistor per cell but are thousands of times slower than registers, hence we can't work with only DRAM in a high-performance system. However some embedded system do make use of register file so registers are also main memory
More information: Can we have a computer with just registers as memory?

Registers are much faster and also the operations that you can perform directly on memory are far more limited.

In real, there are tiny implementations that does not separate registers from memory. They can expose it, for example, in the way they have 512 bytes of RAM, and first 64 of them are exposed as 32 16-bit registers and in the same time accessible as addressable RAM. Or, another example, MosTek 6502 "zero page" (RAM range 0-255, accessed used 1-byte address) was a poor substitution for registers, due to small amount of real registers in CPU. But, this is poorly scalable to larger setups.
The advantage of registers are following:
They are the most fast. They are faster in a typical modern system than any cache, more so than DRAM. (In the example above, RAM is likely SRAM. But SRAM of a few gigabytes is unusably expensive.) And, they are close to processor. Difference of time between register access and DRAM access can reach values like 200 or even 1000. Even compared to L1 cache, register access is typically 2-4 times faster.
Their amount is limited. A typical instruction set will become too bloated if any memory location is addressed explicitly.
Registers are specific to each CPU (core, hardware thread, hart) separately. (In systems where fixed RAM addresses serve role of special registers, as e.g. zSeries does, this needs special remapping of such service area in absolute addresses, separate for each core.)
In the same manner as (3), registers are specific to each process thread without a need to adjust locations in code for a thread.
Registers (relatively easily) allow specific optimizations, as register renaming. This is too complex if memory addresses are used.
Additionally, there are registers that could not be implemented in separate block RAM because access to RAM needs their change. I mean the "execution phase" register in the simplest CPU designs, which takes values like "instruction extracting phase", "instruction decoding phase", "ALU phase", "data writing phase" and so on, and this register equivalents in more complicated (pipeline, out-of-order) designs; also different buffer registers on bus access, and so on. But, such registers are not visible to programmer, so you did likely not mean them.

x86, like pretty much every other "normal" CPU you might learn assembly for, is a register machine1. There are other ways to design something that you can program (e.g. a Turing machine that moves along a logical "tape" in memory, or the Game of Life), but register machines have proven to be basically the only way to go for high-performance.
https://www.realworldtech.com/architecture-basics/2/ covers possible alternatives like accumulator or stack machines which are also obsolete now. Although it omits CISCs like x86 which can be either load-store or register-memory. x86 instructions can actually be reg,mem; reg,reg; or even mem,reg. (Or with an immediate source.)
Footnote 1: The abstract model of computation called a register machine doesn't distinguish between registers and memory; what it calls registers are more like memory in real computers. I say "register machine" here to mean a machine with multiple general-purpose registers, as opposed to just one accumulator, or a stack machine or whatever. Most x86 instructions have 2 explicit operands (but it varies), up to one of which can be memory. Even microcontrollers like 6502 that can only really do math into one accumulator register almost invariably have some other registers (e.g. for pointers or indices), unlike true toy ISAs like Marie or LMC that are extremely inefficient to program for because you need to keep storing and reloading different things into the accumulator, and can't even keep an array index or loop counter anywhere that you can use it directly.
Since x86 was designed to use registers, you can't really avoid them entirely, even if you wanted to and didn't care about performance.
Current x86 CPUs can read/write many more registers per clock cycle than memory locations.
For example, Intel Skylake can do two loads and one store from/to its 32KiB 8-way associative L1D cache per cycle (best case), but can read upwards of 10 registers per clock, and write 3 or 4 (plus EFLAGS).
Building an L1D cache with as many read/write ports as the register file would be prohibitively expensive (in transistor count/area and power usage), especially if you wanted to keep it as large as it is. It's probably just not physically possible to build something that can use memory the way x86 uses registers with the same performance.
Also, writing a register and then reading it again has essentially zero latency because the CPU detects this and forwards the result directly from the output of one execution unit to the input of another, bypassing the write-back stage. (See https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing).
These result-forwarding connections between execution units are called the "bypass network" or "forwarding network", and it's much easier for the CPU to do this for a register design than if everything had to go into memory and back out. The CPU only has to check a 3 to 5 bit register number, instead of an 32-bit or 64-bit address, to detect cases where the output of one instruction is needed right away as the input for another operation. (And those register numbers are hard-coded into the machine-code, so they're available right away.)
As others have mentioned, 3 or 4 bits to address a register make the machine-code format much more compact than if every instruction had absolute addresses.
See also https://en.wikipedia.org/wiki/Memory_hierarchy: you can think of registers as a small fast fixed-size memory space separate from main memory, where only direct absolute addressing is supported. (You can't "index" a register: given an integer N in one register, you can't get the contents of the Nth register with one insn.)
Registers are also private to a single CPU core, so out-of-order execution can do whatever it wants with them. With memory, it has to worry about what order things become visible to other CPU cores.
Having a fixed number of registers is part of what lets CPUs do register-renaming for out-of-order execution. Having the register-number available right away when an instruction is decoded also makes this easier: there's never a read or write to a not-yet-known register.
See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an explanation of register renaming, and a specific example (the later edits to the question / later parts of my answer showing the speedup from unrolling with multiple accumulators to hide FMA latency even though it reuses the same architectural register repeatedly).
The store buffer with store forwarding does basically give you "memory renaming". A store/reload to a memory location is independent of earlier stores and load to that location from within this core. (Can a speculatively executed CPU branch contain opcodes that access RAM?)
Repeated function calls with a stack-args calling convention, and/or returning a value by reference, are cases where the same bytes of stack memory can be reused multiple times.
The seconds store/reload can execute even if the first store is still waiting for its inputs. (I've tested this on Skylake, but IDK if I ever posted the results in an answer anywhere.)

Registers are accessed way faster than RAM memory, since you don't have to access the "slow" memory bus!

We use registers because they are fast. Usually, they operate at CPU's speed.
Registers and CPU cache are made with different technology / fabrics and
they are expensive. RAM on the other hand is cheap and 100 times slower.

Generally speaking register arithmetic is much faster and much preferred. However there are some cases where the direct memory arithmetic is useful.
If all you want to do is increment a number in memory (and nothing else at least for a few million instructions) then a single direct memory arithmetic instruction is usually slightly faster than load/add/store.
Also if you are doing complex array operations you generally need a lot of registers to keep track of where you are and where your arrays end. On older architectures you could run out of register really quickly so the option of adding two bits of memory together without zapping any of your current registers was really useful.

Yes, it's much much much faster to use registers. Even if you only consider the physical distance from processor to register compared to proc to memory, you save a lot of time by not sending electrons so far, and that means you can run at a higher clock rate.

Yes - also you can typically push/pop registers easily for calling procedures, handling interrupts, etc

It's just that the instruction set will not allow you to do such complex operations:
add [0x40001234],[0x40002234]
You have to go through the registers.

Seeking articles on shared memory locking issues

I'm reviewing some code and feel suspicious of the technique being used.
In a linux environment, there are two processes that attach multiple
shared memory segments. The first process periodically loads a new set
of files to be shared, and writes the shared memory id (shmid) into
a location in the "master" shared memory segment. The second process
continually reads this "master" location and uses the shmid to attach
the other shared segments.
On a multi-cpu host, it seems to me it might be implementation dependent
as to what happens if one process tries to read the memory while it's
being written by the other. But perhaps hardware-level bus locking prevents
mangled bits on the wire? It wouldn't matter if the reading process got
a very-soon-to-be-changed value, it would only matter if the read was corrupted
to something that was neither the old value nor the new value. This is an edge case: only 32 bits are being written and read.
Googling for shmat stuff hasn't led me to anything that's definitive in this
area.
I suspect strongly it's not safe or sane, and what I'd really
like is some pointers to articles that describe the problems in detail.

It is legal -- as in the OS won't stop you from doing it.
But is it smart? No, you should have some type of synchronization.
There wouldn't be "mangled bits on the wire". They will come out either as ones or zeros. But there's nothing to say that all your bits will be written out before another process tries to read them. And there are NO guarantees on how fast they'll be written vs how fast they'll be read.
You should always assume there is absolutely NO relationship between the actions of 2 processes (or threads for that matter).
Hardware level bus locking does not happen unless you get it right. It can be harder then expected to make your compiler / library / os / cpu get it right. Synchronization primitives are written to makes sure it happens right.
Locking will make it safe, and it's not that hard to do. So just do it.
#unknown - The question has changed somewhat since my answer was posted. However, the behavior you describe is defiantly platform (hardware, os, library and compiler) dependent.
Without giving the compiler specific instructions, you are actually not guaranteed to have 32 bits written out in one shot. Imagine a situation where the 32 bit word is not aligned on a word boundary. This unaligned access is acceptable on x86, and in the case of the x68, the access is turned into a series of aligned accesses by the cpu.
An interrupt can occurs between those operations. If a context switch happens in the middle, some of the bits are written, some aren't. Bang, You're Dead.
Also, lets think about 16 bit cpus or 64 bit cpus. Both of which are still popular and don't necessarily work the way you think.
So, actually you can have a situation where "some other cpu-core picks up a word sized value 1/2 written to". You write you code as if this type of thing is expected to happen if you are not using synchronization.
Now, there are ways to preform your writes to make sure that you get a whole word written out. Those methods fall under the category of synchronization, and creating synchronization primitives is the type of thing that's best left to the library, compiler, os, and hardware designers. Especially if you are interested in portability (which you should be, even if you never port your code)

The problem's actually worse than some of the people have discussed. Zifre is right that on current x86 CPUs memory writes are atomic, but that is rapidly ceasing to be the case - memory writes are only atomic for a single core - other cores may not see the writes in the same order.
In other words if you do
a = 1;
b = 2;
on CPU 2 you might see location b modified before location 'a' is. Also if you're writing a value that's larger than the native word size (32 bits on an x32 processor) the writes are not atomic - so the high 32 bits of a 64 bit write will hit the bus at a different time from the low 32 bits of the write. This can complicate things immensely.
Use a memory barrier and you'll be ok.

You need locking somewhere. If not at the code level, then at the hardware memory cache and bus.
You are probably OK on a post-PentiumPro Intel CPU. From what I just read, Intel made their later CPUs essentially ignore the LOCK prefix on machine code. Instead the cache coherency protocols make sure that the data is consistent between all CPUs. So if the code writes data that doesn't cross a cache-line boundary, it will work. The order of memory writes that cross cache-lines isn't guaranteed, so multi-word writes are risky.
If you are using anything other than x86 or x86_64 then you are not OK. Many non-Intel CPUs (and perhaps Intel Itanium) gain performance by using explicit cache coherency machine commands, and if you do not use them (via custom ASM code, compiler intrinsics, or libraries) then writes to memory via cache are not guaranteed to ever become visible to another CPU or to occur in any particular order.
So just because something works on your Core2 system doesn't mean that your code is correct. If you want to check portability, try your code also on other SMP architectures like PPC (an older MacPro or a Cell blade) or an Itanium or an IBM Power or ARM. The Alpha was a great CPU for revealing bad SMP code, but I doubt you can find one.

Two processes, two threads, two cpus, two cores all require special attention when sharing data through memory.
This IBM article provides an excellent overview of your options.
Anatomy of Linux synchronization methods
Kernel atomics, spinlocks, and mutexes
by M. Tim Jones (mtj#mtjones.com), Consultant Engineer, Emulex
http://www.ibm.com/developerworks/linux/library/l-linux-synchronization.html

I actually believe this should be completely safe (but is depends on the exact implementation). Assuming the "master" segment is basically an array, as long as the shmid can be written atomically (if it's 32 bits then probably okay), and the second process is just reading, you should be okay. Locking is only needed when both processes are writing, or the values being written cannot be written atomically. You will never get a corrupted (half written values). Of course, there may be some strange architectures that can't handle this, but on x86/x64 it should be okay (and probably also ARM, PowerPC, and other common architectures).

Read Memory Ordering in Modern Microprocessors, Part I and Part II
They give the background to why this is theoretically unsafe.
Here's a potential race:
Process A (on CPU core A) writes to a new shared memory region
Process A puts that shared memory ID into a shared 32-bit variable (that is 32-bit aligned - any compiler will try to align like this if you let it).
Process B (on CPU core B) reads the variable. Assuming 32-bit size and 32-bit alignment, it shouldn't get garbage in practise.
Process B tries to read from the shared memory region. Now, there is no guarantee that it'll see the data A wrote, because you missed out the memory barrier. (In practise, there probably happened to be memory barriers on CPU B in the library code that maps the shared memory segment; the problem is that process A didn't use a memory barrier).
Also, it's not clear how you can safely free the shared memory region with this design.
With the latest kernel and libc, you can put a pthreads mutex into a shared memory region. (This does need a recent version with NPTL - I'm using Debian 5.0 "lenny" and it works fine). A simple lock around the shared variable would mean you don't have to worry about arcane memory barrier issues.

I can't believe you're asking this. NO it's not safe necessarily. At the very least, this will depend on whether the compiler produces code that will atomically set the shared memory location when you set the shmid.
Now, I don't know Linux, but I suspect that a shmid is 16 to 64 bits. That means it's at least possible that all platforms would have some instruction that could write this value atomically. But you can't depend on the compiler doing this without being asked somehow.
Details of memory implementation are among the most platform-specific things there are!
BTW, it may not matter in your case, but in general, you have to worry about locking, even on a single CPU system. In general, some device could write to the shared memory.

I agree that it might work - so it might be safe, but not sane.
The main question is if this low-level sharing is really needed - I am not an expert on Linux, but I would consider to use for instance a FIFO queue for the master shared memory segment, so that the OS does the locking work for you. Consumer/producers usually need queues for synchronization anyway.

Legal? I suppose. Depends on your "jurisdiction". Safe and sane? Almost certainly not.
Edit: I'll update this with more information.
You might want to take a look at this Wikipedia page; particularly the section on "Coordinating access to resources". In particular, the Wikipedia discussion essentially describes a confidence failure; non-locked access to shared resources can, even for atomic resources, cause a misreporting / misrepresentation of the confidence that an action was done. Essentially, in the time period between checking to see whether or not it CAN modify the resource, the resource gets externally modified, and therefore, the confidence inherent in the conditional check is busted.

I don't believe anybody here has discussed how much of an impact lock contention can have over the bus, especially on bus bandwith constrained systems.
Here is an article about this issue in some depth, they discuss some alternative schedualing algorythems which reduse the overall demand on exclusive access through the bus. Which increases total throughput in some cases over 60% than a naieve scheduler (when considering the cost of an explicit lock prefix instruction or implicit xchg cmpx..). The paper is not the most recent work and not much in the way of real code (dang academic's) but it worth the read and consideration for this problem.
More recent CPU ABI's provide alternative operations than simple lock whatever.
Jeffr, from FreeBSD (author of many internal kernel components), discusses monitor and mwait, 2 instructions added for SSE3, where in a simple test case identified an improvement of 20%. He later postulates;
So this is now the first stage in the
adaptive algorithm, we spin a while,
then sleep at a high power state, and
then sleep at a low power state
depending on load.
...
In most cases we're still idling in
hlt as well, so there should be no
negative effect on power. In fact, it
wastes a lot of time and energy to
enter and exit the idle states so it
might improve power under load by
reducing the total cpu time required.
I wonder what would be the effect of using pause instead of hlt.
From Intel's TBB;
ALIGN 8
PUBLIC __TBB_machine_pause
__TBB_machine_pause:
L1:
dw 090f3H; pause
add ecx,-1
jne L1
ret
end
Art of Assembly also uses syncronization w/o the use of lock prefix or xchg. I haven't read that book in a while and won't speak directly to it's applicability in a user-land protected mode SMP context, but it's worth a look.
Good luck!

If the shmid has some type other than volatile sig_atomic_t then you can be pretty sure that separate threads will get in trouble even on the very same CPU. If the type is volatile sig_atomic_t then you can't be quite as sure, but you still might get lucky because multithreading can do more interleaving than signals can do.
If the shmid crosses cache lines (partly in one cache line and partly in another) then while the writing cpu is writing you sure find a reading cpu reading part of the new value and part of the old value.
This is exactly why instructions like "compare and swap" were invented.

Sounds like you need a Reader-Writer Lock : http://en.wikipedia.org/wiki/Readers-writer_lock.

The answer is - it's absolutely safe to do reads and writes simultaneously.
It is clear that the shm mechanism
provides bare-bones tools for the
user. All access control must be taken
care of by the programmer. Locking and
synchronization is being kindly
provided by the kernel, this means the
user have less worries about race
conditions. Note that this model
provides only a symmetric way of
sharing data between processes. If a
process wishes to notify another
process that new data has been
inserted to the shared memory, it will
have to use signals, message queues,
pipes, sockets, or other types of IPC.
From Shared Memory in Linux article.
The latest Linux shm implementation just uses copy_to_user and copy_from_user calls, which are synchronised with memory bus internally.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio