When are page frame specific cache management policies useful? - linux-kernel

I'm reading the O'Reilly Linux Kernel book and one of the things that was pointed out during the chapter on paging is that the Pentium cache lets the operating system associate a different cache management policy with each page frame. So I get that there could be scenarios where a program has very little spacial/temporal locality and memory accesses are random/infrequent enough that the probability of cache hits is below some sort of threshold.
I was wondering whether this mechanism is actually used in practice today? Or is it more of a feature that was necessary back when caches where fairly small and not as efficient as they are now? I could see it being useful for an embedded system with little overhead as far as system calls are necessary, are there other applications I am missing?

Having multiple cache management policies is widely used, whether by assigning whole regions using MTRRs (fixed/dynamic, as explained in Intel's PRM), MMIO regions, or through special instructions (e.g. streaming loads/stores, non-temporal prefetches, etc..). The use-cases also vary a lot, whether you're trying to map an external I/O device into virtual memory (and don't want CPU caching to impact its coherence), or whether you want to define a writethrough region for better integrity management of some database, or just want plain writeback to maximize the cache-hierarchy capacity and replacement efficiency (which means performance).
These usages often overlap (especially when multiple applications are running), so the flexibility is very much needed, as you said - you don't want data with little to no spatial/temporal locality to thrash out other lines you use all the time.
By the way, caches are never going to be big enough in the foreseeable future (with any known technology), since increasing them requires you to locate them further away from the core and pay in latency. So cache management is still, and is going to be for a long while, one of the most important things for performance critical systems and applications

Related

What is the point of MESI on Intel 64 and IA-32

The point of MESI is to retain a notion of a shared memory system.
However, with store buffers, things are complicated:
Memory is coherent downstream of once the data hits the MESI-implementing caches.
However, upstream of that, each core may disagree on what is in memory location X, dependent on what is in each core's local store buffer.
As such, it seems like, from the viewpoint of each core, that the state of memory is different - it is not coherent.
So, why do we bother "partially" enforcing coherency with MESI?
Edit: A substantial edit was made, after some further narrowing of what was really confusing me. I have tried to keep the general notion of the question the same, to preserve the relevance of the great answers received.
The point of MESI on x86 is the same as on pretty much any multiple core/CPU system: to enforce cache consistency. There is no "partial coherency" used for the cache coherency part of the equation on x86: the caches are fully coherent. The possible re-orderings, then, are a result of both the coherent caching system and the interaction with core-local components such as the load/store subsystem (especially store buffers) and other out-of-order machinery.
The result of that interaction is the architected strong memory model that x86 provides, with only limited re-ordering. Without coherent caches, you couldn't reasonably implement this model at all, or almost any model that was anything other than completely weak1.
Your question seems to embed the assumption that there are only possible states "coherent" and "everything every else". Also, there is some mixing of the ideas of cache coherency (which mostly deals with the caches specifically, and is mostly a hidden detail), and the memory consistency model which is architecturally defined and will be implemented by each architecture2. Wikipedia explains that one difference between cache coherency and memory consistency is that the rules for the former applies only to one location at a time, whereas consistency rules apply across locations. In practice, the more important distinction is that the memory consistency model is the only architecturally documented one.
Briefly, Intel (and AMD likewise) define a specific memory consistency model, x86-TSO3 - which is relatively strong as far as memory models go, but is still weaker than sequential consistency. The primary behaviors weakened compared to sequential consistency are:
That later loads can pass earlier stores.
That stores can be seen in a different order from the total store order, but only by cores that performed one of the stores.
To order to implement this memory model, various parts must play by the rules to achieve it. On all recent x86, this means ordered load and store buffers, which avoid the disallowed re-orderings. The use of a store buffer results in the two re-orderings mentioned above: without allowing those, the implementation would be very restricted and probably much slower. In practice it also means fully coherent data caches, since many of the guarantees (e.g., no load-load reordering) would be very difficult to implement without that.
To wrap it all up:
Memory consistency is different than cache coherency: the former is what is documented and forms part of the programming model.
In practice, x86 implementations have fully coherent caches, which helps them implement their x86-TSO memory model, which is fairly strong but weaker than sequential consistency.
Finally, perhaps the answer you were looking for, in different words: a memory model weaker than sequential consistency is still very useful since you can program against it, and in the case you need sequential consistency for some particular operations(s) you insert the right memory barriers4.
If you program against a language supplied memory model, such as Java's or C++11's you don't need to worry about the hardware specifics, but rather than language memory model, and the compiler inserts the barriers required to match the language memory model semantics to the hardware one. The stronger the hardware model, the fewer the barriers required.
1 If your memory model was completely weak, i.e., not really placing any restrictions on cross-core reordering, I suppose you could implement it directly on a non-cache coherent system in a cheap way for normal operations, but then memory barriers potentially become very expensive since they would need to flush a potentially large part of the local private cache.
2 Various chips may implement in differently internally, and in particular some chips may implement stronger semantics than the model (i.e., some allowed re-orderings can never be observed), but absent bugs none will implement a weaker one.
3 This is the name given to it in that paper, which I used because Intel themselves doesn't give it a name, and the paper is a more formal definition than the one Intel gives a less formal model as a series of litmus tests.
4 It practice on x86 you usually use locked instructions (using the lock prefix) rather than separate barriers, although standalone barriers exist also. Here's I'll just use the term barries to refer to both standalone barriers and the barrier semantics embedded into locked instructions.
Re: your edit, which seems to be a new question: right, store-forwarding "violates" coherency. A core can see its own stores earlier than any other core can see them. The store buffer is not coherent.
The x86 memory ordering rules require that loads become globally visible in program order, but allows a core to load data from its own stores before they become globally visible. It doesn't have to pretend it waited and check for memory-order mis-speculation, like it does in other cases of doing loads earlier than the memory model says it should.
Also related; Can x86 reorder a narrow store with a wider load that fully contains it? is a specific example of the store buffer + store-forwarding violating the usual memory-ordering rules. See this collection of mailing list posts by Linus Torvalds explaining the effect of store-forwarding on memory ordering (and how it means that the proposed locking scheme doesn't work).
Without coherency at all, how would you atomically increment a shared counter, or implement other atomic read-modify-write operations that are essential for implementing locks, or for use directly in lockless code. (See Can num++ be atomic for 'int num'?).
lock add [shared_counter], 1 in multiple threads at the same time never loses any counts on actual x86, because the lock prefix makes a core keep exclusive ownership of the cache line from the load until the store commits to L1d (and thus becomes globally visible).
A system without coherent caches would let each thread increment its own copy of a shared counter, and then the final value in memory would come from whichever thread last flushed that line.
Allowing different caches to hold conflicting data for the same line long-term even while other loads/stores happened, and across memory barriers, would allow all sorts of weirdness.
It would also violate the assumption that a pure store becomes visible to other cores promptly. If you didn't have coherency at all, then cores could keep using their cached copy of a shared variable. So if you wanted readers to notice updates, you'd have to clflush before every read of the shared variable, making the common case expensive (when nobody has modified the data since you last checked).
MESI is like a push notification system, instead of forcing every reader to re-validate their cache on every read.
MESI (or coherency in general) allows things like RCU (Read-Copy-Update) to have zero overhead for readers (compared to single threaded) in the case where the shared data structure hasn't been modified. See https://lwn.net/Articles/262464/, and https://en.wikipedia.org/wiki/Read-copy-update. The basic idea is that instead of locking a data structure, a writer copies the whole thing, modifies the copy, and then updates a shared pointer to point to the new version. So readers are always entirely wait-free; they just dereference an (atomic) pointer, and data stays hot in their L1d caches.
Hardware-supported coherency is extremely valuable, and almost every shared-memory SMP architecture uses it. Even ISAs with much weaker memory-ordering rules than x86, like PowerPC, use MESI.

What is the performance impact of virtual memory relative to direct mapped memory?

Virtual memory is a convenient way to isolate memory among processes and give each process its own address space. It works by translating virtual addresses to physical addresses.
I'm already very familiar with how virtual memory works and is implemented. What I don't know about is the performance impact of virtual memory relative to direct mapped memory, which requires no overhead for translation.
Please don't say that there is no overhead. This is obviously false since traversing page tables requires several memory accesses. It is possible that TLB misses are infrequent enough that the performance impacts are negligible, however, if this is the case there should be evidence for it.
I also realize the importance of virtual memory for many of the functions a modern OS provides, so this question isn't about whether virtual memory is good or bad (it is clearly a good thing for most use cases), I'm asking purely about the performance effects of virtual memory.
The answer I'm looking for is ideally something like: virtual memory imposes an x% overhead over direct mapping and here is a paper showing that. I tried to look for papers with such results, but was unable to find any.
This question is difficult to answer definitively because virtual memory is an integral part of modern systems are designed to support virtual memory and most software is written and optimized using systems with virtual memory.
However, in the early 2000s Microsoft Research developed a research OS called Signularity that, among other things, did not rely on virtual memory for process isolation. As part of this project they published a paper where they analyzed the overhead of hardware support for process isolation. The paper is entitled Deconstructing Process Isolation (non-paywall link here). In the paper the researchers write:
Most operating systems use a CPU’s memory management hardware to
provide process isolation, using two mechanisms. First, processes are
only allowed access to certain pages of physical memory. Second,
privilege levels prevent untrusted code from manipulating the system
resources that implement processes, for example, the memory management
unit (MMU) or interrupt controllers. These mechanisms’ non-trivial
performance costs are largely hidden, since there is no widely used
alternative approach to compare them to. Mapping from virtual to
physical addresses can incur overheads up to 10–30% due to exception
handling, inline TLB lookup, TLB reloads, and maintenance of kernel
data structures such as page tables [29]. In addition, virtual memory
and privilege levels increase the cost of inter-process communication.
Later in the paper they write:
Virtual memory systems (with the exception of software-only systems
such as SPUR [46]) rely on a hardware cache of address translations to
avoid accessing page tables at every processor cache miss. Managing
TLB entries has a cost, which Jacob and Mudge estimated at 5–10% on a
simulated MIPS-like processor [29]. The virtual memory system also
brings its data, and in some systems, code as well, into a processor’s
caches, which evicts user code and data. Jacob and Mudge estimate
that, with small caches, these induced misses can increase the
overhead to 10–20%. Furthermore, they found that virtual memory
induced interrupts can increase the overhead to 10–30%. Other studies
found similar or even higher overheads, though the actual costs are
very dependent on system details and benchmarks [3, 6, 10, 26, 36, 40,
41]. In addition, TLB access is on the critical path of many processor
designs [2, 30] and so might affect processor clock speed.
Overall I would take these results with a grain of salt since the research is promoting an alternative system. But clearly there is some overhead associated with implementing virtual memory, and this paper gives one attempt to quantify some of these overheads (within the context of evaluating a possible alternative). I recommend reading the paper for more detail.

How to avoid bottleneck performance?

A distributed system is described as scalable if it remains effective when there is a significant increase the number of resources and the number of users. However, these systems sometimes face performance bottlenecks. How can these be avoided?
The question is pretty broad, and depends entirely on what the system is doing.
Here are some things I've seen in systems to reduce bottlenecks.
Use caches, reducing network and disk bottlenecks. But remember that knowing when to evict from a cache eviction can a hard problem in some situations.
Use message queues to decouple components in the system. This way you can add more hardware to specific parts of the system that need it.
Delay computation when possible (often by using message queues). This takes the heat off the system during high-processing times.
Of course, design the system for parallel processing wherever possible. One host doing processing is NOT scalable. Note: most relational databases fall into the one-host bucket, this is why NoSQL has become suddenly popular; but not always appropriate (theoretically).
Use eventual consistency if possible. Strong consistency is much harder to scale.
Some are proponents for CQRS and DDD. Though I have never seen or designed a "CQRS system" nor a "DDD system," those have definitely affected the way I design systems.
There is a lot of overlap in the points above; some the techniques may use some of the others.
But, experience (your own and others) eventually teaches you about scalable systems. I keep up-to-date by reading about designs from google, amazon, twitter, facebook, and the like. Another good starting point is the high-scalability blog.
Just to build on a point discussed in the abover post, I would like to add that what you need for your distributed system is a distributed cache, so that when you intend on scaling your application the distributed cache acts like a "elastic" data-fabric meaning that you can increase storage capacity of the cache without compromising on performance and at the same time giving you a relaible platform that is accessible to multiple applications.
One such distributed caching solution is NCache. Do take a look!

is the jvm faster under load?

Lots of personal experience, anecdotal evidence, and some rudimentary analysis suggests that a Java server (running, typically, Oracle's 1.6 JVM) has faster response times when it's under a decent amount of load (only up to a point, obviously).
I don't think this is purely hotspot, since response times slow down a bit again when the traffic dies down.
In a number of cases we can demonstrate this by averaging response times from server logs ... in some cases it's as high as 20% faster, on average, and with a smaller standard deviation.
Can anyone explain why this is so? Is it likely a genuine effect, or are the averages simply misleading? I've seen this for years now, through several jobs, and tend to state it as a fact, but have no explanation for why.
Thanks,
Eric
EDIT a fairly large edit for wording and adding more detail throughout.
A few thoughts:
Hotspot kicks in when a piece of code is being executed significantly more than other pieces (it's the hot spot of the program). This makes that piece of code significantly faster (for the normal path) from that point forward. The rate of call after the hotspot compilation is not important, so I don't think this is causing the effect you are mentioning.
Is the effect real? It's very easy to trick yourself with statistics. Not saying you are, but be sure that all your runs are included in the result, and that all other effects (such as other programs, activity, and your monitoring program are the same in all cases. I have more than one had my monitoring program, such as top, cause a difference in behaviour). On one occasion, the performance of the application went up appreciably when the caches warmed up on the database - there was memory pressure from other applications on the same DB instance.
The Operating System and/or CPU may well be involved. The OS and CPU both actively and passively do things to improve the responsiveness of the main program as it moves from being mainly running to being mainly waiting for I/O and vice versa, including:
OS paging memory to disk while it's not being used, and back to RAM when the program is running
OS will cache frequently used disk blocks, which again may improve the application performance
CPU instruction and memory caches fill with the active program's instruction and data
Java applications particularly sensitive to memory paging effects because:
A typical Java application server will pre-allocate almost all free memory to Java. The large memory makes the application inherently more sensitive to memory effects
The generational garbage collector used to manage Java memory ends up creating new objects over a lot of pages, so each request to the application will need more page requests than in other languages. (this is true principally for 'new' objects that have not been through many garbage collections. Objects promoted to the permanent generation are actually very compactly stored)
As most available physical memory is allocated on the system, there is always a pressure on memory, and the largest, least recently run application is a perfect candidate to be pages out.
With these considerations, there is much more probability that there will be page misses and therefore a performance hit than environments with smaller memory requirements. These will be particularly manifest after Java has been idle for some time.
If you use Solaris or Mac, the excellent dTrace can trace memory and disk paging specific to an application. The JVM has numerous dTrace hooks that can be used as triggers to start and stop page monitoring.
On Solaris, you can use large memory pages (even over 1GB in size) and pin them to RAM so they will never be paged out. This should eliminate the memory page problem stated above. Remember to leave a good chunk of free memory for disk caching and for other system/maintenance/backup/management apps. I am sure that other OSes support similar features.
TL/DR: The currently running program in modern operating systems will appear to run faster after a few seconds as the OS brings the program and data pages back from disk, places frequently used disk pages in disk cache and the OS instruction and data caches will tend to be "warmer" for the main program. This effect is not unique to the JVM but is more visible due to the memory requirements of typical Java applications and the garbage collection memory model.

In what applications caching does not give any advantage?

Our professor asked us to think of an embedded system design where caches cannot be used to their full advantage. I have been trying to find such a design but could not find one yet. If you know such a design, can you give a few tips?
Caches exploit the fact data (and code) exhibit locality.
So an embedded system wich does not exhibit locality, will not benefit from a cache.
Example:
An embedded system has 1MB of memory and 1kB of cache.
If this embedded system is accessing memory with short jumps it will stay long in the same 1kB area of memory, which could be successfully cached.
If this embedded system is jumping in different distant places inside this 1MB and does that frequently, then there is no locality and cache will be used badly.
Also note that depending on architecture you can have different caches for data and code, or a single one.
More specific example:
If your embedded system spends most of its time accessing the same data and (e.g.) running in a tight loop that will fit in cache, then you're using cache to a full advantage.
If your system is something like a database that will be fetching random data from any memory range, then cache can not be used to it's full advantage. (Because the application is not exhibiting locality of data/code.)
Another, but weird example
Sometimes if you are building safety-critical or mission-critical system, you will want your system to be highly predictable. Caches makes your code execution being very unpredictable, because you can't predict if a certain memory is cached or not, thus you don't know how long it will take to access this memory. Thus if you disable cache it allows you to judge you program's performance more precisely and calculate worst-case execution time. That is why it is common to disable cache in such systems.
I do not know what you background is but I suggest to read about what the "volatile" keyword does in the c language.
Think about how a cache works. For example if you want to defeat a cache, depending on the cache, you might try having your often accessed data at 0x10000000, 0x20000000, 0x30000000, 0x40000000, etc. It takes very little data at each location to cause cache thrashing and a significant performance loss.
Another one is that caches generally pull in a "cache line" A single instruction fetch may cause 8 or 16 or more bytes or words to be read. Any situation where on average you use a small percentage of the cache line before it is evicted to bring in another cache line, will make your performance with the cache on go down.
In general you have to first understand your cache, then come up with ways to defeat the performance gain, then think about any real world situations that would cause that. Not all caches are created equal so there is no one good or bad habit or attack that will work for all caches. Same goes for the same cache with different memories behind it or a different processor or memory interface or memory cycles in front of it. You also need to think of the system as a whole.
EDIT:
Perhaps I answered the wrong question. not...full advantage. that is a much simpler question. In what situations does the embedded application have to touch memory beyond the cache (after the initial fill)? Going to main memory wipes out the word full in "full advantage". IMO.
Caching does not offer an advantage, and is actually a hindrance, in controlling memory-mapped peripherals. Things like coprocessors, motor controllers, and UARTs often appear as just another memory location in the processor's address space. Instead of simply storing a value, those locations can cause something to happen in the real world when written to or read from.
Cache causes problems for these devices because when software writes to them, the peripheral doesn't immediately see the write. If the cache line never gets flushed, the peripheral may never actually receive a command even after the CPU has sent hundreds of them. If writing 0xf0 to 0x5432 was supposed to cause the #3 spark plug to fire, or the right aileron to tilt down 2 degrees, then the cache will delay or stop that signal and cause the system to fail.
Similarly, the cache can prevent the CPU from getting fresh data from sensors. The CPU reads repeatedly from the address, and cache keeps sending back the value that was there the first time. On the other side of the cache, the sensor waits patiently for a query that will never come, while the software on the CPU frantically adjusts controls that do nothing to correct gauge readings that never change.
In addition to almost complete answer by Halst, I would like to mention one additional case where caches may be far from being an advantage. If you have multiple-core SoC where all cores, of course, have own cache(s) and depending on how program code utilizes these cores - caches can be very ineffective. This may happen if ,for example, due to incorrect design or program specific (e.g. multi-core communication) some data block in RAM is concurrently used by 2 or more cores.

Resources