Should software cache improve performance on a NUMA machine

Should software cache improve performance on a NUMA machine - performance

Since NUMA machine do not have local cache, would a software cache implementation improve performance in task that require access to remote memory ?

Some NUMA machines do have local cache. If you have a multi-socket Opteron or Xeon system, each socket is a NUMA domain with multiple levels of cache, some shared between cores and some not. At least for Intel chips since Nehalem, all of those caches can store remote memory references. This is good for performance in 2-8 sockets but also continues to be a benefit on larger systems built on longer range cache-coherent interconnects like NumaConnect or SGI NUMALink.
With that said, if you're stuck on a non-coherent system, you'll need to narrow down a bunch of other parameters before a yes/no answer is possible. How expensive is each state transition in your software coherency protocol? How often are those transitions happening for a trace of an app you're concerned about? If transitions are cheap enough or lines stay resident long enough, then sure, it could help... but that depends on the implementation, the underlying architecture and the behavior of the app itself.
Here's a group experimenting with some related performance issues: http://www.lfbs.rwth-aachen.de/content/17.html. You might also find some interesting work done relating to the Cell BE architecture used in the Playstation 3, for example: http://researcher.ibm.com/files/us-alexe/paper-gonzalez-pact08.pdf.

Related

What is the performance impact of virtual memory relative to direct mapped memory?

Virtual memory is a convenient way to isolate memory among processes and give each process its own address space. It works by translating virtual addresses to physical addresses.
I'm already very familiar with how virtual memory works and is implemented. What I don't know about is the performance impact of virtual memory relative to direct mapped memory, which requires no overhead for translation.
Please don't say that there is no overhead. This is obviously false since traversing page tables requires several memory accesses. It is possible that TLB misses are infrequent enough that the performance impacts are negligible, however, if this is the case there should be evidence for it.
I also realize the importance of virtual memory for many of the functions a modern OS provides, so this question isn't about whether virtual memory is good or bad (it is clearly a good thing for most use cases), I'm asking purely about the performance effects of virtual memory.
The answer I'm looking for is ideally something like: virtual memory imposes an x% overhead over direct mapping and here is a paper showing that. I tried to look for papers with such results, but was unable to find any.

This question is difficult to answer definitively because virtual memory is an integral part of modern systems are designed to support virtual memory and most software is written and optimized using systems with virtual memory.
However, in the early 2000s Microsoft Research developed a research OS called Signularity that, among other things, did not rely on virtual memory for process isolation. As part of this project they published a paper where they analyzed the overhead of hardware support for process isolation. The paper is entitled Deconstructing Process Isolation (non-paywall link here). In the paper the researchers write:
Most operating systems use a CPU’s memory management hardware to
provide process isolation, using two mechanisms. First, processes are
only allowed access to certain pages of physical memory. Second,
privilege levels prevent untrusted code from manipulating the system
resources that implement processes, for example, the memory management
unit (MMU) or interrupt controllers. These mechanisms’ non-trivial
performance costs are largely hidden, since there is no widely used
alternative approach to compare them to. Mapping from virtual to
physical addresses can incur overheads up to 10–30% due to exception
handling, inline TLB lookup, TLB reloads, and maintenance of kernel
data structures such as page tables [29]. In addition, virtual memory
and privilege levels increase the cost of inter-process communication.
Later in the paper they write:
Virtual memory systems (with the exception of software-only systems
such as SPUR [46]) rely on a hardware cache of address translations to
avoid accessing page tables at every processor cache miss. Managing
TLB entries has a cost, which Jacob and Mudge estimated at 5–10% on a
simulated MIPS-like processor [29]. The virtual memory system also
brings its data, and in some systems, code as well, into a processor’s
caches, which evicts user code and data. Jacob and Mudge estimate
that, with small caches, these induced misses can increase the
overhead to 10–20%. Furthermore, they found that virtual memory
induced interrupts can increase the overhead to 10–30%. Other studies
found similar or even higher overheads, though the actual costs are
very dependent on system details and benchmarks [3, 6, 10, 26, 36, 40,
41]. In addition, TLB access is on the critical path of many processor
designs [2, 30] and so might affect processor clock speed.
Overall I would take these results with a grain of salt since the research is promoting an alternative system. But clearly there is some overhead associated with implementing virtual memory, and this paper gives one attempt to quantify some of these overheads (within the context of evaluating a possible alternative). I recommend reading the paper for more detail.

Why can't be provided a direct access from one processor to the cache of another processor?

In NUMA architecture (Non-uniform memory access) each processor has it's own first level cache, so there's a protocol (MESI) for processor communication. But why can't each processor be connected to other's caches directly? I read that "The connection simply isn't fast enough", but that didn't explain too much.
Thanks.

First, having a L1 cache doesn't imply a NUMA architecture, the motherboard topology is still the primary element that make a machine UMA or NUMA.
Second, the Cache Coherence protocol in use is architecture dependent and may differ from MESI (Actually MESIF is a better fit for NUMA machines).
Turning to your question
Each processor can be connected to each other processor cache. Indeed every cache coherence protocol do this, just not by allowing direct read/write as it would take a lot of efforts with poor reusability.
However it is possible to connect directly a CPU to another CPU cache and actually it is implemented in a way on the Intel CPUs.
Logical cores (i.e. HyperThreading cores) may share L2 cache and some physical core in the same package may share L3 cache.
However there two important aspect here: first, the number of CPUs sharing a cache is low and second they are in the same core/package.
Connecting all the caches directly would lose the boundary between what is inside the CPU (as a whole) and what is outside of the CPU.
Isolating the CPU let us create very customizable and modular systems, an external protocol is an interface that let us hide the implementation details, this worth more than the gain in speed given by closely connected caches.
When we need such a speed, we build dedicated integrated system components, like a coprocessor.
There are various reasons why caches are not directly connected, I cannot speak for industry leaders but here some generic thoughts.
It doesn't scale.
2 processors means 1 link, 3 processors means 3 links, 4 processors means 6 links and so on.
n processors need C(n, 2) links that is n * (n-1) / 2 links.
Also you could connect only CPUs with compatible cache interfaces, and this may imply that you could connect only identical CPUs. Cache architecture is something that change frequently, lines may be made bigger, associativity may change, timings of the signals can be faster.
Lastly, if a CPU has enough pins to connect to only four more CPUs, you can create only quad-cpu systems.
It requires a lot of pins.
Givin access to the caches require a lot of pins, there are two or three caches per core and each one need to be addressed and controlled, this requires to expose a lot of pins, serial interface is not an option as it would be too slow.
If you add that each processor must be connected to each other than the number of pins explodes quadratically.
If you use a shared bus between caches, you are actually using a protocol like MESI, a protocol that try to avoid congestionating the bus, because if you have even few CPUs the traffic on the shared bus is quite intense and the time spent waiting for its turn to use it will slow down the CPU (even with store buffers and invalidation queues).
It is slow.
The cache is highly integrated with the core, it may support multiple read/write ports and other interfaces that increase parallelization. All this cannot be exposed out of the package/core without a large numbers of pins (and a huge increase in size and cost).
The cache is physically close to the core, this minimize the propagation delay, consider that the period of a 3GHz CPU is 1/3 * 10^-9, in that time the light can travel at most 10 cm, or 5 cm for a round-trip, and the signal doesn't propagate at the speed of light.
Furthermore when a cache is accessed only by a core, the designer can make some optimizations based on the internal architecture of the core. This is not possible if the core belongs to another, possibly different, CPU.
It is complex.
Letting a cache being accessed by multiple CPU require replicating a lot of circuitry, for example being the caches associative, it means that when an address is requested, a tag must be verified between a set of possible candidates and this circuit must be replicated to allow others CPUs to read/write the cache asynchronously.
So briefly: It could be possible to connect caches directly, it is just not worth for discrete components. It is done for integrated components.

Why does Windows switch processes between processors?

If a single threaded process is busy and uses 100% of a single core it seems like Windows is switching this process between the cores, because in Task Managers core overview all cores are equal used.
Why does Windows do that? Isn't this destroying L1/L2 caches?

There are advantages to pinning a process to one core, primarily caching which you already mentioned.
There are also disadvantages -- you get unequal heating, which can create mechanical stresses that do not improve the expected lifetime of the silicon die.
To avoid this, OSes tend to keep all cores at equal utilization. When there's only one active thread, it will have to be moved and invalidate caches. As long as this is done infrequently (in CPU time), the impact of the extra cache misses during migration is negligible.
For example, the abstract of "Energy and thermal tradeoffs in hardware-based load balancing for clustered multi-core architectures implementing power gating" explicitly lists this as a design goal of scheduling algorithms (emphasis mine):
In this work, a load-balancing technique for these clustered multi-core architectures is presented that provides both a low overhead in energy and an a smooth temperature distribution across the die, increasing the reliability of the processor by evenly stressing the cores.
Spreading the heat dissipation throughout the die is also essential for techniques such as Turbo Boost, where cores are clocked temporarily at a rate that is unsustainable long term. By moving load to a different core regularly, the average heat dissipation remains sustainable even though the instantaneous power is not.

Your process may be the only one doing a lot of work, but it is not the only thing running. There are lots of other processes that need to run occasionally. When your process gets evicted and eventually re-scheduled, the core on which it was running previously might not be available. It's better to run the waiting process immediately on a free core than to wait for the previous core to be available (and in any case its data will likely have been bumped from the caches by the other thread).
In addition, modern CPUs allow all the cores in a package to share high-level caches. See the "Smart Cache" feature in this Intel Core i5 spec sheet. You still lose the lower-level cache(s) on core switch, but those are small and will probably churn somewhat anyway if you're running more than just a small tight loop.

How to measure memory bandwidth utilization on Windows?

I have a highly threaded program but I believe it is not able to scale well across multiple cores because it is already saturating all the memory bandwidth.
Is there any tool out there which allows to measure how much of the memory bandwidth is being used?
Edit: Please note that typical profilers show things like memory leaks and memory allocation, which I am not interested in.
I am only whether the memory bandwidth is being saturated or not.

If you have a recent Intel processor, you might try to use Intel(r) Performance Counter Monitor: http://software.intel.com/en-us/articles/intel-performance-counter-monitor/ It can directly measure consumed memory bandwidth from the memory controllers.

I'd recommend the Visual Studio Sample Profiler which can collect sample events on specific hardware counters. For example, you can choose to sample on cache misses. Here's an article explaining how to choose the CPU counter, though there are other counters you can play with as well.

it would be hard to find a tool that measured memory bandwidth utilization for your application.
But since the issue you face is a suspected memory bandwidth problem, you could try and measure if your application is generating a lot of page faults / sec, which would definitely mean that you are no where near the theoretical memory bandwidth.
You should also measure how cache friendly your algorithms are. If they are thrashing the cache, your memory bandwidth utilization will be severely hampered. Google "measuring cache misses" on good sources that tells you how to do this.

It isn't possible to properly measure memory bus utilisation with any kind of software-only solution. (it used to be, back in the 80's or so. But then we got piplining, cache, out-of-order execution, multiple cores, non-uniform memory architectues with multiple busses, etc etc etc).
You absolutely have to have hardware monitoring the memory bus, to determine how 'busy' it is.
Fortunately, most PC platforms do have some, so you just need the drivers and other software to talk to it:
wenjianhn comments that there is a project specficially for intel hardware (which they call the Processor Counter Monitor) at https://github.com/opcm/pcm
For other architectures on Windows, I am not sure. But there is a project (for linux) which has a grab-bag of support for different architectures at https://github.com/RRZE-HPC/likwid
In principle, a computer engineer could attach a suitable oscilloscope to almost any PC and do the monitoring 'directly', although this is likely to require both a suitably-trained computer engineer as well as quite high performance test instruments (read: both very costly).
If you try this yourself, know that you'll likely need instruments or at least analysis which is aware of the protocol of the bus you're intending to monitor for utilisation.
This can sometimes be really easy, with some busses - eg old parallel FIFO hardware, which usually has a separate wire for 'fifo full' and another for 'fifo empty'.
Such chips are used usually between a faster bus and a slower one, on a one-way link. The 'fifo full' signal, even it it normally occasionally triggers, can be monitored for excessively 'long' levels: For the example of a USB 2.0 Hi-Speed link, this happens when the OS isn't polling the USB fifo hardware on time. Measuring the frequency and duration of these 'holdups' then lets you measure bus utilisation, but only for this USB 2.0 bus.
For a PC memory bus, I guess you could also try just monitoring how much power your RAM interface is using - which perhaps may scale with use. This might be quite difficult to do, but you may 'get lucky'. You want the current of the supply which feeds VccIO for the bus. This should actually work much better for newer PC hardware than those ancient 80's systems (which always just ran at full power when on).
A fairly ordinary oscilloscope is enough for either of those examples - you just need one that can trigger only on 'pulses longer than a given width', and leave it running until it does, which is a good way to do 'soak testing' over long periods.
You monitor utiliation either way by looking for the change in 'idle' time.
But modern PC memory busses are quite a bit more complex, and also much faster.
To do it directly by tapping the bus, you'll need at least an oscilloscope (and active probes) designed explicitly for monitoring the generation of DDR bus your PC has, along with the software analysis option (usually sold separately) to decode the protocol enough to figure out the kind of activity which is occuring on it, from which you can figure out what kind of activity you want to measure as 'idle'.
You may even need a motherboard designed to allow you to make those measurements also.
This isn't so staightfoward as just looking for periods of no activity - all DRAM needs regular refresh cycles at the very least, which may or may not happen along with obvious bus activity (some DRAM's do it automatically, some need a specific command to trigger it, some can continue to address and transfer data from banks not in refresh, some can't, etc).
So the instrument needs to be able to analyse the data deeply enough for you extract how busy it is.
Your best, and simplest bet is to find a PC hardware (CPU) vendor who has tools which do what you want, and buy that hardware so you can use those tools.
This might even involve running your application in a VM, so you can benefit from better tools in a different OS hosting it.
To this end, you'll likely want to try Linux KVM (yes, even for Windows - there are windows guest drivers for it), and also pin down your VM to specific CPUs, whilst you also configure linux to avoid putting other jobs on those same CPUs.

shared memory vs distributed memory and multithread vs multiprocess

I am learning parallel programming by myself. I wonder if distributed memory is always multiprocess and multithread is always shared memory? if multiprocess can be both for distributed memory and for shared memory?
Thanks and regards!

Yes, yes, and "yes, in a sense"
In a distributed memory system, different CPU units have their own memory systems. Access from another CPU will most likely be slower or with a more limited coherency model, if indeed it is possible at all. This will be more typical of a message-passing multiprocessor.
Using multiple threads for parallel programming is more of a software paradigm than a hardware issue, but you are correct, use of the term thread essentially specifies that a single shared memory is in use, and it may or may not include actual multiple processors. It may not even include multiple kernel threads, in which case the threads will not execute in parallel.
I'm not completely clear on the meaning of last question. Certainly, by saying "distributed memory" or "shared memory" it implies "distributed over processors" and "shared by processors", so I suppose the terms are only reasonably applied to multiprocessor or potentially multiprocessor systems. If we are talking about multiple processes in the software sense, I guess it's pretty much a requirement for distributed memory systems, and essentially a requirement (they might be called threads) for a shared memory system.
I should add that distributed-memory-but-cache-coherent systems do exist and are a type of shared memory multiprocessor design called NUMA. Only a few years ago these machines were the lunatic fringe of parallel computing, but now the Intel Core i7 processors have brought NUMA into the mainstream.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio