Explanation for why effective DRAM bandwidth reduces upon adding CPUs - performance

This question is a spin-off of the one posted here: Measuring bandwidth on a ccNUMA system
I've written a micro-benchmark for the memory bandwidth on a ccNUMA system with 2x Intel(R) Xeon(R) Platinum 8168:
24 cores # 2.70 GHz,
L1 cache 32 kB, L2 cache 1 MB and L3 cache 33 MB.
As a reference, I'm using the Intel Advisor's roof-line plot, which depicts the bandwidths of each CPU data-path available. According to this, the bandwidth is 230 GB/s.
Strong scaling of bandwidth:
Question: If you look at the strong scaling diagram, you can see that the peak effective bandwidth is actually achieved at 33 CPUs, following which adding CPUs only reduces it. Why is this happening?

Overview
This answer provides probable explanations. Put it shortly, all parallel workload does not infinitely scale. When many cores compete for the same shared resource (eg. DRAM), using too many cores is often detrimental because there is a point where there are enough cores to saturate a given shared resource and using more core only increase the overheads.
More specifically, in your case, the L3 cache and the IMCs are likely the problem. Enabling Sub-NUMA Clustering and non-temporal prefetch should improve a bit the performances and the scalability of your benchmark. Still, there are other architectural hardware limitations that can cause the benchmark not to scale well. The next section describes how Intel Skylake SP processors deal with memory accesses and how to find the bottlenecks.
Under the hood
The layout of Intel Xeon Skylake SP processors is like the following in your case:
Source: Intel
There are two sockets connected with an UPI interconnect and each processor is connected to its own set of DRAM. There are 2 Integrated Memory Controller (IMC) per processor and each is connected to 3 DDR4 DRAM # 2666MHz. This means the theoretical bandwidth is 2*2*3*2666e6*8 = 256 GB/s = 238 GiB/s.
Assuming your benchmark is well designed and each processor access only to its NUMA node, I expect a very low UPI throughput and a very low number of remote NUMA pages. You can check this with hardware counters. Linux perf or VTune enable you to check this relatively easily.
The L3 cache is split in slices. All physical addresses are distributed across the cache slices using an hash function (see here for more informations). This method enable the processor to balance the throughput between all the L3 slices. This method also enable the processor to balance the throughput between the two IMCs so that in-fine the processor looks like a SMP architecture instead of a NUMA one. This was also use in Sandy Bridge and Xeon Phi processors (mainly to mitigate NUMA effects).
Hashing does not guarantee a perfect balancing though (no hash function is perfect, especially the ones that are fast to compute), but it is often quite good in practice, especially for contiguous accesses. A bad balancing decreases the memory throughput due to partial stalls. This is one reason you cannot reach the theoretical bandwidth.
With a good hash function, the balancing should be independent of the number of core used. If the hash function is not good enough, one IMC can be more saturated than the other one oscillating over time. The bad news is that the hash function is undocumented and checking this behaviour is complex: AFAIK you can get hardware counters for the each IMC throughput but they have a limited granularity which is quite big. On my Skylake machine the name of the hardware counters are uncore_imc/data_reads/ and uncore_imc/data_writes/ but on your platform you certainly have 4 counters for that (one for each IMC).
Fortunately, Intel provides a feature called Sub-NUMA Clustering (SNC) on Xeon SP processors like your. The idea is to split the processor in two NUMA nodes that have their own dedicated IMC. This solve the balancing issue due to the hash function and so result in faster memory operations as long as your application is NUMA-friendly. Otherwise, it can actually be significantly slower due to NUMA effects. In the worst case, the pages of an application can all be mapped to the same NUMA node resulting in only half the bandwidth being usable. Since your benchmark is supposed to be NUMA-friendly, SNC should be more efficient.
Source: Intel
Furthermore, having more cores accessing the L3 in parallel can cause more early evictions of prefetched cache lines which need to be fetched again later when the core actual need them (with an additional DRAM latency time to pay). This effect is not as unusual as it seems. Indeed, due to the high latency of DDR4 DRAMs, hardware prefetching units have to prefetch data a long time in advance so to reduce the impact of the latency. They also need to perform a lot of requests concurrently. This is generally not a problem with sequential accesses, but more cores causes accesses to look more random from the caches and IMCs point-of-view. The thing is DRAM are designed so that contiguous accesses are faster than random one (multiple contiguous cache lines should be loaded consecutively to fully saturate the bandwidth). You can analyse the value of the LLC-load-misses hardware counter to check if more data are re-fetched with more threads (I see such effect on my Skylake-based PC with only 6-cores but it is not strong enough to cause any visible impact on the final throughput). To mitigate this problem, you can use software non-temporal prefetch (prefetchnta) to request the processor to load data directly into the line fill buffer instead of the L3 cache resulting in a lower pollution (here is a related answer). This may be slower with fewer cores due to a lower concurrency, but it should be a bit faster with a lot of cores. Note that this does not solve the problem of having fetched address that looks more random from the IMCs point-of-view and there is not much to do about that.
The low-level architecture DRAM and caches is very complex in practice. More information about memory can be found in the following links:
What Every Programmer Should Know About Memory
Introduction to High Performance Scientific Computing (Section 1.3)
Lecture: Main Memory and the DRAM System
Short lectures: Dynamic Random Access Memory (in 7 parts)
Intel® 64 and IA-32 Architectures Software Developer's Manual (Volume 3)

Related

Are caches of different level operating in the same frequency domain?

Larger caches are usually with longer bitlines or wordlines and thus most likely higher access latency and cycle time.
So, does L2 caches work in the same domain as L1 caches? How about L3 cache (slices) since they are now non-inclusive and shared among all the cores?
And related questions are:
Are all function units in a core in the same clock domain?
Are the uncore part all in the same clock domain?
Are cores in the multi-core system synchronous?
I believe clock domain crossing would introduce extra latency. Do most parts in a CPU chip working on the same clock domain?
The private L1i/d caches are always part of each core, not on a separate clock, in modern CPUs1. L1d is very tightly coupled with load execution units, and the L1dTLB. This is pretty universally true across architectures. (VIPT Cache: Connection between TLB & Cache?).
On CPUs with per-core private L2 cache, it's also part of the core, in the same frequency domain. This keeps L2 latency very low by keeping timing (in core clock cycles) fixed, and not requiring any async logic to transfer data across clock domains. This is true on Intel and AMD x86 CPUs, and I assume most other designs.
Footnote 1: Decades ago, when even having the L1 caches on-chip was a stretch for transistor budgets, sometimes just the comparators and maybe tags were on-chip, so that part could go fast while starting to set up the access to the data on external SRAM. (Or if not external, sometimes a separate die (piece of silicon) in the same plastic / ceramic package, so the wires could be very short and not exposed as external pins that might need ESD protection, etc).
Or for example early Pentium II ran its off-die / on-package L2 cache at half core clock speed (down from full speed in PPro). (But all the same "frequency domain"; this was before DVFS dynamic frequency/voltage for power management.) L1i/d was tightly integrated into the core like they still are today; you have to go farther back to find CPUs with off-die L1, like maybe early classic RISC CPUs.
The rest of this answer is mostly about Intel x86 CPUs, because from your mention of L3 slices I think that's what you're imagining.
How about L3 cache (slices) since they are now non-inclusive and shared among all the cores?
Of mainstream Intel CPUs (P6 / SnB-family), only Skylake-X has non-inclusive L3 cache. Intel since Nehalem has used inclusive last-level cache so its tags can be a snoop filter. See Which cache mapping technique is used in intel core i7 processor?. But SKX changed from a ring to a mesh, and made L3 non-inclusive / non-exclusive.
On Intel desktop/laptop CPUs (dual/quad), all cores (including their L1+L2 caches) are in the same frequency domain. The uncore (the L3 cache + ring bus) is in a separate frequency domain, but I think normally runs at the speed of the cores. It might clock higher than the cores if the GPU is busy but the cores are all idle.
The memory clock stays high even when the CPU clocks down. (Still, single-core bandwidth can suffer if the CPU decides to clock down from 4.0 to 2.7GHz because it's running memory-bound code on the only active core. Single-core bandwidth is limited by max_concurrency / latency, not by DRAM bandwidth itself if you have dual-channel DDR4 or DDR3. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? I think this is because of increased uncore latency.)
The wikipedia Uncore article mentions overclocking it separately from the cores to reduce L3 / memory latency.
On Haswell and later Xeons (E5 v3), uncore (the ring bus and L3 slices) and each individual core have separate frequency domains. (source: Frank Denneman's NUMA Deep Dive Part 2: System Architecture. It has a typo, saying Haswell (v4) when Haswell is actually Xeon E[357]-xxxx v3. But other sources like this paper Comparisons of core and uncore frequency scaling modes in quantum chemistry application GAMESS confirm that Haswell does have those features. Uncore Frequency Scaling (UFS) and Per Core Power States (PCPS) were both new in Haswell.
On Xeons before Haswell, the uncore runs at the speed of the current fastest core on that package. On a dual-socket NUMA setup, this can badly bottleneck the other socket, by making it slow keeping up with snoop requests. See John "Dr. Bandwidth" McCalpin's post on this Intel forum thread:
On the Xeon E5-26xx processors, the "uncore" (containing the L3 cache, ring interconnect, memory controllers, etc), runs at a speed that is no faster than the fastest core, so the "package C1E state" causes the uncore to also drop to 1.2 GHz. When in this state, the chip takes longer to respond to QPI snoop requests, which increases the effective local memory latency seen by the processors and DMA engines on the other chip!
... On my Xeon E5-2680 chips, the "package C1E" state increases local latency on the other chip by almost 20%
The "package C1E state" also reduces sustained bandwidth to memory located on the "idle" chip by up to about 25%, so any NUMA placement errors generate even larger performance losses.
Dr. Bandwidth ran a simple infinite-loop pinned to a core on the other socket to keep it clocked up, and was able to measure the difference.
Quad-socket-capable Xeons (E7-xxxx) have a small snoop filter cache in each socket. Dual-socket systems simply spam the other socket with every snoop request, using a good fraction of the QPI bandwidth even when they're accessing their own local DRAM after an L3 miss.
I think Broadwell and Haswell Xeon can keep their uncore clock high even when all cores are idle, exactly to avoid this bottleneck.
Dr. Bandwidth says he disables package C1E state on his Haswell Xeons, but that probably wasn't necessary. He also posted some stuff about using Uncore perf counters to measure uncore frequency to find out what your CPU is really doing, and about BIOS settings that can affect the uncore frequency decision-making.
More background: I found https://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/4 about some changes like new snoop mode options (which hop on the ring bus sends snoops to the other core), but it doesn't mention clocks.
A larger cache may have a higher access time, but still it could sustain one access per cycle per port by fully pipelining it. But it also may constrain the maximum supported frequency.
In modern Intel processors, the L1i/L1d and L2 caches and all functional units of a core are in the same frequency domain. On client processors, all cores of the same socket are also in the same frequency domain because they share the same frequency regulator. On server processors (starting with Haswell I think), each core in a separate frequency domain.
In modern Intel processors (since Nehalem I think), the uncore (which includes the L3) is in a separate frequency domain. One interesting case is when a socket is used in a dual NUMA nodes configuration. In this case, I think the uncore partition of each NUMA node would still both exist in the same frequency domain.
There is a special circuitry used to cross frequency domains where all cross-domain communication has to pass through it. So yes I think it incurs a small performance overhead.
There are other frequency domains. In particular, each DRAM channel operates in a frequency domains. I don't know whether current processors support having different channels to operate at different frequencies.

Could multi-cpu access memory simultaneously in common home computer?

As far as I know, in modern mult-core cpu system, different cpus share one memory bus. Does that mean only one cpu could access the memory at one moment since there are only one memory bus which could not be used by more than one cpus at a time?
Yes, at the simplest level, a single memory bus will only be doing one thing at once. For memory busses, it's normal for them to be simplex (i.e. either loading or storing, not sending data in both directions at once like gigabit ethernet or PCIe).
Requests can be pipelined to minimize the gaps between requests, but transferring a cache-line of data takes multiple back-to-back cycles.
First of all, remember that when a CPU core "accesses the memory", they don't have to directly read from DRAM. The cache maintains a coherent view of memory shared by all cores, using (a variant of) the MESI cache coherency protocol.
Essential reading for the low-level details about how cache + memory works:
Ulrich Drepper's 2007 article What Every Programmer Should Know About Memory?, and my 2017 update on what's changed and what hasn't. e.g. a single core can barely saturate the memory controllers on a low-latency dual/quad core Intel CPU, and not even close on a many-core Xeon where max_concurrency / latency is the bottleneck, not the DRAM controller bandwidth. (Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?).
All high-performance / multi-core systems use caches, and normally every core has its own private L1i/L1d cache. In most modern multi-core CPUs, there are 2 levels of private cache per core, with a large shared cache. Earlier CPUs (like Intel Core2) only had private L1 caches, and the large shared last-level cache was L2.
Multi-level caches are essential to give low latency / high bandwidth for the most-hot data while still being large enough to have a high hit rate over a large working set.
Intel divides up their L3 caches into slices on the ring bus that connects cores together. So multiple accesses to different slices of L3 can happen simultaneously. See David Kanter's write-up of Sandybridge. Only on an L3 miss does the request need to be sent to a memory controller. (The memory controllers themselves have some buffering / reordering capability.)
Data written by one core can be read by another core without ever being written back to DRAM. A shared last-level cache acts as a backstop for shared data. (Intel CPUs with inclusive L3 cache also use it as a snoop filter to avoid broadcasting cache-coherency traffic to all cores: Which cache mapping technique is used in intel core i7 processor?).
But the writer will have the cache line in Modified state (and all other cores have it Invalid), so the reader has to request it from the writer to get it in Shared state. This is somewhat slow. See What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?, and What will be used for data exchange between threads are executing on one Core with HT?.
On modern Xeon multi-socket systems, I think it's still the case that dirty data can be sent between sockets without writing back to DRAM. But I'm not sure.
AMD Ryzen has separate L3 for each quad-core cluster, so data transfer between core-clusters is slower than within a single core cluster. (And if all the cores are working on the same data, it will end up replicated in the L3 of each cluster.)
Typical Intel/AMD desktop/laptop systems have dual-channel memory controllers, so (if both memory channels are populated) there can be two burst transfers in flight simultaneous, one to each DIMM.
But if only one channel is populated, or they're mismatched and the BIOS doesn't run them in dual-channel mode, or there are no outstanding accesses to cache lines that map to one of the channels, then memory parallelism is limited to pipelining access to one channel.
I know that modern cpu uses cache to achieve low lantency. So my question is based on the scene that when the computer was just started, there are no data in the cache, so cpus will fetch data directly from the memory
Nobody would design a multi-core system with no caches at all. That would be terribly inefficient because the cores would block each other from accessing the bus to fetch instructions as well as data, as you suspect
One fast CPU can do everything that two half-speed CPUs can do, and some things it can't (like run a single thread fast).
If you can build a CPU complex enough to support SMP operation, you can (and should) first make it support some cache. Maybe just internal tags for external data (for faster hit/miss checking), if we're talking about really old CPUs where the transistor budget for the whole chip was too low for much/any internal cache.
Or you could always have fully external cache outside the CPU, as part of an SMP interconnect. But the CPU has to know about it, at least to be able to mark some memory regions uncacheable so MMIO works, and (if it's not write-through) for consistent DMA. If you want private caches for each core, it can't just be a transparent memory-side cache (i.e. caching just the DRAM, not even seeing accesses to physical memory addresses that aren't backed by DRAM).
Multiple cores on a single piece of silicon only makes sense once you've pushed single-core performance to the point of diminishing returns with pipelining, caches, and superscalar execution. Maybe even out-of-order execution, although there are some multi-core in-order x86 and ARM chips. If running carefully-tuned code, out-of-order execution isn't always necessary for some kinds of problems. For example, GPUs don't use OoO exec because they're just designed for massive throughput with simple control.
Pipelining and caching can give huge speed improvements. See http://www.lighterra.com/papers/modernmicroprocessors/
Summary: it's generally possible for a single core to saturate the memory bus if memory access is all it does.
If you establish the memory bandwidth of your machine, you should be able to see if a single-threaded process can really achieve this and, if not, how the effective bandwidth use scales with the number of processors.
now I'll explain further.
it's all depends on the architecture you're using, for now, let's say modern SMP and SDRAM:
1) If two cores tried to access the same address in RAM
could go several ways:
they both want to read, simultaneously:
two cores on the same chip will probably share an intermediate cache
at some level (2 or 3), so the read will only be done once. On a
modern architecture, each core may be able to keep executing µ-ops
from one or more pipelines until the cache line is ready
two cores on different chips may not share a cache, but still need to
co-ordinate access to the bus: ideally, whichever chip didn't issue
the read will simply snoop the response
if they both want to write:
two cores on the same chip will just be writing to the same cache,
and that only needs to be flushed to RAM once. In fact, since memory
will be read from and written to RAM per cache line, writes at
distinct but sufficiently close addresses can be coalesced into a
single write to RAM
two cores on different chips do have a conflict, and the cache line
will need to be written back to RAM by chip1, fetched into chip2's
cache, modified and then written back again (no idea whether the
write/fetch can be coalesced by snooping)
2) If two cores tried to access different addresses
For a single access, the CAS latency means two operations can potentially be interleaved to take no longer (or perhaps only a little longer) than if the bus were idle.

How come register file size in GPU's (eg GTX 1080) bigger than L2 cache size?

Looking at this fact, I've started wondering how registers work in GPU? Before knowing this, I thought going higher and higher above the hierarchical memory ladder, the size keeps on decreasing (which is intuitive (latency decrease, size decrease)). What is the purpose of registers in GPU's and why is their size greater than the L2/L1 cache?
Thanks.
In CPUs caches serve two basic purposes:
They enable temporal and spatial reuse of data already fetched from DRAM. This reduces the required bandwidth of the DRAM.
CPU caches provide a huge reduction of latency, which is extremely important for single threaded performance.
GPUs are not focused on single thread performance, but are focused on throughput instead. Most of the time they also deal with working sets that are too big to fit into any reasonably sized cache. Small cache help in some situation, but overall caches are not nearly as important for GPUs as they are for CPUs.
Now to the second part of the question: Why huge registers files? GPUs reach their performance by exploiting thread level parallelism. Many threads need to be active at the same time to reach high performance levels. But every thread needs to store its own set of registers. In Maxwell GPUs and likely in GP104/GTX1080 every SM can host up to 2048 threads. Every SM has a 256 KB register file, so if all threads are used, 32x 32-bit registers are available per thread.
I mentioned earlier that CPUs use caches to reduce memory latency, but GPUs must also somehow deal with memory latency. They just switch to a different thread, while a thread is waiting for an answer from the memory. Latency and throughput and threads are connected by Little's law:
(data in flight/thread) * threads = latency x throughput
The memory latency is likely a few hundred ns to thousand ns (lets use 1000ns). The throughput here would be the memory bandwidth (320 GB/s). To full utilize the available memory bandwidth we need at (320 GB/s * 1000 ns=) 320 KB in flight. GTX1080 should have 20 SMs, so each SM would need to have 16 KB in flight to full use the memory bandwidth. Even if all 2048 threads are used for memory access all the time, every thread would still be required to have 8 bytes in outstanding memory requests. If some threads are busy with calculations and cannot send out new memory requests, even more memory requests are required from the remaining threads. If threads use more than 32 registers per thread, even more memory requests per thread are needed.
If GPUs would use smaller register files, they could not use the full bandwidth of their memory. They would send out some work to memory interface and then all threads would be waiting for answers from the memory interface and no new work could be submitted to the memory interface. The huge registers are required to have enough threads available. Careful coding is still required to really get the maximum power of the GPU.
GPU is built for 3D and calculations so vendors dedicated more area for cores. More cores need more data to feed them and that needs more gpu area for scheduling mechanisms to maintain occupation as high as possible.
Too many cores, too many 3D pipelines such as tmu and rop, too many scheduling parts and too wide memory controllers to feed those cores.
Gpu area is just not enough for everything. Least important one seems to be caches. Even texture memory is more important than that and that is faster too.
Making gpu bigger means lower yield for production and that means less profit. Since gpu vendors are not charity organisations, they chose maximum profit, optimum performance and power savings(as of lately). Cache is expensive.
A compute unit in a gpu can have more than kilobytes of registers per thread so any multiply used data is not needed to be transferred between long distances(such as cache and cores) and make it have energy efficieny.
Also you can hide latency of some parts by having good occupation ratio for large-enough calculations; local shared memory(per compute unit) and registers(per thread) has more important role in achieving that.
While memory controller,L1 and L2 can handle only 100 GB/s, 200 GB/s and 300 GB/s;local shared memory and registers can be up to 5 TB/s and 15 TB/s bandwidth for a gpu.

Hyper-threading and gaming (and other computing applications)?

I was wondering what the real-world performance effects are of hyperthreading (multiple logical cores for each physical core) in different situations. Intel advertises this as being effective for when threads of execution are waiting for I/O, however in memory intensive applications, it can be ineffective because when a switch occurs between logical cores, locality is lost in the processor cache. The second application's data is loaded into cache, forcing the first application's memory out of cache. Upon returning to the first application, its references are all cache misses and performance is lost. I know several super computer managers and they claim that they turn off hyperthreading because doing so is more efficient in their cases. Are there "normal" user cases where disabling hyperthreading is more efficient? Gaming can be pretty memory intensive--would it be better without hyperthreading?
First, it should be recognized that hyperthreading is an Intel marketing term labelling Switch-on-Event MultiThreading (on Itanium) and Simultaneous MultiThreading (on x86). SoEMT is primarily beneficial in hiding high latency events such as last level cache misses, is easier to implement, and is friendlier to VLIW-like scheduling. SoEMT is also a better fit for a small L1 (given a somewhat fast L2) than SMT since cache contention is moved more to L2 or L3 (thousands of accesses between thread switches) which can better handle contention given their greater capacity and higher associativity. SMT can be useful in hiding smaller latencies like branch resolution delay or L2 cache hits and provides instruction level parallelism, but introduces more intense contention for resources.
(There is also a difference between disabling hyperthreading and not using hyperthreading. Disabling hyperthreading might provide a small performance benefit in that some shareable resources will be used even by an inactive but enabled thread and some partitioned resources may still use a small amount of power, but the primary benefit would be in preventing the OS from making disruptive scheduling decisions.)
For "normal" code, the available thread-level parallelism may well be lower than the number of cores available. In that case, a modern OS typically will not use the hardware multithreading since it recognizes that a full core has more performance than a core shared by more than one thread. (Sharing a core can theoretically improve performance in special cases where using L1 to communicate between threads is unusually helpful. In addition, waking an inactive thread on an active core is much faster and requires less energy than waking up a core, so using multithreading might be helpful for energy efficiency in some special cases.)
HPC codes tend to be the worst case for SMT. HPC code is more likely to be friendly to static scheduling. This means that the latency hiding benefits of SMT tend to be minimized. (Similarly, HPC code tends to benefit less from out-of-order execution.) HPC code also tends to be constrained by memory bandwidth rather than memory latency. SMT can increase the bandwidth demand per unit of execution (by increasing cache misses) and reduce the actual achieved memory bandwidth by contention at the memory controller. (DRAM is not friendly to random access; such causes excessive refresh and row active cycles.) SMT may also cause the number of data streams that are active to exceed the hardware's support for prefetching. HPC code is also more likely to be blocked according to cache sizes assuming one thread per core; in such cases SMT will produce significant cache thrashing.
Disabling hyperthreading may also be friendlier to gang-scheduled operation, which is common in HPC. If only some of the cores are using multithreading, those cores might have higher performance per core yet would have lower performance per thread; that forces other cores to idly wait for the slowed threads to complete. (HPC systems may have dedicated OS cores and spare cores to avoid similar problems, where OS activity would slow down one core/thread and force hundreds of others to wait or where a failed core could cause, e.g., a 16-thread gang scheduled program to run 15 threads and then one thread, doubling execution time.)
(In theory, SMT could be used in HPC to reduce register pressure in some optimized loops since the effective latency of operations like FMADD in a dual threaded core may be viewed as roughly being halved. Since compilers generally use a fixed latency for scheduling [SMT is treated as a transparent feature], exploiting this feature is not generally practical even when it could be beneficial.)
Rather like out-of-order execution, SMT is most beneficial for irregular code. (OoO looks ahead in a single code stream for instruction level and memory level parallelism; SMT looks "sideways" across threads for such parallelism.) If branch mispredictions and cache misses are common, SMT can use existing thread-level parallelism to hide such latencies (the cost of a branch misprediction is largely in the latency of resolution).
The benefit from SMT varies by workload and by the specific hardware. A deeply pipelined in-order microarchitecture like the initial Intel Atom benefits more from SMT than a shallower pipelined OoO microarchitecture would (latencies, especially branch resolution latency, being generally higher with longer pipelines and OoO providing some parallelism that would otherwise be used by SMT's thread-level parallelism).
Enabled hyperthreading may also have the disadvantage of increasing the number of threads used by an application where performance scaling with increased thread count is sufficiently sublinear that the lower performance per thread with hyperthreading would result in a net loss of performance. E.g., if two-thread-per-core hyperthreading provided a 30% increase in per core performance and doubling thread count increased performance by 50%, then total performance would decrease by 2.5%.
The standard advice of "when in doubt, measure" obviously applies.
Obviously some people don't understand some things. I have done so, here is what I copied froma site:
Depending on when you last bought a computer, you may remember Hyper-Threading as a feature that Intel introduced and then discontinued. This could understandably leave a sour taste in your mouth – why would Intel discontinue it if it wasn’t trouble?
The truth isn’t so grim. Hyper-Threading was for a time made available on certain Intel Pentium 4 and Intel Xeon processors. It was discontinued not because the feature itself was bad, but rather because the processor that used it turned out to be a bit of a misstep for other reasons. The Pentium 4 architecture was a minor disaster for Intel because it was incapable of going the direction Intel hoped (Intel wanted to have Pentium 4 processors with clock speeds of up to 10 GHz). As a result, Intel jumped back to designing processors based on the Pentium Pro family tree.
Hyper-Threading was gone, but not forgotten. Intel eventually found the time and resources to integrate it into another new processor architecture - Nehalem. This is the architecture that is the basis for all current Intel Core i3, i5 and i7 processors.
Source: http://www.makeuseof.com/tag/hyperthreading-technology-explained/

What is locality of reference?

I am having problem in understanding locality of reference. Can anyone please help me out in understanding what it means and what is,
Spatial Locality of reference
Temporal Locality of reference
This would not matter if your computer was filled with super-fast memory.
But unfortunately that's not the case and computer-memory looks something like this1:
+----------+
| CPU | <<-- Our beloved CPU, superfast and always hungry for more data.
+----------+
|L1 - Cache| <<-- ~4 CPU-cycles access latency (very fast), 2 loads/clock throughput
+----------+
|L2 - Cache| <<-- ~12 CPU-cycles access latency (fast)
+----+-----+
|
+----------+
|L3 - Cache| <<-- ~35 CPU-cycles access latency (medium)
+----+-----+ (usually shared between CPU-cores)
|
| <<-- This thin wire is the memory bus, it has limited bandwidth.
+----+-----+
| main-mem | <<-- ~100 CPU-cycles access latency (slow)
+----+-----+ <<-- The main memory is big but slow (because we are cheap-skates)
|
| <<-- Even slower wire to the harddisk
+----+-----+
| harddisk | <<-- Works at 0,001% of CPU speed
+----------+
Spatial Locality
In this diagram, the closer data is to the CPU the faster the CPU can get at it.
This is related to Spacial Locality. Data has spacial locality if it is located close together in memory.
Because of the cheap-skates that we are RAM is not really Random Access, it is really Slow if random, less slow if accessed sequentially Access Memory SIRLSIAS-AM. DDR SDRAM transfers a whole burst of 32 or 64 bytes for one read or write command.
That is why it is smart to keep related data close together, so you can do a sequential read of a bunch of data and save time.
Temporal locality
Data stays in main-memory, but it cannot stay in the cache, or the cache would stop being useful. Only the most recently used data can be found in the cache; old data gets pushed out.
This is related to temporal locality. Data has strong temporal locality if it is accessed at the same time.
This is important because if item A is in the cache (good) than Item B (with strong temporal locality to A) is very likely to also be in the cache.
Footnote 1:
This is a simplification with latency cycle counts estimated from various cpus for example purposes, but give you the right order-of-magnitude idea for typical CPUs.
In reality latency and bandwidth are separate factors, with latency harder to improve for memory farther from the CPU. But HW prefetching and/or out-of-order exec can hide latency in some cases, like looping over an array. With unpredictable access patterns, effective memory throughput can be much lower than 10% of L1d cache.
For example, L2 cache bandwidth is not necessarily 3x worse than L1d bandwidth. (But it is lower if you're using AVX SIMD to do 2x 32-byte loads per clock cycle from L1d on a Haswell or Zen2 CPU.)
This simplified version also leaves out TLB effects (page-granularity locality) and DRAM-page locality. (Not the same thing as virtual memory pages). For a much deeper dive into memory hardware and tuning software for it, see What Every Programmer Should Know About Memory?
Related: Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? explains why a multi-level cache hierarchy is necessary to get the combination of latency/bandwidth and capacity (and hit-rate) we want.
One huge fast L1-data cache would be prohibitively power-expensive, and still not even possible with as low latency as the small fast L1d cache in modern high-performance CPUs.
In multi-core CPUs, L1i/L1d and L2 cache are typically per-core private caches, with a shared L3 cache. Different cores have to compete with each other for L3 and memory bandwidth, but each have their own L1 and L2 bandwidth. See How can cache be that fast? for a benchmark result from a dual-core 3GHz IvyBridge CPU: aggregate L1d cache read bandwidth on both cores of 186 GB/s vs. 9.6 GB/s DRAM read bandwidth with both cores active. (So memory = 10% L1d for single-core is a good bandwidth estimate for desktop CPUs of that generation, with only 128-bit SIMD load/store data paths). And L1d latency of 1.4 ns vs. DRAM latency of 72 ns
It is a principle which states that if some variables are referenced
by a program, it is highly likely that the same might be referenced
again (later in time - also known as temporal locality) .
It is also highly likely that any consecutive storage in memory might
be referenced sooner (spatial locality)
First of all, note that these concepts are not universal laws, they are observations about common forms of code behavior that allow CPU designers to optimize their system to perform better over most of the programs. At the same time, these are properties that programmers seek to adopt in their programs as they know that's how memory systems are built and that's what CPU designers optimize for.
Spatial locality refers to the property of some (most, actually) applications to access memory in a sequential or strided manner. This usually stems from the fact that the most basic data structure building blocks are arrays and structs, both of which store multiple elements adjacently in memory. In fact, many implementations of data structures that are semantically linked (graphs, trees, skip lists) are using arrays internally to improve performance.
Spatial locality allows a CPU to improve the memory access performance thanks to:
Memory caching mechanisms such as caches, page tables, memory controller page are already larger by design than what is needed for a single access. This means that once you pay the memory penalty for bringing data from far memory or a lower level cache, the more additional data you can consume from it the better is your utilization.
Hardware prefetching which exists on almost all CPUs today often covers spatial accesses. Everytime you fetch addr X, the prefetcher will likely fetch the next cache line, and possibly others further ahead. If the program exhibits a constant stride, most CPUs would be able to detect that as well and extrapolate to prefetch even further steps of the same stride. Modern spatial prefetchers may even predict variable recurring strides (e.g. VLDP, SPP)
Temporal locality refers to the property of memory accesses or access patterns to repeat themselves. In the most basic form this could mean that if address X was once accessed it may also be accessed in the future, but since caches already store recent data for a certain duration this form is less interesting (although there are mechanisms on some CPUs aimed to predict which lines are likely to be accessed again soon and which are not).
A more interesting form of temporal locality is that two (or more) temporally adjacent accesses observed once, may repeat together again. That is - if you once accessed address A and soon after that address B, and at some later point the CPU detects another access to address A - it may predict that you will likely access B again soon, and proceed to prefetch it in advance.
Prefetchers aimed to extract and predict this type of relations (temporal prefetchers) are often using relatively large storage to record many such relations. (See Markov prefetching, and more recently ISB, STMS, Domino, etc..)
By the way, these concepts are in no way exclusive, and a program can exhibit both types of localities (as well as other, more irregular forms). Sometimes both are even grouped together under the term spatio-temporal locality to represent the "common" forms of locality, or a combined form where the temporal correlation connects spatial constructs (like address delta always following another address delta).
Temporal locality of reference - A memory location that has been used recently is more likely to be accessed again. For e.g., Variables in a loop. Same set of variables (symbolic name for a memory locations) being used for some i number of iterations of a loop.
Spatial locality of reference - A memory location that is close to the currently accessed memory location is more likely to be accessed. For e.g., if you declare int a,b; float c,d; the compiler is likely to assign them consecutive memory locations. So if a is being used then it is very likely that b, c or d will be used in near future. This is one way how cachelines of 32 or 64 bytes, help. They are not of size 4 or 8 bytes (typical size of int,float, long and double variables).

Resources