which is optimal a bigger block cache size or a smaller one? - algorithm

Given a cache size with constant capacity and associativity, for a given code to determine average of array elements, would a cache with higher block size be preferred?
[from comments]
Examine the code given below to compute the average of an array:
total = 0;
for(j=0; j < k; j++) {
sub_total = 0; /* Nested loops to avoid overflow */
for(i=0; i < N; i++) {
sub_total += A[jN + i];
}
total += sub_total/N;
}
average = total/k;

Related: in the more general case of typical access patterns with some but limited spatial locality, larger lines help up to a point. These "Memory Hierarchy: Set-Associative Cache" (powerpoint) slides by Hong Jiang and/or Yifeng Zhu (U. Maine) have a graph of AMAT (Average Memory Access Time) vs. block size showing a curve, and also breaking it down into miss penalty vs. miss rate (for a simple model I think, for a simple in-order CPU that sucks at hiding memory latency. e.g. maybe not even pipelining multiple independent misses. (miss under miss))
There is a lot of good stuff in those slides, including a compiler-optimization section that mentions loop interchange (to fix nested loops with column-major vs. row-major order), and even cache-blocking for more reuse. A lot of stuff on the Internet is crap, but I looked through these slides and they have some solid info on how caches are designed and what the tradeoffs are. The performance-analysis stuff is only really accurate for simple CPUs, not like modern out-of-order CPUs that can overlap some computation with cache-miss latency so more shorter misses is different from fewer longer misses.
Specific answer to this question:
So the only workload you care about is a linear traversal of your elements? That makes cache line size nearly irrelevant for performance, assuming good hardware prefetching. (So larger lines mean less HW complexity and power usage for the same performance.)
With software prefetch, larger lines mean less prefetch overhead (although depending on the CPU design, that may not hurt performance if you still max out memory bandwidth.)
Without any prefetching, a larger line/block size would mean more hits following every demand-miss. A single traversal of an array has perfect spatial locality and no temporal locality. (Actually not quite perfect spatial locality at the start/end, if the array isn't aligned to the start of a cache line, and/or ends in the middle of a line.)
If a miss has to wait until the entire line is present in cache before the load that caused the miss can be satisfied, this slightly reduces the advantage of larger blocks. (But most of the latency of a cache miss is in the signalling and request overhead, not in waiting for the burst transfer to complete after it's already started.)
A larger block size means fewer requests in flight with the same bandwidth and latency, and limited concurrency is a real limiting factor in memory bandwidth in real CPUs. (See the latency-bound platforms part of this answer about x86 memory bandwidth: many-core Xeons with higher latency to L3 cache have lower single-threaded bandwidth than a dual or quad-core of the same clock speed. Each core only has 10 line-fill buffers to track outstanding L1 misses, and bandwidth = concurrency / latency.)
If your cache-miss handling has an early restart design, even that bit of extra latency can be avoided. (That's very common, but Paul says theoretically possible to not have it in a CPU design). The load that caused the miss gets its data as soon as it arrives. The rest of the cache line fill happens "in the background", and hopefully later loads can also be satisfied from the partially-received cache line.
Critical word first is a related feature, where the needed word is sent first (for use with early restart), and the burst transfer then wraps around to transfer the earlier words of the block. In this case, the critical word will always be the first word, so no special hardware support is needed beyond early restart. (The U. Maine slides I linked above mention early restart / critical word first and point out that it decreases the miss penalty for large cache lines.)
An out-of-order execution CPU (or software pipelining on an in-order CPU) could give you the equivalent of HW prefetch by having multiple demand-misses outstanding at once. If the CPU "sees" the loads to another cache line while a miss to the current cache line is still outstanding, the demand-misses can be pipelined, again hiding some of the difference between larger or smaller lines.
If lines are too small, you'll run into a limit on how many outstanding misses for different lines your L1D can track. With larger lines or smaller out-of-order windows, you might have some "slack" when there's no outstanding request for the next cache line, so you're not maxing out the bandwidth. And you pay for it with bubbles in the pipeline when you get to the end of a cache line and the start of the next line hasn't arrived yet, because it started too late (while ALU execution units were using data from too close to the end of the current cache line.)
Related: these slides don't say much about the tradeoff of larger vs. smaller lines, but look pretty good.

The simplistic answer is that larger cache blocks would be preferred since the workload has no (data) temporal locality (no data reuse), perfect spacial locality (excluding the potentially inadequate alignment of the array for the first block and insufficient size of the array for the last block, every part of every block of data will be used), and a single access stream (no potential for conflict misses).
A more nuanced answer would consider the size and alignment of the array (the fraction of the first and last cache blocks that will be unused and what fraction of the memory transfer time that represents; for a 1 GiB array, even 4 KiB blocks would waste less than 0.0008% of the memory bandwidth), the ability of the system to use critical word first (if the array is of modest size and there is no support for early use of data as it becomes available rather than waiting for the entire block to be filled, then the start-up overhead will remove much of the prefetching advantage of larger cache blocks), the use of prefetching (software or hardware prefetching reduces the benefit of large cache blocks and this workload is extremely friendly to prefetching), the configuration of the memory system (e.g., using DRAM with an immediate page close controller policy would increase the benefit of larger cache blocks because each access would involve a row activate and row close, often to the same DRAM bank preventing latency overlap), whether the same block size is used for instructions and page table accesses and whether these accesses share the cache (instruction accesses provide a second "stream" which can introduce conflict misses; with shared caching of a two-level hierarchical page table TLB misses would access two cache blocks), whether simple way prediction is used (a larger block would increase prediction accuracy reducing misprediction overhead), and perhaps other factors.

From your example code we can't say either way as long as the hardware pre-fetcher can keep up a memory stream at maximum memory throughput.
In a random access scenario a shorter cache-line might be preferable as you then don't need to fill all the line. But the total amount of cached memory would go down as you need more circuits for tags and potentially more time for comparing.
So a compromise must be made Intel has chosen 64-bytes per line (and fetches 2 lines) others has chosen 32-bytes per line.

Related

Is cache miss rate the only thing that matters when deciding which is better or performs better?

If I have two different cache subsystem designs C1 and C2 that both have roughly the same hardware complexity, can I make a decision if which one is better choice considering effectiveness of cache subsystem is the prime factor i.e., the number misses should be minimized.
Give the total miss rate below:
miss_rate = (number of cache misses)/(number of cache reference)
miss rate of C1 = 0.77
miss rate of C2 = 0.73
Is the given miss rate information sufficient to make decision of what subsystem is better?
Yes, assuming hit latency is the same for both caches, actual miss rate on the workload you care about is the ultimate factor for that workload. It doesn't always generalize.
All differences in size, associativity, eviction policy, all matter because of their impact on miss rate on any given workload. Even cache line (block) size factors in to this: a cache with twice as many 32-byte lines vs. a cache with half as many 64-byte lines would be able to cache more scattered words, but pull in less nearby data on a miss. (Unless you have hardware prefetching, but again prefetch algorithms ultimately just affect miss rate.)
If hit and miss latencies are fixed, then all misses are equal and you just want fewer of them.
Well, not just latency, but overall effect on the pipeline, if the CPU isn't a simple in-order design from the 1980s that simply stalls on a miss. Which is what homework usually assumes, because otherwise the miss cost depends on details of the context, making it impossible to calculate performance based on just instruction mix, hit/miss rate, and miss costs.
An out-of-order exec CPU can hide the latency of some misses better than others. (On the critical path of some larger dependency chain vs. not.) Even an in-order CPU that can scoreboard loads can get work done in the shadow of a cache miss load, up until it reaches an instruction that reads the load result. (And with a store buffer, can usually hide store miss latency.) So miss penalty can differ depending on which loads miss, whether it's one that software instruction scheduling was able to hide more vs. less of the latency for. (If the independent work after a load includes other loads, then you'd need a non-blocking cache that handles hit-under-miss. Miss-under-miss to memory-level parallelism of multiple misses in flight also helps, as well as being able to get to a hit after 2 or more cache-miss loads.)
I think usually for most workloads with different cache geometries and sizes, there won't be significant bias towards more of the misses being easier to hide or not, so you could still say that miss-rate is the only thing that ultimately matters.
Miss-rate for a cache depends on workload, so you can't say that a lower miss rate on just one workload or trace makes it better on average or for all cases. e.g. an 8-way associative 16 KiB cache might have a higher hit rate than a 32 KiB 2-way cache on one workload (with a lot of conflict misses for the 2-way cache), but on a different workload where the working set is mostly one contiguous 24KiB array, the 32K 2-way cache might have a higher hit rate.
The term "better" is subjective as follows:
Hardware cost, in terms of silicon real-estate, meaning that a larger chip is more expensive to produce and thus costs more per chip.  (A larger cache may not even fit on the chip in question.)
Hardware cost, in terms of silicon process technology, meaning that a faster cache requires a more advanced chip process, so will increase costs per chip.
A miss rate on a given cache is workload specific (e.g. application specific or algorithm specific).  Thus, two different workloads may have different miss rates on each of the caches in question.  So, "better" here may mean across an average workload (or an average across several different workloads), but there's a lot of room for variability.
We would have to know the performance of the caches upon hit, and also upon miss — as a more complex cache with a higher hit rate might have longer timings.
In summary, in order to say that lower miss rate is better, we would have to know that all the other factors are equal.  Otherwise, the notion of better needs to be defined, perhaps to include cost/benefit definition.

Does anyone have an example where _mm256_stream_load_si256 (non-tempral load to bypasse cache) actually improves performance?

Consider massiveley SIMD-vectorized loops on very large amounts of floating point data (hundreds of GB) that, in theory, should benefit from non-temporal ("streaming" i.e. bypassing cache) loads/store.
Using non-temp store (_mm256_stream_ps) actually does significantly improve throughput by about ~25% over plain store (_mm256_store_ps)
However, I could not measure any difference when using _mm256_stream_load instead of _mm256_load_ps.
Does anyone have an example where _mm256_stream_load_si256 can be used to actually improves performance ?
(Instruction set & Hardware is AVX2 on AMD Zen2, 64 cores)
for(size_t i=0; i < 1000000000/*larger than L3 cache-size*/; i+=8 )
{
#ifdef USE_STREAM_LOAD
__m256 a = _mm256_castsi256_ps (_mm256_stream_load_si256((__m256i *)source+i));
#else
__m256 a = _mm256_load_ps( source+i );
#endif
a *= a;
#ifdef USE_STREAM_STORE
_mm256_stream_ps (destination+i, a);
#else
_mm256_store_ps (destination+i, a);
#endif
}
stream_load (vmovntdqa) is just a slower version of normal load (extra ALU uop) unless you use it on a WC memory region (uncacheable, write-combining).
The non-temporal hint is ignored by current CPUs, because unlike NT stores, the instruction doesn't override the memory ordering semantics. We know that's true on Intel CPUs, and your test results suggest the same is true on AMD.
Its purpose is for copying from video RAM back to main memory, as in an Intel whitepaper. It's useless unless you're copying from some kind of uncacheable device memory. (On current CPUs).
See also What is the difference between MOVDQA and MOVNTDQA, and VMOVDQA and VMOVNTDQ for WB/WC marked region? for more details. As my answer there points out, what can sometimes help if tuned carefully for your hardware and workload, is NT prefetch to reduce cache pollution. But tuning the prefetch distance is pretty brittle; too far and data will be fully evicted by the time you read it, instead of just missing L1 and hitting in L2.
There wouldn't be much if anything to gain in bandwidth anyway. Normal stores cost a read + an eventual write on eviction for each cache line. The Read For Ownership (RFO) is required for cache coherency, and because of how write-back caches work that only track dirty status on a whole-line basis. NT stores can increase bandwidth by avoiding those loads.
But plain loads aren't wasting anything, the only downside is evicting other data as you loop over huge arrays generating boatloads of cache misses, if you can't change your algorithm to have any locality.
If cache-blocking is possible for your algorithm, there's much more to gain from that, so you don't just bottleneck on DRAM bandwidth. e.g. do multiple steps over a subset of your data, then move on to the next.
See also How much of ‘What Every Programmer Should Know About Memory’ is still valid? - most of it; go read Ulrich Drepper's paper.
Anything you can do to increase computational intensity helps (ALU work per time the data is loaded into L1d cache, or into registers).
Even better, make a custom loop that combines multiple steps that you were going to do on each element. Avoid stuff like for(i) A[i] = sqrt(B[i]) if there is an earlier or later step that also does something simple to each element of the same array.
If you're using NumPy or something, and just gluing together optimized building blocks that operate on large arrays, it's kind of expected that you'll bottleneck on memory bandwidth for algorithms with low computational intensity (like STREAM add or triad type of things).
If you're using C with intrinsics, you should be aiming higher. You might still bottleneck on memory bandwidth, but your goal should be to saturate the ALUs, or at least bottleneck on L2 cache bandwidth.
Sometimes it's hard, or you haven't gotten around to all the optimizations on your TODO list that you can think of, so NT stores can be good for memory bandwidth if nothing is going to re-read this data any time soon. But consider that a sign of failure, not success. CPUs have large fast caches, use them.
Further reading:
Enhanced REP MOVSB for memcpy - RFO vs. no-RFO stores (including NT stores), and how per-core memory bandwidth can be limited to the latency-bandwidth product given latency of handing off cache lines to lower levels and the number of LFBs to track them. Especially on Intel server chips.
Non-temporal loads and the hardware prefetcher, do they work together? - no, NT loads are only useful on WC memory, where HW prefetch doesn't work. They kind of exist to fill that gap.

What is the advantage of caching an entire line instead of a single byte or word at a time?

To use cache memory, main memory is divided into cache lines, typically 32 or 64 bytes long. An entire cache line is cached at once. What is the advantage of caching an entire line instead of a single byte or word at a time?
This is done to exploit the principle of locality; spatial locality to be precise. This principle states that the data bytes which lie close together in memory are likely to be referenced together in a program. This is immediately apparent when accessing large arrays in loops. However, this is not always true (e.g. pointer based memory access) and hence it is not advisable to fetch data from memory at more than the granularity of cache lines (in case the program does not have locality of reference) since cache is a very limited and important resource.
Having cache block size equal to the smallest addressable size would mean, if a larger size access is supported, multiple tags would have to be checked for such larger accesses. While parallel tag checking is often used for set associative caches, a four-fold increase (8-bit compared to 32-bit) in the number of tags to check would increase access latency and greatly increase energy cost. In addition, such introduces the possibility of partial hits for larger accesses, increasing the complexity of sending the data to a dependent operation or internal storage. While data can be speculatively sent by assuming a full hit (so latency need not be hurt by the possibility of partial hits), the complexity budget is better not spent on supporting partial hits.
32-bit cache blocks, when the largest access size is 32 bits, would avoid the above-mentioned issues, but would use a significant fraction of storage for tags. E.g., a 16KiB direct-mapped cache in a 32-bit address space would use 18 bits for the address portion of the tag; even without additional metadata such as coherence state, tags would use 36% of the storage. (Additional metadata might be avoided by having a 16KiB region of the address space be non-cacheable; a tag matching this address region would be interpreted as "invalid".)
Besides the storage overhead, having more tag data tends to increase latency (smaller tag storage facilitates earlier way selection) and access energy. In addition, having a smaller number of blocks for a cache of a given size makes way prediction and memoization easier, these are used to reduce latency and/or access energy.
(The storage overhead can be a significant factor when it allows tags to be on chip while data is too large to fit on chip. If data uses a denser storage type than tags — e.g., data in DRAM and tags in SRAM with a four-fold difference in storage density —, lower tag overhead becomes more significant.)
If caches only exploited temporal locality (the reuse of a memory location within a "short" period of time), this would typically be the most attractive block size. However, spatial locality of access (accesses to locations near an earlier access often being close in time) is common. Taken control flow instructions are typically less than a sixth of all instructions and many branches and jumps are short (so the branch/jump target is somewhat likely to be within the same cache block as the branch/jump instruction if each cache block holds four or more instructions). Stack frames are local to a function (concentrating the timing of accesses, especially for leaf functions, which are common). Array accesses often use unit stride or very small strides. Members of a structure/object tend to be accessed nearby in time (conceptually related data tends to be related in action/purpose and so accessed nearer in time). Even some memory allocation patterns bias access toward spatial locality; related structures/objects are often allocated nearby in time — if the preferred free memory is not fragmented (which would happen if spatially local allocations are freed nearby in time, if little memory has been freed, or if the allocator is clever in reducing fragmentation, then such allocations are more likely to be spatially local.
With multiple caches, coherence overhead also tends to be reduced with larger cache blocks (under the assumption spatial locality). False sharing increases coherence overhead (similar to lack of spatial locality increasing capacity and conflict misses).
In this sense, larger cache blocks can be viewed as a simple form of prefetching (even with respect to coherence). Prefetching trades bandwidth and cache capacity for a reduction in latency via cache hits (as well as from increasing the useful queue size and scheduling flexibility). One could gain the same benefit by always prefetching a chunk of memory into multiple small cache blocks, but the capacity benefit of finer-grained eviction would be modest because spatial locality of use is common. In addition, to avoid prefetching data that is already in the cache, the tags for the other blocks would have to be probed to check for hits.
With simple modulo-power-of-two indexing and modest associativity, two spatially nearby blocks are more likely to conflict and evict earlier another blocks with spatial locality (index A and index B will have the same spatial locality relationship for all addresses mapping to indexes within a larger address range). With LRU-oriented replacement, accesses within a larger cache block reduce the chance of a too-early eviction when spatial locality is common at the cost of some capacity and conflict misses.
(For a direct-mapped cache, there is no difference between always prefetching a multi-block aligned chunk and using a larger cache block, so paying the extra tag overhead would be pointless.)
Prefetching into a smaller buffer would avoid cache pollution from used data, increasing the benefit of smaller block size, but such also reduces the temporal scope of the spatial locality. A four-entry prefetch buffer would only support spatial locality within four cache misses; this would catch most stream-like accesses (rarely will more than four new "streams" be active at the same time) and many other cases of spatial locality but some spatial locality is over a larger span of time.
Mandatory prefetching (whether from larger cache blocks or a more flexible mechanism) provides significant bandwidth advantages. First, the address and request type overhead is spread over a larger amount of data. 32 bits of address and request type overhead per 32 bit access uses 50% of the bandwidth for non-data but less than 12% when 256 bits of data are transferred.
Second, the memory controller processing and scheduling overhead can be more easily averaged over more transferred data.
Finally, DRAM chips can provide greater bandwidth by exploiting internal prefetch. Even in the days of Fast Page Mode DRAM, accesses within the same DRAM page were faster and higher bandwidth (less page precharge and activation overhead); while non-mandatory prefetch could exploit such and be more general, the control and communication overheads would be larger. Modern DRAMs have minimum burst lengths (and burst chop merely drops part of the DRAM-chip-internal prefetch — the internal access energy and array occupation are nor reduced).
The ideal cache block size depends on workload ('natural' algorithm choices and legacy optimization assumptions, data set sizes and complexity, etc.), cache sizes and associativity (larger and more associative caches encourage larger blocks), available bandwidth, use of in-cache data compression (which tends to encourage larger blocks), cache block sectoring (where validity/coherence state is tracked at finer granularity than the address), and other factors.
The main advantage of caching an entire line is the probability of the next cache-hit is increased.
From Tanenbaum's "Modern Operating Systems" book:
Cache-hit: When the program needs to read a memory word, the cache hardware checks to see if the line needed is in the cache.
If we don't have a cache-hit then cache-miss will occur. A memory request is sent to the main memory.
As a result, more time will be spent to complete the process, since searching inside the memory is costly.
We can tell that, caching an entire line will increase the probability of completing the process in two-cycles.

Is memory latency affected by CPU frequency? Is it a result of memory power management by the memory controller?

I basically need some help to explain/confirm some experimental results.
Basic Theory
A common idea expressed in papers on DVFS is that execution times have on-chip and off-chip components. On-chip components of execution time scale linearly with CPU frequency whereas the off-chip components remain unaffected.
Therefore, for CPU-bound applications, there is a linear relationship between CPU frequency and instruction-retirement rate. On the other hand, for a memory bound application where the caches are often missed and DRAM has to be accessed frequently, the relationship should be affine (one is not just a multiple of the other, you also have to add a constant).
Experiment
I was doing experiments looking at how CPU frequency affects instruction-retirement rate and execution time under different levels of memory-boundedness.
I wrote a test application in C that traverses a linked list. I effectively create a linked list whose individual nodes have sizes equal to the size of a cache-line (64 bytes). I allocated a large amount of memory that is a multiple of the cache-line size.
The linked list is circular such that the last element links to the first element. Also, this linked list randomly traverses through the cache-line sized blocks in the allocated memory. Every cache-line sized block in the allocated memory is accessed, and no block is accessed more than once.
Because of the random traversal, I assumed it should not be possible for the hardware to use any pre-fetching. Basically, by traversing the list, you have a sequence of memory accesses with no stride pattern, no temporal locality, and no spacial locality. Also, because this is a linked list, one memory access can not begin until the previous one completes. Therefore, the memory accesses should not be parallelizable.
When the amount of allocated memory is small enough, you should have no cache misses beyond initial warm up. In this case, the workload is effectively CPU bound and the instruction-retirement rate scales very cleanly with CPU frequency.
When the amount of allocated memory is large enough (bigger than the LLC), you should be missing the caches. The workload is memory bound and the instruction-retirement rate should not scale as well with CPU frequency.
The basic experimental setup is similiar to the one described here:
"Actual CPU Frequency vs CPU Frequency Reported by the Linux "cpufreq" Subsystem".
The above application is run repeatedly for some duration. At the start and end of the duration, the hardware performance counter is sampled to determine the number of instructions retired over the duration. The length of the duration is measured as well. The average instruction-retirement rate is measured as the ratio between these two values.
This experiment is repeated across all the possible CPU frequency settings using the "userspace" CPU-frequency governor in Linux. Also, the experiment is repeated for the CPU-bound case and the memory-bound case as described above.
Results
The two following plots show results for the CPU-bound case and memory-bound case respectively. On the x-axis, the CPU clock frequency is specified in GHz. On the y-axis, the instruction-retirement rate is specified in (1/ns).
A marker is placed for repetition of the experiment described above. The line shows what the result would be if instruction-retirement rate increased at the same rate as CPU frequency and passed through the lowest-frequency marker.
Results for the CPU-bound case.
Results for the memory-bound case.
The results make sense for the CPU-bound case, but not as much for the memory-bound case. All the markers for the memory-bound fall below the line which is expected because the instruction-retirement rate should not increase at the same rate as CPU frequency for a memory-bound application. The markers appear to fall on straight lines, which is also expected.
However, there appears to be step-changes in the instruction-retirement rate with change in CPU frequency.
Question
What is causing the step changes in the instruction-retirement rate? The only explanation I could think of is that the memory controller is somehow changing the speed and power-consumption of memory with changes in the rate of memory requests. (As instruction-retirement rate increases, the rate of memory requests should increase as well.) Is this a correct explanation?
You seem to have exactly the results you expected - a roughly linear trend for the cpu bound program, and a shallow(er) affine one for the memory bound case (which is less cpu effected). You will need a lot more data to determine if they are consistent steps or if they are - as I suspect - mostly random jitter depending on how 'good' the list is.
The cpu clock will affect bus clocks, which will affect timings and so on - synchronisation between differently clocked buses is always challenging for hardware designers. The spacing of your steps is interestingly 400 Mhz but I wouldn't draw too much from this - generally, this kind of stuff is way too complex and specific-hardware dependent to be properly analysed without 'inside' knowledge the memory controller used, etc.
(please draw nicer lines of best fit)

cuda memory coalescing

I would like first to confirm the following:
The elementary global memory transaction to shared memory is either 32 bytes, 64 or 128 bytes, but only if the memory accesses can be coalesced. The latencies of the precedent transactions are all equal. Is that right?
Second question: If the memory reads can't be coalesced, each thread reads only 4 bytes (is that right?) will all threads memory accesses be made sequential?
It depends on the architecture you are working on. However, on Fermi and Kepler you have:
Memory transactions are always 32byte or 128byte called segments
32byte segments is used when only L2 cache is used, 128byte segments when L2+L1.
If two threads of the same warp fall into the same segment, data is delivered in a single transation
If on the other hand there is data in a segment you fetch that no thread requested - it is being read anyway and you (probably) waste bandwidth
Whole segments fall into L1 & L2 cache and may reduce your bandwidth pressure when your neighbouring warps need the same segment
L1 & L2 are fairly small compared to the number of threads they usually deliver for. That is why you should not expect a piece of data to stay in the cache for long (in contrary to CPU programming)
You can disable L1 caching which may help if you overfetch in random memory access patterns.
As you can see there are several variables which decide how much time your memory access is going to take. The general rule of thumb is: the more dense your access pattern - the better! Stride or misalignment are not as costly now as they were in the past, so don't worry too much about that, unless you are doing some late-stage optimizations.

Resources