____cacheline_aligned_in_smp for structure in the Linux kernel - linux-kernel

In the Linux kernel, why do many structures use the ____cacheline_aligned_in_smp macro? Does it help increase performance when accessing the structure? If yes then how?

____cacheline_aligned instructs the compiler to instantiate a struct or variable at an address corresponding to the beginning of an L1 cache line, for the specific architecture, i.e., so that it is L1 cache-line aligned. ____cacheline_aligned_in_smp is similar, but is actually L1 cache-line aligned only when the kernel is compiled in SMP configuration (i.e., with option CONFIG_SMP). These are defined in file include/linux/cache.h
These definitions are useful for variables (and data structures) that are not allocated dynamically, via some allocator, but are global, compiler-allocated variables (a similar effect can be accomplished by dynamic memory allocators that can allocate memory at specific alignment).
The reason for cache-line aligned variables is to manage the cache-to-cache transfers of these variables, by hardware cache coherence mechanisms, in SMP systems, so that their movement does not implicitly occur when other variables are moved. This is for performance critical code, where one expects contention in the access of variables by multiple cpus (cores). The usual problem one tries to avoid, in this case, is false sharing.
A variable's memory starting at the beginning of a cache line is half the work for this purpose; one also needs to "pack with it" only variables that should move together. An example is an array of variables, where each element of the array is to be accessed by only one cpu (core):
struct my_data {
long int a;
int b;
} ____cacheline_aligned_in_smp cpu_data[ NR_CPUS ];
This kind of definition will require from the compiler (in an SMP configuration of the kernel) that each cpu's struct will begin at a cache line boundary. The compiler will, implicitly, allocate extra space after each cpu's struct, so that the next cpu's struct will begin at a cache line boundary, also.
An alternative is to pad the data structure with a cache line's size of dummy, unused bytes:
struct my_data {
long int a;
int b;
char dummy[L1_CACHE_BYTES];
} cpu_data[ NR_CPUS ];
In this case, only dummy, unused data will be moved unintentionally and those actually accessed by each cpu will only move from cache to memory and vise versa, due to cache capacity misses.

Each cache line in any cache (dcache or icache) is 64 bytes (in x86) architecture. Cache alignment is required to avoid false sharing of cache lines. If the cache lines are shared between global variables (happens more in kernel) If one of the global variables changed by one of the processor in its cache then it marks that cache line as dirty. In remaining CPU cache line it becomes stale entry, which needs to be flushed and re-fetched from memory. This might lead to cache line misses, which requires more CPU cycles. This reduces the performance of the system. Remember this is for global variables. Most of the kernel data strucuters use this to avoid cache line misses.

Linux manages the CPU Cache in a very similar fashion to the TLB. CPU caches, like TLB caches, take advantage of the fact that programs tend to exhibit a locality of reference. To avoid having to fetch data from main memory for each reference, the CPU will instead cache very small amounts of data in the CPU cache. Frequently, there is two levels called the Level 1 and Level 2 CPU caches. The Level 2 CPU caches are larger but slower than the L1 cache, but Linux only concerns itself with the Level 1 or L1 cache.
CPU caches are organised into lines. Each line is typically quite small, usually 32 bytes and each line is aligned to it's boundary size. In other words, a cache line of 32 bytes will be aligned on a 32 byte address. With Linux, the size of the line is L1_CACHE_BYTES which is defined by each architecture.
How addresses are mapped to cache lines vary between architectures but the mappings come under three headings, direct mapping, associative mapping and set associative mapping. Direct mapping is the simplest approach where each block of memory maps to only one possible cache line. With associative mapping, any block of memory can map to any cache line. Set associative mapping is a hybrid approach where any block of memory can map to any line but only within a subset of the available lines.
Regardless of the mapping scheme, they each have one thing in common, addresses that are close together and aligned to the cache size are likely to use different lines. Hence Linux employs simple tricks to try and maximise cache usage
Frequently accessed structure fields are at the start of the
structure to increase the chance that only one line is needed to
address the common fields;
Unrelated items in a structure should try
to be at least cache size bytes apart to avoid false sharing between
CPUs;
Objects in the general caches, such as the mm_struct cache, are
aligned to the L1 CPU cache to avoid false sharing.
If the CPU references an address that is not in the cache, a cache miss occurs and the data is fetched from main memory. The cost of cache misses is quite high as a reference to cache can typically be performed in less than 10ns where a reference to main memory typically will cost between 100ns and 200ns. The basic objective is then to have as many cache hits and as few cache misses as possible.
Just as some architectures do not automatically manage their TLBs, some do not automatically manage their CPU caches. The hooks are placed in locations where the virtual to physical mapping changes, such as during a page table update. The CPU cache flushes should always take place first as some CPUs require a virtual to physical mapping to exist when the virtual address is being flushed from the cache.
More Information here

Related

VIPT Cache: Connection between TLB & Cache?

I just want to clarify the concept and could find detail enough answers which can throw some light upon how everything actually works out in the hardware. Please provide any relevant details.
In case of VIPT caches, the memory request is sent in parallel to both the TLB and the Cache.
From the TLB we get the traslated physical address.
From the cache indexing we get a list of tags (e.g. from all the cache lines belonging to a set).
Then the translated TLB address is matched with the list of tags to find a candidate.
My question is where is this check performed ?
In Cache ?
If not in Cache, where else ?
If the check is performed in Cache, then
is there a side-band connection from TLB to the Cache module to get the
translated physical address needed for comparison with the tag addresses?
Can somebody please throw some light on "actually" how this is generally implemented and the connection between Cache module & the TLB(MMU) module ?
I know this dependents on the specific architecture and implementation.
But, what is the implementation which you know when there is VIPT cache ?
Thanks.
At this level of detail, you have to break "the cache" and "the TLB" down into their component parts. They're very tightly interconnected in a design that uses the VIPT speed hack of translating in parallel with tag fetch (i.e. taking advantage of the index bits all being below the page offset and thus being translated "for free". Related: Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?)
The L1dTLB itself is a small/fast Content addressable memory with (for example) 64 entries and 4-way set associative (Intel Skylake). Hugepages are often handled with a second (and 3rd) array checked in parallel, e.g. 32-entry 4-way for 2M pages, and for 1G pages: 4-entry fully (4-way) associative.
But for now, simplify your mental model and forget about hugepages.
The L1dTLB is a single CAM, and checking it is a single lookup operation.
"The cache" consists of at least these parts:
the SRAM array that stores the tags + data in sets
control logic to fetch a set of data+tags based on the index bits. (High-performance L1d caches typically fetch data for all ways of the set in parallel with tags, to reduce hit latency vs. waiting until the right tag is selected like you would with larger more highly associative caches.)
comparators to check the tags against a translated address, and select the right data if one of them matches, or trigger miss-handling. (And on hit, update the LRU bits to mark this way as Most Recently Used). For a diagram of the basics for a 2-way associative cache without a TLB, see https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec16.pdf#page=17. The = inside a circle is the comparator: producing a boolean true output if the tag-width inputs are equal.
The L1dTLB is not really separate from the L1D cache. I don't actually design hardware, but I think a load execution unit in a modern high-performance design works something like this:
AGU generates an address from register(s) + offset.
(Fun fact: Sandybridge-family optimistically shortcuts this process for simple addressing mode: [reg + 0-2047] has 1c lower load-use latency than other addressing modes, if the reg value is in the same 4k page as reg+disp. Is there a penalty when base+offset is in a different page than the base?)
The index bits come from the offset-within-page part of the address, so they don't need translating from virtual to physical. Or translation is a no-op. This VIPT speed with the non-aliasing of a PIPT cache works as long as L1_size / associativity <= page_size. e.g. 32kiB / 8-way = 4k pages.
The index bits select a set. Tags+data are fetched in parallel for all ways of that set. (This costs power to save latency, and is probably only worth it for L1. Higher-associativity (more ways per set) L3 caches definitely not)
The high bits of the address are looked up in the L1dTLB CAM array.
The tag comparator receives the translated physical-address tag and the fetched tags from that set.
If there's a tag match, the cache extracts the right bytes from the data for the way that matched (using the offset-within-line low bits of the address, and the operand-size).
Or instead of fetching the full 64-byte line, it could have used the offset bits earlier to fetch just one (aligned) word from each way. CPUs without efficient unaligned loads are certainly designed this way. I don't know if this is worth doing to save power for simple aligned loads on a CPU which supports unaligned loads.
But modern Intel CPUs (P6 and later) have no penalty for unaligned load uops, even for 32-byte vectors, as long as they don't cross a cache-line boundary. Byte-granularity indexing for 8 ways in parallel probably costs more than just fetching the whole 8 x 64 bytes and setting up the muxing of the output while the fetch+TLB is happening, based on offset-within-line, operand-size, and special attributes like zero- or sign-extension, or broadcast-load. So once the tag-compare is done, the 64 bytes of data from the selected way might just go into an already-configured mux network that grabs the right bytes and broadcasts or sign-extends.
AVX512 CPUs can even do 64-byte full-line loads.
If there's no match in the L1dTLB CAM, the whole cache fetch operation can't continue. I'm not sure if / how CPUs manage to pipeline this so other loads can keep executing while the TLB-miss is resolved. That process involves checking the L2TLB (Skylake: unified 1536 entry 12-way for 4k and 2M, 16-entry for 1G), and if that fails then with a page-walk.
I assume that a TLB miss results in the tag+data fetch being thrown away. They'll be re-fetched once the needed translation is found. There's nowhere to keep them while other loads are running.
At the simplest, it could just re-run the whole operation (including fetching the translation from L1dTLB) when the translation is ready, but it could lower the latency for L2TLB hits by short-cutting the process and using the translation directly instead of putting it into L1dTLB and getting it back out again.
Obviously that requires that the dTLB and L1D are really designed together and tightly integrated. Since they only need to talk to each other, this makes sense. Hardware page walks fetch data through the L1D cache. (Page tables always have known physical addresses to avoid a catch 22 / chicken-egg problem).
is there a side-band connection from TLB to the Cache?
I wouldn't call it a side-band connection. The L1D cache is the only thing that uses the L1dTLB. Similarly, L1iTLB is used only by the L1I cache.
If there's a 2nd-level TLB, it's usually unified, so both the L1iTLB and L1dTLB check it if they miss. Just like split L1I and L1D caches usually check a unified L2 cache if they miss.
Outer caches (L2, L3) are pretty universally PIPT. Translation happens during the L1 check, so physical addresses can be sent to other caches.

Is Translation Lookaside Buffer (TLB) the same level as L1 cache to CPU? So, Can I overlap virtual address translation with the L1 cache access?

I am trying to understand the whole structure and concepts about caching. As we use TLB for fast mapping virtual addresses to physical addresses, in case if we use virtually-indexed, physically-tagged L1 cache, can one overlap the virtual address translation with the L1 cache access?
Yes, that's the whole point of a VIPT cache.
Since the virtual addresses and physical one match over the lower bits (the page offset is the same), you don't need to translate them. Most VIPT caches are built around this (note that this limits the number of sets you can use, but you can grow their associativity instead), so you can use the lower bits to do a lookup in that cache even before you found the translation in the TLB.
This is critical because the TLB lookup itself takes time, and the L1 caches are usually designed to provide as much BW and low latency as possible to avoid stalling the often much-faster execution.
If you miss the TLB and suffer an even greater latency (either some level2 TLB or, god forbid, a page walk), it's less critical since you can't really do anything with the cache lookup until you compare the tag, but the few cycles you did save in the TLB hit + cache hit case should be the common case on many applications, so that's usually considered worthy to optimize and align the pipelines for.

Is it true, that modern OS may skip copy when realloc is called

While reading the https://stackoverflow.com/a/3190489/196561 I have a question. What the Qt authors says in the Inside the Qt 4 Containers:
... QVector uses realloc() to grow by increments of 4096 bytes. This makes sense because modern operating systems don't copy the entire data when reallocating a buffer; the physical memory pages are simply reordered, and only the data on the first and last pages need to be copied.
My questions are:
1) Is it true that modern OS (Linux - the most interesting for me; FreeBSD, OSX, Windows) and their realloc implementations really capable to realloc pages of data using reordering of virtual-to-physical mapping and without byte-by-byte copy?
2) What is the system call used to achieve this memory move? (I think it can be splice with SPLICE_F_MOVE, but it was flawed and no-op now (?))
3) Is it profitable to use such page shuffling instead of byte-by-byte copy, especially in multicore multithreaded world, where every change of virtual-to-physical mapping need to flush (invalidate) changed page table entries from TLBs in all tens of CPU cores with IPI? (In linux this is smth like flush_tlb_range or flush_tlb_page)
upd for q3: some tests of mremap vs memcpy
Is it true that modern OS (Linux - the most interesting for me; FreeBSD, OSX, Windows) and their realloc implementations really capable to realloc pages of data using reordering of virtual-to-physical mapping and without byte-by-byte copy?
What is the system call used to achieve this memory move? (I think it can be splice with SPLICE_F_MOVE, but it was flawed and no-op now (?))
See thejh's answer.
Who are the actors?
You have at least three actors with your Qt example.
Qt Vector class
glibc's realloc()
Linux's mremap
QVector::capacity() shows that Qt allocates more elements than required. This means that a typical addition of an element will not realloc() anything. The glibc allocator is based on Doug Lea's allocator. This is a binning allocator which supports the use of Linux's mremap. A binning allocator groups similar sized allocations in bins, so a typical random sized allocation will still have some room to grow without needing to call the system. Ie, the free pool or slack is located a the end of the allocated memory.
An answer
Is it profitable to use such page shuffling instead of byte-by-byte copy, especially in multicore multithreaded world, where every change of virtual-to-physical mapping need to flush (invalidate) changed page table entries from TLBs in all tens of CPU cores with IPI? (In linux this is smth like flush_tlb_range or flush_tlb_page)
First off, faster ... than mremap is mis-using mremap() as R notes there.
There are several things that make mremap() valuable as a primitive for realloc().
Reduced memory consumption.
Preserving page mappings.
Avoiding moving data.
Everything in this answer is based upon Linux's implementation, but the semantics can be transferred to other OS's.
Reduce memory consumption
Consider a naive realloc().
void *realloc(void *ptr, size_t size)
{
size_t old_size = get_sz(ptr); /* From bin, address, map table, etc */
if(size <= old_size) {
resize(ptr);
return ptr;
}
void * new_p = malloc(size);
if(new_p) {
memcpy(new_p, ptr, old_size); /* fully committed old_size + new size */
free(ptr);
}
return new_p;
}
In order to support this, you may need double the memory of the realloc() before you go to swap or simply fail to reallocate.
Preserve page mappings
Linux will by default map new allocations to a zero page; a 4k page full of zero data. This is useful for sparsely mapped data structures. If no one writes to the data page, then no physical memory is allocated besides a possible PTE table. These are copy on write or COW. By using the naive realloc(), these mappings will not be preserved and full physical memory is allocated for all zero pages.
If the task is involved in a fork(), the initial realloc() data maybe shared between parent and child. Again, COW will cause physical allocation of pages. The naive implementation will discount this and require separate physical memory per process.
If the system is under memory pressure, the existing realloc() pages may not be in physical memory but in swap. The naive realloc will cause disk reads of the swap page into memory, copy to the updated location, and then likely write the data out to the disk.
Avoid moving data
The issue you consider of updating TLBs is minimal compared to the data. A single TLB is typically 4 bytes and represents a page (4K) of physical data. If you flush the entire TLB for a 4GB system, that is 4MB of data that needs to be restored. Copying large amounts of data will blow the L1 and L2 caches. TLB fetches naturally pipeline better than d-cache and i-cache. It is rare that code will get two TLB misses in a row as most code is sequential.
CPUs are of two variants, VIVT (non-x86) and VIPT as per x86. The VIVT versions usually have mechanisms to invalidate single TLB entries. For a VIPT system, the caches should not need to be invalidated as they are physically tagged.
On a multi-core systems, it is atypical to run one process on all cores. Only cores with the process performing mremap() need to have page table updates. As a process is migrated to a core (typical context switch), it needs to have the page table migrated anyways.
Conclusion
You can construct some pathological cases where a naive copy will work better. As Linux (and most OS's) are for multi-tasking, more than one process will be running. Also, the worst case will be when swapping and the naive implementation will always be worse here (unless you have a disk faster than memory). For minimal realloc() sizes, the dlmalloc or QVector should have fallow space to avoid system level mremap(). A typical mremap() may just expand a virtual address range by growing the region with a random page from the free pool. It is only when the virtual address range must move that mremap() may need a tlb flush, with all the following being true,
The realloc() memory should not be shared with a parent or child process.
The memory should not be sparse (mostly zero or untouched).
The system should not be under memory pressure using swap.
The tlb flush and IPI needs to take place only if the same process is current on other cores. L1-cache loading in not needed for the mremap(), but is for the naive version. The L2 is usually shared between cores and will be current in all cases. The naive version will force the L2 to reload. The mremap() might leave some unused data out of the L2-cache; this is normally a good thing, but could be a disadvantage under some work loads. There are probably better ways to do this such as pre-fetching data.
For Linux, see man 2 mremap:
void *mremap(void *old_address, size_t old_size,
size_t new_size, int flags, ... /* void *new_address */);
mremap() expands (or shrinks) an existing memory mapping, potentially moving it at the same
time (controlled by the flags argument and the available virtual address space).

Line size of L1 and L2 caches

From a previous question on this forum, I learned that in most of the memory systems, L1 cache is a subset of the L2 cache means any entry removed from L2 is also removed from L1.
So now my question is how do I determine a corresponding entry in L1 cache for an entry in the L2 cache. The only information stored in the L2 entry is the tag information. Based on this tag information, if I re-create the addr it may span multiple lines in the L1 cache if the line-sizes of L1 and L2 cache are not same.
Does the architecture really bother about flushing both the lines or it just maintains L1 and L2 cache with the same line-size.
I understand that this is a policy decision but I want to know the commonly used technique.
Cache-Lines size is (typically) 64 bytes.
Moreover, take a look at this very interesting article about processors caches:
Gallery of Processor Cache Effects
You will find the following chapters:
Memory accesses and performance
Impact of cache lines
L1 and L2 cache sizes
Instruction-level parallelism
Cache associativity
False cache line sharing
Hardware complexities
In core i7 the line sizes in L1 , L2 and L3 are the same: that is 64 Bytes.
I guess this simplifies maintaining the inclusive property, and coherence.
See page 10 of: https://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
The most common technique of handling cache block size in a strictly inclusive cache hierarchy is to use the same size cache blocks for all levels of cache for which the inclusion property is enforced. This results in greater tag overhead than if the higher level cache used larger blocks, which not only uses chip area but can also increase latency since higher level caches generally use phased access (where tags are checked before the data portion is accessed). However, it also simplifies the design somewhat and reduces the wasted capacity from unused portions of the data. It does not take a large fraction of unused 64-byte chunks in 128-byte cache blocks to compensate for the area penalty of an extra 32-bit tag. In addition, the larger cache block effect of exploiting broader spatial locality can be provided by relatively simple prefetching, which has the advantages that no capacity is left unused if the nearby chunk is not loaded (to conserve memory bandwidth or reduce latency on a conflicting memory read) and that the adjacency prefetching need not be limited to a larger aligned chunk.
A less common technique divides the cache block into sectors. Having the sector size the same as the block size for lower level caches avoids the problem of excess back-invalidation since each sector in the higher level cache has its own valid bit. (Providing all the coherence state metadata for each sector rather than just validity can avoid excessive writeback bandwidth use when at least one sector in a block is not dirty/modified and some coherence overhead [e.g., if one sector is in shared state and another is in the exclusive state, a write to the sector in the exclusive state could involve no coherence traffic—if snoopy rather than directory coherence is used].)
The area savings from sectored cache blocks were especially significant when tags were on the processor chip but the data was off-chip. Obviously, if the data storage takes area comparable to the size of the processor chip (which is not unreasonable), then 32-bit tags with 64-byte blocks would take roughly a 16th (~6%) of the processor area while 128-byte blocks would take half as much. (IBM's POWER6+, introduced in 2009, is perhaps the most recent processor to use on-processor-chip tags and off-processor data. Storing data in higher-density embedded DRAM and tags in lower-density SRAM, as IBM did, exaggerates this effect.)
It should be noted that Intel uses "cache line" to refer to the smaller unit and "cache sector" for the larger unit. (This is one reason why I used "cache block" in my explanation.) Using Intel's terminology it would be very unusual for cache lines to vary in size among levels of cache regardless of whether the levels were strictly inclusive, strictly exclusive, or used some other inclusion policy.
(Strict exclusion typically uses the higher level cache as a victim cache where evictions from the lower level cache are inserted into the higher level cache. Obviously, if the block sizes were different and sectoring was not used, then an eviction would require the rest of the larger block to be read from somewhere and invalidated if present in the lower level cache. [Theoretically, strict exclusion could be used with inflexible cache bypassing where an L1 eviction would bypass L2 and go to L3 and L1/L2 cache misses would only be allocated to either L1 or L2, bypassing L1 for certain accesses. The closest to this being implemented that I am aware of is Itanium's bypassing of L1 for floating-point accesses; however, if I recall correctly, the L2 was inclusive of L1.])
Typically, in one access to the main memory 64 bytes of data and 8 bytes of parity/ECC (I don't remember exactly which) is accessed. And it is rather complicated to maintain different cache line sizes at the various memory levels. You have to note that cache line size would be more correlated to the word alignment size on that architecture than anything else. Based on that, a cache line size is highly unlikely to be different from memory access size. Now, the parity bits are for the use of the memory controller - so cache line size typically is 64 bytes. The processor really controls very little beyond the registers. Everything else going on in the computer is more about getting hardware in to optimize CPU performance. In that sense also, it really would not make any sense to import extra complexity by making cache line sizes different at different levels of memory.

Information on N-way set associative Cache stides

Several of the resources I've gone to on the internet have disagree on how set associative caching works.
For example hardware secrets seem to believe it works like this:
Then the main RAM memory is divided in
the same number of blocks available in
the memory cache. Keeping the 512 KB
4-way set associative example, the
main RAM would be divided into 2,048
blocks, the same number of blocks
available inside the memory cache.
Each memory block is linked to a set
of lines inside the cache, just like
in the direct mapped cache.
http://www.hardwaresecrets.com/printpage/481/8
They seem to be saying that each cache block(4 cache lines) maps to a particular block of contiguous RAM. They are saying non-contiguous blocks of system memory(RAM) can't map to the same cache block.
This is there picture of how hardwaresecrets thinks it works
http://www.hardwaresecrets.com/fullimage.php?image=7864
Contrast that with wikipedia's picture of set associative cache
http://upload.wikimedia.org/wikipedia/commons/9/93/Cache%2Cassociative-fill-both.png.
Brown disagrees with hardware secrets
Consider what might happen if each
cache line had two sets of fields: two
valid bits, two dirty bits, two tag
fields, and two data fields. One set
of fields could cache data for one
area of main memory, and the other for
another area which happens to map to
the same cache line.
http://www.spsu.edu/cs/faculty/bbrown/web_lectures/cache/
That is, non-contiguous blocks of system memory can map to the same cache block.
How are the relationships between non-contiguous blocks on system memory and cache blocks created. I read somewhere that these relationships are based on cache strides, but I can't find any information on cache strides other than that they exist.
Who is right?
If striding is actually used, how does striding work and do I have the correct technical name? How do I find the stride for a particular system? is it based on the paging system? Can someone point me to a url that explains N-way set associative cache in great detail?
also see:
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Memory/set.html
When I teach cache memory architecture to my students, I start with a direct-mapped cache. Once that is understood, you can think of N-way set associative caches as parallel blocks of direct-mapped cache. To understand that both figures may be correct, you need to first understand the purpose of set-assoc caches.
They are designed to work around the problem of 'aliasing' in a direct-mapped cache, where multiple memory locations can map to a specific cache entry. This is illustrated in the Wikipedia figure. So, instead of evicting a cache entry, we can use a N-way cache to store the other 'aliased' memory locations.
In effect, the hardware secrets diagram would be correct assuming the order of replacement is such that the first chunk of main memory is mapped to Way-1 and then the second chunk to Way-2 and so on so forth. However, it is equally possible to have the first chunk of main memory spread over multiple Ways.
Hope this explanation helps!
PS: Contiguous memory locations are only needed for a single cache line, exploiting spatial locality. As for the latter part of your question, I believe that you may be confusing several different concepts.
The replacement policy decides where in the cache a copy of a
particular entry of main memory will go. If the replacement policy is
free to choose any entry in the cache to hold the copy, the cache is
called fully associative. At the other extreme, if each entry in main
memory can go in just one place in the cache, the cache is direct
mapped. Many caches implement a compromise in which each entry in main
memory can go to any one of N places in the cache, and are described
as N-way set associative

Resources