I want to get the size of L1i, L1d, L2 and L3 cache size in Solaris OS using command. The closest I get is
prtpicl -v| grep cache
:l3-cache-size 0x400000
:l3-cache-line-size 0x40
:l3-cache-associativity 0x10
:l1-dcache-size 0x8000
:l1-dcache-line-size 0x40
:l1-dcache-associativity 0x8
But I cannot get the L2 cache size, no separate L1i and L1d cache information. How to get the cache size of all L1, L2, L3 cache?
OS info:
Version:
5.11
Release :
Oracle Solaris 11.4 X86
Copyright (c) 1983, 2018, Oracle and/or its affiliates. All rights reserved.
Related
Every modern high-performance CPU of the x86/x86_64 architecture has some hierarchy of data caches: L1, L2, and sometimes L3 (and L4 in very rare cases), and data loaded from/to main RAM is cached in some of them.
Sometimes the programmer may want some data to not be cached in some or all cache levels (for example, when wanting to memset 16 GB of RAM and keep some data still in the cache): there are some non-temporal (NT) instructions for this like MOVNTDQA (https://stackoverflow.com/a/37092 http://lwn.net/Articles/255364/)
But is there a programmatic way (for some AMD or Intel CPU families like P3, P4, Core, Core i*, ...) to completely (but temporarily) turn off some or all levels of the cache, to change how every memory access instruction (globally or for some applications / regions of RAM) uses the memory hierarchy? For example: turn off L1, turn off L1 and L2? Or change every memory access type to "uncached" UC (CD+NW bits of CR0??? SDM vol3a pages 423 424, 425 and "Third-Level Cache Disable flag, bit 6 of the IA32_MISC_ENABLE MSR (Available only in processors based on Intel NetBurst microarchitecture) — Allows the L3 cache to be disabled and enabled, independently of the L1 and L2 caches.").
I think such action will help to protect data from cache side channel attacks/leaks like stealing AES keys, covert cache channels, Meltdown/Spectre. Although this disabling will have an enormous performance cost.
PS: I remember such a program posted many years ago on some technical news website, but can't find it now. It was just a Windows exe to write some magical values into an MSR and make every Windows program running after it very slow. The caches were turned off until reboot or until starting the program with the "undo" option.
The Intel's manual 3A, Section 11.5.3, provides an algorithm to globally disable the caches:
11.5.3 Preventing Caching
To disable the L1, L2, and L3 caches after they have been enabled and have received cache fills, perform the following steps:
Enter the no-fill cache mode. (Set the CD flag in control register CR0 to 1 and the NW flag to 0.
Flush all caches using the WBINVD instruction.
Disable the MTRRs and set the default memory type to uncached or set all MTRRs for the uncached memory
type (see the discussion of the discussion of the TYPE field and the E flag in Section 11.11.2.1,
“IA32_MTRR_DEF_TYPE MSR”).
The caches must be flushed (step 2) after the CD flag is set to ensure system memory coherency. If the caches are
not flushed, cache hits on reads will still occur and data will be read from valid cache lines.
The intent of the three separate steps listed above addresses three distinct requirements: (i) discontinue new data
replacing existing data in the cache (ii) ensure data already in the cache are evicted to memory, (iii) ensure subsequent memory references observe UC memory type semantics. Different processor implementation of caching
control hardware may allow some variation of software implementation of these three requirements. See note below.
NOTES
Setting the CD flag in control register CR0 modifies the processor’s caching behaviour as indicated
in Table 11-5, but setting the CD flag alone may not be sufficient across all processor families to
force the effective memory type for all physical memory to be UC nor does it force strict memory
ordering, due to hardware implementation variations across different processor families. To force
the UC memory type and strict memory ordering on all of physical memory, it is sufficient to either
program the MTRRs for all physical memory to be UC memory type or disable all MTRRs.
For the Pentium 4 and Intel Xeon processors, after the sequence of steps given above has been
executed, the cache lines containing the code between the end of the WBINVD instruction and
before the MTRRS have actually been disabled may be retained in the cache hierarchy. Here, to remove code from the cache completely, a second WBINVD instruction must be executed after the
MTRRs have been disabled.
That's a long quote but it boils down to this code
;Step 1 - Enter no-fill mode
mov eax, cr0
or eax, 1<<30 ; Set bit CD
and eax, ~(1<<29) ; Clear bit NW
mov cr0, eax
;Step 2 - Invalidate all the caches
wbinvd
;All memory accesses happen from/to memory now, but UC memory ordering may not be enforced still.
;For Atom processors, we are done, UC semantic is automatically enforced.
xor eax, eax
xor edx, edx
mov ecx, IA32_MTRR_DEF_TYPE ;MSR number is 2FFH
wrmsr
;P4 only, remove this code from the L1I
wbinvd
most of which is not executable from user mode.
AMD's manual 2 provides a similar algorithm in section 7.6.2
7.6.2 Cache Control Mechanisms
The AMD64 architecture provides a number of mechanisms for controlling the cacheability of memory. These are described in the following sections.
Cache Disable. Bit 30 of the CR0 register is the cache-disable bit, CR0.CD. Caching is enabled
when CR0.CD is cleared to 0, and caching is disabled when CR0.CD is set to 1. When caching is
disabled, reads and writes access main memory.
Software can disable the cache while the cache still holds valid data (or instructions). If a read or write
hits the L1 data cache or the L2 cache when CR0.CD=1, the processor does the following:
Writes the cache line back if it is in the modified or owned state.
Invalidates the cache line.
Performs a non-cacheable main-memory access to read or write the data.
If an instruction fetch hits the L1 instruction cache when CR0.CD=1, some processor models may read
the cached instructions rather than access main memory. When CR0.CD=1, the exact behavior of L2
and L3 caches is model-dependent, and may vary for different types of memory accesses.
The processor also responds to cache probes when CR0.CD=1. Probes that hit the cache cause the
processor to perform Step 1. Step 2 (cache-line invalidation) is performed only if the probe is
performed on behalf of a memory write or an exclusive read.
Writethrough Disable. Bit 29 of the CR0 register is the not writethrough disable bit, CR0.NW. In
early x86 processors, CR0.NW is used to control cache writethrough behavior, and the combination of
CR0.NW and CR0.CD determines the cache operating mode.
[...]
In implementations of the AMD64 architecture, CR0.NW is not used to qualify the cache operating
mode established by CR0.CD.
This translates to this code (very similar to the Intel's one):
;Step 1 - Disable the caches
mov eax, cr0
or eax, 1<<30
mov cr0, eax
;For some models we need to invalidated the L1I
wbinvd
;Step 2 - Disable speculative accesses
xor eax, eax
xor edx, edx
mov ecx, MTRRdefType ;MSR number is 2FFH
wrmsr
Caches can also be selectively disabled at:
Page level, with the attribute bits PCD (Page Cache Disable) [Only for Pentium Pro and Pentium II].
When both are clear the MTTR of relevance is used, if PCD is set the aching
Page level, with the PAT (Page Attribute Table) mechanism.
By filling the IA32_PAT with caching types and using the bits PAT, PCD, PWT as a 3-bit index it's possible to select one the six caching types (UC-, UC, WC, WT, WP, WB).
Using the MTTRs (fixed or variable).
By setting the caching type to UC or UC- for specific physical areas.
Of these options only the page attributes can be exposed to user mode programs (see for example this).
I have used the cpuid | grep -i tlb command on the terminal to try to determine the exact number of page table entries (and the corresponding page-sizes) being used by the machine. This is what I've got.
cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xb5: instruction TLB: 4K, 8-way, 64 entries
L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xb5: instruction TLB: 4K, 8-way, 64 entries
L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xb5: instruction TLB: 4K, 8-way, 64 entries
L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xb5: instruction TLB: 4K, 8-way, 64 entries
L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
I wanted to know whether I can interpret this information to say that my data TLB on all the processors (I have a Intel® Core™ i7-6500U CPU # 2.50GHz with 4 cores) has a Level 1 with 1G pages and 4 entries, and a level 2 with 4K pages and 64 entries? The part that I am confused about is how to make the distinction between these two pieces of information :
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xb5: instruction TLB: 4K, 8-way, 64 entries
L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
Does my TLB has 1G, 4K pages in addition to having 2M pages (as specified in the second paragraph) ? Am I reading this in a completely incorrect way ?
This is a question on my exam study guide and we have not yet covered how to calculate data transfer. Any help would be greatly appreciated.
Given is an 8 way set associative level 2 data cache with a capacity of 2 MByte (1MByte = 2^20 Byte)
and a block size 128 Bytes. The cache is connected to the main memory by a shared 32 bit address and
data bus. The cache and the RISC-CPU are connected by a separated address and data bus, each with a
width of 32 bit. The CPU is executing a load word instruction
a) How much user data is transferred from the main memory to the cache in case of a cache miss?
b) How much user data is transferred from the cache to the CPU in case of a cache miss?
You need to compute first your cache line size:
Number of cache blocks: 2MB / 128B = 16384 blocks (14 bits)
Number of sets: 16384 / 8 way = 2048 sets (11 bits)
Address width: 32 bits
Line offset bits: 32 - 14 - 11 = 7 bits
So the cache line size is 128B - actually a line is a block but it's good to know the above computation.
a) How much user data is transferred from the main memory to the cache
in case of a cache miss?
In your problem, the L2 cache is the last level cache before main memory. So if you miss in the L2 cache (you don't find the line you are looking for), you need to fetch the line from main memory. So 128B of user data will be transferred from the main memory. The fact that the address bus and data bus are shared does not influence.
b) How much user data is transferred from the cache to the CPU in case
of a cache miss?
If you reached the L2 cache that means you missed the L1 cache. So from L2 the CPU has to transfer to L1 a full L1 cache line. So the L1 line size is 128B, then 128B of data will go from L2 to L1. The CPU will use then only a fraction of that line to feed the instruction that generated the miss into the L1 cache. Whether that line is evicted or not from L2, this should have been stated in the problem sentence (inclusive / exclusive cache)
My intel i3 processor having L3 cache of size 3072kb i want to partition(divide) L3 cache based on the number of cores present which is 2 in intel i3 Clarkdale. If anyone have done something related to cache then please reply?
I am trying to profile and optimize algorithms and I would like to understand the specific impact of the caches on various processors. For recent Intel x86 processors (e.g. Q9300), it is very hard to find detailed information about cache structure. In particular, most web sites (including Intel.com) that post processor specs do not include any reference to L1 cache. Is this because the L1 cache does not exist or is this information for some reason considered unimportant? Are there any articles or discussions about the elimination of the L1 cache?
[edit]
After running various tests and diagnostic programs (mostly those discussed in the answers below), I have concluded that my Q9300 seems to have a 32K L1 data cache. I still haven't found a clear explanation as to why this information is so difficult to come by. My current working theory is that the details of L1 caching are now being treated as trade secrets by Intel.
It is near impossible to find specs on Intel caches. When I was teaching a class on caches last year, I asked friends inside Intel (in the compiler group) and they couldn't find specs.
But wait!!! Jed, bless his soul, tells us that on Linux systems, you can squeeze lots of information out of the kernel:
grep . /sys/devices/system/cpu/cpu0/cache/index*/*
This will give you associativity, set size, and a bunch of other information (but not latency).
For example, I learned that although AMD advertises their 128K L1 cache, my AMD machine has a split I and D cache of 64K each.
Two suggestions which are now mostly obsolete thanks to Jed:
AMD publishes a lot more information about its caches, so you can at least got some information about a modern cache. For example, last year's AMD L1 caches delivered two words per cycle (peak).
The open-source tool valgrind has all sorts of cache models inside it, and it is invaluable for profiling and understanding cache behavior. It comes with a very nice visualization tool kcachegrind which is part of the KDE SDK.
For example: in Q3 2008, AMD K8/K10 CPUs use 64 byte cache lines, with a 64kB each L1I/L1D split cache. L1D is 2-way associative and exclusive with L2, with latency of 3 cycles. L2 cache is 16-way associative and latency is about 12 cycles.
AMD Bulldozer-family CPUs use a split L1 with a 16kiB 4-way associative L1D per cluster (2 per core).
Intel CPUs have kept L1 the same for a long time (from Pentium M to Haswell to Skylake, and presumably many generations after that): Split 32kB each I and D caches, with L1D being 8-way associative. 64 byte cache lines, matching the burst-transfer size of DDR DRAM. Load-use latency is ~4 cycles.
Also see the x86 tag wiki for links to more performance and microarchitectural data.
This Intel Manual: Intel® 64 and IA-32 Architectures Optimization Reference Manual has a decent discussion of cache considerations.
Page 46, Section 2.2.5.1 Intel® 64 and IA-32 Architectures Optimization Reference Manual
Even MicroSlop is waking up to the need for more tools to monitor cache usage and performance, and has a GetLogicalProcessorInformation() function example (...while blazing new trails in creating ridiculously long function names in the process) I think I'll code up.
UPDATE I: Hazwell increases cache load performance 2X, from Inside the Tock; Haswell's Architecture
If there were any doubt how critical it is to make the best possible use of cache, this presentation by Cliff Click, formerly of Azul, should dispel any and all doubt. In his words, "memory is the new disk!".
UPDATE II: SkyLake's significantly improved cache performance specifications.
You are looking at the consumer specifications, not the developer specifications. Here is the documentation you want. The cache sizes vary by processor family sub-models, so they typically are not in the IA-32 development manuals, but you can easily look them up on NewEgg and such.
Edit: More specifically: Chapter 10 of Volume 3A (Systems Programming Guide), Chapter 7 of the Optimization Reference Manual, and potentially something in the TLB page-caching manual, although I would assume that one is further out from the L1 than you care about.
I did some more investigating. There is a group at ETH Zurich who built a memory-performance evaluation tool which might be able to get information about the size at least (and maybe also associativity) of L1 and L2 caches. The program works by trying different read patterns experimentally and measuring the resulting throughput. A simplified version was used for the popular textbook by Bryant and O'Hallaron.
L1 caches exist on these platforms. This will almost definitly remain true until memory and front side bus speeds exceed the speed of the CPU, which is a very likely a long way off.
On Windows, you can use the GetLogicalProcessorInformation to get some level of cache information (size, line size, associativity, etc.) The Ex version on Win7 will give even more data, like which cores share which cache. CpuZ also gives this information.
Locality of Reference has a major impact on performance of some algorithms; The size and speed of L1, L2 (and on newer CPUs L3) cache obviously play a large part in this. Matrix multiplication is one such algorithm.
Intel Manual Vol. 2 specifies the following formula to compute cache size:
This Cache Size in Bytes
= (Ways + 1) * (Partitions + 1) * (Line_Size + 1) * (Sets + 1)
= (EBX[31:22] + 1) * (EBX[21:12] + 1) * (EBX[11:0] + 1) * (ECX + 1)
Where the Ways, Partitions, Line_Size and Sets are queried using cpuid with eax set to 0x04.
Providing the header file declaration
x86_cache_size.h:
unsigned int get_cache_line_size(unsigned int cache_level);
The implementation looks as follows:
;1st argument - the cache level
get_cache_line_size:
push rbx
;set line number argument to be used with CPUID instruction
mov ecx, edi
;set cpuid initial value
mov eax, 0x04
cpuid
;cache line size
mov eax, ebx
and eax, 0x7ff
inc eax
;partitions
shr ebx, 12
mov edx, ebx
and edx, 0x1ff
inc edx
mul edx
;ways of associativity
shr ebx, 10
mov edx, ebx
and edx, 0x1ff
inc edx
mul edx
;number of sets
inc ecx
mul ecx
pop rbx
ret
Which on my machine works as follows:
#include "x86_cache_size.h"
int main(void){
unsigned int L1_cache_size = get_cache_line_size(1);
unsigned int L2_cache_size = get_cache_line_size(2);
unsigned int L3_cache_size = get_cache_line_size(3);
//L1 size = 32768, L2 size = 262144, L3 size = 8388608
printf("L1 size = %u, L2 size = %u, L3 size = %u\n", L1_cache_size, L2_cache_size, L3_cache_size);
}