How to clear CPU L1 and L2 cache [duplicate] - caching

This question already has answers here:
How can I do a CPU cache flush in x86 Windows?
(4 answers)
Closed 4 years ago.
I'm running a benchmark on xeon server , and i repeat the executions 2-3 times. I'd like to erase the cache contents in L1 and L2 while repeating the runs. Can you suggest any methods for doing so ?

Try to read repetitly large data via CPU (i.e. not by DMA).
Like:
int main() {
const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
char *c = (char *)malloc(size);
for (int i = 0; i < 0xffff; i++)
for (int j = 0; j < size; j++)
c[j] = i*j;
}
However depend on server a bigger problem may be a disk cache (in memory) then L1/L2 cache. On Linux (for example) drop using:
sync
echo 3 > /proc/sys/vm/drop_caches
Edit: It is trivial to generate large program which do nothing:
#!/usr/bin/ruby
puts "main:"
200000.times { puts " nop" }
puts " xor rax, rax"
puts " ret"
Running a few times under different names (code produced not the script) should do the work

Related

how do we calculate the number of reads/misses of the cache in this code snippet?

Given this code snippet from this textbook that I am currently studying. Randal E. Bryant, David R. O’Hallaron - Computer Systems. A Programmer’s Perspective [3rd ed.] (2016, Pearson) (global edition, so the book's exercises could be wrong.)
for (i = 31; i >= 0; i--) {
for (j = 31; j >= 0; j--) {
total_x += grid[i][j].x;
}
}
for (i = 31; i >= 0; i--) {
for (j = 31; j >= 0; j--) {
total_y += grid[i][j].y;
}
}
and this is the information given
The heart of the recent hit game SimAquarium is a tight loop that calculates the
average position of 512 algae. You are evaluating its cache performance on a
machine with a 2,048-byte direct-mapped data cache with 32-byte blocks (B = 32).
struct algae_position {
int x;
int y;
};
struct algae_position grid[32][32];
int total_x = 0, total_y = 0;
int i, j;
You should also assume the following:
sizeof(int) = 4.
grid begins at memory address 0.
The cache is initially empty.
The only memory accesses are to the entries of the array grid.
Variables i, j,
total_x, and total_y are stored in registers
The book gives the following questions as practice:
A. What is the total number of reads?
Answer given : 2048
B. What is the total number of reads that miss in the cache?
Answer given : 1024
C. What is the miss rate?
Answer given: 50%
I'm guessing for A, the answer is derived from 32*32 *2? 32*32 for the dimensions of the matrix and 2 because there are 2 separate loops for x and y vals. Is this correct? How should the total number of reads be counted?
How do we calculate the total number of misses that happen in the cache and the miss rate? I read that the miss rate is (1- hit-rate)
Question A
You are correct about 32 x 32 x 2 reads.
Question B
The loops counts down from 31 towards 0 but that doesn't matter for this question. The answer is the same for loops going from 0 to 31. Since that is a bit easier to explain I'll assume increasing loop counters.
When you read grid[0][0], you'll get a cache miss. This will bring grid[0][0], grid[0][1], grid[0][2] and grid[0][3] into the cache. This is because each element is 2x4 = 8 bytes and the block size is 32. In other words: 32 / 8 = 4 grid elements in one block.
So the next cache miss is for grid[0][4] which again will bring the next 4 grid elements into the cache. And so on... like:
miss
hit
hit
hit
miss
hit
hit
hit
miss
hit
hit
hit
...
So in the first loop you simply have:
"Number of grid elements" divided by 4.
or
32 * 32 / 4 = 256
In general in the first loop:
Misses = NumberOfElements / (BlockSize / ElementSize)
so here:
Misses = 32*32 / (32 / 8) = 256
Since the cache size is only 2048 and the whole grid is 32 x 32 x 8 = 8192, nothing read into the cache in the first loop will generate cache hit in the second loop. In other words - both loops will have 256 misses.
So the total number of cache misses are 2 x 256 = 512.
Also notice that there seem to be a bug in the book.
Here:
The heart of the recent hit game SimAquarium is a tight loop that calculates the
average position of 512 algae.
^^^
Hmmm... 512 elements...
Here:
for (i = 31; i >= 0; i--) {
for (j = 31; j >= 0; j--) {
^^^^^^
hmmm... 32 x 32 is 1024
So the loop access 1024 elements but the text says 512. So something is wrong in the book.
Question C
Miss rate = 512 misses / 2048 reads = 25 %
note:
Being very strict we cannot say for sure that the element size is two times the integer size. The C standard allow that structs contain padding. So in principle there could be 8 bytes padding in the struct (i.e. element size being 16) and that would give the results that the book says.

Minimize the number of page faults by loop interchange

Assume page size is 1024 words and each row is stored in one page.
If the OS allocates 512 frames for a program and uses LRU page replacement algorithm,
What will be the number of page faults in the following programs?
int A[][] = new int[1024][1024];
Program 1:
for (j = 0; j < A.length; j++)
for (i = 0; i < A.length; i++)
A[i][j] = 0;
Program 2:
for (i = 0; i < A.length; i++)
for(j = 0; j < A.length; j++)
A[i][j] = 0;
I assume that bringing the pages by row is better than bringing by column, however I cannot support my claim. Can you help me to calculate # of page faults?
One way to answer this is by simulation. You could change your loops to output the address of the assignment rather than setting it to zero:
printf("%p\n", &A[i][j]);
Then, write a second program that simulates page placement, so it would do something like:
uintptr_t h;
uintptr_t work[NWORKING_SET] = {0};
int lru = 0;
int fault = 0;
while (gethex(&h)) {
h /= pagesize;
int i;
for (i = 0; i < NWORKING_SET && work[i] != h; i++) {
}
if (i == NWORKING_SET) {
work[lru] = h;
fault++;
lru = (lru+1) % NWORKING_SET;
}
}
printf("%d\n", fault);
With that program in place, you can try multiple traversal strategies. PS: my lru just happens to work; I'm sure you can do much better.
For the second program; the CPU accesses the first int in a row causing a page fault, then accesses the other ints in the row while the page is already present. This means that (if the rows start on page boundaries) you'll get a page fault per row, plus probably one when the program's code is first started, plus probably another when the program's stack is first used (and one more if the array isn't aligned on a page boundary); which probably works out to 1026 or 1027 page faults.
For the first program; the CPU accesses the first int in a row causing a page fault; but by the time it accesses the second int in that same row the page has been evicted (become "least recently used" and replaced with a different page). This means that you'll get 1024*1024 page faults while accessing the array (plus one for program's code, stack, etc). That probably works out to 1048578 page faults (as long as the start of the array is aligned to "sizeof(int)").
However; this all assumes that the compiler failed to optimize anything. In reality it's extremely likely that any compiler that is worth using would have transformed both programs into something a little more like "memset(array, 0, sizeof(int)*1024*1024); that does consecutive writes (possibly writing multiple ints in a single larger write if underlying CPU supports larger writes). This implies that both programs would take probably cause 1026 or 1027 page faults.

Why is my program slow when access 4K offset array element?

I wrote a program to get the cache & cache line size for my computer, but I got the result that I can't explain, could anyone help me to explain it for me?
Here is my program, the access_array() traverses through array with different step size, and I measure the execution time for these step size.
// Program to calculate L1 cache line size, compile in g++ -O1
#include <iostream>
#include <string>
#include <sys/time.h>
#include <cstdlib>
using namespace std;
#define ARRAY_SIZE (256 * 1024) // arbitary array size, must in 2^N to let module work
void access_array(char* arr, int steps)
{
const int loop_cnt = 1024 * 1024 * 32; // arbitary loop count
int idx = 0;
for (int i = 0; i < loop_cnt; i++)
{
arr[idx] += 10;
idx = (idx + steps) & (ARRAY_SIZE - 1); // if use %, the latency will be too high to see the gap
}
}
int main(int argc, char** argv){
double cpu_us_used;
struct timeval start, end;
for(int step = 1 ; step <= ARRAY_SIZE ; step *= 2){
char* arr = new char[ARRAY_SIZE];
for(int i = 0 ; i < ARRAY_SIZE ; i++){
arr[i] = 0;
}
gettimeofday(&start, NULL); // get start clock
access_array(arr, step);
gettimeofday(&end, NULL); // get end clock
cpu_us_used = 1000000 * (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec);
cout << step << " , " << cpu_us_used << endl;
delete[] arr;
}
return 0;
}
Result
My question is:
From 64 to 512, I can't explain why the execution time is almost the same, any why there is a linear growth from 1K to 4K?
Here are my assumptions.
For step = 1, every 64 iterations cause 1 cache line miss. And after 32K iterations, the L1 Cache is full, so we have L1 collision & capacity miss every 64 iterations.
For step = 64, every 1 iteration cause 1 cache line miss. And after 512 iterations, the L1 Cache is full, so we have L1 collision & capacity miss every 1 iterations.
As a result, there is a gap between step = 32 and 64.
By observing the first gap, I can conclude that the L1 cache line size is 64 bytes.
For step = 512, every 1 iteration cause 1 cache line miss. And after 64 iterations, the Set 0,8,16,24,32,40,48,56 of L1 Cache is full, so we have L1 collision miss every 1 iterations.
For step = 4K, every 1 iteration cause 1 cache line miss. And after 8 iterations, the set 0 of L1 Cache is full, so we have L1 collision miss every 1 iterations.
For 128 to 4K cases, they all happened L1 collision miss, and the difference is that with the more steps, we start the collision miss earlier.
The only idea that I can come up with is that there are other mechanism (maybe page, TLB, etc.) to impact the execution time.
Here is the cache size & CPU info of my workstation. By the way, I have run this program on my PC either, and I got the similar results.
Platform : Intel Xeon(R) CPU E5-2667 0 # 2.90GHz
LEVEL1_ICACHE_SIZE 32768
LEVEL1_ICACHE_ASSOC 8
LEVEL1_ICACHE_LINESIZE 64
LEVEL1_DCACHE_SIZE 32768
LEVEL1_DCACHE_ASSOC 8
LEVEL1_DCACHE_LINESIZE 64
LEVEL2_CACHE_SIZE 262144
LEVEL2_CACHE_ASSOC 8
LEVEL2_CACHE_LINESIZE 64
LEVEL3_CACHE_SIZE 15728640
LEVEL3_CACHE_ASSOC 20
LEVEL3_CACHE_LINESIZE 64
LEVEL4_CACHE_SIZE 0
LEVEL4_CACHE_ASSOC 0
LEVEL4_CACHE_LINESIZE 0
This CPU probably has:
a hardware cache line prefetcher, that detects linear access patterns within the same physical 4 KiB page and prefetches them before an access is made. This stops prefetching at 4 KiB boundaries (because the physical address is likely to be very different and unknown).
a hardware TLB prefetcher, that detects linear access patterns in TLB usage and prefetches TLB entries.
From 1 to 16 the cache line prefetcher is doing its job, fetching cache lines before you access them, so execution time remains the same (uneffected by cache misses).
At 32, the cache line prefetcher starts to struggle (due to the "stop at 4 KiB page boundary" thing).
From 64 to 512 the TLB prefetcher is doing its job, fetching TLB entries before you access them, so execution time remains the same (unaffected by TLB misses).
From 512 to 4096 the TLB prefetcher is failing to keep up. The CPU stalls waiting for TLB info for every "4096/step" accesses; and these stalls are causing "linear-ish" growth in execution time.
From, 4096 to 131072; I'd like to assume that the "new char[ARRAY_SIZE];" allocates so much space that the library and/or OS decided to give you 2 MiB pages and/or 1 GiB pages, eliminating some TLB misses, and improving execution time as the number of pages being accessed decreases.
For "larger than 131072"; I'd assume you start to see the effects of "1 GiB page TLB miss".
Note that it's probably easier (and less error prone) to get the cache characteristics (size, associativity, how many logical CPUs are sharing it, ..) and cache line size from the CPUID instruction. The approach you're using is more suited to measuring cache latency (how long it takes to get data from one of the caches).
Also; to reduce TLB interference the OS might allow you to explicitly ask for 1 GiB pages (e.g. mmap(..., MAP_POPULATE | MAP_HUGE_1GB, ... ) on Linux); and you can "pre-warm" the TLB by doing a "touch then CLFLUSH" warm-up loop before you start measuring. The hardware cache-line prefetcher can be disabled via. a flag in an MSR (if you have permission), or can be defeated by using an "random" (unpredictable) access pattern.
Finally I found the answer.
I tried to set the array size to 16KiB and smaller, but it slow on step = 4KiB too.
On the other hand, I tried to change the offset of steps from mutiplying 2 in each iteration to add 1 in each iteration, it still slow down when step = 4KiB.
Code
#define ARRAY_SIZE (4200)
void access_array(char* arr, int steps)
{
const int loop_cnt = 1024 * 1024 * 32; // arbitary loop count
int idx = 0;
for (int i = 0; i < loop_cnt; i++)
{
arr[idx] += 10;
idx = idx + steps;
if(idx >= ARRAY_SIZE)
idx = 0;
}
}
for(int step = 4090 ; step <= 4100 ; step ++){
char* arr = new char[ARRAY_SIZE];
for(int i = 0 ; i < ARRAY_SIZE ; i++){
arr[i] = 0;
}
gettimeofday(&start, NULL); // get start clock
access_array(arr, step);
gettimeofday(&end, NULL); // get end clock
cpu_us_used = 1000000 * (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec);
cout << step << " , " << cpu_us_used << endl;
delete[] arr;
}
Result
4090 , 48385
4091 , 48497
4092 , 48136
4093 , 48520
4094 , 48090
4095 , 48278
4096 , **51818**
4097 , 48196
4098 , 48600
4099 , 48185
4100 , 63149
As a result, I suspected it's not related to any of cache / TLB / prefetch mechanism.
With more and more googling the relation ship between performance and magic number "4K", I have found the 4K aliasing problem on Intel Platform, which slows down the performance of load.
This occurs when a load is issued after a store and their memory addresses are offset by (4K). When this is processed in the pipeline, the issue of the load will match the previous store (the full address is not used at this point), so pipeline will try to forward the results of the store and avoid doing the load (this is store forwarding). Later on when the address of the load is fully resolved, it will not match the store, and so the load will have to be re-issued from a later point in the pipe. This has a 5-cycle penalty in the normal case, but could be worse in certain situations, like with un-aligned loads that span 2 cache lines.

how to calculate cache misses?

I got the following question:
A and B are arrays of 4 integers (integer = 4 bytes = one word) on a computer that uses a cache with cache size of 64 Bytes, and with block size of one word.
A starts in address 0 and B starts in address 16
Assume the cache is initially empty.
A user run the following code:
for (i=0; i<2; i++)
{
for (j=0; j<4; j++) {
read A[j]
read B[j]
}
}
I'm asked to answer&explain how many cache misses would you expect at the following cases:
a) The cache uses direct mapping.
b) The cache uses 2-Way Set Associativity
What does it mean that 'A starts in address 0 and B starts in address 16'? don't sure how to access this question
It's saying:
&A[0] == 0
&B[0] == 16

printf performance issue in openmp

I have been told not to use printf in openmp programs as it degrades the performance of parallel simulation program.
I want to know what is the substitute for that. I mean how to display the output of a program without using printf.
I have the following AES-128 simulation problem using openmp which needs further comments
Parallel simulation of AES in C using Openmp
I want to know how to output the cipher text without degrading the simulation performance?
Thanks in advance.
You cannot both have your pie and eat it. Decide if you want to have great parallel performance or if it's important to see the output of the algorithm while running the parallel loop.
The obvious offline solution is to store the plaintexts, keys and ciphertexts in arrays. In your case that would require 119 MiB (= 650000*(3*4*16) bytes) in the original case and only 12 MiB in the case with 65000 trials. Nothing that a modern machine with GiBs of RAM cannot handle. The latter case even even fits in the last-level cache of some server-class CPUs.
#define TRIALS 65000
int (*key)[16];
int (*pt)[16];
int (*ct)[16];
double timer;
key = malloc(TRIALS * sizeof(*key));
pt = malloc(TRIALS * sizeof(*pt));
ct = malloc(TRIALS * sizeof(*ct));
timer = -omp_get_wtime();
#pragma omp parallel for private(rnd,j)
for(i = 0; i < TRIALS; i++)
{
...
for(j = 0; j < 4; j++)
{
key[i][4*j] = (rnd[j] & 0xff);
pt[i][4*j] = key[i][4*j];
key[i][4*j+1] = ((rnd[j] >> 8) & 0xff) ;
pt[4*j+1] = key[i][4*j+1];
key[i][4*j+2] = ((rnd[j] >> 16) & 0xff) ;
pt[i][4*j+2] = key[i][4*j+2];
key[i][4*j+3] = ((rnd[j] >> 24) & 0xff) ;
pt[i][4*j+3] = key[i][4*j+3];
}
encrypt(key[i],pt[i],ct[i]);
}
timer += omp_get_wtime();
printf("Encryption took %.6f seconds\n", timer);
// Now display the results serially
for (i = 0; i < TRIALS; i++)
{
display pt[i], key[i] -> ct[i]
}
free(key); free(pt); free(ct);
To see the speed-up, you have to measure only the time spent in the parallel region. If you also measure the time it takes to display the results, you will be back to where you started.

Resources