how to calculate cache misses?

how to calculate cache misses? - caching

I got the following question:
A and B are arrays of 4 integers (integer = 4 bytes = one word) on a computer that uses a cache with cache size of 64 Bytes, and with block size of one word.
A starts in address 0 and B starts in address 16
Assume the cache is initially empty.
A user run the following code:
for (i=0; i<2; i++)
{
for (j=0; j<4; j++) {
read A[j]
read B[j]
}
}
I'm asked to answer&explain how many cache misses would you expect at the following cases:
a) The cache uses direct mapping.
b) The cache uses 2-Way Set Associativity
What does it mean that 'A starts in address 0 and B starts in address 16'? don't sure how to access this question

It's saying:
&A[0] == 0
&B[0] == 16

Related

how do we calculate the number of reads/misses of the cache in this code snippet?

Given this code snippet from this textbook that I am currently studying. Randal E. Bryant, David R. O’Hallaron - Computer Systems. A Programmer’s Perspective [3rd ed.] (2016, Pearson) (global edition, so the book's exercises could be wrong.)
for (i = 31; i >= 0; i--) {
for (j = 31; j >= 0; j--) {
total_x += grid[i][j].x;
}
}
for (i = 31; i >= 0; i--) {
for (j = 31; j >= 0; j--) {
total_y += grid[i][j].y;
}
}
and this is the information given
The heart of the recent hit game SimAquarium is a tight loop that calculates the
average position of 512 algae. You are evaluating its cache performance on a
machine with a 2,048-byte direct-mapped data cache with 32-byte blocks (B = 32).
struct algae_position {
int x;
int y;
};
struct algae_position grid[32][32];
int total_x = 0, total_y = 0;
int i, j;
You should also assume the following:
sizeof(int) = 4.
grid begins at memory address 0.
The cache is initially empty.
The only memory accesses are to the entries of the array grid.
Variables i, j,
total_x, and total_y are stored in registers
The book gives the following questions as practice:
A. What is the total number of reads?
Answer given : 2048
B. What is the total number of reads that miss in the cache?
Answer given : 1024
C. What is the miss rate?
Answer given: 50%
I'm guessing for A, the answer is derived from 32*32 *2? 32*32 for the dimensions of the matrix and 2 because there are 2 separate loops for x and y vals. Is this correct? How should the total number of reads be counted?
How do we calculate the total number of misses that happen in the cache and the miss rate? I read that the miss rate is (1- hit-rate)

Question A
You are correct about 32 x 32 x 2 reads.
Question B
The loops counts down from 31 towards 0 but that doesn't matter for this question. The answer is the same for loops going from 0 to 31. Since that is a bit easier to explain I'll assume increasing loop counters.
When you read grid[0][0], you'll get a cache miss. This will bring grid[0][0], grid[0][1], grid[0][2] and grid[0][3] into the cache. This is because each element is 2x4 = 8 bytes and the block size is 32. In other words: 32 / 8 = 4 grid elements in one block.
So the next cache miss is for grid[0][4] which again will bring the next 4 grid elements into the cache. And so on... like:
miss
hit
hit
hit
miss
hit
hit
hit
miss
hit
hit
hit
...
So in the first loop you simply have:
"Number of grid elements" divided by 4.
or
32 * 32 / 4 = 256
In general in the first loop:
Misses = NumberOfElements / (BlockSize / ElementSize)
so here:
Misses = 32*32 / (32 / 8) = 256
Since the cache size is only 2048 and the whole grid is 32 x 32 x 8 = 8192, nothing read into the cache in the first loop will generate cache hit in the second loop. In other words - both loops will have 256 misses.
So the total number of cache misses are 2 x 256 = 512.
Also notice that there seem to be a bug in the book.
Here:
The heart of the recent hit game SimAquarium is a tight loop that calculates the
average position of 512 algae.
^^^
Hmmm... 512 elements...
Here:
for (i = 31; i >= 0; i--) {
for (j = 31; j >= 0; j--) {
^^^^^^
hmmm... 32 x 32 is 1024
So the loop access 1024 elements but the text says 512. So something is wrong in the book.
Question C
Miss rate = 512 misses / 2048 reads = 25 %
note:
Being very strict we cannot say for sure that the element size is two times the integer size. The C standard allow that structs contain padding. So in principle there could be 8 bytes padding in the struct (i.e. element size being 16) and that would give the results that the book says.

nodemcu - problem writing to arbitrary location in memory

From esp8266 documentation:
You need to call EEPROM.begin(size) before you start reading or writing, size being the number of bytes you want to use. Size can be anywhere between 4 and 4096 bytes.
I had problem reading/writing to memory when start address was not 0 but some xx address and had to write small test program to check it out.
First code ...
EEPROM.begin(16);
for (int i = 0; i < 16; i++)
{
EEPROM.write(i, i);
}
EEPROM.commit();
EEPROM.end();
this is OK. Everything is written correctly.
But if I change start address …
EEPROM.begin(20);
for (int i = 0; i < 20; i++)
{
EEPROM.write(i+16, i);
}
EEPROM.commit();
EEPROM.end();
it only writes first 4 bytes since in begin I requested 20 bytes.
My question is: Is this normal or is it a bug? In documentation states
size being the number of bytes you want to use
so if I want to use only 20 bytes from random address should I write EEPROM.begin(20);?
If it's not bug how to read from address 5000 if max number for begin method is 4096?

Understanding Cache memory

I am trying to understand how cache memory reads and writes. Also I am trying to determine the hit and miss rate. I have tried reading and reading the textbook "Computer Systems - A Programmer Perspective" over and over and can't seem to grasp this idea. Maybe someone can help me understand this:
I am working with a two-dimensional array which has 480 rows and 640 columns. The cache is direct-mapped and 64 KB with 4 byte lines. Below is the C-code:
struct pixel {
char r;
char g;
char b;
char a;
};
struct pixel buffer[480][640];
register int i, j;
register char *cptr;
register int *iptr;
sizeof(char) == 1 (meaning an index in the array consists of 4 byte each (if I am understanding that correctly)). The buffer begins at memory address 0 and the cache is initially empty (cold cache). The only memory accesses are to the entries of the array. All other variables are stored in registers.
for (j=0; j < 640; j++) {
for (i=0; i < 480; i++){
buffer[i][j].r = 0;
buffer[i][j].g = 0;
buffer[i][j].b = 0;
buffer[i][j].a = 0;
}
}
For the code above then it is initializing all the elements in the array to 0, so it must be writing. I can see that this is bad locality because the array is writing column by column instead of row by row. Doesn't that affect the miss rate? I am trying to determine the miss rate for this code based on the cache size. I think the miss rate is 100% and if the locality was row by row then it would be 25%. But I am not totally understanding how cache-memory works so... Can anyone tell me something that could help me understand this better?

I would recommend you to watch the whole Tutorial if you are a beginner.
But for your question, lecture 27 to 31 would explain everything.
https://www.youtube.com/watch?v=tGarzP488Wc&index=29&list=PL2F82ECDF8BB71B0C
IISc Bangalore.

Array of 10000 having 16bit elements, find bits set (unlimited RAM) - Google interview

This was asked in my Google interview recently and I offered an answer which involved bit shift and was O(n) but she said this is not the fastest way to go about doing it. I don't understand, is there a way to count the bits set without having to iterate over the entire bits provided?

Brute force: 10000 * 16 * 4 = 640,000 ops. (shift, compare, increment and iteration for each 16 bits word)
Faster way:
We can build table 00-FF -> number of bits set. 256 * 8 * 4 = 8096 ops
I.e. we build a table where for each byte we calculate a number of bits set.
Then for each 16-bit int we split it to upper and lower
for (n in array)
byte lo = n & 0xFF; // lower 8-bits
byte hi = n >> 8; // higher 8-bits
// simply add number of bits in the upper and lower parts
// of each 16-bits number
// using the pre-calculated table
k += table[lo] + table[hi];
}
60000 ops in total in the iteration. I.e. 68096 ops in total. It's O(n) though, but with less constant (~9 times less).
In other words, we calculate number of bits for every 8-bits number, and then split each 16-bits number into two 8-bits in order to count bits set using the pre-built table.

There's (almost) always a faster way. Read up about lookup tables.

I don't know what the correct answer was when this question was asked, but I believe the most sensible way to solve this today is to use the POPCNT instruction. Specifically, you should use the 64-bit version. Since we just want the total number of set bits, boundaries between 16-bit elements are of no interest to us. Since the 32-bit and 64-bit POPCNT instructions are equally fast, you should use the 64-bit version to count four elements' worth of bits per cycle.

I just implemented it in Java:
import java.util.Random;
public class Main {
static int array_size = 1024;
static int[] array = new int[array_size];
static int[] table = new int[257];
static int total_bits_in_the_array = 0;
private static void create_table(){
int i;
int bits_set = 0;
for (i = 0 ; i <= 256 ; i++){
bits_set = 0;
for (int z = 0; z <= 8 ; z++){
bits_set += i>>z & 0x1;
}
table[i] = bits_set;
//System.out.println("i = " + i + " bits_set = " + bits_set);
}
}
public static void main(String args[]){
create_table();
fill_array();
parse_array();
System.out.println("The amount of bits in the array is: " + total_bits_in_the_array);
}
private static void parse_array() {
int current;
for (int i = 0; i < array.length; i++){
current = array[i];
int down = current & 0xff;
int up = current & 0xff00;
int sum = table[up] + table[down];
total_bits_in_the_array += sum;
}
}
private static void fill_array() {
Random ran = new Random();
for (int i = 0; i < array.length; i++){
array[i] = Math.abs(ran.nextInt()%512);
}
}
}
Also at https://github.com/leitao/bits-in-a-16-bits-integer-array/blob/master/Main.java

You can pre-calculate the bit counts in bytes and then use that for lookup. It is faster, if you make certain assumptions.
Number of operations (just computation, not reading input) should take the following
Shift approach:
For each byte:
2 ops (shift, add) times 16 bits = 32 ops, 0 mem access times 10000 = 320 000 ops + 0 mem access
Pre-calculation approach:
255 times 2 ops (shift, add) times 8 bits = 4080 ops + 255 mem access (write the result)
For each byte:
2 ops (compute addresses) + 2 mem access + op (add the results) = 30 000 ops + 20 000 mem access
Total of 30 480 ops + 20 255 mem access
So a lot more memory access with lot fewer operations
Thus, assuming everything else being equal, pre-calculation for 10 000 bytes is faster if we can assume memory access is faster than an operation by a factor of (320 000 - 30 480)/20 255 = 14.29
Which is probably true if you are alone on a dedicated core on a reasonably modern box as the 255 bytes should fit into a cache. If you start getting cache misses, the assumption might no longer hold.
Also, this math assumes pointer arithmetic and direct memory access as well as atomic operations and atomic memory access. Depending on your language of choice (and, apparently, based on previous answers, your choice of compiler switches), that assumption might not hold.
Finally, things get more interesting if you consider scalability: shifting can be easily parallelised onto up to 10000 cores but pre-computation not necessarily. As byte number goes up, however, lookup gets more and more advantageous.
So, in short. Yes, pre-calculation is faster under pretty reasonable assumptions but no, it is not guaranteed to be faster.

How to clear CPU L1 and L2 cache [duplicate]

This question already has answers here:
How can I do a CPU cache flush in x86 Windows?
(4 answers)
Closed 4 years ago.
I'm running a benchmark on xeon server , and i repeat the executions 2-3 times. I'd like to erase the cache contents in L1 and L2 while repeating the runs. Can you suggest any methods for doing so ?

Try to read repetitly large data via CPU (i.e. not by DMA).
Like:
int main() {
const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
char *c = (char *)malloc(size);
for (int i = 0; i < 0xffff; i++)
for (int j = 0; j < size; j++)
c[j] = i*j;
}
However depend on server a bigger problem may be a disk cache (in memory) then L1/L2 cache. On Linux (for example) drop using:
sync
echo 3 > /proc/sys/vm/drop_caches
Edit: It is trivial to generate large program which do nothing:
#!/usr/bin/ruby
puts "main:"
200000.times { puts " nop" }
puts " xor rax, rax"
puts " ret"
Running a few times under different names (code produced not the script) should do the work

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio