Minimize the number of page faults by loop interchange - memory-management

Assume page size is 1024 words and each row is stored in one page.
If the OS allocates 512 frames for a program and uses LRU page replacement algorithm,
What will be the number of page faults in the following programs?
int A[][] = new int[1024][1024];
Program 1:
for (j = 0; j < A.length; j++)
for (i = 0; i < A.length; i++)
A[i][j] = 0;
Program 2:
for (i = 0; i < A.length; i++)
for(j = 0; j < A.length; j++)
A[i][j] = 0;
I assume that bringing the pages by row is better than bringing by column, however I cannot support my claim. Can you help me to calculate # of page faults?

One way to answer this is by simulation. You could change your loops to output the address of the assignment rather than setting it to zero:
printf("%p\n", &A[i][j]);
Then, write a second program that simulates page placement, so it would do something like:
uintptr_t h;
uintptr_t work[NWORKING_SET] = {0};
int lru = 0;
int fault = 0;
while (gethex(&h)) {
h /= pagesize;
int i;
for (i = 0; i < NWORKING_SET && work[i] != h; i++) {
}
if (i == NWORKING_SET) {
work[lru] = h;
fault++;
lru = (lru+1) % NWORKING_SET;
}
}
printf("%d\n", fault);
With that program in place, you can try multiple traversal strategies. PS: my lru just happens to work; I'm sure you can do much better.

For the second program; the CPU accesses the first int in a row causing a page fault, then accesses the other ints in the row while the page is already present. This means that (if the rows start on page boundaries) you'll get a page fault per row, plus probably one when the program's code is first started, plus probably another when the program's stack is first used (and one more if the array isn't aligned on a page boundary); which probably works out to 1026 or 1027 page faults.
For the first program; the CPU accesses the first int in a row causing a page fault; but by the time it accesses the second int in that same row the page has been evicted (become "least recently used" and replaced with a different page). This means that you'll get 1024*1024 page faults while accessing the array (plus one for program's code, stack, etc). That probably works out to 1048578 page faults (as long as the start of the array is aligned to "sizeof(int)").
However; this all assumes that the compiler failed to optimize anything. In reality it's extremely likely that any compiler that is worth using would have transformed both programs into something a little more like "memset(array, 0, sizeof(int)*1024*1024); that does consecutive writes (possibly writing multiple ints in a single larger write if underlying CPU supports larger writes). This implies that both programs would take probably cause 1026 or 1027 page faults.

Related

how do we calculate the number of reads/misses of the cache in this code snippet?

I'm trying to get an understanding of how to calculate the errors in the code, from the link on this page, Example given from text book. I can see where the calculations come from, but as the values are the same (32), I cannot work out how to do the calculation should the value in the two loops differ. Using different sized loops, what would the calculations be please?
`
for (i = 32; i >= 0; i--) {
for (j = 128; j >= 0; j--) {
total_x += grid[i][j].x;
}
}
for (i = 128; i >= 0; i--) {
for (j = 32; j >= 0; j--) {
total_y += grid[i][j].y;
}
}
`
If we had a matrix with 128 rows and 24 columns (instead of the 32 x 32 in the example), using 32-bit integers, and with each memory block able to hold 16 bytes, how do we calculate the number of compulsory misses on the top loop?
Also, if we use a direct-mapped cache holding 256 bytes of data, how would we calculate the number of all the data cache misses when running the top loop?
Finally, if we flip it and use the bottom loop, how does the maths change (if it does) for the points above?
Apologies as this is all new to me and I just want to understand the maths behind it so I can answer the problem, rather than just be given an answer.
Nothing - it's a theoretical question

CUDA kernel with single branch runs 1.5x faster than kernel without branch

I've got a strange performance inversion on filter kernel with and without branching. Kernel with branching runs ~1.5x faster than the kernel without branching.
Basically I need to sort a bunch of radiance rays then apply interaction kernels. Since there are a lot of accompanying data, I can't use something like thrust::sort_by_key() many times.
Idea of the algorithm:
Run a loop for all possible interaction types (which is five)
At every cycle a warp thread votes for its interaction type
After loop completion every warp thread knows about another threads with the same interaction type
Threads elect they leader (per interaction type)
Leader updates interactions offsets table using atomicAdd
Each thread writes its data to corresponding offset
I used techniques described in this Nvidia post https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/
My first kernel contains a branch inside loop and runs for ~5ms:
int active;
int leader;
int warp_progress;
for (int i = 0; i != hit_interaction_count; ++i)
{
if (i == decision)
{
active = __ballot(1);
leader = __ffs(active) - 1;
warp_progress = __popc(active);
}
}
My second kernel use lookup table of two elements, use no branching and runs for ~8ms:
int active = 0;
for (int i = 0; i != hit_interaction_count; ++i)
{
const int masks[2] = { 0, ~0 };
int mask = masks[i == decision];
active |= (mask & __ballot(mask));
}
int leader = __ffs(active) - 1;
int warp_progress = __popc(active);
Common part:
int warp_offset;
if (lane_id() == leader)
warp_offset = atomicAdd(&interactions_offsets[decision], warp_progress);
warp_offset = warp_broadcast(warp_offset, leader);
...copy data here...
How can that be? Is there any way to implement such filter kernel so it will run faster than branching one?
UPD: Complete source code can be found in filter_kernel cuda_equation/radiance_cuda.cu at https://bitbucket.org/radiosity/engine/src
I think this is CPU programmer brain deformation. On CPU I expect performance boost because of eliminated branch and branch misprediction penalty.
But there is no branch prediction on GPU and no penalty, so only instructions count matters.
First I need to rewrite code to the simple one.
With branch:
int active;
for (int i = 0; i != hit_interaction_count; ++i)
if (i == decision)
active = __ballot(1);
Without branch:
int active = 0;
for (int i = 0; i != hit_interaction_count; ++i)
{
int mask = 0 - (i == decision);
active |= (mask & __ballot(mask));
}
In first version there are ~3 operations: compare, if and __ballot().
In second version there are ~5 operations: compare, make mask, __ballot(), & and |=.
And there are ~15 ops in common code.
Both loops runs for 5 cycles. It total 35 ops in first, and 45 ops in second. This calculation can explain performance degradation.

observation about false sharing

I have an observation about false sharing, please look this code:
struct foo {
int x;
int y;
};
static struct foo f;
/* The two following functions are running concurrently: */
int sum_a(void)
{
int s = 0;
int i;
for (i = 0; i < 1000000; ++i)
s += f.x;
return s;
}
void inc_b(void)
{
int i;
for (i = 0; i < 1000000; ++i)
++f.y;
}
Here, sum_a may need to continually re-read x from main memory
(instead of from cache) even though inc_b's concurrent modification of
y should be irrelevant.
By Oracle's website I read this explanation:
simultaneous updates of individual elements in the same cache line
coming from different processors invalidates entire cache lines, even
though these updates are logically independent of each other. Each
update of an individual element of a cache line marks the line as
invalid. Other processors accessing a different element in the same
line see the line marked as invalid. They are forced to fetch a more
recent copy of the line from memory or elsewhere, even though the
element accessed has not been modified. This is because cache
coherency is maintained on a cache-line basis, and not for individual
elements.As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.
My observation is simple:
if what i read is true, how can be possible that exist the problem about "dirty read" ???

Understanding Cache memory

I am trying to understand how cache memory reads and writes. Also I am trying to determine the hit and miss rate. I have tried reading and reading the textbook "Computer Systems - A Programmer Perspective" over and over and can't seem to grasp this idea. Maybe someone can help me understand this:
I am working with a two-dimensional array which has 480 rows and 640 columns. The cache is direct-mapped and 64 KB with 4 byte lines. Below is the C-code:
struct pixel {
char r;
char g;
char b;
char a;
};
struct pixel buffer[480][640];
register int i, j;
register char *cptr;
register int *iptr;
sizeof(char) == 1 (meaning an index in the array consists of 4 byte each (if I am understanding that correctly)). The buffer begins at memory address 0 and the cache is initially empty (cold cache). The only memory accesses are to the entries of the array. All other variables are stored in registers.
for (j=0; j < 640; j++) {
for (i=0; i < 480; i++){
buffer[i][j].r = 0;
buffer[i][j].g = 0;
buffer[i][j].b = 0;
buffer[i][j].a = 0;
}
}
For the code above then it is initializing all the elements in the array to 0, so it must be writing. I can see that this is bad locality because the array is writing column by column instead of row by row. Doesn't that affect the miss rate? I am trying to determine the miss rate for this code based on the cache size. I think the miss rate is 100% and if the locality was row by row then it would be 25%. But I am not totally understanding how cache-memory works so... Can anyone tell me something that could help me understand this better?
I would recommend you to watch the whole Tutorial if you are a beginner.
But for your question, lecture 27 to 31 would explain everything.
https://www.youtube.com/watch?v=tGarzP488Wc&index=29&list=PL2F82ECDF8BB71B0C
IISc Bangalore.

How can I get my CPU's branch target buffer(BTB) size?

It's useful when execute this routine when LOOPS > BTB_SIZE,
eg,
from
int n = 0;
for (int i = 0; i < LOOPS; i++)
n++;
to
int n = 0;
int loops = LOOPS / 2;
for(int i = 0; i < loops; i+=2)
n += 2;
can reduce branch misses.
BTB ref:http://www-ee.eng.hawaii.edu/~tep/EE461/Notes/ILP/buffer.html but it doesn't tell how to get the BTB size.
Any modern compiler worth its salt should optimise the code to int n = LOOPS;, but in a more complex example, the compiler will take care of such optimisation; see LLVM's auto-vectorisation, for instance, which handles many kinds of loop unrolling. Rather than trying to optimise your code, find appropriate compiler flags to get the compiler to do all the hard work.
From the BTB's point of view, both versions are the same. In both versions (if compiled unoptimized) there is only one conditional jump (each originating from the i<LOOPS), so there is only one jump target in the code, thus only one branch target buffer is used. You can see the resulting assembler code using Matt Godbolt's compiler explorer.
There would be difference between
for(int i=0;i<n;i++){
if(i%2==0)
do_something();
}
and
for(int i=0;i<n;i++){
if(i%2==0)
do_something();
if(i%3==0)
do_something_different();
}
The first version would need 2 branch target buffers (for for and for if), the second would need 3 branch target buffers (for for and for two ifs).
However, how Matt Godbolt found out, there are 4096 branch target buffers, so I would not worry too much about them.

Resources