I'm trying to get an understanding of how to calculate the errors in the code, from the link on this page, Example given from text book. I can see where the calculations come from, but as the values are the same (32), I cannot work out how to do the calculation should the value in the two loops differ. Using different sized loops, what would the calculations be please?
`
for (i = 32; i >= 0; i--) {
for (j = 128; j >= 0; j--) {
total_x += grid[i][j].x;
}
}
for (i = 128; i >= 0; i--) {
for (j = 32; j >= 0; j--) {
total_y += grid[i][j].y;
}
}
`
If we had a matrix with 128 rows and 24 columns (instead of the 32 x 32 in the example), using 32-bit integers, and with each memory block able to hold 16 bytes, how do we calculate the number of compulsory misses on the top loop?
Also, if we use a direct-mapped cache holding 256 bytes of data, how would we calculate the number of all the data cache misses when running the top loop?
Finally, if we flip it and use the bottom loop, how does the maths change (if it does) for the points above?
Apologies as this is all new to me and I just want to understand the maths behind it so I can answer the problem, rather than just be given an answer.
Nothing - it's a theoretical question
Given this code snippet from this textbook that I am currently studying. Randal E. Bryant, David R. O’Hallaron - Computer Systems. A Programmer’s Perspective [3rd ed.] (2016, Pearson) (global edition, so the book's exercises could be wrong.)
for (i = 31; i >= 0; i--) {
for (j = 31; j >= 0; j--) {
total_x += grid[i][j].x;
}
}
for (i = 31; i >= 0; i--) {
for (j = 31; j >= 0; j--) {
total_y += grid[i][j].y;
}
}
and this is the information given
The heart of the recent hit game SimAquarium is a tight loop that calculates the
average position of 512 algae. You are evaluating its cache performance on a
machine with a 2,048-byte direct-mapped data cache with 32-byte blocks (B = 32).
struct algae_position {
int x;
int y;
};
struct algae_position grid[32][32];
int total_x = 0, total_y = 0;
int i, j;
You should also assume the following:
sizeof(int) = 4.
grid begins at memory address 0.
The cache is initially empty.
The only memory accesses are to the entries of the array grid.
Variables i, j,
total_x, and total_y are stored in registers
The book gives the following questions as practice:
A. What is the total number of reads?
Answer given : 2048
B. What is the total number of reads that miss in the cache?
Answer given : 1024
C. What is the miss rate?
Answer given: 50%
I'm guessing for A, the answer is derived from 32*32 *2? 32*32 for the dimensions of the matrix and 2 because there are 2 separate loops for x and y vals. Is this correct? How should the total number of reads be counted?
How do we calculate the total number of misses that happen in the cache and the miss rate? I read that the miss rate is (1- hit-rate)
Question A
You are correct about 32 x 32 x 2 reads.
Question B
The loops counts down from 31 towards 0 but that doesn't matter for this question. The answer is the same for loops going from 0 to 31. Since that is a bit easier to explain I'll assume increasing loop counters.
When you read grid[0][0], you'll get a cache miss. This will bring grid[0][0], grid[0][1], grid[0][2] and grid[0][3] into the cache. This is because each element is 2x4 = 8 bytes and the block size is 32. In other words: 32 / 8 = 4 grid elements in one block.
So the next cache miss is for grid[0][4] which again will bring the next 4 grid elements into the cache. And so on... like:
miss
hit
hit
hit
miss
hit
hit
hit
miss
hit
hit
hit
...
So in the first loop you simply have:
"Number of grid elements" divided by 4.
or
32 * 32 / 4 = 256
In general in the first loop:
Misses = NumberOfElements / (BlockSize / ElementSize)
so here:
Misses = 32*32 / (32 / 8) = 256
Since the cache size is only 2048 and the whole grid is 32 x 32 x 8 = 8192, nothing read into the cache in the first loop will generate cache hit in the second loop. In other words - both loops will have 256 misses.
So the total number of cache misses are 2 x 256 = 512.
Also notice that there seem to be a bug in the book.
Here:
The heart of the recent hit game SimAquarium is a tight loop that calculates the
average position of 512 algae.
^^^
Hmmm... 512 elements...
Here:
for (i = 31; i >= 0; i--) {
for (j = 31; j >= 0; j--) {
^^^^^^
hmmm... 32 x 32 is 1024
So the loop access 1024 elements but the text says 512. So something is wrong in the book.
Question C
Miss rate = 512 misses / 2048 reads = 25 %
note:
Being very strict we cannot say for sure that the element size is two times the integer size. The C standard allow that structs contain padding. So in principle there could be 8 bytes padding in the struct (i.e. element size being 16) and that would give the results that the book says.
I have a code and I need to analyze how many cache-miss and cache-hits occur when it is executed.
The cache memory has 1024 lines organized in blocks with 4 words of 1 byte each. The replacement method is direct mapping.
The code:
int N = 1024;
byte x[N], y[N];
for(int i = 0; i<N-1;i++){
for(int j=0; j<N;j++){
x[i] = x[i+1]*y[j];
}
}
Can someone help me?
Assume page size is 1024 words and each row is stored in one page.
If the OS allocates 512 frames for a program and uses LRU page replacement algorithm,
What will be the number of page faults in the following programs?
int A[][] = new int[1024][1024];
Program 1:
for (j = 0; j < A.length; j++)
for (i = 0; i < A.length; i++)
A[i][j] = 0;
Program 2:
for (i = 0; i < A.length; i++)
for(j = 0; j < A.length; j++)
A[i][j] = 0;
I assume that bringing the pages by row is better than bringing by column, however I cannot support my claim. Can you help me to calculate # of page faults?
One way to answer this is by simulation. You could change your loops to output the address of the assignment rather than setting it to zero:
printf("%p\n", &A[i][j]);
Then, write a second program that simulates page placement, so it would do something like:
uintptr_t h;
uintptr_t work[NWORKING_SET] = {0};
int lru = 0;
int fault = 0;
while (gethex(&h)) {
h /= pagesize;
int i;
for (i = 0; i < NWORKING_SET && work[i] != h; i++) {
}
if (i == NWORKING_SET) {
work[lru] = h;
fault++;
lru = (lru+1) % NWORKING_SET;
}
}
printf("%d\n", fault);
With that program in place, you can try multiple traversal strategies. PS: my lru just happens to work; I'm sure you can do much better.
For the second program; the CPU accesses the first int in a row causing a page fault, then accesses the other ints in the row while the page is already present. This means that (if the rows start on page boundaries) you'll get a page fault per row, plus probably one when the program's code is first started, plus probably another when the program's stack is first used (and one more if the array isn't aligned on a page boundary); which probably works out to 1026 or 1027 page faults.
For the first program; the CPU accesses the first int in a row causing a page fault; but by the time it accesses the second int in that same row the page has been evicted (become "least recently used" and replaced with a different page). This means that you'll get 1024*1024 page faults while accessing the array (plus one for program's code, stack, etc). That probably works out to 1048578 page faults (as long as the start of the array is aligned to "sizeof(int)").
However; this all assumes that the compiler failed to optimize anything. In reality it's extremely likely that any compiler that is worth using would have transformed both programs into something a little more like "memset(array, 0, sizeof(int)*1024*1024); that does consecutive writes (possibly writing multiple ints in a single larger write if underlying CPU supports larger writes). This implies that both programs would take probably cause 1026 or 1027 page faults.
I'm a learning Cuda student, and I would like to optimize the execution time of my kernel function. As a result, I realized a short program computing the difference between two pictures. So I compared the execution time between a classic CPU execution in C, and a GPU execution in Cuda C.
Here you can find the code I'm talking about:
int *imgresult_data = (int *) malloc(width*height*sizeof(int));
int size = width*height;
switch(computing_type)
{
case GPU:
HANDLE_ERROR(cudaMalloc((void**)&dev_data1, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data2, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data_res, size*sizeof(int)));
HANDLE_ERROR(cudaMemcpy(dev_data1, img1_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data2, img2_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data_res, imgresult_data, size*sizeof(int), cudaMemcpyHostToDevice));
float time;
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate(&start) );
HANDLE_ERROR( cudaEventCreate(&stop) );
HANDLE_ERROR( cudaEventRecord(start, 0) );
for(int m = 0; m < nb_loops ; m++)
{
diff<<<height, width>>>(dev_data1, dev_data2, dev_data_res);
}
HANDLE_ERROR( cudaEventRecord(stop, 0) );
HANDLE_ERROR( cudaEventSynchronize(stop) );
HANDLE_ERROR( cudaEventElapsedTime(&time, start, stop) );
HANDLE_ERROR(cudaMemcpy(imgresult_data, dev_data_res, size*sizeof(int), cudaMemcpyDeviceToHost));
printf("Time to generate: %4.4f ms \n", time/nb_loops);
break;
case CPU:
clock_t begin = clock(), diff;
for (int z=0; z<nb_loops; z++)
{
// Apply the difference between 2 images
for (int i = 0; i < height; i++)
{
tmp = i*imgresult_pitch;
for (int j = 0; j < width; j++)
{
imgresult_data[j + tmp] = (int) img2_data[j + tmp] - (int) img1_data[j + tmp];
}
}
}
diff = clock() - begin;
float msec = diff*1000/CLOCKS_PER_SEC;
msec = msec/nb_loops;
printf("Time taken %4.4f milliseconds", msec);
break;
}
And here is my kernel function:
__global__ void diff(unsigned char *data1 ,unsigned char *data2, int *data_res)
{
int row = blockIdx.x;
int col = threadIdx.x;
int v = col + row*blockDim.x;
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
}
I obtained these execution time for each one
CPU: 1,3210ms
GPU: 0,3229ms
I wonder why GPU result is not as lower as it should be. I am a beginner in Cuda so please be comprehensive if there are some classic errors.
EDIT1:
Thank you for your feedback. I tried to delete the 'if' condition from the kernel but it didn't change deeply my program execution time.
However, after having install Cuda profiler, it told me that my threads weren't running concurrently. I don't understand why I have this kind of message, but it seems true because I only have a 5 or 6 times faster application with GPU than with CPU. This ratio should be greater, because each thread is supposed to process one pixel concurrently to all the other ones. If you have an idea of what I am doing wrong, it would be hepful...
Flow.
Here are two things you could do which may improve the performance of your diff kernel:
1. Let each thread do more work
In your kernel, each thread handles just a single element; but having a thread do anything already has a bunch of overhead, at the block and the thread level, including obtaining the parameters, checking the condition and doing address arithmetic. Now, you could say "Oh, but the reads and writes take much more time then that; this overhead is negligible" - but you would be ignoring the fact, that the latency of these reads and writes is hidden by the presence of many other warps which may be scheduled to do their work.
So, let each thread process more than a single element. Say, 4, as each thread can easily read 4 bytes at once into a register. Or even 8 or 16; experiment with it. Of course you'll need to adjust your grid and block parameters accordingly.
2. "Restrict" your pointers
__restrict is not part of C++, but it is supported in CUDA. It tells the compiler that accesses through different pointers passed to the function never overlap. See:
What does the restrict keyword mean in C++?
Realistic usage of the C99 'restrict' keyword?
Using it allows the CUDA compiler to apply additional optimizations, e.g. loading or storing data via non-coherent cache. Indeed, this happens with your kernel although I haven't measured the effects.
3. Consider using a "SIMD" instruction
CUDA offers this intrinsic:
__device__ unsigned int __vsubss4 ( unsigned int a, unsigned int b )
Which subtracts each signed byte value in a from its corresponding one in b. If you can "live" with the result, rather than expecting a larger int variable, that could save you some of work - and go very well with increasing the number of elements per thread. In fact, it might let you increase it even further to get to the optimum.
I don't think you are measuring times correctly, memory copy is a time consuming step in GPU that you should take into account when measuring your time.
I see some details that you can test:
I suppose you are using MAX_H and MAX_H as constants, you may consider doing so using cudaMemcpyToSymbol().
Remember to sync your threads using __syncthreads(), so you don't get issues between each loop iteration.
CUDA works with warps, so block and number of threads per block work better as multiples of 8, but not larger than 512 threads per block unless your hardware supports it. Here is an example using 128 threads per block: <<<(cols*rows+127)/128,128>>>.
Remember as well to free your allocated memory in GPU and destroying your time events created.
In your kernel function you can have a single variable int v = threadIdx.x + blockIdx.x * blockDim.x .
Have you tested, beside the execution time, that your result is correct? I think you should use cudaMallocPitch() and cudaMemcpy2D() while working with arrays due to padding.
Probably there are other issues with the code, but here's what I see. The following lines in __global__ void diff are considered not optimal:
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
Conditional operators inside a kernel result in warp divergence. It means that if and else parts inside a warp are executed in sequence, not in parallel. Also, as you might have realized, if evaluates to false only at borders. To avoid the divergence and needless computation, split your image in two parts:
Central part where row < MAX_H && col < MAX_W is always true. Create an additional kernel for this area. if is unnecessary here.
Border areas that will use your diff kernel.
Obviously you'll have modify your code that calls the kernels.
And on a separate note:
GPU has throughput-oriented architecture, but not latency-oriented as CPU. It means CPU may be faster then CUDA when it comes to processing small amounts of data. Have you tried using large data sets?
CUDA Profiler is a very handy tool that will tell you're not optimal in the code.