Change in apple metal kernel does not change execution time - gpgpu

I modified the kernel code in the above project
https://developer.apple.com/documentation/metal/basic_tasks_and_concepts/performing_calculations_on_a_gpu?preferredLanguage=occ
from this
kernel void add_arrays(device const float* inA,
device const float* inB,
device float* result,
uint index [[thread_position_in_grid]])
{
// the for-loop is replaced with a collection of threads, each of which
// calls this function.
result[index] = inA[index] + inB[index] ;
}
to this just embedding same calculation inside a for loop
kernel void add_arrays(device const float* inA,
device const float* inB,
device float* result,
uint index [[thread_position_in_grid]])
{
// the for-loop is replaced with a collection of threads, each of which
// calls this function.
for(int i=0;i<1000000;i++){ // added
result[index] = inA[index] + inB[index] ;
}
}
but the execution time of the program does not change , am I doing something wrong

as #frank-schlegel suggested I firstly added
for(int i=0;i<1000000;i++){
result[index] = inA[index] + inB[index] + i;
}
it did not change the performance
but when I added below code
for(int i=0;i<1000000;i++){
result[index] += i;
}
It significantly did,
My conclusion is Metal compiler probably optimizes code super intelligently!

Related

CUDA Initialize Array on Device

I am very new to CUDA and I am trying to initialize an array on the device and return the result back to the host to print out to show if it was correctly initialized. I am doing this because the end goal is a dot product solution in which I multiply two arrays together, storing the results in another array and then summing up the entire thing so that I only need to return the host one value.
In the code I am working on all I am only trying to see if I am initializing the array correctly. I am trying to create an array of size N following the patterns of 1,2,3,4,5,6,7,8,1,2,3....
This is the code that I've written and it compiles without issue but when I run it the terminal is hanging and I have no clue why. Could someone help me out here? I'm so incredibly confused :\
#include <stdio.h>
#include <stdlib.h>
#include <chrono>
#define ARRAY_SIZE 100
#define BLOCK_SIZE 32
__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ int temp;
if(temp != 8){
a_d[x] = temp;
temp++;
} else {
a_d[x] = temp;
temp = 1;
}
}
int main (int argc, char *argv[])
{
//declare pointers for arrays
int *a_d, *b_d, *c_d, *sum_h, *sum_d,a_h[ARRAY_SIZE];
//set space for device variables
cudaMalloc((void**) &a_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &b_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &c_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &sum_d, sizeof(int));
// set execution configuration
dim3 dimblock (BLOCK_SIZE);
dim3 dimgrid (ARRAY_SIZE/BLOCK_SIZE);
// actual computation: call the kernel
cu_kernel <<<dimgrid, dimblock>>> (a_d,b_d,c_d,ARRAY_SIZE);
cudaError_t result;
// transfer results back to host
result = cudaMemcpy (a_h, a_d, sizeof(int) * ARRAY_SIZE, cudaMemcpyDeviceToHost);
if (result != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed.");
exit(1);
}
// print reversed array
printf ("Final state of the array:\n");
for (int i =0; i < ARRAY_SIZE; i++) {
printf ("%d ", a_h[i]);
}
printf ("\n");
}
There are at least 3 issues with your kernel code.
you are using shared memory variable temp without initializing it.
you are not resolving the order in which threads access a shared variable as discussed here.
you are imagining (perhaps) a particular order of thread execution, and CUDA provides no guarantees in that area
The first item seems self-evident, however naive methods to initialize it in a multi-threaded environment like CUDA are not going to work. Firstly we have the multi-threaded access pattern, again, Furthermore, in a multi-block scenario, shared memory in one block is logically distinct from shared memory in another block.
Rather than wrestle with mechanisms unsuited to create the pattern you desire, (informed by notions carried over from a serial processing environment), I would simply do something trivial like this to create the pattern you desire:
__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
if (x < size) a_d[x] = (x&7) + 1;
}
Are there other ways to do it? certainly.
__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ int temp;
if (!threadIdx.x) temp = blockIdx.x*blockDim.x;
__syncthreads();
if (x < size) a_d[x] = ((temp+threadIdx.x) & 7) + 1;
}
You can get as fancy as you like.
These changes will still leave a few values at zero at the end of the array, which would require changes to your grid sizing. There are many questions about this already, or study a sample code like vectorAdd.

Floating point min/max in CUDA slower than CPU version. Why?

I wrote a kernel for computing the min and max values of an array of about 100,000 floats using reduction (see code below). I use thread blocks to reduce chunks of 1024 values to a single value (in shared memory), and then do the final reduction among the blocks on the CPU.
I then compared this with a serial calculation just on the CPU. The CUDA version takes 2.2ms, and the CPU version takes 0.21ms. Why is the CUDA version much slower? Is the array size not large enough to take advantage of the parallelism, or is my code not optimized somehow?
This is part of an exercise in the Udacity Parallel Programming class. I am running this through their web site, so I don't know what the exact hardware is, but they claim the code runs on actual GPUs.
Here is the CUDA code:
__global__ void min_max_kernel(const float* const d_logLuminance,
const size_t length,
float* d_min_logLum,
float* d_max_logLum) {
// Shared working memory
extern __shared__ float sh_logLuminance[];
int blockWidth = blockDim.x;
int x = blockDim.x * blockIdx.x + threadIdx.x;
float* min_logLuminance = sh_logLuminance;
float* max_logLuminance = sh_logLuminance + blockWidth;
// Copy this block's chunk of the data to shared memory
// We copy twice so we compute min and max at the same time
if (x < length) {
min_logLuminance[threadIdx.x] = d_logLuminance[x];
max_logLuminance[threadIdx.x] = min_logLuminance[threadIdx.x];
}
else {
// Pad if we're out of range
min_logLuminance[threadIdx.x] = FLT_MAX;
max_logLuminance[threadIdx.x] = -FLT_MAX;
}
__syncthreads();
// Reduce
for (int s = blockWidth/2; s > 0; s /= 2) {
if (threadIdx.x < s) {
if (min_logLuminance[threadIdx.x + s] < min_logLuminance[threadIdx.x]) {
min_logLuminance[threadIdx.x] = min_logLuminance[threadIdx.x + s];
}
if (max_logLuminance[threadIdx.x + s] > max_logLuminance[threadIdx.x]) {
max_logLuminance[threadIdx.x] = max_logLuminance[threadIdx.x + s];
}
}
__syncthreads();
}
// Write to global memory
if (threadIdx.x == 0) {
d_min_logLum[blockIdx.x] = min_logLuminance[0];
d_max_logLum[blockIdx.x] = max_logLuminance[0];
}
}
size_t get_num_blocks(size_t inputLength, size_t threadsPerBlock) {
return inputLength / threadsPerBlock +
((inputLength % threadsPerBlock == 0) ? 0 : 1);
}
/*
* Compute min, max over the data by first reducing on the device, then
* doing the final reducation on the host.
*/
void compute_min_max(const float* const d_logLuminance,
float& min_logLum,
float& max_logLum,
const size_t numRows,
const size_t numCols) {
// Compute min, max
printf("\n=== computing min/max ===\n");
const size_t blockWidth = 1024;
const size_t numPixels = numRows * numCols;
size_t numBlocks = get_num_blocks(numPixels, blockWidth);
printf("Num min/max blocks = %d\n", numBlocks);
float* d_min_logLum;
float* d_max_logLum;
int alloc_size = sizeof(float) * numBlocks;
checkCudaErrors(cudaMalloc(&d_min_logLum, alloc_size));
checkCudaErrors(cudaMalloc(&d_max_logLum, alloc_size));
min_max_kernel<<<numBlocks, blockWidth, sizeof(float) * blockWidth * 2>>>
(d_logLuminance, numPixels, d_min_logLum, d_max_logLum);
float* h_min_logLum = (float*) malloc(alloc_size);
float* h_max_logLum = (float*) malloc(alloc_size);
checkCudaErrors(cudaMemcpy(h_min_logLum, d_min_logLum, alloc_size, cudaMemcpyDeviceToHost));
checkCudaErrors(cudaMemcpy(h_max_logLum, d_max_logLum, alloc_size, cudaMemcpyDeviceToHost));
min_logLum = FLT_MAX;
max_logLum = -FLT_MAX;
// Reduce over the block results
// (would be a bit faster to do it on the GPU, but it's just 96 numbers)
for (int i = 0; i < numBlocks; i++) {
if (h_min_logLum[i] < min_logLum) {
min_logLum = h_min_logLum[i];
}
if (h_max_logLum[i] > max_logLum) {
max_logLum = h_max_logLum[i];
}
}
printf("min_logLum = %.2f\nmax_logLum = %.2f\n", min_logLum, max_logLum);
checkCudaErrors(cudaFree(d_min_logLum));
checkCudaErrors(cudaFree(d_max_logLum));
free(h_min_logLum);
free(h_max_logLum);
}
And here is the host version:
void compute_min_max_on_host(const float* const d_logLuminance, size_t numPixels) {
int alloc_size = sizeof(float) * numPixels;
float* h_logLuminance = (float*) malloc(alloc_size);
checkCudaErrors(cudaMemcpy(h_logLuminance, d_logLuminance, alloc_size, cudaMemcpyDeviceToHost));
float host_min_logLum = FLT_MAX;
float host_max_logLum = -FLT_MAX;
printf("HOST ");
for (int i = 0; i < numPixels; i++) {
if (h_logLuminance[i] < host_min_logLum) {
host_min_logLum = h_logLuminance[i];
}
if (h_logLuminance[i] > host_max_logLum) {
host_max_logLum = h_logLuminance[i];
}
}
printf("host_min_logLum = %.2f\nhost_max_logLum = %.2f\n",
host_min_logLum, host_max_logLum);
free(h_logLuminance);
}
As #talonmies suggests, behavior may be different for larger sizes; 100,000 is really not that much: Much of it fits within the combined overall L1 cache of the cores on a modern CPU; half of it fits in a single core's L2 cache.
Transfer over PCI express takes time; and in your case, double the time it might have, since you don't use pinned memory.
You're not overlapping computation and PCI express I/O (not that it would make much sense for only 100,000 elements)
Your kernel is rather slow, for more than one reason; not the least of which is the extensive use of shared memory, most of which is unnecessary
More generally: Always profile your code using nvvp (or nvprof for getting textual information for further analysis).

Optimising Matrix Multiplication OpenCL - Purpose: learn how to manage memory

I'm new to OpenCL and trying to understand how to optimise matrix multiplication to become familiar with the various paradigms. Here's the current code.
If I'm multipliying matrices A and B. I allocate a row of A in private memory to start with (because each work item uses it), and a column of B in local memory (because each work group uses it).
1) the code is currently incorrect, unfortunately I'm struggling on how to use local work ids to get the correct code, but I can't find my mistake? I'm basing myself on http://www.cs.bris.ac.uk/home/simonm/workshops/OpenCL_lecture3.pdf but (slide 27) it seems that this is wrong as they don't make use of loc_size in their internal loop)
2) Are there any other optimisations you would suggest with this code?
__kernel void mmul(
__global int* C,
__global int* A,
__global int* B,
const int rA,
const int rB,
const int cC,
__local char* local_mem)
{
int k,ty;
int tx = get_global_id(0);
int loctx = get_local_id(0);
int loc_size = get_local_size(0);
int value = 0 ;
int tmp_array[1000];
for(k=0;k<rB;k++) {
tmp_array[k] = A[tx * cA + k] ;
}
for (ty=0 ; ty < cC ; ty++) { \n" \
for (k = loctx ; k < rB ; k+=loc_size) {
local_mem[k] = B[ty + k * cC] ;
}
barrier(CLK_LOCAL_MEM_FENCE);
value = 0 ;
for(k=0;k<rB;k+=1) {
int i = loctx + k*loc_size;
value += tmp_array[k] * local_mem[i];
}
C[ty + (tx * cC)] = value;
}
}
where I set the global and local work items as follows
const size_t globalWorkItems[1] = {result_row};
const size_t localWorkItems[1] = {(size_t)local_wi_size};
local_wi_size is result_row/number of compute units (such that result_row % compute units == 0)
Your code is pretty close, but the indexing into the local memory array is actually simpler that you think. You have a row in private memory and a column in local memory, and you need to compute the dot product of these two vectors. You just need to sum row[k]*col[k], for k = 0 up to N-1:
for(k=0;k<rB;k+=1) {
value += tmp_array[k] * local_mem[k];
}
There's actually a second, more subtle bug that is also present in the example solution given on the slides you are using. Since you are reading and writing local memory inside a loop, you actually need two barriers, in order to make sure that work-items writing to local memory on iteration i don't overwrite values that are being read by other work-items executing iteration i-1.
Therefore, the full code for your kernel (tested and working), should look something like this:
__kernel void mmul(
__global int* C,
__global int* A,
__global int* B,
const int rA,
const int rB,
const int cC,
__local char* local_mem)
{
int k,ty;
int tx = get_global_id(0);
int loctx = get_local_id(0);
int loc_size = get_local_size(0);
int value = 0;
int tmp_array[1000];
for(k=0;k<rB;k++) {
tmp_array[k] = A[tx * cA + k] ;
}
for (ty=0 ; ty < cC ; ty++) {
for (k = loctx ; k < rB ; k+=loc_size) {
local_mem[k] = B[ty + k * cC];
}
barrier(CLK_LOCAL_MEM_FENCE); // First barrier to ensure writes have finished
value = 0;
for(k=0;k<rB;k+=1) {
value += tmp_array[k] * local_mem[k];
}
C[ty + (tx * cC)] = value;
barrier(CLK_LOCAL_MEM_FENCE); // Second barrier to ensure reads have finished
}
}
You can find the full set of exercises and solutions that go with the slides you are looking at on the HandsOnOpenCL GitHub page. There's also a more complete set of slides from the same tutorial available here, which go on to show a much more optimised matrix multiply example that uses a blocking approach to better exploit temporal and spatial locality. The aforementioned missing barrier bug has been fixed in the example solution code, but not on the slides (yet).

A question about the details about the distribution from blocks to SMs in CUDA

Let me take the hardware with computation ability 1.3 as an example.
30 SMs are available. Then at most 240 blocks are able to be running at the same time(Considering the limit of register and shared memory, the restriction to the number of block may be much lower). Those blocks beyond 240 have to wait for available hardware resources.
My question is when those blocks beyond 240 will be assigned to SMs. Once some blocks of the first 240 are completed? Or when all of the first 240 blocks are finished?
I wrote such a piece of code.
#include<stdio.h>
#include<string.h>
#include<cuda_runtime.h>
#include<cutil_inline.h>
const int BLOCKNUM = 1024;
const int N=240;
__global__ void kernel ( volatile int* mark ) {
if ( blockIdx.x == 0 ) while ( mark[N] == 0 );
if ( threadIdx.x == 0 ) mark[blockIdx.x] = 1;
}
int main() {
int * mark;
cudaMalloc ( ( void** ) &mark, sizeof ( int ) *BLOCKNUM );
cudaMemset ( mark, 0, sizeof ( int ) *BLOCKNUM );
kernel <<< BLOCKNUM, 1>>> ( mark );
cudaFree ( mark );
return 0;
}
This code causes a deadlock and fails to terminate. But if I change N from 240 to 239, the code is able to terminate. So I want to know some details about the scheduling of blocks.
On the GT200, it has been demonstrated through micro-benchmarking that new blocks are scheduled whenever a SM has retired all the currently active blocks which it was running. So the answer is when some blocks are finished, and the scheduling granularity is SM level. There seems to be a consensus that Fermi GPUs have a finer scheduling granularity than previous generations of hardware.
I can't find any reference about this for compute capabilities < 1.3.
Fermi architectures introduce a new block dispatcher called GigaThread engine.
GigaThread enables immediate replacement of blocks on an SM when one completes executing and also enables concurrent kernel execution.
While there is no official answer to this, you can measure through atomic operations when your blocks begin your work and when they end.
Try playing with the following code:
#include <stdio.h>
const int maxBlocks=60; //Number of blocks of size 512 threads on current device required to achieve full occupancy
__global__ void emptyKernel() {}
__global__ void myKernel(int *control, int *output) {
if (threadIdx.x==1) {
//register that we enter
int enter=atomicAdd(control,1);
output[blockIdx.x]=enter;
//some intensive and long task
int &var=output[blockIdx.x+gridDim.x]; //var references global memory
var=1;
for (int i=0; i<12345678; ++i) {
var+=1+tanhf(var);
}
//register that we quit
var=atomicAdd(control,1);
}
}
int main() {
int *gpuControl;
cudaMalloc((void**)&gpuControl, sizeof(int));
int cpuControl=0;
cudaMemcpy(gpuControl,&cpuControl,sizeof(int),cudaMemcpyHostToDevice);
int *gpuOutput;
cudaMalloc((void**)&gpuOutput, sizeof(int)*maxBlocks*2);
int cpuOutput[maxBlocks*2];
for (int i=0; i<maxBlocks*2; ++i) //clear the host array just to be on the safe side
cpuOutput[i]=-1;
// play with these values
const int thr=479;
const int p=13;
const int q=maxBlocks;
//I found that this may actually affect the scheduler! Try with and without this call.
emptyKernel<<<p,thr>>>();
cudaEvent_t timerStart;
cudaEvent_t timerStop;
cudaEventCreate(&timerStart);
cudaEventCreate(&timerStop);
cudaThreadSynchronize();
cudaEventRecord(timerStart,0);
myKernel<<<q,512>>>(gpuControl, gpuOutput);
cudaEventRecord(timerStop,0);
cudaEventSynchronize(timerStop);
cudaMemcpy(cpuOutput,gpuOutput,sizeof(int)*maxBlocks*2,cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
float thisTime;
cudaEventElapsedTime(&thisTime,timerStart,timerStop);
cudaEventDestroy(timerStart);
cudaEventDestroy(timerStop);
printf("Elapsed time: %f\n",thisTime);
for (int i=0; i<q; ++i)
printf("%d: %d-%d\n",i,cpuOutput[i],cpuOutput[i+q]);
}
What you get in the output is the block ID, followed by the enter "time" and exit "time". This way you can learn in which order those events occured.
On Fermi, I'm sure that a block is scheduled on a SM as soon there is room for it. I.e., whenever, a SM finishes executing one block, it will execute another block if there is any block left. (However, the actual order is not deterministic).
In older versions, I don't know. But you can verify it by using the build-in clock() function.
For example, I used the following OpenCL kernel code (you can easily convert it to CUDA):
__kernel void test(uint* start, uint* end, float* buffer);
{
int id = get_global_id(0);
start[id] = clock();
__do_something_here;
end[id] = clock();
}
Then output it to a file and build a graph. You will see how visual it is.

Passing CUDA Random Generator State by reference

Is the following code correct when passing the random generator state(CUDA toolkit 3.2 curand.lib) by reference in function CalculateValue(curandState *localStat) and GetExponential(curandState *localState)?
Thanks
__device__ double GetExponential(curandState *localState) {
double u1 = curand_uniform_double(localState); }
__device__ double CalculateValue(curandState *localStat) {
double x = GetExponential(localState);
return x; }
__global__ void RunMonteCarloKernel(curandState *state, double *results) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
/* Copy state to local memory for efficiency */
curandState localState = state[threadIdx.x + blockIdx.x * blockDim.x];
results[i] = CalculateValue(&localState);
/* Copy state back to global memory */
state[threadIdx.x + blockIdx.x * blockDim.x] = localState; }
__global__ void setup_kernel(curandState *state) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
/* Each thread gets different seed, a different sequence number, no offset */
curand_init(i, i, 0, &state[i]); }
int main(void) {
double *devResults;
curandState *devStates;
/* Allocate space for prng states on device */
CUDA_CALL(cudaMalloc((void **)&devStates, totalThreads * sizeof(curandState)));
/* Setup prng states */
setup_kernel<<<totalBlocks, threadsPerBlock>>>(devStates);
for(int i=0; i< 1000; i++)
{
RunMonteCarloKernel(devStates, devResults);
} }
Is there a problem? It looks ok.
You may want to check out the EstimatePiInlineP sample which is in the MonteCarloCURAND directory of the 3.2 SDK. It uses C++ style pass by reference to avoid taking the address of a local variable. You would need to store the state back to memory at the end of the kernel (as you do in your code).
Passing by C++ reference can assist the compiler by clearly showing that the function can operate on the data directly in the original registers. Taking the address of a local array in a GPU can be detrimental to performance if the compiler cannot be certain that all threads handle the pointer identically (i.e. identical operations on the pointer), in which case it will spill the array to local memory. It'll work, but it may be slower.

Resources