CUDA measure execution time per gpu core

CUDA measure execution time per gpu core - time

I am really new to cuda programming (just started a few weeks ago) and I have an assignment to multiply big size matrices (like 960x960) and measure the time of execution overall and per gpu core. I looked into the CUDA Samples that come with the installation of the Toolkit (more precisely the project matrixMul int the 0_Simple folder). I altered the sample to multiply big matrices. The sample itself has implementation of measuring time of execution, but my question is how can I measure the time of execution per gpu core. I am confused.
Also, with less importance, why does the kernel function in this example gets called inside a for function with max 300 iterations.

Each CUDA device has multiple streaming multi-processors (SMs). Each SM can have multiple warp schedulers and multiple execution units. CUDA cores are execution units not "cores" so I will avoid them for the rest of the discussions.
The NVIDIA profiling tools
CUDA command line profiler
nvprof command line profiler (new in CUDA 5.0)
Visual Profiler
Nsight VSE CUDA profiler
support the ability to collect the duration and PM counters for CUDA grid launches. A subset of the PM counters can be collected per SM.
I've provided the command line for nvprof for collecting the two pieces of information. Both examples run a debug build of the matrixMul sample on a GTX480 with 15 SMs.
COLLECTING GRID EXECUTION TIME
Each of the tools listed above has simplified mode to collect the execution duration of each kernel grid launch. The graphics tools can display this on a timeline or in a table.
nvprof --print-gpu-trace matrixMul.exe
======== NVPROF is profiling matrixMul.exe...
======== Command: matrixMul.exe
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce GTX 480" with compute capability 2.0
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 39.40 GFlop/s, Time= 3.327 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK
Note: For peak performance, please refer to the matrixMulCUBLAS example.
======== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
267.83ms 71.30us - - - - - 409.60KB 5.74GB/s 0 1 2 [CUDA memcpy HtoD]
272.72ms 139.20us - - - - - 819.20KB 5.88GB/s 0 1 2 [CUDA memcpy HtoD]
272.86ms 3.33ms (20 10 1) (32 32 1) 20 8.19KB 0B - - 0 1 2 void matrixMulCUDA<int=32>(float*, float*, float*, int, int)
277.29ms 3.33ms (20 10 1) (32 32 1) 20 8.19KB 0B - - 0 1 2 void matrixMulCUDA<int=32>(float*, float*, float*, int, int)
In order to collect in the other tools
CUDA command line profiler - specify timestamps
Visual Profiler - run generate timeline
Nsight VSE - New Analysis Activity | Trace | Enable CUDA
COLLECTING SM ACTIVITY
Your questions states you need the execution time per GPU core. This can mean per GPU (see above) or per SM. SM execution time can be collected using the SM PM counter active_cycles. active_cycles counts the number of cycles the SM has at least one active warp.
For each line in the output there will be 15 values (one for each SM).
nvprof --events active_cycles --aggregate-mode-off matrixMul.exe
======== NVPROF is profiling matrixMul.exe...
======== Command: matrixMul.exe
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce GTX 480" with compute capability 2.0
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 12.07 GFlop/s, Time= 10.860 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK
Note: For peak performance, please refer to the matrixMulCUBLAS example.
======== Profiling result:
Device Context Stream, Event Name, Kernel, Values
0 1 2, active_cycles, void matrixMulCUDA<int=32>(float*, float*, float*, int, int), 2001108 2001177 2000099 2002857 2152562 2153254 2001086 2153043 2001015 2001192 2000065 2154293 2000071 2000238 2154905
0 1 2, active_cycles, void matrixMulCUDA<int=32>(float*, float*, float*, int, int), 2155340 2002145 2155289 2002374 2003336 2002498 2001865 2155503 2156271 2156429 2002108 2002836 2002461 2002695 2002098

Related

Performance Analysis of Multiple Kernels (CUDA C)

I have CUDA program with multiple kernels run on series (in the same stream- the default one). I want to make performance analysis for the program as a whole specifically the GPU portion. I'm doing the analysis using some metrics such as achieved_occupancy, inst_per_warp, gld_efficiency and so on using nvprof tool.
But the profiler gives metrics values separately for each kernel while I want to compute that for them all to see the total usage of the GPU for the program.
Should I take the (average or largest value or total) of all kernels for each metric??

One possible approach would be to use a weighted average method.
Suppose we had 3 non-overlapping kernels in our timeline. Let's say kernel 1 runs for 10 milliseconds, kernel 2 runs for 20 millisconds, and kernel 3 runs for 30 milliseconds. Collectively, all 3 kernels are occupying 60 milliseconds in our overall application timeline.
Let's also suppose that the profiler reports the gld_efficiency metric as follows:
kernel duration gld_efficiency
1 10ms 88%
2 20ms 76%
3 30ms 50%
You could compute the weighted average as follows:
88*10 76*20 50*30
"overall" global load efficiency = ----- + ----- + ----- = 65%
60 60 60
I'm sure there may be other approaches that make sense also. For example, a better approach might be to have the profiler report the total number of global load transaction for each kernel, and do your weighting based on that, rather than kernel duration:
kernel gld_transactions gld_efficiency
1 1000 88%
2 2000 76%
3 3000 50%
88*1000 76*2000 50*3000
"overall" global load efficiency = ------- + ------- + ------- = 65%
6000 6000 6000

FFT performance figures

I am wondering what time performance one can achieve nowadays to compute 2D FFTs. Just an order of magnitude, for 1K x 1K or 2K x 2K images.
Links or personal experience are welcome.

Rerun simple test for reference:
FFTW library 3.3.5 (2016 year). I've used precompiled dll's, they exploit SSE, but I am not sure about AVX.
Windows 7 32 bit. Intel i5-4670 (Haswell 4 cores)
Single precision, real-to complex out-of place 2D transform (using fftwf_plan_dft_r2c_2d).
1024 x 1024:
Single thread: 5 ms per iteration
Two threads: 3.8 ms per iteration
Four threads: 2.4 ms per iteration
2048 x 2048:
Single thread: 28 ms per iteration
Two threads: 16 ms per iteration
Four threads: 12 ms per iteration
Double precision, real-to complex out-of place 2D transform (using fftw_plan_dft_r2c_2d).
1024 x 1024:
Single thread: 7 ms per iteration
Four threads: 3 ms per iteration

Unenhanced performance of matlab GPU computing

With the intention of comparing the speed of GPU vs CPU computing, I ran the example codes available here (a Mandelbrot set on the GPU) from MATLAB central. Below are the results that I obtained:
Case 1 (without GPU): 6.2 secs
Case 2 (using parallel.gpu.GPUArray): 6.518 secs (1.39 secs in the example)
Case 3 (Using Element-wise Operation): 1.259 secs (0.14 secs in the example)
As can be seen, there is no improvement in case 2 and only slight improvement of around 4 times in case 3. As the example did not state the details of GPU they used, may I know if this is simply due to the "incompetency" of my graphic card or am I missing something important?
The graphic card is also responsible for driving my display (HP Z Display Z23i 23-inch IPS LED Backlit Monitor).
CPU: Intel i7-4790, 3.6 GHz (8 cores)
GPU:
Name: 'NVS 510'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 6
ToolkitVersion: 5
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.1475e+09
FreeMemory: 1.6934e+09
MultiprocessorCount: 1
ClockRateKHz: 797000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
Thank you!
Edit
The GPU used in the example here is Tesla C2050. (Credits to #Sam Roberts)

The times on that link are most likely for a different GPU in comparison to yours. They don't specify what kind of graphics card they're using, but my guess is that they're using a more higher end card.
By Googling NVS 510, the specs are similar to the card that I have for my machine. However, your card is geared towards business while mine is geared towards gaming. I have a GTX 660 which is one of the higher end GPUs that are available on the market.
These are the attributes of my graphics card:
CUDADevice with properties:
Name: 'GeForce GTX 660'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 6.5000
ToolkitVersion: 5.5000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.1475e+09
FreeMemory: 1.5357e+09
MultiprocessorCount: 5
ClockRateKHz: 1084500
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
The differences between my card and yours are that I have 5 multiprocessors, and my clock rate is about 300 MHz faster than yours. For a side-by-side comparison, check out my card in comparison to yours:
NVS 510: http://www.nvidia.ca/object/nvs-510-graphics-card.html#pdpContent=2
GTX 660: http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-660/specifications
Upon further inspection, I have a much higher memory bandwidth than your card. I also have 960 GPU cores in comparison to your 192.
I decided to run these tests to compare my performance with your timings. My CPU is an i7-4770 3.6 GHz Intel and I have 16 GB of RAM on my machine.
The times that I get by running those examples are the following:
Case #1 - Without GPU: 6.46 seconds
Case #2 - Naive GPU: 0.82 seconds - 7.9x faster
Case #3 - Through CUDA: 0.09 seconds - 71.7x faster
With this, my guess is that your graphics card may be of a lower quality in comparison to those tests that MathWorks performed. Maybe try updating your graphics drivers and see if that helps. However, my guess is that my performance is much better due to the multiprocessor count, faster clock, a higher amount of cores and higher memory bandwidth.

CUDA 5.5 samples compile fine on OS X 10.9 but error out immediately when run

This is on a MacBookPro7,1 with a GeForce 320M (compute capability 1.2). Previously, with OS X 10.7.8, XCode 4.x and CUDA 5.0, CUDA code compiled and ran fine.
Then, I update to OS X 10.9.2, XCode 5.1 and CUDA 5.5. At first, deviceQuery failed. I read elsewhere that 5.5.28 (the driver CUDA 5.5 shipped with) did not support compute capability 1.x (sm_10), but that 5.5.43 did. After updating the CUDA driver to the even more current 5.5.47 (GPU Driver verions 8.24.11 310.90.9b01), deviceQuery indeed passes with the following output.
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce 320M"
CUDA Driver Version / Runtime Version 5.5 / 5.5
CUDA Capability Major/Minor version number: 1.2
Total amount of global memory: 253 MBytes (265027584 bytes)
( 6) Multiprocessors, ( 8) CUDA Cores/MP: 48 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 1064 Mhz
Memory Bus Width: 128-bit
Maximum Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(8192), 512 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(8192, 8192), 512 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 512
Max dimension size of a thread block (x,y,z): (512, 512, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 1)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = GeForce 320M
Result = PASS
Furthermore, I can successfully compile without modification the CUDA 5.5 samples, though I have not tried to compile all of them.
However, samples such as matrixMul, simpleCUFFT, simpleCUBLAS all fail immediately when run.
$ ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce 320M" with compute capability 1.2
MatrixA(160,160), MatrixB(320,160)
cudaMalloc d_A returned error code 2, line(164)
$ ./simpleCUFFT
[simpleCUFFT] is starting...
GPU Device 0: "GeForce 320M" with compute capability 1.2
CUDA error at simpleCUFFT.cu:105 code=2(cudaErrorMemoryAllocation) "cudaMalloc((void **)&d_signal, mem_size)"
Error Code 2 is cudaErrorMemoryAllocation, but I suspect it hides a failed CUDA initialization somehow.
$ ./simpleCUBLAS
GPU Device 0: "GeForce 320M" with compute capability 1.2
simpleCUBLAS test running..
!!!! CUBLAS initialization error
Actual error code is CUBLAS_STATUS_NOT_INITIALIZED being returned from call to cublasCreate().
Has anyone run into this before and found a fix? Thanks in advance.

I would guess you are running out of memory. Your GPU is being used by the display manager, and it only has 256Mb of RAM. The combined memory footprint of the OS 10.9 display manager and the CUDA 5.5 runtime might be leaving you with almost no free memory. I would recommend writing and running a small test program like this:
#include <iostream>
int main(void)
{
size_t mfree, mtotal;
cudaSetDevice(0);
cudaMemGetInfo(&mfree, &mtotal);
std::cout << mfree << " bytes of " << mtotal << " available." << std::endl;
return cudaDeviceReset();
}
[disclaimer: written in browser, never compiled or tested use at own risk ]
That should give you a picture of the available free memory after context establishment on the device. You might be surprised at how little there is to work with.
EDIT: Here is an even lighter weight alternative test which doesn't even attempt to establish a context on the device. Instead, it only uses the driver API to check the device. If this succeeds, then either the runtime API shipping for OS X is broken somehow, or you have no memory available on the device for establishing a context. If it fails, then your truly have a broken CUDA installation. Either way, I would consider opening a bug report with NVIDIA:
#include <iostream>
#include <cuda.h>
int main(void)
{
CUdevice d;
size_t b;
cuInit(0);
cuDeviceGet(&d, 0);
cuDeviceTotalMem(&b, d);
std::cout << "Total memory = " << b << std::endl;
return 0;
}
Note you will need to explicitly link the cuda driver library to get this to work (pass -lcuda to nvcc, for example)

optimal number of CUDA parallel blocks

Can there be any performance advantage to launch a grid of blocks simultaneously over launching blocks one at a time if the number of threads in each block is already larger than the number of CUDA cores?

I think there is; A thread block is assigned to a Streaming Multiprocessor (SM) and the SM further divides the threads of each block into warps of 32 threads (newer architectures can handle larger warps) that are scheduled to be executed (more-less) sequentially. Considering this, it will be faster to break each computation into blocks so that they occupy as many SMs as possible. It is also meaning full to build blocks that are multiples of the threads per warp that the card supports (a block of 32 or 64 threads rather than 40 threads, for the case that SMs use 32-thread warps).

Launch Latency
Launch latency (API call to work is started on the GPU) is of a grid is 3-8 µs on Linux to
30-80 µs on Windows Vista/Win7.
Distributing a block to a SM is 10-100s ns.
Launching a warp in a block (32 threads) is a few cycles and happens in parallel on each SM.
Resource Limitations
Concurrent Kernels
- Tesla N/A only 1 grid at a time
- Fermi 16 grids at a time
- Kepler 16 grids (Kepler2 32 grids)
Maximum Blocks (not considering occupancy limitations)
- Tesla SmCount * 8 (gtx280 = 30 * 8 = 240)
- Fermi SmCount * 16 (gf100 = 16 * 16 = 256)
- Kepler SmCount * 16 (gk104 = 8 * 16 = 128)
See occupancy calculator for limitations on threads per block, threads per SM, registers per SM, registers per thread, ...
Warps Scheduling and CUDA Cores
CUDA cores are floating point/ALU units. Each SM has other types of execution units including load/store, special function, branch, etc. A CUDA core is equivalent to a SIMD unit in a x86 processor. It is not equivalent to a x86 core.
Occupancy is the measure of warps per SM to the maximum number of warps per SM. The more warps per SM the higher the chance that the warp scheduler has an eligible warp to schedule. However, the higher the occupancy the less resources will be available per thread. As a basic goal you want to target more than
25% 8 warps on Tesla
50% or 24 warps on Fermi
50% or 32 warps on Kepler (generally higher)
You'll notice there is no real relationship to CUDA cores in these calculations.
To understand this better read the Fermi whitepaper and if you can use the Nsight Visual Studio Edition CUDA Profiler look at the Issue Efficiency Experiment (not yet available in the CUDA Profiler or Visual Profiler) to understand how well your kernel is hiding execution and memory latency.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio