Why are opencl enqueue commands so time consuming? - performance

I am trying to do live audio processing using the Intel HD Graphics GPU. I theory they should be perfect for this. But I am surprised at cost of the enqueuing commands. This looks to be a prohibiting factor, and by far the most time-consuming step.
In short calling the enqueueXXXXX commands take a long time. Actually doing the data copying and executing the kernel is sufficiently fast. Is this just an inherent problem with the OpenCL implementation, or am I doing something wrong?
Data copying + kernel execution takes about 10us
Calling the enqueue commands takes about 300us - 500us
The code is available at https://github.com/tblum/opencl_enqueue/blob/master/main.cpp
for (int i = 0; i < 10; ++i) {
cl::Event copyToEvent;
cl::Event copyFromEvent;
cl::Event kernelEvent;
auto t1 = Clock::now();
commandQueue.enqueueWriteBuffer(clIn, CL_FALSE, 0, 10 * 48 * sizeof(float), frameBufferIn, nullptr, &copyToEvent);
OCLdownMix.setArg(0,clIn);
OCLdownMix.setArg(1,clOut);
OCLdownMix.setArg(2,(unsigned int)480);
commandQueue.enqueueNDRangeKernel(OCLdownMix, cl::NullRange, cl::NDRange(480), cl::NDRange(48), nullptr, &kernelEvent);
commandQueue.enqueueReadBuffer(clOut, CL_FALSE, 0, 10 * 48 * sizeof(float), clResult, nullptr, &copyFromEvent);
auto t2 = Clock::now();
commandQueue.finish();
auto t3 = Clock::now();
cl_ulong copyToTime = copyToEvent.getProfilingInfo<CL_PROFILING_COMMAND_END>() -
copyToEvent.getProfilingInfo<CL_PROFILING_COMMAND_START>();
cl_ulong kernelTime = kernelEvent.getProfilingInfo<CL_PROFILING_COMMAND_END>() -
kernelEvent.getProfilingInfo<CL_PROFILING_COMMAND_START>();
cl_ulong copyFromTime = copyFromEvent.getProfilingInfo<CL_PROFILING_COMMAND_END>() -
copyFromEvent.getProfilingInfo<CL_PROFILING_COMMAND_START>();
std::cout << "Enqueue: " << t2 - t1 << ", Total: " << t3 - t1 << ", GPU: " << (copyToTime+kernelTime+copyFromTime) / 1000.0 << "us"<< std::endl;
}
Output:
Enqueue: 1804us, Total: 4322us, GPU: 10.832us
Enqueue: 485us, Total: 668us, GPU: 10.666us
Enqueue: 237us, Total: 419us, GPU: 10.499us
Enqueue: 282us, Total: 474us, GPU: 10.832us
Enqueue: 345us, Total: 531us, GPU: 10.082us
Enqueue: 359us, Total: 555us, GPU: 10.915us
Enqueue: 345us, Total: 524us, GPU: 10.082us
Enqueue: 327us, Total: 504us, GPU: 10.416us
Enqueue: 363us, Total: 540us, GPU: 10.333us
Enqueue: 442us, Total: 595us, GPU: 10.916us
I found this related question: How to reduce OpenCL enqueue time/any other ideas?
But no useful answers for my situation.
Any help or ideas would be appreciated.
Thanks
BR Troels

Related

Inconsist delay of ReadFile a serial port device

I need to read data from a serial port device(which sends data per second) on Windows in REALTIME(<= 5 ms). But the time cost by ReadFile is unpredictable, which drives me to crazy. Some piece of the code can be found at:
https://gist.github.com/morris-stock/62b1674b4cda0e9df84d4738e54773f8
the delay is dumped at https://gist.github.com/morris-stock/62b1674b4cda0e9df84d4738e54773f8#file-serialport_win-cc-L283
Poco::Timestamp now;
if (!ReadFile(_handle, buffer, size, &bytesRead, NULL))
throw Poco::IOException("failed to read from serial port");
Poco::Timestamp::TimeDiff elapsed = now.elapsed();
std::cout << Poco::DateTimeFormatter::format(now, "%Y-%m-%d %H:%M:%S.%i")
<< ", elapsed: " << elapsed << ", data len: " << bytesRead << std::endl << std::flush;
Sometimes ReadFile costs about 3000 us(which is OK, affected by COMMTIMEOUTS) to return, but sometimes, it costs 15600 us(NOT affected by COMMTIMEOUTS).
Please let me know if there is anything I can do to make the problem clear.
P.S.
COMMTIMEOUTS:
COMMTIMEOUTS cto;
cto.ReadIntervalTimeout = 1;
cto.ReadTotalTimeoutConstant = 1;
cto.ReadTotalTimeoutMultiplier = 0;
cto.WriteTotalTimeoutConstant = MAXDWORD;
cto.WriteTotalTimeoutMultiplier = 0;
the main reading thread part:
https://gist.github.com/morris-stock/62b1674b4cda0e9df84d4738e54773f8#file-serialdevice-cc-L31
device data type
baudrate: 9600, it sends about 400 bytes per second(continuously, then no data in the rest of the second).
consle output
wPacketLength: 64
wPacketVersion: 2
dwServiceMask: 1
dwReserved1: 0
dwMaxTxQueue: 0
dwMaxRxQueue: 0
dwMaxBaud: 268435456
dwProvSubType: 1
dwProvCapabilities: 255
dwSettableParams: 127
dwSettableBaud: 268959743
wSettableData: 15
wSettableStopParity: 7943
dwCurrentTxQueue: 0
dwCurrentRxQueue: 68824
dwProvSpec1: 0
dwProvSpec2: 1128813859
wcProvChar: 0039F16C
2018-01-22 03:35:52.658, elapsed: 15600, data len: 0
2018-01-22 03:35:52.673, elapsed: 15600, data len: 0
2018-01-22 03:35:52.689, elapsed: 15600, data len: 0
2018-01-22 03:35:52.704, elapsed: 15600, data len: 0
2018-01-22 03:35:52.720, elapsed: 15600, data len: 0
2018-01-22 03:35:52.736, elapsed: 15600, data len: 0
2018-01-22 03:35:52.751, elapsed: 15600, data len: 0
In my case, it's the Windows system clock resolution that matters.
ClockRes gives me:
C:\work\utils\ClockRes>Clockres.exe
Clockres v2.1 - Clock resolution display utility
Copyright (C) 2016 Mark Russinovich
Sysinternals
Maximum timer interval: 15.600 ms
Minimum timer interval: 0.500 ms
Current timer interval: 1.000 ms
and
C:\work\utils\ClockRes>Clockres.exe
Clockres v2.1 - Clock resolution display utility
Copyright (C) 2016 Mark Russinovich
Sysinternals
Maximum timer interval: 15.600 ms
Minimum timer interval: 0.500 ms
Current timer interval: 15.600 ms
by calling timeBeginPeriod(1) when my app starts, I can get a more consistent result.
Thanks everyone for your kindly help.

Formulas in perf stat

I am wondering about the formulas used in perf stat to calculate figures from the raw data.
perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./myapp
1080267.226401 task-clock (msec) # 19.062 CPUs utilized
1,592,123,216,789 cycles # 1.474 GHz (50.00%)
871,190,006,655 instructions # 0.55 insn per cycle (75.00%)
3,697,548,810 cache-references # 3.423 M/sec (75.00%)
459,457,321 cache-misses # 12.426 % of all cache refs (75.00%)
In this context, how do you calculate M/sec from cache-references?
Formulas are seems not to be implemented in the builtin-stat.c (where default event sets for perf stat are defined), but they are probably calculated (and averaged with stddev) in perf_stat__print_shadow_stats() (and some stats are collected into arrays in perf_stat__update_shadow_stats()):
http://elixir.free-electrons.com/linux/v4.13.4/source/tools/perf/util/stat-shadow.c#L626
When HW_INSTRUCTIONS is counted:
"Instructions per clock" = HW_INSTRUCTIONS / HW_CPU_CYCLES; "stalled cycles per instruction" = HW_STALLED_CYCLES_FRONTEND / HW_INSTRUCTIONS
if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
total = avg_stats(&runtime_cycles_stats[ctx][cpu]);
if (total) {
ratio = avg / total;
print_metric(ctxp, NULL, "%7.2f ",
"insn per cycle", ratio);
} else {
print_metric(ctxp, NULL, NULL, "insn per cycle", 0);
}
Branch misses are from print_branch_misses as HW_BRANCH_MISSES / HW_BRANCH_INSTRUCTIONS
There are several cache miss ratio calculations in perf_stat__print_shadow_stats() too like HW_CACHE_MISSES / HW_CACHE_REFERENCES and some more detailed (perf stat -d mode).
Stalled percents are computed as HW_STALLED_CYCLES_FRONTEND / HW_CPU_CYCLES and HW_STALLED_CYCLES_BACKEND / HW_CPU_CYCLES
GHz is computed as HW_CPU_CYCLES / runtime_nsecs_stats, where runtime_nsecs_stats was updated from any of software events task-clock or cpu-clock (SW_TASK_CLOCK & SW_CPU_CLOCK, We still know no exact difference between them two since 2010 in LKML and 2014 at SO)
if (perf_evsel__match(counter, SOFTWARE, SW_TASK_CLOCK) ||
perf_evsel__match(counter, SOFTWARE, SW_CPU_CLOCK))
update_stats(&runtime_nsecs_stats[cpu], count[0]);
There are also several formulas for transactions (perf stat -T mode).
"CPU utilized" is from task-clock or cpu-clock / walltime_nsecs_stats, where walltime is calculated by the perf stat itself (in userspace using clock from the wall (astronomic time, ):
static inline unsigned long long rdclock(void)
{
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec * 1000000000ULL + ts.tv_nsec;
}
...
static int __run_perf_stat(int argc, const char **argv)
{
...
/*
* Enable counters and exec the command:
*/
t0 = rdclock();
clock_gettime(CLOCK_MONOTONIC, &ref_time);
if (forks) {
....
}
t1 = rdclock();
update_stats(&walltime_nsecs_stats, t1 - t0);
There are also some estimations from the Top-Down methodology (Tuning Applications Using a Top-down Microarchitecture Analysis Method, Software Optimizations Become Simple with Top-Down Analysis .. Name Skylake, IDF2015, #22 in Gregg's Methodology List. Described in 2016 by Andi Kleen https://lwn.net/Articles/688335/ "Add top down metrics to perf stat" (perf stat --topdown -I 1000 cmd mode).
And finally, if there was no exact formula for the currently printing event, there is universal "%c/sec" (K/sec or M/sec) metric: http://elixir.free-electrons.com/linux/v4.13.4/source/tools/perf/util/stat-shadow.c#L845 Anything divided by runtime nsec (task-clock or cpu-clock events, if they were present in perf stat event set)
} else if (runtime_nsecs_stats[cpu].n != 0) {
char unit = 'M';
char unit_buf[10];
total = avg_stats(&runtime_nsecs_stats[cpu]);
if (total)
ratio = 1000.0 * avg / total;
if (ratio < 0.001) {
ratio *= 1000;
unit = 'K';
}
snprintf(unit_buf, sizeof(unit_buf), "%c/sec", unit);
print_metric(ctxp, NULL, "%8.3f", unit_buf, ratio);
}

Minimize cudaDeviceSynchronize launch overhead

I'm currently doing a project with CUDA where a pipeline is refreshed with 200-10000 new events every 1ms. Each time, I want to call one(/two) kernels which compute a small list of outputs; then fed those outputs to the next element of the pipeline.
The theoretical flow is:
receive data in an std::vector
cudaMemcpy the vector to GPU
processing
generate small list of outputs
cudaMemcpy to the output std::vector
But when I'm calling cudaDeviceSynchronize on a 1block/1thread empty kernel with no processing, it already takes in average 0.7 to 1.4ms, which is already higher than my 1ms timeframe.
I could eventually change the timeframe of the pipeline in order to receive events every 5ms, but with 5x more each times. It wouldn't be ideal though.
What would be the best way to minimize the overhead of cudaDeviceSynchronize? Could streams be helpful in this situation? Or another solution to efficiently run the pipeline.
(Jetson TK1, compute capabilities 3.2)
Here's a nvprof log of the applications:
==8285== NVPROF is profiling process 8285, command: python player.py test.rec
==8285== Profiling application: python player.py test.rec
==8285== Profiling result:
Time(%) Time Calls Avg Min Max Name
94.92% 47.697ms 5005 9.5290us 1.7500us 13.083us reset_timesurface(__int64, __int64*, __int64*, __int64*, __int64*, float*, float*, bool*, bool*, Event*)
5.08% 2.5538ms 8 319.23us 99.750us 413.42us [CUDA memset]
==8285== API calls:
Time(%) Time Calls Avg Min Max Name
75.00% 5.03966s 5005 1.0069ms 25.083us 11.143ms cudaDeviceSynchronize
17.44% 1.17181s 5005 234.13us 83.750us 3.1391ms cudaLaunch
4.71% 316.62ms 9 35.180ms 23.083us 314.99ms cudaMalloc
2.30% 154.31ms 50050 3.0830us 1.0000us 2.6866ms cudaSetupArgument
0.52% 34.857ms 5005 6.9640us 2.5000us 464.67us cudaConfigureCall
0.02% 1.2048ms 8 150.60us 71.917us 183.33us cudaMemset
0.01% 643.25us 83 7.7490us 1.3330us 287.42us cuDeviceGetAttribute
0.00% 12.916us 2 6.4580us 2.0000us 10.916us cuDeviceGetCount
0.00% 5.3330us 1 5.3330us 5.3330us 5.3330us cuDeviceTotalMem
0.00% 4.0830us 1 4.0830us 4.0830us 4.0830us cuDeviceGetName
0.00% 3.4160us 2 1.7080us 1.5830us 1.8330us cuDeviceGet
A small reconstitution of the program (nvprof log at the end) - for some reason, the average of cudaDeviceSynchronize is 4 times lower, but it's still really high for an empty 1-thread kernel:
/* Compile with `nvcc test.cu -I.`
* with -I pointing to "helper_cuda.h" and "helper_string.h" from CUDA samples
**/
#include <iostream>
#include <cuda.h>
#include <helper_cuda.h>
#define MAX_INPUT_BUFFER_SIZE 131072
typedef struct {
unsigned short x;
unsigned short y;
short a;
long long b;
} Event;
long long *d_a_[2], *d_b_[2];
float *d_as_, *d_bs_;
bool *d_some_bool_[2];
Event *d_data_;
int width_ = 320;
int height_ = 240;
__global__ void reset_timesurface(long long ts,
long long *d_a_0, long long *d_a_1,
long long *d_b_0, long long *d_b_1,
float *d_as, float *d_bs,
bool *d_some_bool_0, bool *d_some_bool_1, Event *d_data) {
// nothing here
}
void reset_errors(long long ts) {
static const int n = 1024;
static const dim3 grid_size(width_ * height_ / n
+ (width_ * height_ % n != 0), 1, 1);
static const dim3 block_dim(n, 1, 1);
reset_timesurface<<<1, 1>>>(ts, d_a_[0], d_a_[1],
d_b_[0], d_b_[1],
d_as_, d_bs_,
d_some_bool_[0], d_some_bool_[1], d_data_);
cudaDeviceSynchronize();
// static long long *h_holder = (long long*)malloc(sizeof(long long) * 2000);
// cudaMemcpy(h_holder, d_a_[0], 0, cudaMemcpyDeviceToHost);
}
int main(void) {
checkCudaErrors(cudaMalloc(&(d_a_[0]), sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMemset(d_a_[0], 0, sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMalloc(&(d_a_[1]), sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMemset(d_a_[1], 0, sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMalloc(&(d_b_[0]), sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMemset(d_b_[0], 0, sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMalloc(&(d_b_[1]), sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMemset(d_b_[1], 0, sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMalloc(&d_as_, sizeof(float)*width_*height_*2));
checkCudaErrors(cudaMemset(d_as_, 0, sizeof(float)*width_*height_*2));
checkCudaErrors(cudaMalloc(&d_bs_, sizeof(float)*width_*height_*2));
checkCudaErrors(cudaMemset(d_bs_, 0, sizeof(float)*width_*height_*2));
checkCudaErrors(cudaMalloc(&(d_some_bool_[0]), sizeof(bool)*width_*height_*2));
checkCudaErrors(cudaMemset(d_some_bool_[0], 0, sizeof(bool)*width_*height_*2));
checkCudaErrors(cudaMalloc(&(d_some_bool_[1]), sizeof(bool)*width_*height_*2));
checkCudaErrors(cudaMemset(d_some_bool_[1], 0, sizeof(bool)*width_*height_*2));
checkCudaErrors(cudaMalloc(&d_data_, sizeof(Event)*MAX_INPUT_BUFFER_SIZE));
for (int i = 0; i < 5005; ++i)
reset_errors(16487L);
cudaFree(d_a_[0]);
cudaFree(d_a_[1]);
cudaFree(d_b_[0]);
cudaFree(d_b_[1]);
cudaFree(d_as_);
cudaFree(d_bs_);
cudaFree(d_some_bool_[0]);
cudaFree(d_some_bool_[1]);
cudaFree(d_data_);
cudaDeviceReset();
}
/* nvprof ./a.out
==9258== NVPROF is profiling process 9258, command: ./a.out
==9258== Profiling application: ./a.out
==9258== Profiling result:
Time(%) Time Calls Avg Min Max Name
92.64% 48.161ms 5005 9.6220us 6.4160us 13.250us reset_timesurface(__int64, __int64*, __int64*, __int64*, __int64*, float*, float*, bool*, bool*, Event*)
7.36% 3.8239ms 8 477.99us 148.92us 620.17us [CUDA memset]
==9258== API calls:
Time(%) Time Calls Avg Min Max Name
53.12% 1.22036s 5005 243.83us 9.6670us 8.5762ms cudaDeviceSynchronize
25.10% 576.78ms 5005 115.24us 44.250us 11.888ms cudaLaunch
9.13% 209.77ms 9 23.308ms 16.667us 208.54ms cudaMalloc
6.56% 150.65ms 1 150.65ms 150.65ms 150.65ms cudaDeviceReset
5.33% 122.39ms 50050 2.4450us 833ns 6.1167ms cudaSetupArgument
0.60% 13.808ms 5005 2.7580us 1.0830us 104.25us cudaConfigureCall
0.10% 2.3845ms 9 264.94us 22.333us 537.75us cudaFree
0.04% 938.75us 8 117.34us 58.917us 169.08us cudaMemset
0.02% 461.33us 83 5.5580us 1.4160us 197.58us cuDeviceGetAttribute
0.00% 15.500us 2 7.7500us 3.6670us 11.833us cuDeviceGetCount
0.00% 7.6670us 1 7.6670us 7.6670us 7.6670us cuDeviceTotalMem
0.00% 4.8340us 1 4.8340us 4.8340us 4.8340us cuDeviceGetName
0.00% 3.6670us 2 1.8330us 1.6670us 2.0000us cuDeviceGet
*/
As detailled in the comments of the original message, my problem was entirely related to the GPU I'm using (Tegra K1). Here's an answer I found for this particular problem; it might be useful for other GPUs as well. The average for cudaDeviceSynchronize on my Jetson TK1 went from 250us to 10us.
The rate of the Tegra was 72000kHz by default, we'll have to set it to 852000kHz using this command:
$ echo 852000000 > /sys/kernel/debug/clock/override.gbus/rate
$ echo 1 > /sys/kernel/debug/clock/override.gbus/state
We can find the list of available frequency using this command:
$ cat /sys/kernel/debug/clock/gbus/possible_rates
72000 108000 180000 252000 324000 396000 468000 540000 612000 648000 684000 708000 756000 804000 852000 (kHz)
More performance can be obtained (again, in exchange for a higher power draw) on both the CPU and GPU; check this link for more informations.

Timer resolution in OpenCL profiling

I need some clarification on timer resolution. I'm trying to learn profiling in openCL. I have reduction algorithm implemented in OpenCL and want to measure the execution kernel time by getting the total elapsed time in the code given below. I ran this code on different devices and here are the results:
On CPU -- AMD FX 770K
Total time = 352,855,601
CL_DEVICE_PROFILING_TIMER_RESOLUTION = 69 ns
On GPU -- AMD Radeon R7 240
Total time = 172,297
CL_DEVICE_PROFILING_TIMER_RESOLUTION = 1 ns
On another GPU -- GeForce GT 610
Total time = 1,725,504
CL_DEVICE_PROFILING_TIMER_RESOLUTION = 1000 ns
The "Total time" given above is in actual nanoseconds? or I need to divide them by the time resolution to get the actual execution time? How the timer resolution can help us?
Here is a part of the code:
/* Enqueue kernel */
err = clEnqueueNDRangeKernel(queue, kernel[i], 1, NULL, &global_size,
&local_size, 0, NULL, &prof_event);
if (err < 0) {
perror("Couldn't enqueue the kernel");
exit(1);
}
/* Finish processing the queue and get profiling information */
clFinish(queue);
clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START,
sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END,
sizeof(time_end), &time_end, NULL);
total_time = time_end - time_start;
printf("Total time = %lu\n\n", total_time);
The specification is pretty clear on this: "current device time counter in
nanoseconds"
The times are always in nanoseconds. The resolution query is so you can find out how accurate the data is. For example, given the measurements and resolutions you posted, you can deduce the the error margin of the measure:
AMD FX 770K:
Measured: 352,855,601 ± 69 ns
Actual: 352,855,532 - 352,855,670
AMD Radeon R7 240:
Measured: 172,297 ± 1 ns
Actual: 172,296 - 172,298
GeForce GT 610:
Measured: 1,725,504 ± 1000 ns
Actual: 1,724,504 - 1,726,504

DirectX 11 Compute Shader device synchronization?

Background: perform benchmarking/comparisson over GPGPU platforms.
Problem: Device synchronization when dispatching a DirectX 11 Compute Shader.
Looking for the equivalent of cudaDeviceSynchronize() of clFinish(...) to make a fair comparisson of how my algorithm performs.
CUDA and OpenCL functions are more clear on the blocking/ non-blocking issues. DirectCompute however is more related to the graphics pipeline (of which I learning and very unfamiliar with) and therefore I have trouble finding out if a Dispatch call is blocking or if previously memory allocation/transfers are finished.
Code DX_1:
// Setup
...
for (...) {
startTimer();
context->Dispatch(number_of_groups, 1, 1);
times[i] = stopTimer();
}
// Release
...
Code DX_2:
for (...) {
// Setup
...
startTimer();
context->Dispatch(number_of_groups, 1, 1);
times[i] = stopTimer();
// Release
...
}
Results (average times of 2^2 to 2^11 elements):
DX_1 DX_2 CUDA
1.6 205.5 24.8
1.8 133.4 24.8
29.1 186.5 25.6
18.6 175.0 25.6
11.4 187.5 26.6
85.2 127.7 26.3
166.4 151.1 28.1
98.2 149.5 35.2
26.8 203.5 31.6
Notice: these times are run on a desktop GPU with a screen connected, some erratic timings are expected. Times are not supposed to include host to device buffer transfers.
Notice 2: These are very short sequences (4 - 2048 elements) the interesting tests are performed on problem sizes of up to 2^26 elements.
My new solution is to avoid synchronization with device. I have looked into some methods of retreiving timestamps instead, results look ok and I'm fairly sure the comparisons are fair enough. I compared my CUDA times (Event Record vs. QPC) and the difference is small, a seemingly constant overhead.
CUDA Event Host QPC
4,6 30,0
4,8 30,0
5,0 31,0
5,2 32,0
5,6 34,0
6,1 34,0
6,9 31,0
8,3 47,0
9,2 34,0
12,0 39,0
16,7 46,0
20,5 55,0
32,1 69,0
48,5 111,0
86,0 134,0
182,4 237,0
419,0 473,0
In case my question brings someone in hopes of finding how to do gpgpu benchmarking I will leave some code behind demonstrating my current benchmarking strategy.
Code Examples, CUDA
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
float milliseconds = 0;
cudaEventRecord(start);
...
// Launch my algorithm
...
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&milliseconds, start, stop);
OpenCL
cl_event start_event, end_event;
cl_ulong start = 0, end = 0;
// Enqueue a dummy kernel for the start event.
clEnqueueNDRangeKernel(..., &start_event);
...
// Launch my algorithm
...
// Enqueue a dummy kernel for the end event.
clEnqueueNDRangeKernel(..., &end_event);
clWaitForEvents(1, &end_event);
clGetEventProfilingInfo(start_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
clGetEventProfilingInfo(end_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
timeInMS = (double)(end - start)*(double)(1e-06);
DirectCompute
Here I followed the suggestion from Adam Miles and looked into that source. Will look something like this:
ID3D11Device* device = nullptr;
...
// Setup
...
ID3D11QueryPtr disjoint_query;
ID3D11QueryPtr q_start;
ID3D11QueryPtr q_end;
...
if (disjoint_query == NULL)
{
D3D11_QUERY_DESC desc;
desc.Query = D3D11_QUERY_TIMESTAMP_DISJOINT;
desc.MiscFlags = 0;
device->CreateQuery(&desc, &disjoint_query);
desc.Query = D3D11_QUERY_TIMESTAMP;
device->CreateQuery(&desc, &q_start);
device->CreateQuery(&desc, &q_end);
}
context->Begin(disjoint_query);
context->End(q_start);
...
// Launch my algorithm
...
context->End(q_end);
context->End(disjoint_query);
UINT64 start, end;
D3D11_QUERY_DATA_TIMESTAMP_DISJOINT q_freq;
while (S_OK != context->GetData(q_start, &start, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(q_end, &end, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(disjoint_query, &q_freq, sizeof(D3D11_QUERY_DATA_TIMESTAMP_DISJOINT), 0)){};
timeInMS = (((double)(end - start)) / ((double)q_freq.Frequency)) * 1000.0;
C/C++/OpenMP
static LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds, Frequency;
static void __inline startTimer()
{
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
}
static double __inline stopTimer()
{
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
return (double)ElapsedMicroseconds.QuadPart;
}
My code examples are taken out of context and I tried to do some clean-up but errors might be present.
If you're interested in how long a particular Draw or Dispatch is taking on the GPU then you should take a look at DirectX 11's Timestamp queries. You can query the GPU's clock frequency and current clock value before and after some GPU work and figure out how long that took in wall time.
This is probably a good primer / example on how to do it:
https://mynameismjp.wordpress.com/2011/10/13/profiling-in-dx11-with-queries/

Resources