Formulas in perf stat - performance

I am wondering about the formulas used in perf stat to calculate figures from the raw data.
perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./myapp
1080267.226401 task-clock (msec) # 19.062 CPUs utilized
1,592,123,216,789 cycles # 1.474 GHz (50.00%)
871,190,006,655 instructions # 0.55 insn per cycle (75.00%)
3,697,548,810 cache-references # 3.423 M/sec (75.00%)
459,457,321 cache-misses # 12.426 % of all cache refs (75.00%)
In this context, how do you calculate M/sec from cache-references?

Formulas are seems not to be implemented in the builtin-stat.c (where default event sets for perf stat are defined), but they are probably calculated (and averaged with stddev) in perf_stat__print_shadow_stats() (and some stats are collected into arrays in perf_stat__update_shadow_stats()):
http://elixir.free-electrons.com/linux/v4.13.4/source/tools/perf/util/stat-shadow.c#L626
When HW_INSTRUCTIONS is counted:
"Instructions per clock" = HW_INSTRUCTIONS / HW_CPU_CYCLES; "stalled cycles per instruction" = HW_STALLED_CYCLES_FRONTEND / HW_INSTRUCTIONS
if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
total = avg_stats(&runtime_cycles_stats[ctx][cpu]);
if (total) {
ratio = avg / total;
print_metric(ctxp, NULL, "%7.2f ",
"insn per cycle", ratio);
} else {
print_metric(ctxp, NULL, NULL, "insn per cycle", 0);
}
Branch misses are from print_branch_misses as HW_BRANCH_MISSES / HW_BRANCH_INSTRUCTIONS
There are several cache miss ratio calculations in perf_stat__print_shadow_stats() too like HW_CACHE_MISSES / HW_CACHE_REFERENCES and some more detailed (perf stat -d mode).
Stalled percents are computed as HW_STALLED_CYCLES_FRONTEND / HW_CPU_CYCLES and HW_STALLED_CYCLES_BACKEND / HW_CPU_CYCLES
GHz is computed as HW_CPU_CYCLES / runtime_nsecs_stats, where runtime_nsecs_stats was updated from any of software events task-clock or cpu-clock (SW_TASK_CLOCK & SW_CPU_CLOCK, We still know no exact difference between them two since 2010 in LKML and 2014 at SO)
if (perf_evsel__match(counter, SOFTWARE, SW_TASK_CLOCK) ||
perf_evsel__match(counter, SOFTWARE, SW_CPU_CLOCK))
update_stats(&runtime_nsecs_stats[cpu], count[0]);
There are also several formulas for transactions (perf stat -T mode).
"CPU utilized" is from task-clock or cpu-clock / walltime_nsecs_stats, where walltime is calculated by the perf stat itself (in userspace using clock from the wall (astronomic time, ):
static inline unsigned long long rdclock(void)
{
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec * 1000000000ULL + ts.tv_nsec;
}
...
static int __run_perf_stat(int argc, const char **argv)
{
...
/*
* Enable counters and exec the command:
*/
t0 = rdclock();
clock_gettime(CLOCK_MONOTONIC, &ref_time);
if (forks) {
....
}
t1 = rdclock();
update_stats(&walltime_nsecs_stats, t1 - t0);
There are also some estimations from the Top-Down methodology (Tuning Applications Using a Top-down Microarchitecture Analysis Method, Software Optimizations Become Simple with Top-Down Analysis .. Name Skylake, IDF2015, #22 in Gregg's Methodology List. Described in 2016 by Andi Kleen https://lwn.net/Articles/688335/ "Add top down metrics to perf stat" (perf stat --topdown -I 1000 cmd mode).
And finally, if there was no exact formula for the currently printing event, there is universal "%c/sec" (K/sec or M/sec) metric: http://elixir.free-electrons.com/linux/v4.13.4/source/tools/perf/util/stat-shadow.c#L845 Anything divided by runtime nsec (task-clock or cpu-clock events, if they were present in perf stat event set)
} else if (runtime_nsecs_stats[cpu].n != 0) {
char unit = 'M';
char unit_buf[10];
total = avg_stats(&runtime_nsecs_stats[cpu]);
if (total)
ratio = 1000.0 * avg / total;
if (ratio < 0.001) {
ratio *= 1000;
unit = 'K';
}
snprintf(unit_buf, sizeof(unit_buf), "%c/sec", unit);
print_metric(ctxp, NULL, "%8.3f", unit_buf, ratio);
}

Related

Timer resolution in OpenCL profiling

I need some clarification on timer resolution. I'm trying to learn profiling in openCL. I have reduction algorithm implemented in OpenCL and want to measure the execution kernel time by getting the total elapsed time in the code given below. I ran this code on different devices and here are the results:
On CPU -- AMD FX 770K
Total time = 352,855,601
CL_DEVICE_PROFILING_TIMER_RESOLUTION = 69 ns
On GPU -- AMD Radeon R7 240
Total time = 172,297
CL_DEVICE_PROFILING_TIMER_RESOLUTION = 1 ns
On another GPU -- GeForce GT 610
Total time = 1,725,504
CL_DEVICE_PROFILING_TIMER_RESOLUTION = 1000 ns
The "Total time" given above is in actual nanoseconds? or I need to divide them by the time resolution to get the actual execution time? How the timer resolution can help us?
Here is a part of the code:
/* Enqueue kernel */
err = clEnqueueNDRangeKernel(queue, kernel[i], 1, NULL, &global_size,
&local_size, 0, NULL, &prof_event);
if (err < 0) {
perror("Couldn't enqueue the kernel");
exit(1);
}
/* Finish processing the queue and get profiling information */
clFinish(queue);
clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START,
sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END,
sizeof(time_end), &time_end, NULL);
total_time = time_end - time_start;
printf("Total time = %lu\n\n", total_time);
The specification is pretty clear on this: "current device time counter in
nanoseconds"
The times are always in nanoseconds. The resolution query is so you can find out how accurate the data is. For example, given the measurements and resolutions you posted, you can deduce the the error margin of the measure:
AMD FX 770K:
Measured: 352,855,601 ± 69 ns
Actual: 352,855,532 - 352,855,670
AMD Radeon R7 240:
Measured: 172,297 ± 1 ns
Actual: 172,296 - 172,298
GeForce GT 610:
Measured: 1,725,504 ± 1000 ns
Actual: 1,724,504 - 1,726,504

DirectX 11 Compute Shader device synchronization?

Background: perform benchmarking/comparisson over GPGPU platforms.
Problem: Device synchronization when dispatching a DirectX 11 Compute Shader.
Looking for the equivalent of cudaDeviceSynchronize() of clFinish(...) to make a fair comparisson of how my algorithm performs.
CUDA and OpenCL functions are more clear on the blocking/ non-blocking issues. DirectCompute however is more related to the graphics pipeline (of which I learning and very unfamiliar with) and therefore I have trouble finding out if a Dispatch call is blocking or if previously memory allocation/transfers are finished.
Code DX_1:
// Setup
...
for (...) {
startTimer();
context->Dispatch(number_of_groups, 1, 1);
times[i] = stopTimer();
}
// Release
...
Code DX_2:
for (...) {
// Setup
...
startTimer();
context->Dispatch(number_of_groups, 1, 1);
times[i] = stopTimer();
// Release
...
}
Results (average times of 2^2 to 2^11 elements):
DX_1 DX_2 CUDA
1.6 205.5 24.8
1.8 133.4 24.8
29.1 186.5 25.6
18.6 175.0 25.6
11.4 187.5 26.6
85.2 127.7 26.3
166.4 151.1 28.1
98.2 149.5 35.2
26.8 203.5 31.6
Notice: these times are run on a desktop GPU with a screen connected, some erratic timings are expected. Times are not supposed to include host to device buffer transfers.
Notice 2: These are very short sequences (4 - 2048 elements) the interesting tests are performed on problem sizes of up to 2^26 elements.
My new solution is to avoid synchronization with device. I have looked into some methods of retreiving timestamps instead, results look ok and I'm fairly sure the comparisons are fair enough. I compared my CUDA times (Event Record vs. QPC) and the difference is small, a seemingly constant overhead.
CUDA Event Host QPC
4,6 30,0
4,8 30,0
5,0 31,0
5,2 32,0
5,6 34,0
6,1 34,0
6,9 31,0
8,3 47,0
9,2 34,0
12,0 39,0
16,7 46,0
20,5 55,0
32,1 69,0
48,5 111,0
86,0 134,0
182,4 237,0
419,0 473,0
In case my question brings someone in hopes of finding how to do gpgpu benchmarking I will leave some code behind demonstrating my current benchmarking strategy.
Code Examples, CUDA
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
float milliseconds = 0;
cudaEventRecord(start);
...
// Launch my algorithm
...
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&milliseconds, start, stop);
OpenCL
cl_event start_event, end_event;
cl_ulong start = 0, end = 0;
// Enqueue a dummy kernel for the start event.
clEnqueueNDRangeKernel(..., &start_event);
...
// Launch my algorithm
...
// Enqueue a dummy kernel for the end event.
clEnqueueNDRangeKernel(..., &end_event);
clWaitForEvents(1, &end_event);
clGetEventProfilingInfo(start_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
clGetEventProfilingInfo(end_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
timeInMS = (double)(end - start)*(double)(1e-06);
DirectCompute
Here I followed the suggestion from Adam Miles and looked into that source. Will look something like this:
ID3D11Device* device = nullptr;
...
// Setup
...
ID3D11QueryPtr disjoint_query;
ID3D11QueryPtr q_start;
ID3D11QueryPtr q_end;
...
if (disjoint_query == NULL)
{
D3D11_QUERY_DESC desc;
desc.Query = D3D11_QUERY_TIMESTAMP_DISJOINT;
desc.MiscFlags = 0;
device->CreateQuery(&desc, &disjoint_query);
desc.Query = D3D11_QUERY_TIMESTAMP;
device->CreateQuery(&desc, &q_start);
device->CreateQuery(&desc, &q_end);
}
context->Begin(disjoint_query);
context->End(q_start);
...
// Launch my algorithm
...
context->End(q_end);
context->End(disjoint_query);
UINT64 start, end;
D3D11_QUERY_DATA_TIMESTAMP_DISJOINT q_freq;
while (S_OK != context->GetData(q_start, &start, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(q_end, &end, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(disjoint_query, &q_freq, sizeof(D3D11_QUERY_DATA_TIMESTAMP_DISJOINT), 0)){};
timeInMS = (((double)(end - start)) / ((double)q_freq.Frequency)) * 1000.0;
C/C++/OpenMP
static LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds, Frequency;
static void __inline startTimer()
{
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
}
static double __inline stopTimer()
{
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
return (double)ElapsedMicroseconds.QuadPart;
}
My code examples are taken out of context and I tried to do some clean-up but errors might be present.
If you're interested in how long a particular Draw or Dispatch is taking on the GPU then you should take a look at DirectX 11's Timestamp queries. You can query the GPU's clock frequency and current clock value before and after some GPU work and figure out how long that took in wall time.
This is probably a good primer / example on how to do it:
https://mynameismjp.wordpress.com/2011/10/13/profiling-in-dx11-with-queries/

Measure elapsed time in OS X

I need to measure elapsed time, in order to know when a certain period of time has been exceeded.
I used to use Ticks() and Microseconds() for this, but both functions are now deprecated.
CFAbsoluteTimeGetCurrent is not the correct way to use because it may run backwards, as explained in the docs:
Repeated calls to this function do not guarantee monotonically
increasing results. The system time may decrease due to
synchronization with external time references or due to an explicit
user change of the clock.
What else is there that's not deprecated and fairly future-proof?
One way, as explained in Q&A 1398, is to use mach_absolute_time as follows:
static mach_timebase_info_data_t sTimebaseInfo;
mach_timebase_info(&sTimebaseInfo); // Determines the time scale
uint64_t t1 = mach_absolute_time();
...
uint64_t t2 = mach_absolute_time();
uint64_t elapsedNano = (t2-t1) * sTimebaseInfo.numer / sTimebaseInfo.denom;
This may not be fool-proof either, though. The values could overflow in some cases, as pointed out in this answer.
Use NSTimeInterval:
Used to specify a time interval, in seconds.
Example:
- (void)loop {
NSDate *startTime = [NSDate date];
sleep(90); // sleep for 90 seconds
[self elapsedTime:startTime];
}
- (void)elapsedTime:(NSDate *)startTime {
NSTimeInterval elapsedTime = fabs([startTime timeIntervalSinceNow]);
int intSeconds = (int) elapsedTime;
int intMinutes = intSeconds / 60;
intSeconds = intSeconds % 60;
NSLog(#"Elapsed Time: %d minute(s) %d seconds", intMinutes, intSeconds);
}
Result:
Elapsed Time: 1 minute(s) 29 seconds
It's unclear what type of precision you are looking for, although NSTimeInterval can accomodate fractions of a second (eg. tenths, hundredths, thousandths, etc.)

Units of QueryPerformanceFrequency

A simple question:
Which is the QueryPerformanceFrequency unit?
Hz (ticks per second)?
Thank you very much,
Bruno
Q: Units of QueryPerformanceFrequency?
A: KILO-HERTZ (NOT Hz)
=========== DETAILS ==============================================
My research indicates that both Counters and Freq are in KILOs, KILO-clock-ticks and KILO-HERTZ!
The counters register KILO-Clicks (KLICKS) and the freq is either in kHz or I am woefully UnderClocked. When you divide the Clock_Ticks by Clock_Frequency, kclicks/(kclicks*sec^-1), everything wipes out except for seconds.
Here is an example C program stripped to just the essentials:
#include "stdio.h"
#include <windows.h> // Needed for LARGE_INTEGER
// gcc cpu.freq.test.c -o cft.exe
// cft.exe -> Sleep d_KLICKS=3417790, d_time=0.999182880 sec, CPU_Freq=3420585 KILO-Hz
void main(int argc, char *argv[]) {
// Clock KILO-ticks start, end, CPU_Freq in kHz. KILOs cancel
LARGE_INTEGER sklick, eklick, cpu_khz;
double delta_time; // Expected time in SECONDS. All units above are k.
QueryPerformanceFrequency(&cpu_khz); // Gets clock KILO-tics, Klicks/sec
QueryPerformanceCounter(&sklick); // Capture cpu Start Klicks
Sleep(1000); // Sleep 1000 MILLI-seconds
QueryPerformanceCounter(&eklick); // Capture cpu End Klicks
delta_time = (eklick.QuadPart-sklick.QuadPart) / (double)cpu_khz.QuadPart;
printf("Sleep d_KLICKS=%lld, d_time=%4.9lf sec, CPU_Freq=%lld KILO-Hz\n",
eklick.QuadPart-sklick.QuadPart, delta_time, cpu_khz.QuadPart);
}
It actually compiles! Running...
Sleep d_KLICKS=3418803, d_time=0.999479036 sec, CPU_Freq=3420585 KILO-Hz
The CPU freq reads 3420585 or 3.420585E6 or 3.4 M-Hertz? <- MEGA-HURTS !OUCH!
The actual CPU freq is 3.4 Mega-Kilo-Hz or 3.4 GHz
microsoft appears to be confused (some things Never Change):
https://msdn.microsoft.com/en-us/library/windows/desktop/dn553408%28v=vs.85%29.aspx
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
// Activity to be timed
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
// We now have the elapsed number of ticks, along with the
// number of ticks-per-second.
The number of "elapsed ticks" in 1 second is in the MILLIONS, NOT BILLIONS so they are NOT UNIT-CPU-CLOCK-TICKS but KILO-CPU-CLOCK-TICKS
Same off-by-3-orders-of-magnitude error for FREQ: 3.4 MILLION is not "ticks-per-second" but THOUSAND-ticks-per-second.
As long as you divide one by the other, the ?clicks cancel with a result in seconds. If one were so fatuous as to take ms at their document and try to use their "ticks-per-second" in some other calculation, you would wind up off by a factor of 1000 or ~1 standard_ms_error!
Perhaps we should call Heinrich in to check HIS units? Oops! 153 years too late. :(

Program execution taking almost same usertime on CPU as well as GPU?

The program for finding prime numbers using OpenCL 1.1 gave the following benchmarks :
Device : CPU
Realtime : approx. 3 sec
Usertime : approx. 32 sec
Device : GPU
Realtime - approx. 37 sec
Usertime - approx. 32 sec
Why is the usertime of execution by GPU not less than that of CPU? Is data/task parallelization not occuring?
System specifications :64-bit CentOS 5.3 system with two ATI Radeon 5970 graphics card + Intel Core i7 processor(12 cores)
Your kernel is rather inefficient, I have an adjusted one below for you to consider. As to why it runs better on a cpu device...
Using your algorithm, the work items take varying amounts of time to execute. They will take longer as the numbers tested grow larger. A work group on a gpu will not finish until all of its items are finished some of the hardware will be left idle until the last item is done. On a cpu, it behaves more like a loop iterating over the kernel items, so the difference in cycles needed to compute each item won't drastically affect the performance.
'A' is not used by the kernel. It should not be copied unless it is used. It looks like you wanted to test the A[i] rather then 'i' itself though.
I think the gpu would be much better at FFT-based prime calculations, or even a sieve algorithm.
{
int t;
int i = get_global_id(0);
int end = sqrt(i);
if(i%2){
B[i] = 0;
}else{
B[i] = 1; //assuming only that it should be non-zero
}
for ( t = 3; (t<=end)&&(B[i] > 0) ; t+=2 ) {
if ( i % t == 0 ) {
B[ i ] = 0;
}
}
}

Resources