I am following Tim Mattson's lectures on OpenMP to learn ways of implementation of some parallel programming concepts.
I was trying to observe the running time behavior of a parallel program that computes the value of PI using 3x10^8 steps.
Here is the code,
#include <omp.h>
#include <stadio.h>
static long num_steps = 300000000;
double step;
#define PAD 8 // tried 50 too
#define NUM_THREADS 4
int main()
{
int i, nthreads;
double pi, sum[NUM_THREADS][PAD];
double ts, te;
ts = omp_get_wtime();
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id,nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if (id == 0) nthreads = nthrds;
for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) {
x = (i+0.5)*step;
sum[id][0] += 4.0/(1.0+x*x);
}
}
for(i=0, pi=0.0;i<nthreads;i++)
pi += sum[i][0] * step;
te = omp_get_wtime();
printf("%.10f\n", pi);
printf("%.f\n", te-ts);
}
Now I was on Ubuntu 14.04 LTS running on a Dual Core machine. A call to omp_get_num_procs() returned 2. The running time was something like totally random, ranging from 1.31 second to 4.46 seconds. Whereas the serial program was taking 2.31 second almost always.
I tried creating 1, 2, 3, 4, upto 10 threads. The running time varies too much in every case, though the average is smaller in case of more threads. I wasn't running any other applications.
Can anyone explain why the running time varied too much?
How to calculate the run time accurately? The lecturer has given the running time of his computer which seems consistent. And he was also using Dual Core processor.
Dual-CPU comparison, using OpenMP :
Result : 3.1415926536
Number of CPU-s : 2
Duration : 2.4025482161
There seems to be pretty consistent set of resulting code-execution times:
/* Duration : 2.3984972970
Duration : 2.4004815188
Duration : 2.3814983589
Duration : 2.4070654172
Duration : 2.3964317020
Duration : 2.3858104548
Duration : 2.3765923560
Duration : 2.3734730321
-O3:
Duration : 0.4159400249
Duration : 0.3089567909
Duration : 0.3106977220
Duration : 0.3312316008
Duration : 0.2856188160
Duration : 0.2984415500
Duration : 0.3282426349
Duration : 0.2836121118
:......
+ FYI: #pragma-overheads :......
Duration : 0.0001377461
Duration : 0.0001228561
Duration : 0.0001215260
REF:
Amdahl's Law >>> https://stackoverflow.com/revisions/18374629/3
criticism,
on
(not-)including also the real-world's infrastructure add-on
{ setup | termination }-overhead costs of #pragma omp parallel section
(
simplified test w/o the add-on costs of global OpenMP setup & configuration
)
*/
which turns attention to your System-under-Test workload background noise.
Best re-test your code on a head-less platform, so as to avoid any sort of GUI-related workloads from intervening the computing-part of the test.
May enjoy this sandboxed online-TiO-platform to re-run experiments.
Related
I am testing the performance of a cluster where I am using 64 threads. I have written a simple code:
unsigned int m(67000);
double start_time_i(0.),end_time_i(0.),start_time_init(0.),end_time_init(0.),diff_time_i(0.),start_time_j(0.),end_time_j(0.),diff_time_j(0.),total_time(0.);
cout<<"omp_get_max_threads : "<<omp_get_max_threads()<<endl;
cout<<"omp_get_num_procs : "<<omp_get_num_procs()<<endl;
omp_set_num_threads(omp_get_max_threads());
unsigned int dim_i=omp_get_max_threads();
unsigned int dim_j=dim_i*m;
std::vector<std::vector<unsigned int>> vector;
vector.resize(dim_i, std::vector<unsigned int>(dim_j, 0));
start_time_init = omp_get_wtime();
for (unsigned int j=0;j<dim_j;j++){
vector[0][j]=j;
}
end_time_init = omp_get_wtime();
start_time_i = omp_get_wtime();
#pragma omp parallel for
for (unsigned int i=0;i<dim_i;i++){
start_time_j = omp_get_wtime();
for (unsigned int j=0;j<dim_j;j++) vector[i][j]=i+j;
end_time_j = omp_get_wtime();
cout<<"i "<<i<<" thread "<<omp_get_thread_num()<<" int_time = "<<(end_time_j-start_time_j)*1000<<endl;
}
end_time_i = omp_get_wtime();
cout<<"time_final = "<<(end_time_i-start_time_i)*1000<<endl;
cout<<"initial non parallel region "<< " time = "<<(end_time_init-start_time_init)*1000<<endl;
return 0;
I do not understand why "(end_time_j-start_time_j)*1000" is much bigger (around 50) than the time I need to go through the same loop over j if I am outside from the parallel region, i.e "end_time_init-start_time_init" (around 1).
omp_get_max_threads() and omp_get_num_procs() are both equal to 64.
In your loop you just fill a memory location with a lot of values. This task is not computation expensive, it depends on the speed of memory write. One thread can do it at a certain rate, but when you use N threads simultaneously, the total memory bandwidth remains the same on Shared-Memory Multicore systems (i.e most PCs, laptops) and it increases on Distributed-Memory Multicore systems (high-end serves). For more details please read this.
So, depending on the system the speed of memory write either remains the same or decreases when running several loops concurrently. For me 50 times difference seems to be a bit large. I got the following results on compiler explorer (it means that it has to be a Distributed-Memory Multicore system):
omp_get_max_threads : 4
omp_get_num_procs : 2
i 2 thread 2 int_time = 0.095537
i 0 thread 0 int_time = 0.084061
i 1 thread 1 int_time = 0.099578
i 3 thread 3 int_time = 0.10519
time_final = 0.868523
initial non parallel region time = 0.090862
On my laptop I got the following (so it is a shared-memory multicore system):
omp_get_max_threads : 8
omp_get_num_procs : 8
i 7 thread 7 int_time = 0.7518
i 5 thread 5 int_time = 1.0555
i 1 thread 1 int_time = 1.2755
i 6 thread 6 int_time = 1.3093
i 2 thread 2 int_time = 1.3093
i 3 thread 3 int_time = 1.3093
i 4 thread 4 int_time = 1.3093
i 0 thread 0 int_time = 1.3093
time_final = 1.915
initial non parallel region time = 0.1578
In conclusion it does depend on the system you are using...
I am wondering about the formulas used in perf stat to calculate figures from the raw data.
perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./myapp
1080267.226401 task-clock (msec) # 19.062 CPUs utilized
1,592,123,216,789 cycles # 1.474 GHz (50.00%)
871,190,006,655 instructions # 0.55 insn per cycle (75.00%)
3,697,548,810 cache-references # 3.423 M/sec (75.00%)
459,457,321 cache-misses # 12.426 % of all cache refs (75.00%)
In this context, how do you calculate M/sec from cache-references?
Formulas are seems not to be implemented in the builtin-stat.c (where default event sets for perf stat are defined), but they are probably calculated (and averaged with stddev) in perf_stat__print_shadow_stats() (and some stats are collected into arrays in perf_stat__update_shadow_stats()):
http://elixir.free-electrons.com/linux/v4.13.4/source/tools/perf/util/stat-shadow.c#L626
When HW_INSTRUCTIONS is counted:
"Instructions per clock" = HW_INSTRUCTIONS / HW_CPU_CYCLES; "stalled cycles per instruction" = HW_STALLED_CYCLES_FRONTEND / HW_INSTRUCTIONS
if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
total = avg_stats(&runtime_cycles_stats[ctx][cpu]);
if (total) {
ratio = avg / total;
print_metric(ctxp, NULL, "%7.2f ",
"insn per cycle", ratio);
} else {
print_metric(ctxp, NULL, NULL, "insn per cycle", 0);
}
Branch misses are from print_branch_misses as HW_BRANCH_MISSES / HW_BRANCH_INSTRUCTIONS
There are several cache miss ratio calculations in perf_stat__print_shadow_stats() too like HW_CACHE_MISSES / HW_CACHE_REFERENCES and some more detailed (perf stat -d mode).
Stalled percents are computed as HW_STALLED_CYCLES_FRONTEND / HW_CPU_CYCLES and HW_STALLED_CYCLES_BACKEND / HW_CPU_CYCLES
GHz is computed as HW_CPU_CYCLES / runtime_nsecs_stats, where runtime_nsecs_stats was updated from any of software events task-clock or cpu-clock (SW_TASK_CLOCK & SW_CPU_CLOCK, We still know no exact difference between them two since 2010 in LKML and 2014 at SO)
if (perf_evsel__match(counter, SOFTWARE, SW_TASK_CLOCK) ||
perf_evsel__match(counter, SOFTWARE, SW_CPU_CLOCK))
update_stats(&runtime_nsecs_stats[cpu], count[0]);
There are also several formulas for transactions (perf stat -T mode).
"CPU utilized" is from task-clock or cpu-clock / walltime_nsecs_stats, where walltime is calculated by the perf stat itself (in userspace using clock from the wall (astronomic time, ):
static inline unsigned long long rdclock(void)
{
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec * 1000000000ULL + ts.tv_nsec;
}
...
static int __run_perf_stat(int argc, const char **argv)
{
...
/*
* Enable counters and exec the command:
*/
t0 = rdclock();
clock_gettime(CLOCK_MONOTONIC, &ref_time);
if (forks) {
....
}
t1 = rdclock();
update_stats(&walltime_nsecs_stats, t1 - t0);
There are also some estimations from the Top-Down methodology (Tuning Applications Using a Top-down Microarchitecture Analysis Method, Software Optimizations Become Simple with Top-Down Analysis .. Name Skylake, IDF2015, #22 in Gregg's Methodology List. Described in 2016 by Andi Kleen https://lwn.net/Articles/688335/ "Add top down metrics to perf stat" (perf stat --topdown -I 1000 cmd mode).
And finally, if there was no exact formula for the currently printing event, there is universal "%c/sec" (K/sec or M/sec) metric: http://elixir.free-electrons.com/linux/v4.13.4/source/tools/perf/util/stat-shadow.c#L845 Anything divided by runtime nsec (task-clock or cpu-clock events, if they were present in perf stat event set)
} else if (runtime_nsecs_stats[cpu].n != 0) {
char unit = 'M';
char unit_buf[10];
total = avg_stats(&runtime_nsecs_stats[cpu]);
if (total)
ratio = 1000.0 * avg / total;
if (ratio < 0.001) {
ratio *= 1000;
unit = 'K';
}
snprintf(unit_buf, sizeof(unit_buf), "%c/sec", unit);
print_metric(ctxp, NULL, "%8.3f", unit_buf, ratio);
}
Background: perform benchmarking/comparisson over GPGPU platforms.
Problem: Device synchronization when dispatching a DirectX 11 Compute Shader.
Looking for the equivalent of cudaDeviceSynchronize() of clFinish(...) to make a fair comparisson of how my algorithm performs.
CUDA and OpenCL functions are more clear on the blocking/ non-blocking issues. DirectCompute however is more related to the graphics pipeline (of which I learning and very unfamiliar with) and therefore I have trouble finding out if a Dispatch call is blocking or if previously memory allocation/transfers are finished.
Code DX_1:
// Setup
...
for (...) {
startTimer();
context->Dispatch(number_of_groups, 1, 1);
times[i] = stopTimer();
}
// Release
...
Code DX_2:
for (...) {
// Setup
...
startTimer();
context->Dispatch(number_of_groups, 1, 1);
times[i] = stopTimer();
// Release
...
}
Results (average times of 2^2 to 2^11 elements):
DX_1 DX_2 CUDA
1.6 205.5 24.8
1.8 133.4 24.8
29.1 186.5 25.6
18.6 175.0 25.6
11.4 187.5 26.6
85.2 127.7 26.3
166.4 151.1 28.1
98.2 149.5 35.2
26.8 203.5 31.6
Notice: these times are run on a desktop GPU with a screen connected, some erratic timings are expected. Times are not supposed to include host to device buffer transfers.
Notice 2: These are very short sequences (4 - 2048 elements) the interesting tests are performed on problem sizes of up to 2^26 elements.
My new solution is to avoid synchronization with device. I have looked into some methods of retreiving timestamps instead, results look ok and I'm fairly sure the comparisons are fair enough. I compared my CUDA times (Event Record vs. QPC) and the difference is small, a seemingly constant overhead.
CUDA Event Host QPC
4,6 30,0
4,8 30,0
5,0 31,0
5,2 32,0
5,6 34,0
6,1 34,0
6,9 31,0
8,3 47,0
9,2 34,0
12,0 39,0
16,7 46,0
20,5 55,0
32,1 69,0
48,5 111,0
86,0 134,0
182,4 237,0
419,0 473,0
In case my question brings someone in hopes of finding how to do gpgpu benchmarking I will leave some code behind demonstrating my current benchmarking strategy.
Code Examples, CUDA
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
float milliseconds = 0;
cudaEventRecord(start);
...
// Launch my algorithm
...
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&milliseconds, start, stop);
OpenCL
cl_event start_event, end_event;
cl_ulong start = 0, end = 0;
// Enqueue a dummy kernel for the start event.
clEnqueueNDRangeKernel(..., &start_event);
...
// Launch my algorithm
...
// Enqueue a dummy kernel for the end event.
clEnqueueNDRangeKernel(..., &end_event);
clWaitForEvents(1, &end_event);
clGetEventProfilingInfo(start_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
clGetEventProfilingInfo(end_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
timeInMS = (double)(end - start)*(double)(1e-06);
DirectCompute
Here I followed the suggestion from Adam Miles and looked into that source. Will look something like this:
ID3D11Device* device = nullptr;
...
// Setup
...
ID3D11QueryPtr disjoint_query;
ID3D11QueryPtr q_start;
ID3D11QueryPtr q_end;
...
if (disjoint_query == NULL)
{
D3D11_QUERY_DESC desc;
desc.Query = D3D11_QUERY_TIMESTAMP_DISJOINT;
desc.MiscFlags = 0;
device->CreateQuery(&desc, &disjoint_query);
desc.Query = D3D11_QUERY_TIMESTAMP;
device->CreateQuery(&desc, &q_start);
device->CreateQuery(&desc, &q_end);
}
context->Begin(disjoint_query);
context->End(q_start);
...
// Launch my algorithm
...
context->End(q_end);
context->End(disjoint_query);
UINT64 start, end;
D3D11_QUERY_DATA_TIMESTAMP_DISJOINT q_freq;
while (S_OK != context->GetData(q_start, &start, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(q_end, &end, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(disjoint_query, &q_freq, sizeof(D3D11_QUERY_DATA_TIMESTAMP_DISJOINT), 0)){};
timeInMS = (((double)(end - start)) / ((double)q_freq.Frequency)) * 1000.0;
C/C++/OpenMP
static LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds, Frequency;
static void __inline startTimer()
{
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
}
static double __inline stopTimer()
{
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
return (double)ElapsedMicroseconds.QuadPart;
}
My code examples are taken out of context and I tried to do some clean-up but errors might be present.
If you're interested in how long a particular Draw or Dispatch is taking on the GPU then you should take a look at DirectX 11's Timestamp queries. You can query the GPU's clock frequency and current clock value before and after some GPU work and figure out how long that took in wall time.
This is probably a good primer / example on how to do it:
https://mynameismjp.wordpress.com/2011/10/13/profiling-in-dx11-with-queries/
A simple question:
Which is the QueryPerformanceFrequency unit?
Hz (ticks per second)?
Thank you very much,
Bruno
Q: Units of QueryPerformanceFrequency?
A: KILO-HERTZ (NOT Hz)
=========== DETAILS ==============================================
My research indicates that both Counters and Freq are in KILOs, KILO-clock-ticks and KILO-HERTZ!
The counters register KILO-Clicks (KLICKS) and the freq is either in kHz or I am woefully UnderClocked. When you divide the Clock_Ticks by Clock_Frequency, kclicks/(kclicks*sec^-1), everything wipes out except for seconds.
Here is an example C program stripped to just the essentials:
#include "stdio.h"
#include <windows.h> // Needed for LARGE_INTEGER
// gcc cpu.freq.test.c -o cft.exe
// cft.exe -> Sleep d_KLICKS=3417790, d_time=0.999182880 sec, CPU_Freq=3420585 KILO-Hz
void main(int argc, char *argv[]) {
// Clock KILO-ticks start, end, CPU_Freq in kHz. KILOs cancel
LARGE_INTEGER sklick, eklick, cpu_khz;
double delta_time; // Expected time in SECONDS. All units above are k.
QueryPerformanceFrequency(&cpu_khz); // Gets clock KILO-tics, Klicks/sec
QueryPerformanceCounter(&sklick); // Capture cpu Start Klicks
Sleep(1000); // Sleep 1000 MILLI-seconds
QueryPerformanceCounter(&eklick); // Capture cpu End Klicks
delta_time = (eklick.QuadPart-sklick.QuadPart) / (double)cpu_khz.QuadPart;
printf("Sleep d_KLICKS=%lld, d_time=%4.9lf sec, CPU_Freq=%lld KILO-Hz\n",
eklick.QuadPart-sklick.QuadPart, delta_time, cpu_khz.QuadPart);
}
It actually compiles! Running...
Sleep d_KLICKS=3418803, d_time=0.999479036 sec, CPU_Freq=3420585 KILO-Hz
The CPU freq reads 3420585 or 3.420585E6 or 3.4 M-Hertz? <- MEGA-HURTS !OUCH!
The actual CPU freq is 3.4 Mega-Kilo-Hz or 3.4 GHz
microsoft appears to be confused (some things Never Change):
https://msdn.microsoft.com/en-us/library/windows/desktop/dn553408%28v=vs.85%29.aspx
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
// Activity to be timed
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
// We now have the elapsed number of ticks, along with the
// number of ticks-per-second.
The number of "elapsed ticks" in 1 second is in the MILLIONS, NOT BILLIONS so they are NOT UNIT-CPU-CLOCK-TICKS but KILO-CPU-CLOCK-TICKS
Same off-by-3-orders-of-magnitude error for FREQ: 3.4 MILLION is not "ticks-per-second" but THOUSAND-ticks-per-second.
As long as you divide one by the other, the ?clicks cancel with a result in seconds. If one were so fatuous as to take ms at their document and try to use their "ticks-per-second" in some other calculation, you would wind up off by a factor of 1000 or ~1 standard_ms_error!
Perhaps we should call Heinrich in to check HIS units? Oops! 153 years too late. :(
The program for finding prime numbers using OpenCL 1.1 gave the following benchmarks :
Device : CPU
Realtime : approx. 3 sec
Usertime : approx. 32 sec
Device : GPU
Realtime - approx. 37 sec
Usertime - approx. 32 sec
Why is the usertime of execution by GPU not less than that of CPU? Is data/task parallelization not occuring?
System specifications :64-bit CentOS 5.3 system with two ATI Radeon 5970 graphics card + Intel Core i7 processor(12 cores)
Your kernel is rather inefficient, I have an adjusted one below for you to consider. As to why it runs better on a cpu device...
Using your algorithm, the work items take varying amounts of time to execute. They will take longer as the numbers tested grow larger. A work group on a gpu will not finish until all of its items are finished some of the hardware will be left idle until the last item is done. On a cpu, it behaves more like a loop iterating over the kernel items, so the difference in cycles needed to compute each item won't drastically affect the performance.
'A' is not used by the kernel. It should not be copied unless it is used. It looks like you wanted to test the A[i] rather then 'i' itself though.
I think the gpu would be much better at FFT-based prime calculations, or even a sieve algorithm.
{
int t;
int i = get_global_id(0);
int end = sqrt(i);
if(i%2){
B[i] = 0;
}else{
B[i] = 1; //assuming only that it should be non-zero
}
for ( t = 3; (t<=end)&&(B[i] > 0) ; t+=2 ) {
if ( i % t == 0 ) {
B[ i ] = 0;
}
}
}