perf_event with openMP - openmp

I'm using perf_event to measure performance information. I confirmed that instruction counter is working well in a single core.
However, when I try to use parallel computing with a openMP, the result seemed something wrong..
I thought if number of cores is changed, the instruction counter is same.
for(int i=1; i<=8; i++)
{
cnt=0;
omp_set_num_threads(i);
ioctl(pc_c, PERF_EVENT_IOC_RESET, 0);
ioctl(pc_c, PERF_EVENT_IOC_ENABLE, 0);
#pragma omp parallel for
for(int tid = 0; tid < 1000000; tid++)
{
//sleep(0.2);
cnt ++;
}
ioctl(pc_c, PERF_EVENT_IOC_DISABLE, 0);
read(pc_c, &pc_c_result, sizeof(long long));
}
when I use sleep(0.2), the result seemed regular.
// result
core[1] perf count = 25006756
core[2] perf count = 14681730
core[3] perf count = 10166403
core[4] perf count = 7601514
core[5] perf count = 7165846
core[6] perf count = 4202816
core[7] perf count = 3621566
core[8] perf count = 3247411
I understood this result is about one core. So, this result is correct.
But, when I use cnt++ instead of sleep function, result was totally different.
core[1] perf count = 5735
core[2] perf count = 74244
core[3] perf count = 57295
core[4] perf count = 2976047
core[5] perf count = 35821
core[6] perf count = 2112339
core[7] perf count = 10487
core[8] perf count = 3885038
I can't find any rule about this result.
Isn't there anyone who know about this?
I referred to this site
http://man7.org/linux/man-pages/man2/perf_event_open.2.html#EXAMPLE

Related

omp parallel for loop (reduction to find max) ran slower than serial codes

I am new in using OpenMP.
I think that use max reduction clause to find the max element of an array is not such a bad idea, but in fact the parallel for loop ran much slower than serial one.
int main() {
double sta, end, elapse_t;
int bsize = 46000;
int q = bsize;
int max_val = 0;
double *buffer;
buffer = (double*)malloc(bsize*sizeof(double));
srand(time(NULL));
for(int i=0;i<q;i++)
buffer[i] = rand()%10000;
sta = omp_get_wtime();
//int i;
#pragma omp parallel for reduction(max : max_val)
for(int i=0;i<q; i++)
{
max_val = max_val > buffer[i] ? max_val : buffer[i];
}
end = omp_get_wtime();
printf("parallel maximum time %f\n", end-sta);
sta = omp_get_wtime();
for(int i=0;i<q; i++)
{
max_val = max_val > buffer[i] ? max_val : buffer[i];
}
end = omp_get_wtime();
printf("serial maximum time %f\n", end-sta);
free(buffer);
return 0;}
Compile command
gcc-7 kp_omp.cpp -o kp_omp -fopenmp
Execution results
./kp_omp
parallel maximum time 0.000505
serial maximum time 0.000266
As for the CPU, it is an Intel Core i7-6700 with 8 cores.
Whenever you parallelise a loop openMP needs to perform some operations, for example creating the threads. Those operations result in some overhead and this in turns implies that, for each loop, there is a minimum number of iterations under which it is not convenient to parallelise.
If I execute your code I obtain the same results you have:
./kp_omp
parallel maximum time 0.000570
serial maximum time 0.000253
However if I modify bsize in line 8 such that
int bsize = 100000;
I obtain
./kp_omp
parallel maximum time 0.000323
serial maximum time 0.000552
So the parallelised version got faster than the sequential. Part of the challenges you encounter when you try to speedup the execution of a code is to understand when it is convenient to parallelise and when the overhead of the parallelisation would kill your expected gain in performance.

MPI Latency measuring

I am trying to understand some aspects of the MPI.
During the creation of the program, which is to measure latency between send/recv of two processes, I was faced with strange effects.
I tried to measure the result of many iterations, and received a response that matches the other benchmarks. Then I decided to display values ​​after each iteration and was surprised: they ranged between four values ​​that have not changed. I also drew attention to some very high values.
The code that calculates the value of latency and sample values is below:
int main()
{
MPI::Init();
Proc_Rank = MPI::COMM_WORLD.Get_rank();
for(int i = 0; i < 100; ++i)
latency_test(Proc_Rank, 1, 0);
MPI::Finalize();
return 0;
}
void latency_test(int Proc_Rank, int Iterations_Num, int Size)
{
double Total_Time, Latency;
double t1, t2;
char *Send_Buffer = new char[Size];
char *Recv_Buffer = new char[Size];
for(int i = 0; i < Size; i++){
Send_Buffer[i] = 'a';
}
for(int i = 0; i < Size; i++){
Recv_Buffer[i] = 'b';
}
MPI::COMM_WORLD.Barrier();
t1 = MPI::Wtime();
for(int i = 0; i < Iterations_Num; i++){
if (Proc_Rank == 0){
MPI::COMM_WORLD.Send(Send_Buffer, Size, MPI::CHAR, 1, 0);
MPI::COMM_WORLD.Recv(Recv_Buffer,Size,MPI::CHAR,1,
MPI::ANY_TAG);
}
else if (Proc_Rank==1) { MPI::COMM_WORLD.Recv(Recv_Buffer,Size,MPI::CHAR,0,MPI::ANY_TAG);
MPI::COMM_WORLD.Send(Send_Buffer, Size, MPI::CHAR, 0, 0);
}
}
t2 = MPI::Wtime();
delete []Send_Buffer;
delete []Recv_Buffer;
Total_Time = t2-t1;
if(Proc_Rank == 0){
Latency = (Total_Time / (Iterations_Num * 2.0)) * 1000000.0;
printf("%10.10f\n", Latency);
}
}
Part of the result:
5.4836273193
1.0728836060
0.9536743164
1.0728836060
0.4768371582
0.9536743164
0.5960464478
6.5565109253
0.9536743164
0.9536743164
1.0728836060
0.5960464478
0.4768371582
0.4768371582
Why are 4 fixed values randomly repeat? And why there are rare very large values?
As pointed out by Zulan, the resolution of the timer used by MPI_Wtime is not infinite. You can query the timer resolution by calling MPI_Wtick (MPI::Wtick in the C++ bindings). Measuring a single ping-pong round that lasts less than a microsecond is prone to very high statistical uncertainty, especially since the OS jitter, which is the random delay of the process execution due to other OS activities or processes being scheduled on the same CPU, could be several microseconds. No respectable MPI benchmark would do a single ping-pong round with empty messages.
As a side note, you are using a wildcard receive (MPI_ANY_TAG) in one of the processes. Those tend to be slower than fully-specified receives, especially when it comes to network equipment.

OpenCL slow -- not sure why

I'm teaching myself OpenCL by trying to optimize the mpeg4dst reference audio encoder. I achieved a 3x speedup by using vector instructions on CPU but I figured the GPU could probably do better.
I'm focusing on computing auto-correlation vectors in OpenCL as my first area of improvement. The CPU code is:
for (int i = 0; i < NrOfChannels; i++) {
for (int shift = 0; shift <= PredOrder[ChannelFilter[i]]; shift++)
vDSP_dotpr(Signal[i] + shift, 1, Signal[i], 1, &out, NrOfChannelBits - shift);
}
NrOfChannels = 6
PredOrder = 129
NrOfChannelBits = 150528.
On my test file, this function take approximately 188ms to complete.
Here's my OpenCL method:
kernel void calculateAutocorrelation(size_t offset,
global const float *input,
global float *output,
size_t size) {
size_t index = get_global_id(0);
size_t end = size - index;
float sum = 0.0;
for (size_t i = 0; i < end; i++)
sum += input[i + offset] * input[i + offset + index];
output[index] = sum;
}
This is how it is called:
gcl_memcpy(gpu_signal_in, Signal, sizeof(float) * NrOfChannels * MAXCHBITS);
for (int i = 0; i < NrOfChannels; i++) {
size_t sz = PredOrder[ChannelFilter[i]] + 1;
cl_ndrange range = { 1, { 0, 0, 0 }, { sz, 0, 0}, { 0, 0, 0 } };
calculateAutocorrelation_kernel(&range, i * MAXCHBITS, (cl_float *)gpu_signal_in, (cl_float *)gpu_out, NrOfChannelBits);
gcl_memcpy(out, gpu_out, sizeof(float) * sz);
}
According to Instruments, my OpenCL implementation seems to take about 13ms, with about 54ms of memory copy overhead (gcl_memcpy).
When I use a much larger test file, 1 minute of 2-channel music vs, 1 second of 6-channel, while the measured performance of the OpenCL code seems to be the same, the CPU usage falls to about 50% and the whole program takes about 2x longer to run.
I can't find a cause for this in Instruments and I haven't read anything yet that suggests that I should expect very heavy overhead switching in and out of OpenCL.
If I'm reading your kernel code correctly, each work item is iterating over all of the data from it's location to the end. This isn't going to be efficient. For one (and the primary performance concern), the memory accesses won't be coalesced and so won't be at full memory bandwidth. Secondly, because each work item has a different amount of work, there will be branch divergence within a work group, which will leave some threads idle waiting for others.
This seems like it has a lot in common with a reduction problem and I'd suggest reading up on "parallel reduction" to get some hints about doing an operation like this in parallel.
To see how memory is being read, work out how 16 work items (say, global_id 0 to 15) will be reading data for each step.
Note that if every work item in a work group access the same memory, there is a "broadcast" optimization the hardware can make. So just reversing the order of your loop could improve things.

printf performance issue in openmp

I have been told not to use printf in openmp programs as it degrades the performance of parallel simulation program.
I want to know what is the substitute for that. I mean how to display the output of a program without using printf.
I have the following AES-128 simulation problem using openmp which needs further comments
Parallel simulation of AES in C using Openmp
I want to know how to output the cipher text without degrading the simulation performance?
Thanks in advance.
You cannot both have your pie and eat it. Decide if you want to have great parallel performance or if it's important to see the output of the algorithm while running the parallel loop.
The obvious offline solution is to store the plaintexts, keys and ciphertexts in arrays. In your case that would require 119 MiB (= 650000*(3*4*16) bytes) in the original case and only 12 MiB in the case with 65000 trials. Nothing that a modern machine with GiBs of RAM cannot handle. The latter case even even fits in the last-level cache of some server-class CPUs.
#define TRIALS 65000
int (*key)[16];
int (*pt)[16];
int (*ct)[16];
double timer;
key = malloc(TRIALS * sizeof(*key));
pt = malloc(TRIALS * sizeof(*pt));
ct = malloc(TRIALS * sizeof(*ct));
timer = -omp_get_wtime();
#pragma omp parallel for private(rnd,j)
for(i = 0; i < TRIALS; i++)
{
...
for(j = 0; j < 4; j++)
{
key[i][4*j] = (rnd[j] & 0xff);
pt[i][4*j] = key[i][4*j];
key[i][4*j+1] = ((rnd[j] >> 8) & 0xff) ;
pt[4*j+1] = key[i][4*j+1];
key[i][4*j+2] = ((rnd[j] >> 16) & 0xff) ;
pt[i][4*j+2] = key[i][4*j+2];
key[i][4*j+3] = ((rnd[j] >> 24) & 0xff) ;
pt[i][4*j+3] = key[i][4*j+3];
}
encrypt(key[i],pt[i],ct[i]);
}
timer += omp_get_wtime();
printf("Encryption took %.6f seconds\n", timer);
// Now display the results serially
for (i = 0; i < TRIALS; i++)
{
display pt[i], key[i] -> ct[i]
}
free(key); free(pt); free(ct);
To see the speed-up, you have to measure only the time spent in the parallel region. If you also measure the time it takes to display the results, you will be back to where you started.

OpenMP program is slower than sequential one

When I try the following code
double start = omp_get_wtime();
long i;
#pragma omp parallel for
for (i = 0; i <= 1000000000; i++) {
double x = rand();
}
double end = omp_get_wtime();
printf("%f\n", end - start);
Execution time is about 168 seconds, while the sequential version only spends 20 seconds.
I'm still a newbie in parallel programming. How could I get a parallel version that's faster that the sequential one?
The random number generator rand(3) uses global state variables (hidden in the (g)libc implementation). Access to them from multiple threads leads to cache issues and also is not thread safe. You should use the rand_r(3) call with seed parameter private to the thread:
long i;
unsigned seed;
#pragma omp parallel private(seed)
{
// Initialise the random number generator with different seed in each thread
// The following constants are chosen arbitrarily... use something more sensible
seed = 25234 + 17*omp_get_thread_num();
#pragma omp for
for (i = 0; i <= 1000000000; i++) {
double x = rand_r(&seed);
}
}
Note that this will produce different stream of random numbers when executed in parallel than when executed in serial. I would also recommend erand48(3) as a better (pseudo-)random number source.

Resources