Performance with thread pooling in OpenMP - openmp

I have following code:
float arr[1000];
for(int j=0; j<1000; ++j)
{
#pragma omp parallel num_threads(t1)
{
#pragma omp for
for(int i=0; i<1000; ++i) arr[i] = std::pow(i,2);
}
#pragma omp parallel num_threads(t2)
{
#pragma omp for
for(int i=0; i<1000; ++i) arr[i] = std::pow(i,2);
}
}
for t1=1, t2=36 time is 160
for t1=2, t2=36 time is 3800
for t1=8, t2=36 time is 2020
for t1=18, t2=36 time is 1370
for t1=24, t2=36 time is 880
for t1=30, t2=36 time is 450
for t1=36, t2=36 time is 146
Intel(R) Xeon(R) CPU E5-2699 v3 # 2.30GHz, cores per socket: 18, virtualization: VT-x, sockets: 2, L1d cache: 32K, L1i cache: 32K, L2 cache: 256K, L3 cache: 46080K, CentOS 7, gcc 4.8.3
Time before and after main for() measured with rdtsc or omp_get_wtime functions, trend is very similar.
Is there any problem with thread pooling between sections (teams) in OpenMP ?
Thanks in advance.

Related

Why do I get 128 teams/thread blocks and 96 threads in each teams/thread blocks using #pragma omp target team distribute parallel for in OpenMP?

I am running this code on Ubuntu 18.04, clang/llvm compiler with Nvidia GTX 1070 GPU
#pragma omp target data map(to: A,B) map(from: C)
{
#pragma omp target teams distribute
for(int n=0; n<Row; n++)
{
int team_id= omp_get_team_num();
#pragma omp parallel for default(shared) schedule(auto)
for(int j = 0; j <Col; j++)
{
int thread_id = omp_get_thread_num();
printf("Iteration= c[ %d ][ %d ], Team=%d, Thread=%d\n",n, j, team_id, thread_id);
C[n][j] = A[n][j] + B[n][j];
}
}
}
in the above code, max value of team is 127 and thread is 95
compile flags: clang++ -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_61 -Wall -O3 debug.cpp -o debug

What does the "page-faults" in `perf stat` mean with respect to "backend cycles idle"?

I wrote a simple matrix multiplication programme to demonstrate the effects of caching and performance measuring via perf. The two functions for comparing the relative efficiencies are sqmat_mult and sqmat_mult_efficient:
void sqmat_mult(int x, const int a[x][x], const int b[x][x], int m[x][x])
{
for (int i = 0; i < x; i++) {
for (int j = 0; j < x; j++) {
int sum = 0;
for (int k = 0; k < x; k++) {
sum += a[i][k] * b[k][j]; // access of b array is non sequential
}
m[i][j] = sum;
}
}
}
static void sqmat_transpose(int x, int a[x][x])
{
for (int i = 0; i < x; i++) {
for (int j = i+1; j < x; j++) {
int temp = a[i][j];
a[i][j] = a[j][i];
a[j][i] = temp;
}
}
}
void sqmat_mult_efficient(int x, const int a[x][x], int b[x][x], int m[x][x])
{
sqmat_transpose(x, b);
for (int i = 0; i < x; i++) {
for (int j = 0; j < x; j++) {
int sum = 0;
for (int k = 0; k < x; k++) {
sum += a[i][k] * b[j][k]; // access of b array is sequential
}
m[i][j] = sum;
}
}
sqmat_transpose(x, b);
}
However, when I run perf stat based on these two functions, I am confused about the "page-faults" statistics. The output for sqmat_mult is:
428374.070363 task-clock (msec) # 1.000 CPUs utilized
128 context-switches # 0.000 K/sec
127 cpu-migrations # 0.000 K/sec
12,334 page-faults # 0.029 K/sec
8,63,12,33,75,858 cycles # 2.015 GHz (83.33%)
2,89,73,31,370 stalled-cycles-frontend # 0.34% frontend cycles idle (83.33%)
7,90,36,10,13,864 stalled-cycles-backend # 91.57% backend cycles idle (33.34%)
2,24,41,64,76,049 instructions # 0.26 insn per cycle
# 3.52 stalled cycles per insn (50.01%)
8,84,30,79,219 branches # 20.643 M/sec (66.67%)
1,04,85,342 branch-misses # 0.12% of all branches (83.34%)
428.396804101 seconds time elapsed
While the output for sqmat_mult_efficient is:
8534.611199 task-clock (msec) # 1.000 CPUs utilized
39876.726670 task-clock (msec) # 1.000 CPUs utilized
11 context-switches # 0.000 K/sec
11 cpu-migrations # 0.000 K/sec
12,334 page-faults # 0.309 K/sec
1,19,87,36,75,397 cycles # 3.006 GHz (83.33%)
49,19,07,231 stalled-cycles-frontend # 0.41% frontend cycles idle (83.33%)
50,40,53,90,525 stalled-cycles-backend # 42.05% backend cycles idle (33.35%)
2,24,10,77,52,356 instructions # 1.87 insn per cycle
# 0.22 stalled cycles per insn (50.01%)
8,75,27,87,494 branches # 219.496 M/sec (66.68%)
50,48,390 branch-misses # 0.06% of all branches (83.33%)
39.879816492 seconds time elapsed
With the transposed based multiplication, the "backend cycles idle" is considerably reduced as expected. This is due to the low wait time of fetching coherent memory. What confuses me is that the number of "page-faults" being the same. At first I thought this might be a context switching issue, so I made the programme run with scheduling SCHED_FIFO and priority 1. The number of context switches decreased from roughly 800 to 11, but the "page-faults" remanined exactly the same as before.
The dimension of the matrix I am using is 2048 x 2048, so the entire array will not fit in a 4K page. Nor does it fit in my entire cache (4MiB in my case), since the size of the three matrices (3*2048*2048*sizeof(int) == 48MiB) far exceeds the total avaibale cache. Assuming that "page-faults" here refers to TLB miss or cache miss, I don't see how the two algorithms can have the exact same number. I understand that a lot of the efficiency improvement I see is from the fact that L1 cache hit is happening but that does not mean that RAM to cache transfers have no role to play.
I also did another experiment to see if the "page-faults" scale with the array-size - It does, as expected.
So my questions are -
Am I correct in assuming that "page-faults" refer to a TLB miss or a cache-miss?
How come the "page-faults" are exactly the same for both algorithms?
Am I correct in assuming that "page-faults" refer to a TLB miss or a cache-miss?
No. A page fault happens when the virtual memory is not mapped to physical memory. There are two typical1 reasons for this:
Data was moved from physical memory to disk (swapping). When the page fault happens, the data will be moved back to physical memory and a mapping from virtual memory will be restored.
A memory region was lazily allocated. Only upon the first allocation the actual physical memory will allocated to the corresponding virtual memory.
Now a page fault will probably2 only happen after a cache miss and a TLB miss, because the TLB is only storing actual mappings and a cache will only contain data that is also in the main memory.
Here is a simplified chart of what's happening:
Image by Aravind krishna CC BY-SA 4.0, from Wikimedia Commons
How come the "page-faults" are exactly the same for both algorithms?
All your data fits into the physical memory, so there is no swapping in your case. This leaves you with 2), which happens once for each virtual page. You use 48 MiB of memory, which uses 12288 pages of 4 kiB size. This is the same for both algorithms and almost exactly what you measure.
1 Additional reasons can come up with memory mapped IO or NUMA balancing.
2 I suppose you could build an architecture where the TLB contains information unmapped virtual addresses or the memory is exclusive of a cache, but I doubt that such an architecture exists.

Intel VTune Results Understanding - Naive Questions

My application I want to speedup performs element-wise processing of large array (about 1e8 elements).
​The processing procedure for each element is very simple and I suspect that bottleneck could be not CPU but DRAM bandwidth.
​So I decided to study one-threaded version at first.
The system is: Windows 10 64bit, 32 GB RAM, Intel Core i7-3770S Ivybridge 1.10 GHz 4 cores, Hyperthreading enabled
Concurrency analysis
Elapsed Time: 34.425s
CPU Time: 14.908s
Effective Time: 14.908s
Idle: 0.005s
Poor: 14.902s
Ok: 0s
Ideal: 0s
Over: 0s
Spin Time: 0s
Overhead Time: 0s
Wait Time: 0.000s
Idle: 0.000s
Poor: 0s
Ok: 0s
Ideal: 0s
Over: 0s
Total Thread Count: 2
Paused Time: 18.767s
Memory Access Analysis
Memory Access Analysis provides different CPU times for three consecutive  runs on the same amount of data
​Actual execution time was about 23 seconds as Concurrency Analysis says.
Elapsed Time: 33.526s
CPU Time: 5.740s
Memory Bound: 38.3%
L1 Bound: 10.4%
L2 Bound: 0.0%
L3 Bound: 0.1%
DRAM Bound: 0.8%
Memory Bandwidth: 36.1%
Memory Latency: 60.4%
Loads: 12,912,960,000
Stores: 7,720,800,000
LLC Miss Count: 420,000
Average Latency (cycles): 15
Total Thread Count: 4
Paused Time: 18.081s
Elapsed Time: 33.011s
CPU Time: 4.501s
Memory Bound: 36.9%
L1 Bound: 10.6%
L2 Bound: 0.0%
L3 Bound: 0.2%
DRAM Bound: 0.6%
Memory Bandwidth: 36.5%
Memory Latency: 62.7%
Loads: 9,836,100,000
Stores: 5,876,400,000
LLC Miss Count: 180,000
Average Latency (cycles): 15
Total Thread Count: 4
Paused Time: 17.913s
Elapsed Time: 33.738s
CPU Time: 5.999s
Memory Bound: 38.5%
L1 Bound: 10.8%
L2 Bound: 0.0%
L3 Bound: 0.1%
DRAM Bound: 0.9%
Memory Bandwidth: 57.8%
Memory Latency: 37.3%
Loads: 13,592,760,000
Stores: 8,125,200,000
LLC Miss Count: 660,000
Average Latency (cycles): 15
Total Thread Count: 4
Paused Time: 18.228s
As far as I understand the Summary Page, the situation is not very good.
The paper Finding your Memory Access performance bottlenecks says that the reason is so-called false sharing. But I do not use multithreading, all processing is performed by  just one thread.
From the other hand according to Memory Access Analysis/Platform Page DRAM Bandwidth is not bottleneck.
So the questions are
Why CPU times metric values are different for Concurrency Analysis and Memory Access Analysis
What is the reason of bad memory metrics values, especially for L1 Bound?
The main loop is lambda function, where
tasklets: std::vector of simple structures that contain coefficients for data processing
points: data itself, Eigen::Matrix
projections: Eigen::Matrix, array to put results of processing into
The code is:
#include <iostream>
#include <future>
#include <random>
#include <Eigen/Dense>
#include <ittnotify.h>
using namespace std;
using Vector3 = Eigen::Matrix<float, 3, 1>;
using Matrix3X = Eigen::Matrix<float, 3, Eigen::Dynamic>;
uniform_real_distribution<float> rnd(0.1f, 100.f);
default_random_engine gen;
class Tasklet {
public:
Tasklet(int p1, int p2)
:
p1Id(p1), p2Id(p2), Loc0(p1)
{
RestDistance = rnd(gen);
Weight_2 = rnd(gen);
}
__forceinline void solve(const Matrix3X& q, Matrix3X& p)
{
Vector3 q1 = q.col(p1Id);
Vector3 q2 = q.col(p2Id);
for (int i = 0; i < 0; ++i) {
Vector3 delta = q2 - q1;
float norm = delta.blueNorm() * delta.hypotNorm();
}
Vector3 deltaQ = q2 - q1;
float dist = deltaQ.norm();
Vector3 deltaUnitVector = deltaQ / dist;
p.col(Loc0) = deltaUnitVector * RestDistance * Weight_2;
}
int p1Id;
int p2Id;
int Loc0;
float RestDistance;
float Weight_2;
};
typedef vector<Tasklet*> TaskList;
void
runTest(const Matrix3X& points, Matrix3X& projections, TaskList& tasklets)
{
size_t num = tasklets.size();
for (size_t i = 0; i < num; ++i) {
Tasklet* t = tasklets[i];
t->solve(points, projections);
}
}
void
prepareData(Matrix3X& points, Matrix3X& projections, int numPoints, TaskList& tasklets)
{
points.resize(3, numPoints);
projections.resize(3, numPoints);
points.setRandom();
/*
for (int i = 0; i < numPoints; ++i) {
points.col(i) = Vector3(1, 0, 0);
}
*/
tasklets.reserve(numPoints - 1);
for (int i = 1; i < numPoints; ++i) {
tasklets.push_back(new Tasklet(i - 1, i));
}
}
int
main(int argc, const char** argv)
{
// Pause VTune data collection
__itt_pause();
cout << "Usage: <exefile> <number of points (in thousands)> <#runs for averaging>" << endl;
int numPoints = 150 * 1000;
int numRuns = 1;
int argNo = 1;
if (argc > argNo) {
istringstream in(argv[argNo]);
int i;
in >> i;
if (in) {
numPoints = i * 1000;
}
}
++argNo;
if (argc > argNo) {
istringstream in(argv[argNo]);
int i;
in >> i;
if (in) {
numRuns = i;
}
}
cout
<< "Running test" << endl
<< "\t NumPoints (thousands): " << numPoints / 1000. << endl
<< "\t # of runs for averaging: " << numRuns << endl;
Matrix3X q, projections;
TaskList tasklets;
cout << "Preparing test data" << endl;
prepareData(q, projections, numPoints, tasklets);
cout << "Running test" << endl;
// Resume VTune data collection
__itt_resume();
for (int r = 0; r < numRuns; ++r) {
runTest(q, projections, tasklets);
}
// Pause VTune data collection
__itt_pause();
for (auto* t : tasklets) {
delete t;
}
return 0;
}
Thank you.

openMP reduction and thread number control

I use OpenMP as:
#pragma omp parallel for reduction(+:average_stroke_width)
for(int i = 0; i < GB_name.size(); ++i) {...}
I know I can use :
#pragma omp parallel for num_threads(thread)
for(int index = 0; index < GB_name.size(); ++index){...}
How can I control the thread number when I use reduction?
How can I control the thread number when I use reduction?
Both clauses can be used togehter:
#pragma omp parallel for reduction(+:average_stroke_width) num_threads(thread)
for(int i = 0; i < GB_name.size(); ++i) {...}
Note that reduction involves all threads, so you cannot have a parallel loop with 8 threads and then perform a reduction with 4 threads only. Reduction combines the local values in all threads and therefore all of them need to participate.

Low performance in a OpenMP program

I am trying to understand an openmp code from here. You can see the code below.
In order to measure the speedup, difference between the serial and omp version, I use time.h, do you find right this approach?
The program runs on a 4 core machine. I specify export OMP_NUM_THREADS="4" but can not see substantially speedup, usually I get 1.2 - 1.7. Which problems am I facing in this parallelization?
Which debug/performace tool could I use to see the loss of performace?
code (for compilation I use xlc_r -qsmp=omp omp_workshare1.c -o omp_workshare1.exe)
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define CHUNKSIZE 1000000
#define N 100000000
int main (int argc, char *argv[])
{
int nthreads, tid, i, chunk;
float a[N], b[N], c[N];
unsigned long elapsed;
unsigned long elapsed_serial;
unsigned long elapsed_omp;
struct timeval start;
struct timeval stop;
chunk = CHUNKSIZE;
// ================= SERIAL start =======================
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
gettimeofday(&start,NULL);
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
//printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
gettimeofday(&stop,NULL);
elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
elapsed += stop.tv_usec - start.tv_usec;
elapsed_serial = elapsed ;
printf (" \n Time SEQ= %lu microsecs\n", elapsed_serial);
// ================= SERIAL end =======================
// ================= OMP start =======================
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
gettimeofday(&start,NULL);
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
//printf("Thread %d starting...\n",tid);
#pragma omp for schedule(static,chunk)
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
//printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
} /* end of parallel section */
gettimeofday(&stop,NULL);
elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
elapsed += stop.tv_usec - start.tv_usec;
elapsed_omp = elapsed ;
printf (" \n Time OMP= %lu microsecs\n", elapsed_omp);
// ================= OMP end =======================
printf (" \n speedup= %f \n\n", ((float) elapsed_serial) / ((float) elapsed_omp)) ;
}
There's nothing really wrong with the code as above, but your speedup is going to be limited by the fact that the main loop, c=a+b, has very little work -- the time required to do the computation (a single addition) is going to be dominated by memory access time (2 loads and one store), and there's more contention for memory bandwidth with more threads acting on the array.
We can test this by making the work inside the loop more compute-intensive:
c[i] = exp(sin(a[i])) + exp(cos(b[i]));
And then we get
$ ./apb
Time SEQ= 17678571 microsecs
Number of threads = 4
Time OMP= 4703485 microsecs
speedup= 3.758611
which is obviously a lot closer to the 4x speedup one would expect.
Update: Oh, and to the other questions -- gettimeofday() is probably fine for timing, and on a system where you're using xlc - is this AIX? In that case, peekperf is a good overall performance tool, and the hardware performance monitors will give you access to to memory access times. On x86 platforms, free tools for performance monitoring of threaded code include cachegrind/valgrind for cache performance debugging (not the problem here), scalasca for general OpenMP issues, and OpenSpeedShop is pretty useful, too.

Resources