I am creating several tasks in OpenMP, but for some reason the tasks are executed by the same thread. The code has the following pattern:
#pragma omp parallel num_threads(n_threads)
#pragma omp single
while(!found){
/.... operations with k, e, y, dist, u, step
#pragma omp task firstprivate(k, e, y, dist, u, step)
process(k, e, y, dist, u, step, h, out);
}
Is there any way of forcing the OpenMP runtime system to assign the tasks to all threads but the one that creates the task?
Thanks
Edit:
void process(int a[]){
printf("Thread %d processing\n", omp_get_thread_num());
/* Heavy Operations (used sleep to simulate them)*/
sleep(10);
}
int main(){
int a[10];
int i, j;
#pragma omp parallel num_threads(4)
{
#pragma omp single
{
for (j = 0; j < 10; j++) {
/* Compute a */
for (i = 0; i < 10; i++) {
a[i] += 1;
}
#pragma omp task firstprivate(a)
process(a);
}
printf("Thread %d finished preprocessing\n", omp_get_thread_num());
}
}
}
Output:
Thread 1 processing
Thread 1 processing
Thread 1 processing
Thread 1 processing
Thread 1 processing
Thread 1 processing
Thread 1 processing
Thread 1 processing
Thread 1 processing
Thread 1 processing
Thread 1 finished preprocessing
Isn't that the same question as in your other post (Task scheduling points of OpenMP tasks)
The answer to this one here is: no, you cannot control how the OpenMP implementation assigns the tasks to the OpenMP threads of the parallel team.
May I suggest that you post a piece of code that actually compiles? Then it would be easier to check why the code does not behave as you think it should behave.
With the example code added by the OP, here's the behavior on my machine:
$> gcc -fopenmp omp.c
$> ./a.out
Thread 0 finished preprocessing
Thread 0 processing
Thread 3 processing
Thread 2 processing
Thread 1 processing
Thread 0 processing
Thread 3 processing
Thread 2 processing
Thread 1 processing
Thread 0 processing
Thread 3 processing
$> gcc --version
gcc (SUSE Linux) 4.7.2 20130108 [gcc-4_7-branch revision 195012]
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$> icc -openmp omp.c
$> ./a.out
Thread 0 finished preprocessing
Thread 0 processing
Thread 3 processing
Thread 2 processing
Thread 1 processing
Thread 0 processing
Thread 3 processing
Thread 2 processing
Thread 1 processing
Thread 0 processing
Thread 3 processing
mklemm-mobl2-vm:~/tmp [0]> icc -V
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.1.106 Build 20131008
Copyright (C) 1985-2013 Intel Corporation. All rights reserved.
What compiler are you using?
Cheers,
-michael
Related
I'm dealing with a routine which I want the first time to be executed by the CPU and every other time by the GPU. This routine contains the loop:
for (k = kb; k <= ke; k++){
for (j = jb; j <= je; j++){
for (i = ib; i <= ie; i++){
...
}}}
I tried with adding #pragma acc loop collapse(3) to the loop and #pragma acc routine(routine) vector just before the calls where I want the GPU to execute the routine. -Minfo=accel doesn't report any message and with Nsight-System I see that the routine is always executed by the CPU so in this way it doesn't work.
Why the compiler is reading neither of the two #pragma?
To follow on to Thomas' answer, here's an example of using the "if" clause:
% cat test.c
#include <stdlib.h>
#include <stdio.h>
void compute(int * Arr, int size, int use_gpu) {
#pragma acc parallel loop copyout(Arr[:size]) if(use_gpu)
for (int i=0; i < size; ++i) {
Arr[i] = i;
}
}
int main() {
int *Arr;
int size;
int use_gpu;
size=1024;
Arr = (int*) malloc(sizeof(int)*size);
// Run on the host
use_gpu=0;
compute(Arr,size,use_gpu);
// Run on the GPU
use_gpu=1;
compute(Arr,size,use_gpu);
free(Arr);
}
% nvc -acc -Minfo=accel test.c
compute:
4, Generating copyout(Arr[:size]) [if not already present]
Generating NVIDIA GPU code
7, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
% setenv NV_ACC_TIME 1
% a.out
Accelerator Kernel Timing data
test.c
compute NVIDIA devicenum=0
time(us): 48
4: compute region reached 1 time
4: kernel launched 1 time
grid: [8] block: [128]
device time(us): total=5 max=5 min=5 avg=5
elapsed time(us): total=331 max=331 min=331 avg=331
4: data region reached 2 times
9: data copyout transfers: 1
device time(us): total=43 max=43 min=43 avg=43
I'm using nvc and set the compiler's runtime profiler (NV_ACC_TIME=1) to show that the kernel is launched only once.
You need to enable OpenACC processing: -acc (with NVHPC tools) or -fopenacc (with GCC), for example, and then you need to use an OpenACC compute construct (parallel, kernels) to actually launch parallel GPU execution (plus host/device memory management, as necessary). For example, you could call your routine from that compute construct, and the routine would annotate the loop nest with OpenACC loop directives, as you've mentioned, to actually make use of the GPU parallelism.
Then, to answer your actual question: the OpenACC compute constructs then support an if clause to specify whether the region will execute on the current device ("GPU") vs. the local thread will execute the region ("CPU").
I am trying to make a fast parallel loop. In each iteration of the loop, I build an array which is costly so I want it distributed over many threads. After the array is built, I use it to update a matrix. Here it gets tricky because the matrix is common to all threads so only 1 thread can modify parts of the matrix at one time, but when I work on the matrix, it turns out I can distribute that work too since I can work on different parts of the matrix at the same time.
Here is what I currently am doing:
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
#pragma omp critical
{
update_matrix(A, bi)
}
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %d\n", omp_get_thread_num());
#pragma omp parallel sections
{
#pragma omp section
{
printf("id1 = %d\n", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp section
{
printf("id2 = %d\n", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
}
The problem is that the two different sections of the update_matrix() routine are not being parallelized. The output I get looks like this:
id0 = 19
id1 = 0
id2 = 0
id0 = 5
id1 = 0
id2 = 0
...
So the two sections are being executed by the same thread (0). I tried removing the #pragma omp critical in the main loop but it gives the same result. Does anyone know what I'm doing wrong?
#pragma omp parallel sections should not work there because you are already in a parallel part of the code distributed by the #pragma omp prallel for clause. Unless you have enabled nested parallelization with omp_set_nested(1);, the parallel sections clause will be ignored.
Please not that it is not necessarily efficient as spawning new threads has an overhead cost which may not be worth if the update_matrix part is not too CPU intensive.
You have several options:
Forget about that. If the non-critical part of the loop is really what takes most calculations and you already have as many threads as CPUs, spwaning extra threads for a simple operations will do no good. Just remove the parallel sections clause in the subroutine.
Try enable nesting with omp_set_nested(1);
Another option, which comes at the cost of a double synchronization overhead and would be use named critical sections. There may be only one thread in critical section ONE_TO_J and one on critical section J_TO_K so basically up to two threads may update the matrix in parallel. This is costly in term of synchronization overhead.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %d\n", omp_get_thread_num());
#pragma omp critical(ONE_TO_J)
{
printf("id1 = %d\n", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp critical(J_TO_K)
{
printf("id2 = %d\n", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
Or use atomic operations to edit the matrix, if this is suitable.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
float tmp;
printf("id0 = %d\n", omp_get_thread_num());
for (int row=0; row<max_row;row++)
for (int column=0;column<k;column++){
float(tmp)=some_function(b,row,column);
#pragma omp atomic
A[column][row]+=tmp;
}
}
By the way, data is stored in row major order in C, so you should be updating the matrix row by row rather than column by column. This will prevent false-sharing and will improve the algorithm memory-access performance.
I just started learning OpenMP with C++, and I used a very simple program to check if I can get some speedup from parallelize the program:
#include <iostream>
#include <ctime>
#include "omp.h"
int main() {
const uint N = 1000000000;
clock_t start_time = clock();
#pragma omp parallel for
for (uint i = 0; i < N; i++) {
int x = 1+1;
}
clock_t end_time = clock();
std::cout << "total_time: " << double(end_time - start_time) / CLOCKS_PER_SEC << " seconds." << std::endl;
}
The program takes 2.2 seconds without parallel #pragma, and takes 2.8 seconds with parallel #pragma 4 threads. What mistake did I make in the program? My compiler is clang++ 6.0, and the computer is Macbook Pro with 2.6G i5 CPU and MacOS 10.13.6.
EDIT:
I realized I used the wrong function for measuring execution time. Instead of clock() from library ctime, I should use high_resolution_clock from library chrono library. In that case, I get 80 seconds for 1 thread, 47 seconds for 2 threads, 35 seconds for 3 threads. Should the speedup be better than what I get here, since the program is embarrassingly parallel?
As with anything in parallel programming, there is a startup cost to creating new threads. For simple programs, the overhead of creating and managing threads is often great enough that it actually slows down the target program compared to when the program is run in a single thread.
In other words, you didn't make a mistake - this is an inherent part of using threads.
[Background: OpenMP v4+ on Intel's icc compiler]
I want to parallelize tasks inside a loop that is already parallelized. I saw quite a few queries on subjects close to this one, e.g.:
Parallel sections in OpenMP using a loop
Doing a section with one thread and a for-loop with multiple threads
and others with more concentrated wisdom still.
but I could not get a definite answer other than a compile time error message when trying it.
Code:
#pragma omp parallel for private(a,bd) reduction(+:sum)
for (int i=0; i<128; i++) {
a = i%2;
for (int j=a; j<128; j=j+2) {
u_n = 0.25 * ( u[ i*128 + (j-3) ]+
u[ i*128 + (j+3) ]+
u[ (i-1)*128 + j ]+
u[ (i+1)*128 + j ]);
// #pragma omp single nowait
// {
// #pragma omp task shared(sum1) firstprivate(i,j)
// sum1 = (u[i*128+(j-3)]+u[i*128+(j-2)] + u[i*128+(j-1)])/3;
// #pragma omp task shared(sum2) firstprivate(i,j)
// sum2 = (u[i*128+(j+3)]+u[i*128+(j+2)]+u[i*128+(j+1)])/3;
// #pragma omp task shared(sum3) firstprivate(i,j)
// sum3 = (u[(i-1)*128+j]+u[(i-2)*128+j]+u[(i-3)*128+j])/3;
// #pragma omp task shared(sum4) firstprivate(i,j)
// sum4 = (u[(i+1)*128+j]+u[(i+2)*128+j]+u[(i+3)*128+j])/3;
// }
// #pragma omp taskwait
// {
// u_n = 0.25*(sum1+sum2+sum3+sum4);
// }
bd = u_n - u[i*128+ j];
sum += diff * diff;
u[i*128+j]=u_n;
}
}
In the above code, I tried replacing the u_n = 0.25 *(...); line with the 15 commented lines, to try not only to paralllelize the iterations over the 2 for loops, but also to acheive a degree of parallelism on each of the 4 calculations (sum1 to sum4) involving array u[].
The compile error is fairly explicit:
error: the OpenMP "single" pragma must not be enclosed by the
"parallel for" pragma
Is there a way around this so I can optimize that calculation further with OpenMP ?
The single worksharing construct within a loop worksharing construct is prohibited by the standard, but you don't need it there.
The usual parallel -> single -> task setup for tasking is to ensure that you have a thread team setup for your tasks (parallel), but then only spawn each task once (single). You wouldn't need the latter in a parallel for context because each iteration is already executed only once. So you could spawn tasks directly within the loop. This seems to have the expected behavior on both gnu and Intel compilers, i.e. threads that have completed their own loop iterations do help other threads to execute their tasks.
However, that is a bad idea to do in your case. A tiny computation such as the one of sum1 will be much faster on it's own compared to the overhead of spawning a task.
Removing all pragmas except for the parallel for, this is a very reasonable parallelization. Before further optimizing the calculation, you should measure! In particularly, you are interested in whether all your available threads are always computing something, or whether some threads finish early and wait for others (load imbalance). To measure, you should look for a parallel performance analysis tool for your platform. If that is the case, you can address it with scheduling policies, or possibly by nested parallelism in the inner loop.
A full discussion of the performance of your code is more complex, and requires a minimal, complete and verifiable example, a detailed system description, and actual measured performance numbers.
My openMP version did not give any speed boost. I have a dual core machine and the CPU usage is always 50%. So I tried the sample program given in Wiki. Looks like the openMP compiler (Visual Studio 2008) is not creating more than one thread.
This is the program:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n",nthreads);
}
}
return EXIT_SUCCESS;
}
This is the output that I get:
Hello World from thread 0
There are 1 threads
Press any key to continue . . .
There's nothing wrong with the program - so presumably there's some issue with how it's being compiled or run. Is this VS2008 Pro? A quick google around suggests OpenMP is not enabled in Standard. Is OpenMP enabled in Properties -> C/C++ -> Language -> OpenMP? (Eg, are you compiling with /openmp)? Is the environment variable OMP_NUM_THREADS being set to 1 somewhere when you run this?
If you want to test out your program with more than one thread, there are several constructs for specifying the number of threads in an OpenMP parallel region. They are, in order of precedence:
Evaluation of the if clause
Setting of the num_threads clause
Use of the omp_set_num_threads() library function
Setting of the OMP_NUM_THREADS environment variable
Implementation default
It sounds like your implementation is defaulting to one thread (assuming you don't have OMP_NUM_THREADS=1 set in your environment).
To test with 4 threads, for instance, you could add num_threads(4) to your #pragma omp parallel directive.
As the other answer noted, you won't really see any "speedup" because you aren't exploiting any parallelism. But it is reasonable to want to run a "hello world" program with several threads to test it out.
As mentioned here, http://docs.oracle.com/cd/E19422-01/819-3694/5_compiling.html I got it working by setting the environment variable OMP_DYNAMIC to FALSE
Why would you need more than one thread for that program? It's clearly the case that OpenMP realizes that it doesn't need to create an extra thread to run a program with no loops, no code that could run in parallel whatsoever.
Try running some parallel stuff with OpenMP. Something like this:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define CHUNKSIZE 10
#define N 100
int main (int argc, char *argv[])
{
int nthreads, tid, i, chunk;
float a[N], b[N], c[N];
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
printf("Thread %d starting...\n",tid);
#pragma omp for schedule(dynamic,chunk)
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
} /* end of parallel section */
}
If you want some hard core stuff, try running one of these.