I have a question about OpenMP. Does it make any difference whether I declare i in the outer loop as private or shared?
int i,j;
#pragma omp parallel for private(j)
for (i=0; i<n; i++) {
for(j=0; j<n; j++) {
//do something
}
}
i should be defined as private in your code. As each thread will run a proportion of the i loop, each thread should keep a private i so that the loop body knows which iteration he is in.
However a better way is to define variables when you use them, so that you don't need to specify their properties in most cases. In the following example, every variable (i and j) defined in the parallel for loop is private by default.
#pragma omp parallel for
for(int i=0; i<n; i++) {
for(int j=0; j<n; j++) {
//do something
}
}
Technically it does not make a difference, because the standard explicitly states the loop variable in the canonical loop form is always implicitly private.
If this variable would otherwise be shared, it is implicitly made private in the loop construct.
[2.6 in OpenMP API 4.5]
I would advise against declaring it shared, it would be very confusing to to anyone trying to read the program. You might argue that you could have the variable being shared outside of the loop and private inside, but I can't think of a case where this would make sense.
I'd hoped the compiler would give you a warning in that case, but at least GCC does not.
Related
I was trying to implement some piece of parallel code and tried to synchronize the threads using an array of flags as shown below
// flags array set to zero initially
#pragma omp parallel for num_threads (n_threads) schedule(static, 1)
for(int i = 0; i < n; i ++){
for(int j = 0; j < i; j++) {
while(!flag[j]);
y[i] -= L[i][j]*y[j];
}
y[i] /= L[i][i];
flag[i] = 1;
}
However, the code always gets stuck after a few iterations when I try to compile it using gcc -O3 -fopenmp <file_name>. I have tried different number of threads like 2, 4, 8 all of them leads to the loop getting stuck. On putting print statements inside critical sections, I figured out that even though the value of flag[i] gets updated to 1, the while loop is still stuck or maybe there is some other problem with the code, I am not aware of.
I also figured out that if I try to do something inside the while block like printf("Hello\n") the problem goes away. I think there is some problem with the memory consistency across threads but I do not know how to resolve this. Any help would be appreciated.
Edit: The single threaded code I am trying to parallelise is
for(int i=0; i<n; i++){
for(int j=0; j < i; j++){
y[i]-=L[i][j]*y[j];
}
y[i]/=L[i][i];
}
You have data race in your code, which is easy to fix, but the bigger problem is that you also have loop carried dependency. The result of your code does depend on the order of execution. Try reversing the i loop without OpenMP, you will get different result, so your code cannot be parallelized efficiently.
One possibility is to parallelize the j loop, but the workload is very small inside this loop, so the OpenMP overheads will be significantly bigger than the speed gain by parallelization.
EDIT: In the case of your updated code I suggest to forget parallelization (because of loop carried dependency) and make sure that inner loop is properly vectorized, so I suggest the following:
for(int i = 0; i < n; i ++){
double sum_yi=y[i];
#pragma GCC ivdep
for(int j = 0; j < i; j++) {
sum_yi -= L[i][j]*y[j];
}
y[i] = sum_yi/L[i][i];
}
#pragma GCC ivdep tells the compiler that there is no loop carried dependency in the loop, so it can vectorize it safely. Do not forget to inform compiler the about the vectorization capabilities of your processor (e.g. use -mavx2 flag if your processor is AVX2 capable).
I have the following nested loops that I want to collapse into one for parallelization. Unfortunately the inner loop is a max-reduction rather than standard for loop thus collapse(2) directive apparently can't be used here. Is there any way to collapse these two loops anyway? Thanks!
(note that s is the number of sublists and n is the length of each sublist and suppose n >> s)
#pragma omp parallel for default(shared) private(i,j)
for (i=0; i<n; i++) {
rank[i] = 0;
for (j=0; j<s; j++)
if (rank[i] < sublistrank[j][i])
rank[i] = sublistrank[j][i];
}
In this code the best idea is not to parallelize the inner loop at all, but make sure it is properly vectorized. The inner loop does not access the memory continuously, which prevents vectorization and results in a poor cache utilization. You should rewrite your entire code to ensure continuous memory access (e.g. change the order of indices and use sublistrank[i][j] instead of sublistrank[j][i]).
If also beneficial to use a temporary variable for comparisons and assign it to rank[i] after the loop.
Another comment is that always use your variables in their minimum required scope, it also helps the compiler to create more optimized code. Putting it together your code should look like something like this (assuming you use unsigned int for rank and loop variables)
#pragma omp parallel for default(none) shared(sublistrank, rank)
for (unsigned int i=0; i<n; i++) {
unsigned int max=0;
for (unsigned int j=0; j<s; j++)
if (max < sublistrank[i][j])
max = sublistrank[i][j];
rank[i]=max;
}
I have compared your code and this code on CompilerExporer. You can see that the compiler is able to vectorize it, but not the old one.
Note also that if n is small, the parallel overhead may be bigger than the benefit of parallelization.
I'd like to use openMP to apply multi-thread.
Here is simple code that I wrote.
vector<Vector3f> a;
int i, j;
for (i = 0; i<10; i++)
{
Vector3f b;
#pragma omp parallel for private(j)
for (j = 0; j < 3; j++)
{
b[j] = j;
}
a.push_back(b);
}
for (i = 0; i < 10; i++)
{
cout << a[i] << endl;
}
I want to change it to works lik :
parallel for1
{
for2
}
or
for1
{
parallel for2
}
Code works when #pragma line is deleted. but it does not work when I use it. What's the problem?
///////// Added
Actually I use OpenMP to more complicated example,
double for loop question.
here, also When I do not apply MP, it works well.
But When I apply it,
the error occurs at vector push_back line.
vector<Class> B;
for 1
{
#pragma omp parallel for private(j)
parallel for j
{
Class A;
B.push_back(A); // error!!!!!!!
}
}
If I erase B.push_back(A) line, it works as well when I applying MP.
I could not find exact error message, but it looks like exception error about vector I guess. Debug stops at
void _Reallocate(size_type _Count)
{ // move to array of exactly _Count elements
pointer _Ptr = this->_Getal().allocate(_Count);
_TRY_BEGIN
_Umove(this->_Myfirst, this->_Mylast, _Ptr);
std::vector::push_back is not thread safe, you cannot call that without any protection against race conditions from multiple threads.
Instead, prepare the vector such that it's size is already correct and then insert the elements via operator[].
Alternatively you can protect the insertion with a critical region:
#pragma omp critical
B.push_back(A);
This way only one thread at a time will do the insertion which will fix the error but slow down the code.
In general I think you don't approach parallelization the right way, but there is no way to give better advise without a clearer and more representative problem description.
I'm using OMP to try to get some speedup in a small kernel. It's basically just querying a vector of unordered_sets for membership. I tried to make an optimization, but surprisingly I got a slowdown, and am really curious why.
My first pass was:
vector<unordered_set<uint16_t> > setList = getData();
#pragma omp parallel for default(shared) private(i, j) schedule(dynamic, 50)
for(i = 0; i < size; i++){
for(j = 0; j < 500; j++){
count = count + setList[i].count(val[j]);
}
}
Then I thought I could maybe get a speedup by moving the setList[i] sub expression up one level of nesting and save it in a temp variable, by doing the following:
#pragma omp parallel for default(shared) private(i, j, currSet) schedule(dynamic, 50)
for(i = 0; i < size; i++){
currSet = setList[i];
for(j = 0; j < 500; j++){
count = count + currSet.count(val[j]);
}
}
I had thought this would maybe save a load each iteration of the "j" for loop, and get a speedup, but it actually SLOWED DOWN by about 3x. By this I mean the entire kernel took about 3 times as long to run. Thoughts on why this would occur?
Thanks!
Adding up a few integers is really not enough work to warrant starting threads for.
If you forget to add the reduction clause, you'll suffer from true sharing - all threads want to update that count variable at the same time. This makes all cores fight for the cache line containing tha variable, which will considerably impact your performance.
I just noticed that you set the schedule to be dynamic. You shouldn't. This workload can be divided at compile time already. So don't specify a schedule.
As has already been stated inter-loop dependencies, i.e. threads waiting for data from other threads, or data being accessed by multiple threads successively, can cause a paralleled program to experience slow down and should be avoided as a rule of thumb. Built in functions like reductions can collect individual results and compile them together in an optimised fashion.
Here is a good example of reduction being used in a similar case to yours from the university of Utah
int array[8] = { 1, 1, 1, 1, 1, 1, 1, 1};
int sum = 0, i;
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < 8; i++) {
sum += array[i];
}
printf("total %d\n", sum);
source: http://www.eng.utah.edu/~cs4960-01/lecture9.pdf
as an aside: private variables need only be assigned when they are local variables inside a parallel region In both cases it is not necessary for i to be declared private.
see wikipedia: https://en.wikipedia.org/wiki/OpenMP#Data_sharing_attribute_clauses
Data sharing attribute clauses
shared: the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.
private: the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.
see stack exchange answer here: OpenMP: are local variables automatically private?
What is the difference in combining 2 for loops and parallizing together and parallizing separately
Example
1. not paralleling together
#pragma omp parallel for
for(i = 0; i < 100; i++) {
//.... some code
}
#pragma omp parallel for
for(i = 0; i < 1000; i++) {
//.... some code
}
2. paralleling together
#pragma omp parallel
{
#pragma omp for
for(i = 0; i < 100; i++) {
//.... some code
}
#pragma omp for
for(i = 0; i < 1000; i++) {
//.... some code
}
}
which code is better and why????
One might expect a small win in the second, because one is fork/joining (or the functional equivalent) the OMP threads twice, rather than once. Whether it makes any actual difference for your code is an empirical question best answered by measurement.
The second can also have a more significant advantage if the work in the two loops are independant, and you can start the second at any time, and there's reason to expect some load imbalance in the first loop. In that case, you can add a nowait clause to the firs tomp for and, rather than all threads waiting until the for loop ends, whoever's done first can immediately go on to start working on the second loop. Or, one could put the two chunks of codes each in a section, or task. In general, you have a lot of control over what threads do and how they do it within a parallel section; whereas once you end the parallel section, you lose that flexibility - everything has to join together and you're done.