I am a complete noob in OpenMP and just started by exploring some simple test script below.
#pragma omp parallel
{
#pragma omp for
for(int i=0;i<10;++i)
std::cout<<i<<" "<<endl;
// printf("%d \n",i);
}
}
I tried the C and C++ version and the C version seems to work fine whereas the C++ version gives me a wrong output.
Many implementations of printf acquire a lock to ensure that each printf call is not interrupted by other threads.
In contrast, std::cout's overloaded << operator means that (even with a lock) one thread's printing of i and ' ' and '\n' can be interleaved with another thread's output, because std::cout<<i<<" "<<endl; is translated to three operator<<() function calls by the C++ compiler.
This is outdated but perhaps this could be still of help to anyone:
It's not really clear what you expect the output to be but be aware of:
Your variable "i" is possibly shared amongst threads. You have a race-condition for the contents of "i". One thread needs to wait for another when it wants to access "i". Further 1 thread can change "i" and another thread doesn't take note of it meaning it will output a wrong value.
The endl() flushes the memory after ending the line. If you use \n for newline the effect is similar but without the flush. And std is an object too so multiple threads race for std access. When the memory isn't flushed after every access you may experience interferences.
To make sure those are not related to your problems you could declare the "i" as private so every thread counts "i" itself. And you could play with flushing the memory at output so see if it has to do with the problem you experience.
Related
I am creating a wrapper over c library that recieves some financial data and I want to collect it into python data type (dict with list of field names and list of lists with financial data fields).
On the c level there is function that starts "listening" to some port and when any event appears some user-defined function is called. This function is written in cython. Simplified example of such function is here:
cdef void default_listener(const event_data_t* data, int data_count, void* user_data):
cdef trade_t* trades = <trade_t*>data # cast recieved data according to expected type
cdef dict py_data = <dict>user_data # cast user_data to initial type(dict in our case)
for i in range(data_count):
# append to list in the dict that we passed to the function
# fields of recieved struct
py_data['data'].append([trades[i].price,
trades[i].size,
]
)
The problem: when there is only one python process with this function started, there are no problems, but if I start another python process and run the same function one of the processes will be terminated in undetermiined amount of time. I suppose that this happens because two functions that are called simultaniously in different processes may try to write to the same part of the memory. May this be the case?
If this is the case, are there any ways to prevent two processes use the same memory? Or maybe some lock can be established before the cython code starts to write?
P.S.: I also have read this article and according to it for each python process there is some memory allocated that does not intersect with parts for other processes. But it is unclear for me, is this allocated memory also available for underlying c functions or these functions have acces to another fields that may intersect
I'm taking a guess at the answer based on your comment - if it's wrong then I'll delete it, but I think it's likely enough to be right to be worth posting as an answer.
Python has a locking mechanism known as the Global Interpreter Lock (or GIL). This ensures that multiple threads don't attempt to access the same memory simultaneously (including memory internal to Python, that may not be obvious to the user).
Your Cython code will be working on the assumption that its thread holds the GIL. I strongly suspect that this isn't true, and so performing any operations on a Python object will likely cause a crash. One way to deal with this would be to follow this section of documentation in the C code that calls the Cython code. However, I suspect it's easier to handle in Cython.
First tell Cython that the function is "nogil" - it does not require the GIL:
cdef void default_listener(const event_data_t* data, int data_count, void* user_data) nogil:
If you try to compile now it will fail, since you use Python types within the function. To fix this, claim the GIL within your Cython code.
cdef void default_listener(...) nogil:
with gil:
default_listener_impl(...)
What I've done is put the implementation in a separate function that does require the GIL (i.e. doesn't have a nogil attached). The reason for this is that you can't put cdef statements in the with gil section (as you say in your comment) - they have to be outside it. However, you can't put cdef dict outside it, because it's a Python object. Therefore a separate function is the easiest solution. The separate function looks almost exactly like default_listener does now.
It's worth knowing that this isn't a complete locking mechanism - it's really only to protect the Python internals from being corrupted - an ordinary Python thread will release and regain the GIL periodically automatically, and that may be while you're "during" an operation. Cython won't release the GIL unless you tell it to (in this case, at the end of the with gil: block) so does hold an exclusive lock during this time. If you need finer control of locking then you may want to look at either the multithreading library, or wrapping some C locking library.
a frame is shared with a kernel.
User-space code:
read frame // read frame content
_mm_mfence // prevent before "releasing" a frame before we read everything.
frame.status = 0 // "release" a frame
Kernel code:
poll for frame.status // reads a frame's status
_mm_lfence
Kernel can poll it asynchronically, in another thread. So, there is no syscall between userspace code and kernelspace.
Is it correctly synchronized?
I doubt because of the following situation:
A compiler has a weak memory model and we have to assume that it can do wild changes as you can imagine if optimizied/changed program is consistent within one-thread.
So, on my eye we need a second barrier because it is possible that a compiler optimize out store frame.status, 0.
Yes, it will be a very wild optimization but if a compiler would be able to prove that noone in the context (within thread) reads that field it can optimize out it.
I believe that it is theoretically possibe, isn't it?
So, to prevent that we can put the second barrier:
User-space code:
read frame // read frame content
_mm_mfence // prevent before "releasing" a frame before we read everything.
frame.status = 0 // "release" a frame
_mm_fence
Ok, now compiler restrain itself before optimization.
What do you think?
EDIT
[The question is raised by the issue that __mm_fence does not prevent before optimizations-out.
#PeterCordes, to make sure myself: __mm_fence does not prevent before optimizations out (it is just x86 memory barrier, not compiler). However, atomic_thread_fence(any_order) prevents before reorderings (it depends on any_order, obviously) but it also prevents before optimizations out?
For example:
// x is an int pointer
*x = 5
*(x+4) = 6
std::atomic_thread_barrier(memory_order_release)
prevents before optimizations out of stores to x? It seems that it must- otherwise every store to x should be volatile.
However, I saw a lot of lock-free code and there is no making fields as volatile.
_mm_mfence is also a compiler barrier. (See When should I use _mm_sfence _mm_lfence and _mm_mfence, and also BeeOnRope's answer there).
atomic_thread_fence with release, rel_acq, or seq_cst stops earlier stores from merging with later stores. But mo_acquire doesn't have to.
Writes to non-atomic globals variables can only be optimized out by merging with other writes to the same non-atomic variables, not by optimizing them away entirely. So the real question is what reorderings can happen that can let two non-atomic assignments come together.
There has to be an assignment to an atomic variable in there somewhere for there to be anything that another thread could synchronize with. Some compilers might give atomic_thread_fence stronger behaviour wrt. non-atomic variables, but in C++11 there's no way for another thread to legally observer anything about the ordering of *x and x[4] in
#include <atomic>
std::atomic<int> shared_flag {0};
int x[8];
void writer() {
*x = 0;
x[4] = 0;
atomic_thread_fence(mo_release);
x[4] = 1;
atomic_thread_fence(mo_release);
shared_flag.store(1, mo_relaxed);
}
The store to shared_flag has to appear after the stores to x[0] and x[4], but it's only an implementation detail what order the stores to x[0] and x[4] happen in, and whether there are 2 stores to x[4].
For example, on the Godbolt compiler explorer gcc7 and earlier merge the stores to x[4], but gcc8 doesn't, and neither do clang or ICC. The old gcc behaviour does not violate the ISO C++ standard, but I think they strengthened gcc's thread_fence because it wasn't strong enough to prevent bugs in other cases.
For example,
void writer_gcc_bug() {
*x = 0;
std::atomic_thread_fence(std::memory_order_release);
shared_flag.store(1, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
*x = 2; // gcc7 and earlier merge this, which arguably a bug
}
gcc only does shared_flag = 1; *x = 2; in that order. You could argue that there's no way for another thread to safely observe *x after seeing shared_flag == 1, because this thread writes it again right away with no synchronization. (i.e. data race UB in any potential observer makes this reordering arguably legal).
But gcc developers don't think that's enough reason, (it may be violating the guarantees of the builtin __atomic functions that the <atomic> header uses to implement the API). And there may be other cases where there is a real bug that even a standards-conforming program could observe the aggressive reordering that violated the standard.
Apparently this changed on 2017-09 with the fix for gcc bug 80640.
Alexander Monakov wrote:
I think the bug is that on x86 __atomic_thread_fence(x) is expanded into nothing for x!=__ATOMIC_SEQ_CST, it should place a compiler barrier similar to expansion of __atomic_signal_fence.
(__atomic_signal_fence includes something as strong as asm("" ::: "memory" ).)
Yup that would definitely be a bug. So it's not that gcc was being really clever and doing allowed reorderings, it was just mostly failing at thread_fence, and any correctness that did happen was due to other factors, like non-inline function boundaries! (And that it doesn't optimize atomics, only non-atomics.)
I am trying to build something with MPI, so since i am not very familiar with it, i started with some arrays and printing stuff. I noticed that a plain C command (not an MPI one) works simultaneously on every process, i.e. printing something like that :
printf("Process No.%d",rank);
Them i noticed that the numbers of the processes got all scrambled and because the right sequence of the processes would fit me, i tried using a for-loop like that :
for(rank=0; rank<processes; rank++) printf("Process No.%d",rank);
And that started a third world war in my computer. Lots of strange errors in a strange format that i couldn't understand and that made me suspicious. How is it possible since an if-loop stating a ranks value , like the master rank:
if(rank==0) printf("Process No.%d",rank);
cant use a for-loop for the same reason. Well, that is my first question.
My second question is about an other for-loop i used, that it got ignored.
printf("PROCESS --------------->**%d**\n",id);
for (i = 0; i < PARTS; ++i){
printf("Array No.%d\n", i+1);
for (j = 0; j < MAXWORDS; ++j)
printf("%d, ",0);
printf("\n\n");
}
I run that for-loop and every process printed only the first line:
$ mpiexec -n 6 `pwd`/test
PROCESS --------------->**0**
PROCESS --------------->**1**
PROCESS --------------->**3**
PROCESS --------------->**2**
PROCESS --------------->**4**
PROCESS --------------->**5**
And not the following amount of zeros (there was an array there at first that i removed cause i was trying to figure out why it didn't get printed).
So, why is it about MPI and for-loops that don't get along?
--edit 1: grammar
--edit 2: Code paste
It is not the same as above, but same problem in the the last for-loop with fprintf.
This is a paste zone, sorry for that, i couldn't deal with the code system here
--edit 3: fixed
Well i finally figured it out. For first i have to say that the fprintf function when used inside MPI is a mess. Apparently there is a kind of overlap while every process writes in a text file. I tested it with the printf function and it worked. The second thing i was doing is, i was calling the MPI_Scatter function from inside root:
if(rank==root) MPI_Scatter();
..which only scatters the data inside the process and not the others.
Now that i have fixed those two issues, the program works as it should, apart a minor problem when i printf the my_list arrays. It seems like every array has a random number of inputs, but when i tested using a counter for every array, it's only the data that is printed like this. Tried using fflush(stdout); but it returned me an error.
usr/lib/gcc/x86_64-pc-linux-gnu/4.2.2/../../../../x86_64-pc-linux-gnu/bin/ld: final link failed: `Input/output error collect2: ld returned 1 exit status`
MPI in and of itself does not have a problem with for loops. However, just like with any other software, you should always remember that it will work the way you code it, not the way you intend it. It appears that you are having two distinct issues, both of which are only tangentially related to MPI.
The first issue is that the variable PARTS is defined in such a way that it depends on another variable, procs, which is not initialized at before-hand. This means that the value of PARTS is undefined, and probably ends up causing a divide-by-zero as often as not. PARTS should actually be set after line 44, when procs is initialized.
The second issue is with the loop for(i = 0; i = LISTS; i++) labeled /*Here is the problem*/. First of all, the test condition of the loop always sets i to the value of LISTS, regardless of the initial value of 0 and the increment at the end of the loop. Perhaps it was intended to be i < LISTS? Secondly, LISTS is initialized in a way that depends on PARTS, which depends on procs, before that variable is initialized. As with PARTS, LISTS must be initialized after the statement MPI_Comm_size(MPI_COMM_WORLD, &procs); on line 44.
Please be more careful when you write your loops. Also, make sure that you initialize variables correctly. I highly recommend using print statements (for small programs) or the debugger to make sure your variables are being set to the expected values.
We've been using CygWin (/usr/bin/x86_64-w64-mingw32-gcc) to generate Windows 64-bit executable files and it has been working fine through yesterday. Today it stopped working in a bizarre way--it "caches" standard output until the program ends. I wrote a six line example
that did the same thing. Since we use the code in batch, I wouldn't worry except when I run a test case on the now-strangely-caching executable, it opens the output files, ends early, and does not fill them with data. (The same code on Linux works fine, but these guys are using Windows.) I know it's not gorgeous code, but it demonstrates my problem, printing the numbers "1 2 3 4 5 6 7 8 9 10" only after I press the key.
#include <stdio.h>
main ()
{
char q[256];
int i;
for (i = 1; i <= 10; i++)
printf ("%d ", i);
gets (q);
printf ("\n");
}
Does anybody know enough CygWin to help me out here? What do I try? (I don't know how to get version numbers--I did try to get them.) I found a 64-bit cygwin1.dll in /cygdrive/c/cygwin64/bin and that didn't help a bit. The 32-bit gcc compilation works fine, but I need 64-bit to work. Any suggestions will be appreciated.
Edit: we found and corrected an unexpected error in the original code that caused the program not to populate the output files. At this point, the remaining problem is that cygwin won't show the output of the program.
For months, the 64-bit executable has properly generated the expected output, just as the 32-bit version did. Just today, it has started exhibiting the "caching" behavior described above. The program sends many hundreds of lines with many newline characters through stdout. Now, when the 64-bit executable is created as above, none of these lines are shown until the program completes and the entire output it printed at once. Can anybody provide insight into this problem?
This is quite normal. printf outputs to stdout which is a FILE* and is normally line buffered when connected to a terminal. This means you will not see any output until you write a newline, or the internal buffer of the stdout FILE* is full (A common buffer size is 4096 bytes).
If you write to a file or pipe, output might be fully buffered, in which case output is flushed when the internal buffer is full and not when you write a newline.
In all cases the buffers of a FILE* are flushed when: you call fflush(..). You call fclose(..) or the program ends normally.
Your program will behave the same on windows/cygwin as on linux.
You can add a call to fflush(stdout) to see the output immediately.
for (i = 1; i <= 10; i++) {
printf ("%d ", i);
fflush(stdout);
}
Also, do not use the gets() function.
If your real programs "ends early" and does not write data in text files that it's supposed to, it may be it crashes due to a bug of yours before it finishes, in which case the buffered output will not be flushed out. Or, more unlikely, you call the _exit() function, which will terminate the program without flushing FILE* buffers (in contrast to the exit() function)
One OpenMP directive I have never used and don't know when to use is flush(with and without a list).
I have two questions:
1.) When is an explicit `omp flush` or `omp flush(var1, ...) necessary?
2.) Is it sometimes not necessary but helpful (i.e. can it make the code fast)?
The main reason I can't understand when to use an explicit flush is that flushes are done implicitly after many directives (e.g. as barrier, single, ...) which synchronize the threads. I can't, for example, see way using flush and not synchronizing (e.g. with nowait) would be helpful.
I understand that different compilers may implement omp flush in different ways. Some may interpret a flush with a list as as one without (i.e. flush all shared objects) OpenMP flush vs flush(list). But I only care about what the specification requires. In other words, I want to know where an explicit flush in principle may be necessary or helpful.
Edit: I think I need to clarify my second question. Let me give an example. I would like to know if there are cases where removing an implicit flush (e.g. with nowait) and instead using an explicit flush instead but only on certain shared variables would be faster (and still give the correct result). Something like the following:
float a,b;
#pragma omp parallel
{
#pragma omp for nowait // No barrier. Do not flush on exit.
//code which uses only shared variable a
#pragma omp flush(a) // Flush only variable a rather than all shared variables.
#pragma omp for
//Code which uses both shared variables a and b.
}
I think that code still needs a barrier after the the first for loop but all barriers have an implicit flush so that defeats the purpose. Is it possible to have a barrier which does not do a flush?
The flush directive tells the OpenMP compiler to generate code to make the thread's private view on the shared memory consistent again. OpenMP usually handles this pretty well and does the right thing for typical programs. Hence, there's no need for flush.
However, there are cases where the OpenMP compiler needs some help. One of these cases is when you try to implement your own spin lock. In these cases, you would need a combination of flushes to make things work, since otherwise the spin variables will not be updated. Getting the sequence of flushes correct will be tough and will be very, very error prone.
The general recommendation is that flushes should not be used. If at all, programmers should avoid flush with a list (flush(var,...)) at all means. Some folks are actually talking about deprecating it in future OpenMP.
Performance-wise the impact of flush should be more negative than positive. Since it causes the compiler to generate memory fences and additional load/store operations, I would expect it to slow down things.
EDIT: For your second question, the answer is no. OpenMP makes sure that each thread has a consistent view on the shared memory when it needs to. If threads do not synchronize, they do not need to update their view on the shared memory, because they do not see any "interesting" change there. That means that any read a thread makes does not read any data that has been changed by some other thread. If that would be the case, then you'd have a race condition and a potential bug in your program. To avoid the race, you need to synchronize (which then implies a flush to make each participating thread's view consistent again). A similar argument applies to barriers. You use barriers to start a new epoch in the computation of a parallel region. Since you're keeping the threads in lock-step, you will very likely also have some shared state between the threads that has been computed in the previous epoch.
BTW, OpenMP may keep private data for a thread, but it does not have to. So, it is likely that the OpenMP compiler will keep variables in registers for a while, which causes them to be out of sync with the shared memory. However, updates to array elements are typically reflected pretty soon in the shared memory, since the amount of private storage for a thread is usually small (register sets, caches, scratch memory, etc.). OpenMP only gives you some weak restrictions on what you can expect. An actual OpenMP implementation (or the hardware) may be as strict as it wishes to be (e.g., write back any change immediately and to flushes all the time).
Not exactly an answer, but Michael Klemm's question is closed for comments. I think an excellent example of why flushes are so hard to understand and use properly is the following one copied (and shortened a bit) from the OpenMP Examples:
//http://www.openmp.org/wp-content/uploads/openmp-examples-4.0.2.pdf
//Example mem_model.2c, from Chapter 2 (The OpenMP Memory Model)
int main() {
int data, flag = 0;
#pragma omp parallel num_threads(2)
{
if (omp_get_thread_num()==0) {
/* Write to the data buffer that will be read by thread */
data = 42;
/* Flush data to thread 1 and strictly order the write to data
relative to the write to the flag */
#pragma omp flush(flag, data)
/* Set flag to release thread 1 */
flag = 1;
/* Flush flag to ensure that thread 1 sees S-21 the change */
#pragma omp flush(flag)
}
else if (omp_get_thread_num()==1) {
/* Loop until we see the update to the flag */
#pragma omp flush(flag, data)
while (flag < 1) {
#pragma omp flush(flag, data)
}
/* Values of flag and data are undefined */
printf("flag=%d data=%d\n", flag, data);
#pragma omp flush(flag, data)
/* Values data will be 42, value of flag still undefined */
printf("flag=%d data=%d\n", flag, data);
}
}
return 0;
}
Read the comments and try to understand.