Barrier before MPI_Bcast()? - parallel-processing

Barrier before MPI_Bcast()? - parallel-processing

I see some open source code use MPI_Barrier before broadcasting the root value:
MPI_Barrier(MPI_COMM_WORLD);
MPI_Bcast(buffer, N, MPI_FLOAT, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
I am not sure if MPI_Bcast() already has natural blocking feature. If this is true, I may not need MPI_Barrier() to synchronize the progress of all the cores. Then I can only use:
MPI_Bcast(buffer, N, MPI_FLOAT, 0, MPI_COMM_WORLD);
So which one is correct?

There is rarely a need to perform explicit synchronisation in MPI and code like that one makes little sense in general. Ranks in MPI mostly process data locally, share no access to global objects, and synchronise implicitly following the semantics of the send and receive operations. Why should rank i care whether some other rank j has received the broadcast when i is processing the received data locally?
Explicit barriers are generally needed in the following situations:
benchmarking - a barrier before a timed region of the code removes any extraneous waiting times resulting from one or more ranks being late to the party
parallel I/O - in this case, there is a global object (a shared file) and the consistency of its content may depend on the proper order of I/O operations, hence the need for explicit synchronisation
one-sided operations (RMA) - similarly to the parallel I/O case, some RMA scenarios require explicit synchronisation
shared-memory windows - those are a subset of RMA where access to memory shared between several ranks doesn't go through MPI calls but rather direct memory read and write instructions are issued, which brings all the problems inherent to shared-memory programming like the possibility of data races occurring and thus the need for locks and barriers into MPI
There are rare cases when the code actually makes sense. Depending on the number of ranks, their distribution throughout the network of processing elements, the size of the data to broadcast, the latency and bandwidth of the interconnect, and the algorithm used by the MPI library to actually implement the data distribution, it may take much longer time to complete when the ranks are even so slightly out of alignment in time due to the phenomenon of delay propagation, which may also apply to the user code itself. Those are pathological cases and usually occur under specific conditions, which is why sometimes you may see code like:
#ifdef UNALIGNED_BCAST_IS_SLOW
MPI_Barrier(MPI_COMM_WORLD);
#endif
MPI_Bcast(buffer, N, MPI_FLOAT, 0, MPI_COMM_WORLD);
or even
if (config.unaligned_bcast_performance_remedy)
MPI_Barrier(MPI_COMM_WORLD);
MPI_Bcast(buffer, N, MPI_FLOAT, 0, MPI_COMM_WORLD);
I've seen at least one MPI-enabled quantum chemistry simulation software package include similar code.
That said, collective operations in MPI are not necessarily synchronising. The only one to guarantee that there is a point in time where all ranks are simultaneously inside the call is MPI_BARRIER. MPI allows ranks to exit early once their participation in the collective operation has finished. For example, MPI_BCAST may be implemented as a linear sequence of sends from the root:
int rank, size;
MPI_Comm_rank(comm, &rank);
MPI_Comm_size(comm, &size);
if (rank == root)
{
for (int i = 0; i < size; i++)
if (i != rank)
MPI_Send(buffer, count, type, i, SPECIAL_BCAST_TAG, comm);
}
else
{
MPI_Recv(buffer, count, type, root, SPECIAL_BCAST_TAG, comm, MPI_STATUS_IGNORE);
}
In this case, rank 0 (if root is not 0) or rank 1 (when root is 0) will be the first one to receive the data and since there is no more communication directed to or from it, can safely return from the broadcast call. If the data buffer is large and the interconnect is slow, it will create quite some temporal staggering between the ranks.

Related

Limits of workload that can be put into hardware accelerators

I am interested in understanding what's the percentage of workload that can almost never be put into a hardware accelerators. While more and more tasks are being amenable to domain specific accelerators, I wonder is it possible to have tasks that are not going to be useful with accelerator? Put simply, what are the tasks that are less likely to be accelerator-compatible?
Would love to have a pointers to resources that speaks to this question.

So you have the following question(s) in your original post:
Question:
I wonder is it possible to have tasks that are not going to be useful with accelerator? Put simply, what are the tasks that are less likely to be accelerator-compatible?
Answer:
Of course it's possible. First and foremost, workload that needs to be accelerated on hardware accelerators should not involve following:
dynamic polymorphism and dynamic memory allocation
runtime type information (RTTI)
system calls
........... (some more depending on the hardware accelerator)
Although explaining each above-mentioned point will make the post too lengthy, I can explain few. There is no support of dynamic memory allocation because hardware accelerators have fixed set of resources on silicon, and the dynamic creation and freeing of memory resources is not supported. Similarly dynamic polymorphism is only supported if the pointer object can be determined at compile time. And there should be no System calls because these are actions that relate to performing some task upon the operating system. Therefore OS operations, such as file read/write or OS queries like time and date, are not supported.
Having said that, the workload that are less likely to be accelerator-compatible are mostly communication intensive kernels. Such communication intensive kernels often lead to a serious data transfer overhead compared to the CPU execution, which can probably be detected by the CPU-FPGA or CPU-GPU communication time measurement.
For better understanding, let's take the following example:
Communication Intensive Breadth-First Search (BFS):
1 procedure BFS(G, root) is
2 let Q be a queue
3 label root as explored
4 Q.enqueue(root)
5 while Q is not empty do
6 v := Q.dequeue()
7 if v is the goal then
8 return v
9 for all edges from v to w in G.adjacentEdges(v) do
10 if w is not labeled as explored then
11 label w as explored
12 Q.enqueue(w)
The above pseudo code is of famous bread-first search (BFS). Why it's not a good candidate for acceleration? Because it traverses all the nodes in a graph without doing any significant computation. Hence it's immensely communication intensive as compared to compute intensive. Furthermore, for a data-driven algorithm like
BFS, the shape and structure of the input can actually dictate runtime characteristics like locality and branch behaviour , making it not so good candidate for hardware acceleration.
Now the question arises why have I focused on compute intensive vs communication intensive?
As you have tagged FPGA in your post, I can explain you this concept with respect to FPGA. For instance in a given system that uses the PCIe connection between the CPU and FPGA, we calculate the PCIe transfer time as the elapsed time of data movement from the host memory to the device memory through PCIe-based direct memory access (DMA).
The PCIe transfer time is a significant factor to filter out the FPGA acceleration for communication bounded workload. Therefore, the above mentioned BFS can show severe PCIe transfer overheads and hence, not acceleration compatible.
On the other hand, consider a the family of object recognition algorithms implemented as a deep neural network. If you go through these algorithms you will find that a significant amount of time (more than 90% may be) is spent in the convolution function. The input data is relatively small. The convolutions are embarrassingly parallel. And this makes it them ideal workload for moving to hardware accelerator.
Let's take another example showing a perfect workload for hardware acceleration:
Compute Intensive General Matrix Multiply (GEMM):
void gemm(TYPE m1[N], TYPE m2[N], TYPE prod[N]){
int i, k, j, jj, kk;
int i_row, k_row;
TYPE temp_x, mul;
loopjj:for (jj = 0; jj < row_size; jj += block_size){
loopkk:for (kk = 0; kk < row_size; kk += block_size){
loopi:for ( i = 0; i < row_size; ++i){
loopk:for (k = 0; k < block_size; ++k){
i_row = i * row_size;
k_row = (k + kk) * row_size;
temp_x = m1[i_row + k + kk];
loopj:for (j = 0; j < block_size; ++j){
mul = temp_x * m2[k_row + j + jj];
prod[i_row + j + jj] += mul;
}
}
}
}
}
}
The above code example is General Matrix Multiply (GEMM). It is a common algorithm in linear algebra, machine learning, statistics, and many other domains. The matrix multiplication in this code is more commonly computed using a blocked
loop structure. Commuting the arithmetic to reuse all of the elements
in one block before moving onto the next dramatically
improves memory locality. Hence it is an extremely compute intensive and perfect candidate for acceleration.
Hence, to name only few, we can conclude following are the deciding factors for hardware acceleration:
the load of your workload
the data your workload accesses,
how parallel is your workload
the underlying silicon available for acceleration
the bandwidth and latency of communication channels.
Do not forget Amdahl's Law:
Even if you have found out the right workload that is an ideal candidate for hardware acceleration, the struggle does not end here. Why? Because the famous Amdahl's law comes into play. Meaning, you might be able to significantly speed up a workload, but if it is only 2% of the runtime of the application, then even if you speed it up infinitely (take the run time to 0) you will only speed the overall application by 2% at the system level. Hence, your ideal workload should not only be an ideal workload algorithmically, in fact, it should also be contributing significantly to the overall runtime of your system.

OpenCL Pseudo random generator race condition

Using this question as basis I implemented a pseudo-random number generator with a global state:
__global uint global_random_state;
void set_random_seed(uint seed){
global_random_state = seed;
}
uint get_random_number(uint range){
uint seed = global_random_state + get_global_id(0);
uint t = seed ^ (seed << 11);
uint result = seed ^ (seed >> 19) ^ (t ^ (t >> 8));
global_random_state = result; /* race condition? */
return result % range;
}
Since these functions will be used from multiple threads, there will be a race condition present when writing to global_random_state.
This might actually help the system to be more unpredictable, so it seems like a good thing, but I'd like to know if there are any consequences to this that might not surface immediately. Are there any side-effects inside the GPU which might cause problems later on when the kernel is run?

In theory you want atom_cmpxchg for correctness here (or find the equivalent GPGPU). However, a grave note of warning, having the entire machine serializing through a single cacheline is going to strangle your performance fundamentally. Atomics on the same address must form a queue and wait. Atomics on different locations can parallelize (more details at the end).
Generally, algorithms that leverage random variables on GPGPU will keep their own copy of the random variable generators. This enables each work item to cache and potentially reuse their own random with out glutting the bus with memory traffic on every new random. Search for "OpenCL Monte Carlo" "Simulation" or "Example" for samples. CUDA has some nice examples too.
Another option is to use a random generator that allows one to skip ahead and have different work items move forward in the sequence different amounts. This can be more compute intensive though, but the tradeoff is that you don't strain the memory hierarchy as much.
More gory details on atomics: (1) GPU cache atomics are designed to expect contiguous arrays and atomic ALUs are per bank, (2) each dword in a cacheline will be processed by the same atomic ALU each time, and (3) neighboring cachelines will hash to different banks. So, if every clock you are doing atomics on contiguous cachelines of data then the work should be perfectly spread out (or statistically so). Conversely, if one makes every work item atomically modify the same 32b, then the cache system cannot apply all the same atomic ALU slot to 16/32/64 (whatever your system uses). It must break the operation up in 16/32/64 separate atomic operations apply it iteratively (by #2 above). In a system where you have 512 ALUs to process atomics you would be using 1 of those ALUs each clock (the same one). Spread the work out and you can use all 512/c.

Faster memory allocation and freeing algorithm than multiple Free List method

We allocate and free many memory blocks. We use Memory Heap. However, heap access is costly.
For faster memory access allocation and freeing, we adopt a global Free List. As we make a multithreaded program, the Free List is protected by a Critical Section. However, Critical Section causes a bottleneck in parallelism.
For removing the Critical Section, we assign a Free List for each thread, i.e. Thread Local Storage. However, thread T1 always memory blocks and thread T2 always frees them, so Free List in thread T2 is always increasing, meanwhile there is no benefit of Free List.
Despite of the bottleneck of Critical Section, we adopt the Critical Section again, with some different method. We prepare several Free Lists as well as Critical Sections which is assigned to each Free List, thus 0~N-1 Free Lists and 0~N-1 Critical Sections. We prepare an atomic-operated integer value which mutates to 0, 1, 2, ... N-1 then 0, 1, 2, ... again. For each allocation and freeing, we get the integer value X, then mutate it, access X-th Critical Section, then access X-th Free List. However, this is quite slower than the previous method (using Thread Local Storage). Atomic operation is quite slow as there are more threads.
As mutating the integer value non-atomically cause no corruption, we did the mutation in non-atomic way. However, as the integer value is sometimes stale, there is many chance of accessing the same Critical Section and Free List by different threads. This causes the bottleneck again, though it is quite few than the previous method.
Instead of the integer value, we used thread ID with hashing to the range (0~N-1), then the performance got better.
I guess there must be much better way of doing this, but I cannot find an exact one. Are there any ideas for improving what we have made?

Dealing with heap memory is a task for the OS. Nothing guarantees you can do a better/faster job than the OS does.
But there are some conditions where you can get a bit of improvement, specially when you know something about your memory usage that is unknown to the OS.
I'm writting here my untested idea, hope you'll get some profit of it.
Let's say you have T threads, all of them reserving and freeing memory. The main goal is speed, so I'll try not to use TLS, nor critical blocking, not atomic ops.
If (repeat: if, if, if) the app can fit to several discrete sizes of memory blocks (not random sizes, so as to avoid fragmentation and unuseful holes) then start asking the OS for a number of these discrete blocks.
For example, you have an array of n1 blocks each of size size1, an array of n2 blocks each of size size2, an array of n3... and so on. Each array is bidimensional, the second field just stores a flag for used/free block. If your arrays are very large then it's better to use a dedicated array for the flags (due to contiguous memory usage is always faster).
Now, some one asks for a block of memory of size sB. A specialized function (or object or whatever) searches the array of blocks of size greater or equal to sB, and then selects a block by looking at the used/free flag. Just before ending this task the proper block-flag is set to "used".
When two or more threads ask for blocks of the same size there may be a corruption of the flag. Using TLS will solve this issue, and critical blocking too. I think you can set a bool flag at the beggining of the search into flags-array, that makes the other threads to wait until the flag changes, which only happens after the block-flag changes. With pseudo code:
MemoryGetter(sB)
{
//select which array depending of 'sB'
for (i=0, i < numOfarrays, i++)
if (sizeOfArr(i) >= sB)
arrMatch = i
break //exit for
//wait if other thread wants a block from the same arrMatch array
while ( searching(arrMatch) == true )
; //wait
//blocks other threads wanting a block from the same arrMatch array
searching(arrMatch) = true
//Get the first free block
for (i=0, i < numOfBlocks, i++)
if ( arrOfUsed(arrMatch, i) != true )
selectedBlock = addressOf(....)
//mark the block as used
arrOfUsed(arrMatch, i) = true
break; //exit for
//Allow other threads
searching(arrMatch) = false
return selectedBlock //NOTE: selectedBlock==NULL means no free block
}
Freeing a block is easier, just mark it as free, no thread concurrency issue.
Dealing with no free blocks is up to you (wait, use a bigger block, ask OS for more, etc).
Note that the whole memory is reserved from the OS at app start, which can be a problem.
If this idea makes your app faster, let me know. What I can say for sure is that memory used is greater than if you use normal OS request; but not much if you choose "good" sizes, those most used.
Some improvements can be done:
Cache the last freeded block (per size) so as to avoid the search.
Start with not that much blocks, and ask the OS for more memory only
when needed. Play with 'number of blocks' for each size depending on
your app. Find the optimal case.

Why is it a Bad Thing for one process to be able to read, or even write, to memory occupied by a different process? [duplicate]

It's my understanding that if two threads are reading from the same piece of memory, and no thread is writing to that memory, then the operation is safe. However, I'm not sure what happens if one thread is reading and the other is writing. What would happen? Is the result undefined? Or would the read just be stale? If a stale read is not a concern is it ok to have unsynchronized read-write to a variable? Or is it possible the data would be corrupted, and neither the read nor the write would be correct and one should always synchronize in this case?
I want to say that I've learned it is the later case, that a race on memory access leaves the state undefined... but I don't remember where I may have learned that and I'm having a hard time finding the answer on google. My intuition is that a variable is operated on in registers, and that true (as in hardware) concurrency is impossible (or is it), so that the worst that could happen is stale data, i.e. the following:
WriteThread: copy value from memory to register
WriteThread: update value in register
ReadThread: copy value of memory to register
WriteThread: write new value to memory
At which point the read thread has stale data.

Usually memory is read or written in atomic units determined by the CPU architecture (32 bit and 64 bits item aligned on 32 bit and 64 bit boundaries is common these days).
In this case, what happens depends on the amount of data being written.
Let's consider the case of 32 bit atomic read/write cells.
If two threads write 32 bits into such an aligned cell, then it is absolutely well defined what happens: one of the two written values is retained. Unfortunately for you (well, the program), you don't know which value. By extremely clever programming, you can actually use this atomicity of reads and writes to build synchronization algorithms (e.g., Dekker's algorithm), but it is faster typically to use architecturally defined locks instead.
If two threads write more than an atomic unit (e.g., they both write a 128 bit value), then in fact the atomic unit sized pieces of the values written will be stored in a absolutely well defined way, but you won't know which pieces of which value get written in what order. So what may end up in storage is the value from the first thread, the second thread, or mixes of the bits in atomic unit sizes from both threads.
Similar ideas hold for one thread reading, and one thread writing in atomic units, and larger.
Basically, you don't want to do unsynchronized reads and writes to memory locations, because you won't know the outcome, even though it may be very well defined by the architecture.

The result is undefined. Corrupted data is entirely possible. For an obvious example, consider a 64-bit value being manipulated by a 32-bit processor. Let's assume the value is a simple counter, and we increment it when the lower 32-bits contain 0xffffffff. The increment produces 0x00000000. When we detect that, we increment the upper word. If, however, some other thread read the value between the time the lower word was incremented and the upper word was incremented, they get a value with an un-incremented upper word, but the lower word set to 0 -- a value completely different from what it would have been either before or after the increment is complete.

As I hinted in Ira Baxter's answer, CPU cache also plays a part on multicore systems. Consider the following test code:
DANGER WILL ROBISON!
The following code boosts priority to realtime to achieve somewhat more consistent results - while doing so requires admin privileges, be careful if running the code on dual- or single-core systems, since your machine will lock up for the duration of the test run.
#include <windows.h>
#include <stdio.h>
const int RUNFOR = 5000;
volatile bool terminating = false;
volatile int value;
static DWORD WINAPI CountErrors(LPVOID parm)
{
int errors = 0;
while(!terminating)
{
value = (int) parm;
if(value != (int) parm)
errors++;
}
printf("\tThread %08X: %d errors\n", parm, errors);
return 0;
}
static void RunTest(int affinity1, int affinity2)
{
terminating = false;
DWORD dummy;
HANDLE t1 = CreateThread(0, 0, CountErrors, (void*)0x1000, CREATE_SUSPENDED, &dummy);
HANDLE t2 = CreateThread(0, 0, CountErrors, (void*)0x2000, CREATE_SUSPENDED, &dummy);
SetThreadAffinityMask(t1, affinity1);
SetThreadAffinityMask(t2, affinity2);
ResumeThread(t1);
ResumeThread(t2);
printf("Running test for %d milliseconds with affinity %d and %d\n", RUNFOR, affinity1, affinity2);
Sleep(RUNFOR);
terminating = true;
Sleep(100); // let threads have a chance of picking up the "terminating" flag.
}
int main()
{
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);
RunTest(1, 2); // core 1 & 2
RunTest(1, 4); // core 1 & 3
RunTest(4, 8); // core 3 & 4
RunTest(1, 8); // core 1 & 4
}
On my Quad-core intel Q6600 system (which iirc has two sets of cores where each set share L2 cache - would explain the results anyway ;)), I get the following results:
Running test for 5000 milliseconds with affinity 1 and 2
Thread 00002000: 351883 errors
Thread 00001000: 343523 errors
Running test for 5000 milliseconds with affinity 1 and 4
Thread 00001000: 48073 errors
Thread 00002000: 59813 errors
Running test for 5000 milliseconds with affinity 4 and 8
Thread 00002000: 337199 errors
Thread 00001000: 335467 errors
Running test for 5000 milliseconds with affinity 1 and 8
Thread 00001000: 55736 errors
Thread 00002000: 72441 errors

What is the difference between mutex and critical section?

Please explain from Linux, Windows perspectives?
I am programming in C#, would these two terms make a difference. Please post as much as you can, with examples and such....
Thanks

For Windows, critical sections are lighter-weight than mutexes.
Mutexes can be shared between processes, but always result in a system call to the kernel which has some overhead.
Critical sections can only be used within one process, but have the advantage that they only switch to kernel mode in the case of contention - Uncontended acquires, which should be the common case, are incredibly fast. In the case of contention, they enter the kernel to wait on some synchronization primitive (like an event or semaphore).
I wrote a quick sample app that compares the time between the two of them. On my system for 1,000,000 uncontended acquires and releases, a mutex takes over one second. A critical section takes ~50 ms for 1,000,000 acquires.
Here's the test code, I ran this and got similar results if mutex is first or second, so we aren't seeing any other effects.
HANDLE mutex = CreateMutex(NULL, FALSE, NULL);
CRITICAL_SECTION critSec;
InitializeCriticalSection(&critSec);
LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);
LARGE_INTEGER start, end;
// Force code into memory, so we don't see any effects of paging.
EnterCriticalSection(&critSec);
LeaveCriticalSection(&critSec);
QueryPerformanceCounter(&start);
for (int i = 0; i < 1000000; i++)
{
EnterCriticalSection(&critSec);
LeaveCriticalSection(&critSec);
}
QueryPerformanceCounter(&end);
int totalTimeCS = (int)((end.QuadPart - start.QuadPart) * 1000 / freq.QuadPart);
// Force code into memory, so we don't see any effects of paging.
WaitForSingleObject(mutex, INFINITE);
ReleaseMutex(mutex);
QueryPerformanceCounter(&start);
for (int i = 0; i < 1000000; i++)
{
WaitForSingleObject(mutex, INFINITE);
ReleaseMutex(mutex);
}
QueryPerformanceCounter(&end);
int totalTime = (int)((end.QuadPart - start.QuadPart) * 1000 / freq.QuadPart);
printf("Mutex: %d CritSec: %d\n", totalTime, totalTimeCS);

From a theoretical perspective, a critical section is a piece of code that must not be run by multiple threads at once because the code accesses shared resources.
A mutex is an algorithm (and sometimes the name of a data structure) that is used to protect critical sections.
Semaphores and Monitors are common implementations of a mutex.
In practice there are many mutex implementation availiable in windows. They mainly differ as consequence of their implementation by their level of locking, their scopes, their costs, and their performance under different levels of contention. See CLR Inside Out -
Using concurrency for scalability for an chart of the costs of different mutex implementations.
Availiable synchronization primitives.
Monitor
Mutex
Semaphore
ReaderWriterLock
ReaderWriterLockSlim
Interlocked
The lock(object) statement is implemented using a Monitor - see MSDN for reference.
In the last years much research is done on non-blocking synchronization. The goal is to implement algorithms in a lock-free or wait-free way. In such algorithms a process helps other processes to finish their work so that the process can finally finish its work. In consequence a process can finish its work even when other processes, that tried to perform some work, hang. Usinig locks, they would not release their locks and prevent other processes from continuing.

In addition to the other answers, the following details are specific to critical sections on windows:
in the absence of contention, acquiring a critical section is as simple as an InterlockedCompareExchange operation
the critical section structure holds room for a mutex. It is initially unallocated
if there is contention between threads for a critical section, the mutex will be allocated and used. The performance of the critical section will degrade to that of the mutex
if you anticipate high contention, you can allocate the critical section specifying a spin count.
if there is contention on a critical section with a spin count, the thread attempting to acquire the critical section will spin (busy-wait) for that many processor cycles. This can result in better performance than sleeping, as the number of cycles to perform a context switch to another thread can be much higher than the number of cycles taken by the owning thread to release the mutex
if the spin count expires, the mutex will be allocated
when the owning thread releases the critical section, it is required to check if the mutex is allocated, if it is then it will set the mutex to release a waiting thread
In linux, I think that they have a "spin lock" that serves a similar purpose to the critical section with a spin count.

Critical Section and Mutex are not Operating system specific, their concepts of multithreading/multiprocessing.
Critical Section
Is a piece of code that must only run by it self at any given time (for example, there are 5 threads running simultaneously and a function called "critical_section_function" which updates a array... you don't want all 5 threads updating the array at once. So when the program is running critical_section_function(), none of the other threads must run their critical_section_function.
mutex*
Mutex is a way of implementing the critical section code (think of it like a token... the thread must have possession of it to run the critical_section_code)

A mutex is an object that a thread can acquire, preventing other threads from acquiring it. It is advisory, not mandatory; a thread can use the resource the mutex represents without acquiring it.
A critical section is a length of code that is guaranteed by the operating system to not be interupted. In pseudo-code, it would be like:
StartCriticalSection();
DoSomethingImportant();
DoSomeOtherImportantThing();
EndCriticalSection();

The 'fast' Windows equal of critical selection in Linux would be a futex, which stands for fast user space mutex. The difference between a futex and a mutex is that with a futex, the kernel only becomes involved when arbitration is required, so you save the overhead of talking to the kernel each time the atomic counter is modified. That .. can save a significant amount of time negotiating locks in some applications.
A futex can also be shared amongst processes, using the means you would employ to share a mutex.
Unfortunately, futexes can be very tricky to implement (PDF). (2018 update, they aren't nearly as scary as they were in 2009).
Beyond that, its pretty much the same across both platforms. You're making atomic, token driven updates to a shared structure in a manner that (hopefully) does not cause starvation. What remains is simply the method of accomplishing that.

In Windows, a critical section is local to your process. A mutex can be shared/accessed across processes. Basically, critical sections are much cheaper. Can't comment on Linux specifically, but on some systems they're just aliases for the same thing.

Just to add my 2 cents, critical Sections are defined as a structure and operations on them are performed in user-mode context.
ntdll!_RTL_CRITICAL_SECTION
+0x000 DebugInfo : Ptr32 _RTL_CRITICAL_SECTION_DEBUG
+0x004 LockCount : Int4B
+0x008 RecursionCount : Int4B
+0x00c OwningThread : Ptr32 Void
+0x010 LockSemaphore : Ptr32 Void
+0x014 SpinCount : Uint4B
Whereas mutex are kernel objects (ExMutantObjectType) created in the Windows object directory. Mutex operations are mostly implemented in kernel-mode. For instance, when creating a Mutex, you end up calling nt!NtCreateMutant in kernel.

Great answer from Michael. I've added a third test for the mutex class introduced in C++11. The result is somewhat interesting, and still supports his original endorsement of CRITICAL_SECTION objects for single processes.
mutex m;
HANDLE mutex = CreateMutex(NULL, FALSE, NULL);
CRITICAL_SECTION critSec;
InitializeCriticalSection(&critSec);
LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);
LARGE_INTEGER start, end;
// Force code into memory, so we don't see any effects of paging.
EnterCriticalSection(&critSec);
LeaveCriticalSection(&critSec);
QueryPerformanceCounter(&start);
for (int i = 0; i < 1000000; i++)
{
EnterCriticalSection(&critSec);
LeaveCriticalSection(&critSec);
}
QueryPerformanceCounter(&end);
int totalTimeCS = (int)((end.QuadPart - start.QuadPart) * 1000 / freq.QuadPart);
// Force code into memory, so we don't see any effects of paging.
WaitForSingleObject(mutex, INFINITE);
ReleaseMutex(mutex);
QueryPerformanceCounter(&start);
for (int i = 0; i < 1000000; i++)
{
WaitForSingleObject(mutex, INFINITE);
ReleaseMutex(mutex);
}
QueryPerformanceCounter(&end);
int totalTime = (int)((end.QuadPart - start.QuadPart) * 1000 / freq.QuadPart);
// Force code into memory, so we don't see any effects of paging.
m.lock();
m.unlock();
QueryPerformanceCounter(&start);
for (int i = 0; i < 1000000; i++)
{
m.lock();
m.unlock();
}
QueryPerformanceCounter(&end);
int totalTimeM = (int)((end.QuadPart - start.QuadPart) * 1000 / freq.QuadPart);
printf("C++ Mutex: %d Mutex: %d CritSec: %d\n", totalTimeM, totalTime, totalTimeCS);
My results were 217, 473, and 19 (note that my ratio of times for the last two is roughly comparable to Michael's, but my machine is at least four years younger than his, so you can see evidence of increased speed between 2009 and 2013, when the XPS-8700 came out). The new mutex class is twice as fast as the Windows mutex, but still less than a tenth the speed of the Windows CRITICAL_SECTION object. Note that I only tested the non-recursive mutex. CRITICAL_SECTION objects are recursive (one thread can enter them repeatedly, provided it leaves the same number of times).

I found the explanations stating that critical sections protect a code section from being entered by multiple threads quite misleading.
There is no point in protecting code, as code is read only and can't be modified by multiple threads. What one usually wants is to protect data from being modified by multiple threads, leading to incoherent state. Commonly a mutex (or critical section, fulfilling the same purpose) should be associated with some piece of data. Each code section accessing this data should aquire the mutex/critical section and release when it's finished accessing the data. This may be considerably more fine grained than just locking out threads from entering a function. Also, from my experience, locking functions by some synchronisation is much more prone for errors, in particular dead locks. A good article covering that topic can be found here:
https://www.bogotobogo.com/cplusplus/multithreaded4_cplusplus11B.php
So, in summary (recursive) mutexes and critical sections basically fulfill the same purpose, which is rather not protecting code, but protecting data instead.
Critical sections may be implemented more efficiently than plain kernel mutexes. The example given in the first answer is a bit misleading, because it does not depict what the synchronisation primitive is designed for: synchronize access to sth. from multiple threads. The example just measures the trivial case, when the critical section/mutex is never owned by another thread.
While critical sections could be more efficient if, for instance, two threads access data in short, interlocked periods, they could prove less efficient if we got lots of threads accessing the same piece of data. Each thread would spinlock until giving up and waiting for the semaphore, part of the implementation of the critical section.
Such a case should also be considered when measuring execution times.

A C functions is called reentrant if it uses its actual parameters only.
Reentrant functions can be called by multiple threads at the same time.
Example of reentrant function:
int reentrant_function (int a, int b)
{
int c;
c = a + b;
return c;
}
Example of non reentrant function:
int result;
void non_reentrant_function (int a, int b)
{
int c;
c = a + b;
result = c;
}
The C standard library strtok() is not reentrant and can't be used by 2 or more threads at the same time.
Some platform SDK's comes with the reentrant version of strtok() called strtok_r();

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio