Unable to understand correctness of Peterson Algorithm

Unable to understand correctness of Peterson Algorithm - algorithm

I have a scenario to discuss here for Peterson Algorithm:
flag[0] = 0;
flag[1] = 0;
turn;
P0: flag[0] = 1;
turn = 1;
while (flag[1] == 1 && turn == 1)
{
// busy wait
}
// critical section
...
// end of critical section
flag[0] = 0;
P1: flag[1] = 1;
turn = 0;
while (flag[0] == 1 && turn == 0)
{
// busy wait
}
// critical section
...
// end of critical section
flag[1] = 0;
Suppose both process start executing concurrently .P0 sets flag[0]=1 and die. Then P1 starts. Its while condition will be satisfied as flag[0]=1 ( set by P0 and turn =0) and it will stuck in this loop forever which is a dead lock.
So does Peterson Algorithm doesn't account for this case ?
In case if dying of process in not to be considered while analyzing such algorithms then in Operating System Book by William Stalling, appendix A contain a series of algorithm for concurrency, starting with 4 incorrect algorithm for demonstration. It proves them incorrect by considering the case of dying of a process ( in addition to other cases) before completion but then claims Peterson Algorithm to be correct.
I came across this thread which give me clue that there is a problem when considering N process( n>2) but for two process this algorithm works fine.
So is there a problem in the analysis of the algorithm(suggested by one of my classmate and i fully second him) as discussed above or Peterson Algorithm doesn't claim deadlock with 2 process too?
In this paper Some myths about famous mutual exclusion algorithms, the author concluded Peterson has never claimed that his algorithm assures bounded bypass.
Can unbounded bypass be thought of as reaching infinity as in case of deadlock ? Well in that case we can have less trouble in accepting that Peterson Algorithm can lead to deadlock.

Certainly you could write code that would throw an unhandled exception, but the algorithm supposes that the executing process will always set its flag to false, after its critical section has executed. Therefore Peterson's algorithm does pass the 3 tests for critical sections.
1) Mutual exclusion - flag[0] and flag[1] can both be true, but turn can only be 0 or 1. Therefore only one of the two critical sections can be executed. The other will spin wait.
2) Progress - If process 0 is in the critical section, then turn = 0 and flag[0] is true. Once it has completed it's critical section (even if something catastrophic happens), it must set flag[0] to false. If process 1 was spin waiting, it will now free as flag[0] != true.
3) Bound-waiting - If Process 1 is waiting, Process 0 can only enter it's critical section once, before process 1 is given the green light, as explained in #2.
The problem with Peterson's Algorithm is that in modern architecture, a CPU cache could screw up the mutual exclusion requirement. The problem is called cache-coherence, and it is possible that the cache used by Processs 0 on CPU 0 sets flag[0] equal to true, while Process 1 on CPU 1 still thinks flag[0] is false. In this case, both critical sections would enter, and BANG... mutual exclusion fails, and race conditions are now possible.

You are right, Peterson's algorithm assumes processes do not fail while executing the part of the algorithm for synchronizing. I do not have access to the book you mention, but perhaps the other algorithms are incorrect because they do not account for processes failing outside of the synchronization parts (which is worse)?
Note: while still interesting historically, Peterson's algorithm also makes assumptions on the way memory work that are not valid with today's hardware.

Most locking algorithms don't account for a process dying while it is within the critical section (how can other processes distinguish between a process having died after taking a lock, versus a process merely taking a long time?).
A process dying when it is not in a critical section, however, shouldn't prevent other processes entering or leaving. For example, a critical section where two processes "take turns" to enter the critical section is problematic; if one process dies outside the critical section, the second waits forever for the first to have its turn. This is perhaps what your teacher was referring to.
(As a hacky solution, you could attempt to handle processes dying within a critical section with timeouts; if a process takes to long, you assume it has died. This comes at the risk of allowing two processes into a critical section if one takes too long, though.)

Even some of the semaphore methods fail if we assume this premature dying of the of the processes (try producer / consumer problem )
So , we cannot say that the algorithm is incorrect but just that it wasn't made for the purpose as we see it .These are all misconception-ed .

I agree with Fabio. Instead of saying "P0 sets flag[0]=1 and die", I would like to consider the scenario in which P0 is preempted by Scheduler and P1 is scheduled immediately after P0 sets its Flag[0] to 1.
Then both process gets trapped in busy wait, as :
Flag(0)=1, Flag(1)=1 and turn=0.
It means P1 will busy wait as condition in while loop is true.
Now if P1 is preempted, let us say due to time out and P0 is scheduled by scheduler then:
Flag(0)=1, Flag(1)=1 and turn=1.
It means P0 will also busy wait as condition in while is true.
Now both the processes are busy waiting for each other and deadlock occurs.
I don't know why this solution is so famous or are we missing something ....?

Related

Data races with MESI optimization

I dont really understand what exactly is causing the problem in this example:
Here is a snippet from my book:
Based on the discussion of the MESI protocol in the preceding section, it would
seem that the problem of data sharing between L1 caches in a multicore machine
has been solved in a watertight way. How, then, can the memory ordering
bugs we’ve hinted at actually happen?
There’s a one-word answer to that question: Optimization. On most hardware,
the MESI protocol is highly optimized to minimize latency. This means
that some operations aren’t actually performed immediately when messages
are received over the ICB. Instead, they are deferred to save time. As with
compiler optimizations and CPU out-of-order execution optimizations, MESI
optimizations are carefully crafted so as to be undetectable by a single thread.
But, as you might expect, concurrent programs once again get the raw end of
this deal.
For example, our producer (running on Core 1) writes 42 into g_data and
then immediately writes 1 into g_ready. Under certain circumstances, optimizations
in the MESI protocol can cause the new value of g_ready to become
visible to other cores within the cache coherency domain before the updated
value of g_data becomes visible. This can happen, for example, if Core
1 already has g_ready’s cache line in its local L1 cache, but does not have
g_data’s line yet. This means that the consumer (on Core 2) can potentially
see a value of 1 for g_ready before it sees a value of 42 in g_data, resulting in
a data race bug.
Here is the code:
int32_t g_data = 0;
int32_t g_ready = 0;
void ProducerThread() // running on Core 1
{
g_data = 42;
// assume no instruction reordering across this line
g_ready = 1;
}
void ConsumerThread() // running on Core 2
{
while (!g_ready)
PAUSE();
// assume no instruction reordering across this line
ASSERT(g_data == 42);
}
How can g_data be computed but not present in the cache?
This can happen, for example, if Core
1 already has g_ready’s cache line in its local L1 cache, but does not have
g_data’s line yet.
If g_data is not in cache, then why does the previous sentece end with a yet? Would the CPU load the cache line with g_data after it has been computed?
If we read this sentence:
This means that some operations aren’t actually performed immediately when messages are received over the ICB. Instead, they are deferred to save time.
Then what operation is deferred in our example with producer and consumer threads?
So basically I dont understand how under the MESI protocol, some operations are visible to other cores in the wrong order, despite being computed in the right order by a specific core.
PS:
This example is from a book called "Game Engine Architecture, Third Edition" by Jason Gregory, its on the page 309. Here is the book

last warp loop unrolling in Nvidia's parallel reduction tutorial problem

I ran into a problem for understanding the logic behind "the last warp loop unrolling" technique in Nvidia's parallel reduction tutorial available here.
In case of thread31 (for which tid=31), before unrolling the loop:
this thread only executes these operations:
sdata[31] += sdata[31+64]
sdata[31] += sdata[31+32]
But after the loop unrolling (as shown below):
The condition if(tid < 32) becomes true for thread31 and the warpReduce function will be executed for it and therefore all these operations which wouldn't be executed in the unrolled loop version will be executed now:
sdata[31] += sdata[31+32] //for second time
sdata[31] += sdata[31+16]
...
sdata[31] += sdata[31+1]
What's the logic behind it?

First:
sdata[31] += sdata[31+32] //for second time
No, that's not the case, it doesn't get executed a second time. The loop terminates when the s variable is shifted right from 64 to 32, and the body of the loop is not executed for s=32. Therefore the above statement is not executed during the body of the loop, because that would imply s=32, which is excluded by the loop termination condition.
Now, on to your question. It's true there is a behavioral difference between the two cases, however the only result that matters at the end is sdata[0] and this behavioral difference does not affect the results calculation for sdata[0]. So the only thing left would be "does it matter for performance?"
I don't have an answer for you, but I doubt it would make a significant difference. In the non-warp-reduce case, at each loop iteration there is a shift-right operation on a register variable, followed by a test, followed by a predicated set of shared memory instructions. In the warp-reduce case, there is some extra shared memory load/store activity and add arithmetic, but no shift arithmetic or testing per reduction step.
With respect to the extra load/store activity, the only portion of this that matters is the portion that will reach "above" the warp range (i.e. 0-31). There is extra shared loading activity going on here. The extra store activity and extra add arithmetic is irrelevant, because constraining these operations to less than a single warp is not any better performance-wise (this point is covered in the presentation itself, "We don’t need if (tid < s) because it doesn’t save any
work"). So the only consideration here is the once-per-step "extra" read of shared memory, one additional transaction, basically, per step. Against that we have the shifting, conditional test, and predication.
I don't know which is faster, but my guess as to the "logic" would be:
The difference would be small. Shared memory pressure is unlikely to be an issue at this point in this code.
The person who wrote it either didn't consider this at all, or considered it and decided it was probably so trivial as to be not worthy of cluttering a presentation that is really focused on other things, and will be read by many people.
EDIT: Based on comments, there appears to still be some question about my claim that the behavioral difference does not affect the results calculdation for sdata[0].
First, let's acknowledge that the only item we care about at the end is sdata[0]. sdata[1] or any other "result" is irrelevant for this discussion.
Let's make an observation about which thread calculations matter, at each step. We can observe that at a given step in the final-warp reduction, the only threads that matter (i.e. that can have an effect on the final value in sdata[0]) are those that are less then the offset value:
sdata[tid] += sdata[tid + offset]; // where offset is 32, then 16, then 8, etc.
Why is this? In order to understand that, we need to understand 2 things. First, we must understand at this point that there is an expectation of warp-synchronous behavior. This is already identified in the presentation (slide 21) as a necessary precondition to convert the loop reduction to the unrolled final warp reduction. I'm not going to spend a lot of time on the definition of warp-synchronous, but it essentially means we are depending on the warp to execute in lockstep. A warp is 32 threads, and it means that when one thread is executing a particular instruction, every thread in the warp is executing that instruction, at that point in the instruction stream. Second, we need to carefully decompose the above line to understand the sequence of operations. The above line of C++ code will decompose into the following pseudo-machine-language code that the GPU is actually executing:
LD R0, sdata[tid]
LD R1, sdata[tid+offset]
ADD R3, R2, R1
ST sdata[tid], R3
In english, at each step in the final warp unrolled reduction, each thread will load its sdata[tid] value, then each thread will load its sdata[tid+offset] value, then each thread will add those 2 values together, then each thread will store the result. Because the warp is executing in lockstep at this point, when each thread loads its sdata[tid] value, it means that every thread is loading its respective value, at that instruction cycle/clock cycle, i.e. at that instant.
now, lets revisit the overall operation. At the point in the sequence where we have:
sdata[tid] += sdata[tid + 16];
how can we justify the statement that the only threads here that matter are those whose tid value is less than the offset? The first thing each thread does is load sdata[tid]. Then each thread loads sdata[tid+16]. So at this point, threads 0-15 have loaded their own value, plus the values from locations 16-31. Threads 16-31 have loaded their own value, plus the values from locations 32-47. Then all 32 threads perform the addition, then all 32 threads perform the store operation. So thread 16, which also picked up the value from location 32, did not update the location 16 value until after the previous value at location 16 had been consumed (by thread 0 in this case). So the behavior of threads 16-31 at this point have no impact on the value computed for thread 0.
We can repeat the above process to show that for each offset, the threads whose indexes lie at or above the offset have no impact on the calculation for thread 0.

How to properly implement waiting of async computations?

i have some little trouble and i am asking for hint. I am on Windows platform, doing calculations in a following manner:
int input = 0;
int output; // junk bytes here
while(true) {
async_enqueue_upload(input); // completes instantly, but transfer will take 10us
async_enqueue_calculate(); // completes instantly, but computation will take 80us
async_enqueue_download(output); // completes instantly, but transfer will take 10us
sync_wait_finish(); // must wait while output is fully calculated, and there is no junk
input = process(output); // i cannot launch next step without doing it on the host.
}
I am asking about wait_finish() thing. I must wait all devices to finish, to combine all results and somehow process the data and upload a new portion, that is based on a previous computation step. I need to sync data in between each step, so i can't parallelize steps. I know, this is not quite performant case. So lets proceed to question.
I have 2 ways of checking completion, within wait_finish(). First is to put thread to sleep until it wakes up by completion event:
while( !is_completed() )
Sleep(1);
It has very low performance, because actual calculation, to say, takes 100us, and minimal Windows sheduler timestep is 1ms, so it gives unsuitable 10x lower performance.
Second way is to check completion in empty infinite loop:
while( !is_completed() )
{} // do_nothing();
It has 10x good computation performance. But it is also unsuitable solution, because it makes full cpu core utilisation usage, with absolutely useless work. How to make cpu "sleep" exactly time i needed? (Each step has equal amount of work)
How this case is usually solved, when amount of calculation time is too big for active spin-wait, but is too small compared to sheduler timestep? Also related subquestion - how to do that on linux?

Fortunately, i have succeeded in finding answer on my own. In short words - i should use linux for that.
And my investigation shows following. On windows there is hidden function in ntdll, NtDelayExecution(). It is not exposed through SDK, but can be loaded in a following manner:
static int(__stdcall *NtDelayExecution)(BOOL Alertable, PLARGE_INTEGER DelayInterval) = (int(__stdcall*)(BOOL, PLARGE_INTEGER)) GetProcAddress(GetModuleHandleW(L"ntdll.dll"), "NtDelayExecution");
It allows to set sleep intervals in 100ns periods. However, even that not worked well, as shown in a following benchmark:
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS); // requires Admin privellegies
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);
uint64_t hpf = qpf(); // QueryPerformanceFrequency()
uint64_t s0 = qpc(); // QueryPerformanceCounter()
uint64_t n = 0;
while (1) {
sleep_precise(1); // NtDelayExecution(-1); waits one 100-nanosecond interval
auto s1 = qpc();
n++;
auto passed = s1 - s0;
if (passed >= hpf) {
std::cout << "freq=" << (n * hpf / passed) << " hz\n";
s0 = s1;
n = 0;
}
}
That yields something less than just 2000 hz loop rate, and result varies from string to string. That led me towards windows thread switching sheduler, which is totally not suited for real time tasks. And its minimum interval of 0.5ms (+overhead). Btw, does anyone knows on how to tune that value?
And next was linux question, and what does it can? So i've built custom tiny kernel 4.14 with means of buildroot, and tested that benchmark code there. I replaced qpc() to return clock_gettime() data, with CLOCK_MONOTONIC clock, and qpf() just returns number of nanoseconds in a second and sleep_precise() just called clock_nanosleep(). I was failed to find out what is the difference between CLOCK_MONOTONIC and CLOCK_REALTIME.
And i was quite surprised, getting whooping 18.4khz frequency just out of the box, and that was quite stable. While i tested several intervals, i found that i can set the loop to almost any frequency up to 18.4khz, but also that actual measured wait time results differs to 1.6 times of what i asked. For example if i ask to sleep 100 us it actually sleeps for ~160 us, giving ~6.25 khz frequency. Nothing else is going on the system, just kernel, busybox and this test. I am not an experience linux user, and i am still wondering how can i tune this to be more real-time and deterministic. Can i push that frequency maximum even more?

Two Process Solution Algorithm 1

Here is the two process solution algorithm 1:
turn = 0;
i = 0, j = 1;
do
{
while (turn != i) ; //if not i's turn , wait indefinitely
// critical section
turn = j; //after i leaves critical section, lets j in
//remainder section
} while (1); //loop again
I understand that the mutual exclusion is satisfied. Because when P0 is in critical section, P1 waits until it leaves critical section. And after P0 updates turn, P1 enters critical section. I don't understand why progress is not satisfied in this algorithm.
Progress is if there is no process in critical section waiting process should be able to enter into critical section without waiting.
P0 updates turn after leaving critical section so P1 who waits in while loop should be able to enter to critical section. Can you please tell me why there is no progress then?

Forward progress is defined as follows:
If no process is executing in its CS and there exist some processes that wish to enter their CS, then the selection of the process that will enter the CS next cannot be postponed indefinitely.
The code you wrote above does not satisfy this in the case the threads are not balanced, consider the following scenario:
P0 has entered the critical section, finished it, and set the turn to P1.
P1 enters the section, completes it, sets the turn back to P0.
P1 quickly completes the remainder section, and wishes to enter the critical section again. However, P0 still holds the turn.
P0 gets stalled somewhere in its remainder section indefinitely. P1 is starved.
In other words, this algorithm can't support a system where one of the processes runs much faster. It forces the critical section to be owned in equal turns by P0 -> P1 -> P0 -> P1 -> ... For forward progress we would like to allow a scenario where it's owned for example in the following manner P0 -> P1 -> P1 -> .., and continuing with P1 while P0 isn't ready for entering again. Otherwise P1 may be starved.
Petersons' algorithm fixes this by adding flags to indicate when a thread is ready to enter the critical section, on top of the turn-based fairness like you have. This guarantees that no one is stalled by the other thread inefficiency, and that no one can enter multiple times in a row unless the other permits it to.

You can not be sure about the order in which the code in the two processes is run. When first P1 is run and tries to enter the critical section, it is not allowed to, because it is the turn of P0. So, P1 can not enter the critical section even if there is no other process in it. Therefore progress is not fulfilled.

The problem here is that this totally depends on the lower level process scheduling. OS usually takes a bit to wake up a sleeping process, and this is done at a point when the process currently running on the CPU voluntarily gives up control by executing some blocking system call, or out of timer interrupt when time quanta expires. On a full SMP system this also takes some non-trivial in-kernel synchronization and signaling.
This means that process 0 can just loop leaving and entering critical section again without process 1 ever having a chance to run.
Also, I hope you are nor relying on bare integer variables for mutual exclusion. These might be cached in a register by a compiler, and if not, processor caches come into play. This is supposed to be done with special CPU instructions like test-and-set.

What is the difference between mutex and critical section?

Please explain from Linux, Windows perspectives?
I am programming in C#, would these two terms make a difference. Please post as much as you can, with examples and such....
Thanks

For Windows, critical sections are lighter-weight than mutexes.
Mutexes can be shared between processes, but always result in a system call to the kernel which has some overhead.
Critical sections can only be used within one process, but have the advantage that they only switch to kernel mode in the case of contention - Uncontended acquires, which should be the common case, are incredibly fast. In the case of contention, they enter the kernel to wait on some synchronization primitive (like an event or semaphore).
I wrote a quick sample app that compares the time between the two of them. On my system for 1,000,000 uncontended acquires and releases, a mutex takes over one second. A critical section takes ~50 ms for 1,000,000 acquires.
Here's the test code, I ran this and got similar results if mutex is first or second, so we aren't seeing any other effects.
HANDLE mutex = CreateMutex(NULL, FALSE, NULL);
CRITICAL_SECTION critSec;
InitializeCriticalSection(&critSec);
LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);
LARGE_INTEGER start, end;
// Force code into memory, so we don't see any effects of paging.
EnterCriticalSection(&critSec);
LeaveCriticalSection(&critSec);
QueryPerformanceCounter(&start);
for (int i = 0; i < 1000000; i++)
{
EnterCriticalSection(&critSec);
LeaveCriticalSection(&critSec);
}
QueryPerformanceCounter(&end);
int totalTimeCS = (int)((end.QuadPart - start.QuadPart) * 1000 / freq.QuadPart);
// Force code into memory, so we don't see any effects of paging.
WaitForSingleObject(mutex, INFINITE);
ReleaseMutex(mutex);
QueryPerformanceCounter(&start);
for (int i = 0; i < 1000000; i++)
{
WaitForSingleObject(mutex, INFINITE);
ReleaseMutex(mutex);
}
QueryPerformanceCounter(&end);
int totalTime = (int)((end.QuadPart - start.QuadPart) * 1000 / freq.QuadPart);
printf("Mutex: %d CritSec: %d\n", totalTime, totalTimeCS);

From a theoretical perspective, a critical section is a piece of code that must not be run by multiple threads at once because the code accesses shared resources.
A mutex is an algorithm (and sometimes the name of a data structure) that is used to protect critical sections.
Semaphores and Monitors are common implementations of a mutex.
In practice there are many mutex implementation availiable in windows. They mainly differ as consequence of their implementation by their level of locking, their scopes, their costs, and their performance under different levels of contention. See CLR Inside Out -
Using concurrency for scalability for an chart of the costs of different mutex implementations.
Availiable synchronization primitives.
Monitor
Mutex
Semaphore
ReaderWriterLock
ReaderWriterLockSlim
Interlocked
The lock(object) statement is implemented using a Monitor - see MSDN for reference.
In the last years much research is done on non-blocking synchronization. The goal is to implement algorithms in a lock-free or wait-free way. In such algorithms a process helps other processes to finish their work so that the process can finally finish its work. In consequence a process can finish its work even when other processes, that tried to perform some work, hang. Usinig locks, they would not release their locks and prevent other processes from continuing.

In addition to the other answers, the following details are specific to critical sections on windows:
in the absence of contention, acquiring a critical section is as simple as an InterlockedCompareExchange operation
the critical section structure holds room for a mutex. It is initially unallocated
if there is contention between threads for a critical section, the mutex will be allocated and used. The performance of the critical section will degrade to that of the mutex
if you anticipate high contention, you can allocate the critical section specifying a spin count.
if there is contention on a critical section with a spin count, the thread attempting to acquire the critical section will spin (busy-wait) for that many processor cycles. This can result in better performance than sleeping, as the number of cycles to perform a context switch to another thread can be much higher than the number of cycles taken by the owning thread to release the mutex
if the spin count expires, the mutex will be allocated
when the owning thread releases the critical section, it is required to check if the mutex is allocated, if it is then it will set the mutex to release a waiting thread
In linux, I think that they have a "spin lock" that serves a similar purpose to the critical section with a spin count.

Critical Section and Mutex are not Operating system specific, their concepts of multithreading/multiprocessing.
Critical Section
Is a piece of code that must only run by it self at any given time (for example, there are 5 threads running simultaneously and a function called "critical_section_function" which updates a array... you don't want all 5 threads updating the array at once. So when the program is running critical_section_function(), none of the other threads must run their critical_section_function.
mutex*
Mutex is a way of implementing the critical section code (think of it like a token... the thread must have possession of it to run the critical_section_code)

A mutex is an object that a thread can acquire, preventing other threads from acquiring it. It is advisory, not mandatory; a thread can use the resource the mutex represents without acquiring it.
A critical section is a length of code that is guaranteed by the operating system to not be interupted. In pseudo-code, it would be like:
StartCriticalSection();
DoSomethingImportant();
DoSomeOtherImportantThing();
EndCriticalSection();

The 'fast' Windows equal of critical selection in Linux would be a futex, which stands for fast user space mutex. The difference between a futex and a mutex is that with a futex, the kernel only becomes involved when arbitration is required, so you save the overhead of talking to the kernel each time the atomic counter is modified. That .. can save a significant amount of time negotiating locks in some applications.
A futex can also be shared amongst processes, using the means you would employ to share a mutex.
Unfortunately, futexes can be very tricky to implement (PDF). (2018 update, they aren't nearly as scary as they were in 2009).
Beyond that, its pretty much the same across both platforms. You're making atomic, token driven updates to a shared structure in a manner that (hopefully) does not cause starvation. What remains is simply the method of accomplishing that.

In Windows, a critical section is local to your process. A mutex can be shared/accessed across processes. Basically, critical sections are much cheaper. Can't comment on Linux specifically, but on some systems they're just aliases for the same thing.

Just to add my 2 cents, critical Sections are defined as a structure and operations on them are performed in user-mode context.
ntdll!_RTL_CRITICAL_SECTION
+0x000 DebugInfo : Ptr32 _RTL_CRITICAL_SECTION_DEBUG
+0x004 LockCount : Int4B
+0x008 RecursionCount : Int4B
+0x00c OwningThread : Ptr32 Void
+0x010 LockSemaphore : Ptr32 Void
+0x014 SpinCount : Uint4B
Whereas mutex are kernel objects (ExMutantObjectType) created in the Windows object directory. Mutex operations are mostly implemented in kernel-mode. For instance, when creating a Mutex, you end up calling nt!NtCreateMutant in kernel.

Great answer from Michael. I've added a third test for the mutex class introduced in C++11. The result is somewhat interesting, and still supports his original endorsement of CRITICAL_SECTION objects for single processes.
mutex m;
HANDLE mutex = CreateMutex(NULL, FALSE, NULL);
CRITICAL_SECTION critSec;
InitializeCriticalSection(&critSec);
LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);
LARGE_INTEGER start, end;
// Force code into memory, so we don't see any effects of paging.
EnterCriticalSection(&critSec);
LeaveCriticalSection(&critSec);
QueryPerformanceCounter(&start);
for (int i = 0; i < 1000000; i++)
{
EnterCriticalSection(&critSec);
LeaveCriticalSection(&critSec);
}
QueryPerformanceCounter(&end);
int totalTimeCS = (int)((end.QuadPart - start.QuadPart) * 1000 / freq.QuadPart);
// Force code into memory, so we don't see any effects of paging.
WaitForSingleObject(mutex, INFINITE);
ReleaseMutex(mutex);
QueryPerformanceCounter(&start);
for (int i = 0; i < 1000000; i++)
{
WaitForSingleObject(mutex, INFINITE);
ReleaseMutex(mutex);
}
QueryPerformanceCounter(&end);
int totalTime = (int)((end.QuadPart - start.QuadPart) * 1000 / freq.QuadPart);
// Force code into memory, so we don't see any effects of paging.
m.lock();
m.unlock();
QueryPerformanceCounter(&start);
for (int i = 0; i < 1000000; i++)
{
m.lock();
m.unlock();
}
QueryPerformanceCounter(&end);
int totalTimeM = (int)((end.QuadPart - start.QuadPart) * 1000 / freq.QuadPart);
printf("C++ Mutex: %d Mutex: %d CritSec: %d\n", totalTimeM, totalTime, totalTimeCS);
My results were 217, 473, and 19 (note that my ratio of times for the last two is roughly comparable to Michael's, but my machine is at least four years younger than his, so you can see evidence of increased speed between 2009 and 2013, when the XPS-8700 came out). The new mutex class is twice as fast as the Windows mutex, but still less than a tenth the speed of the Windows CRITICAL_SECTION object. Note that I only tested the non-recursive mutex. CRITICAL_SECTION objects are recursive (one thread can enter them repeatedly, provided it leaves the same number of times).

I found the explanations stating that critical sections protect a code section from being entered by multiple threads quite misleading.
There is no point in protecting code, as code is read only and can't be modified by multiple threads. What one usually wants is to protect data from being modified by multiple threads, leading to incoherent state. Commonly a mutex (or critical section, fulfilling the same purpose) should be associated with some piece of data. Each code section accessing this data should aquire the mutex/critical section and release when it's finished accessing the data. This may be considerably more fine grained than just locking out threads from entering a function. Also, from my experience, locking functions by some synchronisation is much more prone for errors, in particular dead locks. A good article covering that topic can be found here:
https://www.bogotobogo.com/cplusplus/multithreaded4_cplusplus11B.php
So, in summary (recursive) mutexes and critical sections basically fulfill the same purpose, which is rather not protecting code, but protecting data instead.
Critical sections may be implemented more efficiently than plain kernel mutexes. The example given in the first answer is a bit misleading, because it does not depict what the synchronisation primitive is designed for: synchronize access to sth. from multiple threads. The example just measures the trivial case, when the critical section/mutex is never owned by another thread.
While critical sections could be more efficient if, for instance, two threads access data in short, interlocked periods, they could prove less efficient if we got lots of threads accessing the same piece of data. Each thread would spinlock until giving up and waiting for the semaphore, part of the implementation of the critical section.
Such a case should also be considered when measuring execution times.

A C functions is called reentrant if it uses its actual parameters only.
Reentrant functions can be called by multiple threads at the same time.
Example of reentrant function:
int reentrant_function (int a, int b)
{
int c;
c = a + b;
return c;
}
Example of non reentrant function:
int result;
void non_reentrant_function (int a, int b)
{
int c;
c = a + b;
result = c;
}
The C standard library strtok() is not reentrant and can't be used by 2 or more threads at the same time.
Some platform SDK's comes with the reentrant version of strtok() called strtok_r();

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio