Atomicity, Volatility and Thread Safety in Windows

Atomicity, Volatility and Thread Safety in Windows - windows

It's my understanding of atomicity that it's used to make sure a value will be read/written in whole rather than in parts. For example, a 64-bit value that is really two 32-bit DWORDs (assume x86 here) must be atomic when shared between threads so that both DWORDs are read/written at the same time. That way one thread can't read half variable that's not updated. How do you guarantee atomicity?
Furthermore it's my understanding that volatility does not guarantee thread safety at all. Is that true?
I've seen it implied many places that simply being atomic/volatile is thread-safe. I don't see how that is. Won't I need a memory barrier as well to ensure that any values, atomic or otherwise, are read/written before they can actually be guaranteed to be read/written in the other thread?
So for example let's say I create a thread suspended, do some calculations to change some values to a struct available to the thread and then resume, for example:
HANDLE hThread = CreateThread(NULL, 0, thread_entry, (void *)&data, CREATE_SUSPENDED, NULL);
data->val64 = SomeCalculation();
ResumeThread(hThread);
I suppose this would depend on any memory barriers in ResumeThread? Should I do an interlocked exchange for val64? What if the thread were running, how does that change things?
I'm sure I'm asking a lot here but basically what I'm trying to figure out is what I asked in the title: a good explanation for atomicity, volatility and thread safety in Windows. Thanks

it's used to make sure a value will be read/written in whole
That's just a small part of atomicity. At its core it means "uninterruptible", an instruction on a processor whose side-effects cannot be interleaved with another instruction. By design, a memory update is atomic when it can be executed with a single memory-bus cycle. Which requires the address of the memory location to be aligned so that a single cycle can update it. An unaligned access requires extra work, part of the bytes written by one cycle and part by another. Now it is not uninterruptible anymore.
Getting aligned updates is pretty easy, it is a guarantee provided by the compiler. Or, more broadly, by the memory model implemented by the compiler. Which simply chooses memory addresses that are aligned, sometimes intentionally leaving unused gaps of a few bytes to get the next variable aligned. An update to a variable that's larger than the native word size of the processor can never be atomic.
But much more important are the kind of processor instructions you need to make threading work. Every processor implements a variant of the CAS instruction, compare-and-swap. It is the core atomic instruction you need to implement synchronization. Higher level synchronization primitives, like monitors (aka condition variables), mutexes, signals, critical sections and semaphores are all built on top of that core instruction.
That's the minimum, a processor usually provide extra ones to make simple operations atomic. Like incrementing a variable, at its core an interruptible operation since it requires a read-modify-write operation. Having a need for it be atomic is very common, most any C++ program relies on it for example to implement reference counting.
volatility does not guarantee thread safety at all
It doesn't. It is an attribute that dates from much easier times, back when machines only had a single processor core. It only affects code generation, in particular the way a code optimizer tries to eliminate memory accesses and use a copy of the value in a processor register instead. Makes a big, big difference to code execution speed, reading a value from a register is easily 3 times faster than having to read it from memory.
Applying volatile ensures that the code optimizer does not consider the value in the register to be accurate and forces it to read memory again. It truly only matters on the kind of memory values that are not stable by themselves, devices that expose their registers through memory-mapped I/O. It has been abused heavily since that core meaning to try to put semantics on top of processors with a weak memory model, Itanium being the most egregious example. What you get with volatile today is strongly dependent on the specific compiler and runtime you use. Never use it for thread-safety, always use a synchronization primitive instead.
simply being atomic/volatile is thread-safe
Programming would be much simpler if that was true. Atomic operations only cover the very simple operations, a real program often needs to keep an entire object thread-safe. Having all its members updated atomically and never expose a view of the object that is partially updated. Something as simple as iterating a list is a core example, you can't have another thread modifying the list while you are looking at its elements. That's when you need to reach for the higher-level synchronization primitives, the kind that can block code until it is safe to proceed.
Real programs often suffer from this synchronization need and exhibit Amdahls' law behavior. In other words, adding an extra thread does not actually make the program faster. Sometimes actually making it slower. Whomever finds a better mouse-trap for this is guaranteed a Nobel, we're still waiting.

In general, C and C++ don't give any guarantees about how reading or writing a 'volatile' object behaves in multithreaded programs. (The 'new' C++11 probably does since it now includes threads as part of the standard, but tradiationally threads have not been part of standard C or C++.) Using volatile and making assumptions about atomicity and cache-coherence in code that's meant to be portable is a problem. It's a crap-shoot as to whether a particular compiler and platform will treat accesses to 'volatile' objects in a thread-safe way.
The general rule is: 'volatile' is not enough to ensure thread safe access. You should use some platform-provided mechanism (usually some functions or synchronisation objects) to access thread-shared values safely.
Now, specifically on Windows, specifically with the VC++ 2005+ compiler, and specifically on x86 and x64 systems, accessing a primitive object (like an int) can be made thread-safe if:
On 64- and 32-bit Windows, the object has to be a 32-bit type, and it has to be 32-bit aligned.
On 64-bit Windows, the object may also be a 64-bit type, and it has to be 64-bit aligned.
It must be declared volatile.
If those are true, then accesses to the object will be volatile, atomic and be surrounded by instructions that ensure cache-coherency. The size and alignment conditions must be met so that the compiler makes code that performs atomic operations when accessing the object. Declaring the object volatile ensures that the compiler doesn't make code optimisations related to caching previous values it may have read into a register and ensures that code generated includes appropriate memory barrier instructions when it's accessed.
Even so, you're probably still better off using something like the Interlocked* functions for accessing small things, and bog standard synchronisation objects like Mutexes or CriticalSections for larger objects and data structures. Ideally, get libraries for and use data structures that already include appropriate locks. Let your libraries & OS do the hard work as much as possible!
In your example, I expect you do need to use a thread-safe access to update val64 whether the thread is started yet or not.
If the thread was already running, then you would definitely need some kind of thread-safe write to val64, either using InterchangeExchange64 or similar, or by acquiring and releasing some kind of synchronisation object which will perform appropriate memory barrier instructions. Similarly, the thread would need to use a thread-safe accessor to read it as well.
In the case where the thread hasn't been resumed yet, it's a bit less clear. It's possible that ResumeThread might use or act like a synchronisation function and do the memory barrier operations, but the documentation doesn't specify that it does, so it is better to assume that it doesn't.
References:
On atomicity of 32- and 64- bit aligned types... https://msdn.microsoft.com/en-us/library/windows/desktop/ms684122%28v=vs.85%29.aspx
On 'volatile' including memory fences... https://msdn.microsoft.com/en-us/library/windows/desktop/ms686355%28v=vs.85%29.aspx

Related

Which std::sync::atomic::Ordering to use?

All the methods of std::sync::atomic::AtomicBool take a memory ordering (Relaxed, Release, Acquire, AcqRel, and SeqCst), which I have not used before. Under what circumstances should these values be used? The documentation uses confusing “load” and “store” terms which I don’t really understand. For example:
A producer thread mutates some state held by a Mutex, then calls AtomicBool::compare_and_swap(false, true, ordering) (to coalesce invalidations), and if it swapped, posts an “invalidate” message to a concurrent queue (e.g. mpsc or a winapi PostMessage). A consumer thread resets the AtomicBool, reads from the queue, and reads the state held by the Mutex. Can the producer use Relaxed ordering because it is preceded by a mutex, or must it use Release? Can the consumer use store(false, Relaxed), or must it use compare_and_swap(true, false, Acquire) to receive the changes from the mutex?
What if the producer and consumer share a RefCell instead of a Mutex?

I'm not an expert on this, and it's really complicated, so please feel free to critique my post. As pointed out by mdh.heydari, cppreference.com has much better documentation of orderings than Rust (C++ has an almost identical API).
For your question
You'd need to use "release" ordering in your producer and "acquire" ordering in your consumer. This ensures that the data mutation occurs before the AtomicBool is set to true.
If your queue is asynchronous, then the consumer will need to keep trying to read from it in a loop, since the producer could get interrupted between setting the AtomicBool and putting something in the queue.
If the producer code might run multiple times before client runs, then you can't use RefCell because they could mutate the data while the client is reading it. Otherwise it's fine.
There are other better and simpler ways to implement this pattern, but I assume you were just giving it as an example.
What are orderings?
The different orderings have to do with what another thread sees happen when an atomic operation occurs. Compilers and CPUs are normally both allowed to reorder instructions in order to optimize code, and the orderings effect how much they're allowed to reorder instructions.
You could just always use SeqCst, which basically guarantees everyone will see that instruction as having occurred wherever you put it relative to other instructions, but in some cases if you specify a less restrictive ordering then LLVM and the CPU can better optimize your code.
You should think of these orderings as applying to a memory location (instead of applying to an instruction).
Ordering Types
Relaxed Ordering
There are no constraints besides any modification to the memory location being atomic (so it either happens completely or not at all). This is fine for something like a counter if the values retrieved by/set by individual threads don't matter as long as they're atomic.
Acquire Ordering
This constraint says that any variable reads that occur in your code after "acquire" is applied can't be reordered to occur before it. So, say in your code you read some shared memory location and get value X, which was stored in that memory location at time T, and then you apply the "acquire" constraint. Any memory locations that you read from after applying the constraint will have the value they had at time T or later.
This is probably what most people would expect to happen intuitively, but because a CPU and optimizer are allowed to reorder instructions as long as they don't change the result, it isn't guaranteed.
In order for "acquire" to be useful, it has to be paired with "release", because otherwise there's no guarantee that the other thread didn't reorder its write instructions that were supposed to occur at time T to an earlier time.
Acquire-reading the flag value you're looking for means you won't see a stale value somewhere else that was actually changed by a write before the release-store to the flag.
Release Ordering
This constraint says that any variable writes that occur in your code before "release" is applied can't be reordered to occur after it. So, say in your code you write to a few shared memory locations and then set some memory location t at time T, and then you apply the "release" constraint. Any writes that appear in your code before "release" is applied are guaranteed to have occurred before it.
Again, this is what most people would expect to happen intuitively, but it isn't guaranteed without constraints.
If the other thread trying to read value X doesn't use "acquire", then it isn't guaranteed to see the new value with respect to changes in other variable values. So it could get the new value, but it might not see new values for any other shared variables. Also keep in mind that testing is hard. Some hardware won't in practice show re-ordering with some unsafe code, so problems can go undetected.
Jeff Preshing wrote a nice explanation of acquire and release semantics, so read that if this isn't clear.
AcqRel Ordering
This does both Acquire and Release ordering (ie. both restrictions apply). I'm not sure when this is necessary - it might be helpful in situations with 3 or more threads if some Release, some Acquire, and some do both, but I'm not really sure.
SeqCst Ordering
This is most restrictive and, therefore, slowest option. It forces memory accesses to appear to occur in one, identical order to every thread. This requires an MFENCE instruction on x86 on all writes to atomic variables (full memory barrier, including StoreLoad), while the weaker orderings don't. (SeqCst loads don't require a barrier on x86, as you can see in this C++ compiler output.)
Read-Modify-Write accesses, like atomic increment, or compare-and-swap, are done on x86 with locked instructions, which are already full memory barriers. If you care at all about compiling to efficient code on non-x86 targets, it makes sense to avoid SeqCst when you can, even for atomic read-modify-write ops. There are cases where it's needed, though.
For more examples of how atomic semantics turn into ASM, see this larger set of simple functions on C++ atomic variables. I know this is a Rust question, but it's supposed to have basically the same API as C++. godbolt can target x86, ARM, ARM64, and PowerPC. Interestingly, ARM64 has load-acquire (ldar) and store-release (stlr) instructions, so it doesn't always have to use separate barrier instructions.
By the way, x86 CPUs are always "strongly ordered" by default, which means they always act as if at least AcqRel mode was set. So for x86 "ordering" only affects how LLVM's optimizer behaves. ARM, on the other hand, is weakly ordered. Relaxed is set by default, to allow the compiler full freedom to reorder things, and to not require extra barrier instructions on weakly-ordered CPUs.

How to understand acquire and release semantics?

I found out three function from MSDN , below:
1.InterlockedDecrement().
2.InterlockedDecrementAcquire().
3.InterlockedDecrementRelease().
I knew those fucntion use to decrement a value as an atomic operation, but i don't know distinction between the three function

(um... but don't ask me what does it mean exactly)
I'll take a stab at that.
Something to remember is that the compiler, or the CPU itself, may reorder memory reads and writes if they appear to not deal with each other.
This is useful, for instance, if you have some code that, maybe is updating a structure:
if ( playerMoved ) {
playerPos.X += dx;
playerPos.Y += dy;
// Keep the player above the world's surface.
if ( playerPos.Z + dz > 0 ) {
playerPos.Z += dz;
}
else {
playerPos.Z = 0;
}
}
Most of above statements may be reordered because there's no data dependency between them, in fact, a superscalar CPU may execute most of those statements simultaneously, or maybe would start working on the Z section sooner, since it doesn't affect X or Y, but might take longer.
Here's the problem with that - lets say that you're attempting lock-free programming. You want to perform a whole bunch of memory writes to, maybe, fill in a shared queue. You signal that you're done by finally writing to a flag.
Well, since that flag appears to have nothing to do with the rest of the work being done, the compiler and the CPU may reorder those instructions, and now you may set your 'done' flag before you've actually committed the rest of the structure to memory, and now your "lock-free" queue doesn't work.
This is where Acquire and Release ordering semantics come into play. I set that I'm doing work by setting a flag or so with an Acquire semantic, and the CPU guarantees that any memory games I play after that instruction stay actually below that instruction. I set that I'm done by setting a flag or so with a Release semantic, and the CPU guarantees that any memory games I had done just before the release actually stay before the release.
Normally, one would do this using explicit locks - mutexes, semaphores, etc, in which the CPU already knows it has to pay attention to memory ordering. The point of attempting to create 'lock free' data structures is to provide data structures that are thread safe (for some meaning of thread safe), that don't use explicit locks (because they are very slow).
Creating lock-free data structures is possible on a CPU or compiler that doesn't support acquire/release ordering semantics, but it usually means that some slower memory ordering semantic is used. For instance, you could issue a full memory barrier - everything that came before this instruction has to actually be committed before this instruction, and everything that came after this instruction has to be committed actually after this instruction. But that might mean that I wait for a bunch of actually irrelevant memory writes from earlier in the instruction stream (perhaps function call prologue) that has nothing to do with the memory safety I'm trying to implement.
Acquire says "only worry about stuff after me". Release says "only worry about stuff before me". Combining those both is a full memory barrier.

http://preshing.com/20120913/acquire-and-release-semantics/
Acquire semantics is a property which can only apply to operations
which read from shared memory, whether they are read-modify-write
operations or plain loads. The operation is then considered a
read-acquire. Acquire semantics prevent memory reordering of the
read-acquire with any read or write operation which follows it in
program order.
Release semantics is a property which can only apply to operations
which write to shared memory, whether they are read-modify-write
operations or plain stores. The operation is then considered a
write-release. Release semantics prevent memory reordering of the
write-release with any read or write operation which precedes it in
program order.
(um... but don't ask me what does it mean exactly)

How does my OS know in which direction to flush if I use OpenMP flush function?

A thread that reads a shared variable has first to call flush, and a thread that writes to a shared variable has to call OpenMP flush afterwards, to keep the shared variable in main memory and cache synchronized. How does the flush function know in which direction to flush? It needs to know which of both variables (main memory or cache) is newer. I assume, but I am not sure, that the OS or CPU take care of this somehow. Does someone know?

flush is not a function - it is an OpenMP compiler directive. It affects the way the compiler generates the executable code and instructs it to synchronise the values of all optimised variables (stored in CPU registers or other explicitly programmable cache / thread-local memory) in the flush-set. This is similar to the effect that the volatile storage modifier has on code generation, but has more limited point-local effect.
How does it work? While parsing the source code, the compiler analyses the flow of statements and the data (variables) that gets affected by those statements. Consequently the compiler builds an execution graph and a data dependency graph from the code. It knows exactly where and how the value of each variable is being used and the execution of which code block affects which variables. Then the compiler tries to optimise the code by simplifying the graph and to reduce the number of expensive memory operations by either using CPU registers to store intermediate values or by using another for of faster thread-addressable local memory. The flush directive adds special points in the execution graph, where the compiler must explicitly synchronise the memory view of the thread (register variables and local-memory variables) with the global shared memory. Since the compiler has built the dependency graph in the first place, it knows exactly which variables in the flush-set were modified and hence have to be written to the shared memory; all other variables in the flush-set have to be read from the shared memory.
So the answer to your question is that it is usually the compiler who processes the flush directive, not the OS, although the compiler might call into the OS to actually implement the flush, e.g. on systems with explicitly programmable caches/local memories. But one should also note that OpenMP is an abstract standard, which can be implemented on many different hardware platforms and that some of those platforms provide certain hardware that can help with implementing the OpenMP abstractions more efficiently (e.g. the CPU ASIC in IBM's Blue Gene/Q provides many such features).

You don't need to call flush to keep shared variables synchronized.
The hardware (CPU) does keep track of cached memory and if there are conflicting accesses, they will slow down your program, because the cache will be flushed by CPU.
I understand the flush directive more like a conditional barrier.
A flush containing the same variable must be encountered by at least two threads to have an effect.
When this directive is met by two threads with say variable a in common, if they have modified it they will write back their modifications to memory (as opposed to keep it in a local variable or register), and then I suppose there is a barrier for both thread to get to that point before they continue.
If the variable a is used after the flush it is reread from memory.

Atomic load/store for OSs other than BSD?

Among the atomic operations provided by BSD (as given on the atomic(9) man page), there are atomic_load_acq_int() and atomic_store_rel_int(). In looking for the equivalent for other OSs (for example, by reading the atomic(3) man page for Mac OS X, the atomic_ops(3C) man page for Solaris, and the Interlocked*() functions for Windows), there don't seem to be any (obvious) equivalents for just atomically reading/writing an int.
Is this because that it's implied for those OSs that reads/writes for int are guaranteed to be atomic by default? (Or must you use declare them volatile in C/C++?)
If not, then how does one do atomic reads/writes of an int on those OSs?
(Atomic reads can be simulated by returning the result of an atomic add of 0, but there's no equivalent for doing atomic writes.)

I think you are mixing together atomic memory access with cache coherence. The former is the required hardware support for building synchronization primitives in software (spin-locks, semaphores, and mutexes), while the latter is the hardware support for multiple chips (several CPUs, and peripheral devices) working over the same bus, and having consistent view of the main memory.
Different compilers/libraries provide different utilities for the first. Here's, for example, GCC intrinsics for atomic memory access. They all boil down to generating either compare-and-swap or load-linked/store-conditional based instruction blocks depending on the platform support. Compile your source with, say, -S for GCC and see the assembler generated.
You don't have to do anything explicitly for cache coherency - it's all handled in hardware - but it definitely helps to understand how it works to avoid things like cache line ping-pong.
With all that, aligned single word reads and writes are atomic on all commodity platforms (somebody correct me if I'm wrong here). Since ints are less or equal to processor word in size, you are covered (see the GCC builtins link above).
It's the order of reads and writes that is important. Here's where architecture memory model is important. It dictates what operations can and cannot be re-ordered by the hardware. Example would be updating a linked list - you don't want other CPUs see a new item linked until the item itself is in consistent state. Explicit memory barriers (also often called "memory fences") might be required. Acquire barrier ensures that subsequent operations are not re-ordererd before the barrier (say you read the linked-list item pointer before the content of the item), Release barrier ensures that previous operations are not re-ordered after the barrier (you write the item content before writing the new link pointer).
volatile is often misunderstood as being related to all the above. In fact it is just an instruction to the compiler not to cache variable value in register, but read it from memory on each access. Many argue that it's "almost useless" for concurrent programming.
Apologies for lengthy reply. Hope this clears it a bit.
Edit:
Upcoming C++0x standard finally addresses concurrency, see Hans Boehm's C++ memory model papers for many details.

Seeking articles on shared memory locking issues

I'm reviewing some code and feel suspicious of the technique being used.
In a linux environment, there are two processes that attach multiple
shared memory segments. The first process periodically loads a new set
of files to be shared, and writes the shared memory id (shmid) into
a location in the "master" shared memory segment. The second process
continually reads this "master" location and uses the shmid to attach
the other shared segments.
On a multi-cpu host, it seems to me it might be implementation dependent
as to what happens if one process tries to read the memory while it's
being written by the other. But perhaps hardware-level bus locking prevents
mangled bits on the wire? It wouldn't matter if the reading process got
a very-soon-to-be-changed value, it would only matter if the read was corrupted
to something that was neither the old value nor the new value. This is an edge case: only 32 bits are being written and read.
Googling for shmat stuff hasn't led me to anything that's definitive in this
area.
I suspect strongly it's not safe or sane, and what I'd really
like is some pointers to articles that describe the problems in detail.

It is legal -- as in the OS won't stop you from doing it.
But is it smart? No, you should have some type of synchronization.
There wouldn't be "mangled bits on the wire". They will come out either as ones or zeros. But there's nothing to say that all your bits will be written out before another process tries to read them. And there are NO guarantees on how fast they'll be written vs how fast they'll be read.
You should always assume there is absolutely NO relationship between the actions of 2 processes (or threads for that matter).
Hardware level bus locking does not happen unless you get it right. It can be harder then expected to make your compiler / library / os / cpu get it right. Synchronization primitives are written to makes sure it happens right.
Locking will make it safe, and it's not that hard to do. So just do it.
#unknown - The question has changed somewhat since my answer was posted. However, the behavior you describe is defiantly platform (hardware, os, library and compiler) dependent.
Without giving the compiler specific instructions, you are actually not guaranteed to have 32 bits written out in one shot. Imagine a situation where the 32 bit word is not aligned on a word boundary. This unaligned access is acceptable on x86, and in the case of the x68, the access is turned into a series of aligned accesses by the cpu.
An interrupt can occurs between those operations. If a context switch happens in the middle, some of the bits are written, some aren't. Bang, You're Dead.
Also, lets think about 16 bit cpus or 64 bit cpus. Both of which are still popular and don't necessarily work the way you think.
So, actually you can have a situation where "some other cpu-core picks up a word sized value 1/2 written to". You write you code as if this type of thing is expected to happen if you are not using synchronization.
Now, there are ways to preform your writes to make sure that you get a whole word written out. Those methods fall under the category of synchronization, and creating synchronization primitives is the type of thing that's best left to the library, compiler, os, and hardware designers. Especially if you are interested in portability (which you should be, even if you never port your code)

The problem's actually worse than some of the people have discussed. Zifre is right that on current x86 CPUs memory writes are atomic, but that is rapidly ceasing to be the case - memory writes are only atomic for a single core - other cores may not see the writes in the same order.
In other words if you do
a = 1;
b = 2;
on CPU 2 you might see location b modified before location 'a' is. Also if you're writing a value that's larger than the native word size (32 bits on an x32 processor) the writes are not atomic - so the high 32 bits of a 64 bit write will hit the bus at a different time from the low 32 bits of the write. This can complicate things immensely.
Use a memory barrier and you'll be ok.

You need locking somewhere. If not at the code level, then at the hardware memory cache and bus.
You are probably OK on a post-PentiumPro Intel CPU. From what I just read, Intel made their later CPUs essentially ignore the LOCK prefix on machine code. Instead the cache coherency protocols make sure that the data is consistent between all CPUs. So if the code writes data that doesn't cross a cache-line boundary, it will work. The order of memory writes that cross cache-lines isn't guaranteed, so multi-word writes are risky.
If you are using anything other than x86 or x86_64 then you are not OK. Many non-Intel CPUs (and perhaps Intel Itanium) gain performance by using explicit cache coherency machine commands, and if you do not use them (via custom ASM code, compiler intrinsics, or libraries) then writes to memory via cache are not guaranteed to ever become visible to another CPU or to occur in any particular order.
So just because something works on your Core2 system doesn't mean that your code is correct. If you want to check portability, try your code also on other SMP architectures like PPC (an older MacPro or a Cell blade) or an Itanium or an IBM Power or ARM. The Alpha was a great CPU for revealing bad SMP code, but I doubt you can find one.

Two processes, two threads, two cpus, two cores all require special attention when sharing data through memory.
This IBM article provides an excellent overview of your options.
Anatomy of Linux synchronization methods
Kernel atomics, spinlocks, and mutexes
by M. Tim Jones (mtj#mtjones.com), Consultant Engineer, Emulex
http://www.ibm.com/developerworks/linux/library/l-linux-synchronization.html

I actually believe this should be completely safe (but is depends on the exact implementation). Assuming the "master" segment is basically an array, as long as the shmid can be written atomically (if it's 32 bits then probably okay), and the second process is just reading, you should be okay. Locking is only needed when both processes are writing, or the values being written cannot be written atomically. You will never get a corrupted (half written values). Of course, there may be some strange architectures that can't handle this, but on x86/x64 it should be okay (and probably also ARM, PowerPC, and other common architectures).

Read Memory Ordering in Modern Microprocessors, Part I and Part II
They give the background to why this is theoretically unsafe.
Here's a potential race:
Process A (on CPU core A) writes to a new shared memory region
Process A puts that shared memory ID into a shared 32-bit variable (that is 32-bit aligned - any compiler will try to align like this if you let it).
Process B (on CPU core B) reads the variable. Assuming 32-bit size and 32-bit alignment, it shouldn't get garbage in practise.
Process B tries to read from the shared memory region. Now, there is no guarantee that it'll see the data A wrote, because you missed out the memory barrier. (In practise, there probably happened to be memory barriers on CPU B in the library code that maps the shared memory segment; the problem is that process A didn't use a memory barrier).
Also, it's not clear how you can safely free the shared memory region with this design.
With the latest kernel and libc, you can put a pthreads mutex into a shared memory region. (This does need a recent version with NPTL - I'm using Debian 5.0 "lenny" and it works fine). A simple lock around the shared variable would mean you don't have to worry about arcane memory barrier issues.

I can't believe you're asking this. NO it's not safe necessarily. At the very least, this will depend on whether the compiler produces code that will atomically set the shared memory location when you set the shmid.
Now, I don't know Linux, but I suspect that a shmid is 16 to 64 bits. That means it's at least possible that all platforms would have some instruction that could write this value atomically. But you can't depend on the compiler doing this without being asked somehow.
Details of memory implementation are among the most platform-specific things there are!
BTW, it may not matter in your case, but in general, you have to worry about locking, even on a single CPU system. In general, some device could write to the shared memory.

I agree that it might work - so it might be safe, but not sane.
The main question is if this low-level sharing is really needed - I am not an expert on Linux, but I would consider to use for instance a FIFO queue for the master shared memory segment, so that the OS does the locking work for you. Consumer/producers usually need queues for synchronization anyway.

Legal? I suppose. Depends on your "jurisdiction". Safe and sane? Almost certainly not.
Edit: I'll update this with more information.
You might want to take a look at this Wikipedia page; particularly the section on "Coordinating access to resources". In particular, the Wikipedia discussion essentially describes a confidence failure; non-locked access to shared resources can, even for atomic resources, cause a misreporting / misrepresentation of the confidence that an action was done. Essentially, in the time period between checking to see whether or not it CAN modify the resource, the resource gets externally modified, and therefore, the confidence inherent in the conditional check is busted.

I don't believe anybody here has discussed how much of an impact lock contention can have over the bus, especially on bus bandwith constrained systems.
Here is an article about this issue in some depth, they discuss some alternative schedualing algorythems which reduse the overall demand on exclusive access through the bus. Which increases total throughput in some cases over 60% than a naieve scheduler (when considering the cost of an explicit lock prefix instruction or implicit xchg cmpx..). The paper is not the most recent work and not much in the way of real code (dang academic's) but it worth the read and consideration for this problem.
More recent CPU ABI's provide alternative operations than simple lock whatever.
Jeffr, from FreeBSD (author of many internal kernel components), discusses monitor and mwait, 2 instructions added for SSE3, where in a simple test case identified an improvement of 20%. He later postulates;
So this is now the first stage in the
adaptive algorithm, we spin a while,
then sleep at a high power state, and
then sleep at a low power state
depending on load.
...
In most cases we're still idling in
hlt as well, so there should be no
negative effect on power. In fact, it
wastes a lot of time and energy to
enter and exit the idle states so it
might improve power under load by
reducing the total cpu time required.
I wonder what would be the effect of using pause instead of hlt.
From Intel's TBB;
ALIGN 8
PUBLIC __TBB_machine_pause
__TBB_machine_pause:
L1:
dw 090f3H; pause
add ecx,-1
jne L1
ret
end
Art of Assembly also uses syncronization w/o the use of lock prefix or xchg. I haven't read that book in a while and won't speak directly to it's applicability in a user-land protected mode SMP context, but it's worth a look.
Good luck!

If the shmid has some type other than volatile sig_atomic_t then you can be pretty sure that separate threads will get in trouble even on the very same CPU. If the type is volatile sig_atomic_t then you can't be quite as sure, but you still might get lucky because multithreading can do more interleaving than signals can do.
If the shmid crosses cache lines (partly in one cache line and partly in another) then while the writing cpu is writing you sure find a reading cpu reading part of the new value and part of the old value.
This is exactly why instructions like "compare and swap" were invented.

Sounds like you need a Reader-Writer Lock : http://en.wikipedia.org/wiki/Readers-writer_lock.

The answer is - it's absolutely safe to do reads and writes simultaneously.
It is clear that the shm mechanism
provides bare-bones tools for the
user. All access control must be taken
care of by the programmer. Locking and
synchronization is being kindly
provided by the kernel, this means the
user have less worries about race
conditions. Note that this model
provides only a symmetric way of
sharing data between processes. If a
process wishes to notify another
process that new data has been
inserted to the shared memory, it will
have to use signals, message queues,
pipes, sockets, or other types of IPC.
From Shared Memory in Linux article.
The latest Linux shm implementation just uses copy_to_user and copy_from_user calls, which are synchronised with memory bus internally.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio