Seeking articles on shared memory locking issues - shared-memory

I'm reviewing some code and feel suspicious of the technique being used.
In a linux environment, there are two processes that attach multiple
shared memory segments. The first process periodically loads a new set
of files to be shared, and writes the shared memory id (shmid) into
a location in the "master" shared memory segment. The second process
continually reads this "master" location and uses the shmid to attach
the other shared segments.
On a multi-cpu host, it seems to me it might be implementation dependent
as to what happens if one process tries to read the memory while it's
being written by the other. But perhaps hardware-level bus locking prevents
mangled bits on the wire? It wouldn't matter if the reading process got
a very-soon-to-be-changed value, it would only matter if the read was corrupted
to something that was neither the old value nor the new value. This is an edge case: only 32 bits are being written and read.
Googling for shmat stuff hasn't led me to anything that's definitive in this
area.
I suspect strongly it's not safe or sane, and what I'd really
like is some pointers to articles that describe the problems in detail.

It is legal -- as in the OS won't stop you from doing it.
But is it smart? No, you should have some type of synchronization.
There wouldn't be "mangled bits on the wire". They will come out either as ones or zeros. But there's nothing to say that all your bits will be written out before another process tries to read them. And there are NO guarantees on how fast they'll be written vs how fast they'll be read.
You should always assume there is absolutely NO relationship between the actions of 2 processes (or threads for that matter).
Hardware level bus locking does not happen unless you get it right. It can be harder then expected to make your compiler / library / os / cpu get it right. Synchronization primitives are written to makes sure it happens right.
Locking will make it safe, and it's not that hard to do. So just do it.
#unknown - The question has changed somewhat since my answer was posted. However, the behavior you describe is defiantly platform (hardware, os, library and compiler) dependent.
Without giving the compiler specific instructions, you are actually not guaranteed to have 32 bits written out in one shot. Imagine a situation where the 32 bit word is not aligned on a word boundary. This unaligned access is acceptable on x86, and in the case of the x68, the access is turned into a series of aligned accesses by the cpu.
An interrupt can occurs between those operations. If a context switch happens in the middle, some of the bits are written, some aren't. Bang, You're Dead.
Also, lets think about 16 bit cpus or 64 bit cpus. Both of which are still popular and don't necessarily work the way you think.
So, actually you can have a situation where "some other cpu-core picks up a word sized value 1/2 written to". You write you code as if this type of thing is expected to happen if you are not using synchronization.
Now, there are ways to preform your writes to make sure that you get a whole word written out. Those methods fall under the category of synchronization, and creating synchronization primitives is the type of thing that's best left to the library, compiler, os, and hardware designers. Especially if you are interested in portability (which you should be, even if you never port your code)

The problem's actually worse than some of the people have discussed. Zifre is right that on current x86 CPUs memory writes are atomic, but that is rapidly ceasing to be the case - memory writes are only atomic for a single core - other cores may not see the writes in the same order.
In other words if you do
a = 1;
b = 2;
on CPU 2 you might see location b modified before location 'a' is. Also if you're writing a value that's larger than the native word size (32 bits on an x32 processor) the writes are not atomic - so the high 32 bits of a 64 bit write will hit the bus at a different time from the low 32 bits of the write. This can complicate things immensely.
Use a memory barrier and you'll be ok.

You need locking somewhere. If not at the code level, then at the hardware memory cache and bus.
You are probably OK on a post-PentiumPro Intel CPU. From what I just read, Intel made their later CPUs essentially ignore the LOCK prefix on machine code. Instead the cache coherency protocols make sure that the data is consistent between all CPUs. So if the code writes data that doesn't cross a cache-line boundary, it will work. The order of memory writes that cross cache-lines isn't guaranteed, so multi-word writes are risky.
If you are using anything other than x86 or x86_64 then you are not OK. Many non-Intel CPUs (and perhaps Intel Itanium) gain performance by using explicit cache coherency machine commands, and if you do not use them (via custom ASM code, compiler intrinsics, or libraries) then writes to memory via cache are not guaranteed to ever become visible to another CPU or to occur in any particular order.
So just because something works on your Core2 system doesn't mean that your code is correct. If you want to check portability, try your code also on other SMP architectures like PPC (an older MacPro or a Cell blade) or an Itanium or an IBM Power or ARM. The Alpha was a great CPU for revealing bad SMP code, but I doubt you can find one.

Two processes, two threads, two cpus, two cores all require special attention when sharing data through memory.
This IBM article provides an excellent overview of your options.
Anatomy of Linux synchronization methods
Kernel atomics, spinlocks, and mutexes
by M. Tim Jones (mtj#mtjones.com), Consultant Engineer, Emulex
http://www.ibm.com/developerworks/linux/library/l-linux-synchronization.html

I actually believe this should be completely safe (but is depends on the exact implementation). Assuming the "master" segment is basically an array, as long as the shmid can be written atomically (if it's 32 bits then probably okay), and the second process is just reading, you should be okay. Locking is only needed when both processes are writing, or the values being written cannot be written atomically. You will never get a corrupted (half written values). Of course, there may be some strange architectures that can't handle this, but on x86/x64 it should be okay (and probably also ARM, PowerPC, and other common architectures).

Read Memory Ordering in Modern Microprocessors, Part I and Part II
They give the background to why this is theoretically unsafe.
Here's a potential race:
Process A (on CPU core A) writes to a new shared memory region
Process A puts that shared memory ID into a shared 32-bit variable (that is 32-bit aligned - any compiler will try to align like this if you let it).
Process B (on CPU core B) reads the variable. Assuming 32-bit size and 32-bit alignment, it shouldn't get garbage in practise.
Process B tries to read from the shared memory region. Now, there is no guarantee that it'll see the data A wrote, because you missed out the memory barrier. (In practise, there probably happened to be memory barriers on CPU B in the library code that maps the shared memory segment; the problem is that process A didn't use a memory barrier).
Also, it's not clear how you can safely free the shared memory region with this design.
With the latest kernel and libc, you can put a pthreads mutex into a shared memory region. (This does need a recent version with NPTL - I'm using Debian 5.0 "lenny" and it works fine). A simple lock around the shared variable would mean you don't have to worry about arcane memory barrier issues.

I can't believe you're asking this. NO it's not safe necessarily. At the very least, this will depend on whether the compiler produces code that will atomically set the shared memory location when you set the shmid.
Now, I don't know Linux, but I suspect that a shmid is 16 to 64 bits. That means it's at least possible that all platforms would have some instruction that could write this value atomically. But you can't depend on the compiler doing this without being asked somehow.
Details of memory implementation are among the most platform-specific things there are!
BTW, it may not matter in your case, but in general, you have to worry about locking, even on a single CPU system. In general, some device could write to the shared memory.

I agree that it might work - so it might be safe, but not sane.
The main question is if this low-level sharing is really needed - I am not an expert on Linux, but I would consider to use for instance a FIFO queue for the master shared memory segment, so that the OS does the locking work for you. Consumer/producers usually need queues for synchronization anyway.

Legal? I suppose. Depends on your "jurisdiction". Safe and sane? Almost certainly not.
Edit: I'll update this with more information.
You might want to take a look at this Wikipedia page; particularly the section on "Coordinating access to resources". In particular, the Wikipedia discussion essentially describes a confidence failure; non-locked access to shared resources can, even for atomic resources, cause a misreporting / misrepresentation of the confidence that an action was done. Essentially, in the time period between checking to see whether or not it CAN modify the resource, the resource gets externally modified, and therefore, the confidence inherent in the conditional check is busted.

I don't believe anybody here has discussed how much of an impact lock contention can have over the bus, especially on bus bandwith constrained systems.
Here is an article about this issue in some depth, they discuss some alternative schedualing algorythems which reduse the overall demand on exclusive access through the bus. Which increases total throughput in some cases over 60% than a naieve scheduler (when considering the cost of an explicit lock prefix instruction or implicit xchg cmpx..). The paper is not the most recent work and not much in the way of real code (dang academic's) but it worth the read and consideration for this problem.
More recent CPU ABI's provide alternative operations than simple lock whatever.
Jeffr, from FreeBSD (author of many internal kernel components), discusses monitor and mwait, 2 instructions added for SSE3, where in a simple test case identified an improvement of 20%. He later postulates;
So this is now the first stage in the
adaptive algorithm, we spin a while,
then sleep at a high power state, and
then sleep at a low power state
depending on load.
...
In most cases we're still idling in
hlt as well, so there should be no
negative effect on power. In fact, it
wastes a lot of time and energy to
enter and exit the idle states so it
might improve power under load by
reducing the total cpu time required.
I wonder what would be the effect of using pause instead of hlt.
From Intel's TBB;
ALIGN 8
PUBLIC __TBB_machine_pause
__TBB_machine_pause:
L1:
dw 090f3H; pause
add ecx,-1
jne L1
ret
end
Art of Assembly also uses syncronization w/o the use of lock prefix or xchg. I haven't read that book in a while and won't speak directly to it's applicability in a user-land protected mode SMP context, but it's worth a look.
Good luck!

If the shmid has some type other than volatile sig_atomic_t then you can be pretty sure that separate threads will get in trouble even on the very same CPU. If the type is volatile sig_atomic_t then you can't be quite as sure, but you still might get lucky because multithreading can do more interleaving than signals can do.
If the shmid crosses cache lines (partly in one cache line and partly in another) then while the writing cpu is writing you sure find a reading cpu reading part of the new value and part of the old value.
This is exactly why instructions like "compare and swap" were invented.

Sounds like you need a Reader-Writer Lock : http://en.wikipedia.org/wiki/Readers-writer_lock.

The answer is - it's absolutely safe to do reads and writes simultaneously.
It is clear that the shm mechanism
provides bare-bones tools for the
user. All access control must be taken
care of by the programmer. Locking and
synchronization is being kindly
provided by the kernel, this means the
user have less worries about race
conditions. Note that this model
provides only a symmetric way of
sharing data between processes. If a
process wishes to notify another
process that new data has been
inserted to the shared memory, it will
have to use signals, message queues,
pipes, sockets, or other types of IPC.
From Shared Memory in Linux article.
The latest Linux shm implementation just uses copy_to_user and copy_from_user calls, which are synchronised with memory bus internally.

Related

Can you directly access the cache using assembly?

Caching is a core thing when it comes to efficiency.
I know that caching usually happens automatically.
However, I'd like to control cache usage myself, because I think that I can do better than some heuristics that don't know the exact program.
Therefore I would need assembly instructions to directly move to or from cache memory cells.
like:
movL1 address content
I know that there are some instructions that give the "caching system" hints, but I'm not sure if that's enough because the hints could be ignored or they maybe aren't sufficient to express anything expressable by such a move to/from cache order.
Are there any assemblers that allow for complete cache control?
Side note: why I'd like to improve caching:
consider a hypothetical CPU with 1 register and a cache containing 2 cells.
consider the following two programs:
(where x,y,z,a are memory cells)
"START"
"move 1 to x"
"move 2 to y"
"move 3 to z"
"move 4 to a"
"move z to x"
"move y to x"
"END"
"START"
"move 1 to x"
"move 2 to y"
"move 3 to z"
"move 4 to a"
"move a to x"
"move y to x"
"END"
In the first case, you'd use the register and the cache for x,y,z (a is only written to once)
In the second case, you'd use the register and the cache for a,x,y (z is only written to once)
If the CPU does the caching, it simply can't decide ahead of time which of the two above cases it's facing.
It has to decide for each of the memory cells x,y,z if its contents should be cached before it knows if the program executed, is no. 1 or no. 2, because both programs start out the same.
The programmer on the other hand knows ahead of time which memory cells are reused, and when they are reused.
Peter Cordes wrote:
On most microarchitectures for most ISAs, no, you can't pin a line in cache to stop it from being evicted. The only way to use cache is as a transparent cache that you load/store through.
This is correct, but the exceptions are of interest....
It is common in DSP ("Digital Signal Processing") chips to provide a limited ability to partition SRAM between "cache" and "scratchpad memory" functionality. There are lots of white papers and reference guides on this topic -- an example is http://www.ti.com/lit/ug/sprug82a/sprug82a.pdf. In this chip, there are three blocks of SRAM -- a small "Level-1 Instruction" SRAM, a small "Level-1 Data" SRAM, and a larger "Level-2" SRAM. Each of the three can be partitioned between Cache and directly-addressed memory, with the details depending on the specific chip. For example, a chip may allow no cache, 1/4 SRAM as cache, 1/2 SRAM as cache, or all SRAM as cache. (The ratios are limited so the allowed cache sizes can be indexed efficiently.)
The IBM "Cell" processor (used in the Sony PlayStation 3, released in 2006) was a multi-core chip with one ordinary general-purpose core and eight co-processor cores. The co-processor cores had a limited instruction set, with load and store instructions that could only access their private 128KiB "scratchpad" memory. In order to access main memory, the co-processors had to program a DMA engine to perform a block copy of main memory to local scratchpad memory (or vice versa). This approach provided (and required) perfect control over data motion, resulting in (a very small amount of) very high-performance software.
Some GPUs also have small on-chip SRAMs that can be configured as either an L1 cache or as explicitly controlled local memory.
All of these are considered to be "very hard" (or worse) to use, but this can be the right approach if the product requires very low cost, completely predictable performance, or very low power.
On most microarchitectures for most ISAs, no, you can't pin a line in cache to stop it from being evicted. The only way to use cache is as a transparent cache that you load/store through.
Of course, a normal load will definitely bring a cache line into L1d cache, at least temporarily. Nothing stops it from being evicted later, though. e.g. on x86-64: mov eax, [rdi] instead of prefetcht0 [rdi].
Before dedicated prefetch instructions existed, using a plain load as a prefetch was sometimes done (e.g. ahead of some loop-bounds calculations before entering a loop that would start looping over an array). For performance purposes, best-effort software prefetch instructions that the CPU can ignore are usually better.
A plain load has the downside of not being able to retire from the out-of-order back-end until the loaded data actually arrives. (At least I think it can't on x86 CPUs with x86's strongly ordered memory model. Weakly-ordered ISAs that allow out-of-order loads might let the load retire even if it hasn't truly completed yet.) Software prefetch instructions exist to allow prefetch as a hint without bottlenecking the CPU on waiting for the load to finish.
On modern x86, forced eviction of a cache is possible. NT stores guarantee that on Pentium-M or newer, or CPUs after Pentium-M, I forget which. Also, clflush and clflushopt exist specifically for that.
clflush is not just a hint that the CPU can drop; it guarantees correctness for non-volatile DIMMs like Optane DC PM. Why does CLFLUSH exist in x86?
Being guaranteed, not just a hint, makes it slow. You generally don't want to do this for performance. As #old_timer says, burning instructions / cycles micro-managing the cache is almost always a waste of time. Leaving things up to the hardware's pseudo-LRU replacement and HW prefetch algorithms usually provide good results in the long run. SW prefetch can help in a few cases.
Xeon Phi can configure its MCDRAM as a large last-level cache, or as architecturally visible "local memory" that's part of physical address space. But at 6 to 16GiB, it's vastly bigger than the on-die L1/L2 caches, or the L1/L2/L3 caches of modern mainstream CPUs.
Also, x86 CPUs can run in cache-as-RAM no-fill mode, used by the BIOS in early startup before configuring DRAM controllers. But that's really just no fills on read or write, and read-as-zero for invalid lines, so you can't use DRAM at all when no-fill-mode is activated. i.e. only cache is available, and you have to be careful not to evict anything that was cached. It's not usable for any practical purpose except early-boot.
What use is the INVD instruction? and Cache-as-Ram (no fill mode) Executable Code have some details.
I know that there are some instructions that give the "caching system" hints, but I'm not sure if that's enough because the hints could be ignored or they maybe aren't sufficient to express anything expressable by such a move to/from cache order.
Direct access to the cache srams has nothing to do with the instruction set, if you have access then you have access and you access it however the chip/system designers implemented it. It could be as simple as an address space or it may be some indirect peripheral like access where you poke at control registers and that logic accesses that item in the cache for you.
And this doesn't mean that all ARM processors can gain access to their cache in the same way. (arm is an IP company not a chip company) but it might mean that no you can't do this on any existing x86s. I know for a fact on the product I am part of we can do this because we have ECC on those SRAMs and have an access method to initialize the rams from software before enabling the monitor. Some of the srams you can do it through normal accesses, but for example the arm we are using was implemented with parity checking not ECC so we added ECC on the SRAM and a side door access for init because trying to go through the cache with normal accesses and get 100% coverage was a PITA and end the end not the right solution.
Also worked on a product where the dram controller cache can be used direct access as an on chip ram, up to software decide how to use it as an L2 cache or as on chip ram.
So it has and can be done, and these are isolated examples. As part of screening the parts there are mbist tests that run, but often those are driven through jtag and not directly available to the processor and/or the ram isn't, sometimes the mbist can be started and checked by software but the ram can't, and some implementations, the designers made it so software can touch all of it, including tag ram.
Which leads to if you think you can do a better job than the hardware and want to move stuff around then you will also likely need access to the tag ram as well so that you can trace/drive where you want the cache line, its status, etc.
Based on this comment:
Sorry, I'm a [beginner] at assembly, could you please explain this simpler? whats a CPU "mode"? What's that HBM? How to set a CPU mode? what are NDAs? – KGM
Two things, you can't do better than the cache, and two, you are not ready for this task.
Even with experience you can't generally do better than the cache, if you want to manipulate the cache you use the same knowledge as to how you write your code and where you place it in memory as well as where the data is you are using and then the logic implementation can work better for you. Burning instructions and cycles trying to reposition things runtime isn't going to help. You generally need access to the design at level that is not available to the general public. Thus an NDA (non disclosure agreement), and even then it is extremely unlikely that you will get the info you need and/or the gains will be minimal, may only work on one implementation and not across the whole family of products, etc.
More interesting is what do you think you can do better and how are you thinking you can do it? (also understand that many of us here can make any cache implementation fail and run slower than if it wasn't there, even if you create a newer better cache, by definition it only improves performance in certain cases).

Will gettimeofday() be slowed due to the fix to the recently announced Intel bug?

I have been estimating the impact of the recently announced Intel bug on my packet processing application using netmap. So far, I have measured that I process about 50 packets per each poll() system call made, but this figure doesn't include gettimeofday() calls. I have also measured that I can read from a non-existing file descriptor (which is about the cheapest thing that a system call can do) 16.5 million times per second. My packet processing rate is 1.76 million packets per second, or in terms of system calls, 0.0352 million system calls per second. This means performance reduction would be 0.0352 / 16.5 = 0.21333% if system call penalty doubles, hardly something I should worry about.
However, my application may use gettimeofday() system calls quite often. My understanding is that these are not true system calls, but rather implemented as virtual system calls, as described in What are vdso and vsyscall?.
Now, my question is, does the fix to the recently announced Intel bug (that may affect ARM as well and that probably won't affect AMD) slow down gettimeofday() system calls? Or is gettimeofday() an entirely different animal due to being implemented as a different kind of virtual system call?
In general, no.
The current patches keep things like the vDSO pages mapped in user-space, and only change the behavior for the remaining vast majority of kernel-only pages which will no longer be mapped in user-space.
On most architectures, gettimeofday() is implemented as a purely userspace call, and never enters the kernel, doesn't include the TLB flush or CR3 switch that KPTI implies, so you shouldn't see a performance impact.
Exceptions include unusual kernel or hardware configurations that don't use the vDSO mechanisms, e.g., if you don't have a constant rdtsc or if you have explicitly disabled rdtsc timekeeping via a boot parameter. You'd probably already know if that was the case since it means that gettimeofday would take 100-200ns rather than 15-20ns since it's already making a kernel call.
Good question, the VDSO pages are kernel memory mapped into user space. If you single-step into gettimeofday(), you see a call into the VDSO page where some code there uses rdtsc and scales the result with scale factors it reads from another data page.
But these pages are supposed to be readable from user-space, so Linux can keep them mapped without any risk. The Meltdown vulnerability is that the U/S bit (user/supervisor) in page-table / TLB entries doesn't stop unprivileged loads (and further dependent instructions) from happening microarchitecturally, producing a change in the microarchitectural state which can then be read with cache-timing.

Are one-sided RDMA reads atomic for single cache lines?

My group (a project called Isis2) is experimenting with RDMA. We're puzzled by the lack of documentation for the atomicity guarantees of one-sided RDMA reads. I've spent the past hour and a half hunting for any kind of information at all on this to no avail. This includes close reading of the blog at rdmamojo.com, famous for having answers to every RDMA question...
In the case we are focused on, we want to have writers doing atomic writes for objects that will always fit within a single cache line. Say this happens on machine A. Then we plan to have a one-sided atomic RDMA reader on machine B, who might read chunks of memory from A, spanning many of these objects (but again, no object would ever be written non-atomically, and all will fit within some single cache line). So B reads X, Y and Z, and each of those objects lives in one cache line on A, and was written with atomic writes.
Thus the atomic writes will be local, but the RDMA reads will arrive from remote machines and are done with no local CPU involvement.
Are our one-sided reads "semantically equivalent" to atomic local reads despite being initiated on the remote machine? (I suspect so: otherwise, one-sided RDMA reads would be useless for data that is ever modified...). And where are the "rules" documented?
Ok, meanwhile I seem to have found the correct answer, and I believe that Roland's response is not quite right -- partly right but not entirely.
In http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf, which is the Intel architecture manual (I'll need to check again for AMD...) I found this: Atomic memory operation in Intel 64 and IA-32 architecture is guaranteed only for a subset of memory operand
sizes and alignment scenarios. The list of guaranteed atomic operations are described in Section 8.1.1 of IA-32
Intel® Architecture Software Developer’s Manual, Volumes 3A.
Then in that section, which is entitled MULTIPLE-PROCESSOR MANAGEMENT, one finds a lot of information about guaranteed atomic operations (page 2210). In particular, Intel guarantees that its memory subsystems will be atomic for native types (bit, byte, integers of various sizes, float). These objects must be aligned so as to fit within a cache line (64 bytes on the current Intel platforms), not crossing a cache line boundary. But then Intel guarantees that no matter what device is using the memory bus, stores and fetches will be atomic.
For more complex objects, locking is required if you want to be sure you will get a safe execution. Further, if you are doing multicore operations you have to use the locked (atomic) variants of the Intel instructions to be sure of coherency for concurrent writes. You get this automatically for variables marked volatile in C++ or C# (Java too?).
What this adds up to is that local writes to native types can be paired with remotely initiated RDMA reads safely.
But notice that strings, byte arrays -- those would not be atomic because they could easily cross a cache line. Also, operations on complex objects with more than one data field might not be atomic -- for such things you would need a more complex approach, such as the one in the FaRM paper (Fast Remote Memory) by MSR. My own need is simpler and won't require the elaborate version numbering scheme FaRM implements...
The cache coherence protocol implemented in the PCIe controller should guarantee atomicity for single cache line RDMA reads. The PCIe controller has to snoop the caches of CPU cores and take ownership of the cache line (RFO) before returning data to the RDMA adapter. So it should see some snapshot of the cache line.
I don't know of any such guarantee of atomicity. Of course RDMA reads are executed by the remote adapter, and cacheline size is a CPU concept. I don't believe anything ensures that the granularity of reads used by remote RDMA adapter matches the size of writes performed by the remote CPU.
In practice it is likely to work since the remote adapter will probably issue a single PCI transaction etc. but I don't think there is anything architectural that guarantees you don't get "torn" data.

Atomic operations in ARM strex and ldrex - can they work on I/O registers?

Suppose I'm modifying a few bits in a memory-mapped I/O register, and it's possible that another process or and ISR could be modifying other bits in the same register.
Can ldrex and strex be used to protect against this? I mean, they can in principle because you can ldrex, and then change the bit(s), and strex it back, and if the strex fails it means another operation may have changed the reg and you have to start again. But can the strex/ldrex mechanism be used on a non-cacheable area?
I have tried this on raspberry pi, with an I/O register mapped into userspace, and the ldrex operation gives me a bus error. If I change the ldrex/strex to a simple ldr/str it works fine (but is not atomic any more...) Also, the ldrex/strex routines work fine on ordinary RAM. Pointer is 32-bit aligned.
So is this a limitation of the strex/ldrex mechanism? or a problem with the BCM2708 implementation, or the way the kernel has set it up? (or somethinge else- maybe I've mapped it wrong)?
Thanks for mentioning me...
You do not use ldrex/strex pairs on the resource itself. Like swp or test and set or whatever your instruction set supports (for arm it is swp and more recently strex/ldrex). You use these instructions on ram, some ram location agreed to by all the parties involved. The processes sharing the resource use the ram location to fight over control of the resource, whoever wins, gets to then actually address the resource. You would never use swp or ldrex/strex on a peripheral itself, that makes no sense. and I could see the memory system not giving you an exclusive okay response (EXOKAY) which is what you need to get out of the ldrex/strex infinite loop.
You have two basic methods for sharing a resource (well maybe more, but here are two). One is you use this shared memory location and each user of the shared resource, fights to win control over the memory location. When you win you then talk to the resource directly. When finished give up control over the shared memory location.
The other method is you have only one piece of software allowed to talk to the peripheral, nobody else is allowed to ever talk to the peripheral. Anyone wishing to have something done on the peripheral asks the one resource to do it for them. It is like everyone being able to share the soft drink fountain, vs the soft drink fountain is behind the counter and only the soft drink fountain employee is allowed to use the soft drink fountain. Then you need a scheme either have folks stand in line or have folks take a number and be called to have their drink filled. Along with the single resource talking to the peripheral you have to come up with a scheme, fifo for example, to essentially make the requests serial in nature.
These are both on the honor system. You expect nobody else to talk to the peripheral who is not supposed to talk to the peripheral, or who has not won the right to talk to the peripheral. If you are looking for hardware solutions to prevent folks from talking to it, well, use the mmu but now you need to manage the who won the lock and how do they get the mmu unblocked (without using the honor system) and re-blocked in a way that
Situations where you might have an interrupt handler and a foreground task sharing a resource, you have one or the other be the one that can touch the resource, and the other asks for requests. for example the resource might be interrupt driven (a serial port for example) and you have the interrupt handlers talk to the serial port hardware directly, if the application/forground task wants to have something done it fills out a request (puts something in a fifo/buffer) the interrupt then looks to see if there is anything in the request queue, and if so operates on it.
Of course there is the, disable interrupts and re-enable critical sections, but those are scary if you want your interrupts to have some notion of timing/latency...Understand what you are doing and they can be used to solve this app+isr two user problem.
ldrex/strex on non-cached memory space:
My extest perhaps has more text on the when you can and cant use ldrex/strex, unfortunately the arm docs are not that good in this area. They tell you to stop using swp, which implies you should use strex/ldrex. But then switch to the hardware manual which says you dont have to support exclusive operations on a uniprocessor system. Which says two things, ldrex/strex are meant for multiprocessor systems and meant for sharing resources between processors on a multiprocessor system. Also this means that ldrex/strex is not necessarily supported on uniprocessor systems. Then it gets worse. ARM logic generally stops either at the edge of the processor core, the L1 cache is contained within this boundary it is not on the axi/amba bus. Or if you purchased/use the L2 cache then the ARM logic stops at the edge of that layer. Then you get into the chip vendor specific logic. That is the logic that you read the hardware manual for where it says you dont NEED to support exclusive accesses on uniprocessor systems. So the problem is vendor specific. And it gets worse, ARM's L1 and L2 cache so far as I have found do support ldrex/strex, so if you have the caches on then ldrex/strex will work on a system whose vendor code does not support them. If you dont have the cache on that is when you get into trouble on those systems (that is the extest thing I wrote).
The processors that have ldrex/strex are new enough to have a big bank of config registers accessed through copressor reads. buried in there is a "swp instruction supported" bit to determine if you have a swap. didnt the cortex-m3 folks run into the situation of no swap and no ldrex/strex?
The bug in the linux kernel (there are many others as well for other misunderstandings of arm hardware and documentation) is that on a processor that supports ldrex/strex the ldrex/strex solution is chosen without determining if it is multiprocessor, so you can (and I know of two instances) get into an infinite ldrex/strex loop. If you modify the linux code so that it uses the swp solution (there is code there for either solution) they linux will work. why only two people have talked about this on the internet that I know of, is because you have to turn off the caches to have it happen (so far as I know), and who would turn off both caches and try to run linux? It actually takes a fair amount of work to succesfully turn off the caches, modifications to linux are required to get it to work without crashing.
No, I cant tell you the systems, and no I do not now nor ever have worked for ARM. This stuff is all in the arm documentation if you know where to look and how to interpret it.
Generally, the ldrex and strex need support from the memory systems. You may wish to refer to some answers by dwelch as well as his extext application. I would believe that you can not do this for memory mapped I/O. ldrex and strex are intended more for Lock Free algorithms, in normal memory.
Generally only one driver should be in charge of a bank of I/O registers. Software will make requests to that driver via semaphores, etc which can be implement with ldrex and strex in normal SDRAM. So, you can inter-lock these I/O registers, but not in the direct sense.
Often, the I/O registers will support atomic access through write one to clear, multiplexed access and other schemes.
Write one to clear - typically use with hardware events. If code handles the event, then it writes only that bit. In this way, multiple routines can handle different bits in the same register.
Multiplexed access - often an interrupt enable/disable will have a register bitmap. However, there are also alternate register that you can write the interrupt number to which enable or disable a particular register. For instance, intmask maybe two 32 bit registers. To enable int3, you could mask 1<<3 to the intmask or write only 3 to an intenable register. They intmask and intenable are hooked to the same bits via hardware.
So, you can emulate an inter-lock with a driver or the hardware itself may support atomic operations through normal register writes. These schemes have served systems well for quiet some time before people even started to talk about lock free and wait free algorithms.
Like previous answers state, ldrex/strex are not intended for accessing the resource itself, but rather for implementing the synchronization primitives required to protect it.
However, I feel the need to expand a bit on the architectural bits:
ldrex/strex (pronounced load-exclusive/store-exclusive) are supported by all ARM architecture version 6 and later processors, minus the M0/M1 microcontrollers (ARMv6-M).
It is not architecturally guaranteed that load-exclusive/store-exclusive will work on memory types other than "Normal" - so any clever usage of them on peripherals would not be portable.
The SWP instruction isn't being recommended against simply because its very nature is counterproductive in a multi-core system - it was deprecated in ARMv6 and is "optional" to implement in certain ARMv7-A revisions, and most ARMv7-A processors already require it to be explicitly enabled in the cp15 SCTLR. Linux by default does not, and instead emulates the operation through the undef handler using ... load-exclusive and store-exclusive (what #dwelch refers to above). So please don't recommend SWP as a valid alternative if you are expecting code to be portable across ARMv7-A platforms.
Synchronization with bus masters not in the inner-shareable domain (your cache-coherency island, as it were) requires additional external hardware - referred to as a global monitor - in order to track which masters have requested exclusive access to which regions.
The "not required on uniprocessor systems" bit sounds like the ARM terminology getting in the way. A quad-core Cortex-A15 is considered one processor... So testing for "uniprocessor" in Linux would not make one iota of a difference - the architecture and the interconnect specifications remain the same regardless, and SWP is still optional and may not be present at all.
Cortex-M3 supports ldrex/strex, but its interconnect (AHB-lite) does not support propagating it, so it cannot use it to synchronize with external masters. It does not support SWP, never introduced in the Thumb instruction set, which its interconnect would also not be able to propagate.
If the chip in question has a toggle register (which is essentially XORed with the output latch when written to) there is a work around.
load port latch
mask off unrelated bits
xor with desired output
write to toggle register
as long as two processes do not modify the same pins (as opposed to "the same port") there is no race condition.
In the case of the bcm2708 you could choose an output pin whose neighbors are either unused or are never changed and write to GPFSELn in byte mode. This will however only ensure that you will not corrupt others. If others are writing in 32 bit mode and you interrupt them they will still corrupt you. So its kind of a hack.
Hope this helps

Atomic load/store for OSs other than BSD?

Among the atomic operations provided by BSD (as given on the atomic(9) man page), there are atomic_load_acq_int() and atomic_store_rel_int(). In looking for the equivalent for other OSs (for example, by reading the atomic(3) man page for Mac OS X, the atomic_ops(3C) man page for Solaris, and the Interlocked*() functions for Windows), there don't seem to be any (obvious) equivalents for just atomically reading/writing an int.
Is this because that it's implied for those OSs that reads/writes for int are guaranteed to be atomic by default? (Or must you use declare them volatile in C/C++?)
If not, then how does one do atomic reads/writes of an int on those OSs?
(Atomic reads can be simulated by returning the result of an atomic add of 0, but there's no equivalent for doing atomic writes.)
I think you are mixing together atomic memory access with cache coherence. The former is the required hardware support for building synchronization primitives in software (spin-locks, semaphores, and mutexes), while the latter is the hardware support for multiple chips (several CPUs, and peripheral devices) working over the same bus, and having consistent view of the main memory.
Different compilers/libraries provide different utilities for the first. Here's, for example, GCC intrinsics for atomic memory access. They all boil down to generating either compare-and-swap or load-linked/store-conditional based instruction blocks depending on the platform support. Compile your source with, say, -S for GCC and see the assembler generated.
You don't have to do anything explicitly for cache coherency - it's all handled in hardware - but it definitely helps to understand how it works to avoid things like cache line ping-pong.
With all that, aligned single word reads and writes are atomic on all commodity platforms (somebody correct me if I'm wrong here). Since ints are less or equal to processor word in size, you are covered (see the GCC builtins link above).
It's the order of reads and writes that is important. Here's where architecture memory model is important. It dictates what operations can and cannot be re-ordered by the hardware. Example would be updating a linked list - you don't want other CPUs see a new item linked until the item itself is in consistent state. Explicit memory barriers (also often called "memory fences") might be required. Acquire barrier ensures that subsequent operations are not re-ordererd before the barrier (say you read the linked-list item pointer before the content of the item), Release barrier ensures that previous operations are not re-ordered after the barrier (you write the item content before writing the new link pointer).
volatile is often misunderstood as being related to all the above. In fact it is just an instruction to the compiler not to cache variable value in register, but read it from memory on each access. Many argue that it's "almost useless" for concurrent programming.
Apologies for lengthy reply. Hope this clears it a bit.
Edit:
Upcoming C++0x standard finally addresses concurrency, see Hans Boehm's C++ memory model papers for many details.

Resources