In some hi-level programming environments (java, dotnet), when accessing same memory from multiple threads, you have to explicitly mark it as volatile or synchrnozied, otherwise you could get stale results from some cache or out-of-order values due to out-of-order execution by CPU or other optimizations.
In MRI ruby for some time native OS threads are used. Each of those threads sometime execute ruby code (I assume, but not sure), even if never truly parallel because of VM lock.
I guess MRI solves this stale/ooo values issue somehow, because there is no volatile construct in ruby language and I never heard of stale value issues.
What guarantees Ruby lang or MRI specifically gives regarding memory access from multiple threads? I would be extremely grateful if someone would point me to any documentation regarding this. Thanks!
It sounds like your specific question is if Ruby implicitly provides a memory barrier when switching threads, such that all caching/reordering concerns that occur at a processor level are resolved automatically.
I believe MRI does provide this, as otherwise the GVL would be pointless; why restrict one thread to run at a time if even then they can end up reading/writing stale data? It is difficult to find the precise place where this is provided, but I believe the entry point is via RB_VM_LOCK_ENTER which is called throughout the codebase and which ultimately calls vm_lock_enter. This has code which strongly implies that memory barriers are in place:
// lock
rb_native_mutex_lock(&vm->ractor.sync.lock);
VM_ASSERT(vm->ractor.sync.lock_owner == NULL);
vm->ractor.sync.lock_owner = cr;
if (!no_barrier) {
// barrier
while (vm->ractor.sync.barrier_waiting) {
unsigned int barrier_cnt = vm->ractor.sync.barrier_cnt;
Related
I've been learning about parallel/GPU programming a lot recently, and I've encountered a situation that's stumped me. What happens when two threads in a warp/wave attempt to write to the same exact location in shared memory? Specifically, I'm confused as to how this can occur when warp threads each execute the exact same instruction at the same time (to my understanding).
For instance, say you dispatch a shader that runs 32 threads, the size of a normal non-AMD warp. Assuming no dynamic branching (which as I understand, will normally call up a second warp to execute the branched code? I could be very wrong about that), what happens if we have every single thread try to write to a single location in shared memory?
Though I believe my question applies to any kind of GPU code, here's a simple example in HLSL:
groupshared uint test_target;
#pragma kernel WarpWriteTest
[numthreads(32, 1, 1)]
void WarpWriteTest (uint thread_id: SV_GroupIndex) {
test_target = thread_id;
}
I understand this is almost certainly implementation-specific, but I'm just curious what would generally happen in a situation like this. Obviously, you'd end up with an unpredictable value stored in test_target, but what I'm really curious about is what happens on a hardware level. Does the entire warp have to wait until every write is complete, at which point it will continue executing code in lockstep (and would this result in noticeable latency)? Or is there some other mechanism to GPU shared memory/cache that I'm not understanding?
Let me clarify, I'm not asking what happens when multiple threads try to access a value in global memory/DRAM—I'd be curious to know, but my question is specifically concerned the shared memory in a threadgroup. I also apologize if this information is readily available somewhere else—as anyone reading might know, GPU terminology in general can be very nebulous and non-standardized, so I've had difficulty even knowing what I should be looking for.
Thank you so much!
It's my understanding of atomicity that it's used to make sure a value will be read/written in whole rather than in parts. For example, a 64-bit value that is really two 32-bit DWORDs (assume x86 here) must be atomic when shared between threads so that both DWORDs are read/written at the same time. That way one thread can't read half variable that's not updated. How do you guarantee atomicity?
Furthermore it's my understanding that volatility does not guarantee thread safety at all. Is that true?
I've seen it implied many places that simply being atomic/volatile is thread-safe. I don't see how that is. Won't I need a memory barrier as well to ensure that any values, atomic or otherwise, are read/written before they can actually be guaranteed to be read/written in the other thread?
So for example let's say I create a thread suspended, do some calculations to change some values to a struct available to the thread and then resume, for example:
HANDLE hThread = CreateThread(NULL, 0, thread_entry, (void *)&data, CREATE_SUSPENDED, NULL);
data->val64 = SomeCalculation();
ResumeThread(hThread);
I suppose this would depend on any memory barriers in ResumeThread? Should I do an interlocked exchange for val64? What if the thread were running, how does that change things?
I'm sure I'm asking a lot here but basically what I'm trying to figure out is what I asked in the title: a good explanation for atomicity, volatility and thread safety in Windows. Thanks
it's used to make sure a value will be read/written in whole
That's just a small part of atomicity. At its core it means "uninterruptible", an instruction on a processor whose side-effects cannot be interleaved with another instruction. By design, a memory update is atomic when it can be executed with a single memory-bus cycle. Which requires the address of the memory location to be aligned so that a single cycle can update it. An unaligned access requires extra work, part of the bytes written by one cycle and part by another. Now it is not uninterruptible anymore.
Getting aligned updates is pretty easy, it is a guarantee provided by the compiler. Or, more broadly, by the memory model implemented by the compiler. Which simply chooses memory addresses that are aligned, sometimes intentionally leaving unused gaps of a few bytes to get the next variable aligned. An update to a variable that's larger than the native word size of the processor can never be atomic.
But much more important are the kind of processor instructions you need to make threading work. Every processor implements a variant of the CAS instruction, compare-and-swap. It is the core atomic instruction you need to implement synchronization. Higher level synchronization primitives, like monitors (aka condition variables), mutexes, signals, critical sections and semaphores are all built on top of that core instruction.
That's the minimum, a processor usually provide extra ones to make simple operations atomic. Like incrementing a variable, at its core an interruptible operation since it requires a read-modify-write operation. Having a need for it be atomic is very common, most any C++ program relies on it for example to implement reference counting.
volatility does not guarantee thread safety at all
It doesn't. It is an attribute that dates from much easier times, back when machines only had a single processor core. It only affects code generation, in particular the way a code optimizer tries to eliminate memory accesses and use a copy of the value in a processor register instead. Makes a big, big difference to code execution speed, reading a value from a register is easily 3 times faster than having to read it from memory.
Applying volatile ensures that the code optimizer does not consider the value in the register to be accurate and forces it to read memory again. It truly only matters on the kind of memory values that are not stable by themselves, devices that expose their registers through memory-mapped I/O. It has been abused heavily since that core meaning to try to put semantics on top of processors with a weak memory model, Itanium being the most egregious example. What you get with volatile today is strongly dependent on the specific compiler and runtime you use. Never use it for thread-safety, always use a synchronization primitive instead.
simply being atomic/volatile is thread-safe
Programming would be much simpler if that was true. Atomic operations only cover the very simple operations, a real program often needs to keep an entire object thread-safe. Having all its members updated atomically and never expose a view of the object that is partially updated. Something as simple as iterating a list is a core example, you can't have another thread modifying the list while you are looking at its elements. That's when you need to reach for the higher-level synchronization primitives, the kind that can block code until it is safe to proceed.
Real programs often suffer from this synchronization need and exhibit Amdahls' law behavior. In other words, adding an extra thread does not actually make the program faster. Sometimes actually making it slower. Whomever finds a better mouse-trap for this is guaranteed a Nobel, we're still waiting.
In general, C and C++ don't give any guarantees about how reading or writing a 'volatile' object behaves in multithreaded programs. (The 'new' C++11 probably does since it now includes threads as part of the standard, but tradiationally threads have not been part of standard C or C++.) Using volatile and making assumptions about atomicity and cache-coherence in code that's meant to be portable is a problem. It's a crap-shoot as to whether a particular compiler and platform will treat accesses to 'volatile' objects in a thread-safe way.
The general rule is: 'volatile' is not enough to ensure thread safe access. You should use some platform-provided mechanism (usually some functions or synchronisation objects) to access thread-shared values safely.
Now, specifically on Windows, specifically with the VC++ 2005+ compiler, and specifically on x86 and x64 systems, accessing a primitive object (like an int) can be made thread-safe if:
On 64- and 32-bit Windows, the object has to be a 32-bit type, and it has to be 32-bit aligned.
On 64-bit Windows, the object may also be a 64-bit type, and it has to be 64-bit aligned.
It must be declared volatile.
If those are true, then accesses to the object will be volatile, atomic and be surrounded by instructions that ensure cache-coherency. The size and alignment conditions must be met so that the compiler makes code that performs atomic operations when accessing the object. Declaring the object volatile ensures that the compiler doesn't make code optimisations related to caching previous values it may have read into a register and ensures that code generated includes appropriate memory barrier instructions when it's accessed.
Even so, you're probably still better off using something like the Interlocked* functions for accessing small things, and bog standard synchronisation objects like Mutexes or CriticalSections for larger objects and data structures. Ideally, get libraries for and use data structures that already include appropriate locks. Let your libraries & OS do the hard work as much as possible!
In your example, I expect you do need to use a thread-safe access to update val64 whether the thread is started yet or not.
If the thread was already running, then you would definitely need some kind of thread-safe write to val64, either using InterchangeExchange64 or similar, or by acquiring and releasing some kind of synchronisation object which will perform appropriate memory barrier instructions. Similarly, the thread would need to use a thread-safe accessor to read it as well.
In the case where the thread hasn't been resumed yet, it's a bit less clear. It's possible that ResumeThread might use or act like a synchronisation function and do the memory barrier operations, but the documentation doesn't specify that it does, so it is better to assume that it doesn't.
References:
On atomicity of 32- and 64- bit aligned types... https://msdn.microsoft.com/en-us/library/windows/desktop/ms684122%28v=vs.85%29.aspx
On 'volatile' including memory fences... https://msdn.microsoft.com/en-us/library/windows/desktop/ms686355%28v=vs.85%29.aspx
I have a general question about the Ruby VM (Ruby Interpreter ). How does it work with multiprocessors? Regarding parallelism and concurrency in Ruby, let's say that I have 4 processors. Will the VM automatically assign the tasks with the processors through the Kernel? With scaling, lets say that my ruby process is taking a lot of the CPU resources; what will happen if I add a new processor? Is the OS responsible for assigning the tasks to the processors, or will each VM work on one processor? What would be the best way to scale my ruby application? I tried as much as possible to separate my processes and use amqp queuing. Any other ideas?
It would be great if you can send me links for more explanation.
Thanks in advance.
Ruby Threading
The Ruby language itself supports parallel execution through a threading model; however, the implementation dictates if additional hardware resources get used. The "gold standard" interpreter (MRI Ruby) uses a "green threading" model in 1.8; threading is done within the interpreter and only uses a single system thread for execution. However, others (such as JRuby) leverage the Java VM to create actual system level threads for execution. MRI Ruby 1.9 adds additional threading capability but (afaik) it's still limited to only switching thread contexts when a thread stalls on an I/O event.
Advanced Threading
Typically the OS manages assignment of threads to logical cores since most application software doesn't actually care. In some high performance compute cases, the software will specifically request certain threads to execute on specific logical cores for architecture specific performance. It's highly unlikely anything written in Ruby would fall into this category.
Refactoring
Per application performance limits can usually be addressed by refactoring the code first. Leveraging a language or other environment more suited to the specific problem is likely the best first step instead of immediately jumping to threading in the existing implementation.
Example
I once worked on a Ruby on Rails app with a massive hash mapping function step in it when data was uploaded. The initial implementation was written completely in Ruby and took ~80s to complete. Rewriting the code in ANSI C and leveraging more specific memory allocation, the execution time fell to under a second (without even using threads). The next bottleneck was inserting the massive amount of data back into MySQL which eventually also moved out of the Ruby code and into threaded C code. I specifically went this route since the MRI Ruby interpreter easily binds to C code. The final result has Ruby preparing the environment for the C code, calling it as a Ruby instance method on a class with parameters, hash mapping by a single thread of C code, and finally finishes with an OpenMP worker queue model of generating and executing inserts into MySQL.
Previously I think the Visibility Problem is cause by CPU Cache for performance.
But I saw this article: http://www.ibm.com/developerworks/java/library/j-5things15/index.html
In the paragraph 3. Volatile variables, it tells that Thread holds the cache, sounds like the cache is caused by JVM.
What's the answer? JVM or Hardware?
JVM gives you some weak guarantees. Compiler and Hardware cause you problems. :-)
When a thread reads a variable, it is not necessarily getting the latest value from memory. The processor might return a cached value. Additionally, even though the programmer authored code where a variable is first written and later read, the compiler might reorder the statements as long as it does not change the program semantics. It is quite common for processors and compilers to do this for performance optimization. As a result, a thread might not see the values it expects to see. This can result in hard to fix bugs in concurrent programs.
Most programmers are familiar with the fact that entering a synchronized block means obtaining a lock on a monitor that ensures that no other thread can enter the synchronized block. Less familiar but equally important are the facts that
(1) Acquiring a lock and entering a synchronized block forces the thread to refresh data from memory.
(2) Upon exiting the synchronized block, data written is flushed to memory.
http://www.javacodegeeks.com/2011/02/java-memory-model-quick-overview-and.html
See also JSR 133 (Java Memory Model and Thread Specification Revision) http://jcp.org/en/jsr/detail?id=133 It was released with JDK 1.5.
In other languages you have a number of possibilities usually for memory reclamation:
Mark objects and then remove them
Explicit retain and release
Count references to objects
Internal heap disposition
How does Ruby work?
The garbage collector Ruby 1.8 is actually quite awful. Every 7Mb of allocation it will perform a mark phase from all root objects and try to find which can be reached. Those that cannot be reached will be freed.
However, to find out what objects are reachable, it checks the stack, the registers and the allocated object memory. This allows for some false positives but eases writing C extensions: C extensions don't have to reference and deference, since the stack and so on which C extensions used are automatically scanned.
Furthermore, the state of the object (referenced or not) is kept in the state of each object. This is quite bad for cache behavior and copy-on-write behaviour: a lot of cache lines are touched during this process, and ruby interpreters do not share as much memory as they could if you've got more then one (relevant for server deployment like Ruby on Rails). Therefore, other implementations exists (Ruby Enterprise Edition) which do this in a separate part of the memory to speed GC up.
Also a problem are long linked lists. Since mark-and-sweep uses the stack to do the recursion, a long linked lists segfaults ruby.
The GC does also no compaction and this becomes problematic in the long run.
If you run JRuby however, those problems will disappear while keeping Ruby 1.8 compatibility to some extend.
"conservative mark and sweep"
See this thread which includes Matz' description, which ought to be definitive.
Ruby's GC uses the mark-and-sweep strategy.