This question already has answers here:
Does ruby have real multithreading?
(9 answers)
Closed 6 years ago.
I'm using ruby-head and Debian wheezy x64. When I run a multithreaded ruby script, htop shows that it's using multiple cores visually with the bars at the top, and that it's using 100% CPU on the process list but it's only using 100% of the capacity of one core. I assume it's possible to have multiple cores running at 100% and this number seems too convenient to be bottle-necked by either the program logic or another hardware aspect. Is the OS limiting the amount of available instructions I'm using, if so how do I stop this?
EDIT more info:
When I mean visually using multiple cores e.g.: 47% core 1, 29% core 2, and 24% core 3. These percentages are constantly shifting up and down and to different sets of cores, but always collectively add up to 100%-102%. More than 3(/8 total) cores are being used, but any cores other than the three most burdened only utilize 2% or less capacity. I guess I should also mention this is a linode VPS.
EDIT:
Well it looks like I was reading promises that 2.0 would feature true parallel threads, and not actual release information. Time to switch to Jruby...
You failed to mention which Ruby implementation you are using. Not all Ruby implementations are capable of scheduling Ruby threads to multiple CPUs.
In particular:
MRI implements Ruby threads as green threads inside the interpreter and schedules them itself; it cannot schedule more than one thread at a time and it cannot schedule them to multiple CPUs
YARV implements Ruby threads as native OS threads (POSIX threads or Windows threads) and lets the OS schedule them, however it puts a Giant VM Lock (GVL) around them, so that only one Ruby thread can be running at any given time
Rubinius implements Ruby threads as native OS threads (POSIX threads or Windows threads) and lets the OS schedule them, however it puts a Global Interpreter Lock (GIL) around them, so that only one Ruby thread can be running at any given time; Rubinius 2.0 is going to have fine-grained locks so that multiple Ruby threads can run at any given time
JRuby implements Ruby threads as JVM threads and uses fine-grained locking so that multiple threads can be running; however, whether or not those threads are scheduled to multiple CPUs depends on the JVM being used, some allow this, some don't
IronRuby implements Ruby threads as CLI threads and uses fine-grained locking so that multiple threads can be running; however, whether or not those threads are scheduled to multiple CPUs depends on the VES being used, some allow this, some don't
MacRuby implements Ruby threads as native OS threads and uses fine-grained locking so that multiple threads can be running on multiple CPUs at the same time
I don't know enough about Topaz, Cardinal, MagLev, MRuby and all the others.
MRI implements Ruby Threads as Green Threads within its interpreter. Unfortunately, it doesn't allow those threads to be scheduled in parallel, they can only run one thread at a time.
See similar question here
Related
Brian Goetz got me excited about project Loom and, in order to fully appreciate it, I'll need some clarification on the status quo.
My understanding is as follows: Currently, in order to have real parallelism, we need to have a thread per cpu/core; 1) is there then any point in having n+1 threads on an n-core machine? Project Loom will bring us virtually limitless threads/fibres, by relying on the jvm to carry out a task on a virtual thread, inside the JVM. 2) Will that be truly parallel? 3)How, specifically, will that differ from the aforementioned scenario "n+1 threads on an n-core machine "?
Thanks for your time.
Virtual threads allow for concurrency (IO bound), not parallelism (CPU bound). They represent causal simultaneity, but not resource usage simultaneity.
In fact, if two virtual threads are in an IO bound* state (awaiting a return from a REST call for example), then no thread is being used at all. Whereas, the use of normal threads (if not using a reactive or completable semantic) would both be blocked and unavailable for use until the calls are complete.
*Except for certain conditions (e.g., use of synchonize vs ReentrackLock, blocking that occurs in a native method, and possibly some other minor areas).
is there then any point in having n+1 threads on an n-core machine?
For one, most modern n-core machines have n*2 hardware threads because each core has 2 hardware threads.
Sometimes it does make sense to spawn more OS threads than hardware threads. That’s the case when some OS threads are asleep waiting for something. For instance, on Linux, until io_uring arrived couple years ago, there was no good way to implement asynchronous I/O for files on local disks. Traditionally, disk-heavy applications spawned more threads than CPU cores, and used blocking I/O.
Will that be truly parallel?
Depends on the implementation. Not just the language runtime, but also the I/O related parts of the standard library. For instance, on Windows, when doing disk or network I/O in C# with async/await (an equivalent of project loom, released around 2012) these tasks are truly parallel, the OS kernel and drivers are indeed doing more work at the same time. AFAIK on Linux async/await is only truly parallel for sockets but not files, for asynchronous file I/O it uses a pool of OS threads under the hood.
How, specifically, will that differ from the aforementioned scenario "n+1 threads on an n-core machine "?
OS threads are more expensive for a few reasons. (1) They require native stack so each OS thread consumes memory (2) Memory is slow, processors have caches to compensate, switching between OS threads increases RAM bandwidth because thread-specific data invalidates after a context switch (3) OS schedulers were improving over decades but still they’re not free. One reason is saving/restoring thread state to/from memory takes time.
The higher-level cooperative multitasking implemented in C# async/await or Java’s Loom causes way less overhead when switching contexts, compared to switching OS threads. At least in theory, this should improve both throughput and latency for I/O heavy applications.
I've read that Ruby code (CRuby/YARV) only "runs" on a single processor core, but something is not clear yet:
I understand that the GIL prevents threads from running concurrently and that in recent Ruby versions threads are scheduled by the operating system.
Couldn't a thread possibly be "placed" on core 1 and the other on core 2, even if they're not actually running at the same time?
Just trying to understand if the OS scheduler actually puts all Ruby threads on a single core. Thanks!
Edit: Another answer mentions that C++ uses pthreads and those are scheduled across cores, and that Ruby uses the same. I guess that's what I was looking for, but since most answers seem to equate not running threads in parallel with never running on multiple cores, I just wanted to confirm.
First off, we have to clearly distinguish between "Ruby Threads" and "Ruby Threads as implemented by YARV". Ruby Threads make no guarantees how they are scheduled. They might be scheduled concurrently, they might not. They might be scheduled on multiple CPUs, they might not. They might be implemented as native platform threads, they might be implemented as green threads, they might be implemented as something else.
YARV implements Ruby Threads as native platform threads (e.g. pthreads on POSIX and Windows threads on Windows). However, unlike other Ruby implementations which use native platform threads (e.g. JRuby, IronRuby, Rubinius), YARV has a Giant VM Lock (GVL) which prevents two threads to enter the YARV bytecode interpreter at the same time. This makes it effectively impossible to run Ruby code in multiple threads at the same time.
Note however, that the GVL only protects the YARV interpreter and runtime. This means that, for example, multiple threads can execute C code at the same time, and at the same time as another thread executed Ruby code. It just means that no two threads can execute Ruby code at the same time on YARV.
Note also that in recent versions of YARV, the "Giant" VM Lock is becoming ever smaller. Sections of code are moved out from under the lock, and the lock itself is broken down in smaller, more fine-grained locks. That is a very long process, but it means that in the future more and more Ruby code will be able to run in parallel on YARV.
But, all of this has nothing to do with how the platform schedules the threads. Many platforms have some sort of heuristics for thread affinity to CPU cores, e.g they may try to schedule the same thread to the same core, under the assumption that its working set is still in that core's cache, or they may try to identify threads that operate on shared data, and schedule those threads to the same CPU and so on. Therefore, it is hard to impossible to predict how and where a thread will be scheduled.
Many platforms also provide a way to influence this CPU affinity, e.g. on Linux and Windows, you can set a thread to only be scheduled on one specific or a set of specific cores. However, YARV does not do that by default. (In fact, on some platforms influencing CPU affinity requires elevated privileges, so it would mean that YARV would have to run with elevated privileges, which is not a good idea.)
So, in short: yes, depending on the platform, the hardware, and the environment, YARV threads may and probably will be scheduled on different cores. But, they won't be able to take advantage of that fact, i.e. they won't be able to run faster than on a single core (at least when running Ruby code).
I am running a parallel algorithm using light threads and I am wondering how are these assigned to different cores when the system provides several cores and several chips. Are threads assigned to a single chip until all the cores on the chip are exhausted? Are threads assigned to cores on different chips in order to better distribute the work between chips?
You don't say what OS you're on, but in Linux, threads are assigned to a core based on the load on that core. A thread that is ready to run will be assigned to a core with lowest load unless you specify otherwise by setting thread affinity. You can do this with sched_setaffinity(). See the man page for more details. In general, as meyes1979 said, this is something that is decided by the scheduler implemented in the OS you are using.
Depending upon the version of Linux you're using, there are two articles that might be helpful: this article describes early 2.6 kernels, up through 2.6.22, and this article describes kernels newer than 2.6.23.
Different threading libraries perform threading operations differently. The "standard" in Linux these days is NPTL, which schedules threads at the same level as processes. This is quite fine, as process creation is fast on Linux, and is intended to always remain fast.
The Linux kernel attempts to provide very strong CPU affinity with executing processes and threads to increase the ratio of cache hits to cache misses -- if a task always executes on the same core, it'll more likely have pre-populated cache lines.
This is usually a good thing, but I have noticed the kernel might not always migrate tasks away from busy cores to idle cores. This behavior is liable to change from version to version, but I have found multiple CPU-bound tasks all running on one core while three other cores were idle. (I found it by noticing that one core was six or seven degrees Celsius warmer than the other three.)
In general, the right thing should just happen; but when the kernel does not automatically migrate tasks to other processors, you can use the taskset(1) command to restrict the processors allowed to programs or you could modify your program to use the pthread_setaffinity_np(3) function to ask for individual threads to be migrated. (This is perhaps best for in-house applications -- one of your users might not want your program to use all available cores. If you do choose to include calls to this function within your program, make sure it is configurable via configuration files to provide functionality similar to the taskset(1) program.)
I have a general question about the Ruby VM (Ruby Interpreter ). How does it work with multiprocessors? Regarding parallelism and concurrency in Ruby, let's say that I have 4 processors. Will the VM automatically assign the tasks with the processors through the Kernel? With scaling, lets say that my ruby process is taking a lot of the CPU resources; what will happen if I add a new processor? Is the OS responsible for assigning the tasks to the processors, or will each VM work on one processor? What would be the best way to scale my ruby application? I tried as much as possible to separate my processes and use amqp queuing. Any other ideas?
It would be great if you can send me links for more explanation.
Thanks in advance.
Ruby Threading
The Ruby language itself supports parallel execution through a threading model; however, the implementation dictates if additional hardware resources get used. The "gold standard" interpreter (MRI Ruby) uses a "green threading" model in 1.8; threading is done within the interpreter and only uses a single system thread for execution. However, others (such as JRuby) leverage the Java VM to create actual system level threads for execution. MRI Ruby 1.9 adds additional threading capability but (afaik) it's still limited to only switching thread contexts when a thread stalls on an I/O event.
Advanced Threading
Typically the OS manages assignment of threads to logical cores since most application software doesn't actually care. In some high performance compute cases, the software will specifically request certain threads to execute on specific logical cores for architecture specific performance. It's highly unlikely anything written in Ruby would fall into this category.
Refactoring
Per application performance limits can usually be addressed by refactoring the code first. Leveraging a language or other environment more suited to the specific problem is likely the best first step instead of immediately jumping to threading in the existing implementation.
Example
I once worked on a Ruby on Rails app with a massive hash mapping function step in it when data was uploaded. The initial implementation was written completely in Ruby and took ~80s to complete. Rewriting the code in ANSI C and leveraging more specific memory allocation, the execution time fell to under a second (without even using threads). The next bottleneck was inserting the massive amount of data back into MySQL which eventually also moved out of the Ruby code and into threaded C code. I specifically went this route since the MRI Ruby interpreter easily binds to C code. The final result has Ruby preparing the environment for the C code, calling it as a Ruby instance method on a class with parameters, hash mapping by a single thread of C code, and finally finishes with an OpenMP worker queue model of generating and executing inserts into MySQL.
I'm trying to understand the practical impact of different threading models between MRI Ruby 1.8 and JRuby.
What does this difference mean to me as a developer?
And also, are there any practical examples of code in MRI Ruby 1.8 that will have worse performance characteristics on JRuby due to different threading models?
State
ruby 1.8 has green threads, these are fast to create/delete (as objects) but do not truly execute in parallel and are not even scheduled by the operating system but by the virtual machine
ruby 1.9 has real threads, these are slow to create/delete (as objects) because of OS calls, but because of the GIL (global interpreter lock) that only allows one thread to execute at a time, neither these are truly parallel
JRuby also has real threads scheduled by the OS, and are truly concurrent
Conclusion
A threaded program running on a 2-core CPU will run faster on JRuby then the other implementations, regarding the threading point of view
Notice!
Many existing ruby libraries are not thread-safe so the advantage of JRuby in many times useless.
Also note that many techniques of ruby programming (for example class vars) will need additional programming effort to ensure thread-safeness (mutex locks, monitors etc) if one is to use threads.
JRuby's threads are native system threads, so they give you all the benefits of threaded programming (including the use of multiple processor cores, if applicable). However, Ruby has a Global Interpreter Lock (GIL), which prevents multiple threads from running simultaneously. So the only real performance difference is the fact that your MRI/YARV Ruby applications won't be able to utilize all of your processor cores, but your JRuby applications will happily do so.
However, if that isn't an issue, MRI's threads are (theoretically, I haven't tested this) a little faster because they are green threads, which use fewer system resources. YARV (Ruby 1.9) uses native system threads.
I am a regular JRuby user and the biggest difference is that JRuby threads are truly concurrent. They are actually system level threads so they can be executed concurrently on multiple cores. I do not know of any place where MRI Ruby 1.8 code runs slower on JRuby. You might consider checking out this question Does ruby have real multithreading?.