Ruby 1.8 and disk I/O in a multi-threaded setting - ruby

Ruby 1.8 uses userspace threads, not operating system threads. This means that Ruby 1.8 can only utilize a single CPU core no matter how many Ruby threads you create.
On the bright side, not all is bad. Ruby 1.8 internally uses non-blocking I/O while Ruby 1.9 unlocks the global interpreter lock while doing I/O. So if one Ruby thread is blocked on I/O, another Ruby thread can continue execution. Likewise, Ruby is smart enough to cause things like sleep() and even waitpid() to preempt to other threads.
The above is an excerpt from a recent blog post by the Phusion folks.
How does MRI handle disk I/O internally?
From what I gather, doing disk I/O in a non-blocking manner via select/epoll/kqueue is not possible since the fds will always be readable/writeable. So I would expect MRI to block when it does file I/O, but if it blocks there's no point in writing a multi-threaded program. Does MRI have an internal thread-pool to which these blocking I/O calls are offloaded to?

Yehuda Katz, one of the core contributors to Rails 3, has blogged about this in some detail:
http://yehudakatz.com/2010/08/14/threads-in-ruby-enough-already

Related

Ruby On Rails, and multi threading

I came across that Ruby doesn't really have any performance benefit when you do multi threading. because of GIL nature.
I see there is no point of using multi-threading in Rails app.
What is use case of multi-threading in Rails app?
An IO (input/output) operation is one that is not operating on your CPU, such as, reading from a hard drive, an API call to a service, a database operation of some kind.
Anything that is IO heavy would benefit from multi-threading even with GIL. IO operations are blocking in ruby while they wait for the result, so it's only reasonable, while you are waiting for the result of the operation, to want to switch to another thread to do some work.

When making network requests, when should I use Threads vs Processes?

I'm working on a Ruby script that will be making hundreds of network requests (via open-uri) to various APIs and I'd like to do this in parallel since each request is slow, and blocking.
I have been looking at using Thread or Process to achieve this but I'm not sure which method to use.
With regard to network request, when should i use a Thread over Process, or does it not matter?
Before going into detail, there is already a library solving your problem. Typhoeus is optimized to run a large number of HTTP requests in parallel and is based on the libcurl library.
Like a modern code version of the mythical beast with 100 serpent
heads, Typhoeus runs HTTP requests in parallel while cleanly
encapsulating handling logic.
Threads will be run in the same process as your application. Since Ruby 1.9 native threads are used as the underlying implementation. Resources can be easily shared across threads, as they all can access the mutual state of the application. The problem, however, is that you cannot utilize the multiple cores of your CPU with most Ruby implementations.
Ruby uses the Global Interpreter Lock (GIL). GIL is a locking mechanism to ensure that the mutual state is not corrupted due to parallel modifications from different threads. Other Ruby implementations like JRuby, Rubinius or MacRuby offer an approach without GIL.
Processes run separately from each other. Processes do not share resources, which means every process has its own state. This can be a problem, if you want to share data across your requests. A process also allocates its own stack of memory. You could still share data by using a messaging bus like RabitMQ.
I cannot recommend to use either only threads or only processes. If you want to implement that yourself, you should use both. Fork for every n requests a new processes which then again spawns a number of threads to issue the HTTP requests. Why?
If you fork for every HTTP request another process, this will result in too many processes. Although your operating system might be able to handle this, the overhead is still tremendous. Some HTTP requests might finish very fast, so why bother with an extra process, just run them in another thread.

Ruby 1.9 thread pools

As I understand, Ruby 1.9 uses OS threads but only one thread will still actually be running concurrently (though one thread may be doing blocking IO while another thread is doing processing). The threading examples I've seen just use Thread.new to launch a new thread. Coming from a Java background, I typically use thread pools as to not launch to many new threads since they are "heavyweight."
Is there a thread pool construct built into ruby? I didn't see one in the default language libraries. Or are there is a standard gem that is typically used? Since OS level threading is a newer feature of ruby, I don't know how mature the libraries are for it.
You are correct in that the default C Ruby interpreter only executes one thread at a time (other C based dynamic languages such as Python have similar restrictions). Because of this restriction, threading is not really that common in Ruby and as a result there is no default threadpool library. If there are tasks to be done in parallel, people typically uses processes since processes can scale over multiple servers.
If you do need to use threads, I would recommend you use https://github.com/meh/ruby-threadpool on the JRuby platform, which is a Ruby interpreter running on the JVM. That should be right up your alley, and because it is running on the virtual machine it will have true threading.
The accepted answer is correct, But, there are many tasks in which threads are fine. after all there are some reasons why it is there. even though it can only run a thread at a time. it is still can be considered parallel in many real life situations.
for example when we have 100 long running process in which each takes approximate 10 minutes to complete. by using threads in ruby, even with all those restrictions, if we define a threadpool of 10 tasks at time, it will run much faster than 100*10 minutes when running without threads. examples include, live capturing of file changes, sending large number of web requests (such as status check)
You can understand how pooling works by reading https://blog.codeship.com/understanding-fundamental-ruby-abstraction-concurrency/ . in production code use https://github.com/meh/ruby-thread#pool

Speeding up IPC with Ruby

I am trying to do IPC between 2 processes on the same Linux box in Ruby, and I need to optimize the solution as far as practicable.
I had begun with a TCPSocket but I see that using UNIXSocket is probably faster, and perhaps does not copy data to kernel buffers.
I have been reading SO threads, and looks like mmap might be interesting to look at. But for mmap I need to a) install mmap gem and b) provide locking since multiple client processes would likely try connect with the server process (both running on same box).
My questions:
What other options would you recommend?
How do you recommend locking the memory with ruby mmap?
How do the numbers, if there's any available, stack up for UNIXSocket versus mmap?

Ruby threading deadlocks

I'm writing a project at the moment that involves running two parallel threads to pull data from different sources at regular intervals. I am using the Threads functionality in ruby 1.9 to do this but am unfortunately running up against deadlock problems. Also I have a feeling that the Thread.join method is causing the threads to queue rather than run in parallel.
I'm new to multithreading programming and any advice would be greatly appreciated
Cheers
Patrick
EDIT: The shared resource that both these threads are accessing is a mysql database which could be the problem. The deadlock arrises after a few iterations of these threads being run.
You can use synchronization mechanisms such as Mutex, Monitor, Queue, SizedQueue from standart library. Or problem in using them?
It's very difficult to diagnose what could be going wrong without more details but deadlock is (obviously) caused by multiple threads trying to acquire resources held by others. That really means that you must have at least two mutexes and two threads. Could that be happening in your code?
Thread.join doesn't have anything to do with parallel executiion - it's a synchronization method to enable one (usually the master) thread to wait for one or more threads to complete.
Which Ruby 1.9 implementation are you using? YARV cannot run Ruby Threads in parallel. At the moment, there is no production-ready implementation of Ruby 1.9 which can run threads in parallel. JRuby can threads in parallel, but its Ruby 1.9 implementation is not quite complete yet. (Although it is stable, so if all the features you need are there, you can use it.)

Resources