I need a process pool manager for MRI, with multiple worker subprocesses which in turn contain a number of worker threads. With good signal handling already built-in (that it kills all the threads in all the child processes, handles graceful/non-graceful quit and so on). Pretty much every Ruby webserver already has one, but I was wondering if there are canned gems offering this kind of pooling?
Related
I'm working on a Ruby script that will be making hundreds of network requests (via open-uri) to various APIs and I'd like to do this in parallel since each request is slow, and blocking.
I have been looking at using Thread or Process to achieve this but I'm not sure which method to use.
With regard to network request, when should i use a Thread over Process, or does it not matter?
Before going into detail, there is already a library solving your problem. Typhoeus is optimized to run a large number of HTTP requests in parallel and is based on the libcurl library.
Like a modern code version of the mythical beast with 100 serpent
heads, Typhoeus runs HTTP requests in parallel while cleanly
encapsulating handling logic.
Threads will be run in the same process as your application. Since Ruby 1.9 native threads are used as the underlying implementation. Resources can be easily shared across threads, as they all can access the mutual state of the application. The problem, however, is that you cannot utilize the multiple cores of your CPU with most Ruby implementations.
Ruby uses the Global Interpreter Lock (GIL). GIL is a locking mechanism to ensure that the mutual state is not corrupted due to parallel modifications from different threads. Other Ruby implementations like JRuby, Rubinius or MacRuby offer an approach without GIL.
Processes run separately from each other. Processes do not share resources, which means every process has its own state. This can be a problem, if you want to share data across your requests. A process also allocates its own stack of memory. You could still share data by using a messaging bus like RabitMQ.
I cannot recommend to use either only threads or only processes. If you want to implement that yourself, you should use both. Fork for every n requests a new processes which then again spawns a number of threads to issue the HTTP requests. Why?
If you fork for every HTTP request another process, this will result in too many processes. Although your operating system might be able to handle this, the overhead is still tremendous. Some HTTP requests might finish very fast, so why bother with an extra process, just run them in another thread.
I wanted to know how I can I make the io do something like a thread.join() wait for all tasks to finish.
io_type->post( strand->wrap(boost::bind &somemethod,ptr,parameter)));
In the above code if 4 threads were initially launched this would give work to the next available thread. However I want to know how I could actually wait for all the threads to finish work. Like we do with threads.join().
If this really needs to be done, then you could setup a mutex or critical section to stop your io handlers from processing messages off of the socket. This would need to be activated from another thread. But, more importantly...
Perhaps you should rethink your design. The problem with having the io wait for other threads to finish is that the io would then be unresponsive. In general, not a good idea. I suspect that most developers working on networking software would not even consider it. If you are receiving messages that are not ready to be processed yet due to other processing that is going on, then consider storing them in a queue and process them on a different thread when the other threads have signaled that they have completed their work.
I am looking for a good way to poll a lot of servers for their status through TCP. I am currently using synchronous code and the Minecraft Query Protocol, but whenever a server is offline the rest of the queue gets hold up.
Another problem I am experiencing with my current code is that some servers tend to block my server I use for polling in their firewall, and thus their servers appear offline on my serverlist.
I am using a Ruby rake task with an infinite loop in which every Minecraft server in my MongoDB database gets checked and updated every +- 10 minutes (I try to set this interval by letting the loop sleep (600/ s.count.to_i).ceil seconds.
Is there any way I can do this task efficiently (and prevent servers from blacklisting my IP in their firewall), preferably with Async code in Ruby?
You need to use non-blocking sockets to check - multithreading. The best thing to do is spawn several threads at once to check several servers at once - that way your main thread won't get held up.
This question contains a lot of information about multithreading in Ruby - you should be able to spawn multiple concurrent threads at once, or at least use non-blocking sockets.
Another point given by #Lie Ryan, you can use IO.Select to poll a array of servers, all at once. It will return an array of "online" servers when it's done - this could be more elegant than spawning multiple threads.
I was pretty sure, that all Rack application servers (I had some experience with Unicorn and Passenger) were creating single process for every worker when they were created, and its state was "frozen".
Whenever app server receives request to handle, it forks from the master process and all further changes to forked process are separated from the original process. They benefit from copy-on-write optimizations and are safe to be "damaged" by processing request. All changes to the environment affect only single process that will be preempted anyway.
If my vision of RoR application stack was true, there should be almost no need of garbage collection, unless serving single request would take a lot of time and memory (which usually is not the case).
On the other hand, question about GC measurements done with NewRelic and it's answers led me to conclusion that I must be completely wrong.
Can someone clarify this process?
Rack applications servers are not forking at each request, only during initialization:
First, the environment is loaded in one process
Then, the server fork several workers
Then all requests are distributed among these processes
That's why the Garbage collector is used to keep each process memory clean & stable.
As I understand, Ruby 1.9 uses OS threads but only one thread will still actually be running concurrently (though one thread may be doing blocking IO while another thread is doing processing). The threading examples I've seen just use Thread.new to launch a new thread. Coming from a Java background, I typically use thread pools as to not launch to many new threads since they are "heavyweight."
Is there a thread pool construct built into ruby? I didn't see one in the default language libraries. Or are there is a standard gem that is typically used? Since OS level threading is a newer feature of ruby, I don't know how mature the libraries are for it.
You are correct in that the default C Ruby interpreter only executes one thread at a time (other C based dynamic languages such as Python have similar restrictions). Because of this restriction, threading is not really that common in Ruby and as a result there is no default threadpool library. If there are tasks to be done in parallel, people typically uses processes since processes can scale over multiple servers.
If you do need to use threads, I would recommend you use https://github.com/meh/ruby-threadpool on the JRuby platform, which is a Ruby interpreter running on the JVM. That should be right up your alley, and because it is running on the virtual machine it will have true threading.
The accepted answer is correct, But, there are many tasks in which threads are fine. after all there are some reasons why it is there. even though it can only run a thread at a time. it is still can be considered parallel in many real life situations.
for example when we have 100 long running process in which each takes approximate 10 minutes to complete. by using threads in ruby, even with all those restrictions, if we define a threadpool of 10 tasks at time, it will run much faster than 100*10 minutes when running without threads. examples include, live capturing of file changes, sending large number of web requests (such as status check)
You can understand how pooling works by reading https://blog.codeship.com/understanding-fundamental-ruby-abstraction-concurrency/ . in production code use https://github.com/meh/ruby-thread#pool