I'm new to Celluloid and have some questions about pools and futures. I'm building a simple web crawler (see the example at bottom). My URLS array dozen of thousands of URLs, so the example is stripped to some hundred.
What I now want to do is to group to max. 50 req/s using futures, get their callbacks and crawl further 50 urls etc. The problem I have with this code: I would expect that it would maximum 50 threads but it spawns upto 400 and more in my case. If the input data increases, the code snippet finishes because it cannot spawn further requests (OS limits, OSX in my case).
Why are there so many threads spawned and how to avoid this? I need a fast crawler which uses all resources the OS provides but not more than this :) So 2.000 threads seem to be the limit at OSX, all above this value let the code crashes.
#!/usr/bin/env jruby
require 'celluloid'
require 'open-uri'
class Crawler
include Celluloid
def fetch(id)
uri = URI("{id}")
req = open(uri).read
URLS.each_slice(50).map do |idset|
pool = Crawler.pool(size: 50)
crawlers = do |id|
pool.future(:fetch, id)
crawlers.compact.each do |resp|
puts resp.value.size rescue nil

Split the class. It's been told on wiki to never do pool of a worker inside it.
Gotcha: Don't make pools inside workers!
Using MyWorker.pool within MyWorker will result in an unbounded
explosion of worker threads.
If you want to limit your pool just create it outside the each_slice block so you use always the same Threads I guess.
pool = Crawler.pool(size: 50)
URLS.each_slice(50).map do |idset|
crawlers = do |id|
pool.future(:fetch, id)
# ...

Each iteration through the slice of 50 you're resetting the value of pool, which likely is dereferencing your poolmanager. Since actors aren't garbage collected just by being dereferenced (you have to call #terminate) you're probably piling up your old pools. It should be ok to just make one pool, and create all your futures at once (if you keep the return value small the future object itself is small). If you do find that you have to slice, instantiate your pool outside the each_slice and it will continue to use the same pool without making a new one each time around. If for some other reason you want to get a new pool each time, call terminate on the pool before you dereference it. Also be sure you're working with celluloid 0.12.0+ as it fixes an issue where pool workers weren't being terminated when the pool was.
When I iterate around actors, I've found this bit of logging to be useful to be sure I don't have any actor leaks: "Actors left: #{Celluloid::Actor.all.to_set.length} Alive: #{Celluloid::Actor.all.to_set.reject { |a| a.nil? || !a.alive? }.length}"


How do I properly use Threads to connect ping a url?

I am trying to ping a large amount of urls and retrieve information regarding the certificate of the url. As I read in this thoughtbot article here Thoughtbot Threads and others, I've read that the best way to do this is by using Threads. When I implement threads however, I keep running into Timeout errors and other problems for urls that I can retrieve successfully on their own. I've been told in another related question that I asked earlier that I should not use Timeout with Threads. However, the examples I see wrap API/NET::HTTP/TCPSocket calls in the Timeout block and based opn what I've read, that entire API/NET::HTTP/TCP Socket call will be nested within the Thread. Here is my code:
class SslClient
attr_reader :url, :port, :timeout
def initialize(url, port = '443', timeout = 30)
#url = url
#port = port
#timeout = timeout
def ping_for_certificate_info
context =
certificates = nil
verify_result = nil
Timeout.timeout(timeout) do
tcp_client =, port)
ssl_client = tcp_client, context
ssl_client.hostname = url
ssl_client.sync_close = true
certificates = ssl_client.peer_cert_chain
verify_result = ssl_client.verify_result
{certificate: certificates.first, verify_result: verify_result }
rescue => error
puts url
puts error.inspect
[VERY LARGE LIST OF URLS].map do |url| do
ssl_client =
cert_info = ssl_client.ping_for_certificate_info
puts cert_info
If you run this code in your terminal, you will see many Timeout errors and ERNNO:TIMEDOUT errors for sites like,,, etc that should return information. When I run these individually however I get the information I need. When I run them in the thread they tend to fail especially for domains that have a foreign domain name. What I'm asking is whether I am using Threads correctly. This snippet of code that I've pasted is part of a larger piece of code that interacts with ActiveRecord objects in rails depending on the results given. Am I using Timeout and Threads correctly? What do I need to do to make this work? Why would a ping work individually but not wrapped in a thread? Help would be greatly appreciated.
There are several issues:
You'd not spawn thousands of threads, use a connection pool (e.g so you have maximum 20-30 concurrent requests going (this maximum number should be determined by testing at which point network performance drops and you get these timeouts).
It's difficult to guarantee that your code is not broken when you use threads, that's why I suggest you use something where others figured it out for you, like (with examples for thread safety and concurrent requests like There are other libs out there (Typhoeus, patron) but this one is pure Ruby so basic thread safety is easier to achieve.
You should not use Timeout (see and Use or something else.
Also, I suggest you learn about threading issues like deadlocks, starvations and all the gotchas. In your case you are doing a starvation of network resources because all the threads are fighting for bandwidth/network.

Which of Ruby's concurrency devices would be best suited for this scenario?

The whole threads/fibers/processes thing is confusing me a little. I have a practical problem that can be solved with some concurrency, so I thought this was a good opportunity to ask professionals and people more knowledgable than me about it.
I have a long array, let's say 3,000 items. I want to send a HTTP request for each item in the array.
Actually iterating over the array, generating requests, and sending them is very rapid. What takes time is waiting for each item to be received, processed, and acknowledged by the party I'm sending to. I'm essentially sending 100 bytes, waiting 2 seconds, sending 100 bytes, waiting 2 seconds.
What I would like to do instead is send these requests asynchronously. I want to send a request, specify what to do when I get the response, and in the meantime, send the next request.
From what I can see, there are four concurrency options I could use here.
Processes; unsuitable as far as I know because multiple processes accessing the same array isn't feasible/safe.
Asynchronous functionality like JavaScript's XMLHttpRequest.
The simplest would seem to be the last one. But what is the best, simplest way to do that using Ruby?
Failing #4, which of the remaining three is the most sensible choice here?
Would any of these options also allow me to say "Have no more than 10 pending requests at any time"?
This is your classic producer/consumer problem and is nicely suited for threads in Ruby. Just create a Queue
urls = [...] # array with bunches of urls
require "thread"
queue = # this will only allow 10 items on the queue at once
p1 = do
url_slice = urls.each do |url|
response = do_http_request(url)
queue << response
queue << "done"
consumer = do
http_response = queue.pop(true) # don't block when zero items are in queue
Thread.exit if http_response == "done"
# wait for the consumer to finish
EventMachine as an event loop and em-synchrony as a Fiber wrapper for it's callbacks into synchronous code
Copy Paste from em-synchrony README
require "em-synchrony"
require "em-synchrony/em-http"
require "em-synchrony/fiber_iterator"
EM.synchrony do
concurrency = 2
urls = ['', '']
results = [], concurrency).each do |url|
resp =
results.push resp.response
p results # all completed requests
This is an IO bounded case that fits more in both:
Threading model: no problem with MRI Ruby in this case cause threads work well with IO cases; GIL effect is almost zero.
Asynchronous model, which proves(in practice and theory) to be far superior than threads when it comes to IO specific problems.
For this specific case and to make things far simpler, I would have gone with Typhoeus HTTP client which has a parallel support that works as the evented(Asynchronous) concurrency model.
hydra =
%w(url1 url2 url3).each do |url|
request =, followlocation: true)
request.on_complete do |response|
# do something with response
end # this is a blocking call that returns once all requests are complete

Does this code require a Mutex to access the #clients variable thread safely?

For fun I wrote this Ruby socket server which actually works quite nicely. I'm plannin on using it for the backend of an iOS App. My question for now is, when in the thread do I need a Mutex? Will I need one when accessing a shared variable such as #clients?
require 'rubygems'
require 'socket'
module Server
#server =
#clients = []
#server = port
while (socket=#server.accept)
#clients << socket
Thread.start(socket) do |socket|
loop do
msg =
msg << data
break if data.to_s.length < 1024
#clients.each do |client| client.write "#{socket} says: #{msg}" unless client == socket end
rescue => e
#clients.delete socket
puts e
puts "Killed client #{socket}"
Thread.kill self
According to the answer from John Bollinger I need to synchronize the thread any time that a thread needs to access a shared resource. Does this apply to database queries? Can I read/write from a postgres database with ActiveRecord ORM inside multiple threads at once?
Any data that may be modified by one thread and read by a different one must be protected by a Mutex or a similar synchronization construct. Inasmuch as multiple threads may safely read the same data at the same time, a synchronization construct a bit more sophisticated than a single Mutex might yield better performance.
In your code, it looks like not only does #clients need to be properly synchronized, but so also do all its elements because writing to a socket is a modification.
Don't use a mutex unless you really have to.
It's pity the literature on Ruby multi-threading is so scarce, the only good book written on the topic is Working With Ruby Threads from Jesse Storimer. I've learned a lot of useful principles from there, one of which is: Don't use a mutex if there are better alternatives. In your case, there are. If you use Ruby without any gems, the only thread-safe data structure is a Queue. An array is not safe. However, with the thread_safe gem you can create one:
require 'thread_safe'
sa = # supports standard forms
sh = # supports standard forms
Regarding your question, it's only if any thread MODIFIES a shared data structure that you'll need to protect it with a mutex (assuming all the threads just read from that data structure, none writes to it, see John's comment for explanation on a case where you might need a mutex if one thread is reading, while another is writing to a thread etc). You don't need one for accessing unchanging data. If you're using Active Record + Postgres, yes Active Records IS thread safe, as for Postgres, you might want to follow these instructions (Behavior in Threaded Programs) to check that.
Also, be aware of race conditions (see How to Make ActiveRecord ThreadSafe which is one inherent problem which you should be aware of when coding multi-threaded apps).
Avdi Grimm had one very sound advice for multi-threaded apps: When testing them, make them fail loud and fast. So don't forget to add at the top:
Thread.abort_on_exception = true
so your threads don't silently fail if something wrong happens.

Ruby and Celluloid

Due to some limitations I want to switch my current project from EventMachine/EM-Synchrony to Celluloid but I've some trouble to get in touch with it. The project I am coding on is a web harvester which should crawl tons of pages as fast as possible.
For the basic understanding of Celluloid I've generated 10.000 dummy pages on a local web server and wanna crawl them by this simple Celluloid snippet:
#!/usr/bin/env jruby --1.9
require 'celluloid'
require 'open-uri'
IDS = 1..9999
class Crawler
include Celluloid
def read(id)
url = "#{BASE_URL}/#{id}"
puts "URL: " + url
open(url) { |x| }
pool = Crawler.pool(size: 100) do |id|
pool.future(:read, id)
As far as I understand Celluloid, futures are the way to go to get the response of a fired request (comparable to callbacks in EventMachine), right? The other thing is, every actor runs in its own thread, so I need some kind of batching the requests cause 10.000 threads would result in errors on my OSX dev machine.
So creating a pool is the way to go, right? BUT: the code above iterates over the 9999 URLs but only 1300 HTTP requests are sent to the web server. So something goes wrong with limiting the requests and iterating over all URLs.
Likely your program is exiting as soon as all of your futures are created. With Celluloid a future will start execution but you can't be assured of it finishing until you call #value on the future object. This holds true for futures in pools as well. Probably what you need to do is change it to something like this:
crawlers = do |id|
pool.future(:read, id)
rescue DeadActorError, MailboxError
crawlers.compact.each { |crawler| crawler.value rescue nil }

EventMachine: What is the maximum of parallel HTTP requests EM can handle?

I'm building a distributed web-crawler and trying to get maximum out of resources of each single machine. I run parsing functions in EventMachine through Iterator and use em-http-request to make asynchronous HTTP requests. For now I have 100 iterations that run at the same time and it seems that I can't pass over this level. If I increase a number of iteration it doesn't affect the speed of crawling. However, I get only 10-15% cpu load and 20-30% of network load, so there's plenty of room to crawl faster.
I'm using Ruby 1.9.2. Is there any way to improve the code to use resources effectively or maybe I'm even doing it wrong?
def start_job_crawl
#redis.lpop #queue do |link|
if link.nil?
EventMachine::add_timer( 1 ){ start_job_crawl() }
#parsing link, using asynchronous http request,
#doing something with the content
#main reactor loop {
#redis = EM::Protocols::Redis.connect(:host => "")
#redis.errback do |code|
puts "Redis error: #{code}"
#100 parallel 'threads'. Want to increase this, 100).each do |num, iter|
if you are using select()(which is the default for EM), the most is 1024 because select() limited to 1024 file descriptors.
However it seems like you are using kqueue, so it should be able to handle much more than 1024 file descriptors at once.
which is the value of your EM.threadpool_size ?
try enlarging it, I suspect the limit is not in the kqueue but in the pool handling the requests...
