Handle Sinatra and Faye in same EventMachine Performantly - ruby

I am writing a web application that uses both Sinatra—for general single-client synchronous gets—and Faye—for multiple-client asynchronous server-based broadcasts.
My (limited) understanding of EventMachine was that it would allow me to put both of these in a single process and get parallel requests handled for me. However, my testing shows that if either Sinatra or Faye takes a long time on a particular action (which may happen regularly with my real application) it blocks the other.
How can I rewrite the simple test application below so that if either sleep command is uncommented the Faye-pushes and the AJAX poll responses are not delayed?
%w[eventmachine thin sinatra faye json].each{ |lib| require lib }
def run!
EM.run do
Faye::WebSocket.load_adapter('thin')
webapp = MyWebApp.new
server = Faye::RackAdapter.new(mount:'/', timeout:25)
dispatch = Rack::Builder.app do
map('/'){ run webapp }
map('/faye'){ run server }
end
Rack::Server.start({
app: dispatch,
Host: '0.0.0.0',
Port: 8090,
server: 'thin',
signals: false,
})
end
end
class MyWebApp < Sinatra::Application
# http://stackoverflow.com/q/10881594/405017
configure{ set threaded:false }
def initialize
super
#faye = Faye::Client.new("http://localhost:8090/faye")
EM.add_periodic_timer(0.5) do
# uncommenting the following line should not
# prevent Sinatra from responding to "pull"
# sleep 5
#faye.publish( '/push', { faye:Time.now.to_f } )
end
end
get ('/pull') do
# uncommenting the following line should not
# prevent Faye from sending "push" updates rapidly
# sleep 5
content_type :json
{ sinatra:Time.now.to_f }.to_json
end
get '/' do
"<!DOCTYPE html>
<html lang='en'><head>
<meta charset='utf-8'>
<title>PerfTest</title>
<script src='https://code.jquery.com/jquery-2.2.0.min.js'></script>
<script src='/faye/client.js'></script>
<script>
var faye = new Faye.Client('/faye', { retry:2, timeout:10 } );
faye.subscribe('/push',console.log.bind(console));
setInterval(function(){
$.get('/pull',console.log.bind(console))
}, 500 );
</script>
</head><body>
Check the logs, yo.
</body></html>"
end
end
run!

How does sleep differ from, say, 999999.times{ Math.sqrt(rand) } or exec("sleep 5")? Those also block any single-thread, right? That's what I'm trying to simulate, a blocking command that takes a long time.
Both cases would block your reactor/event queue. With the reactor pattern, you want to avoid any CPU intensive work, and focus purely on IO (i.e. network programming).
The reason why the single-threaded reactor pattern works so well with I/O is because IO is not CPU intensive - instead it just blocks your programs while the system kernel handles your I/O request.
The reactor pattern takes advantage of this by immediately switching your single thread to potentially work on something different (perhaps the response of some other request has completed) until the I/O operation is completed by the OS.
Once the OS has the result of your IO request, EventMachine finds the callback you had initially registered with your I/O request and passes it the response data.
So instead of something like
# block here for perhaps 50 ms
r = RestClient.get("http://www.google.ca")
puts r.body
EventMachine is more like
# Absolutely no blocking
response = EventMachine::HttpRequest.new('http://google.ca/').get
# add to event queue for when kernel eventually delivers result
response.callback {
puts http.response
}
In the first example, you would need the multi-threaded model for your web server, since a single thread making a network request can block for potentially seconds.
In the second example, you don't have blocking operations, so one thread works great (and is generally faster than a multi-thread app!)
If you ever do have a CPU intensive operation, EventMachine allows you to cheat a little bit, and start a new thread so that the reactor doesn't block. Read more about EM.defer here.
One final note is that this is the reason Node.js is so popular. For Ruby we need EventMachine + compatible libraries for the reactor pattern (can't just use the blocking RestClient for example), but Node.js and all of it's libraries are written from the start for the reactor design pattern (they are callback based).

Related

Single thread still handles concurrency request?

Ruby process is single thread. When we start a single process using thin server, why are we still able to handle concurrency request?
require 'sinatra'
require 'thin'
set :server, %w[thin]
get '/test' do
sleep 2 <----
"success"
end
What is inside thin that can handle concurrency request? If it is due to event-machine framework, the code above is actually a sync code which is not for EM used.
Quoting the chapter: "Non blocking IOs/Reactor pattern" in
http://merbist.com/2011/02/22/concurrency-in-ruby-explained/:
"this is the approach used by Twisted, EventMachine and Node.js. Ruby developers can use EventMachine or
an EventMachine based webserver like Thin as well as EM clients/drivers to make non blocking async calls."
The heart of the matter regard EventMachine.defer
*
used for integrating blocking operations into EventMachine's control flow.
The action of defer is to take the block specified in the first parameter (the "operation")
and schedule it for asynchronous execution on an internal thread pool maintained by EventMachine.
When the operation completes, it will pass the result computed by the block (if any)
back to the EventMachine reactor.
Then, EventMachine calls the block specified in the second parameter to defer (the "callback"),
as part of its normal event handling loop.
The result computed by the operation block is passed as a parameter to the callback.
You may omit the callback parameter if you don't need to execute any code after the operation completes.
*
Essentially, in response to an HTTP request, the server executes that you wrote,
invokes the process method in the Connecction class.
have a look at the code in $GEM_HOME/gems/thin-1.6.2/lib/thin/connection.rb:
# Connection between the server and client.
# This class is instanciated by EventMachine on each new connection
# that is opened.
class Connection < EventMachine::Connection
# Called when all data was received and the request
# is ready to be processed.
def process
if threaded?
#request.threaded = true
EventMachine.defer(method(:pre_process), method(:post_process))
else
#request.threaded = false
post_process(pre_process)
end
end
..here is where a threaded connection invoke EventMachine.defer
The reactor
To see where is activated the EventMachine reactor
should follow the initialization of the program:
Notice that for all Sinatra applications and middleware ($GEM_HOME/gems/sinatra-1.4.5/base.rb)
can run the Sinatra app as a self-hosted server using Thin, Puma, Mongrel, or WEBrick.
def run!(options = {}, &block)
return if running?
set options
handler = detect_rack_handler
....
the method detect_rack_handler returns the first Rack::Handler
return Rack::Handler.get(server_name.to_s)
in our test we require thin therefore it returns a Thin rack handler and setup a threaded server
# Starts the server by running the Rack Handler.
def start_server(handler, server_settings, handler_name)
handler.run(self, server_settings) do |server|
....
server.threaded = settings.threaded if server.respond_to? :threaded=
$GEM_HOME/gems/thin-1.6.2/lib/thin/server.rb
# Start the server and listen for connections.
def start
raise ArgumentError, 'app required' unless #app
log_info "Thin web server (v#{VERSION::STRING} codename #{VERSION::CODENAME})"
...
log_info "Listening on #{#backend}, CTRL+C to stop"
#backend.start { setup_signals if #setup_signals }
end
$GEM_HOME/gems/thin-1.6.2/lib/thin/backends/base.rb
# Start the backend and connect it.
def start
#stopping = false
starter = proc do
connect
yield if block_given?
#running = true
end
# Allow for early run up of eventmachine.
if EventMachine.reactor_running?
starter.call
else
#started_reactor = true
EventMachine.run(&starter)
end
end

Asynchronous IO server : Thin(Ruby) and Node.js. Any difference?

I wanna clear my concept of asynchronous IO, non-blocking server
When dealing with Node.js , it is easy to under the concept
var express = require('express');
var app = express();
app.get('/test', function(req, res){
setTimeout(function(){
console.log("sleep doesn't block, and now return");
res.send('success');
}, 2000);
});
var server = app.listen(3000, function() {
console.log('Listening on port %d', server.address().port);
});
I know that when node.js is waiting for 2 seconds of setTimeout, it is able to serve another request at the same time, once the 2 seconds is passed, it will call it callback function.
How about in Ruby world, thin server?
require 'sinatra'
require 'thin'
set :server, %w[thin]
get '/test' do
sleep 2 <----
"success"
end
The code snippet above is using Thin server (non-blocking, asynchronous IO), When talking to asynchronous IO, i want to ask when reaching sleep 2 , is that the server are able to serve another request at the same time as sleep 2 is blocking IO.
The code between node.js and sinatra is that
node.js is writing asynchronous way (callback approach)
ruby is writing in synchronous way (but working in asynchronous way under the cover? is it true)
If the above statement is true,
it seems that ruby is better as the code looks better rather than bunch of callback code in node.js
Kit
Sinatra / Thin
Thin will be started in threaded mode,
if it is started by Sinatra (i.e. with ruby asynchtest.rb)
This means that your assumptions are correct; when reaching sleep 2 , the server is able to serve another request at the same time , but on another thread.
I would to show this behavior with a simple test:
#asynchtest.rb
require 'sinatra'
require 'thin'
set :server, %w[thin]
get '/test' do
puts "[#{Time.now.strftime("%H:%M:%S")}] logging /test starts on thread_id:#{Thread.current.object_id} \n"
sleep 10
"[#{Time.now.strftime("%H:%M:%S")}] success - id:#{Thread.current.object_id} \n"
end
let's test it by starting three concurrent http requests ( in here timestamp and thread-id are relevant parts to observe):
The test demonstrate that we got three different thread ( one for each cuncurrent request ), namely:
70098572502680
70098572602260
70098572485180
each of them starts concurrently ( the starts is pretty immediate as we can see from the execution of the puts statement ) , then waits (sleeps) ten seconds and after that time flush the response to the client (to the curl process).
deeper understanding
Quoting wikipedia - Asynchronous_I/O:
In computer science, asynchronous I/O, or non-blocking I/O is a form of input/output processing that permits
other processing to continue before the transmission has finished .
The above test (Sinatra/thin) actually demonstrate that it's possible to start a first request from curl ( the client ) to thin ( the server)
and, before we get the response from the first (before the transmission has finished) it's possible to start a second and a third
request and these lasts requests aren't queued but starts concurrently the first one or in other words: permits other processing to continue*
Basically this is a confirmation of the #Holger just's comment: sleep blocks the current thread, but not the whole process. That said, in thin, most stuff is handled in the main reactor thread which thus works similar to the one thread available in node.js: if you block it, nothing else scheduled in this thread will run. In thin/eventmachine, you can however defer stuff to other threads.
This linked answers have more details: "is-sinatra-multi-threaded and Single thread still handles concurrency request?
Node.js
To compare the behavoir of the two platform let's run an equivalent asynchtest.js on node.js; as we do in asynchtest.rb to undertand what happen we add a log line when processing starts;
here the code of asynchtest.rb:
var express = require('express');
var app = express();
app.get('/test', function(req, res){
console.log("[" + getTime() + "] logging /test starts\n");
setTimeout(function(){
console.log("sleep doen't block, and now return");
res.send('[' + getTime() + '] success \n');
},10000);
});
var server = app.listen(3000,function(){
console.log("listening on port %d", server.address().port);
});
Let's starts three concurrent requests in nodejs and observe the same behavoir:
of course very similar to what we saw in the previous case.
This response doesn't claim to be exhaustive on the subject which is very complex and deserves further study and specific evidence before drawing conclusions for their own purposes.
There are lots of subtle differences, almost too many to list here.
First, don't confuse "coding style" with "event model". There's no reason you need to use callbacks in Node.js (see various 'promise' libraries). And Ruby has EventMachine if like the call-back structured code.
Second, Thin (and Ruby) can have many different multi-tasking models. You didn't specify which one.
In Ruby 1.8.7, "Thread" will create green threads. The language actually turns a "sleep N" into a timer call, and allows other statements to execute. But it's got a lot of limitations.
Ruby 1.9.x can create native OS threads. But those can be hard to use (spinning up 1000's is bad for performance, etc.)
Ruby 1.9.x has "Fibers" which are a much better abstraction, very similar to Node.
In any comparison, you also have to take into account the entire ecosystem: Pretty much any node.js code will work in a callback. It's really hard to write blocking code. But many Ruby libraries are not Thread-aware out of the box (require special configuration, etc). Many seemingly simple things (DNS) can block the entire ruby process.
You also need to consider the language. Node.JS, is built on JavaScript, which has a lot of dark corners to trip you up. For example, it's easy to assume that JavaScript has Integers, but it doesn't. Ruby has fewer dark corners (such as Metaprogramming).
If you are really into evented architectures, you should really consider Go. It has the best of all worlds: The evented architecture is built in (just like in Node, except it's multiprocessor-aware), there are no callbacks (just like in Ruby), plus it has first-class messaging (very similar to Erlang). As a bonus, it will use a fraction of the memory of a Node or Ruby process.
No, node.js is fully asynchronous, setTimeout will not block script execution, just delay part inside it. So this parts of code are not equal. Choosing platform for your project depends on tasks you want to reach.

Why Sinatra request takes EM thread?

Sinatra app receives requests for long running tasks and EM.defer them, launching them in EM's internal pool of 20 threads. When there are more than 20 EM.defer running, they are stored in EM's threadqueue by EM.defer.
However, it seems Sinatra won't service any requests until there is an EM thread available to handle them. My question is, isn't Sinatra suppose to use the reactor of the main thread to service all requests? Why am I seeing an add on the threadqueue when I make a new request?
Steps to reproduce:
Access /track/
Launch 30 /sleep/ reqs to fill the threadqueue
Access /ping/ and notice the add in the threadqueue as well as the delay
Code to reproduce it:
require 'sinatra'
#monkeypatch EM so we can access threadpools
module EventMachine
def self.queuedDefers
#threadqueue==nil ? 0: #threadqueue.size
end
def self.availThreads
#threadqueue==nil ? 0: #threadqueue.num_waiting
end
def self.busyThreads
#threadqueue==nil ? 0: #threadpool_size - #threadqueue.num_waiting
end
end
get '/track/?' do
EM.add_periodic_timer(1) do
p "Busy: " + EventMachine.busyThreads.to_s + "/" +EventMachine.threadpool_size.to_s + ", Available: " + EventMachine.availThreads.to_s + "/" +EventMachine.threadpool_size.to_s + ", Queued: " + EventMachine.queuedDefers.to_s
end
end
get '/sleep/?' do
EM.defer(Proc.new {sleep 20}, Proc.new {body "DONE"})
end
get '/ping/?' do
body "pong"
end
I tried the same thing on Rack/Thin (no Sinatra) and works as it's supposed to, so I guess Sinatra is causing it.
Ruby version: 1.9.3.p125
EventMachine: 1.0.0.beta.4.1
Sinatra: 1.3.2
OS: Windows
Ok, so it seems Sinatra starts Thin in threaded mode by default causing the above behavior.
You can add
set :threaded, false
in your Sinatra configure section and this will prevent the Reactor defering requests on a separate thread, and blocking when under load.
Source1
Source2
Unless I'm misunderstanding something about your question, this is pretty much how EventMachine works. If you check out the docs for EM.defer, they state:
Don't write a deferred operation that will block forever. If so, the
current implementation will not detect the problem, and the thread
will never be returned to the pool. EventMachine limits the number of
threads in its pool, so if you do this enough times, your subsequent
deferred operations won't get a chance to run.
Basically, there's a finite number of threads, and if you use them up, any pending operations will block until a thread is available.
It might be possible to bump threadpool_size if you just need more threads, although ultimately that's not a long-term solution.
Is Sinatra multi threaded? is a really good question here on SO about Sinatra and threads. In short, Sinatra is awesome but if you need decent threading you might need to look elsewhere.

EventMachine: What is the maximum of parallel HTTP requests EM can handle?

I'm building a distributed web-crawler and trying to get maximum out of resources of each single machine. I run parsing functions in EventMachine through Iterator and use em-http-request to make asynchronous HTTP requests. For now I have 100 iterations that run at the same time and it seems that I can't pass over this level. If I increase a number of iteration it doesn't affect the speed of crawling. However, I get only 10-15% cpu load and 20-30% of network load, so there's plenty of room to crawl faster.
I'm using Ruby 1.9.2. Is there any way to improve the code to use resources effectively or maybe I'm even doing it wrong?
def start_job_crawl
#redis.lpop #queue do |link|
if link.nil?
EventMachine::add_timer( 1 ){ start_job_crawl() }
else
#parsing link, using asynchronous http request,
#doing something with the content
parse(link)
end
end
end
#main reactor loop
EM.run {
EM.kqueue
#redis = EM::Protocols::Redis.connect(:host => "127.0.0.1")
#redis.errback do |code|
puts "Redis error: #{code}"
end
#100 parallel 'threads'. Want to increase this
EM::Iterator.new(0..99, 100).each do |num, iter|
start_job_crawl()
end
}
if you are using select()(which is the default for EM), the most is 1024 because select() limited to 1024 file descriptors.
However it seems like you are using kqueue, so it should be able to handle much more than 1024 file descriptors at once.
which is the value of your EM.threadpool_size ?
try enlarging it, I suspect the limit is not in the kqueue but in the pool handling the requests...

Ruby Eventmachine queueing problem

I have a Http client written in Ruby that can make synchronous requests to URLs. However, to quickly execute multiple requests I decided to use Eventmachine. The idea is to
queue all the requests and execute them using eventmachine.
class EventMachineBackend
...
...
def execute(request)
$q ||= EM.Queue.new
$q.push(request)
$q.pop {|request| request.invoke}
EM.run{EM.next_tick {EM.stop}}
end
...
end
Forgive my use of a global queue variable. I will refactor it later. Is what I am doing in EventMachineBackend#execute the right way of using Eventmachine queues?
One problem I see in my implementation is it is essentially synchronous. I push a request, pop and execute the request and wait for it to complete.
Could anyone suggest a better implementation.
Your the request logic has to be asynchronous for it to work with EventMachine, I suggest that you use em-http-request. You can find an example on how to use it here, it shows how to run the requests in parallel. An even better interface for running multiple connections in parallel is the MultiRequest class from the same gem.
If you want to queue requests and only run a fixed number of them in parallel you can do something like this:
EM.run do
urls = [...] # regular array with URLs
active_requests = 0
# this routine will be used as callback and will
# be run when each request finishes
when_done = proc do
active_requests -= 1
if urls.empty? && active_requests == 0
# if there are no more urls, and there are no active
# requests it means we're done, so shut down the reactor
EM.stop
elsif !urls.empty?
# if there are more urls launch a new request
launch_next.call
end
end
# this routine launches a request
launch_next = proc do
# get the next url to fetch
url = urls.pop
# launch the request, and register the callback
request = EM::HttpRequest.new(url).get
request.callback(&when_done)
request.errback(&when_done)
# increment the number of active requests, this
# is important since it will tell us when all requests
# are done
active_requests += 1
end
# launch three requests in parallel, each will launch
# a new requests when done, so there will always be
# three requests active at any one time, unless there
# are no more urls to fetch
3.times do
launch_next.call
end
end
Caveat emptor, there may very well be some detail I've missed in the code above.
If you think it's hard to follow the logic in my example, welcome to the world of evented programming. It's really tricky to write readable evented code. It all goes backwards. Sometimes it helps to start reading from the end.
I've assumed that you don't want to add more requests after you've started downloading, it doesn't look like it from the code in your question, but should you want to you can rewrite my code to use an EM::Queue instead of a regular array, and remove the part that does EM.stop, since you will not be stopping. You can probably remove the code that keeps track of the number of active requests too, since that's not relevant. The important part would look something like this:
launch_next = proc do
urls.pop do |url|
request = EM::HttpRequest.new(url).get
request.callback(&launch_next)
request.errback(&launch_next)
end
end
Also, bear in mind that my code doesn't actually do anything with the response. The response will be passed as an argument to the when_done routine (in the first example). I also do the same thing for success and error, which you may not want to do in a real application.

Resources