Ruby Eventmachine queueing problem - ruby

I have a Http client written in Ruby that can make synchronous requests to URLs. However, to quickly execute multiple requests I decided to use Eventmachine. The idea is to
queue all the requests and execute them using eventmachine.
class EventMachineBackend
...
...
def execute(request)
$q ||= EM.Queue.new
$q.push(request)
$q.pop {|request| request.invoke}
EM.run{EM.next_tick {EM.stop}}
end
...
end
Forgive my use of a global queue variable. I will refactor it later. Is what I am doing in EventMachineBackend#execute the right way of using Eventmachine queues?
One problem I see in my implementation is it is essentially synchronous. I push a request, pop and execute the request and wait for it to complete.
Could anyone suggest a better implementation.

Your the request logic has to be asynchronous for it to work with EventMachine, I suggest that you use em-http-request. You can find an example on how to use it here, it shows how to run the requests in parallel. An even better interface for running multiple connections in parallel is the MultiRequest class from the same gem.
If you want to queue requests and only run a fixed number of them in parallel you can do something like this:
EM.run do
urls = [...] # regular array with URLs
active_requests = 0
# this routine will be used as callback and will
# be run when each request finishes
when_done = proc do
active_requests -= 1
if urls.empty? && active_requests == 0
# if there are no more urls, and there are no active
# requests it means we're done, so shut down the reactor
EM.stop
elsif !urls.empty?
# if there are more urls launch a new request
launch_next.call
end
end
# this routine launches a request
launch_next = proc do
# get the next url to fetch
url = urls.pop
# launch the request, and register the callback
request = EM::HttpRequest.new(url).get
request.callback(&when_done)
request.errback(&when_done)
# increment the number of active requests, this
# is important since it will tell us when all requests
# are done
active_requests += 1
end
# launch three requests in parallel, each will launch
# a new requests when done, so there will always be
# three requests active at any one time, unless there
# are no more urls to fetch
3.times do
launch_next.call
end
end
Caveat emptor, there may very well be some detail I've missed in the code above.
If you think it's hard to follow the logic in my example, welcome to the world of evented programming. It's really tricky to write readable evented code. It all goes backwards. Sometimes it helps to start reading from the end.
I've assumed that you don't want to add more requests after you've started downloading, it doesn't look like it from the code in your question, but should you want to you can rewrite my code to use an EM::Queue instead of a regular array, and remove the part that does EM.stop, since you will not be stopping. You can probably remove the code that keeps track of the number of active requests too, since that's not relevant. The important part would look something like this:
launch_next = proc do
urls.pop do |url|
request = EM::HttpRequest.new(url).get
request.callback(&launch_next)
request.errback(&launch_next)
end
end
Also, bear in mind that my code doesn't actually do anything with the response. The response will be passed as an argument to the when_done routine (in the first example). I also do the same thing for success and error, which you may not want to do in a real application.

Related

How to connect to multiple WebSockets with Ruby?

Using faye-websocket and EventMachine the code looks very similar to faye-websocket's client example:
require 'faye/websocket'
require 'eventmachine'
def setup_socket(url)
EM.run {
ws = Faye::WebSocket::Client.new(url)
ws.on :open do ... end
ws.on :message do ... end
ws.on :close do ... end
}
end
I'd like to have multiple connections open parallely. I can't simply call setup_socket multiple times as the execution won't exit the EM.run clause. I've tried to run setup_socket multiple times in separate threads as:
urls.each do |url|
Thread.new { setup_socket(url) }
end
But it doesn't seem to do anyhting as the puts statements don't reach the output.
I'm not restricted to use faye-websocket but it seemed most people use this library. If possible I'd like to avoid multithreading. I'd also not like to lose the possiblity to make changes (e.g. add a new websocket) over time. Therefore moving the iteration of URLs inside the EM.run clause is not desired but instead starting multiple EMs would be more beneficial. I found an example for starting multiple servers via EM in a very clean way. I'm looking for something similar.
How can I connect to multiple WebSockets at the same time?
Here's one way to do it.
First, you have to accept that the EM thread needs to be running. Without this thread you won't be able to process any current connections. So you just can't get around that.
Then, in order to add new URLs to the EM thread you then need some way to communicate from the main thread to the EM thread, so you can tell it to launch a new connection. This can be done with EventMachine::Channel.
So what we can build now is something like this:
#channel = EventMachine::Channel.new
Thread.new {
EventMachine.run {
#channel.subscribe { |url|
ws = Faye::...new(url)
...
}
}
}
Then in the main thread, any time you want to add a new URL to the event loop, you just use this:
def setup_socket(url)
#channel.push(url)
end
Here's another way to do it... Use Iodine's native websocket support (or the Plezi framework) instead of em-websocket...
...I'm biased (I'm the author), but I think they make it a lot easier. Also, Plezi offers automatic scaling with Redis, so it's easy to grow.
Here's an example using Plezi, where each Controller acts like a channel, with it's own URL and Websocket callback (although I think Plezi's Auto-Dispatch is easier than the lower level on_message callback). This code can be placed in a config.ru file:
require 'plezi'
# Once controller / channel for all members of the "Red" group
class RedGroup
def index # HTTP index for the /red URL
"return the RedGroup client using `:render`".freeze
end
# handle websocket messages
def on_message data
# in this example, we'll send the data to all the members of the other group.
BlueGroup.broadcast :handle_message, data
end
# This is the method activated by the "broadcast" message
def handle_message data
write data # write the data to the client.
end
end
# the blue group controller / channel
class BlueGroup
def index # HTTP index for the /blue URL
"return the BlueGroup client using `:render`".freeze
end
# handle websocket messages
def on_message data
# in this example, we'll send the data to all the members of the other group.
RedGroup.broadcast :handle_message, data
end
# This is the method activated by the "broadcast" message
def handle_message data
write data
end
end
# the routes
Plezi.route '/red', RedGroup
Plezi.route '/blue', BlueGroup
# Set the Rack application
run Plezi.app
P.S.
I wrote this answer also because em-websocket might fail or hog resources in some cases. I'm not sure about the details, but it was noted both on the websocket-shootout benchmark and the AnyCable Websocket Benchmarks.

Run when you can

In my sinatra web application, I have a route:
get "/" do
temp = MyClass.new("hello",1)
redirect "/home"
end
Where MyClass is:
class MyClass
#instancesArray = []
def initialize(string,id)
#string = string
#id = id
#instancesArray[id] = this
end
def run(id)
puts #instancesArray[id].string
end
end
At some point I would want to run MyClass.run(1), but I wouldn't want it to execute immediately because that would slow down the servers response to some clients. I would want the server to wait to run MyClass.run(temp) until there was some time with a lighter load. How could I tell it to wait until there is an empty/light load, then run MyClass.run(temp)? Can I do that?
Addendum
Here is some sample code for what I would want to do:
$var = 0
get "/" do
$var = $var+1 # each time a request is recieved, it incriments
end
After that I would have a loop that would count requests/minute (so after a minute it would reset $var to 0, and if $var was less than some number, then it would run tasks util the load increased.
As Andrew mentioned (correctly—not sure why he was voted down), Sinatra stops processing a route when it sees a redirect, so any subsequent statements will never execute. As you stated, you don't want to put those statements before the redirect because that will block the request until they complete. You could potentially send the redirect status and header to the client without using the redirect method and then call MyClass#run. This will have the desired effect (from the client's perspective), but the server process (or thread) will block until it completes. This is undesirable because that process (or thread) will not be able to serve any new requests until it unblocks.
You could fork a new process (or spawn a new thread) to handle this background task asynchronously from the main process associated with the request. Unfortunately, this approach has the potential to get messy. You would have to code around different situations like the background task failing, or the fork/spawn failing, or the main request process not ending if it owns a running thread or other process. (Disclaimer: I don't really know enough about IPC in Ruby and Rack under different application servers to understand all of the different scenarios, but I'm confident that here there be dragons.)
The most common solution pattern for this type of problem is to push the task into some kind of work queue to be serviced later by another process. Pushing a task onto the queue is ideally a very quick operation, and won't block the main process for more than a few milliseconds. This introduces a few new challenges (where is the queue? how is the task described so that it can be facilitated at a later time without any context? how do we maintain the worker processes?) but fortunately a lot of the leg work has already been done by other people. :-)
There is the delayed_job gem, which seems to provide a nice all-in-one solution. Unfortunately, it's mostly geared towards Rails and ActiveRecord, and the efforts people have made in the past to make it work with Sinatra look to be unmaintained. The contemporary, framework-agnostic solutions are Resque and Sidekiq. It might take some effort to get up and running with either option, but it would be well worth it if you have several "run when you can" type functions in your application.
MyClass.run(temp) is never actually executing. In your current request to / path you instantiate a new instance of MyClass then it will immediately do a get request to /home. I'm not entirely sure what the question is though. If you want something to execute after the redirect, that functionality needs to exist within the /home route.
get '/home' do
# some code like MyClass.run(some_arg)
end

Which of Ruby's concurrency devices would be best suited for this scenario?

The whole threads/fibers/processes thing is confusing me a little. I have a practical problem that can be solved with some concurrency, so I thought this was a good opportunity to ask professionals and people more knowledgable than me about it.
I have a long array, let's say 3,000 items. I want to send a HTTP request for each item in the array.
Actually iterating over the array, generating requests, and sending them is very rapid. What takes time is waiting for each item to be received, processed, and acknowledged by the party I'm sending to. I'm essentially sending 100 bytes, waiting 2 seconds, sending 100 bytes, waiting 2 seconds.
What I would like to do instead is send these requests asynchronously. I want to send a request, specify what to do when I get the response, and in the meantime, send the next request.
From what I can see, there are four concurrency options I could use here.
Threads.
Fibers.
Processes; unsuitable as far as I know because multiple processes accessing the same array isn't feasible/safe.
Asynchronous functionality like JavaScript's XMLHttpRequest.
The simplest would seem to be the last one. But what is the best, simplest way to do that using Ruby?
Failing #4, which of the remaining three is the most sensible choice here?
Would any of these options also allow me to say "Have no more than 10 pending requests at any time"?
This is your classic producer/consumer problem and is nicely suited for threads in Ruby. Just create a Queue
urls = [...] # array with bunches of urls
require "thread"
queue = SizedQueue.new(10) # this will only allow 10 items on the queue at once
p1 = Thread.new do
url_slice = urls.each do |url|
response = do_http_request(url)
queue << response
end
queue << "done"
end
consumer = Thread.new do
http_response = queue.pop(true) # don't block when zero items are in queue
Thread.exit if http_response == "done"
process(http_response)
end
# wait for the consumer to finish
consumer.join
EventMachine as an event loop and em-synchrony as a Fiber wrapper for it's callbacks into synchronous code
Copy Paste from em-synchrony README
require "em-synchrony"
require "em-synchrony/em-http"
require "em-synchrony/fiber_iterator"
EM.synchrony do
concurrency = 2
urls = ['http://url.1.com', 'http://url2.com']
results = []
EM::Synchrony::FiberIterator.new(urls, concurrency).each do |url|
resp = EventMachine::HttpRequest.new(url).get
results.push resp.response
end
p results # all completed requests
EventMachine.stop
end
This is an IO bounded case that fits more in both:
Threading model: no problem with MRI Ruby in this case cause threads work well with IO cases; GIL effect is almost zero.
Asynchronous model, which proves(in practice and theory) to be far superior than threads when it comes to IO specific problems.
For this specific case and to make things far simpler, I would have gone with Typhoeus HTTP client which has a parallel support that works as the evented(Asynchronous) concurrency model.
Example:
hydra = Typhoeus::Hydra.new
%w(url1 url2 url3).each do |url|
request = Typhoeus::Request.new(url, followlocation: true)
request.on_complete do |response|
# do something with response
end
hydra.queue(request)
end
hydra.run # this is a blocking call that returns once all requests are complete

Typhoeus retry if fail

Currently, Typhoeus doesn't have automatic re-download in case of failure. What would be the best way of ensuring a retry if the download is not successful?
def request
request ||= Typhoeus::Request.new("www.example.com")
request.on_complete do |response|
if response.success?
xml = Nokogiri::XML(response.body)
else
# retry to download it
end
end
end
I think you need to refactor your code. You should have two queues and threads you're working with, at a minimum.
The first is a queue of URLs that you pull from to read via Typhoeus::Request.
If the queue is empty you sleep that thread for a minute, then look for a URL to retrieve. If you successfully read the page, parse it and push the resulting XML doc into a second queue of DOMs to work on. Process that from a second thread. And, if the second queue is empty, sleep that second thread until there is something to work on.
If reading a URL fails, automatically re-push it onto the first queue.
If both queues are empty you could exit the code, or let both threads sleep until something says to start processing URLs again and you repopulate the first queue.
You also need a retries-counter associated with the URL, otherwise if a site goes down you could retry forever. You could push little sub-arrays onto the queue as:
["url", 0]
where 0 is the retry, or get more complex using an object or define a class. Whatever you do, increment that counter until it hits a drop-dead value, then stop adding that to the queue and report it or remove it from your source of URLs database somehow.
That's somewhat similar to code I've written a couple times to handle big spidering tasks.
See Ruby's Thread and Queue classes for examples of this.
Also:
request ||= Typhoeus::Request.new("www.example.com")
makes no sense. request will be nil when that code runs, so the ||= will always fire. Instead use:
request = Typhoeus::Request.new("www.example.com")
modified with the appropriate code to pull the next value from the first queue mentioned above.

Typhoeus: How to write code in order to send and run a request more times but stop this process on the first successful response?

I am using the Typhoeus gem. The official documentation refers to Memoization:
Memoization: Hydra memoizes requests within a single run call. You
can also disable memoization.
hydra = Typhoeus::Hydra.new
2.times do
r = Typhoeus::Request.new("http://localhost/3000/users/1")
hydra.queue r
end
hydra.run # this will result in a single request being issued. However, the on_complete handlers of both will be called.
hydra.disable_memoization
2.times do
r = Typhoeus::Request.new("http://localhost/3000/users/1")
hydra.queue r
end
hydra.run # this will result in a two requests.
How do I write code to send and run a request multiple times but stop on the first successful response? Also, I would like to skip the current request if it has timed-out.
Take a look at Typhoeus's times.rb example.
Don't submit multiple requests to a URL to the Hydra queue, only do one per URL.
Inside the on_complete block you have access to the response object. The response object has a timed_out? method which checks to see if the request timed out. If it did, resubmit your request to Hydra then exit the block, otherwise process the content as normal.

Resources