asynchronously call multiple GET requests from Ruby - ruby

I have a bunch of web requests I am making in parallel right now using the parallel gem. This is causing all kinds of memory issues due to vfork. These web requests take around 30 seconds each. Is there a way I can queue them all up asynchronously and have them start at the same time without using the parallel gem?
Right now I use Faraday to do the web requests. The code for each request looks like this:
conn = Faraday.new(url: TRIGGER_URL)
conn.post do |req|
req.headers['Content-Type'] = 'application/json'
req.options.timeout = 540
req.body = {auth_key: AUTH_KEY, image_url: image_url, space_id: space_id, scene_num: scene_num, cylinder_mode: cylinder_mode}.to_json
end.body

The async-http gem did exactly what I want.
https://github.com/socketry/async-http

Related

Run code in sinatra after the response has been sent

I'm pretty new to Ruby and Sinatra and I'm trying to set up a basic sinatra server to listen for HTTP post requests, and then process the data.
I need to send the response within 5 seconds or the server (shopify) which sends the POST thinks that the request has failed and sends it again. To avoid that, Shopify advises to defer processing until after the response has been sent.
I'm not sure how to trigger my processing once Sinatra has sent the response.
Will this work ?
require 'sinatra'
require 'json'
webhook_data = Order.new
post '/' do
request.body.rewind
data = request.body.read
webhook_data.parsed_json = JSON.parse(data)
puts "My response gets sent here, right ?"
end
after do
#DO MY PROCESSING HERE
end
Is there any better way to do this ?
You can use any solution for background jobs processing. Here is example for Sidekiq usage with Sinatra.
You can try to use Threads as well:
set :threaded, true
post '/' do
request.body.rewind
data = request.body.read
Thread.new do
# data processing staff goes here
end
# here goes response
end

Sinatra - How to calculate the response time for every request for stats purposes?

I would like to measure the execution time of each Sinatra route including the complete request/response cycle duration and then send this metrics to a Graphite server.
My first approach was to use the Rack::Runtime and then fetch the values I needed from the response headers in the Sinatra after filter but I discovered that this filter is actually executed before the response is completely sent to the client.
So not only I cannot access many response informations in the after block, I also cannot use this block to send the metrics to Graphite in any other way because they wouldn't reflect the real response time.
I've read in other topics that a possible approach is to create a Rack middleware to wrap the application and execute the benchmark and I ended with something like this:
class GraphiteRoutesReporter
def initialize(app)
#app = app
end
def call(env)
start_time = Time.now
status, headers, body = #app.call(env)
time_taken = (1000 * (Time.now - start_time))
# send #{time_taken} to my stats server
[status, headers, body]
end
end
Which I can include in config.ru and seems like it's working fine.
But my worries are about this code messing with the core Rack request chain and I am worry that I am making incorrect use of the Sinatra public API.
Which is the proper way to get the full response time of a Sinatra request?
If I were to find a solution for a non-business-critical reason (so we speak about a "fun"-scenario) I would regularly "parse" (awk) the default log output of sinatra where the response time is included (at the very end: 0.1093 seconds in the example - if I'm not wrong)
179.24.226.1 - felixb [22/Aug/2016:13:30:46 +0200] "GET /index HTTP/1.0" 200 11546 0.1093
Which might bring me to the idea of implement a simple Logger who does whatever should happen with the output (yes, thats a hack).
But that said, your approach looks fine to me, just make sure to offload the # send #{time_taken} to my stats server - you dont want to make your users wait because your Graphite is too hard to get the teeth in fast.
Also if it is about profiling your web-app/server, take a look at https://github.com/MiniProfiler/rack-mini-profiler .

Handle Sinatra and Faye in same EventMachine Performantly

I am writing a web application that uses both Sinatra—for general single-client synchronous gets—and Faye—for multiple-client asynchronous server-based broadcasts.
My (limited) understanding of EventMachine was that it would allow me to put both of these in a single process and get parallel requests handled for me. However, my testing shows that if either Sinatra or Faye takes a long time on a particular action (which may happen regularly with my real application) it blocks the other.
How can I rewrite the simple test application below so that if either sleep command is uncommented the Faye-pushes and the AJAX poll responses are not delayed?
%w[eventmachine thin sinatra faye json].each{ |lib| require lib }
def run!
EM.run do
Faye::WebSocket.load_adapter('thin')
webapp = MyWebApp.new
server = Faye::RackAdapter.new(mount:'/', timeout:25)
dispatch = Rack::Builder.app do
map('/'){ run webapp }
map('/faye'){ run server }
end
Rack::Server.start({
app: dispatch,
Host: '0.0.0.0',
Port: 8090,
server: 'thin',
signals: false,
})
end
end
class MyWebApp < Sinatra::Application
# http://stackoverflow.com/q/10881594/405017
configure{ set threaded:false }
def initialize
super
#faye = Faye::Client.new("http://localhost:8090/faye")
EM.add_periodic_timer(0.5) do
# uncommenting the following line should not
# prevent Sinatra from responding to "pull"
# sleep 5
#faye.publish( '/push', { faye:Time.now.to_f } )
end
end
get ('/pull') do
# uncommenting the following line should not
# prevent Faye from sending "push" updates rapidly
# sleep 5
content_type :json
{ sinatra:Time.now.to_f }.to_json
end
get '/' do
"<!DOCTYPE html>
<html lang='en'><head>
<meta charset='utf-8'>
<title>PerfTest</title>
<script src='https://code.jquery.com/jquery-2.2.0.min.js'></script>
<script src='/faye/client.js'></script>
<script>
var faye = new Faye.Client('/faye', { retry:2, timeout:10 } );
faye.subscribe('/push',console.log.bind(console));
setInterval(function(){
$.get('/pull',console.log.bind(console))
}, 500 );
</script>
</head><body>
Check the logs, yo.
</body></html>"
end
end
run!
How does sleep differ from, say, 999999.times{ Math.sqrt(rand) } or exec("sleep 5")? Those also block any single-thread, right? That's what I'm trying to simulate, a blocking command that takes a long time.
Both cases would block your reactor/event queue. With the reactor pattern, you want to avoid any CPU intensive work, and focus purely on IO (i.e. network programming).
The reason why the single-threaded reactor pattern works so well with I/O is because IO is not CPU intensive - instead it just blocks your programs while the system kernel handles your I/O request.
The reactor pattern takes advantage of this by immediately switching your single thread to potentially work on something different (perhaps the response of some other request has completed) until the I/O operation is completed by the OS.
Once the OS has the result of your IO request, EventMachine finds the callback you had initially registered with your I/O request and passes it the response data.
So instead of something like
# block here for perhaps 50 ms
r = RestClient.get("http://www.google.ca")
puts r.body
EventMachine is more like
# Absolutely no blocking
response = EventMachine::HttpRequest.new('http://google.ca/').get
# add to event queue for when kernel eventually delivers result
response.callback {
puts http.response
}
In the first example, you would need the multi-threaded model for your web server, since a single thread making a network request can block for potentially seconds.
In the second example, you don't have blocking operations, so one thread works great (and is generally faster than a multi-thread app!)
If you ever do have a CPU intensive operation, EventMachine allows you to cheat a little bit, and start a new thread so that the reactor doesn't block. Read more about EM.defer here.
One final note is that this is the reason Node.js is so popular. For Ruby we need EventMachine + compatible libraries for the reactor pattern (can't just use the blocking RestClient for example), but Node.js and all of it's libraries are written from the start for the reactor design pattern (they are callback based).

Ruby and Celluloid

Due to some limitations I want to switch my current project from EventMachine/EM-Synchrony to Celluloid but I've some trouble to get in touch with it. The project I am coding on is a web harvester which should crawl tons of pages as fast as possible.
For the basic understanding of Celluloid I've generated 10.000 dummy pages on a local web server and wanna crawl them by this simple Celluloid snippet:
#!/usr/bin/env jruby --1.9
require 'celluloid'
require 'open-uri'
IDS = 1..9999
BASE_URL = "http://192.168.0.20/files"
class Crawler
include Celluloid
def read(id)
url = "#{BASE_URL}/#{id}"
puts "URL: " + url
open(url) { |x| x.read }
end
end
pool = Crawler.pool(size: 100)
IDS.to_a.map do |id|
pool.future(:read, id)
end
As far as I understand Celluloid, futures are the way to go to get the response of a fired request (comparable to callbacks in EventMachine), right? The other thing is, every actor runs in its own thread, so I need some kind of batching the requests cause 10.000 threads would result in errors on my OSX dev machine.
So creating a pool is the way to go, right? BUT: the code above iterates over the 9999 URLs but only 1300 HTTP requests are sent to the web server. So something goes wrong with limiting the requests and iterating over all URLs.
Likely your program is exiting as soon as all of your futures are created. With Celluloid a future will start execution but you can't be assured of it finishing until you call #value on the future object. This holds true for futures in pools as well. Probably what you need to do is change it to something like this:
crawlers = IDS.to_a.map do |id|
begin
pool.future(:read, id)
rescue DeadActorError, MailboxError
end
end
crawlers.compact.each { |crawler| crawler.value rescue nil }

What is the preferred way of performing non blocking I/O in Ruby?

If say I want to retrieve a web page for parsing, but not block the CPU while the I/O is taking place. Is there something equivalent to Python's Eventlet library?
The best HTTP client library for Ruby is Typhoeus, it can be used to perform multiple HTTP requests in parallel in a non-blocking fashion. There is a blocking and non-blocking interface:
# blocking
response = Typhoeus::Request.get("http://stackoverflow.com/")
puts response.body
# non-blocking
request1 = Typhoeus::Request.new("http://stackoverflow.com/")
request1.on_complete do |response|
puts response.body
end
request2 = Typhoeus::Request.new("http://stackoverflow.com/questions")
request2.on_complete do |response|
puts response.body
end
hydra = Typhoeus::Hydra.new
hydra.queue(request1)
hydra.queue(request2)
hydra.run # this call is blocking, though
Another option is em-http-request, which runs on top of EventMachine. It has a completely non-blocking interface:
EventMachine.run do
request = EventMachine::HttpRequest.new('http://stackoverflow.com/').get
request.callback do
puts request.response
EventMachine.stop
end
end
There's also an interface for making many requests in parallel, similarly to Typhoeus Hydra.
The downside of em-http-request is that it is tied to EventMachine. EventMachine is an awesome framework in itself, but it's an all-or-nothing deal. You need to write your whole application in an evented/continuation-passing-style fashion, and that has been known to cause brain damage. Typhoeus is much better suited to applications that are not already evented.
I'm not sure what Eventlet does, but Ruby has EventMachine, a library for non-blocking IO (amongst other things).

Resources