Is there a way to flush html to the wire in Sinatra - ruby

I have a Sinatra app with a long running process (a web scraper). I'd like the app flush the results of the crawler's progress as the crawler is running instead of at the end.
I've considered forking the request and doing something fancy with ajax but this is a really basic one-pager app that really just needs to output a log to a browser as it's happening. Any suggestions?

Update (2012-03-21)
As of Sinatra 1.3.0, you can use the new streaming API:
get '/' do
stream do |out|
out << "foo\n"
sleep 10
out << "bar\n"
end
end
Old Answer
Unfortunately you don't have a stream you can simply flush to (that would not work with Rack middleware). The result returned from a route block can simply respond to each. The Rack handler will then call each with a block and in that block flush the given part of the body to the client.
All rack responses have to always respond to each and always hand strings to the given block. Sinatra takes care of this for you, if you just return a string.
A simple streaming example would be:
require 'sinatra'
get '/' do
result = ["this", " takes", " some", " time"]
class << result
def each
super do |str|
yield str
sleep 0.3
end
end
end
result
end
Now you could simply place all your crawling in the each method:
require 'sinatra'
class Crawler
def initialize(url)
#url = url
end
def each
yield "opening url\n"
result = open #url
yield "seaching for foo\n"
if result.include? "foo"
yield "found it\n"
else
yield "not there, sorry\n"
end
end
end
get '/' do
Crawler.new 'http://mysite'
end

Related

Faye ruby client publishing only once

I have a faye server (nodejs) running on localhost, and I am trying to setup a server side ruby client which needs to publish on the server on a regular basis. This is the code I am trying to use.
(Please ignore the commented code to start with).
I make a class variable ##client and initialize it as soon as the class loads. I define a class method pub whose task is to publish something on the faye server.
In the end, I just call the pub method twice. The first publication callback is received successfully, but the second publication doesn't make either of callback or the errback. And since the control has not been given back to the app, the app just hangs there.
If I make the gobal variable $client (currently commented), the behaviour is the same. But if I make the client everytime pub is called, then the publish goes on smoothly. I initiate it in EM.run loop or outside, the behavior is same. (as expected)
I don't want to make a new connection everytime I want to publish something since that defeats the purpose. Also, if I create a new client in EM.run everytime I call the method, the client connections don't close by themselves. I can see them open in lsof command as open files, and soon I'll start getting too many open files error I think.
I don't really understand Event Machine correctly, maybe I am missing something there.
require 'faye'
require 'eventmachine'
# $client = Faye::Client.new('http://localhost:5050/faye')
class Fayeclient
puts "#{__LINE__}: Reactor running: " + EM.reactor_running?.to_s
# if !defined? ##client or ##client.nil?
##client = Faye::Client.new('http://localhost:5050/faye')
puts "Created client: " + ##client.inspect
# end
def self.pub
puts "#{__LINE__}: Reactor running: " + EM.reactor_running?.to_s
# client = Faye::Client.new('http://localhost:5050/faye') #$client
# client = ##client
EM.run {
#client = Faye::Client.new('http://localhost:5050/faye') #$client
puts "#{__LINE__}: Reactor running: " + EM.reactor_running?.to_s
puts ##client.inspect
publication = ##client.publish('/foo', 'text' =>'Hello world')
puts "Publishing: #{publication.inspect}"
# puts "Publication methods: #{publication.methods}"
publication.callback do
puts "Did it #{publication.inspect}"
EM.stop_event_loop
puts "#{__LINE__}: Reactor running: " + EM.reactor_running?.to_s
# puts "#{client.methods}"
# puts client.inspect
# client.remove_all_listeners
# puts client.inspect
end
publication.errback do |error |
puts error.inspect
EM.stop_event_loop
end
}
puts "Outside event loop"
puts "#{__LINE__}: Reactor running: " + EM.reactor_running?.to_s
end
end
Fayeclient.pub
Fayeclient.pub
EM.run call is blocking, you have to run it on a separate thread, and eventually join it when all is over. In the example I'm using Singleton but it's up to you.
This does correctly the 2 faye calls.
#!/usr/bin/env ruby
#
require 'faye'
require 'singleton'
require 'eventmachine'
class Fayeclient
include Singleton
attr_accessor :em_thread, :client
def initialize
self.em_thread = Thread.new do
EM.run
end
self.client = Faye::Client.new('http://localhost:8890/faye')
end
def pub
puts "#{__LINE__}: Reactor running: " + EM.reactor_running?.to_s
puts client.inspect
publication = client.publish('/foo', 'text' =>'Hello world')
puts "Publishing: #{publication.inspect}"
publication.callback do
puts "Did it #{publication.inspect}"
EM.stop_event_loop
puts "#{__LINE__}: Reactor running: " + EM.reactor_running?.to_s
end
publication.errback do |error |
puts error.inspect
EM.stop_event_loop
end
end
end
Fayeclient.instance.pub
Fayeclient.instance.pub
Fayeclient.instance.em_thread.join
In my personal experience, anyway, having to deal with EventMachine inside Rails application can be a mess, some webserver uses EM, other does not so, and when you want to test from console it may not work as expected.
My solution is to fallback to http calls:
RestClient.post "http://localhost:#{Rails.configuration.faye_port}/faye", message: {foo: 'bar'}.to_json
I found this solution simpler and easy to customize, if you don't need to receive message from this piece of code.

Efficient way to render ton of JSON on Heroku

I built a simple API with one endpoint. It scrapes files and currently has around 30,000 records. I would ideally like to be able to fetch all those records in JSON with one http call.
Here is my Sinatra view code:
require 'sinatra'
require 'json'
require 'mongoid'
Mongoid.identity_map_enabled = false
get '/' do
content_type :json
Book.all
end
I've tried the following:
using multi_json with
require './require.rb'
require 'sinatra'
require 'multi_json'
MultiJson.engine = :yajl
Mongoid.identity_map_enabled = false
get '/' do
content_type :json
MultiJson.encode(Book.all)
end
The problem with this approach is I get Error R14 (Memory quota exceeded). I get the same error when I try to use the 'oj' gem.
I would just concatinate everything one long Redis string, but Heroku's redis service is $30 per month for the instance size I would need (> 10mb).
My current solution is to use background task that creates objects and stuffs them full of jsonified objects at near the Mongoid object size limit (16mb). The problems with this approach: It still takes nearly 30 seconds to render, and I have to run post-processing on the receiving app to properly extract the json from the objects.
Does anyone have any better idea for how I can render json for 30k records in one call without switching away from Heroku?
Sounds like you want to stream the JSON directly to the client instead of building it all up in memory. It's probably the best way to cut down memory usage. You could for example use yajl to encode JSON directly to a stream.
Edit: I rewrote the entire code for yajl, because its API is much more compelling and allows for much cleaner code. I also included an example for reading the response in chunks. Here's the streamed JSON array helper I wrote:
require 'yajl'
module JsonArray
class StreamWriter
def initialize(out)
super()
#out = out
#encoder = Yajl::Encoder.new
#first = true
end
def <<(object)
#out << ',' unless #first
#out << #encoder.encode(object)
#out << "\n"
#first = false
end
end
def self.write_stream(app, &block)
app.stream do |out|
out << '['
block.call StreamWriter.new(out)
out << ']'
end
end
end
Usage:
require 'sinatra'
require 'mongoid'
Mongoid.identity_map_enabled = false
# use a server that supports streaming
set :server, :thin
get '/' do
content_type :json
JsonArray.write_stream(self) do |json|
Book.all.each do |book|
json << book.attributes
end
end
end
To decode on the client side you can read and parse the response in chunks, for example with em-http. Note that this solution requires the clients memory to be large enough to store the entire objects array. Here's the corresponding streamed parser helper:
require 'yajl'
module JsonArray
class StreamParser
def initialize(&callback)
#parser = Yajl::Parser.new
#parser.on_parse_complete = callback
end
def <<(str)
#parser << str
end
end
def self.parse_stream(&callback)
StreamParser.new(&callback)
end
end
Usage:
require 'em-http'
parser = JsonArray.parse_stream do |object|
# block is called when we are done parsing the
# entire array; now we can handle the data
p object
end
EventMachine.run do
http = EventMachine::HttpRequest.new('http://localhost:4567').get
http.stream do |chunk|
parser << chunk
end
http.callback do
EventMachine.stop
end
end
Alternative solution
You could actually simplify the whole thing a lot when you give up the need for generating a "proper" JSON array. What the above solution generates is JSON in this form:
[{ ... book_1 ... }
,{ ... book_2 ... }
,{ ... book_3 ... }
...
,{ ... book_n ... }
]
We could however stream each book as a separate JSON and thus reduce the format to the following:
{ ... book_1 ... }
{ ... book_2 ... }
{ ... book_3 ... }
...
{ ... book_n ... }
The code on the server would then be much simpler:
require 'sinatra'
require 'mongoid'
require 'yajl'
Mongoid.identity_map_enabled = false
set :server, :thin
get '/' do
content_type :json
encoder = Yajl::Encoder.new
stream do |out|
Book.all.each do |book|
out << encoder.encode(book.attributes) << "\n"
end
end
end
As well as the client:
require 'em-http'
require 'yajl'
parser = Yajl::Parser.new
parser.on_parse_complete = Proc.new do |book|
# this will now be called separately for every book
p book
end
EventMachine.run do
http = EventMachine::HttpRequest.new('http://localhost:4567').get
http.stream do |chunk|
parser << chunk
end
http.callback do
EventMachine.stop
end
end
The great thing is that now the client does not have to wait for the entire response, but instead parses every book separately. However, this will not work if one of your clients expects one single big JSON array.

How to use fibers with ruby eventmachine in non-rack?

So basically my goal is get some sort of light-weight ruby daemon(or sidekiq/resque worker), that processes jobs and notifies other apps over http. The app itself does not need to receive http requests, so no rack to remain as light-weight as possible. Pretty much a bit of ruby code I can run in loop {}
So trying to not use EventMachine' reactor pattern and using fiber approach instead. Where would I put EM.run or EM.stop in this context Thread.new { EM.run } doesn't seem to be fiber aware so adding it gave no callbacks? Is there a em-synchrony alternative to this?
#slow=true injects a sleep 3, so page 2 callback should output faster
require 'em-http-request'
require 'fiber'
def http_get(url)
f = Fiber.current
http = EventMachine::HttpRequest.new(url).get
# resume fiber once http call is done
http.callback { f.resume(http) }
http.errback { f.resume(http) }
return Fiber.yield
end
puts "fetching some data from database for request params"
EventMachine.run do
Fiber.new{
page = http_get('http://localhost:3000/status?slow=true')
puts "notified external page it responded with: #{page.response_header.status}"
}.resume
Fiber.new{
page = http_get('http://localhost:4000/status')
puts "notified external page 2 it responded with: #{page.response_header.status}"
}.resume
puts "Finishised notification task"
end
puts "Moving on to next task as fast as possible"
Avoid reinventing the wheel, use EM::Synchrony or even better switch to celluloid or celluloid-io as EM seems to have fallen out of maintenance

Eventmachine calls callback twice

I tried to launch eventmachine httpserver example, but I've added simple puts in the process_http_request method. To my surprise, when I access localhost:8080 from browser, I see puts output in terminal twice.
Why is it printed twice? Is it a bug? Maybe I misunderstand something in eventmachine.
You can see my example below.
require 'eventmachine'
require 'evma_httpserver'
class MyHttpServer < EM::Connection
include EM::HttpServer
def post_init
super
no_environment_strings
end
def process_http_request
response = EM::DelegatedHttpResponse.new(self)
response.status = 200
response.content_type 'text/html'
response.content = '<center><h1>Hi there</h1></center>'
puts 'my_test_string'
response.send_response
end
end
EM.run do
EM.start_server '0.0.0.0', 8080, MyHttpServer
end
The first one is a request for the favicon. The second one is a request for the page body. If you want to call it a bug, it is your bug, not the library's.

How can I use EventMachine from within a Sinatra app?

I use an api, that is written on top of EM. This means that to make a call, I need to write something like the following:
EventMachine.run do
api.query do |result|
# Do stuff with result
end
EventMachine.stop
end
Works fine.
But now I want to use this same API within a Sinatra controller. I tried this:
get "/foo" do
output = ""
EventMachine.run do
api.query do |result|
output = "Result: #{result}"
end
EventMachine.stop
end
output
end
But this doesn't work. The run block is bypassed, so an empty response is returned and once stop is called, Sinatra shuts down.
Not sure if it's relevant, but my Sinatra app runs on Thin.
What am I doing wrong?
I've found a workaround by busy waiting until data becomes available. Possibly not the best solution, but it works at least:
helpers do
def wait_for(&block)
while (return_val = block.call).nil?
sleep(0.1)
end
return_val
end
end
get "/foo" do
output = nil
EventMachine.run do
api.query do |result|
output = "Result: #{result}"
end
end
wait_for { output }
end

Resources