Efficient way to render ton of JSON on Heroku - ruby

I built a simple API with one endpoint. It scrapes files and currently has around 30,000 records. I would ideally like to be able to fetch all those records in JSON with one http call.
Here is my Sinatra view code:
require 'sinatra'
require 'json'
require 'mongoid'
Mongoid.identity_map_enabled = false
get '/' do
content_type :json
Book.all
end
I've tried the following:
using multi_json with
require './require.rb'
require 'sinatra'
require 'multi_json'
MultiJson.engine = :yajl
Mongoid.identity_map_enabled = false
get '/' do
content_type :json
MultiJson.encode(Book.all)
end
The problem with this approach is I get Error R14 (Memory quota exceeded). I get the same error when I try to use the 'oj' gem.
I would just concatinate everything one long Redis string, but Heroku's redis service is $30 per month for the instance size I would need (> 10mb).
My current solution is to use background task that creates objects and stuffs them full of jsonified objects at near the Mongoid object size limit (16mb). The problems with this approach: It still takes nearly 30 seconds to render, and I have to run post-processing on the receiving app to properly extract the json from the objects.
Does anyone have any better idea for how I can render json for 30k records in one call without switching away from Heroku?

Sounds like you want to stream the JSON directly to the client instead of building it all up in memory. It's probably the best way to cut down memory usage. You could for example use yajl to encode JSON directly to a stream.
Edit: I rewrote the entire code for yajl, because its API is much more compelling and allows for much cleaner code. I also included an example for reading the response in chunks. Here's the streamed JSON array helper I wrote:
require 'yajl'
module JsonArray
class StreamWriter
def initialize(out)
super()
#out = out
#encoder = Yajl::Encoder.new
#first = true
end
def <<(object)
#out << ',' unless #first
#out << #encoder.encode(object)
#out << "\n"
#first = false
end
end
def self.write_stream(app, &block)
app.stream do |out|
out << '['
block.call StreamWriter.new(out)
out << ']'
end
end
end
Usage:
require 'sinatra'
require 'mongoid'
Mongoid.identity_map_enabled = false
# use a server that supports streaming
set :server, :thin
get '/' do
content_type :json
JsonArray.write_stream(self) do |json|
Book.all.each do |book|
json << book.attributes
end
end
end
To decode on the client side you can read and parse the response in chunks, for example with em-http. Note that this solution requires the clients memory to be large enough to store the entire objects array. Here's the corresponding streamed parser helper:
require 'yajl'
module JsonArray
class StreamParser
def initialize(&callback)
#parser = Yajl::Parser.new
#parser.on_parse_complete = callback
end
def <<(str)
#parser << str
end
end
def self.parse_stream(&callback)
StreamParser.new(&callback)
end
end
Usage:
require 'em-http'
parser = JsonArray.parse_stream do |object|
# block is called when we are done parsing the
# entire array; now we can handle the data
p object
end
EventMachine.run do
http = EventMachine::HttpRequest.new('http://localhost:4567').get
http.stream do |chunk|
parser << chunk
end
http.callback do
EventMachine.stop
end
end
Alternative solution
You could actually simplify the whole thing a lot when you give up the need for generating a "proper" JSON array. What the above solution generates is JSON in this form:
[{ ... book_1 ... }
,{ ... book_2 ... }
,{ ... book_3 ... }
...
,{ ... book_n ... }
]
We could however stream each book as a separate JSON and thus reduce the format to the following:
{ ... book_1 ... }
{ ... book_2 ... }
{ ... book_3 ... }
...
{ ... book_n ... }
The code on the server would then be much simpler:
require 'sinatra'
require 'mongoid'
require 'yajl'
Mongoid.identity_map_enabled = false
set :server, :thin
get '/' do
content_type :json
encoder = Yajl::Encoder.new
stream do |out|
Book.all.each do |book|
out << encoder.encode(book.attributes) << "\n"
end
end
end
As well as the client:
require 'em-http'
require 'yajl'
parser = Yajl::Parser.new
parser.on_parse_complete = Proc.new do |book|
# this will now be called separately for every book
p book
end
EventMachine.run do
http = EventMachine::HttpRequest.new('http://localhost:4567').get
http.stream do |chunk|
parser << chunk
end
http.callback do
EventMachine.stop
end
end
The great thing is that now the client does not have to wait for the entire response, but instead parses every book separately. However, this will not work if one of your clients expects one single big JSON array.

Related

How to test HTTParty API call with Ruby and RSpec

I am using the HTTParty gem to make a call to the GitHub API to access a list of user's repos.
It is a very simple application using Sinatra that displays a user's favourite programming language based on the most common language that appears in their repos.
I am a bit stuck on how I can write an RSpec expectation that mocks out the actual API call and instead just checks that json data is being returned.
I have a mock .json file but not sure how to use it in my test.
Any ideas?
github_api.rb
require 'httparty'
class GithubApi
attr_reader :username, :data, :languages
def initialize(username)
#username = username
#response = HTTParty.get("https://api.github.com/users/#{#username}/repos")
#data = JSON.parse(#response.body)
end
end
github_api_spec.rb
require './app/models/github_api'
require 'spec_helper'
describe GithubApi do
let(:github_api) { GithubApi.new('mock_user') }
it "receives a json response" do
end
end
Rest of the files for clarity:
results.rb
require 'httparty'
require_relative 'github_api'
class Results
def initialize(github_api = Github.new(username))
#github_api = github_api
#languages = []
end
def get_languages
#github_api.data.each do |repo|
#languages << repo["language"]
end
end
def favourite_language
get_languages
#languages.group_by(&:itself).values.max_by(&:size).first
end
end
application_controller.rb
require './config/environment'
require 'sinatra/base'
require './app/models/github_api'
class ApplicationController < Sinatra::Base
configure do
enable :sessions
set :session_secret, "#3x!iltĀ£"
set :views, 'app/views'
end
get "/" do
erb :index
end
post "/user" do
#github = GithubApi.new(params[:username])
#results = Results.new(#github)
#language = #results.favourite_language
session[:language] = #language
session[:username] = params[:username]
redirect '/results'
end
get "/results" do
#language = session[:language]
#username = session[:username]
erb :results
end
run! if app_file == $0
end
There are multiple ways you could approach this problem.
You could, as #anil suggested, use a library like webmock to mock the underlying HTTP call. You could also do something similar with VCR (https://github.com/vcr/vcr) which records the results of an actual call to the HTTP endpoint and plays back that response on subsequent requests.
But, given your question, I don't see why you couldn't just use an Rspec double. I'll show you how below. But, first, it would be a bit easier to test the code if it were not all in the constructor.
github_api.rb
require 'httparty'
class GithubApi
attr_reader :username
def initialize(username)
#username = username
end
def favorite_language
# method to calculate which language is used most by username
end
def languages
# method to grab languages from repos
end
def repos
repos ||= do
response = HTTParty.get("https://api.github.com/users/#{username}/repos")
JSON.parse(response.body)
end
end
end
Note that you do not need to reference the #username variable in the url because you have an attr_reader.
github_api_spec.rb
require './app/models/github_api'
require 'spec_helper'
describe GithubApi do
subject(:api) { described_class.new(username) }
let(:username) { 'username' }
describe '#repos' do
let(:github_url) { "https://api.github.com/users/#{username}/repos" }
let(:github_response) { instance_double(HTTParty::Response, body: github_response_body) }
let(:github_response_body) { 'response_body' }
before do
allow(HTTParty).to receive(:get).and_return(github_response)
allow(JSON).to receive(:parse)
api.repos
end
it 'fetches the repos from Github api' do
expect(HTTParty).to have_received(:get).with(github_url)
end
it 'parses the Github response' do
expect(JSON).to have_received(:parse).with(github_response_body)
end
end
end
Note that there is no need to actually load or parse any real JSON. What we're testing here is that we made the correct HTTP call and that we called JSON.parse on the response. Once you start testing the languages method you'd need to actually load and parse your test file, like this:
let(:parsed_response) { JSON.parse(File.read('path/to/test/file.json')) }
You can mock those API calls using https://github.com/bblimke/webmock and send back mock.json using webmock. This post, https://robots.thoughtbot.com/how-to-stub-external-services-in-tests walks you through the setup of webmock with RSpec (the tests in the post mock GitHub API call too)

case sensitive headers in get request using httparty in rails

I'm currently getting an error when I make a GET request using httparty. The call works when I use curl. The error is as follows:
\"Authdate\":\"1531403501\"}" }, { "error_code":
"external_auth_error", "error_message": "Date header is missing or
timestamp out of bounds" } ] }
When I make the request via curl this is the header I use.
curl -X GET -H "AuthDate: 1531403501"
However, as you can see, the request changes from AuthDate to Authdate causing the error. Here is how I'm making the call:
require 'openssl'
require 'base64'
module SeamlessGov
class Form
include HTTParty
attr_accessor :form_id
base_uri "https://nycopp.seamlessdocs.com/api"
def initialize(id)
#api_key = ENV['SEAMLESS_GOV_API_KEY']
#signature = generate_signature
#form_id = id
#timestamp = Time.now.to_i
end
def relative_uri
"/form/#{#form_id}/elements"
end
def create_form
self.class.get(relative_uri, headers: generate_headers)
end
private
def generate_signature
OpenSSL::HMAC.hexdigest('sha256', ENV['SEAMLESS_GOV_SECRET'], "GET+#{relative_uri}+#{#timestamp}")
end
def generate_headers
{
"Authorization" => "HMAC-SHA256 api_key='#{#api_key}' signature='#{#signature}'",
"AuthDate" => #timestamp
}
end
end
end
any workaround this?
Headers are case-insensitive per the spec https://stackoverflow.com/a/41169947/1518336, so it seems like the server you're accessing is in the wrong.
Looking at Net::HTTPHeader, on which HTTParty is implemented
Unlike raw hash access, HTTPHeader provides access via case-insensitive keys
It looks like the class downcases the header keys for uniformity.
You'll likely need to look at a different networking library which doesn't rely on the net/http. Perhaps curb?
There is a work around this in the following article
https://github.com/jnunemaker/httparty/issues/406#issuecomment-239542015
I created the file lib/net_http.rb
require 'net/http'
class Net::HTTP::ImmutableHeaderKey
attr_reader :key
def initialize(key)
#key = key
end
def downcase
self
end
def capitalize
self
end
def split(*)
[self]
end
def hash
key.hash
end
def eql?(other)
key.eql? other.key.eql?
end
def to_s
def self.to_s
key
end
self
end
end
Then in the headers
def generate_headers
{
"Authorization" => "HMAC-SHA256 api_key='#{#api_key}' signature='#{#timestamp}'",
Net::HTTP::ImmutableHeaderKey.new('AuthDate') => "#{#timestamp}"
}
end

How to stop a background thread in Sinatra once the connection is closed

I'm trying to consume the twitter streaming API with Sinatra and give users real-time updates when they search for a keyword.
require 'sinatra'
require 'eventmachine'
require 'em-http'
require 'json'
STREAMING_URL = 'https://stream.twitter.com/1/statuses/sample.json'
get '/' do
stream(:keep_open) do |out|
http = EM::HttpRequest.new(STREAMING_URL).get :head => { 'Authorization' => [ 'USERNAME', 'PASS' ] }
buffer = ""
http.stream do |chunk|
puts "still chugging"
buffer += chunk
while line = buffer.slice!(/.+\r?\n/)
tweet = JSON.parse(line)
unless tweet.length == 0 or tweet['user'].nil?
out << "<p><b>#{tweet['user']['screen_name']}</b>: #{tweet['text']}</p>"
end
end
end
end
end
I want the processing of the em-http-request stream to stop if the user closes the connection. Does anyone know how to do this?
Eric's answer was close, but what it does is closing the response body (not the client connection, btw) once your twitter stream closes, which normally never happens. This should work:
require 'sinatra/streaming' # gem install sinatra-contrib
# ...
get '/' do
stream(:keep_open) do |out|
# ...
out.callback { http.conn.close_connection }
out.errback { http.conn.close_connection }
end
end
I'm not quite familiar with the Sinatra stream API yet, but did you try this?
http.callback { out.close }

stream multiple body using async sinatra

I would like start a long poll request from javascript which is fine and i expect my ruby prog to stream multiple body sections to the javascript. Why doesn the following (pseudo)code work?
require 'rubygems'
require 'sinatra/async'
require 'eventmachine'
require 'thin'
require 'json'
class Test < Sinatra:Base
register Sinatra::Async
aget '/process' do
for c in 1..10
body {
{ :data => [ "this is part #{c}" ] }.to_json
end
end
end
run!
end
Maybe i misunderstood what long polling and async is supposed to do, but my expectation is that i get multiple bodies sent back to the client ? Do i need to use eventmachine or something?
thanks
require 'rubygems'
require 'sinatra/async'
require 'thin'
require 'json'
class Test < Sinatra::Base
register Sinatra::Async
class JSONStream
include EventMachine::Deferrable
def stream(object)
#block.call object.to_json + "\n"
end
def each(&block)
#block = block
end
end
aget '/process' do
puts 'ok'
out = JSONStream.new
body out
EM.next_tick do
c = 0
timer = EM.add_periodic_timer(0.3) do
c += 1
out.stream :data => ["this is part #{c}"]
if c == 100
timer.cancel
out.succeed
end
end
end
end
run!
end
See also: http://confreaks.net/videos/564-scotlandruby2011-real-time-rack
It appears in the example below that you need an EventMachine event to trigger the sending of the multiple bodies. Also see this previous answer as well.
require 'sinatra/async'
class AsyncTest < Sinatra::Base
register Sinatra::Async
aget '/' do
body "hello async"
end
aget '/delay/:n' do |n|
EM.add_timer(n.to_i) { body { "delayed for #{n} seconds" } }
end
end

Is there a way to flush html to the wire in Sinatra

I have a Sinatra app with a long running process (a web scraper). I'd like the app flush the results of the crawler's progress as the crawler is running instead of at the end.
I've considered forking the request and doing something fancy with ajax but this is a really basic one-pager app that really just needs to output a log to a browser as it's happening. Any suggestions?
Update (2012-03-21)
As of Sinatra 1.3.0, you can use the new streaming API:
get '/' do
stream do |out|
out << "foo\n"
sleep 10
out << "bar\n"
end
end
Old Answer
Unfortunately you don't have a stream you can simply flush to (that would not work with Rack middleware). The result returned from a route block can simply respond to each. The Rack handler will then call each with a block and in that block flush the given part of the body to the client.
All rack responses have to always respond to each and always hand strings to the given block. Sinatra takes care of this for you, if you just return a string.
A simple streaming example would be:
require 'sinatra'
get '/' do
result = ["this", " takes", " some", " time"]
class << result
def each
super do |str|
yield str
sleep 0.3
end
end
end
result
end
Now you could simply place all your crawling in the each method:
require 'sinatra'
class Crawler
def initialize(url)
#url = url
end
def each
yield "opening url\n"
result = open #url
yield "seaching for foo\n"
if result.include? "foo"
yield "found it\n"
else
yield "not there, sorry\n"
end
end
end
get '/' do
Crawler.new 'http://mysite'
end

Resources