Most efficient way to parse and reformat data with Nokogiri & Sinatra - ruby

I'm working on reformatting HTML output from a search query for an inventory manager for a number of car dealers. There's no direct DB access, no information available from the service creators so I decided to attempts to parse and reformat the data with Nokogiri and generate new pages of results based on the search query.
On first load of the page, I'm just using a default search to generate the first results.
For the search to work, I'm sending the query to a URL like this:
post '/search/?:search_query' do
url = "" + settings.dealer_id + "&maxrows=10&#{params[:search_query]}"
doc = Nokogiri::HTML(open(url))
doc.css("td:nth-child(5) .ForeColor4").each do |msrp|
session["msrp"] = msrp.inner_html
doc.css("td:nth-child(4) .ForeColor4").each do |price|
session["price"] = price.inner_html
erb :index
I know there's got to be a smarter way to do this.
An example URL to request data:
A description of the HTML generated:
Unfortunately, it's old code that's almost entirely table-based, has inline-styles and lacks classes or ids in most areas.
An example of a CSS selector:
td:nth-child(5) .ForeColor4
An XPath selector:
//td[(((count(preceding-sibling::*) + 1) = 5) and parent::*)]//*[contains(concat( " ", #class, " " ), concat( " ", "ForeColor4", " " ))]
I've also looked at mechanize or hpricot as possibilities but I'm not aware of the best tools for the job as I haven't attempted screen-scraping before.
Summary: I want to pull the data from the HTML, temporarily store it in a variable / session / cookie (data changes several times per day), and then be able to reformat the output into my own HTML/CSS styling.

Personally, I'd decouple the scraping from the user action. Have an independent process scrape and fill your database. This will improve performance drastically, as the fetching, creating a DOM, parsing, then rendering output on every action is going to be slow.

doc.css("td:nth-child(5) .ForeColor4").each do |msrp|
session["msrp"] = msrp.inner_html
doc.css("td:nth-child(4) .ForeColor4").each do |price|
session["price"] = price.inner_html
You might want to use Nokogiri's at_css() method instead of the regular css(). at_css() finds the first occurrence of your target and only returns that one node, similar to doing a .first against the nodeset that .css() returns.
That would simplify your lookups to this form:
session["msrp"] = doc.at_css("td:nth-child(5) .ForeColor4").inner_html
I'd probably add something like rescue 'msrp lookup failed' while testing to the end of the lookups just in case you've got bad accessors. Or you could let the code fail when inner_html() got mad trying to read from a nil. It's just a bit friendlier way to debug.
Otherwise your lookups seem to be decent.


How to check that a PDF file has some link with Ruby/Rspec?

I am using prawnpdf/pdf-inspector to test that content of a PDF generated in my Rails app is correct.
I would want to check that the PDF file contains a link with certain URL. I looked at yob/pdf-reader but haven't found any useful information related to this topic
Is it possible to test URLs within PDF with Ruby/RSpec?
I would want the following:
expect(urls_in_pdf(pdf)).to include ''
The contains a method for each page called text.
Do something like
pdf ="tmp/pdf.pdf")
assert pdf.pages[0].text.include? ''
assuming what you are looking for is at the first page
Since pdf-inspector seems only to return text, you could try to use the pdf-reader directly (pdf-inspector uses it anyways).
reader ="somefile.pdf")
reader.pages.each do |page|
puts page.raw_content # This should also give you the link
Anyway I only did a quick look at the github page. I am not sure what raw_content exactly returns. But there is also a low-level method to directly access the objects of the pdf:
reader ="somefile.pdf")
puts reader.objects.inspect
With that it surely is possible to get the url.

Querying Twilio calls list resource doesn't paginate the results using Ruby or PHP

According to Twilio's documentation here regarding "paging":
The list returned to you includes paging information. If you plan on requesting more records than will fit on a single page, you may want to use the provided nextpageuri rather than incrementing through the pages by page number.
It then gives an example:
# Initialize Twilio Client
#client =, auth_token)
.each do |call|
puts call.direction
However, doing this just returns an array of all calls, there isn't any paging information or limiting of results or any "pages".
For my purposes I'm actually filtering the query like this:
#calls = #client.calls.list(
start_time_after: #time
start_time_before: #another_time
Because my date filter range is a 1 month period and there are currently about 4.5k calls to retrieve, its taking quite a while to process (and sometimes it just never processes)
I'm using the twilio helper library ruby gem "twilio-ruby" and running ruby 2.5
I've also tried using PHP with the respective twilio helper library and have found the same result.
Using curl however does work and gives paging information, its also incredibly fast compared to using the helper libraries
Twilio developer evangelist here.
list will paginate through, loading all the resources it can.
There are other calls that will stream the API in a lazier fashion, if that is more useful for your use case. For example, you can use each and it will load the calls lazily until they have run out.
#calls = #client.calls.each(
start_time_after: #time
start_time_before: #another_time
) do |call|
puts call.direction
If you do want to manually paginate yourself, you can the page method to get a CallPage object and iterate from there.
page =
start_time_after: #time
start_time_before: #another_time
while !page.nil? do
page.each { |call| puts call.direction }
page = page.next_page
Let me know if that helps at all.

Extracting value from complex hash in Ruby

I am using an API (zillow) which returns a complex hash. A sample result is
"xmlns:SearchResults"=>"", "request"=>[{"address"=>["305 Vinton St"], "citystatezip"=>["Melrose, MA 02176"]}],
"message"=>[{"text"=>["Request successfully processed"], "code"=>["0"]}],
"response"=>[{"results"=>[{"result"=>[{"zpid"=>["56291382"], "links"=>[{"homedetails"=>[""],
"graphsanddata"=>[""], "mapthishome"=>[""],
"comparables"=>[""]}], "address"=>[{"street"=>["305 Vinton St"], "zipcode"=>["02176"], "city"=>["Melrose"], "state"=>["MA"], "latitude"=>["42.466805"],
"longitude"=>["-71.072515"]}], "zestimate"=>[{"amount"=>[{"currency"=>"USD", "content"=>"562170"}], "last-updated"=>["06/01/2014"], "oneWeekChange"=>[{"deprecated"=>"true"}], "valueChange"=>[{"duration"=>"30", "currency"=>"USD", "content"=>"42749"}], "valuationRange"=>[{"low"=>[{"currency"=>"USD",
"content"=>"534062"}], "high"=>[{"currency"=>"USD", "content"=>"590278"}]}], "percentile"=>["0"]}], "localRealEstate"=>[{"region"=>[{"id"=>"23017", "type"=>"city",
"name"=>"Melrose", "links"=>[{"overview"=>[""], "forSaleByOwner"=>[""],
I can extract a specific value using the following:
result = result.to_hash
p result["response"][0]["results"][0]["result"][0]["zestimate"][0]["amount"][0]["content"]
It seems odd to have to specify the index of each element in this fashion. Is there a simpler way to obtain a named value?
It looks like this should be parsed into XML. According to the Zillow API Docs, it returns XML by default. Apparently, "to_hash" was able to turn this into a hash (albeit, a very ugly one), but you are really trying to swim upstream by using it this way. I would recommend using it as intended (xml) at the start, and then maybe parsing it into an easier to use format (like a JSON/Hash structure) later.
Nokogiri is GREAT at parsing XML! You can use the xpath syntax for grabbing elements from the dom, or even css selectors.
For example, to get an array of the "content" in every result:
response = #get xml response from zillow
results = Nokogiri::XML(response).remove_namespaces!
#using css
content_array = results.css("result content")
#same thing using xpath:
content_array = results.xpath("//result//content")
If you just want the content from the first result, you can do this as a shortcut:
content = results.at_css("result content").content
Since it is indeed XML dumped into a JSON, you could use JSONPath to query the JSON

Ruby -- Using facebook's Graph API Explorer in conjunction with the koala gem

I've found facebook's 'Graph API Explorer' tool ( to be an incredibly easy way, welcoming (for beginners) & effective way to use facebook's graph API via its GUI.
I'd like to be able to use the koala gem to pass these generated URLs to facebook's api.
Right now, lets say I had a query like this
url = "me?fields=id,name,posts.fields(likes.fields(id,name),comments.fields(parent,likes.fields(id,name)),message)"
I'd like to be able to pass that directly into koala as a single string.
It doesn't like that so I separate out the uid and the ? operator like the gem seems to want
url = "fields=id,name,posts.fields(likes.fields(id,name),comments.fields(parent,likes.fields(id,name)),message)"
#graph.get_connections("me", url)
This however, returns an error as well:
type: OAuthException, code: 2500,
message: Unknown path components: /fields=id,name,posts.fields(likes.fields(id,name),comments.fields(parent,likes.fields(id,name)),message) [HTTP 400]
Currently this is where I am stuck. I'd like to continue using koala because I like the gem-approach to working with API's, especially when it comes to using OAuth & OAuth2.
I'm starting to break down the request into pieces which the koala gem can handle, for example
posts = #graph.get_connections("me", "posts")
postids = { |p| p['id'] }
likes = postids.inject([]) {|ary, id| ary << #graph.get_connection(id, "likes") }
So that's a long way of getting two arrays, one of posts, one of like data.
But I'd quickly burn up my API requests limit in no time using this kind of approach.
I was kind of hoping I'd just be able to pass the whole string from the Graph API Explorer and just get what I wanted rather than having to manually parse all this stuff.
I don't really know about your posts.fields(likes.fields(id,name) -this does not work in the Graph API Explorer- and stuff like that but I know you can do this:
fb_api =
# => => {"id"=>"71170", "name"=>"My Name", "posts"=>{"paging"=>{"next"=>"", "previous"=>""}, "data"=>[{"id"=>"71170_1013572471", "comments"=>{"count"=>0}, "created_time"=>"2013-06-09T08:03:43+0000", "from"=>{"id"=>"71170", "name"=>"My Name"}, "updated_time"=>"2013-06-09T08:03:43+0000", "privacy"=>{"value"=>""}, "type"=>"status", "story_tags"=>{"0"=>[{"id"=>"71170", "name"=>" ", "length"=>8, "type"=>"user", "offset"=>0}]}, "story"=>" likes a photo."}]}}
And you will receive in a hash what you asked for.
From time to time, you must pass nil as a param to koala:
result += graph_api.batch do |batch_api|
facebook_page_ids.each do |facebook_page_id|
batch_api.get_connections(facebook_page_id, nil, {"fields"=>"posts"})

Avoid repeated calls to an API in Jekyll Ruby plugin

I have written a Jekyll plugin to display the number of pageviews on a page by calling the Google Analytics API using the garb gem. The only trouble with my approach is that it makes a call to the API for each page, slowing down build time and also potentially hitting the user call limits on the API.
It would be possible to return all the data in a single call and store it locally, and then look up the pageview count from each page, but my Jekyll/Ruby-fu isn't up to scratch. I do not know how to write the plugin to run once to get all the data and store it locally where my current function could then access it, rather than calling the API page by page.
Basically my code is written as a liquid block that can be put into my page layout:
class GoogleAnalytics < Liquid::Block
def initialize(tag_name, markup, tokens)
super # options that appear in block (between tag and endtag)
#options = markup # optional optionss passed in by opening tag
def render(context)
path = super
# Read in credentials and authenticate
cred = YAML.load_file("/home/cboettig/.garb_auth.yaml")
Garb::Session.api_key = cred[:api_key]
token = Garb::Session.login(cred[:username], cred[:password])
profile = Garb::Management::Profile.all.detect {|p| p.web_property_id == cred[:ua]}
# place query, customize to modify results
data = Exits.results(profile,
:filters => {:page_path.eql => path},
:start_date => Chronic.parse("2011-01-01"))
Full version of my plugin is here
How can I move all the calls to the API to some other function and make sure jekyll runs that once at the start, and then adjust the tag above to read that local data?
EDIT Looks like this can be done with a Generator and writing the data to a file. See example on this branch Now I just need to figure out how to subset the results:
To store the data, I had to:
Write a Generator class (see Jekyll wiki plugins) to call the API.
Convert data to a hash (for easy lookup by path, see 5):
result = Hash[data.collect{|row| [row.page_path, [row.exits, row.pageviews]]}]
Write the data hash to a JSON file.
Read in the data from the file in my existing Liquid block class.
Note that the block tag works from the _includes dir, while the generator works from the root directory.
Match the page path, easy once the data is converted to a hash:
Code for the full plugin, showing how to create the generator and write files, etc, here
And thanks to Sija on GitHub for help on this.
