Parsing Liquid in a Jekyll generator before converting to JSON - ruby

Best to start by saying that I am very new to Ruby and Liquid. I have searched around looking for some resource on this issue, but as yet haven't been able to find anything of real use.
I have a Jekyll site, which utilises the HTML5 History API. I have a Jekyll generator plugin which creates a single JSON file which holds all the post and page content, ready for use with HTML5 PushState and PopState. This part is functioning properly and is tested.
My problem comes when I have a post/page on the site which has Liquid tags in it. I am guessing I need to parse these Liquid tags to get the template output before I create my JSON object for each post/page. Here is what I have for pages as an example:
# Iterate over all pages
site.pages.each do |page|
# Encode the page HTML content to JSON
link = page.url
#content = Liquid::Template.parse(page.content)
hash[link] = { "body_class" => page.data['body_class'], "content" => converter.convert(#content.render), "title" => '<h1>' + page.data["content_title"] + '</h1>' }
end
Now, this at the minute is basically removing all Liquid tags from the generated JSON file, leaving nothing in it's place.
Here is my full generator file on Github which is based very heavily on nice work by Jezen Thomas.
The output JSON file is also in that repo with the site, or can be accessed quickly here. The blog.html content is the last item in the JSON file and shows the empty h1 and div tags which should have content.

Related

Parse data from an ICS file and expose it to a liquid variable in the template

I have searched tirelessly for this on Google but unfortunately every search collides with the fact Jekyll is a site generator and the results do not help.
I'm looking for a small example of how to read an ICS file from a plugin/generator that is then accessible with liquid from the templates.
I've tried creating collections and appending to them in plugins, I've tried creating site.data arrays. Nothing seems to work. Is there a small example of a jekyll plugin that reads a file or url and creates data that is then stored in a site variable and can be accessed via liquid? Specifically I'll be reading an ICS feed and creating a calendar.
Data files might not be able to fulfil your requirement. You need a Jekyll Plugin.
Here's an ad-hoc solution I would implement to read ics files and expose calendar events to liquid variables in Jekyll:
Add icalendar to your Gemfile and install it:
bundle add icalendar
bundle install
Put this file in _plugins/calendar_reader.rb
require "jekyll"
require "icalendar"
module Jekyll
module ICSReader
def read_calendar(input)
begin
calendar_file = File.open(input)
events = Icalendar::Event.parse(calendar_file)
hash = {}
counter = 0
# loop through the events in the calendars
# and map the values you want into a variable and then return it:
events.each do |event|
hash[counter] = {
"summary" => event.summary,
"dtstart" => event.dtstart,
"dtend" => event.dtend,
"description" => event.description
}
counter += 1
end
return hash
rescue
# Handle errors
Jekyll.logger.error "Calendar Reader:", "An error occurred!"
return {}
end
end
end
end
Liquid::Template.register_filter(Jekyll::ICSReader)
The README.md docs of icalendar would help you understand how data is being read from the file. Basically, we parse the events in the file and map them to a dictionary and return it.
Now take an ics file and put it into the _data folder.
_data/my_calendar.ics
BEGIN:VEVENT
DTSTAMP:20050118T211523Z
UID:bsuidfortestabc123
DTSTART;TZID=US-Mountain:20050120T170000
DTEND;TZID=US-Mountain:20050120T184500
CLASS:PRIVATE
GEO:37.386013;-122.0829322
ORGANIZER:mailto:joebob#random.net
PRIORITY:2
SUMMARY:This is a really long summary to test the method of unfolding lines
\, so I'm just going to make it a whole bunch of lines.
ATTACH:http://bush.sucks.org/impeach/him.rhtml
ATTACH:http://corporations-dominate.existence.net/why.rhtml
RDATE;TZID=US-Mountain:20050121T170000,20050122T170000
X-TEST-COMPONENT;QTEST="Hello, World":Shouldn't double double quotes
END:VEVENT
BEGIN:VEVENT
DTSTAMP:20110118T211523Z
UID:uid-1234-uid-4321
DTSTART;TZID=US-Mountain:20110120T170000
DTEND;TZID=US-Mountain:20110120T184500
CLASS:PRIVATE
GEO:37.386013;-122.0829322
ORGANIZER:mailto:jmera#jmera.human
PRIORITY:2
SUMMARY:This is a very short summary.
RDATE;TZID=US-Mountain:20110121T170000,20110122T170000
END:VEVENT
This sample ics file is taken from the icalendar repository.
Now you can use the plugin filter from your markdown/html:
{% assign events = "_data/my_calendar.ics" | read_calendar %}
Here read_calendar is the function defined in _plugins/calendar_reader.rb and _data/my_calendar.ics is the file you want to get the data from. The plugin gets the input as _data/my_calendar.ics, reads it and returns a hash which is stored into the events variable itself.
You can now use {{ events }} to access the hash of the data that you return from the function in the plugin file.
// {{ events }}
{0=>{“summary”=>”This is a really long summary to test the method of unfolding lines, so I’m just going to make it a whole bunch of lines.”, “dtstart”=>#<DateTime: 2005-01-20T17:00:00+00:00 ((2453391j,61200s,0n),+0s,2299161j)>, “dtend”=>#<DateTime: 2005-01-20T18:45:00+00:00 ((2453391j,67500s,0n),+0s,2299161j)>, “description”=>nil}, 1=>{“summary”=>”This is a very short summary.”, “dtstart”=>#<DateTime: 2011-01-20T17:00:00+00:00 ((2455582j,61200s,0n),+0s,2299161j)>, “dtend”=>#<DateTime: 2011-01-20T18:45:00+00:00 ((2455582j,67500s,0n),+0s,2299161j)>, “description”=>nil}}
// {{ events[0] }}
{“summary”=>”This is a really long summary to test the method of unfolding lines, so I’m just going to make it a whole bunch of lines.”, “dtstart”=>#<DateTime: 2005-01-20T17:00:00+00:00 ((2453391j,61200s,0n),+0s,2299161j)>, “dtend”=>#<DateTime: 2005-01-20T18:45:00+00:00 ((2453391j,67500s,0n),+0s,2299161j)>, “description”=>nil}
// {{ events[1] }}
{“summary”=>”This is a very short summary.”, “dtstart”=>#<DateTime: 2011-01-20T17:00:00+00:00 ((2455582j,61200s,0n),+0s,2299161j)>, “dtend”=>#<DateTime: 2011-01-20T18:45:00+00:00 ((2455582j,67500s,0n),+0s,2299161j)>, “description”=>nil}
This was the bare-bones of how a Jekyll Filter works. You can dive deeper into other types of Jekyll Plugins as explained in the docs.

How to check that a PDF file has some link with Ruby/Rspec?

I am using prawnpdf/pdf-inspector to test that content of a PDF generated in my Rails app is correct.
I would want to check that the PDF file contains a link with certain URL. I looked at yob/pdf-reader but haven't found any useful information related to this topic
Is it possible to test URLs within PDF with Ruby/RSpec?
I would want the following:
expect(urls_in_pdf(pdf)).to include 'https://example.com/users/1'
The https://github.com/yob/pdf-reader contains a method for each page called text.
Do something like
pdf = PDF::Reader.new("tmp/pdf.pdf")
assert pdf.pages[0].text.include? 'https://example.com/users/1'
assuming what you are looking for is at the first page
Since pdf-inspector seems only to return text, you could try to use the pdf-reader directly (pdf-inspector uses it anyways).
reader = PDF::Reader.new("somefile.pdf")
reader.pages.each do |page|
puts page.raw_content # This should also give you the link
end
Anyway I only did a quick look at the github page. I am not sure what raw_content exactly returns. But there is also a low-level method to directly access the objects of the pdf:
reader = PDF::Reader.new("somefile.pdf")
puts reader.objects.inspect
With that it surely is possible to get the url.

How to scrape "data:image" URIs; Encountering Errno::ENAMETOOLONG?

I having been trying to write a script that scrapes a page for images the way it has been outlined in
"Save all image files from a website".
I tested that method with another page and it worked fine, but when inserting my link to scrape data:image URIs, which look like:
data:image/jpg;base64,/9j/4FEJFOIEJNFOEJOIAD//gAQTGFGRGREGg2LjEwMAD/2wBDAAgEBAQEREGREWGRWEGUFBQYGBgYGBgYGB...
I get an error beginning with initialize': File name too long and ending in (Errno::ENAMETOOLONG).
Has anyone found a way to deal with situations like this?
data:image URLs actually contain the image inline as base 64. All you need to do is grab that data and decode it:
require 'base64'
File.open(File.basename(uri),'wb'){ |f| f.write(Base64.decode64(url[/base64,(.*)/, 1])) }

Avoid repeated calls to an API in Jekyll Ruby plugin

I have written a Jekyll plugin to display the number of pageviews on a page by calling the Google Analytics API using the garb gem. The only trouble with my approach is that it makes a call to the API for each page, slowing down build time and also potentially hitting the user call limits on the API.
It would be possible to return all the data in a single call and store it locally, and then look up the pageview count from each page, but my Jekyll/Ruby-fu isn't up to scratch. I do not know how to write the plugin to run once to get all the data and store it locally where my current function could then access it, rather than calling the API page by page.
Basically my code is written as a liquid block that can be put into my page layout:
class GoogleAnalytics < Liquid::Block
def initialize(tag_name, markup, tokens)
super # options that appear in block (between tag and endtag)
#options = markup # optional optionss passed in by opening tag
end
def render(context)
path = super
# Read in credentials and authenticate
cred = YAML.load_file("/home/cboettig/.garb_auth.yaml")
Garb::Session.api_key = cred[:api_key]
token = Garb::Session.login(cred[:username], cred[:password])
profile = Garb::Management::Profile.all.detect {|p| p.web_property_id == cred[:ua]}
# place query, customize to modify results
data = Exits.results(profile,
:filters => {:page_path.eql => path},
:start_date => Chronic.parse("2011-01-01"))
data.first.pageviews
end
Full version of my plugin is here
How can I move all the calls to the API to some other function and make sure jekyll runs that once at the start, and then adjust the tag above to read that local data?
EDIT Looks like this can be done with a Generator and writing the data to a file. See example on this branch Now I just need to figure out how to subset the results: https://github.com/Sija/garb/issues/22
To store the data, I had to:
Write a Generator class (see Jekyll wiki plugins) to call the API.
Convert data to a hash (for easy lookup by path, see 5):
result = Hash[data.collect{|row| [row.page_path, [row.exits, row.pageviews]]}]
Write the data hash to a JSON file.
Read in the data from the file in my existing Liquid block class.
Note that the block tag works from the _includes dir, while the generator works from the root directory.
Match the page path, easy once the data is converted to a hash:
result[path][1]
Code for the full plugin, showing how to create the generator and write files, etc, here
And thanks to Sija on GitHub for help on this.

Most efficient way to parse and reformat data with Nokogiri & Sinatra

I'm working on reformatting HTML output from a search query for an inventory manager for a number of car dealers. There's no direct DB access, no information available from the service creators so I decided to attempts to parse and reformat the data with Nokogiri and generate new pages of results based on the search query.
On first load of the page, I'm just using a default search to generate the first results.
For the search to work, I'm sending the query to a URL like this:
post '/search/?:search_query' do
url = "http://domain.com/v/?DealerId=" + settings.dealer_id + "&maxrows=10&#{params[:search_query]}"
doc = Nokogiri::HTML(open(url))
doc.css("td:nth-child(5) .ForeColor4").each do |msrp|
session["msrp"] = msrp.inner_html
end
doc.css("td:nth-child(4) .ForeColor4").each do |price|
session["price"] = price.inner_html
end
erb :index
end
I know there's got to be a smarter way to do this.
Edit:
An example URL to request data:
http://domain.com/?DealerId=1234&object=list&lang=en&MAKE=&MODEL=&maxrows=50&MinYear=&MaxYear=2011&Type=N&MinPrice=&MaxPrice=&STYLE=&ExtColor=&MaxMiles=&StockNo=
A description of the HTML generated:
Unfortunately, it's old code that's almost entirely table-based, has inline-styles and lacks classes or ids in most areas.
An example of a CSS selector:
td:nth-child(5) .ForeColor4
An XPath selector:
//td[(((count(preceding-sibling::*) + 1) = 5) and parent::*)]//*[contains(concat( " ", #class, " " ), concat( " ", "ForeColor4", " " ))]
I've also looked at mechanize or hpricot as possibilities but I'm not aware of the best tools for the job as I haven't attempted screen-scraping before.
Summary: I want to pull the data from the HTML, temporarily store it in a variable / session / cookie (data changes several times per day), and then be able to reformat the output into my own HTML/CSS styling.
Personally, I'd decouple the scraping from the user action. Have an independent process scrape and fill your database. This will improve performance drastically, as the fetching, creating a DOM, parsing, then rendering output on every action is going to be slow.
doc.css("td:nth-child(5) .ForeColor4").each do |msrp|
session["msrp"] = msrp.inner_html
end
doc.css("td:nth-child(4) .ForeColor4").each do |price|
session["price"] = price.inner_html
end
You might want to use Nokogiri's at_css() method instead of the regular css(). at_css() finds the first occurrence of your target and only returns that one node, similar to doing a .first against the nodeset that .css() returns.
That would simplify your lookups to this form:
session["msrp"] = doc.at_css("td:nth-child(5) .ForeColor4").inner_html
I'd probably add something like rescue 'msrp lookup failed' while testing to the end of the lookups just in case you've got bad accessors. Or you could let the code fail when inner_html() got mad trying to read from a nil. It's just a bit friendlier way to debug.
Otherwise your lookups seem to be decent.

Resources