Process Jekyll content to replace first occurrence of any post title with a hyperlink of the post with that title - ruby

What I'm trying to do
I am building a Jekyll ruby plugin that will replace the first occurrence of any word in the post copy text content with a hyperlink linking to the URL of a post by the same name.
The problems I'm having
I've gotten this to work but I can't figure out two problems in the process_words method:
How to only search for a post title in the main content copy text of the post, and not the meta tags before the post or the table of contents (which is also generated before main post copy text)? I can't get this to work with Nokigiri, even though that seems to be the tool of choice here.
If a post's URL is not at post.data['url'], where is it?
Also, is there a more efficient, cleaner way to do this?
The current code works but will replace the first occurrence even if it's the value of an HTML attribute, like an anchor or a meta tag.
Example result
We have a blog with 3 posts:
Hobbies
Food
Bicycles
And in the "Hobbies" post body text, we have a sentence with each word appearing in it for the first time in the post, like so:
I love mountain biking and bicycles in general.
The plugin would process that sentence and output it as:
I love mountain biking and bicycles in general.
My current code (UPDATED 1)
# _plugins/hyperlink_first_word_occurance.rb
require "jekyll"
require 'uri'
module Jekyll
# Replace the first occurance of each post title in the content with the post's title hyperlink
module HyperlinkFirstWordOccurance
POST_CONTENT_CLASS = "page__content"
BODY_START_TAG = "<body"
ASIDE_START_TAG = "<aside"
OPENING_BODY_TAG_REGEX = %r!<body(.*)>\s*!
CLOSING_ASIDE_TAG_REGEX = %r!</aside(.*)>\s*!
class << self
# Public: Processes the content and updates the
# first occurance of each word that also has a post
# of the same title, into a hyperlink.
#
# content - the document or page to be processes.
def process(content)
#title = content.data['title']
#posts = content.site.posts
content.output = if content.output.include? BODY_START_TAG
process_html(content)
else
process_words(content.output)
end
end
# Public: Determines if the content should be processed.
#
# doc - the document being processes.
def processable?(doc)
(doc.is_a?(Jekyll::Page) || doc.write?) &&
doc.output_ext == ".html" || (doc.permalink&.end_with?("/"))
end
private
# Private: Processes html content which has a body opening tag.
#
# content - html to be processes.
def process_html(content)
content.output = if content.output.include? ASIDE_START_TAG
head, opener, tail = content.output.partition(CLOSING_ASIDE_TAG_REGEX)
else
head, opener, tail = content.output.partition(POST_CONTENT_CLASS)
end
body_content, *rest = tail.partition("</body>")
processed_markup = process_words(body_content)
content.output = String.new(head) << opener << processed_markup << rest.join
end
# Private: Processes each word of the content and makes
# the first occurance of each word that also has a post
# of the same title, into a hyperlink.
#
# html = the html which includes all the content.
def process_words(html)
page_content = html
#posts.docs.each do |post|
post_title = post.data['title'] || post.name
post_title_lowercase = post_title.downcase
if post_title != #title
if page_content.include?(" " + post_title_lowercase + " ") ||
page_content.include?(post_title_lowercase + " ") ||
page_content.include?(post_title_lowercase + ",") ||
page_content.include?(post_title_lowercase + ".")
page_content = page_content.sub(post_title_lowercase, "#{ post_title.downcase }")
elsif page_content.include?(" " + post_title + " ") ||
page_content.include?(post_title + " ") ||
page_content.include?(post_title + ",") ||
page_content.include?(post_title + ".")
page_content = page_content.sub(post_title, "#{ post_title }")
end
end
end
page_content
end
end
end
end
Jekyll::Hooks.register %i[posts pages], :post_render do |doc|
# code to call after Jekyll renders a post
Jekyll::HyperlinkFirstWordOccurance.process(doc) if Jekyll::HyperlinkFirstWordOccurance.processable?(doc)
end
Update 1
Updated my code with #Keith Mifsud's advice. Now using either the sidebar's aside element or the page__content class to select body content to work on.
Also improved checking and replacing the correct term.
PS: The code base example I started with working on my plugin was #Keith Mifsud's jekyll-target-blank plugin

this code looks very familiar :) I suggest you look into the Rspecs test file to test against your issues: https://github.com/keithmifsud/jekyll-target-blank
I'll try to answer your questions, sorry I couldn't test these myself the time of writing.
How to only search for a post title in the main content copy text of the post, and not the meta tags before the post or the table of contents (which is also generated before main post copy text)? I can't get this to work with Nokigiri, even though that seems to be the tool of choice here.
Your requirements here are:
1) Ignore content outside the <body></body> tags.
This seems to already be implemented in the process_html() method. This method is stating the only process the body_content and it should work as it is. Have you got tests for it? How are you debugging it? The same string splitting works in my plugin. I.e. only content inside the body is processed.
2) Ignore content inside the Table of Contents (TOC).
I suggest you extend the process_html() method by further splitting the body_content variable. Search for content in between the opening and closing tags of your TOC (by id, css class etc..) and exclude it, then add it back in it's position before or after process_words string.
3) Whether to use the Nokigiri plugin?
This plugin is great for parsing html. I think you are parsing strings and then creating html. So vanilla Ruby and the URI plugin should suffice. You can still use it if you want but it won't be any faster then splitting strings in ruby.
If a post's URL is not at post.data['url'], where is it?
I think you should a have method to get all all post titles and then match the "words" against the array. You can get all the posts collection from the doc itself doc.site.posts and foreach post return the title. The the process_words() method can check each work to see if it matched an item from the array. But what if the title is made of more than one word?
Also, is there a more efficient, cleaner way to do this?
So far so good. I'll start with getting the issues fixed and then refactor for speed and coding standards.
Again I suggest you use testing to help you with this.
Let me know if I can help more :)

Related

Concept for recipe-based parsing of webpages needed

I'm working on a web-scraping solution that grabs totally different webpages and lets the user define rules/scripts in order to extract information from the page.
I started scraping from a single domain and build a parser based on Nokogiri.
Basically everything works fine.
I could now add a ruby class each time somebody wants to add a webpage with a different layout/style.
Instead I thought about using an approach where the user specifies elements where content is stored using xpath and storing this as a sort of recipe for this webpage.
Example: The user wants to scrape a table-structure extracting the rows using a hash (column-name => cell-content)
I was thinking about writing a ruby function for extraction of this generic table information once:
# extracts a table's rows as an array of hashes (column_name => cell content)
# html - the html-file as a string
# xpath_table - specifies the html table as xpath which hold the data to be extracted
def basic_table(html, xpath_table)
xpath_headers = "#{xpath_table}/thead/tr/th"
html_doc = Nokogiri::HTML(html)
html_doc = Nokogiri::HTML(html)
row_headers = html_doc.xpath(xpath_headers)
row_headers = row_headers.map do |column|
column.inner_text
end
row_contents = Array.new
table_rows = html_doc.xpath('#{xpath_table}/tbody/tr')
table_rows.each do |table_row|
cells = table_row.xpath('td')
cells = cells.map do |cell|
cell.inner_text
end
row_content_hash = Hash.new
cells.each_with_index do |cell_string, column_index|
row_content_hash[row_headers[column_index]] = cell_string
end
row_contents << [row_content_hash]
end
return row_contents
end
The user could now specify a website-recipe-file like this:
<basic_table xpath='//div[#id="grid"]/table[#id="displayGrid"]'
The function basic_table is referenced here, so that by parsing the website-recipe-file I would know that I can use the function basic_table to extract the content from the table referenced by the xPath.
This way the user can specify simple recipe-scripts and only has to dive into writing actual code if he needs a new way of extracting information.
The code would not change every time a new webpage needs to be parsed.
Whenever the structure of a webpage changes only the recipe-script would need to be changed.
I was thinking that someone might be able to tell me how he would approach this. Rules/rule engines pop into my mind, but I'm not sure if that really is the solution to my problem.
Somehow I have the feeling that I don't want to "invent" my own solution to handle this problem.
Does anybody have a suggestion?
J.

Ruby RSS/Atom creation - including content

I am creating an Atom feed using ruby's stdlib rss library. This library is essentially undocumented , but I have it working using the example provided on this page:
require 'rss'
rss = RSS::Maker.make("atom") do |m|
m.channel.author = "Steve Wattam"
m.channel.updated = Time.now
m.channel.about = "http://stephenwattam.com/blog/"
m.channel.title = "Steve W's Blog"
storage.posts.each do |p|
m.items.new_item do |item|
item.link = p.link
item.title = p.title
item.updated = p.edited
item.pubDate = p.date
item.summary = p.summary
end
end
end
This works fine. I am unable, however, to add a content element. There is no such thing as item.content=, and I can't seem to find any example code online---a browse of the source indicates that content is stored in the item (docs here), but I lack the knowledge to tease it out.
Does anyone know how I might go about adding a content element?
Incidentally, I'm aware other libraries exist to do this, but would ideally like to get this working without requiring any gems.
By digging through the source of the library, I've discovered that item.content yields an object of type RSS::Maker::Atom::Feed::Items::Item::Content. It's possible to set the content on that object:
item.content.content = 'text to set as content'
This object also responds to #xml_content.
Hope this helps someone!

YQL Yahoo Finance Scraper on XML in Ruby

I am using a YQL query (the standard example query, with GOOG, YHOO, MSFT and AAPL) to generate XML for all of the available fields. I wanted to scrape the YQL site for the XML output once it is generated using a Ruby script, so that I could run it over and over again for different stocks and store the data somewhere. I haven't finished my script yet, but what I have seems to just not run. Here is the code:
yahoo_finance_scrape.rb
require 'rubygems'
require 'nokogiri'
require 'restclient'
PAGE_URL = "http://developer.yahoo.com/yql/console/"
yql_query = 'use "http://github.com/spullara/yql-tables/raw/d60732fd4fbe72e5d5bd2994ff27cf58ba4d3f84/yahoo/finance/yahoo.finance.quotes.xml"
as quotes; select * from quotes where symbol in ("YHOO","AAPL","GOOG","MSFT") '
if page = RestClient.post(PAGE_URL, {'name' => yql_query, 'submit' => 'Test'})
puts "YQL query: #{yql_query}, is valid"
xml_output = Nokogiri::HTML(page)
lines = xml_output.css('#container #layout-doc #yui-gen3000008 #yui-gen3000009 #yui_3_11_0_3_1393417778356_354
#yui-gen3000015 #yui-gen3000016 div#yui_3_11_0_2_1393417778356_10 #centerBottomView
#outputContainer div#output #outputTabContent #formattedView #viewContent #prexml')
lines.each do |line|
puts line.css('span').map{|span| span.text}.join(' ')
sleep 0.03
end
end
When I run the program, it only prints
"YQL query: use "http://github.com/spullara/yql-tables/raw/d60732fd4fbe72e5d5bd2994ff27cf58ba4d3f84/yahoo/finance/yahoo.finance.quotes.xml"
as quotes; select * from quotes where symbol in ("YHOO","AAPL","GOOG","MSFT") , is valid"
And then just stops. Oh, I am using that Github url because yahoo.finance.quotes was not working, and someone else on Stackoverflow suggested to use it.
If you want to check the css tags, just go to http://developer.yahoo.com/yql/console/ and enter my query and do an inspect element on it. I would post it here, but I don't know how.
The output is just the content of your yql_query var. so this does not help much.
You probably should not put the "use xxxx ax quotes" as a string in your code.
Check out what "someone else" had in mind.
The RestClient.post() method returns a response object. With all HTTP operations, always check the response.code, otherwise you don't know about errors.
response = RestClient.post(...)
puts "HTTP Response code: #{response.code}"
if response.code == 200
page = repsonse.to_str
...
end
According to the Nokogiri website the xml_output.css() method filters like it is a css selector. if you have for example "#container #layout-doc", this means "filter elements with the id 'layout-doc' inside elements of the id 'container' and so on. Is this really what you itend to do? if yes, the last "#prexml" should be enough and much less error-prone, as ids should normally be unique.

Disable HTML within XML escaping with Nokogiri

I'm trying to parse an XML document from the Google Directions API.
This is what I've got so far:
x = Nokogiri::XML(GoogleDirections.new("48170", "48104").xml)
x.xpath("//DirectionsResponse//route//leg//step").each do |q|
q.xpath("html_instructions").each do |h|
puts h.inner_html
end
end
The output looks like this:
Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>
Turn <b>right</b> onto <b>N Territorial Rd</b>
Turn <b>left</b> onto <b>Gotfredson Rd</b>
...
I would like the output to be:
Turn <b>right</b> onto <b>N Territorial Rd</b>
The problem seems to be Nokogiri escaping the html within the xml
I trust Google, but I think it would be also good to sanitize it further to:
Turn right onto N Territorial Rd
But I can't (using sanitize perhaps) without the raw xml. Ideas?
Because I don't have the Google Directions API installed I can't access the XML, but I have a strong suspicion the problem is the result of telling Nokogiri you're dealing with XML. As a result it's going to return you the HTML encoded like it should be in XML.
You can unescape the HTML using something like:
CGI::unescape_html('Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>')
=> "Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>\n"
unescape_html is an alias to unescapeHTML:
Unescape a string that has been HTML-escaped
CGI::unescapeHTML("Usage: foo "bar" <baz>")
# => "Usage: foo \"bar\" "
I had to think about this a bit more. It's something I've run into, but it was one of those things that escaped me during the rush at work. The fix is simple: You're using the wrong method to retrieve the content. Instead of:
puts h.inner_html
Use:
puts h.text
I proved this using:
require 'httpclient'
require 'nokogiri'
# This URL comes from: https://developers.google.com/maps/documentation/directions/#XML
url = 'http://maps.googleapis.com/maps/api/directions/xml?origin=Chicago,IL&destination=Los+Angeles,CA&waypoints=Joplin,MO|Oklahoma+City,OK&sensor=false'
clnt = HTTPClient.new
doc = Nokogiri::XML(clnt.get_content(url))
doc.search('html_instructions').each do |html|
puts html.text
end
Which outputs:
Head <b>south</b> on <b>S Federal St</b> toward <b>W Van Buren St</b>
Turn <b>right</b> onto <b>W Congress Pkwy</b>
Continue onto <b>I-290 W</b>
[...]
The difference is that inner_html is reading the content of the node directly, without decoding. text decodes it for you. text, to_str and inner_text are aliased to content internally in Nokogiri::XML::Node for our parsing pleasure.
Wrap your nodes in CDATA:
def wrap_in_cdata(node)
# Using Nokogiri::XML::Node#content instead of #inner_html (which
# escapes HTML entities) so nested nodes will not work
node.inner_html = node.document.create_cdata(node.content)
node
end
Nokogiri::XML::Node#inner_html escapes HTML entities except in CDATA sections.
fragment = Nokogiri::HTML.fragment "<div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span></div>"
puts fragment.inner_html
# <div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span></div>
fragment.xpath(".//span").each {|node| node.inner_html = node.document.create_cdata(node.content) }
fragment.inner_html
# <div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span>\n</div>
This is not a great or DRY solution, but it works:
puts h.inner_html.gsub("<b>" , "").gsub("</b>", "").gsub("<div style=\"font-size:0.9em\">", "").gsub("</div>", "")

Get all attributes or xPath of element in Webdriver

I'm trying to do simple monkey test for my web page, which get all active elements on page and click on them in random order.
When i do this I want to write a log to know, on which element my test click and on which test crashed
So I want log file to look like this
01.01.11 11.01.01 Clicked on Element <span id='myspan' class ='myclass .....>
01.01.11 11.01.01 Clicked on Element <span id='button' class ='myclass title = 'Button'.....>
or
01.01.11 11.01.01 Clicked on Element //*[#id='myspan']
01.01.11 11.01.01 Clicked on Element //*[#id='button']
Is it any way to do in Webdriver + Ruby?
I don't think there is a way but you could always do something like this (with watir-webdriver):
browser.divs.each do |div|
puts '<span ' + ['id','class','title'].map{|x| "#{x}='#{div.attribute_value(x)}'"}.join(' ') + '>'
end
WebDriver does not provide this type of functionality, you would have to get the page source and do some of your own parsing - I've done this in Html Agility Pack with C#, you would need to find a similar library for ruby (see: Options for HTML scraping?)
You can do this:
Get all elements, which are clickable
For example, find all links, find all clickable spans. Put those candidates in a list
Randomly pick a element in that candidate list
Click the very element and write some log
I tweaked #pguardiario's answer to come up with this method:
def get_element_dom_info(#e)
if #e.class != Selenium::WebDriver::Element
raise "No valid element passed: #{#e.class}"
end
#attrs = ['id', 'class', 'title', 'href', 'src', 'type', 'name']
return "<" + #e.tag_name + #attrs.map{ |x| " #{x}='#{#e.attribute(x)}'" if #e.attribute(x) && #e.attribute(x) != "" }.join('') + ">"
end
Of course, it expects that the single parameter you pass into it is an actual Selenium element. Also, it doesn't include every possible attribute, but that's the majority of them (and you can always add your extra attributes if needed).
I suppose you can integrate this via some code like this:
def clickElement(*args)
... # parse vars
#e = #driver.find_element(...)
puts get_timestamp + " Clicked on Element: " + get_element_dom_info(#e)
end
UPDATE
I recently realized that I could get the full html of the element using native javascript (d'oh!). You have to use a hack to get the "outerHTML". Here is my new method:
def get_element_dom_info(how, what)
e = #driver.find_element(how, what)
# Use native javascript to return element dom info by creating a wrapper
# that the element node is cloned to and we check the innerHTML of the parent wrapper.
return #driver.execute_script("var f = document.createElement('div').appendChild(arguments[0].cloneNode(true)); return f.parentNode.innerHTML", e)
end

Resources