Disable HTML within XML escaping with Nokogiri - ruby

I'm trying to parse an XML document from the Google Directions API.
This is what I've got so far:
x = Nokogiri::XML(GoogleDirections.new("48170", "48104").xml)
x.xpath("//DirectionsResponse//route//leg//step").each do |q|
q.xpath("html_instructions").each do |h|
puts h.inner_html
end
end
The output looks like this:
Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>
Turn <b>right</b> onto <b>N Territorial Rd</b>
Turn <b>left</b> onto <b>Gotfredson Rd</b>
...
I would like the output to be:
Turn <b>right</b> onto <b>N Territorial Rd</b>
The problem seems to be Nokogiri escaping the html within the xml
I trust Google, but I think it would be also good to sanitize it further to:
Turn right onto N Territorial Rd
But I can't (using sanitize perhaps) without the raw xml. Ideas?

Because I don't have the Google Directions API installed I can't access the XML, but I have a strong suspicion the problem is the result of telling Nokogiri you're dealing with XML. As a result it's going to return you the HTML encoded like it should be in XML.
You can unescape the HTML using something like:
CGI::unescape_html('Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>')
=> "Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>\n"
unescape_html is an alias to unescapeHTML:
Unescape a string that has been HTML-escaped
CGI::unescapeHTML("Usage: foo "bar" <baz>")
# => "Usage: foo \"bar\" "
I had to think about this a bit more. It's something I've run into, but it was one of those things that escaped me during the rush at work. The fix is simple: You're using the wrong method to retrieve the content. Instead of:
puts h.inner_html
Use:
puts h.text
I proved this using:
require 'httpclient'
require 'nokogiri'
# This URL comes from: https://developers.google.com/maps/documentation/directions/#XML
url = 'http://maps.googleapis.com/maps/api/directions/xml?origin=Chicago,IL&destination=Los+Angeles,CA&waypoints=Joplin,MO|Oklahoma+City,OK&sensor=false'
clnt = HTTPClient.new
doc = Nokogiri::XML(clnt.get_content(url))
doc.search('html_instructions').each do |html|
puts html.text
end
Which outputs:
Head <b>south</b> on <b>S Federal St</b> toward <b>W Van Buren St</b>
Turn <b>right</b> onto <b>W Congress Pkwy</b>
Continue onto <b>I-290 W</b>
[...]
The difference is that inner_html is reading the content of the node directly, without decoding. text decodes it for you. text, to_str and inner_text are aliased to content internally in Nokogiri::XML::Node for our parsing pleasure.

Wrap your nodes in CDATA:
def wrap_in_cdata(node)
# Using Nokogiri::XML::Node#content instead of #inner_html (which
# escapes HTML entities) so nested nodes will not work
node.inner_html = node.document.create_cdata(node.content)
node
end
Nokogiri::XML::Node#inner_html escapes HTML entities except in CDATA sections.
fragment = Nokogiri::HTML.fragment "<div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span></div>"
puts fragment.inner_html
# <div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span></div>
fragment.xpath(".//span").each {|node| node.inner_html = node.document.create_cdata(node.content) }
fragment.inner_html
# <div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span>\n</div>

This is not a great or DRY solution, but it works:
puts h.inner_html.gsub("<b>" , "").gsub("</b>", "").gsub("<div style=\"font-size:0.9em\">", "").gsub("</div>", "")

Related

Process Jekyll content to replace first occurrence of any post title with a hyperlink of the post with that title

What I'm trying to do
I am building a Jekyll ruby plugin that will replace the first occurrence of any word in the post copy text content with a hyperlink linking to the URL of a post by the same name.
The problems I'm having
I've gotten this to work but I can't figure out two problems in the process_words method:
How to only search for a post title in the main content copy text of the post, and not the meta tags before the post or the table of contents (which is also generated before main post copy text)? I can't get this to work with Nokigiri, even though that seems to be the tool of choice here.
If a post's URL is not at post.data['url'], where is it?
Also, is there a more efficient, cleaner way to do this?
The current code works but will replace the first occurrence even if it's the value of an HTML attribute, like an anchor or a meta tag.
Example result
We have a blog with 3 posts:
Hobbies
Food
Bicycles
And in the "Hobbies" post body text, we have a sentence with each word appearing in it for the first time in the post, like so:
I love mountain biking and bicycles in general.
The plugin would process that sentence and output it as:
I love mountain biking and bicycles in general.
My current code (UPDATED 1)
# _plugins/hyperlink_first_word_occurance.rb
require "jekyll"
require 'uri'
module Jekyll
# Replace the first occurance of each post title in the content with the post's title hyperlink
module HyperlinkFirstWordOccurance
POST_CONTENT_CLASS = "page__content"
BODY_START_TAG = "<body"
ASIDE_START_TAG = "<aside"
OPENING_BODY_TAG_REGEX = %r!<body(.*)>\s*!
CLOSING_ASIDE_TAG_REGEX = %r!</aside(.*)>\s*!
class << self
# Public: Processes the content and updates the
# first occurance of each word that also has a post
# of the same title, into a hyperlink.
#
# content - the document or page to be processes.
def process(content)
#title = content.data['title']
#posts = content.site.posts
content.output = if content.output.include? BODY_START_TAG
process_html(content)
else
process_words(content.output)
end
end
# Public: Determines if the content should be processed.
#
# doc - the document being processes.
def processable?(doc)
(doc.is_a?(Jekyll::Page) || doc.write?) &&
doc.output_ext == ".html" || (doc.permalink&.end_with?("/"))
end
private
# Private: Processes html content which has a body opening tag.
#
# content - html to be processes.
def process_html(content)
content.output = if content.output.include? ASIDE_START_TAG
head, opener, tail = content.output.partition(CLOSING_ASIDE_TAG_REGEX)
else
head, opener, tail = content.output.partition(POST_CONTENT_CLASS)
end
body_content, *rest = tail.partition("</body>")
processed_markup = process_words(body_content)
content.output = String.new(head) << opener << processed_markup << rest.join
end
# Private: Processes each word of the content and makes
# the first occurance of each word that also has a post
# of the same title, into a hyperlink.
#
# html = the html which includes all the content.
def process_words(html)
page_content = html
#posts.docs.each do |post|
post_title = post.data['title'] || post.name
post_title_lowercase = post_title.downcase
if post_title != #title
if page_content.include?(" " + post_title_lowercase + " ") ||
page_content.include?(post_title_lowercase + " ") ||
page_content.include?(post_title_lowercase + ",") ||
page_content.include?(post_title_lowercase + ".")
page_content = page_content.sub(post_title_lowercase, "#{ post_title.downcase }")
elsif page_content.include?(" " + post_title + " ") ||
page_content.include?(post_title + " ") ||
page_content.include?(post_title + ",") ||
page_content.include?(post_title + ".")
page_content = page_content.sub(post_title, "#{ post_title }")
end
end
end
page_content
end
end
end
end
Jekyll::Hooks.register %i[posts pages], :post_render do |doc|
# code to call after Jekyll renders a post
Jekyll::HyperlinkFirstWordOccurance.process(doc) if Jekyll::HyperlinkFirstWordOccurance.processable?(doc)
end
Update 1
Updated my code with #Keith Mifsud's advice. Now using either the sidebar's aside element or the page__content class to select body content to work on.
Also improved checking and replacing the correct term.
PS: The code base example I started with working on my plugin was #Keith Mifsud's jekyll-target-blank plugin
this code looks very familiar :) I suggest you look into the Rspecs test file to test against your issues: https://github.com/keithmifsud/jekyll-target-blank
I'll try to answer your questions, sorry I couldn't test these myself the time of writing.
How to only search for a post title in the main content copy text of the post, and not the meta tags before the post or the table of contents (which is also generated before main post copy text)? I can't get this to work with Nokigiri, even though that seems to be the tool of choice here.
Your requirements here are:
1) Ignore content outside the <body></body> tags.
This seems to already be implemented in the process_html() method. This method is stating the only process the body_content and it should work as it is. Have you got tests for it? How are you debugging it? The same string splitting works in my plugin. I.e. only content inside the body is processed.
2) Ignore content inside the Table of Contents (TOC).
I suggest you extend the process_html() method by further splitting the body_content variable. Search for content in between the opening and closing tags of your TOC (by id, css class etc..) and exclude it, then add it back in it's position before or after process_words string.
3) Whether to use the Nokigiri plugin?
This plugin is great for parsing html. I think you are parsing strings and then creating html. So vanilla Ruby and the URI plugin should suffice. You can still use it if you want but it won't be any faster then splitting strings in ruby.
If a post's URL is not at post.data['url'], where is it?
I think you should a have method to get all all post titles and then match the "words" against the array. You can get all the posts collection from the doc itself doc.site.posts and foreach post return the title. The the process_words() method can check each work to see if it matched an item from the array. But what if the title is made of more than one word?
Also, is there a more efficient, cleaner way to do this?
So far so good. I'll start with getting the issues fixed and then refactor for speed and coding standards.
Again I suggest you use testing to help you with this.
Let me know if I can help more :)

Parsing XML document missing enclosing parent entity

I want to process an XML document that lacks an overarching enclosing entity. (Yes, that's the file I'm given. No, I didn't create it.) For example:
<DeviceInfo>
<Greeting>Crunchy bacon!</Greeting>
</DeviceInfo>
<InstantaneousDemand>
<TimeStamp>0x1c722845</TimeStamp>
</InstantaneousDemand>
<InstantaneousDemand>
<TimeStamp>0x1c72284a</TimeStamp>
</InstantaneousDemand>
When I parse the file using Nokogiri's XML method, it (predictably) only reads the first entity:
>> doc = Nokogiri::XML(File.open("x.xml"))
>> doc.children.count
=> 1
doc.text
=> "\n Crunchy bacon!\n"
I could read the file as a string and wrap a fake enclosing entity around the whole thing, but that seems heavy handed. Is there a better way to get Nokogiri to read in all the entities?
You might create a DocumentFragment rather than Document (especially taking into account that your content is actually a document fragment):
▶ doc = Nokogiri::XML::DocumentFragment.parse File.read("x.xml")
#⇒ #<Nokogiri::XML::DocumentFragment:0x14efa38 name="#document-fragment"
# ...
# #<Nokogiri::XML::Element:0x14ef68c name="InstantaneousDemand"
# ...
▶ doc.children.count
#⇒ 6
Hope it helps.

YQL Yahoo Finance Scraper on XML in Ruby

I am using a YQL query (the standard example query, with GOOG, YHOO, MSFT and AAPL) to generate XML for all of the available fields. I wanted to scrape the YQL site for the XML output once it is generated using a Ruby script, so that I could run it over and over again for different stocks and store the data somewhere. I haven't finished my script yet, but what I have seems to just not run. Here is the code:
yahoo_finance_scrape.rb
require 'rubygems'
require 'nokogiri'
require 'restclient'
PAGE_URL = "http://developer.yahoo.com/yql/console/"
yql_query = 'use "http://github.com/spullara/yql-tables/raw/d60732fd4fbe72e5d5bd2994ff27cf58ba4d3f84/yahoo/finance/yahoo.finance.quotes.xml"
as quotes; select * from quotes where symbol in ("YHOO","AAPL","GOOG","MSFT") '
if page = RestClient.post(PAGE_URL, {'name' => yql_query, 'submit' => 'Test'})
puts "YQL query: #{yql_query}, is valid"
xml_output = Nokogiri::HTML(page)
lines = xml_output.css('#container #layout-doc #yui-gen3000008 #yui-gen3000009 #yui_3_11_0_3_1393417778356_354
#yui-gen3000015 #yui-gen3000016 div#yui_3_11_0_2_1393417778356_10 #centerBottomView
#outputContainer div#output #outputTabContent #formattedView #viewContent #prexml')
lines.each do |line|
puts line.css('span').map{|span| span.text}.join(' ')
sleep 0.03
end
end
When I run the program, it only prints
"YQL query: use "http://github.com/spullara/yql-tables/raw/d60732fd4fbe72e5d5bd2994ff27cf58ba4d3f84/yahoo/finance/yahoo.finance.quotes.xml"
as quotes; select * from quotes where symbol in ("YHOO","AAPL","GOOG","MSFT") , is valid"
And then just stops. Oh, I am using that Github url because yahoo.finance.quotes was not working, and someone else on Stackoverflow suggested to use it.
If you want to check the css tags, just go to http://developer.yahoo.com/yql/console/ and enter my query and do an inspect element on it. I would post it here, but I don't know how.
The output is just the content of your yql_query var. so this does not help much.
You probably should not put the "use xxxx ax quotes" as a string in your code.
Check out what "someone else" had in mind.
The RestClient.post() method returns a response object. With all HTTP operations, always check the response.code, otherwise you don't know about errors.
response = RestClient.post(...)
puts "HTTP Response code: #{response.code}"
if response.code == 200
page = repsonse.to_str
...
end
According to the Nokogiri website the xml_output.css() method filters like it is a css selector. if you have for example "#container #layout-doc", this means "filter elements with the id 'layout-doc' inside elements of the id 'container' and so on. Is this really what you itend to do? if yes, the last "#prexml" should be enough and much less error-prone, as ids should normally be unique.

Regex to get ID from link URL

I have links like this:
<div class="zg_title">
Thermos Foogo Leak-Proof Stainless St...
</div>
And I'm scraping them like this:
product_asin = product.xpath('//div[#class="zg_title"]/a/#href').first.value
The problem is that it takes the whole URL and I want to just get the ID:
B000O3GCFU
I think I need to do something like this:
product_asin = product.xpath('//div[#class="zg_title"]/a/#href').first.value[ReGEX_HERE]
What's the simplest regex I can use in this case?
EDIT:
Strange the link URL doesn't appear complete:
http://www.amazon.com/Thermos-Foogo-Leak-Proof-Stainless-10-Ounce/dp/B000O3GCFU/ref=zg_bs_baby-products_1
Use /\w+$/:
p doc.xpath('//div[#class="zg_title"]/a/#href').first.value[/\w+$/]
/\w+$/ matches trailing alphabets, digits, _.
require 'nokogiri'
s = <<EOF
<div class="zg_title">
Thermos Foogo Leak-Proof Stainless St...
</div>
EOF
doc = Nokogiri::HTML(s)
p doc.xpath('//div[#class="zg_title"]/a/#href').first.value[/\w+$/]
# => "B000O3GCFU"
Given that the product code is always preceded by /dp/ and followed by a /:
url[/(?<=\/dp\/)[^\/]+/]
Or, perhaps more readable:
url[%r{(?<=/dp/)[^/]+}]
Alternatively, without using regular expressions:
parts = url.split('/')
parts[parts.index('dp') + 1]
An approach based on available parsers (to please Nicolas Tyler or anyone else who would rather avoid regex for parsing in this sort of case)
require 'uri'
product_uri = product.xpath('//div[#class="zg_title"]/a/#href').first.value
# e.g. http://www.amazon.com/Thermos-Foogo-Leak-Proof-Stainless-10-Ounce/dp/B000O3GCFU/ref=zg_bs_baby-products_1
product_path = URI.parse( product_asin_uri ).path.split('/')
# => ["", "Thermos-Foogo-Leak-Proof-Stainless-10-Ounce",
# "dp", "B000O3GCFU", "ref=zg_bs_baby-products_1"]
# This relies on (un-researched assumption) location in path being consistent
# Now we have components though, we can look at Amazon's documentation and
# select based on position in path, relative position from some other identifier
# etc, without risk of a regex mismatch
product_asin = product_path[2]
# => "B000O3GCFU"

Adding formatted break with Nokogiri

I'm trying to add a few elements to an already existing XML document. The following code is successful at adding the desired nodes and content, however it doesn't format the inserted elements. All the added elements end up on one line instead of with line breaks and indentations after each element.
Any suggestions about how I could add this formatting?
The code is:
doc.xpath("//tei:div[#xml:id='versionlog']", {"tei" => "http://www.tei-c.org/ns/1.0"}).each do |node|
new_entry = Nokogiri::XML::Node.new "div", doc
new_entry["xml:id"] = "v_#{ed_no}"
head = Nokogiri::XML::Node.new "head", doc
head.content = "Description of changes for #{ed_no}"
new_entry.add_child(head)
para = Nokogiri::XML::Node.new "p", doc
para.content = "#{version_description}"
new_entry.add_child(para)
node.add_child(new_entry)
end
Why is it important that the XML not be on one line? It's purely cosmetic having "pretty-printed" XML, and not required by the XML spec or the parser when the XML is reloaded. Personally, I'd recommend having no formatting for your transfer speed and reduced disk size, but YMMV.
You can either run the XML through an XML beautifier, or play a game with Nokogiri along the lines of:
new_entry.add_child(para.to_xml + "\n")
The line break will be added as a text node between the tags, but it's benign and not significant to XML's ability to deliver its payload.
If you insist, "How do I pretty-print HTML with Nokogiri?" describes how to get there.

Resources