How to to parse HTML contents of a page using Nokogiri - ruby

require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = 'https://www.trumba.com/calendars/smithsonian-events.xml'
doc = Nokogiri::XML(open url)
I am trying to fetch the basic set of information like:
event_name
categories
sponsor
venue
event_location
cost
For example, for event_name I have this xpath:
"/html/body/div[2]/div[2]/div[1]/h3/a/span"
And use it like:
puts doc.xpath "/html/body/div[2]/div[2]/div[1]/h3/a/span"
This returns nil for event_name.
If I save the URL contents locally then above XPath works.
Along with this, I need above mentioned information as well. I checked the other XPaths too, but the result turns out to be blank.

Here's how I'd go about doing this:
require 'nokogiri'
doc = Nokogiri::XML(open('/Users/gferguson/smithsonian-events.xml'))
namespaces = doc.collect_namespaces
entries = doc.search('entry').map { |entry|
entry_title = entry.at('title').text
entry_time_start, entry_time_end = ['startTime', 'endTime'].map{ |p|
entry.at('gd|when', namespaces)[p]
}
entry_notes = entry.at('gc|notes', namespaces).text
{
title: entry_title,
start_time: entry_time_start,
end_time: entry_time_end,
notes: entry_notes
}
}
Which, when run, results in entries being an array of hashes:
require 'awesome_print'
ap entries [0, 3]
# >> [
# >> [0] {
# >> :title => "Conservation Clinics",
# >> :start_time => "2016-11-09T14:00:00Z",
# >> :end_time => "2016-11-09T17:00:00Z",
# >> :notes => "Have questions about the condition of a painting, frame, drawing,\n print, or object that you own? Our conservators are available by\n appointment to consult with you about the preservation of your art.\n \n To request an appointment or to learn more,\n e-mail DWRCLunder#si.edu and specify CLINIC in the subject line."
# >> },
# >> [1] {
# >> :title => "Castle Highlights Tour",
# >> :start_time => "2016-11-09T14:00:00Z",
# >> :end_time => "2016-11-09T14:45:00Z",
# >> :notes => "Did you know that the Castle is the Smithsonian’s first and oldest building? Join us as one of our dynamic volunteer docents takes you on a tour to explore the highlights of the Smithsonian Castle. Come learn about the founding and early history of the Smithsonian; its original benefactor, James Smithson; and the incredible history and architecture of the Castle. Here is your opportunity to discover the treasured stories revealed within James Smithson's crypt, the Gre...
# >> },
# >> [2] {
# >> :title => "Exhibition Interpreters/Navigators (throughout the day)",
# >> :start_time => "2016-11-09T15:00:00Z",
# >> :end_time => "2016-11-09T15:00:00Z",
# >> :notes => "Museum volunteer interpreters welcome visitors, answer questions, and help visitors navigate exhibitions. Interpreters may be stationed in several of the following exhibitions at various times throughout the day, subject to volunteer interpreter availability. <ul> \t<li><em>The David H. Koch Hall of Human Origins: What Does it Mean to be Human?</em></li> \t<li><em>The Sant Ocean Hall</em></li> </ul>"
# >> }
# >> ]
I didn't try to gather the specific information you asked for because event_name doesn't exist and what you're doing is very generic and easily done once you understand a few rules.
XML is generally very repetitive because it represents tables of data. The "cells" of the table might vary but there's repetition you can use to help you. In this code
doc.search('entry')
loops over the <entry> nodes. Then it's easy to look inside them to find the information needed.
The XML uses namespaces to help avoid tag-name collisions. At first those seem really hard, but Nokogiri provides the collect_namespaces method for the document that returns a hash of all namespaces in the document. If you're looking for a namespaces-tag, pass that hash as the second parameter.
Nokogiri allows us to use XPath and CSS for selectors. I almost always go with CSS for readability. ns|tag is the format to tell Nokogiri to use a CSS-based namespaced tag. Again, pass it the hash of namespaces in the document and Nokogiri will do the rest.
If you're familiar with working with Nokogiri you'll see the above code is very similar to normal code used to pull the content of <td> cells inside <tr> rows in an HTML <table>.
You should be able to modify that code to gather the data you need without risking namespace collisions.

The provided link contains XML, so your XPath expressions should work with XML structure.
The key thing is that the document has namespaces. As I understand all XPath expressions should keep that in mind and specify namespaces too.
In order to simply XPath expressions one can use the remove_namespaces! method:
require 'nokogiri'
require 'open-uri'
url = 'https://www.trumba.com/calendars/smithsonian-events.xml'
doc = Nokogiri::XML(open(url)); nil # nil is used to avoid huge output
doc.remove_namespaces!; nil
event = doc.xpath('//feed/entry[1]') # it will give you the first event
event.xpath('./title').text # => "Conservation Clinics"
event.xpath('./categories').text # => "Demonstrations,Lectures & Discussions"
Most likely you would like to have array of all event hashes.
You can do it like:
doc.xpath('//feed/entry').reduce([]) do |memo, event|
event_hash = {
title: event.xpath('./title').text,
categories: event.xpath('./categories').text
# all other attributes you need ...
}
memo << event_hash
end
It will give you an array like:
[
{:title=>"Conservation Clinics", :categories=>"Demonstrations,Lectures & Discussions"},
{:title=>"Castle Highlights Tour", :categories=>"Gallery Talks & Tours"},
...
]

Related

Nokogiri children method

I have the following XML here:
<listing>
<seller_info>
<payment_types>Visa, Mastercard, , , , 0, Discover, American Express </payment_types>
<shipping_info>siteonly, Buyer Pays Shipping Costs </shipping_info>
<buyer_protection_info/>
<auction_info>
<bid_history>
<item_info>
</listing>
The following code works fine for displaying first child of the first //listing node:
require 'nokogiri'
require 'open-uri'
html_data = open('http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/321gone.xml')
nokogiri_object = Nokogiri::XML(html_data)
listing_elements = nokogiri_object.xpath("//listing")
puts listing_elements[0].children[1]
This also works:
puts listing_elements[0].children[3]
I tried to access the second node <payment_types> with the the following code:
puts listing_elements[0].children[2]
but a blank line was displayed. Looking through Firebug, it is clearly the 2nd child of the listing node. In general, only odd numbers work with the children method.
Is this a bug in Nokogiri? Any thoughts?
It's not a bug, its the space created while parsing strings that contain "\n" (or empty nodes), but you could use the noblanks option to avoid them:
nokogiri_object = Nokogiri::XML(html_data) { |conf| conf.noblanks }
Use that and you will have no blanks in your array.
The problem is you are not parsing the document correctly. children returns more than you think, and its use is painting you into a corner.
Here's a simplified example of how I'd do it:
require 'nokogiri'
doc = Nokogiri::XML(DATA.read)
auctions = doc.search('listing').map do |listing|
seller_info = listing.at('seller_info')
auction_info = listing.at('auction_info')
hash = [:seller_name, :seller_rating].each_with_object({}) do |s, h|
h[s] = seller_info.at(s.to_s).text.strip
end
[:current_bid, :time_left].each do |s|
hash[s] = auction_info.at(s.to_s).text.strip
end
hash
end
__END__
<?xml version='1.0' ?>
<!DOCTYPE root SYSTEM "http://www.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/321gone.dtd">
<root>
<listing>
<seller_info>
<seller_name>537_sb_3 </seller_name>
<seller_rating> 0</seller_rating>
</seller_info>
<auction_info>
<current_bid> $839.93</current_bid>
<time_left> 1 Day, 6 Hrs</time_left>
</auction_info>
</listing>
<listing>
<seller_info>
<seller_name> lapro8</seller_name>
<seller_rating> 0</seller_rating>
</seller_info>
<auction_info>
<current_bid> $210.00</current_bid>
<time_left> 4 Days, 21 Hrs</time_left>
</auction_info>
</listing>
</root>
After running, auctions will be:
auctions
# => [{:seller_name=>"537_sb_3",
# :seller_rating=>"0",
# :current_bid=>"$839.93",
# :time_left=>"1 Day, 6 Hrs"},
# {:seller_name=>"lapro8",
# :seller_rating=>"0",
# :current_bid=>"$210.00",
# :time_left=>"4 Days, 21 Hrs"}]
Notice there are no empty text nodes to deal with because I told Nokogiri exactly which nodes to grab text from. You should be able to extend the code to grab any information you want easily.
A typically formatted XML or HTML document that displays nesting or indentation uses text nodes to provide that indenting:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
</body>
</html>
EOT
Here's what your code is seeing:
doc.at('body').children.map(&:to_html)
# => ["\n" +
# " ", "<p>foo</p>", "\n" +
# " "]
The Text nodes are what are confusing you:
doc.at('body').children.first.class # => Nokogiri::XML::Text
doc.at('body').children.first.text # => "\n "
If you don't drill down far enough you will pick up the Text nodes and have to clean up the results:
doc.at('body')
.text # => "\n foo\n "
.strip # => "foo"
Instead, explicitly find the node you want and extract the information:
doc.at('body p').text # => "foo"
In the suggested code above I used strip because the incoming XML had spaces surrounding some text:
h[s] = seller_info.at(s.to_s).text.strip
which is the result of the original XML creation code not cleaning the lines prior to generating the XML. So sometimes we have to clean up their mess, but the proper accessing of the node can reduce that a lot.
The problem is that children includes text nodes such as the whitespace between elements. If instead you use element_children you get just the child elements (i.e. the contents of the tags, not the surrounding whitespace).

Add Nokogiri parse result to variable

I have an XML document:
<cred>
<login>Tove</login>
<pass>Jani</pass>
</cred>
My code is:
require 'nokogiri'
require 'selwet'
context "parse xml" do doc = Nokogiri::XML(File.open("test.xml"))
doc.xpath("cred/login").each do
|char_element|
puts char_element.text
end
should "check" do
Unit.go_to "http://www.ya.ru/"
Unit.click '.b-inline'
Unit.fill '[name="login"]', #login
end
When I run my test I get:
Tove
0
But I want to insert the parse result to #login. How can I get variables with the parsing result? Do I need to insert the login and pass values from the XML to fields in the web page?
You can get value of login from your XML with
#login = doc.xpath('//cred/login').text
I'd use something like this to get the values:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<cred>
<login>Tove</login>
<pass>Jani</pass>
</cred>
EOT
login = doc.at('login').text # => "Tove"
pass = doc.at('pass').text # => "Jani"
Nokogiri makes it really easy to access values using CSS, so use it for readability when possible. The same thing can be done using XPath:
login = doc.at('//login').text # => "Tove"
pass = doc.at('//pass').text # => "Jani"
but having to add // twice to accomplish the same thing is usually wasted effort.
The important part is at, which returns the first occurrence of the target. at allows us to use either CSS or XPath, but CSS is usually less visually noisy.

Watir method (or monkey-patch) to select span (or other) tags with custom ("data-*") attribute values equaling a string value (or matching a regex)

So this is ruby right, and while I do have a solution already, which I'll show below, its not tight. Feels like I'm using ahem "C++ iterators", if you will. Too many lines of code. Not like ruby.
Anyway, I'm wondering if there is classier way to do this:
b = Watir::Browser.new
b.goto "javascriptinjectedtablevalues.com" #not real website url:)
# desired urls in list are immediately located within <span> tags with a "class" of
#"name" plus a custom html attribute attribute of "data-bind" = "name: $data". that's it
# unless I wanted to use child-selectors which I'm not very good at
allrows = b.spans(:class => "name").each_with_index.map do |x, i|
[0, x.attribute_value("data-bind")]
end
real_row_ids = allrows.select{|i, databind| databind == "name: $data" }.map(&:first) #now I have all correct span ids
spans = real_row_ids.map {|id| b.spans(:class => "name")[id] }
Now that's a little messy in my opinion. But it leaves artifacts so I can debug and go back and stuff.
I could use this command to just grab a just the spans
spans = b.spans(:class => "name").map do |span|
[span, span.attribute_value("data-bind")]
end.select {|span, databind| databind == "name: $data"}.map(&:first)
but that still feels messy having no artifacts to show for it to use for later when trying to isolate other html tags nearby the span.
I'm hoping there is something like this pseudo code for watir:
b.spans(:class => "name").with_custom_attributes(:key => "data-bind", :value => "name: $data")
that's what I'd really like to do. superman-patching this custom method onto Watir within a rails initializer would be the optimal solution second to it already existing within Watir!
Watir already supports using data attributes for locators. You simply need to replace the dashes with underscores.
For example:
b.spans(:class => 'name', :data_bind => "name: $data")
Would match elements like:
<span class="name" data-bind="name: $data">
Similarly, you can use a regex when matching the data attribute:
b.spans(:class => 'name', :data_bind => /name/)

How to get RSS feed in xml format for ruby script

I am using the following ruby script from this dashing widget that retrieves an RSS feed and parses it and sends that parsed title and description to a widget.
require 'net/http'
require 'uri'
require 'nokogiri'
require 'htmlentities'
news_feeds = {
"seattle-times" => "http://seattletimes.com/rss/home.xml",
}
Decoder = HTMLEntities.new
class News
def initialize(widget_id, feed)
#widget_id = widget_id
# pick apart feed into domain and path
uri = URI.parse(feed)
#path = uri.path
#http = Net::HTTP.new(uri.host)
end
def widget_id()
#widget_id
end
def latest_headlines()
response = #http.request(Net::HTTP::Get.new(#path))
doc = Nokogiri::XML(response.body)
news_headlines = [];
doc.xpath('//channel/item').each do |news_item|
title = clean_html( news_item.xpath('title').text )
summary = clean_html( news_item.xpath('description').text )
news_headlines.push({ title: title, description: summary })
end
news_headlines
end
def clean_html( html )
html = html.gsub(/<\/?[^>]*>/, "")
html = Decoder.decode( html )
return html
end
end
#News = []
news_feeds.each do |widget_id, feed|
begin
#News.push(News.new(widget_id, feed))
rescue Exception => e
puts e.to_s
end
end
SCHEDULER.every '60m', :first_in => 0 do |job|
#News.each do |news|
headlines = news.latest_headlines()
send_event(news.widget_id, { :headlines => headlines })
end
end
The example rss feed works correctly because the URL is for an xml file. However I want to use this for a different rss feed that does not provide an actual xml file. This rss feed I want is at http://www.ttc.ca/RSS/Service_Alerts/index.rss
This doesn't seem to display anything on the widget. Instead of using "http://www.ttc.ca/RSS/Service_Alerts/index.rss", I also tried "http://www.ttc.ca/RSS/Service_Alerts/index.rss?format=xml" and "view-source:http://www.ttc.ca/RSS/Service_Alerts/index.rss" but with no luck. Does anyone know how I can get the actual xml data related to this rss feed so that I can use it with this ruby script?
You're right, that link does not provide regular XML, so that script won't work in parsing it since it's written specifically to parse the example XML. The rss feed you're trying to parse is providing RDF XML and you can use the Rubygem: RDFXML to parse it.
Something like:
require 'nokogiri'
require 'rdf/rdfxml'
rss_feed = 'http://www.ttc.ca/RSS/Service_Alerts/index.rss'
RDF::RDFXML::Reader.open(rss_feed) do |reader|
# use reader to iterate over elements within the document
end
From here you can try learning how to use RDFXML to extract the content you want. I'd begin by inspecting the reader object for methods I could use:
puts reader.methods.sort - Object.methods
That will print out the reader's own methods, look for one you might be able to use for your purposes, such as reader.each_entry
To further dig down you can inspect what each entry looks like:
reader.each_entry do |entry|
puts "----here's an entry----"
puts entry.inspect
end
or see what methods you can call on the entry:
reader.each_entry do |entry|
puts "----here's an entry's methods----"
puts entry.methods.sort - Object.methods
break
end
I was able to crudely find some titles and descriptions using this hack job:
RDF::RDFXML::Reader.open('http://www.ttc.ca/RSS/Service_Alerts/index.rss') do |reader|
reader.each_object do |object|
puts object.to_s if object.is_a? RDF::Literal
end
end
# returns:
# TTC Service Alerts
# http://www.ttc.ca/Service_Advisories/index.jsp
# TTC Service Alerts.
# TTC.ca
# http://www.ttc.ca
# http://www.ttc.ca/images/ttc-main-logo.gif
# Service Advisory
# http://www.ttc.ca/Service_Advisories/all_service_alerts.jsp#Service+Advisory
# 196 York University Rocket route diverting northbound via Sentinel, Finch due to a collision that has closed the York U Bus way.
# - Affecting: Bus Routes: 196 York University Rocket
# 2013-12-17T13:49:03.800-05:00
# Service Advisory (2)
# http://www.ttc.ca/Service_Advisories/all_service_alerts.jsp#Service+Advisory+(2)
# 107B Keele North route diverting northbound via Keele, Lepage due to a collision that has closed the York U Bus way.
# - Affecting: Bus Routes: 107 Keele North
# 2013-12-17T13:51:08.347-05:00
But I couldn't quickly find a way to know which one was a title, and which a description :/
Finally, if you still can't find how to extract what you want, start a new question with this info.
Good luck!

How to parse SOAP response from ruby client?

I am learning Ruby and I have written the following code to find out how to consume SOAP services:
require 'soap/wsdlDriver'
wsdl="http://www.abundanttech.com/webservices/deadoralive/deadoralive.wsdl"
service=SOAP::WSDLDriverFactory.new(wsdl).create_rpc_driver
weather=service.getTodaysBirthdays('1/26/2010')
The response that I get back is:
#<SOAP::Mapping::Object:0x80ac3714
{http://www.abundanttech.com/webservices/deadoralive} getTodaysBirthdaysResult=#<SOAP::Mapping::Object:0x80ac34a8
{http://www.w3.org/2001/XMLSchema}schema=#<SOAP::Mapping::Object:0x80ac3214
{http://www.w3.org/2001/XMLSchema}element=#<SOAP::Mapping::Object:0x80ac2f6c
{http://www.w3.org/2001/XMLSchema}complexType=#<SOAP::Mapping::Object:0x80ac2cc4
{http://www.w3.org/2001/XMLSchema}choice=#<SOAP::Mapping::Object:0x80ac2a1c
{http://www.w3.org/2001/XMLSchema}element=#<SOAP::Mapping::Object:0x80ac2774
{http://www.w3.org/2001/XMLSchema}complexType=#<SOAP::Mapping::Object:0x80ac24cc
{http://www.w3.org/2001/XMLSchema}sequence=#<SOAP::Mapping::Object:0x80ac2224
{http://www.w3.org/2001/XMLSchema}element=[#<SOAP::Mapping::Object:0x80ac1f7c>,
#<SOAP::Mapping::Object:0x80ac13ec>,
#<SOAP::Mapping::Object:0x80ac0a28>,
#<SOAP::Mapping::Object:0x80ac0078>,
#<SOAP::Mapping::Object:0x80abf6c8>,
#<SOAP::Mapping::Object:0x80abed18>]
>>>>>>> {urn:schemas-microsoft-com:xml-diffgram-v1}diffgram=#<SOAP::Mapping::Object:0x80abe6c4
{}NewDataSet=#<SOAP::Mapping::Object:0x80ac1220
{}Table=[#<SOAP::Mapping::Object:0x80ac75e4
{}FullName="Cully, Zara"
{}BirthDate="01/26/1892"
{}DeathDate="02/28/1979"
{}Age="(87)"
{}KnownFor="The Jeffersons"
{}DeadOrAlive="Dead">,
#<SOAP::Mapping::Object:0x80b778f4
{}FullName="Feiffer, Jules"
{}BirthDate="01/26/1929"
{}DeathDate=#<SOAP::Mapping::Object:0x80c7eaf4>
{}Age="81"
{}KnownFor="Cartoonists"
{}DeadOrAlive="Alive">]>>>>
I am having a great deal of difficulty figuring out how to parse and show the returned information in a nice table, or even just how to loop through the records and have access to each element (ie. FullName,Age,etc). I went through the whole "getTodaysBirthdaysResult.methods - Object.new.methods" and kept working down to try and work out how to access the elements, but then I get to the array and I got lost.
Any help that can be offered would be appreciated.
If you're going to parse the XML anyway, you might as well skip SOAP4r and go with Handsoap. Disclaimer: I'm one of the authors of Handsoap.
An example implementation:
# wsdl: http://www.abundanttech.com/webservices/deadoralive/deadoralive.wsdl
DEADORALIVE_SERVICE_ENDPOINT = {
:uri => 'http://www.abundanttech.com/WebServices/DeadOrAlive/DeadOrAlive.asmx',
:version => 1
}
class DeadoraliveService < Handsoap::Service
endpoint DEADORALIVE_SERVICE_ENDPOINT
def on_create_document(doc)
# register namespaces for the request
doc.alias 'tns', 'http://www.abundanttech.com/webservices/deadoralive'
end
def on_response_document(doc)
# register namespaces for the response
doc.add_namespace 'ns', 'http://www.abundanttech.com/webservices/deadoralive'
end
# public methods
def get_todays_birthdays
soap_action = 'http://www.abundanttech.com/webservices/deadoralive/getTodaysBirthdays'
response = invoke('tns:getTodaysBirthdays', soap_action)
(response/"//NewDataSet/Table").map do |table|
{
:full_name => (table/"FullName").to_s,
:birth_date => Date.strptime((table/"BirthDate").to_s, "%m/%d/%Y"),
:death_date => Date.strptime((table/"DeathDate").to_s, "%m/%d/%Y"),
:age => (table/"Age").to_s.gsub(/^\(([\d]+)\)$/, '\1').to_i,
:known_for => (table/"KnownFor").to_s,
:alive? => (table/"DeadOrAlive").to_s == "Alive"
}
end
end
end
Usage:
DeadoraliveService.get_todays_birthdays
SOAP4R always returns a SOAP::Mapping::Object which is sometimes a bit difficult to work with unless you are just getting the hash values that you can access using hash notation like so
weather['fullName']
However, it does not work when you have an array of hashes. A work around is to get the result in xml format instead of SOAP::Mapping::Object. To do that I will modify your code as
require 'soap/wsdlDriver'
wsdl="http://www.abundanttech.com/webservices/deadoralive/deadoralive.wsdl"
service=SOAP::WSDLDriverFactory.new(wsdl).create_rpc_driver
service.return_response_as_xml = true
weather=service.getTodaysBirthdays('1/26/2010')
Now the above would give you an xml response which you can parse using nokogiri or REXML. Here is the example using REXML
require 'rexml/document'
rexml = REXML::Document.new(weather)
birthdays = nil
rexml.each_recursive {|element| birthdays = element if element.name == 'getTodaysBirthdaysResult'}
birthdays.each_recursive{|element| puts "#{element.name} = #{element.text}" if element.text}
This will print out all elements that have any text.
So once you have created an xml document you can pretty much do anything depending upon the methods the library you choose has ie. REXML or Nokogiri
Well, Here's my suggestion.
The issue is, you have to snag the right part of the result, one that is something you can actually iterator over. Unfortunately, all the inspecting in the world won't help you because it's a huge blob of unreadable text.
What I do is this:
File.open('myresult.yaml', 'w') {|f| f.write(result.to_yaml) }
This will be a much more human readable format. What you are probably looking for is something like this:
--- !ruby/object:SOAP::Mapping::Object
__xmlattr: {}
__xmlele:
- - &id024 !ruby/object:XSD::QName
name: ListAddressBooksResult <-- Hash name, so it's resul["ListAddressBooksResult"]
namespace: http://apiconnector.com
source:
- !ruby/object:SOAP::Mapping::Object
__xmlattr: {}
__xmlele:
- - &id023 !ruby/object:XSD::QName
name: APIAddressBook <-- this bastard is enumerable :) YAY! so it's result["ListAddressBooksResult"]["APIAddressBook"].each
namespace: http://apiconnector.com
source:
- - !ruby/object:SOAP::Mapping::Object
The above is a result from DotMailer's API, which I spent the last hour trying to figure out how to enumerate over the results. The above is the technique I used to figure out what the heck is going on. I think it beats using REXML etc this way, I could do something like this:
result['ListAddressBooksResult']['APIAddressBook'].each {|book| puts book["Name"]}
Well, I hope this helps anyone else who is looking.
/jason

Resources