How to parse a web page using Nokogiri in Ruby? - ruby

I am using Nokogiri to parse html. For the website shown, I am trying to create an array of hashes where each hash will contain the pros, cons, and advice sections for a given review shown on the site. I am having trouble doing this and was hoping for some advice here. When I return a certain element, I don't get the right content shown on the site. Any ideas?
require 'open-uri'
require 'nokogiri'
# Perform a google search
doc = Nokogiri::HTML(open('http://www.glassdoor.com/Reviews/Microsoft-Reviews-E1651.htm'))
reviews = []
current_review = Hash.new
doc.css('.employerReview').each do |item|
pro = item.parent.css('p:nth-child(1) .notranslate').text
con = item.parent.css('p:nth-child(2) .notranslate').text
advice = item.parent.css('p:nth-child(3) .notranslate').text
current_review = {'pro' => pro, 'con' => con, 'advice' => advice}
reviews << current_review
end

Try this instead:
reviews = []
doc.css('.employerReview').each do |item|
pro, con, advice = item.css('.description .notranslate text()').map(&:to_s)
reviews << {'pro' => pro, 'con' => con, 'advice' => advice}
end
It's also preferred with ruby to use symbol keys, so unless you need them to be strings, I'd do
reviews << { pro: pro, con: con, advice: advice }

Related

Yahoo Finance news pubDate not accessable by ruby nokogiri

I'm able to access Yahoo Finance news headlines title, but have a hard time parsing pubDate so that I only look at say the last week's news and ignore anything older.
require 'nokogiri'
sym = "1313.HK"
url = "https://feeds.finance.yahoo.com/rss/2.0/headline?s=#{sym}&region=US&lang=en-US"
doc = Nokogiri::HTML(open(url))
titles = doc.css("title")
puts titles.length # works, comes back with 0-20
puts titles.text # works
pubDates = doc.css("pubDate")
puts pubDates.length #does NOT work, always 0
puts pubDates.text #does NOT work, always blank
keywordregex = "bad news"
nodes = doc.search('title') # search title tags only, for keywords
puts found_title = nodes.select{ |n| n.name=='title' && n.text =~ keywordregex } # TODO && pubDate > 7 days old
Try it with Nokogiri::XML, rss is really XML.
doc = Nokogiri::XML(open(url))
pubdate node names in your XML source are lowercase.
> doc.css("pubdate").length
=> 7

How to to parse HTML contents of a page using Nokogiri

require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = 'https://www.trumba.com/calendars/smithsonian-events.xml'
doc = Nokogiri::XML(open url)
I am trying to fetch the basic set of information like:
event_name
categories
sponsor
venue
event_location
cost
For example, for event_name I have this xpath:
"/html/body/div[2]/div[2]/div[1]/h3/a/span"
And use it like:
puts doc.xpath "/html/body/div[2]/div[2]/div[1]/h3/a/span"
This returns nil for event_name.
If I save the URL contents locally then above XPath works.
Along with this, I need above mentioned information as well. I checked the other XPaths too, but the result turns out to be blank.
Here's how I'd go about doing this:
require 'nokogiri'
doc = Nokogiri::XML(open('/Users/gferguson/smithsonian-events.xml'))
namespaces = doc.collect_namespaces
entries = doc.search('entry').map { |entry|
entry_title = entry.at('title').text
entry_time_start, entry_time_end = ['startTime', 'endTime'].map{ |p|
entry.at('gd|when', namespaces)[p]
}
entry_notes = entry.at('gc|notes', namespaces).text
{
title: entry_title,
start_time: entry_time_start,
end_time: entry_time_end,
notes: entry_notes
}
}
Which, when run, results in entries being an array of hashes:
require 'awesome_print'
ap entries [0, 3]
# >> [
# >> [0] {
# >> :title => "Conservation Clinics",
# >> :start_time => "2016-11-09T14:00:00Z",
# >> :end_time => "2016-11-09T17:00:00Z",
# >> :notes => "Have questions about the condition of a painting, frame, drawing,\n print, or object that you own? Our conservators are available by\n appointment to consult with you about the preservation of your art.\n \n To request an appointment or to learn more,\n e-mail DWRCLunder#si.edu and specify CLINIC in the subject line."
# >> },
# >> [1] {
# >> :title => "Castle Highlights Tour",
# >> :start_time => "2016-11-09T14:00:00Z",
# >> :end_time => "2016-11-09T14:45:00Z",
# >> :notes => "Did you know that the Castle is the Smithsonian’s first and oldest building? Join us as one of our dynamic volunteer docents takes you on a tour to explore the highlights of the Smithsonian Castle. Come learn about the founding and early history of the Smithsonian; its original benefactor, James Smithson; and the incredible history and architecture of the Castle. Here is your opportunity to discover the treasured stories revealed within James Smithson's crypt, the Gre...
# >> },
# >> [2] {
# >> :title => "Exhibition Interpreters/Navigators (throughout the day)",
# >> :start_time => "2016-11-09T15:00:00Z",
# >> :end_time => "2016-11-09T15:00:00Z",
# >> :notes => "Museum volunteer interpreters welcome visitors, answer questions, and help visitors navigate exhibitions. Interpreters may be stationed in several of the following exhibitions at various times throughout the day, subject to volunteer interpreter availability. <ul> \t<li><em>The David H. Koch Hall of Human Origins: What Does it Mean to be Human?</em></li> \t<li><em>The Sant Ocean Hall</em></li> </ul>"
# >> }
# >> ]
I didn't try to gather the specific information you asked for because event_name doesn't exist and what you're doing is very generic and easily done once you understand a few rules.
XML is generally very repetitive because it represents tables of data. The "cells" of the table might vary but there's repetition you can use to help you. In this code
doc.search('entry')
loops over the <entry> nodes. Then it's easy to look inside them to find the information needed.
The XML uses namespaces to help avoid tag-name collisions. At first those seem really hard, but Nokogiri provides the collect_namespaces method for the document that returns a hash of all namespaces in the document. If you're looking for a namespaces-tag, pass that hash as the second parameter.
Nokogiri allows us to use XPath and CSS for selectors. I almost always go with CSS for readability. ns|tag is the format to tell Nokogiri to use a CSS-based namespaced tag. Again, pass it the hash of namespaces in the document and Nokogiri will do the rest.
If you're familiar with working with Nokogiri you'll see the above code is very similar to normal code used to pull the content of <td> cells inside <tr> rows in an HTML <table>.
You should be able to modify that code to gather the data you need without risking namespace collisions.
The provided link contains XML, so your XPath expressions should work with XML structure.
The key thing is that the document has namespaces. As I understand all XPath expressions should keep that in mind and specify namespaces too.
In order to simply XPath expressions one can use the remove_namespaces! method:
require 'nokogiri'
require 'open-uri'
url = 'https://www.trumba.com/calendars/smithsonian-events.xml'
doc = Nokogiri::XML(open(url)); nil # nil is used to avoid huge output
doc.remove_namespaces!; nil
event = doc.xpath('//feed/entry[1]') # it will give you the first event
event.xpath('./title').text # => "Conservation Clinics"
event.xpath('./categories').text # => "Demonstrations,Lectures & Discussions"
Most likely you would like to have array of all event hashes.
You can do it like:
doc.xpath('//feed/entry').reduce([]) do |memo, event|
event_hash = {
title: event.xpath('./title').text,
categories: event.xpath('./categories').text
# all other attributes you need ...
}
memo << event_hash
end
It will give you an array like:
[
{:title=>"Conservation Clinics", :categories=>"Demonstrations,Lectures & Discussions"},
{:title=>"Castle Highlights Tour", :categories=>"Gallery Talks & Tours"},
...
]

How do I parse a page using Nokogiri?

I am trying to parse the URL shown in the doc variable below. My issue is with the job variable. When I return it, it returns every job title on the page instead of that specific job title for the given review. Does anyone have advice how to return the specific job title I'm referring to?
require 'nokogiri'
require 'open-uri'
# Perform a google search
doc = Nokogiri::HTML(open('http://www.glassdoor.com/Reviews/Microsoft-Reviews-E1651.htm'))
reviews = []
current_review = Hash.new
doc.css('.employerReview').each do |item|
pro = item.parent.css('p:nth-child(1) .notranslate').text
con = item.parent.css('p:nth-child(2) .notranslate').text
job = item.parent.css('.review-microdata-heading .i-occ').text
puts job
advice = item.parent.css('p:nth-child(3) .notranslate').text
current_review = {'pro' => pro, 'con' => con, 'advice' => advice}
reviews << current_review
end
Looks like item.parent is #MainCol in each case, in other words the entire column.
Changing item.parent.css to item.css should solve your problem.

Ruby - Mechanize: Select link by classname and other questions

At the moment I'm having a look on Mechanize.
I am pretty new to Ruby, so please be patient.
I wrote a little test script:
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get('http://www.google.de')
pp page.title
google_form = page.form_with(:name => 'f')
google_form.q = 'test'
page = agent.submit(google_form)
pp page.title
page_links = Array.new
page.links.each do |ll|
page_links << ll
end
puts page_links.size
This works. But page_links includes not only the search results. It also includes the google links like Login, Pictures, ...
The result links own a styleclass "1". Is it possible to select only the links with class == 1? How do I achieve this?
Is it possible to modify the "agentalias"? If I own a website, including google analytics or something, what browserclient will I see in ga going with mechanize on my site?
Can I select elements by their ID instead of their name? I tried to use
my_form = page.form_with(:id => 'myformid')
But this does not work.
in such cases like your I am using Nokogiri DOM search.
Here is your code a little bit rewritten:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.google.de')
pp page.title
google_form = page.form_with(:name => 'f')
google_form.q = 'test'
page = agent.submit(google_form)
pp page.title
page_links = Array.new
#maybe you better use 'h3.r > a.l' here
page.parser.css("a.l").each do |ll|
#page.parser here is Nokogiri::HTML::Document
page_links << ll
puts ll.text + "=>" + ll["href"]
end
puts page_links.size
Probably this article is a good place to start:
getting-started-with-nokogiri
By the way samples in the article also deal with Google search ;)
You can build a list of just the search result links by changing your code as follows:
page.links.each do |ll|
cls = ll.attributes.attributes['class']
page_links << ll if cls && cls.value == 'l'
end
For each element ll in page.links, ll.attributes is a Nokogiri::XML::Element and ll.attributes.attributes is a Hash containing the attributes on the link, hence the need for ll.attributes.attributes to get at the actual class and the need for the nil check before comparing the value to 'l'
The problem with using :id in the criteria to find a form is that it clashes with Ruby's Object#id method for returning a Ruby object's internal id. I'm not sure what the work around for this is. You would have no problem selecting the form by some other attribute (e.g. its action.)
I believe the selector you are looking for is:
:dom_id e.g. in your case:
my_form = page.form_with(:dom_id => 'myformid')

Word Automation using WIN32OLE

I am trying to insert an image (jpg) in to a word document and the Selection.InlineShapes.AddPicture does not seem to be supported by win32old or I am doing something wrong. Has anyone had any luck inserting images.
You can do this by calling the Document.InlineShapes.AddPicture() method.
The following example inserts an image into the active document, before the second sentence.
require 'win32ole'
word = WIN32OLE.connect('Word.Application')
doc = word.ActiveDocument
image = 'C:\MyImage.jpg'
range = doc.Sentences(2)
params = { 'FileName' => image, 'LinkToFile' => false,
'SaveWithDocument' => true, 'Range' => range }
pic = doc.InlineShapes.AddPicture( params )
Documentation on the AddPicture() method can be found here.
Additional details on automating Word with Ruby can be found here.
This is the answer by David Mullet and can be found here
Running on WinXP, Ruby 1.8.6, Word 2002/XP SP3, I recorded macros and translated them, as far as I could understand them, into this:
require 'win32ole'
begin
word = WIN32OLE::new('Word.Application') # create winole Object
doc = word.Documents.Add
word.Selection.InlineShapes.AddPicture "C:\\pictures\\some_picture.jpg", false, true
word.ChangeFileOpenDirectory "C:\\docs\\"
doc.SaveAs "doc_with_pic.doc"
word.Quit
rescue Exception => e
puts e
word.Quit
ensure
word.Quit unless word.nil?
end
It seems to work. Any use?

Resources