Getting link from Mechanize/Nokogiri - ruby

I am trying to discover the best way to retrieve the a href link from a Nokogiri Node. Here is where I am at
mech = Mechanize.new
mech.get(HOME_URL)
mech.page.search('.listing_content').each do |business|
website = business.css('.website-feature')
puts website.class
puts website.inner_html
end
output =>
Nokogiri::XML::NodeSet
<span class="raquo">ยป</span> Website
Basically, I just need to get the http://urlofsite.com out of the inner_html, and I'm not sure how to do that. I've read about doing it with CSS and XPATH but I can't get either to work at this point. Thanks for any help

First, tell Nokogiri to get a node, rather than a NodeSet. at_css will retrieve the Node and css retrieves a NodeSet, which is like an Array.
Instead of:
website = business.css('.website-feature')
Try:
website = at_css('a.track-visit-website no-tracks')
to retrieve the first instance of an <a> node with class="website-feature". If it's not the first instance you want then you'll need to narrow it down by grabbing the NodeSet and then indexing into it. Without the surrounding HTML it's difficult to help more.
To get the href parameter from a Node, simply treat the node like a hash:
website['href']
should return:
http://urlofsite.com
Here's a little sample from IRB:
irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0>
irb(main):003:0* html = '<a class="this_node" href="http://example.com">'
=> "<a class=\"this_node\" href=\"http://example.com\">"
irb(main):004:0> doc = Nokogiri::HTML.parse(html)
=> #<Nokogiri::HTML::Document:0x8041e2ec name="document" children=[#<Nokogiri::XML::DTD:0x8041d20c name="html">, #<Nokogiri::XML::Element:0x805a2a14 name="html" children=[#<Nokogiri::XML::Element:0x805df8b0 name="body" children=[#<Nokogiri::XML::Element:0x8084c5d0 name="a" attributes=[#<Nokogiri::XML::Attr:0x80860170 name="class" value="this_node">, #<Nokogiri::XML::Attr:0x8086047c name="href" value="http://example.com">]>]>]>]>
irb(main):005:0>
irb(main):006:0* doc.at_css('a.this_node')['href']
=> "http://example.com"
irb(main):007:0>

Related

Ruby: How do I parse links with Nokogiri with content/text all the same?

What I am trying to do: Parse links from website (http://nytm.org/made-in-nyc) that all have the exact same content. "(hiring)" Then I will write to a file 'jobs.html' a list of links. (If it is a violation to publish these websites I will quickly take down the direct URL. I thought it might be useful as a reference to what I am trying to do. First time posting on stack)
DOM Structure:
<article>
<ol>
<li>#waywire</li>
<li><a href="http://1800Postcards.com" target="_self" class="vt-p">1800Postcards.com</a</li>
<li>Adafruit Industries</li>
<li><a href="http://www.adafruit.com/jobs/" target="_self" class="vt-p">(hiring)</a</li>
etc...
What I have tried:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select{|link| link.text == "(hiring)"}
results = hire_links.each{|link| puts link['href']}
begin
file = File.open("./jobs.html", "w")
file.write("#{results}")
rescue IOError => e
ensure
file.close unless file == nil
end
puts hire_links
end
find_jobs
Here is a Gist
Example Result:
[344] #<Nokogiri::XML::Element:0x3fcfa2e2276c name="a" attributes=[#<Nokogiri::XML::Attr:0x3fcfa2e226e0 name="href" value="http://www.zocdoc.com/careers">, #<Nokogiri::XML::Attr:0x3fcfa2e2267c name="target" value="_blank">] children=[#<Nokogiri::XML::Text:0x3fcfa2e1ff1c "(hiring)">]>
So it successfully writes these entries into the jobs.html file but it is in XML format? Not sure how to target just the value and create a link from that. Not sure where to go from here. Thanks!
The problem is with how results is defined. results is an array of Nokogiri::XML::Element:
results = hire_links.each{|link| puts link['href']}
p results.class
#=> Array
p results.first.class
#=> Nokogiri::XML::Element
When you go to write the Nokogiri::XML::Element to the file, you get the results of inspecting it:
puts results.first.inspect
#=> "#<Nokogiri::XML::Element:0x15b9694 name="a" attributes=...."
Given that you want the href attribute of each link, you should collect that in the results instead:
results = hire_links.map{ |link| link['href'] }
Assuming you want each href/link displayed as a line in the file, you can join the array:
File.write('./jobs.html', results.join("\n"))
The modified script:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select { |link| link.text == "(hiring)"}
results = hire_links.map { |link| link['href'] }
File.write('./jobs.html', results.join("\n"))
end
find_jobs
#=> produces a jobs.html with:
#=> http://www.20x200.com/jobs/
#=> http://www.8coupons.com/home/jobs
#=> http://jobs.about.com/index.html
#=> ...
Try using Mechanize. It leverages Nokogiri, and you can do something like
require 'mechanize'
browser = Mechanize.new
page = browser.get('http://nytm.org/made-in-nyc')
links = page.links_with(text: /(hiring)/)
Then you will have an array of link objects that you can get whatever info you want. You can also use the link.click method that Mechanize provides.

How do I print XPath value?

I want to print the contents of an XPath node. Here is what I have:
require "mechanize"
agent = Mechanize.new
agent.get("http://store.steampowered.com/promotion/snowglobefaq")
puts agent.xpath("//*[#id='item_52b3985a70d58']/div[4]")
This returns: <main>: undefined method xpath for #<Mechanize:0x2fa18c0> (NoMethodError).
I just started using Mechanize and have no idea what I'm doing, however, I've used Watir and thought this would work but it didn't.
You an use Nokogiri to parse the page after retrieving it. Here is the example code:
m = Mechanize.new
result = m.get("http://google.com")
html = Nokogiri::HTML(result.body)
divs = html.xpath('//div').map { |div| div.content } # here you can do whatever is needed with the divs
# I've mapped their content into an array
There are two things wrong:
The ID doesn't exist on that page. Try this to see the list of tag IDs available:
require "open-uri"
require 'nokogiri'
doc = Nokogiri::HTML(open("http://store.steampowered.com/promotion/snowglobefaq"))
puts doc.search('[id*="item"]').map{ |n| n['id'] }.sort
The correct chain of methods is agent.page.xpath.
Because there is no sample HTML showing exactly which tag you want, we can't help you much.

Using Nokogiri to scrape a value from Yahoo Finance?

I wrote a simple script:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://au.finance.yahoo.com/q/bs?s=MYGN"
doc = Nokogiri::HTML(open(url))
name = doc.at_css("#yfi_rt_quote_summary h2").text
market_cap = doc.at_css("#yfs_j10_mygn").text
ebit = doc.at("//*[#id='yfncsumtab']/tbody/tr[2]/td/table[2]/tbody/tr/td/table/tbody/tr[11]/td[2]/strong").text
puts "#{name} - #{market_cap} - #{ebit}"
The script grabs three values from Yahoo finance. The problem is that the ebit XPath returns nil. The way I got the XPath was using the Chrome developer tools and copy and pasting.
This is the page I'm trying to get the value from http://au.finance.yahoo.com/q/bs?s=MYGN and the actual value is 483,992 in the total current assets row.
Any help would be appreciated, especially if there is a way to get this value with CSS selectors.
Nokogiri supports:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://au.finance.yahoo.com/q/bs?s=MYGN"))
ebit = doc.at('strong:contains("Total Current Assets")').parent.next_sibling.text.gsub(/[^,\d]+/, '')
puts ebit
# >> 483,992
I'm using the <strong> tag as an place-marker with the :contains pseudo-class, then backing up to the containing <td>, moving to the next <td> and grabbing its text, then finally stripping the white-space using gsub(/[^,\d]+/, '') which removes everything that isn't a number or a comma.
Nokogiri supports a number of jQuery's JavaScript extensions, which is why :contains works.
This seems to work for me
doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.tr(",","").to_i
#=> 483992
Or as a string
doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.strip.gsub(/\u00A0/,"")
#=> "483,992"

Return just the value of an xpath - Nokogiri Ruby

I'm using xpath to get some values on a website like this
auction_page = Nokogiri::HTML open(a, "User-Agent" => theagent)
auction_links = auction_page.xpath('//iframe[contains(#src, "near")]/#src')
Which returns what I need like this
#<Nokogiri::XML::Attr:0x3fcd7bef5730 name="src" value="http://thevalue.com">
I just want to get the value, not the value or anything else. How do I do this?
I think you are looking for the .text method.
So auction_links.text should return "http://thevalue.com".
Edit:
If that doesn't work try, auction_links.first which will return an array, I'm sure the link will be inside there. ; )
For further reference, here is a great tutorial for basic Nokogiri Crawling/Parsing.
You could do this as below:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-end
<a id = "foo" class="bar baz" href = "www.test.com"> click here </a>
end
doc.at_xpath("//a[contains(#class,'bar')]/#href").to_s
# => "www.test.com"
So in your case you can write:
auction_page.at_xpath('//iframe[contains(#src, "near")]/#src').to_s
# => "http://thevalue.com"

How to navigate a XML object in Ruby

I have a regular xml object created from a response of a web service.
I need to get some specific values from some specific keys... for example:
<tag>
<tag2>
<tag3>
<needThisValue>3</needThisValue>
<tag4>
<needThisValue2>some text</needThisValue2>
</tag4>
</tag3>
</tag2>
</tag>
How can I get <needThisValue> and <needThisValue2> in Ruby?
I'm a big fan of Nokogiri:
xml = <<EOT
<tag>
<tag2>
<tag3>
<needThisValue>3</needThisValue>
<tag4>
<needThisValue2>some text</needThisValue2>
</tag4>
</tag3>
</tag2>
</tag>
EOT
This creates a document for parsing:
require 'nokogiri'
doc = Nokogiri::XML(xml)
Use at to find the first node matching the accessor:
doc.at('needThisValue2').class # => Nokogiri::XML::Element
Or search to find all nodes matching the accessor as a NodeSet, which acts like an Array:
doc.search('needThisValue2').class # => Nokogiri::XML::NodeSet
doc.search('needThisValue2')[0].class # => Nokogiri::XML::Element
This uses a CSS accessor to locate the first instance of each node:
doc.at('needThisValue').text # => "3"
doc.at('needThisValue2').text # => "some text"
Again with the NodeSet using CSS:
doc.search('needThisValue')[0].text # => "3"
doc.search('needThisValue2')[0].text # => "some text"
You can use XPath accessors instead of CSS if you want:
doc.at('//needThisValue').text # => "3"
doc.search('//needThisValue2').first.text # => "some text"
Go through the tutorials to get a jumpstart. It's very powerful and quite easy to use.
require "rexml/document"
include REXML
doc = Document.new string
puts XPath.first(doc, "//tag/tag2/tag3/needThisValue").text
puts XPath.first(doc, "//tag/tag2/tag3/tag4/needThisValue2").text
Try this Nokogiri tutorial.
You'll need to install nokogiri gem.
Good luck.
Check out the Nokogiri gem. You can read some tutorials enter link description here. It's fast and simple.

Resources