Get element text from XML doc - ruby

I'm trying to extract some information from XML from Weather Underground.
I can open the resource and pull out the desired elements, but I really want to return the element text as a variable, without the containing XML element tags, so I can manipulate it and display it on a web page.
Perhaps there is a way to do this using regexp to strip off the tags, but I suspect/hope I can do this in a more elegant fashion directly in Nokogiri.
Currently I am using irb to work out the syntax:
irb>require 'rubygems'
irb>require 'nokogiri'
irb>require 'open-uri'
irb>doc = Nokogiri::XML(open('http://api.wunderground.com/auto/wui/geo/WXCurrentObXML/index.xml?query=KBHB'))
=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
=> <?xml version="1.0"?>
# [...]
<!-- 0.036:0 -->
irb>doc.xpath('/current_observation/weather')
=> <weather>Clear</weather>irb(main):019:0>
irb>doc.xpath('/current_observation/wind_dir')
=> <wind_dir>North</wind_dir>
irb>doc.xpath('/current_observation/wind_mph')
=> <wind_mph>10</wind_mph>
irb>doc.xpath('/current_observation/pressure_string')
=> <pressure_string>31.10 in (1053 mb)</pressure_string>
I need help with the specific syntax while using constructs such as:
doc.xpath.element('/current_observation/weather')
doc.xpath.text('/current_observation/weather')
doc.xpath.node('/current_observation/weather')
doc.xpath.element.text('/current_observation/weather')
All return errors.

As per XPath, you can return the text node of an element with text().
In your example it should be doc.xpath('/current_observation/weather/text()') to get the content of weather's text node.

Something like this works for me:
irb(main):019:0> doc.xpath('//current_observation/weather').first.content
=> "Clear"

One of the nice things about Nokogiri is its flexibility when writing accessors. You're not limited to XPath only, instead you can use CSS accessors:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::XML(open('http://api.wunderground.com/auto/wui/geo/WXCurrentObXML/index.xml?query=KBHB'))
weather_report = %w[weather wind_dir wind_mph pressure_string].inject({}) { |h, n|
h[n.to_sym] = doc.at('current_observation ' << n).text
h
}
weather_report # => {:weather=>"Overcast", :wind_dir=>"South", :wind_mph=>"6", :pressure_string=>"29.67 in (1005 mb)"}

Related

Nokogiri : find all the anchors that match a name

I'm trying to save the links only of the sample pages in this website
MusicRadar
require 'open-uri'
require 'nokogiri'
link = 'https://www.musicradar.com/news/tech/free-music-samples-royalty-free-loops-hits-and-multis-to-download'
html = OpenURI.open_uri(link)
doc = Nokogiri::HTML(html)
#used grep because every sample link in that page ends with '-samples'
doc.xpath('//div/a/#href').grep(/-samples/)
The problem is that it only finds 3 of that links
What am I doing wrong?
And If i wanted to open each of that links?
CSS selectors are more useful than XPath (if the document structure is good enough for that)
Now you used XPath with similar to CSS selector div > a, but you don't need it because for example some of the links inside p
If you need all links with -samples you can use *= selector
doc.css('a[href*="-samples"]') # return Nokogiri::XML::NodeSet with matched elements
doc.css('a[href*="-samples"]').map { |a| a[:href] } # return array of URLS

Add Nokogiri parse result to variable

I have an XML document:
<cred>
<login>Tove</login>
<pass>Jani</pass>
</cred>
My code is:
require 'nokogiri'
require 'selwet'
context "parse xml" do doc = Nokogiri::XML(File.open("test.xml"))
doc.xpath("cred/login").each do
|char_element|
puts char_element.text
end
should "check" do
Unit.go_to "http://www.ya.ru/"
Unit.click '.b-inline'
Unit.fill '[name="login"]', #login
end
When I run my test I get:
Tove
0
But I want to insert the parse result to #login. How can I get variables with the parsing result? Do I need to insert the login and pass values from the XML to fields in the web page?
You can get value of login from your XML with
#login = doc.xpath('//cred/login').text
I'd use something like this to get the values:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<cred>
<login>Tove</login>
<pass>Jani</pass>
</cred>
EOT
login = doc.at('login').text # => "Tove"
pass = doc.at('pass').text # => "Jani"
Nokogiri makes it really easy to access values using CSS, so use it for readability when possible. The same thing can be done using XPath:
login = doc.at('//login').text # => "Tove"
pass = doc.at('//pass').text # => "Jani"
but having to add // twice to accomplish the same thing is usually wasted effort.
The important part is at, which returns the first occurrence of the target. at allows us to use either CSS or XPath, but CSS is usually less visually noisy.

How do I print XPath value?

I want to print the contents of an XPath node. Here is what I have:
require "mechanize"
agent = Mechanize.new
agent.get("http://store.steampowered.com/promotion/snowglobefaq")
puts agent.xpath("//*[#id='item_52b3985a70d58']/div[4]")
This returns: <main>: undefined method xpath for #<Mechanize:0x2fa18c0> (NoMethodError).
I just started using Mechanize and have no idea what I'm doing, however, I've used Watir and thought this would work but it didn't.
You an use Nokogiri to parse the page after retrieving it. Here is the example code:
m = Mechanize.new
result = m.get("http://google.com")
html = Nokogiri::HTML(result.body)
divs = html.xpath('//div').map { |div| div.content } # here you can do whatever is needed with the divs
# I've mapped their content into an array
There are two things wrong:
The ID doesn't exist on that page. Try this to see the list of tag IDs available:
require "open-uri"
require 'nokogiri'
doc = Nokogiri::HTML(open("http://store.steampowered.com/promotion/snowglobefaq"))
puts doc.search('[id*="item"]').map{ |n| n['id'] }.sort
The correct chain of methods is agent.page.xpath.
Because there is no sample HTML showing exactly which tag you want, we can't help you much.

Using Nokogiri to scrape a value from Yahoo Finance?

I wrote a simple script:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://au.finance.yahoo.com/q/bs?s=MYGN"
doc = Nokogiri::HTML(open(url))
name = doc.at_css("#yfi_rt_quote_summary h2").text
market_cap = doc.at_css("#yfs_j10_mygn").text
ebit = doc.at("//*[#id='yfncsumtab']/tbody/tr[2]/td/table[2]/tbody/tr/td/table/tbody/tr[11]/td[2]/strong").text
puts "#{name} - #{market_cap} - #{ebit}"
The script grabs three values from Yahoo finance. The problem is that the ebit XPath returns nil. The way I got the XPath was using the Chrome developer tools and copy and pasting.
This is the page I'm trying to get the value from http://au.finance.yahoo.com/q/bs?s=MYGN and the actual value is 483,992 in the total current assets row.
Any help would be appreciated, especially if there is a way to get this value with CSS selectors.
Nokogiri supports:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://au.finance.yahoo.com/q/bs?s=MYGN"))
ebit = doc.at('strong:contains("Total Current Assets")').parent.next_sibling.text.gsub(/[^,\d]+/, '')
puts ebit
# >> 483,992
I'm using the <strong> tag as an place-marker with the :contains pseudo-class, then backing up to the containing <td>, moving to the next <td> and grabbing its text, then finally stripping the white-space using gsub(/[^,\d]+/, '') which removes everything that isn't a number or a comma.
Nokogiri supports a number of jQuery's JavaScript extensions, which is why :contains works.
This seems to work for me
doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.tr(",","").to_i
#=> 483992
Or as a string
doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.strip.gsub(/\u00A0/,"")
#=> "483,992"

How to navigate a XML object in Ruby

I have a regular xml object created from a response of a web service.
I need to get some specific values from some specific keys... for example:
<tag>
<tag2>
<tag3>
<needThisValue>3</needThisValue>
<tag4>
<needThisValue2>some text</needThisValue2>
</tag4>
</tag3>
</tag2>
</tag>
How can I get <needThisValue> and <needThisValue2> in Ruby?
I'm a big fan of Nokogiri:
xml = <<EOT
<tag>
<tag2>
<tag3>
<needThisValue>3</needThisValue>
<tag4>
<needThisValue2>some text</needThisValue2>
</tag4>
</tag3>
</tag2>
</tag>
EOT
This creates a document for parsing:
require 'nokogiri'
doc = Nokogiri::XML(xml)
Use at to find the first node matching the accessor:
doc.at('needThisValue2').class # => Nokogiri::XML::Element
Or search to find all nodes matching the accessor as a NodeSet, which acts like an Array:
doc.search('needThisValue2').class # => Nokogiri::XML::NodeSet
doc.search('needThisValue2')[0].class # => Nokogiri::XML::Element
This uses a CSS accessor to locate the first instance of each node:
doc.at('needThisValue').text # => "3"
doc.at('needThisValue2').text # => "some text"
Again with the NodeSet using CSS:
doc.search('needThisValue')[0].text # => "3"
doc.search('needThisValue2')[0].text # => "some text"
You can use XPath accessors instead of CSS if you want:
doc.at('//needThisValue').text # => "3"
doc.search('//needThisValue2').first.text # => "some text"
Go through the tutorials to get a jumpstart. It's very powerful and quite easy to use.
require "rexml/document"
include REXML
doc = Document.new string
puts XPath.first(doc, "//tag/tag2/tag3/needThisValue").text
puts XPath.first(doc, "//tag/tag2/tag3/tag4/needThisValue2").text
Try this Nokogiri tutorial.
You'll need to install nokogiri gem.
Good luck.
Check out the Nokogiri gem. You can read some tutorials enter link description here. It's fast and simple.

Resources