How to navigate a XML object in Ruby - ruby

I have a regular xml object created from a response of a web service.
I need to get some specific values from some specific keys... for example:
<tag>
<tag2>
<tag3>
<needThisValue>3</needThisValue>
<tag4>
<needThisValue2>some text</needThisValue2>
</tag4>
</tag3>
</tag2>
</tag>
How can I get <needThisValue> and <needThisValue2> in Ruby?

I'm a big fan of Nokogiri:
xml = <<EOT
<tag>
<tag2>
<tag3>
<needThisValue>3</needThisValue>
<tag4>
<needThisValue2>some text</needThisValue2>
</tag4>
</tag3>
</tag2>
</tag>
EOT
This creates a document for parsing:
require 'nokogiri'
doc = Nokogiri::XML(xml)
Use at to find the first node matching the accessor:
doc.at('needThisValue2').class # => Nokogiri::XML::Element
Or search to find all nodes matching the accessor as a NodeSet, which acts like an Array:
doc.search('needThisValue2').class # => Nokogiri::XML::NodeSet
doc.search('needThisValue2')[0].class # => Nokogiri::XML::Element
This uses a CSS accessor to locate the first instance of each node:
doc.at('needThisValue').text # => "3"
doc.at('needThisValue2').text # => "some text"
Again with the NodeSet using CSS:
doc.search('needThisValue')[0].text # => "3"
doc.search('needThisValue2')[0].text # => "some text"
You can use XPath accessors instead of CSS if you want:
doc.at('//needThisValue').text # => "3"
doc.search('//needThisValue2').first.text # => "some text"
Go through the tutorials to get a jumpstart. It's very powerful and quite easy to use.

require "rexml/document"
include REXML
doc = Document.new string
puts XPath.first(doc, "//tag/tag2/tag3/needThisValue").text
puts XPath.first(doc, "//tag/tag2/tag3/tag4/needThisValue2").text

Try this Nokogiri tutorial.
You'll need to install nokogiri gem.
Good luck.

Check out the Nokogiri gem. You can read some tutorials enter link description here. It's fast and simple.

Related

How to get Nokogiri inner_HTML object to ignore/remove escape sequences

Currently, I am trying to get the inner HTML of an element on a page using nokogiri. However I'm not just getting the text of the element, I'm also getting its escape sequences. Is there a way i can suppress or remove them with nokogiri?
require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open("http://the.page.url.com"))
page.at_css("td[custom-attribute='foo']").parent.css('td').css('a').inner_html
this returns => "\r\n\t\t\t\t\t\t\t\tTheActuallyInnerContentThatIWant\r\n\t"
What is the most effective and direct nokogiri (or ruby) way of doing this?
page.at_css("td[custom-attribute='foo']")
.parent
.css('td')
.css('a')
.text # since you need a text, not inner_html
.strip # this will strip a result
String#strip.
Sidenote: css('td a') is likely more efficient than css('td').css('a').
It's important to drill in to the closest node containing the text you want. Consider this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
</body>
</html>
EOT
doc.at('body').inner_html # => "\n <p>foo</p>\n "
doc.at('body').text # => "\n foo\n "
doc.at('p').inner_html # => "foo"
doc.at('p').text # => "foo"
at, at_css and at_xpath return a Node/XML::Element. search, css and xpath return a NodeSet. There's a big difference in how text or inner_html return information when looking at a Node or NodeSet:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
EOT
doc.at('p') # => #<Nokogiri::XML::Element:0x3fd635cf36f4 name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf3514 "foo">]>
doc.search('p') # => [#<Nokogiri::XML::Element:0x3fd635cf36f4 name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf3514 "foo">]>, #<Nokogiri::XML::Element:0x3fd635cf32bc name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf30dc "bar">]>]
doc.at('p').class # => Nokogiri::XML::Element
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.at('p').text # => "foo"
doc.search('p').text # => "foobar"
Notice that using search returned a NodeSet and that text returned the node's text concatenated together. This is rarely what you want.
Also notice that Nokogiri is smart enough to figure out whether a selector is CSS or XPath 99% of the time, so using the generic search and at for either type of selector is very convenient.

Remove only anchor tag from string

In controller:
str= "Employee <b><a href=http://xyz.localhost.in:3000/admin/company>Uday Das</a></b> has applied for leave."
I want to remove anchor tag from above string like Employee <b>Uday Das</b> has applied for leave.,
I used this code:
ActionView::Base.full_sanitizer.sanitize(str)
But it removes all the html tags from the string, as a result i am getting Employee Uday Das has applied for leave..
NOTE: I am getting strings which is dynamic, anchor tag position is not fixed, it could be anywhere in the string.
You can use nokogiri gem.
Something like:
require 'nokogiri'
doc = Nokogiri::HTML str
node = doc.at("a")
node.replace(node.text)
puts puts doc.inner_html
# <html><body><p>Employee <b>Uday Das</b> has applied for leave.</p></body></html>
or to match your exact output:
puts doc.at("p").inner_html
# Employee <b>Uday Das</b> has applied for leave.
I got a simple solution:
include ActionView::Helpers::SanitizeHelper
sanitize(str, :tags=>["b"])
For links, you can use strip_links method from ActionView::Helpers::SanitizeHelper
strip_links('Ruby on Rails')
# => Ruby on Rails
strip_links('Please e-mail me at me#email.com.')
# => Please e-mail me at me#email.com.
strip_links('Blog: Visit.')
# => Blog: Visit.
strip_links('<malformed & link')
# => <malformed & link

Nokogiri check XML root/file validity

Is there a simple method/way to check if a Nokogiri XML file has a proper root, like xml.valid? A way to check if the XML file contains specific content is very welcome as well.
I'm thinking of something like xml.valid? or xml.has_valid_root?. Thanks!
How are you going to determine what is a proper root?
<foo></foo>
has a proper root:
require 'nokogiri'
xml = '<foo></foo>'
doc = Nokogiri::XML(xml)
doc.root # => #<Nokogiri::XML::Element:0x3fd3a9471b7c name="foo">
Nokogiri has no way of determining that something else should have been the root. You might be able to test if you have foreknowledge of what the root node's name should be:
doc_root_ok = (doc.root.name == 'foo')
doc_root_ok # => true
You can see if the document parsed was well-formed (not needing any fixup), by looking at errors:
doc.errors # => []
If Nokogiri had to modify the document just to parse it, errors will return a list of changes that were made prior to parsing:
xml = '<foo><bar><bar></foo>'
doc = Nokogiri::XML(xml)
doc.errors # => [#<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: bar line 1 and foo>, #<Nokogiri::XML::SyntaxError: Premature end of data in tag bar line 1>, #<Nokogiri::XML::SyntaxError: Premature end of data in tag foo line 1>]
A common and useful pattern is
doc = Nokogiri::XML(xml) do |config|
config.strict
end
This will throw a wobbly if the document is not well formed. I like to do this in order to prevent Nokogiri from being too kind to my XML.

Getting link from Mechanize/Nokogiri

I am trying to discover the best way to retrieve the a href link from a Nokogiri Node. Here is where I am at
mech = Mechanize.new
mech.get(HOME_URL)
mech.page.search('.listing_content').each do |business|
website = business.css('.website-feature')
puts website.class
puts website.inner_html
end
output =>
Nokogiri::XML::NodeSet
<span class="raquo">ยป</span> Website
Basically, I just need to get the http://urlofsite.com out of the inner_html, and I'm not sure how to do that. I've read about doing it with CSS and XPATH but I can't get either to work at this point. Thanks for any help
First, tell Nokogiri to get a node, rather than a NodeSet. at_css will retrieve the Node and css retrieves a NodeSet, which is like an Array.
Instead of:
website = business.css('.website-feature')
Try:
website = at_css('a.track-visit-website no-tracks')
to retrieve the first instance of an <a> node with class="website-feature". If it's not the first instance you want then you'll need to narrow it down by grabbing the NodeSet and then indexing into it. Without the surrounding HTML it's difficult to help more.
To get the href parameter from a Node, simply treat the node like a hash:
website['href']
should return:
http://urlofsite.com
Here's a little sample from IRB:
irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0>
irb(main):003:0* html = '<a class="this_node" href="http://example.com">'
=> "<a class=\"this_node\" href=\"http://example.com\">"
irb(main):004:0> doc = Nokogiri::HTML.parse(html)
=> #<Nokogiri::HTML::Document:0x8041e2ec name="document" children=[#<Nokogiri::XML::DTD:0x8041d20c name="html">, #<Nokogiri::XML::Element:0x805a2a14 name="html" children=[#<Nokogiri::XML::Element:0x805df8b0 name="body" children=[#<Nokogiri::XML::Element:0x8084c5d0 name="a" attributes=[#<Nokogiri::XML::Attr:0x80860170 name="class" value="this_node">, #<Nokogiri::XML::Attr:0x8086047c name="href" value="http://example.com">]>]>]>]>
irb(main):005:0>
irb(main):006:0* doc.at_css('a.this_node')['href']
=> "http://example.com"
irb(main):007:0>

Get element text from XML doc

I'm trying to extract some information from XML from Weather Underground.
I can open the resource and pull out the desired elements, but I really want to return the element text as a variable, without the containing XML element tags, so I can manipulate it and display it on a web page.
Perhaps there is a way to do this using regexp to strip off the tags, but I suspect/hope I can do this in a more elegant fashion directly in Nokogiri.
Currently I am using irb to work out the syntax:
irb>require 'rubygems'
irb>require 'nokogiri'
irb>require 'open-uri'
irb>doc = Nokogiri::XML(open('http://api.wunderground.com/auto/wui/geo/WXCurrentObXML/index.xml?query=KBHB'))
=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
=> <?xml version="1.0"?>
# [...]
<!-- 0.036:0 -->
irb>doc.xpath('/current_observation/weather')
=> <weather>Clear</weather>irb(main):019:0>
irb>doc.xpath('/current_observation/wind_dir')
=> <wind_dir>North</wind_dir>
irb>doc.xpath('/current_observation/wind_mph')
=> <wind_mph>10</wind_mph>
irb>doc.xpath('/current_observation/pressure_string')
=> <pressure_string>31.10 in (1053 mb)</pressure_string>
I need help with the specific syntax while using constructs such as:
doc.xpath.element('/current_observation/weather')
doc.xpath.text('/current_observation/weather')
doc.xpath.node('/current_observation/weather')
doc.xpath.element.text('/current_observation/weather')
All return errors.
As per XPath, you can return the text node of an element with text().
In your example it should be doc.xpath('/current_observation/weather/text()') to get the content of weather's text node.
Something like this works for me:
irb(main):019:0> doc.xpath('//current_observation/weather').first.content
=> "Clear"
One of the nice things about Nokogiri is its flexibility when writing accessors. You're not limited to XPath only, instead you can use CSS accessors:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::XML(open('http://api.wunderground.com/auto/wui/geo/WXCurrentObXML/index.xml?query=KBHB'))
weather_report = %w[weather wind_dir wind_mph pressure_string].inject({}) { |h, n|
h[n.to_sym] = doc.at('current_observation ' << n).text
h
}
weather_report # => {:weather=>"Overcast", :wind_dir=>"South", :wind_mph=>"6", :pressure_string=>"29.67 in (1005 mb)"}

Resources