Parsing an XML file with Nokogiri to determine the path (Ruby) - ruby

My code is supposed to "guess" the path(s) that lies before the relevant text nodes in my XML file. Relevant in this case means: text nodes nested within the recurring product/person/something tag, but not text nodes that are used outside of it.
This code:
#doc, items = Nokogiri.XML(#file), []
path = []
#doc.traverse do |node|
if node.class.to_s == "Nokogiri::XML::Element"
is_path_element = false
node.children.each do |child|
is_path_element = true if child.class.to_s == "Nokogiri::XML::Element"
end
path.push(node.name) if is_path_element == true && !path.include?(node.name)
end
end
final_path = "/"+path.reverse.join("/")
works for simple XML files, for example:
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Some XML file title</title>
<description>Some XML file description</description>
<item>
<title>Some product title</title>
<brand>Some product brand</brand>
</item>
<item>
<title>Some product title</title>
<brand>Some product brand</brand>
</item>
</channel>
</rss>
puts final_path # => "/rss/channel/item"
But when it gets more complicated, how should I then approach the challenge? For example with this one:
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Some XML file title</title>
<description>Some XML file description</description>
<item>
<titles>
<title>Some product title</title>
</titles>
<brands>
<brand>Some product brand</brand>
</brands>
</item>
<item>
<titles>
<title>Some product title</title>
</titles>
<brands>
<brand>Some product brand</brand>
</brands>
</item>
</channel>
</rss>

If you are looking for a list of deepest "parent" paths in the XML, there is more than one way to view that.
Although I think your own code could be adjusted to achieve the same output, I was convinced the same thing could be achieved by using xpath. And my motivation is to get my XML skills unrusty (not used Nokogiri yet, but I will need to do so professionally soon). So here is how to get all parent paths that have just one child level beneath them, using xpath:
xml.xpath('//*[child::* and not(child::*/*)]').each { |node| puts node.path }
The output of this for your second example file is:
/rss/channel/item[1]/titles
/rss/channel/item[1]/brands
/rss/channel/item[2]/titles
/rss/channel/item[2]/brands
. . . if you took this list and gsub out the indexes, then make the array unique, then this looks a lot like the output of your loop . . .
paths = xml.xpath('//*[child::* and not(child::*/*)]').map { |node| node.path }
paths.map! { |path| path.gsub(/\[[0-9]+\]/,'') }.uniq!
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]
Or in one line:
paths = xml.xpath('//*[* and not(*/*)]').map { |node| node.path.gsub(/\[[0-9]+\]/,'') }.uniq
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]

I'm created a library to build xpath.
xpath = Jini.new
.add_path('parent')
.add_path('child')
.add_all('toys')
.add_attr('name', 'plane')
.to_s
puts xpath // -> /parent/child//toys[#name="plane"]

Related

How can I make all XML tags lowercase in Nokogiri?

I'm parsing some XML that I get from various feeds. Apparently some of the XML has an occasional tag that is all upper case. I'd like to normalize the XML to be all lower case tags to make searching, etc. easier.
What I want to do is something like:
parsed = Nokogiri::XML.parse(xml_content)
node = parsed.css("title") # => should return a Nokogiri node for the title tag
However, some of the XML documents have "TITLE" for that tag.
What are my options for getting that node whether it's tag is "title", "TITLE", or even "Title"?
Thanks!
If you want to transform your xml document by downcase'ing all tag names, here's one way to do it:
parsed = Nokogiri::XML.parse(xml_content)
parsed.traverse do |node|
node.name = node.name.downcase if node.kind_of?(Nokogiri::XML::Element)
end
As a general approach you could transform all element (tag) names to lower case (e.g. by using XSLT or another solution) and then do all of your XPath/CSS queries using lower case only.
This XSLT solution should work; however, my version of Ruby (2.0.0p481) and/or Nokogiri (1.5.6) complains mysteriously (perhaps about the use of the "lower-case(...)" function? Perhaps Nokogiri doesn't support XSLT v2?)
Here's a solution that seems to work:
require 'nokogiri'
xslt = Nokogiri::XSLT(File.read('lower.xslt'))
# <?xml version="1.0" encoding="UTF-8"?>
# <xsl:transform version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
# <xsl:variable name="lowercase" select="'abcdefghijklmnopqrstuvwxyz'" />
# <xsl:variable name="uppercase" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'" />
# <xsl:template match="*">
# <xsl:element name="{translate(local-name(), $uppercase, $lowercase)}">
# <xsl:apply-templates />
# </xsl:element>
# </xsl:template>
# </xsl:transform>
doc = Nokogiri::XML(File.read('doc.xml'))
# <?xml version="1.0" encoding="UTF-8"?>
# <FOO>
# <BAR>Bar</BAR>
# <GAH>Gah</GAH>
# <ZIP><DOO><DAH/></DOO></ZIP>
# </FOO>
puts xslt.transform(doc)
# <?xml version="1.0"?>
# <foo>
# <bar>Bar</bar>
# <gah>Gah</gah>
# <zip><doo><dah/></doo></zip>
# </foo>

Encode content as CDATA in generated RSS feed

I'm generating an RSS feed using Ruby's built-in RSS library, which seems to escape HTML when generating feeds. For certain elements I'd prefer that it preserve the original HTML by wrapping it in a CDATA block.
A minimal working example:
require 'rss/2.0'
feed = RSS::Rss.new("2.0")
feed.channel = RSS::Rss::Channel.new
feed.channel.title = "Title & Show"
feed.channel.link = "http://foo.net"
feed.channel.description = "<strong>Description</strong>"
item = RSS::Rss::Channel::Item.new
item.title = "Foo & Bar"
item.description = "<strong>About</strong>"
feed.channel.items << item
puts feed
...which generates the following RSS:
<?xml version="1.0"?>
<rss version="2.0">
<channel>
<title>Title & Show</title>
<link>http://foo.net</link>
<description><strong>Description</strong></description>
<item>
<title>Foo & Bar</title>
<description><strong>About</strong></description>
</item>
</channel>
</rss>
Instead of HTML-encoding the channel and item descriptions, I'd like to keep the original HTML and wrap them in CDATA blocks, e.g.:
<description><![CDATA[<strong>Description</strong>]]></description>
monkey-patching the element-generating method works for me:
require 'rss/2.0'
class RSS::Rss::Channel
def description_element need_convert, indent
markup = "#{indent}<description>"
markup << "<![CDATA[#{#description}]]>"
markup << "</description>"
markup
end
end
# ...
this prevents the call to Utils.html_escape which escapes a few special entities.

How to use Nokogiri's noblanks

I have an XML document:
<?xml version="1.0"?>
<installation id="ayfw-a">
</installation>
I am adding a child node to this document like this:
data = Nokogiri::XML(IO.read('file')) { |doc| doc.noblanks }
new_record = Nokogiri::XML::Node.new('tag', data)
data.root.add_child(new_record)
File.open('file', 'w') { |dh_file| dh_file.write(data.to_xml(:indent => 4)) }
With this code I get this inside my file:
<?xml version="1.0"?>
<installation id="ayfw-a">
<tag/></installation>
Here the noblanks does not work.
However, if before inserting the new node my file already has a child node, noblanks works fine:
Before inserting new node:
<?xml version="1.0"?>
<installation id="ayfw-a">
<!---->
</installation>
After inserting new node:
<?xml version="1.0"?>
<installation id="ayfw-a">
<!---->
<tag/>
</installation>
So, it looks like noblanks works only if it already sees the "pattern". Is there any way I can correctly indent my XML if it does not have any children yet?
Perhaps noblanks is not the right option to use, but for some reason it works if I already have some nodes under <installation>. Basically what I currently have when adding a child node is this:
<?xml version="1.0"?>
<installation id="ayfw-a">
<tag/></installation>
What I need to have is this:
<?xml version="1.0"?>
<installation id="ayfw-a">
<tag/>
</installation>
And the child nodes I add must be empty, with some attributes which I suppressed for simplicity.
Your two examples are befuddling: they both show the exact same behavior, yet you say one of them does something different.
As far as I can tell, specifying noblanks never gets rid of an empty node:
xml.xml:
<?xml version="1.0"?>
<root>
<installation id="ayfw-a"></installation>
<dog></dog>
<cat/>
</root>
.
require 'nokogiri'
data = Nokogiri::XML(IO.read('xml.xml')) { |doc| doc.noblanks }
puts data
--output:--
<?xml version="1.0"?>
<root>
<installation id="ayfw-a"/>
<dog/>
<cat/>
</root>
I would expect the output to be:
<root>
<installation id="ayfw-a"></installation>
</root>
Of course, the terrible Nokogiri docs (typical of Ruby) do not define what a blank node is. Apparently, the extent of what noblanks does is convert nodes like this:
<dog></dog>
to:
<dog/>
Ahh, so your problem is with the pretty printing of your XML. Okay, I see the same problem you do. Let me show you how you could have asked your question:
I am having trouble formatting my XML the way I want to:
xml.xml:
<?xml version="1.0"?>
<installation id="ayfw-a">
</installation>
.
require 'nokogiri'
data = Nokogiri::XML(IO.read('xml.xml')) {|doc| doc.noblanks}
new_record = Nokogiri::XML::Node.new('tag', data)
data.root.add_child(new_record)
puts data.to_xml(indent: 4, indent_text: ".")
--output:--
<?xml version="1.0"?>
<installation id="ayfw-a">
<tag/></installation>
The to_xml() method doesn't seem to work correctly. I expected the output to be:
<?xml version="1.0"?>
<installation id="ayfw-a">
....<tag/>
</installation>
But the to_xml() method does format the output the way I want when the tag has a pre-existing child node:
xml.xml:
<?xml version="1.0"?>
<installation id="ayfw-a">
<dog>Rover</dog>
</installation>
.
require 'nokogiri'
data = Nokogiri::XML(IO.read('xml.xml')) {|doc| doc.noblanks}
new_record = Nokogiri::XML::Node.new('tag', data)
data.root.add_child(new_record)
puts data.to_xml(indent: 4, indent_text: ".")
--output:--
<?xml version="1.0"?>
<installation id="ayfw-a">
....<dog>Rover</dog>
....<tag/>
</installation>
How do I get Nokogiri to format the output the way I want it in the first case?
It doesn't look like Nokogiri has a very good pretty printer. It seems that REXML has a better pretty printer than Nokogiri:
xml.xml:
<?xml version="1.0"?>
<installation id="ayfw-a">
</installation>
.
require 'nokogiri'
data = Nokogiri::XML(IO.read('xml.xml')) {|doc| doc.noblanks}
new_record = Nokogiri::XML::Node.new('tag', data)
data.root.add_child(new_record)
puts data.to_xml(indent: 4, indent_text: ".")
require "rexml/document"
REXML::Document.new(data.to_xml).write(File.open("output.txt", "w"), indent_spaces = 4)
--output:--
<installation id="ayfw-a">
<tag/></installation>
$ cat output.txt
<?xml version='1.0'?>
<installation id='ayfw-a'>
<tag/>
</installation>
Pretty printing XML is not a guarantee of correct XML, it's just "pretty". Nokogiri generates valid XML, which is much more important.
If you have to have a certain starting format, create a small template for Nokogiri to parse, then build upon it:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0"?>
<installation id="ayfw-a">
<tag/>
</installation>
EOT
puts doc.to_xml
Which generates:
<?xml version="1.0"?>
<installation id="ayfw-a">
<tag/>
</installation>
Adjusting the code a little lets me set the starting root node's ID and the name of the embedded tag:
require 'nokogiri'
ID = 'ayfw-a'
TAG = 'foo'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0"?>
<installation id="#{ ID }">
<#{ TAG }/>
</installation>
EOT
puts doc.to_xml
Which outputs:
<?xml version="1.0"?>
<installation id="ayfw-a">
<foo/>
</installation>
An alternate way to write this is:
require 'nokogiri'
ID = 'ayfw-a'
TAG = 'foo'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0"?>
<installation>
<tag/>
</installation>
EOT
doc.root['id'] = ID
doc.at('tag').name = TAG
puts doc.to_xml
Which outputs:
<?xml version="1.0"?>
<installation id="ayfw-a">
<foo/>
</installation>
Whatever you do, it lets you work around the issue and be productive.

LIBXML-RUBY > Xpath context

Context: I'm parsing an XML file using the libxml-ruby gem. I need to query the XML document for a set of nodes using the XPath find method. I then need to process each node individually, querying them once again using the XPath find method.
Issue: When I attempt to query the returned nodes individually, the XPath find method is querying the entire document rather than just the node:
Code Example:
require 'xml'
string = %{<?xml version="1.0" encoding="iso-8859-1"?>
<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="eng">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>}
xml = XML::Parser.string(string, :encoding => XML::Encoding::ISO_8859_1).parse
books = xml.find("//book")
books.each do |book|
price = book.find("//price").first.content
puts price
end
This script returns 29.99 twice. I think this must have something to with setting the XPath context but I have not figured out how to accomplish that yet.
The first problem I see is book.find("//price").
//price means "start at the top of the document and look downward. That's most certainly NOT what you want to do. Instead I think you want to look inside book for the first price.
Using Nokogiri, I'd use CSS selectors because they're more easy on the eyes and can usually accomplish the same thing:
require 'nokogiri'
string = %{<?xml version="1.0" encoding="iso-8859-1"?>
<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="eng">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>}
xml = Nokogiri::XML(string)
books = xml.search("book")
books.each do |book|
price = book.at("price").content
puts price
end
After running that I get:
29.99
39.95

How to add a new node to XML

I have a simple XML file, items.xml:
<?xml version="1.0" encoding="UTF-8" ?>
<items>
<item>
<name>mouse</name>
<manufacturer>Logicteh</manufacturer>
</item>
<item>
<name>keyboard</name>
<manufacturer>Logitech - Inc.</manufacturer>
</item>
<item>
<name>webcam</name>
<manufacturer>Logistech</manufacturer>
</item>
</items>
I am trying to insert a new node with the following code:
require 'rubygems'
require 'nokogiri'
f = File.open('items.xml')
#items = Nokogiri::XML(f)
f.close
price = Nokogiri::XML::Node.new "price", #items
price.content = "10"
#items.xpath('//items/item/manufacturer').each do |node|
node.add_next_sibling(price)
end
file = File.open("items_fixed.xml",'w')
file.puts #items.to_xml
file.close
However this code adds a new node only after the last <manufacturer> node, items_fixed.xml:
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>
<name>mouse</name>
<manufacturer>Logitech</manufacturer>
</item>
<item>
<name>keyboard</name>
<manufacturer>Logitech</manufacturer>
</item>
<item>
<name>webcam</name>
<manufacturer>Logitech</manufacturer><price>10</price>
</item>
</items>
Why?
It would be helpful to distinguish between a Node (a particular piece of structured XML data at a particular place in a tree), and a "node template" which is the structure of the data.
Nokogiri (and most other XML libraries) only allow you to specify Nodes, not node templates. So when you created price = Nokogiri::XML::Node.new "price", #items, you had a particular piece of data that belongs in a particular place, but hadn't defined the place yet.
When you added it to the first <item>, you defined its place. When you added it to the second <item>, you uprooted it from its place and put it in a new place. At that point this Node appeared only in the second <item>. This continues when you add the same Node to each item, until you reach the last <item>, which is where the node stays.
Nokogiri doesn't have any way to specify a node template. What you need to do is:
#items.xpath('//items/item/manufacturer').each do |node|
price = Nokogiri::XML::Node.new "price", #items
price.content = "10"
node.add_next_sibling(price)
end
I'd start with this:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>
<name>mouse</name>
<manufacturer>Logitech</manufacturer>
</item>
<item>
<name>keyboard</name>
<manufacturer>Logitech - Inc.</manufacturer>
</item>
</items>
EOT
doc.search('manufacturer').each { |n| n.after('<price>10</price>') }
Which results in:
puts doc.to_xml
# >> <?xml version="1.0" encoding="UTF-8"?>
# >> <items>
# >> <item>
# >> <name>mouse</name>
# >> <manufacturer>Logitech</manufacturer><price>10</price>
# >> </item>
# >> <item>
# >> <name>keyboard</name>
# >> <manufacturer>Logitech - Inc.</manufacturer><price>10</price>
# >> </item>
# >> </items>
It's easy to build upon this to insert different values for the price.

Resources