How to compact existing XML using Nokogiri - ruby

I'm trying to compact an existing XML file using Nokogiri. I have the following demo code:
#!/usr/bin/env ruby
require 'nokogiri'
doc = Nokogiri.XML <<-XML.strip
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
XML
doc.write_xml_to($stdout, indent: 0)
I expected to see:
<?xml version="1.0" encoding="UTF-8"?>
<root><foo><bar>test</bar></foo></root>
but instead I saw:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
I tried:
doc.write_to($stdout, indent: 0, save_with: Nokogiri::XML::Node::SaveOptions::AS_XML)
but that doesn't work either.
How can I remove the ignorable whitespaces?

You can tell Nokogiri to ignore empty text nodes and then to output without indentation:
require 'nokogiri'
xml = <<EOT
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
EOT
doc = Nokogiri::XML(xml) { |opts|
opts.noblanks
opts.strict.noblanks
}
doc.to_xml(:indent_text => '', :indent => 0)
# => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
# "<root>\n" +
# "<foo>\n" +
# "<bar>test</bar>\n" +
# "</foo>\n" +
# "</root>\n"

Okay, I answer my own question.
Nokogiri does not remove the white spaces because Nokogiri doesn't know if the white spaces are ignorable or not (no DTD, no schema), so it keeps all the whitespace-only text as text nodes. I should remove them manually before writing the XML doc to the IO device.
#!/usr/bin/env ruby
require 'bundler'
Bundler.require :default
doc = Nokogiri.XML <<-XML.strip
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
XML
# remove ignorable white spaces
doc.xpath('//text()').each do |node|
node.content = '' if node.text =~ /\A\s+\z/m
end
doc.write_xml_to($stdout, indent: 0)
This is a big progress for me, but I still haven't reached my goal because the XML file I'm working on has inline self-closing tags, and there are whitespace-only text nodes between those tags that should not be compacted. I'm trying to figure out a way to handle this corner case now.

Related

How can I make all XML tags lowercase in Nokogiri?

I'm parsing some XML that I get from various feeds. Apparently some of the XML has an occasional tag that is all upper case. I'd like to normalize the XML to be all lower case tags to make searching, etc. easier.
What I want to do is something like:
parsed = Nokogiri::XML.parse(xml_content)
node = parsed.css("title") # => should return a Nokogiri node for the title tag
However, some of the XML documents have "TITLE" for that tag.
What are my options for getting that node whether it's tag is "title", "TITLE", or even "Title"?
Thanks!
If you want to transform your xml document by downcase'ing all tag names, here's one way to do it:
parsed = Nokogiri::XML.parse(xml_content)
parsed.traverse do |node|
node.name = node.name.downcase if node.kind_of?(Nokogiri::XML::Element)
end
As a general approach you could transform all element (tag) names to lower case (e.g. by using XSLT or another solution) and then do all of your XPath/CSS queries using lower case only.
This XSLT solution should work; however, my version of Ruby (2.0.0p481) and/or Nokogiri (1.5.6) complains mysteriously (perhaps about the use of the "lower-case(...)" function? Perhaps Nokogiri doesn't support XSLT v2?)
Here's a solution that seems to work:
require 'nokogiri'
xslt = Nokogiri::XSLT(File.read('lower.xslt'))
# <?xml version="1.0" encoding="UTF-8"?>
# <xsl:transform version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
# <xsl:variable name="lowercase" select="'abcdefghijklmnopqrstuvwxyz'" />
# <xsl:variable name="uppercase" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'" />
# <xsl:template match="*">
# <xsl:element name="{translate(local-name(), $uppercase, $lowercase)}">
# <xsl:apply-templates />
# </xsl:element>
# </xsl:template>
# </xsl:transform>
doc = Nokogiri::XML(File.read('doc.xml'))
# <?xml version="1.0" encoding="UTF-8"?>
# <FOO>
# <BAR>Bar</BAR>
# <GAH>Gah</GAH>
# <ZIP><DOO><DAH/></DOO></ZIP>
# </FOO>
puts xslt.transform(doc)
# <?xml version="1.0"?>
# <foo>
# <bar>Bar</bar>
# <gah>Gah</gah>
# <zip><doo><dah/></doo></zip>
# </foo>

Output array of tag contents using REXML?

This has been asked before in "REXML - How to extract a single element" but the answer doesn't work. Apparently, the text method is no longer available.
I have an XML file:
<?xml version="1.0" encoding="UTF-8"?>
<ice_cream>
<flavor>Vanilla</flavor>
</ice_cream>
and I can place its contents into an array using REXML:
flavors = xml_file.get_elements('//flavor')
I get an array:
puts flavors[0]
Which returns:
<flavor>Vanilla</flavor>
Instead, I want:
Vanilla
I've tried:
flavors = xml_file.get_elements('//flavor').text
But, I get:
NoMethodError: undefined method `text' for #<Array:0x007fa7a3b94220>
What's the correct way to accomplish this? I'm open to using other libraries, too.
Use Nokogiri. Your code will thank you.
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<ice_cream>
<flavor>Vanilla</flavor>
</ice_cream>
EOT
doc.search('flavor') # => [#<Nokogiri::XML::Element:0x3feb8182fc60 name="flavor" children=[#<Nokogiri::XML::Text:0x3feb8182fa44 "Vanilla">]>]
doc.search('flavor').map(&:text) # => ["Vanilla"]
search finds all nodes, as a NodeSet, that match the CSS selector 'flavor'.
search('flavor').map(&:text) walks the NodeSet and applies (map) the text method to each Node, returning its text node(s).
If your XML is actually something more complex:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<ice_cream>
<flavor>Vanilla</flavor>
<flavor>Chocolate</flavor>
<flavor>Strawberry</flavor>
</ice_cream>
EOT
doc.search('flavor') # => [#<Nokogiri::XML::Element:0x3fcc2a577afc name="flavor" children=[#<Nokogiri::XML::Text:0x3fcc2a5778e0 "Vanilla">]>, #<Nokogiri::XML::Element:0x3fcc2a5776c4 name="flavor" children=[#<Nokogiri::XML::Text:0x3fcc2a5774bc "Chocolate">]>, #<Nokogiri::XML::Element:0x3fcc2a5772b4 name="flavor" children=[#<Nokogiri::XML::Text:0x3fcc2a572c78 "Strawberry">]>]
doc.search('flavor').map(&:text) # => ["Vanilla", "Chocolate", "Strawberry"]
Or:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<ice_creams>
<ice_cream>
<flavor>Vanilla</flavor>
</ice_cream>
<ice_cream>
<flavor>Chocolate</flavor>
</ice_cream>
<ice_cream>
<flavor>Strawberry</flavor>
</ice_cream>
</ice_creams>
EOT
ice_cream = doc.search('ice_cream') # => [#<Nokogiri::XML::Element:0x3fe6a91f6b00 name="ice_cream" children=[#<Nokogiri::XML::Text:0x3fe6a91f68f8 "\n ">, #<Nokogiri::XML::Element:0x3fe6a91f681c name="flavor" children=[#<Nokogiri::XML::Text:0x3fe6a91f6600 "Vanilla">]>, #<Nokogiri::XML::Text:0x3fe6a91f63f8 "\n ">]>, #<Nokogiri::XML::Element:0x3fe6a91f1de4 name="ice_cream" children=[#<Nokogiri::XML::Text:0x3fe6a91f1bdc "\n ">, #<Nokogiri::XML::Element:0x3fe6a91f1ac4 name="flavor" children=[#<Nokogiri::XML::Text:0x3fe6a91f1880 "Chocolate">]>, #<Nokogiri::XML::Text:0x3fe6a91f1678 "\n ">]>, #<Nokogiri::XML::Element:0x3fe6a91f13f8 name="ice_cream" children=[#<Nokogiri::XML::Text:0x3fe6a91f1074 "\n ">, #<Nokogiri::XML::Element:0x3fe6a91f0e80 name="flavor" children=[#<Nokogiri::XML::Text:0x3fe6a91f0a98 "Strawberry">]>, #<Nokogiri::XML::Text:0x3fe6a91f0840 "\n ">]>]
ice_cream.search('flavor').map(&:text) # => ["Vanilla", "Chocolate", "Strawberry"]
For searching, Nokogiri supports using both CSS and XPath selectors, and allows you to use either in the methods, if you want. search accepts both CSS and XPath, and has corollaries of css and xpath for the CSS or XPath specific methods. at returns a single Node and is similar to search('some_node').first and has at_css and at_xpath respectively.
Here is the code :
require 'rexml/document'
doc = <<-xml
<?xml version="1.0" encoding="UTF-8"?>
<ice_cream>
<flavor>Vanilla</flavor>
</ice_cream>
xml
xml_doc = REXML::Document.new(doc)
xml_doc.get_elements('//flavor').class # => Array
xml_doc.get_elements('//flavor')[0].class # => REXML::Element
xml_doc.get_elements('//flavor')[0].text # => "Vanilla"
Actually xml_doc.get_elements('//flavor') will give you the collection of REXML::Element objects. You then need to iterate through the collection and call the method #text on the REXML::Element object to get the text.

How to use Nokogiri's noblanks

I have an XML document:
<?xml version="1.0"?>
<installation id="ayfw-a">
</installation>
I am adding a child node to this document like this:
data = Nokogiri::XML(IO.read('file')) { |doc| doc.noblanks }
new_record = Nokogiri::XML::Node.new('tag', data)
data.root.add_child(new_record)
File.open('file', 'w') { |dh_file| dh_file.write(data.to_xml(:indent => 4)) }
With this code I get this inside my file:
<?xml version="1.0"?>
<installation id="ayfw-a">
<tag/></installation>
Here the noblanks does not work.
However, if before inserting the new node my file already has a child node, noblanks works fine:
Before inserting new node:
<?xml version="1.0"?>
<installation id="ayfw-a">
<!---->
</installation>
After inserting new node:
<?xml version="1.0"?>
<installation id="ayfw-a">
<!---->
<tag/>
</installation>
So, it looks like noblanks works only if it already sees the "pattern". Is there any way I can correctly indent my XML if it does not have any children yet?
Perhaps noblanks is not the right option to use, but for some reason it works if I already have some nodes under <installation>. Basically what I currently have when adding a child node is this:
<?xml version="1.0"?>
<installation id="ayfw-a">
<tag/></installation>
What I need to have is this:
<?xml version="1.0"?>
<installation id="ayfw-a">
<tag/>
</installation>
And the child nodes I add must be empty, with some attributes which I suppressed for simplicity.
Your two examples are befuddling: they both show the exact same behavior, yet you say one of them does something different.
As far as I can tell, specifying noblanks never gets rid of an empty node:
xml.xml:
<?xml version="1.0"?>
<root>
<installation id="ayfw-a"></installation>
<dog></dog>
<cat/>
</root>
.
require 'nokogiri'
data = Nokogiri::XML(IO.read('xml.xml')) { |doc| doc.noblanks }
puts data
--output:--
<?xml version="1.0"?>
<root>
<installation id="ayfw-a"/>
<dog/>
<cat/>
</root>
I would expect the output to be:
<root>
<installation id="ayfw-a"></installation>
</root>
Of course, the terrible Nokogiri docs (typical of Ruby) do not define what a blank node is. Apparently, the extent of what noblanks does is convert nodes like this:
<dog></dog>
to:
<dog/>
Ahh, so your problem is with the pretty printing of your XML. Okay, I see the same problem you do. Let me show you how you could have asked your question:
I am having trouble formatting my XML the way I want to:
xml.xml:
<?xml version="1.0"?>
<installation id="ayfw-a">
</installation>
.
require 'nokogiri'
data = Nokogiri::XML(IO.read('xml.xml')) {|doc| doc.noblanks}
new_record = Nokogiri::XML::Node.new('tag', data)
data.root.add_child(new_record)
puts data.to_xml(indent: 4, indent_text: ".")
--output:--
<?xml version="1.0"?>
<installation id="ayfw-a">
<tag/></installation>
The to_xml() method doesn't seem to work correctly. I expected the output to be:
<?xml version="1.0"?>
<installation id="ayfw-a">
....<tag/>
</installation>
But the to_xml() method does format the output the way I want when the tag has a pre-existing child node:
xml.xml:
<?xml version="1.0"?>
<installation id="ayfw-a">
<dog>Rover</dog>
</installation>
.
require 'nokogiri'
data = Nokogiri::XML(IO.read('xml.xml')) {|doc| doc.noblanks}
new_record = Nokogiri::XML::Node.new('tag', data)
data.root.add_child(new_record)
puts data.to_xml(indent: 4, indent_text: ".")
--output:--
<?xml version="1.0"?>
<installation id="ayfw-a">
....<dog>Rover</dog>
....<tag/>
</installation>
How do I get Nokogiri to format the output the way I want it in the first case?
It doesn't look like Nokogiri has a very good pretty printer. It seems that REXML has a better pretty printer than Nokogiri:
xml.xml:
<?xml version="1.0"?>
<installation id="ayfw-a">
</installation>
.
require 'nokogiri'
data = Nokogiri::XML(IO.read('xml.xml')) {|doc| doc.noblanks}
new_record = Nokogiri::XML::Node.new('tag', data)
data.root.add_child(new_record)
puts data.to_xml(indent: 4, indent_text: ".")
require "rexml/document"
REXML::Document.new(data.to_xml).write(File.open("output.txt", "w"), indent_spaces = 4)
--output:--
<installation id="ayfw-a">
<tag/></installation>
$ cat output.txt
<?xml version='1.0'?>
<installation id='ayfw-a'>
<tag/>
</installation>
Pretty printing XML is not a guarantee of correct XML, it's just "pretty". Nokogiri generates valid XML, which is much more important.
If you have to have a certain starting format, create a small template for Nokogiri to parse, then build upon it:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0"?>
<installation id="ayfw-a">
<tag/>
</installation>
EOT
puts doc.to_xml
Which generates:
<?xml version="1.0"?>
<installation id="ayfw-a">
<tag/>
</installation>
Adjusting the code a little lets me set the starting root node's ID and the name of the embedded tag:
require 'nokogiri'
ID = 'ayfw-a'
TAG = 'foo'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0"?>
<installation id="#{ ID }">
<#{ TAG }/>
</installation>
EOT
puts doc.to_xml
Which outputs:
<?xml version="1.0"?>
<installation id="ayfw-a">
<foo/>
</installation>
An alternate way to write this is:
require 'nokogiri'
ID = 'ayfw-a'
TAG = 'foo'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0"?>
<installation>
<tag/>
</installation>
EOT
doc.root['id'] = ID
doc.at('tag').name = TAG
puts doc.to_xml
Which outputs:
<?xml version="1.0"?>
<installation id="ayfw-a">
<foo/>
</installation>
Whatever you do, it lets you work around the issue and be productive.

Check if xml response has type="array" using Nokogiri?

Given the following xml which has been parsed into #response using Nokogiri
<?xml version="1.0" encoding="UTF-8"?>
<foos type="array">
<foo>
<id type="integer">1</id>
<name>bar</name>
</foo>
</foos>
Does an xpath exist such that #response.xpath(xpath) returns array?
Assume that this xpath must be reused across multiple documents where the naming of foo is inconsistent.
If an xpath is not the correct tool to solve this problem, does Nokogiri provide a method that is?
This xml is automatically generated by the rails framework, and the answer to this question is intended to be used to create an XML equivalent to this Cucumber feature for JSON responses.
If you want to select the root node when its type attribute is array (regardless of the root element's name), then use this:
/*[#type='array']
For its children, use:
/*[#type='array']/*
Simply:
if doc.root['type']=='array'
Here's a test case:
#response = <<ENDXML
<?xml version="1.0" encoding="UTF-8"?>
<foos type="array">
<foo>
<id type="integer">1</id>
<name>bar</name>
</foo>
</foos>
ENDXML
require 'nokogiri'
doc = Nokogiri.XML(#response)
if doc.root['type']=='array'
puts "It is!"
else
puts "Nope"
end
Depending on your needs, you might want to:
case doc.root['type']
when 'array'
#...
when 'string'
#...
else
#...
end

How do I Transform an .xml file to an instance of a ruby array?

I have the following xml file:
/my_file.xml
<?xml version="1.0" encoding="utf-8" ?>
<words>
<w>my_word</w>
<w>second_word</w>
</words>
How can I do the following using Ruby:
Load
Parse
Transform an xml file to an instance of a ruby array:
words = ["my_word","second_word"]
With the Nokogiri gem...
require 'rubygems'
require 'nokogiri'
xml = '<?xml version="1.0" encoding="utf-8" ?>
<words>
<w>my_word</w>
<w>second_word</w>
</words>'
doc = Nokogiri::XML(xml)
words = doc.xpath("//w").map {|x| x.text}

Resources