Getting the value of duplicate nodes in XML? - ruby

How can I get the values of two nodes if they have the same name, using LibXML for Ruby, or any other Ruby library? I have this XML:
<?xml version="1.0" encoding="ISO-8859-1"?>
<test>
<test1>
<foo>534569</foo>
</test1>
<test1>
<foo>534570</foo>
</test1>
</test>
I want both values of foo.

Personally, I'd recommend using Nokogiri. It's become the defacto standard for XML/HTML parsing in Ruby.
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="ISO-8859-1"?>
<test>
<test1>
<foo>534569</foo>
</test1>
<test1>
<foo>534570</foo>
</test1>
</test>
EOT
doc.search('foo').map(&:text)
which returns:
[
[0] "534569",
[1] "534570"
]

You can use the find method, which will return all nodes that match the specified xpath.
Below is an example of how to output the content of each foo element:
require 'libxml'
xml_sample = %q[<?xml version="1.0" encoding="ISO-8859-1"?>
<test>
<test1>
<foo>534569</foo>
</test1>
<test1>
<foo>534570</foo>
</test1>
</test>]
doc = LibXML::XML::Document.string(xml_sample)
doc.find('test1/foo').each{ |foo| puts foo.content }
#=> 534569
#=> 534570

Related

How to compact existing XML using Nokogiri

I'm trying to compact an existing XML file using Nokogiri. I have the following demo code:
#!/usr/bin/env ruby
require 'nokogiri'
doc = Nokogiri.XML <<-XML.strip
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
XML
doc.write_xml_to($stdout, indent: 0)
I expected to see:
<?xml version="1.0" encoding="UTF-8"?>
<root><foo><bar>test</bar></foo></root>
but instead I saw:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
I tried:
doc.write_to($stdout, indent: 0, save_with: Nokogiri::XML::Node::SaveOptions::AS_XML)
but that doesn't work either.
How can I remove the ignorable whitespaces?
You can tell Nokogiri to ignore empty text nodes and then to output without indentation:
require 'nokogiri'
xml = <<EOT
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
EOT
doc = Nokogiri::XML(xml) { |opts|
opts.noblanks
opts.strict.noblanks
}
doc.to_xml(:indent_text => '', :indent => 0)
# => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
# "<root>\n" +
# "<foo>\n" +
# "<bar>test</bar>\n" +
# "</foo>\n" +
# "</root>\n"
Okay, I answer my own question.
Nokogiri does not remove the white spaces because Nokogiri doesn't know if the white spaces are ignorable or not (no DTD, no schema), so it keeps all the whitespace-only text as text nodes. I should remove them manually before writing the XML doc to the IO device.
#!/usr/bin/env ruby
require 'bundler'
Bundler.require :default
doc = Nokogiri.XML <<-XML.strip
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
XML
# remove ignorable white spaces
doc.xpath('//text()').each do |node|
node.content = '' if node.text =~ /\A\s+\z/m
end
doc.write_xml_to($stdout, indent: 0)
This is a big progress for me, but I still haven't reached my goal because the XML file I'm working on has inline self-closing tags, and there are whitespace-only text nodes between those tags that should not be compacted. I'm trying to figure out a way to handle this corner case now.

Output array of tag contents using REXML?

This has been asked before in "REXML - How to extract a single element" but the answer doesn't work. Apparently, the text method is no longer available.
I have an XML file:
<?xml version="1.0" encoding="UTF-8"?>
<ice_cream>
<flavor>Vanilla</flavor>
</ice_cream>
and I can place its contents into an array using REXML:
flavors = xml_file.get_elements('//flavor')
I get an array:
puts flavors[0]
Which returns:
<flavor>Vanilla</flavor>
Instead, I want:
Vanilla
I've tried:
flavors = xml_file.get_elements('//flavor').text
But, I get:
NoMethodError: undefined method `text' for #<Array:0x007fa7a3b94220>
What's the correct way to accomplish this? I'm open to using other libraries, too.
Use Nokogiri. Your code will thank you.
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<ice_cream>
<flavor>Vanilla</flavor>
</ice_cream>
EOT
doc.search('flavor') # => [#<Nokogiri::XML::Element:0x3feb8182fc60 name="flavor" children=[#<Nokogiri::XML::Text:0x3feb8182fa44 "Vanilla">]>]
doc.search('flavor').map(&:text) # => ["Vanilla"]
search finds all nodes, as a NodeSet, that match the CSS selector 'flavor'.
search('flavor').map(&:text) walks the NodeSet and applies (map) the text method to each Node, returning its text node(s).
If your XML is actually something more complex:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<ice_cream>
<flavor>Vanilla</flavor>
<flavor>Chocolate</flavor>
<flavor>Strawberry</flavor>
</ice_cream>
EOT
doc.search('flavor') # => [#<Nokogiri::XML::Element:0x3fcc2a577afc name="flavor" children=[#<Nokogiri::XML::Text:0x3fcc2a5778e0 "Vanilla">]>, #<Nokogiri::XML::Element:0x3fcc2a5776c4 name="flavor" children=[#<Nokogiri::XML::Text:0x3fcc2a5774bc "Chocolate">]>, #<Nokogiri::XML::Element:0x3fcc2a5772b4 name="flavor" children=[#<Nokogiri::XML::Text:0x3fcc2a572c78 "Strawberry">]>]
doc.search('flavor').map(&:text) # => ["Vanilla", "Chocolate", "Strawberry"]
Or:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<ice_creams>
<ice_cream>
<flavor>Vanilla</flavor>
</ice_cream>
<ice_cream>
<flavor>Chocolate</flavor>
</ice_cream>
<ice_cream>
<flavor>Strawberry</flavor>
</ice_cream>
</ice_creams>
EOT
ice_cream = doc.search('ice_cream') # => [#<Nokogiri::XML::Element:0x3fe6a91f6b00 name="ice_cream" children=[#<Nokogiri::XML::Text:0x3fe6a91f68f8 "\n ">, #<Nokogiri::XML::Element:0x3fe6a91f681c name="flavor" children=[#<Nokogiri::XML::Text:0x3fe6a91f6600 "Vanilla">]>, #<Nokogiri::XML::Text:0x3fe6a91f63f8 "\n ">]>, #<Nokogiri::XML::Element:0x3fe6a91f1de4 name="ice_cream" children=[#<Nokogiri::XML::Text:0x3fe6a91f1bdc "\n ">, #<Nokogiri::XML::Element:0x3fe6a91f1ac4 name="flavor" children=[#<Nokogiri::XML::Text:0x3fe6a91f1880 "Chocolate">]>, #<Nokogiri::XML::Text:0x3fe6a91f1678 "\n ">]>, #<Nokogiri::XML::Element:0x3fe6a91f13f8 name="ice_cream" children=[#<Nokogiri::XML::Text:0x3fe6a91f1074 "\n ">, #<Nokogiri::XML::Element:0x3fe6a91f0e80 name="flavor" children=[#<Nokogiri::XML::Text:0x3fe6a91f0a98 "Strawberry">]>, #<Nokogiri::XML::Text:0x3fe6a91f0840 "\n ">]>]
ice_cream.search('flavor').map(&:text) # => ["Vanilla", "Chocolate", "Strawberry"]
For searching, Nokogiri supports using both CSS and XPath selectors, and allows you to use either in the methods, if you want. search accepts both CSS and XPath, and has corollaries of css and xpath for the CSS or XPath specific methods. at returns a single Node and is similar to search('some_node').first and has at_css and at_xpath respectively.
Here is the code :
require 'rexml/document'
doc = <<-xml
<?xml version="1.0" encoding="UTF-8"?>
<ice_cream>
<flavor>Vanilla</flavor>
</ice_cream>
xml
xml_doc = REXML::Document.new(doc)
xml_doc.get_elements('//flavor').class # => Array
xml_doc.get_elements('//flavor')[0].class # => REXML::Element
xml_doc.get_elements('//flavor')[0].text # => "Vanilla"
Actually xml_doc.get_elements('//flavor') will give you the collection of REXML::Element objects. You then need to iterate through the collection and call the method #text on the REXML::Element object to get the text.

How to use Nokogiri's noblanks

I have an XML document:
<?xml version="1.0"?>
<installation id="ayfw-a">
</installation>
I am adding a child node to this document like this:
data = Nokogiri::XML(IO.read('file')) { |doc| doc.noblanks }
new_record = Nokogiri::XML::Node.new('tag', data)
data.root.add_child(new_record)
File.open('file', 'w') { |dh_file| dh_file.write(data.to_xml(:indent => 4)) }
With this code I get this inside my file:
<?xml version="1.0"?>
<installation id="ayfw-a">
<tag/></installation>
Here the noblanks does not work.
However, if before inserting the new node my file already has a child node, noblanks works fine:
Before inserting new node:
<?xml version="1.0"?>
<installation id="ayfw-a">
<!---->
</installation>
After inserting new node:
<?xml version="1.0"?>
<installation id="ayfw-a">
<!---->
<tag/>
</installation>
So, it looks like noblanks works only if it already sees the "pattern". Is there any way I can correctly indent my XML if it does not have any children yet?
Perhaps noblanks is not the right option to use, but for some reason it works if I already have some nodes under <installation>. Basically what I currently have when adding a child node is this:
<?xml version="1.0"?>
<installation id="ayfw-a">
<tag/></installation>
What I need to have is this:
<?xml version="1.0"?>
<installation id="ayfw-a">
<tag/>
</installation>
And the child nodes I add must be empty, with some attributes which I suppressed for simplicity.
Your two examples are befuddling: they both show the exact same behavior, yet you say one of them does something different.
As far as I can tell, specifying noblanks never gets rid of an empty node:
xml.xml:
<?xml version="1.0"?>
<root>
<installation id="ayfw-a"></installation>
<dog></dog>
<cat/>
</root>
.
require 'nokogiri'
data = Nokogiri::XML(IO.read('xml.xml')) { |doc| doc.noblanks }
puts data
--output:--
<?xml version="1.0"?>
<root>
<installation id="ayfw-a"/>
<dog/>
<cat/>
</root>
I would expect the output to be:
<root>
<installation id="ayfw-a"></installation>
</root>
Of course, the terrible Nokogiri docs (typical of Ruby) do not define what a blank node is. Apparently, the extent of what noblanks does is convert nodes like this:
<dog></dog>
to:
<dog/>
Ahh, so your problem is with the pretty printing of your XML. Okay, I see the same problem you do. Let me show you how you could have asked your question:
I am having trouble formatting my XML the way I want to:
xml.xml:
<?xml version="1.0"?>
<installation id="ayfw-a">
</installation>
.
require 'nokogiri'
data = Nokogiri::XML(IO.read('xml.xml')) {|doc| doc.noblanks}
new_record = Nokogiri::XML::Node.new('tag', data)
data.root.add_child(new_record)
puts data.to_xml(indent: 4, indent_text: ".")
--output:--
<?xml version="1.0"?>
<installation id="ayfw-a">
<tag/></installation>
The to_xml() method doesn't seem to work correctly. I expected the output to be:
<?xml version="1.0"?>
<installation id="ayfw-a">
....<tag/>
</installation>
But the to_xml() method does format the output the way I want when the tag has a pre-existing child node:
xml.xml:
<?xml version="1.0"?>
<installation id="ayfw-a">
<dog>Rover</dog>
</installation>
.
require 'nokogiri'
data = Nokogiri::XML(IO.read('xml.xml')) {|doc| doc.noblanks}
new_record = Nokogiri::XML::Node.new('tag', data)
data.root.add_child(new_record)
puts data.to_xml(indent: 4, indent_text: ".")
--output:--
<?xml version="1.0"?>
<installation id="ayfw-a">
....<dog>Rover</dog>
....<tag/>
</installation>
How do I get Nokogiri to format the output the way I want it in the first case?
It doesn't look like Nokogiri has a very good pretty printer. It seems that REXML has a better pretty printer than Nokogiri:
xml.xml:
<?xml version="1.0"?>
<installation id="ayfw-a">
</installation>
.
require 'nokogiri'
data = Nokogiri::XML(IO.read('xml.xml')) {|doc| doc.noblanks}
new_record = Nokogiri::XML::Node.new('tag', data)
data.root.add_child(new_record)
puts data.to_xml(indent: 4, indent_text: ".")
require "rexml/document"
REXML::Document.new(data.to_xml).write(File.open("output.txt", "w"), indent_spaces = 4)
--output:--
<installation id="ayfw-a">
<tag/></installation>
$ cat output.txt
<?xml version='1.0'?>
<installation id='ayfw-a'>
<tag/>
</installation>
Pretty printing XML is not a guarantee of correct XML, it's just "pretty". Nokogiri generates valid XML, which is much more important.
If you have to have a certain starting format, create a small template for Nokogiri to parse, then build upon it:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0"?>
<installation id="ayfw-a">
<tag/>
</installation>
EOT
puts doc.to_xml
Which generates:
<?xml version="1.0"?>
<installation id="ayfw-a">
<tag/>
</installation>
Adjusting the code a little lets me set the starting root node's ID and the name of the embedded tag:
require 'nokogiri'
ID = 'ayfw-a'
TAG = 'foo'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0"?>
<installation id="#{ ID }">
<#{ TAG }/>
</installation>
EOT
puts doc.to_xml
Which outputs:
<?xml version="1.0"?>
<installation id="ayfw-a">
<foo/>
</installation>
An alternate way to write this is:
require 'nokogiri'
ID = 'ayfw-a'
TAG = 'foo'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0"?>
<installation>
<tag/>
</installation>
EOT
doc.root['id'] = ID
doc.at('tag').name = TAG
puts doc.to_xml
Which outputs:
<?xml version="1.0"?>
<installation id="ayfw-a">
<foo/>
</installation>
Whatever you do, it lets you work around the issue and be productive.

Ruby: How do I get attribute values from XML with Nokogiri?

How to get the value of the message value ("ready to use")?
<?xml version="1.0" encoding="UTF-8"?>
<response status="ok" permission_level="admin" message="ready to use" cached="0">
<title>kit</title>
</response>
Thanks
require 'rubygems'
require 'nokogiri'
string = %Q{
<?xml version="1.0" encoding="UTF-8"?>
<response status="ok" permission_level="admin" message="ready to use" cached="0">
<title>kit</title>
</response>
}
doc = Nokogiri::XML(string)
doc.css("response").each do |response_node|
puts response_node["message"]
end
save and run this ruby file, you will get result:
#=> ready to use
You subscript them.
doc = Nokogiri::HTML(open('http://google.com'))
doc.css('img:first').first['alt']
=> "Google"

How do I Transform an .xml file to an instance of a ruby array?

I have the following xml file:
/my_file.xml
<?xml version="1.0" encoding="utf-8" ?>
<words>
<w>my_word</w>
<w>second_word</w>
</words>
How can I do the following using Ruby:
Load
Parse
Transform an xml file to an instance of a ruby array:
words = ["my_word","second_word"]
With the Nokogiri gem...
require 'rubygems'
require 'nokogiri'
xml = '<?xml version="1.0" encoding="utf-8" ?>
<words>
<w>my_word</w>
<w>second_word</w>
</words>'
doc = Nokogiri::XML(xml)
words = doc.xpath("//w").map {|x| x.text}

Resources