Can someone please explain why there is a difference in Nokogiri and REXML outputs in the code below.
require 'rubygems'
require 'Nokogiri'
require 'rexml/document'
xml = "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>
<yml>
<a>TM and © 2009</a>
</yml>"
puts 'nokogiri'
doc = Nokogiri::XML(xml)
puts doc.to_s, "\n"
puts 'rexml'
doc = REXML::Document.new(xml)
puts doc.to_s
outputs:
nokogiri
<?xml version="1.0" encoding="ISO-8859-1"?>
<yml>
<a>TM and ? 2009</a>
</yml>
rexml
<?xml version='1.0' encoding='ISO-8859-1'?>
<yml>
<a>TM and © 2009</a>
</yml>
Sure, nokogiri is converting the text using ISO-8859-1, whereas rexml is just outputting what you put in. If you change the XML to utf-8 encoding then you'll get:
nokogiri:
<?xml version="1.0" encoding="utf-8"?>
<yml>
<a>TM and © 2009</a>
</yml>
rexml:
<?xml version='1.0' encoding='UTF-8'?>
<yml>
<a>TM and © 2009</a>
</yml>
Related
I'm trying to compact an existing XML file using Nokogiri. I have the following demo code:
#!/usr/bin/env ruby
require 'nokogiri'
doc = Nokogiri.XML <<-XML.strip
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
XML
doc.write_xml_to($stdout, indent: 0)
I expected to see:
<?xml version="1.0" encoding="UTF-8"?>
<root><foo><bar>test</bar></foo></root>
but instead I saw:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
I tried:
doc.write_to($stdout, indent: 0, save_with: Nokogiri::XML::Node::SaveOptions::AS_XML)
but that doesn't work either.
How can I remove the ignorable whitespaces?
You can tell Nokogiri to ignore empty text nodes and then to output without indentation:
require 'nokogiri'
xml = <<EOT
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
EOT
doc = Nokogiri::XML(xml) { |opts|
opts.noblanks
opts.strict.noblanks
}
doc.to_xml(:indent_text => '', :indent => 0)
# => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
# "<root>\n" +
# "<foo>\n" +
# "<bar>test</bar>\n" +
# "</foo>\n" +
# "</root>\n"
Okay, I answer my own question.
Nokogiri does not remove the white spaces because Nokogiri doesn't know if the white spaces are ignorable or not (no DTD, no schema), so it keeps all the whitespace-only text as text nodes. I should remove them manually before writing the XML doc to the IO device.
#!/usr/bin/env ruby
require 'bundler'
Bundler.require :default
doc = Nokogiri.XML <<-XML.strip
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
XML
# remove ignorable white spaces
doc.xpath('//text()').each do |node|
node.content = '' if node.text =~ /\A\s+\z/m
end
doc.write_xml_to($stdout, indent: 0)
This is a big progress for me, but I still haven't reached my goal because the XML file I'm working on has inline self-closing tags, and there are whitespace-only text nodes between those tags that should not be compacted. I'm trying to figure out a way to handle this corner case now.
This has been asked before in "REXML - How to extract a single element" but the answer doesn't work. Apparently, the text method is no longer available.
I have an XML file:
<?xml version="1.0" encoding="UTF-8"?>
<ice_cream>
<flavor>Vanilla</flavor>
</ice_cream>
and I can place its contents into an array using REXML:
flavors = xml_file.get_elements('//flavor')
I get an array:
puts flavors[0]
Which returns:
<flavor>Vanilla</flavor>
Instead, I want:
Vanilla
I've tried:
flavors = xml_file.get_elements('//flavor').text
But, I get:
NoMethodError: undefined method `text' for #<Array:0x007fa7a3b94220>
What's the correct way to accomplish this? I'm open to using other libraries, too.
Use Nokogiri. Your code will thank you.
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<ice_cream>
<flavor>Vanilla</flavor>
</ice_cream>
EOT
doc.search('flavor') # => [#<Nokogiri::XML::Element:0x3feb8182fc60 name="flavor" children=[#<Nokogiri::XML::Text:0x3feb8182fa44 "Vanilla">]>]
doc.search('flavor').map(&:text) # => ["Vanilla"]
search finds all nodes, as a NodeSet, that match the CSS selector 'flavor'.
search('flavor').map(&:text) walks the NodeSet and applies (map) the text method to each Node, returning its text node(s).
If your XML is actually something more complex:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<ice_cream>
<flavor>Vanilla</flavor>
<flavor>Chocolate</flavor>
<flavor>Strawberry</flavor>
</ice_cream>
EOT
doc.search('flavor') # => [#<Nokogiri::XML::Element:0x3fcc2a577afc name="flavor" children=[#<Nokogiri::XML::Text:0x3fcc2a5778e0 "Vanilla">]>, #<Nokogiri::XML::Element:0x3fcc2a5776c4 name="flavor" children=[#<Nokogiri::XML::Text:0x3fcc2a5774bc "Chocolate">]>, #<Nokogiri::XML::Element:0x3fcc2a5772b4 name="flavor" children=[#<Nokogiri::XML::Text:0x3fcc2a572c78 "Strawberry">]>]
doc.search('flavor').map(&:text) # => ["Vanilla", "Chocolate", "Strawberry"]
Or:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<ice_creams>
<ice_cream>
<flavor>Vanilla</flavor>
</ice_cream>
<ice_cream>
<flavor>Chocolate</flavor>
</ice_cream>
<ice_cream>
<flavor>Strawberry</flavor>
</ice_cream>
</ice_creams>
EOT
ice_cream = doc.search('ice_cream') # => [#<Nokogiri::XML::Element:0x3fe6a91f6b00 name="ice_cream" children=[#<Nokogiri::XML::Text:0x3fe6a91f68f8 "\n ">, #<Nokogiri::XML::Element:0x3fe6a91f681c name="flavor" children=[#<Nokogiri::XML::Text:0x3fe6a91f6600 "Vanilla">]>, #<Nokogiri::XML::Text:0x3fe6a91f63f8 "\n ">]>, #<Nokogiri::XML::Element:0x3fe6a91f1de4 name="ice_cream" children=[#<Nokogiri::XML::Text:0x3fe6a91f1bdc "\n ">, #<Nokogiri::XML::Element:0x3fe6a91f1ac4 name="flavor" children=[#<Nokogiri::XML::Text:0x3fe6a91f1880 "Chocolate">]>, #<Nokogiri::XML::Text:0x3fe6a91f1678 "\n ">]>, #<Nokogiri::XML::Element:0x3fe6a91f13f8 name="ice_cream" children=[#<Nokogiri::XML::Text:0x3fe6a91f1074 "\n ">, #<Nokogiri::XML::Element:0x3fe6a91f0e80 name="flavor" children=[#<Nokogiri::XML::Text:0x3fe6a91f0a98 "Strawberry">]>, #<Nokogiri::XML::Text:0x3fe6a91f0840 "\n ">]>]
ice_cream.search('flavor').map(&:text) # => ["Vanilla", "Chocolate", "Strawberry"]
For searching, Nokogiri supports using both CSS and XPath selectors, and allows you to use either in the methods, if you want. search accepts both CSS and XPath, and has corollaries of css and xpath for the CSS or XPath specific methods. at returns a single Node and is similar to search('some_node').first and has at_css and at_xpath respectively.
Here is the code :
require 'rexml/document'
doc = <<-xml
<?xml version="1.0" encoding="UTF-8"?>
<ice_cream>
<flavor>Vanilla</flavor>
</ice_cream>
xml
xml_doc = REXML::Document.new(doc)
xml_doc.get_elements('//flavor').class # => Array
xml_doc.get_elements('//flavor')[0].class # => REXML::Element
xml_doc.get_elements('//flavor')[0].text # => "Vanilla"
Actually xml_doc.get_elements('//flavor') will give you the collection of REXML::Element objects. You then need to iterate through the collection and call the method #text on the REXML::Element object to get the text.
How to get the value of the message value ("ready to use")?
<?xml version="1.0" encoding="UTF-8"?>
<response status="ok" permission_level="admin" message="ready to use" cached="0">
<title>kit</title>
</response>
Thanks
require 'rubygems'
require 'nokogiri'
string = %Q{
<?xml version="1.0" encoding="UTF-8"?>
<response status="ok" permission_level="admin" message="ready to use" cached="0">
<title>kit</title>
</response>
}
doc = Nokogiri::XML(string)
doc.css("response").each do |response_node|
puts response_node["message"]
end
save and run this ruby file, you will get result:
#=> ready to use
You subscript them.
doc = Nokogiri::HTML(open('http://google.com'))
doc.css('img:first').first['alt']
=> "Google"
I have the following xml file:
/my_file.xml
<?xml version="1.0" encoding="utf-8" ?>
<words>
<w>my_word</w>
<w>second_word</w>
</words>
How can I do the following using Ruby:
Load
Parse
Transform an xml file to an instance of a ruby array:
words = ["my_word","second_word"]
With the Nokogiri gem...
require 'rubygems'
require 'nokogiri'
xml = '<?xml version="1.0" encoding="utf-8" ?>
<words>
<w>my_word</w>
<w>second_word</w>
</words>'
doc = Nokogiri::XML(xml)
words = doc.xpath("//w").map {|x| x.text}
I am trying to read an XML which have also hebrew letters and its content is:
<?xml version="1.0" encoding="UTF-8"?>
<keywords type="array">
<keyword>seo software</keyword>
<keyword>ipad</keyword>
<keyword>muffuletta manhattanization</keyword>
<keyword>cheap motels</keyword>
<keyword>שפות תכנות</keyword>
</keywords>
And my code to do it is:
# encoding: UTF-8
def use
#require "rexml/document"
file = File.new( "sources/rankabove-test.xml" )
puts file.read
end
However, it doesn't help me, and the output of the 'puts' command is gibberish for the Hebrew letters:
╫⌐╫ñ╫ץ╫¬ ╫¬╫¢╫á╫ץ╫¬
I am using win xp 32 bit. Does anyone familiar with that problem? Anything I can do?
I don't think the problem is Ruby:
# encoding: UTF-8
puts RUBY_VERSION
# >> 1.9.2
xml = '
<?xml version="1.0" encoding="UTF-8"?>
<keywords type="array">
<keyword>seo software</keyword>
<keyword>ipad</keyword>
<keyword>muffuletta manhattanization</keyword>
<keyword>cheap motels</keyword>
<keyword>שפות תכנות</keyword>
</keywords>
'
require 'nokogiri'
doc = Nokogiri::XML(xml)
puts doc.search('//keyword').last.text
# >> שפות תכנות
require "rexml/document"
require 'rexml/node'
require 'rexml/xpath'
doc = REXML::Document.new(xml)
puts REXML::XPath.match(doc, '//keyword').last.text
# >> שפות תכנות
Using both Nokogiri and REXML I get the same output on Mac OS.