Nokogiri xpath query results in String instead of NodeSet

Nokogiri xpath query results in String instead of NodeSet - ruby

I have a Nokogiri node which I'm doing an xpath query on which should return a NodeSet. Instead it returns a String. I checked the xml source and found that the data only contains one element instead of many.
Shouldn't it return a NodeSet with only one value, instead of a String? How do I deal with this?
Here's the pseudo xml which correctly returns a NodeSet with 2 entries:
<root>
<products>
<product>
<productID>1</productID>
</product>
<product>
<productID>2</productID>
</product>
</product>
</root>
Here's the pseudo xpath query:
//root/products/product
If the xml only contains one product, I get a String instead of a NodeSet with 1 entry
<root>
<products>
<product>
<productID>1</productID>
</product>
</product>
</root>
Update 6/12/2012: I still believe this is a bug in Nokogiri.The above pseudo xml does not reproduce the condition, however I have several xml examples from a client which do reproduce the issue. I could probably post an obfuscated version of the xml. In any case I have changed the code to use XmlSimple instead of Nokogiri.

Works for me:
require 'nokogiri'
xml = "<root><products>
<product><productID>1</productID></products>
</product></root>"
p Nokogiri.XML(xml).xpath('//root/products/product').class,
#=> Nokogiri::XML::NodeSet
Nokogiri::VERSION,
#=> "1.5.2"
RUBY_DESCRIPTION
#=> "ruby 1.9.3p125 (2012-02-16) [x86_64-darwin11.3.0]"
Either your version of Nokogiri is bad (leaning on a bad libxml2 version, likely), or your code is sufficiently different that you need to provide us with a way to reproduce your problem.

I ran into this "issue" as well, but after a bit of head scratching, I found out what I was doing wrong... I was trying to debug the xpath by printing out the results as in
product_element = Nokogiri.XML(xml).xpath('//root/products/product')
print "product_element is - #{product_element}\n"
that prints out the string version of the element, but instead when I used
product_element = Nokogiri.XML(xml).xpath('//root/products/product')
p product_element
that correctly showed it as a NodeSet.
... This may not be what was happening to you, but

Related

Ruby hash to XML: How would I create duplicate keys in a hash for repeated XML xpaths?

I have to create XML that looks something like this:
<?xml version="1.0" ?>
<FirstLevel>
<Package>
<Name></Name>
</Package>
<Package>
<Name></Name>
</Package>
...
</FirstLevel>
As you can see, Package shows up multiple times at the same level in the structure.
I know you can't have duplicate keys in a Ruby hash, so I don't know how I would be able to go from a hash to XML when there are duplicate keys. Does anyone have any ideas?
I'm using Hash#to_xml to convert my hash to XML (made available by ActiveSupport I believe).
By the way, I'm using Rails.

Okay I believe I figured it out. You have to use Hash#compare_by_identity. I believe this makes it so that the key lookups are done using object id as opposed to string matches.
I found it in "Ruby Hash with duplicate keys?".
{}.compare_by_identity
h1 = {}
h1.compare_by_identity
h1["a"] = 1
h1["a"] = 2
p h1 # => {"a"=>1, "a"=>2}

Why can't REXML parse CDATA preceded by a line break?

I'm very new to Ruby, and trying to parse an XML document with REXML that has been previously pretty-printed (by REXML) with some slightly erratic results.
Some CDATA sections have a line break after the opening XML tag, but before the opening of the CDATA block, in these cases REXML parses the text of the tag as empty.
Any idea if I can get REXML to read these lines?
If not, could I re-write them before hand with a regex or something?
Is this even Valid XML?
Here's an example XML document (much abridged):
<?xml version="1.0" encoding="utf-8"?>
<root-tag>
<content type="base64"><![CDATA[V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==]]></content>
<content type="base64">
<![CDATA[VGhpcyB3b250IHdvcms=]]></content>
<content><![CDATA[This will work]]></content>
<content>
<![CDATA[This will not appear]]></content>
<content>
Seems happy</content>
<content>Obviously no problem</content>
</root-tag>
and here's my Ruby script (distilled down to a minimal example):
require 'rexml/document'
require 'base64'
include REXML
module RexmlSpike
file = File.new("ex.xml")
doc = Document.new file
doc.elements.each("root-tag/content") do |contentElement|
if contentElement.attributes["type"] == "base64"
puts "decoded: " << Base64.decode64(contentElement.text)
else
puts "raw: " << contentElement.text
end
end
puts "Finished."
end
The output I get is:
>> ruby spike.rb
decoded: Well done! It works :)
decoded:
raw: This will work
raw:
raw:
Seems happy
raw: Obviously no problem
Finished.
I'm using Ruby 1.9.3p392 on OSX Lion. The object of the exercise is ultimately to parse comments from some BlogML into the custom import XML used by Disqus.

Why
Having anything before the <![CDATA[]]> overrides whatever is in the <![CDATA[]]>. Anything from a letter, to a newline (like you've discovered), or a single space. This makes sense, because your example is getting the text of the element, and whitespace counts as text. In the examples where you are able to access <![CDATA[]]>, it is because text is nil.
Solution
If you look at the documentation for Element, you'll see that it has a function called cdatas() that:
Get an array of all CData children. IMMUTABLE.
So, in your example, if you do an inner loop on contentElement.cdatas() you would see the content of all your missing tags.

I'd recommend using Nokogiri, which is the defacto XML/HTML parser for Ruby. Using it to access the contents of the <content> tags, I get:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="utf-8"?>
<root-tag>
<content type="base64"><![CDATA[V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==]]></content>
<content type="base64">
<![CDATA[VGhpcyB3b250IHdvcms=]]></content>
<content><![CDATA[This will work]]></content>
<content>
<![CDATA[This will not appear]]></content>
<content>
Seems happy</content>
<content>Obviously no problem</content>
</root-tag>
EOT
doc.search('content').each do |n|
puts n.content
end
Which outputs:
V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==
VGhpcyB3b250IHdvcms=
This will work
This will not appear
Seems happy
Obviously no problem

Your xml is valid, but not the way you expects, as #lightswitch05 pointed out. You can use the w3c xml validator
If you are using XML from the wild world web, it is a good idea to use nokogiri because it usually works as you think it should, not as it really should.
Side note: this is exactly why I avoid XML and use JSON instead: XML have a proper definition but no one seems to use it anyway.

Problem reading XML with Nokogiri

My Ruby script is supposed to read in an XML doc from a URL and check it for well-formedness, returning any errors. I have a sample bad XML document hosted with the following text (from the Nokogiri tutorial:
<?xml version="1.0"?>
<root>
<open>foo
<closed>bar</closed>
</root>
My test script is as follows (url refers to the above xml file hosted on my personal server):
require 'nokogiri'
document = Nokogiri::XML(url)
puts document
puts document.errors
The output is:
<?xml version="1.0"?>
Start tag expected, '<' not found
Why is it only capturing the first line of the XML file? It does this with even with known good XML files.

It is trying to parse the url, not its content. Please, take into account that first parameter to Nokogiri::XML must be a string containing the document or an IO object since it is just a shortcut to Nokogiri::XML::Document.parse as stated here.
EDIT: For reading from an uri
require 'open-uri'
open(uri).read

I'm not too sure what code you are using to actually output the contents of the XML. I only see error printing code. However, I have posted some sample code to effectively move through XML with Nokogiri below:
<item>
Something
</item>
<item>
Else
</item>
doc = Nokogiri::XML(open(url))
set = doc.xpath('//item')
set.each {|item| puts item.to_s}
#=> Something
#=> Else
In general, the tutorial here should help you.

if you are getting the xml from a Nokogiri xml already, then make sure you use '.to_s' before passing it to the XML function.
for example,
xml = Nokogiri::XML(existing_nokogiri_xml_doc.to_s)

Remove all but certain tags in an XML document with Ruby

require 'nokogiri'
doc = Nokogiri::XML "<root>
<a>foo<c>bar</c></a>
<b>jim<d>jam></d></b>
<a>more</a>
<x>no no no</x>
</root>"
doc.css("a, b").each {|o| p o.to_s}
# "<a>foo<c>bar</c></a>"
# "<a>more</a>"
# "<b>jim<d>jam></d></b>"
How can I keep tags in their original order? Or also remove nested tags?

You might want to look at whitelist/blacklist/scrubbing gems. Sanitize and Loofah come to mind.
From Sanitize's description:
Given a list of acceptable elements and attributes, Sanitize will remove all unacceptable HTML from a string.
From Loofah's description:
Loofah excels at HTML sanitization (XSS prevention). It includes some nice HTML sanitizers, which are based on HTML5lib’s whitelist, so it most likely won’t make your codes less secure. (These statements have not been evaluated by Netexperts.)
In either case, they'll save you from reinventing a wheel.

require 'nokogiri'
doc = Nokogiri::XML "
<root>
<a>foo<c>bar</c></a>
<b>jim<d>jam></d></b>
<a>more</a>
<x>no no no</x>
</root>"
doc.xpath('root//*[name()!="a"][name()!="b"]').remove
puts doc
#=> <?xml version="1.0"?>
#=> <root>
#=> <a>foo</a>
#=> <b>jim</b>
#=> <a>more</a>
#=>
#=> </root>

If this is just an issue of order and none of the tags you need to isolate are nested, using XPath instead of CSS selectors in Nokogiri should return the tags in the same order they are in the document:
doc.xpath("//a | //h3").each { |o| puts o }
I'm not sure if this behavior is in any spec for Nokogiri, so you may want to be careful, but in my experience it is true.
Of course, if the tags you're after are ever nested you may need to define what it means to "remove all but certain tags" (e.g. what happens to removed tags and their contents that exist inside non-removed tags and their contents, etc.).
If your requirement is sufficiently complicated such that XPath queries won't cut it, you may need to "walk the DOM" using something like doc.root.children and recursively examine the children of each node.

Nokogiri and XPath help

Admittedly, I'm a Nokogiri newbie and I must be missing something...
I'm simply trying to print the author > name node out of this XML:
<?xml version="1.0" encoding="UTF-8"?>
<entry xmlns:gd="http://schemas.google.com/g/2005" xmlns:docs="http://schemas.google.com/docs/2007" xmlns="http://www.w3.org/2005/Atom" gd:etag="">
<category term="http://schemas.google.com/docs/2007#document" scheme="http://schemas.google.com/g/2005#kind"/>
<author>
<name>Matt</name>
<email>Darby</email>
</author>
<title>Title</title>
</entry>
I'm trying to using this, but it prints nothing. Seemingly no node (even '*') returns nothing.
Nokogiri::XML(#xml_string).xpath("//author/name").each do |node|
puts node
end

Alejandro already answered this in his comment (+1) but I'm adding this answer too because he left out the Nokogiri code.
Selecting elements in some namespace using Nokogiri with XPath
The elements you are trying to select are in the default namespace, which in this case seems to be http://www.w3.org/2005/Atom. Note the xmlns=" attribute on entry element. Your XPath expression instead matches elements that are not in any namespace. This is the reason why your code worked without namespaces
You need to define a namespace context for your XPath expression and point your XPath steps to match elements in that namespace. AFAIK there should be few different ways to accomplish this with Nokogiri, one of them is shown below
xml.xpath("//a:author/a:name", {"a" => "http://www.w3.org/2005/Atom"})
Note that here we define a namespace-to-prefix mapping and use this prefix (a) in the XPath expression.

For some reason, using remove_namespaces! makes the above bit work as expected.
xml = Nokogiri::XML(#xml_string)
xml.remove_namespaces!
xml.xpath("//author/name").each do |node|
puts node.text
end
=> "Matt"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Nokogiri xpath query results in String instead of NodeSet - ruby

Related

Ruby hash to XML: How would I create duplicate keys in a hash for repeated XML xpaths?

Why can't REXML parse CDATA preceded by a line break?

Problem reading XML with Nokogiri

Remove all but certain tags in an XML document with Ruby

Nokogiri and XPath help

Categories

Resources