Nokogiri and XPath help - ruby

Admittedly, I'm a Nokogiri newbie and I must be missing something...
I'm simply trying to print the author > name node out of this XML:
<?xml version="1.0" encoding="UTF-8"?>
<entry xmlns:gd="http://schemas.google.com/g/2005" xmlns:docs="http://schemas.google.com/docs/2007" xmlns="http://www.w3.org/2005/Atom" gd:etag="">
<category term="http://schemas.google.com/docs/2007#document" scheme="http://schemas.google.com/g/2005#kind"/>
<author>
<name>Matt</name>
<email>Darby</email>
</author>
<title>Title</title>
</entry>
I'm trying to using this, but it prints nothing. Seemingly no node (even '*') returns nothing.
Nokogiri::XML(#xml_string).xpath("//author/name").each do |node|
puts node
end

Alejandro already answered this in his comment (+1) but I'm adding this answer too because he left out the Nokogiri code.
Selecting elements in some namespace using Nokogiri with XPath
The elements you are trying to select are in the default namespace, which in this case seems to be http://www.w3.org/2005/Atom. Note the xmlns=" attribute on entry element. Your XPath expression instead matches elements that are not in any namespace. This is the reason why your code worked without namespaces
You need to define a namespace context for your XPath expression and point your XPath steps to match elements in that namespace. AFAIK there should be few different ways to accomplish this with Nokogiri, one of them is shown below
xml.xpath("//a:author/a:name", {"a" => "http://www.w3.org/2005/Atom"})
Note that here we define a namespace-to-prefix mapping and use this prefix (a) in the XPath expression.

For some reason, using remove_namespaces! makes the above bit work as expected.
xml = Nokogiri::XML(#xml_string)
xml.remove_namespaces!
xml.xpath("//author/name").each do |node|
puts node.text
end
=> "Matt"

Related

Nokogiri not parsing XML in ruby - xmlns issue?

Given the following ruby code :
require 'nokogiri'
xml = "<?xml version='1.0' encoding='UTF-8'?>
<ProgramList xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xmlns:xsd='http://www.w3.org/2001/XMLSchema' xmlns='http://publisher.webservices.affili.net/'>
<TotalRecords>145</TotalRecords>
<Programs>
<ProgramSummary>
<ProgramID>6540</ProgramID>
<Title>Matalan</Title>
<Limitations>A bit of text
</Limitations>
<URL>http://www.matalan.co.uk</URL>
<ScreenshotURL>http://www.matalan.co.uk/</ScreenshotURL>
<LaunchDate>2009-11-02T00:00:00</LaunchDate>
<Status>1</Status>
</ProgramSummary>
<ProgramSummary>
<ProgramID>11787</ProgramID>
<Title>Club 18-30</Title>
<Limitations/>
<URL>http://www.club18-30.com/</URL>
<ScreenshotURL>http://www.club18-30.com</ScreenshotURL>
<LaunchDate>2013-05-16T00:00:00</LaunchDate>
<Status>1</Status>
</ProgramSummary>
</Programs>
</ProgramList>"
doc = Nokogiri::XML(xml)
p doc.xpath("//Programs")
gives :
=> []
Not what is expected.
On further investigation if I remove xmlns='http://publisher.webservices.affili.net/' from the initial <ProgramList> tag I get the expected output.
Indeed if I change xmlns='http://publisher.webservices.affili.net/' to xmlns:anything='http://publisher.webservices.affili.net/' I get the expected output.
So my question is what is going on here? Is this malformed XML? And what is the best strategy for dealing with it?
While it's hardcoded in this example the XML is (will be) coming from a web service.
Update
I realise I can use the remove_namespaces! method but the Nokogiri docs do say that it's "...probably is not a good thing in general" to do this. Also I'm interested in why it's happening and what the 'correct' XML should be.
The xmlns='http://publisher.webservices.affili.net/' indicates the default namespace for all elements under the one where it appears (including the element itself). That means that all elements that don’t otherwise have an explicit namespace fall under this namespace.
XPath queries don’t have default namespaces (at least in XPath 1.0), so any name that appears in one without a prefix refers to that element in no namespace.
In your code, you want to find Program elements in the http://publisher.webservices.affili.net/ namespace (since that is the default namespace), but are looking (in your XPath query) for Program elements in no namespace.
To explicitly specify the namespace in the query, you can do something like this:
doc.xpath("//pub:Programs", "pub" => "http://publisher.webservices.affili.net/")
Nokogiri makes this a little easier for namespaces declared on the root element (as in this case), declaring them for you with the same prefix. It will also declare the default namespace using the xmlns prefix, so you can also do:
doc.xpath("//xmlns:Programs")
which will give you the same result.

Why can't REXML parse CDATA preceded by a line break?

I'm very new to Ruby, and trying to parse an XML document with REXML that has been previously pretty-printed (by REXML) with some slightly erratic results.
Some CDATA sections have a line break after the opening XML tag, but before the opening of the CDATA block, in these cases REXML parses the text of the tag as empty.
Any idea if I can get REXML to read these lines?
If not, could I re-write them before hand with a regex or something?
Is this even Valid XML?
Here's an example XML document (much abridged):
<?xml version="1.0" encoding="utf-8"?>
<root-tag>
<content type="base64"><![CDATA[V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==]]></content>
<content type="base64">
<![CDATA[VGhpcyB3b250IHdvcms=]]></content>
<content><![CDATA[This will work]]></content>
<content>
<![CDATA[This will not appear]]></content>
<content>
Seems happy</content>
<content>Obviously no problem</content>
</root-tag>
and here's my Ruby script (distilled down to a minimal example):
require 'rexml/document'
require 'base64'
include REXML
module RexmlSpike
file = File.new("ex.xml")
doc = Document.new file
doc.elements.each("root-tag/content") do |contentElement|
if contentElement.attributes["type"] == "base64"
puts "decoded: " << Base64.decode64(contentElement.text)
else
puts "raw: " << contentElement.text
end
end
puts "Finished."
end
The output I get is:
>> ruby spike.rb
decoded: Well done! It works :)
decoded:
raw: This will work
raw:
raw:
Seems happy
raw: Obviously no problem
Finished.
I'm using Ruby 1.9.3p392 on OSX Lion. The object of the exercise is ultimately to parse comments from some BlogML into the custom import XML used by Disqus.
Why
Having anything before the <![CDATA[]]> overrides whatever is in the <![CDATA[]]>. Anything from a letter, to a newline (like you've discovered), or a single space. This makes sense, because your example is getting the text of the element, and whitespace counts as text. In the examples where you are able to access <![CDATA[]]>, it is because text is nil.
Solution
If you look at the documentation for Element, you'll see that it has a function called cdatas() that:
Get an array of all CData children. IMMUTABLE.
So, in your example, if you do an inner loop on contentElement.cdatas() you would see the content of all your missing tags.
I'd recommend using Nokogiri, which is the defacto XML/HTML parser for Ruby. Using it to access the contents of the <content> tags, I get:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="utf-8"?>
<root-tag>
<content type="base64"><![CDATA[V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==]]></content>
<content type="base64">
<![CDATA[VGhpcyB3b250IHdvcms=]]></content>
<content><![CDATA[This will work]]></content>
<content>
<![CDATA[This will not appear]]></content>
<content>
Seems happy</content>
<content>Obviously no problem</content>
</root-tag>
EOT
doc.search('content').each do |n|
puts n.content
end
Which outputs:
V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==
VGhpcyB3b250IHdvcms=
This will work
This will not appear
Seems happy
Obviously no problem
Your xml is valid, but not the way you expects, as #lightswitch05 pointed out. You can use the w3c xml validator
If you are using XML from the wild world web, it is a good idea to use nokogiri because it usually works as you think it should, not as it really should.
Side note: this is exactly why I avoid XML and use JSON instead: XML have a proper definition but no one seems to use it anyway.

Nokogiri xpath query results in String instead of NodeSet

I have a Nokogiri node which I'm doing an xpath query on which should return a NodeSet. Instead it returns a String. I checked the xml source and found that the data only contains one element instead of many.
Shouldn't it return a NodeSet with only one value, instead of a String? How do I deal with this?
Here's the pseudo xml which correctly returns a NodeSet with 2 entries:
<root>
<products>
<product>
<productID>1</productID>
</product>
<product>
<productID>2</productID>
</product>
</product>
</root>
Here's the pseudo xpath query:
//root/products/product
If the xml only contains one product, I get a String instead of a NodeSet with 1 entry
<root>
<products>
<product>
<productID>1</productID>
</product>
</product>
</root>
Update 6/12/2012: I still believe this is a bug in Nokogiri.The above pseudo xml does not reproduce the condition, however I have several xml examples from a client which do reproduce the issue. I could probably post an obfuscated version of the xml. In any case I have changed the code to use XmlSimple instead of Nokogiri.
Works for me:
require 'nokogiri'
xml = "<root><products>
<product><productID>1</productID></products>
</product></root>"
p Nokogiri.XML(xml).xpath('//root/products/product').class,
#=> Nokogiri::XML::NodeSet
Nokogiri::VERSION,
#=> "1.5.2"
RUBY_DESCRIPTION
#=> "ruby 1.9.3p125 (2012-02-16) [x86_64-darwin11.3.0]"
Either your version of Nokogiri is bad (leaning on a bad libxml2 version, likely), or your code is sufficiently different that you need to provide us with a way to reproduce your problem.
I ran into this "issue" as well, but after a bit of head scratching, I found out what I was doing wrong... I was trying to debug the xpath by printing out the results as in
product_element = Nokogiri.XML(xml).xpath('//root/products/product')
print "product_element is - #{product_element}\n"
that prints out the string version of the element, but instead when I used
product_element = Nokogiri.XML(xml).xpath('//root/products/product')
p product_element
that correctly showed it as a NodeSet.
... This may not be what was happening to you, but

Remove all but certain tags in an XML document with Ruby

require 'nokogiri'
doc = Nokogiri::XML "<root>
<a>foo<c>bar</c></a>
<b>jim<d>jam></d></b>
<a>more</a>
<x>no no no</x>
</root>"
doc.css("a, b").each {|o| p o.to_s}
# "<a>foo<c>bar</c></a>"
# "<a>more</a>"
# "<b>jim<d>jam></d></b>"
How can I keep tags in their original order? Or also remove nested tags?
You might want to look at whitelist/blacklist/scrubbing gems. Sanitize and Loofah come to mind.
From Sanitize's description:
Given a list of acceptable elements and attributes, Sanitize will remove all unacceptable HTML from a string.
From Loofah's description:
Loofah excels at HTML sanitization (XSS prevention). It includes some nice HTML sanitizers, which are based on HTML5lib’s whitelist, so it most likely won’t make your codes less secure. (These statements have not been evaluated by Netexperts.)
In either case, they'll save you from reinventing a wheel.
require 'nokogiri'
doc = Nokogiri::XML "
<root>
<a>foo<c>bar</c></a>
<b>jim<d>jam></d></b>
<a>more</a>
<x>no no no</x>
</root>"
doc.xpath('root//*[name()!="a"][name()!="b"]').remove
puts doc
#=> <?xml version="1.0"?>
#=> <root>
#=> <a>foo</a>
#=> <b>jim</b>
#=> <a>more</a>
#=>
#=> </root>
If this is just an issue of order and none of the tags you need to isolate are nested, using XPath instead of CSS selectors in Nokogiri should return the tags in the same order they are in the document:
doc.xpath("//a | //h3").each { |o| puts o }
I'm not sure if this behavior is in any spec for Nokogiri, so you may want to be careful, but in my experience it is true.
Of course, if the tags you're after are ever nested you may need to define what it means to "remove all but certain tags" (e.g. what happens to removed tags and their contents that exist inside non-removed tags and their contents, etc.).
If your requirement is sufficiently complicated such that XPath queries won't cut it, you may need to "walk the DOM" using something like doc.root.children and recursively examine the children of each node.

Traverse xml structure to determine if a certain text node exists

Alright I have an xml document that looks something like this:
<xml>
<list>
<partner>
<name>Some Name</name>
<status>active</status>
<id>0</id>
</partner>
<partner>
<name>Another Name</name>
<status>active</status>
<id>1</id>
</partner>
</list>
</xml>
I am using ruby's lib-xml to parse it.
I want to find if there is a partner with the name 'Some Name' in a quick and ruby idiomatic way.
How can I do this in one line or ruby code, assuming i have a the document parsed in a variable named document.. Such that i can call document.find(xpath) to retrieve nodes. I have had to do this multiple times in slightly different scenarios and now its starting to bug me.
I know i can do the following (but its ugly)
found = false
document.find('//partner/name').each do |name|
if (name.content == 'Some Name')
found = true
break
end
end
assert(found, "Some Name should have been found")
but i find this really ugly. I thought about using the enumeration include? mixin method but that still won't work because I need to get the .content field of each node as opposed to the actual node...
While writing this, I though of this (but it seems somewhat inefficient albeit elegant)
found = document.find('//partner/name').collect{|name| name.content}.member?("Some Name")
Are there any other ways of doing this?
What about this?
found = document.find("//partner[name='Some Name']").empty?
I tried this solution:
found = document.find("//partner[name='Some Name']") != nil
but I got an error saying the xpath expression was invalid.
However, i was reading some xpath documentation it it looks like you can call a text() function in the expression to get the text node. I tried the following and it appears to work:
found = document.find("//partner/name/text()='Some Name'")
found actually is not a xml node but a true/false object so this works.
I would use a language that natively operates on XML (XQuery for example). With XQuery it is possible to formulate this sort of queries over xml data in a concise and elegant way.

Resources