XML Namespace issue with Nokogiri - ruby

I have the following XML:
<body>
<hello xmlns='http://...'>
<world>yes</world>
</hello>
</body>
When I load that into a Nokogiri XML document, and call document.at_css "world", I receive nil back. But when I remove the namespace for hello, it works perfectly. I know I can call document.remove_namespaces!, but why is it that it will not work with the namespace?

Because Nokogiri requires you to register the XML namespaces you are querying within (read more about XML Namespaces). But you should still be able to query the element if you specify its namespace when calling at_css. To see the exact usage, check out the css method documentation. It should end up looking something like this:
document.at_css "world", 'namespace_name' => 'namespace URI'

Related

Access deep nested node from document.xml using nokogiri

I am using nokogiri to access a docx's document xml file.
here is a sample of it:
<w:document>
<w:body>
<w:p w:rsidR="00454EDC" w:rsidRDefault="00454EDC" w:rsidP="00454EDC">
<w:drawing>
<wp:inline distT="0" distB="0" distL="0" distR="0">
<wp:extent cx="1926590" cy="1088571"/>
<wp:effectExtent l="0" t="0" r="0" b="0"/>
<wp:docPr id="1" name="Picture 1"/>
<wp:cNvGraphicFramePr>
<a:graphicFrameLocks xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" noChangeAspect="1"/>
</wp:cNvGraphicFramePr>
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:nvPicPr>
<pic:cNvPr id="0" name="Picture 1"/>
<pic:cNvPicPr>
<a:picLocks noChangeAspect="1" noChangeArrowheads="1"/>
</pic:cNvPicPr>
</pic:nvPicPr>
<pic:blipFill>
<a:blip r:embed="rId5" cstate="print">
<a:extLst>
<a:ext uri="{28A0092B-C50C-407E-A947-70E740481C1C}">
<a14:useLocalDpi xmlns:a14="http://schemas.microsoft.com/office/drawing/2010/main" val="0"/>
</a:ext>
</a:extLst>
</a:blip>
<a:srcRect/>
<a:stretch>
<a:fillRect/>
</a:stretch>
</pic:blipFill>
<pic:spPr bwMode="auto">
<a:xfrm>
<a:off x="0" y="0"/>
<a:ext cx="1951299" cy="1102532"/>
</a:xfrm>
<a:prstGeom prst="rect">
<a:avLst/>
</a:prstGeom>
<a:noFill/>
<a:ln>
<a:noFill/>
</a:ln>
</pic:spPr>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:inline>
</w:drawing>
</w:p>
</w:body>
</w:document>
Now I want to access all <w:drawing> tags and from them I wan to access <a:blip> tag and extract the value of attribute of r:embed from it.
In this case as you can see it is rId5
I am able to access the <w:drawing> tag by using xml.xpath('//w:drawing') but when I do so xml.xpath('//w:drawing').xpath('//a:blip'), it throws error :
Nokogiri::XML::XPath::SyntaxError: Undefined namespace prefix: //a:blip
What am I doing wrong, can anyone point me in the right direction?
The error is telling you that in your XPath query, //a:blip, Nokogiri doesn’t know what namespace a refers to. You need to specify the namespaces that you are targeting in your query, not just the prefix. The fact that the prefix a is defined in the document doesn’t really matter, it is the actual namespace URI that is important. It is possible to use completely different prefixes in the query than those used in the document, as long as the namespace URIs match.
You may be wondering why the query //w:drawing works. You don’t include the full XML, but I suspect that the w prefix is defined on the root node (something like xmlns:w="http://some.uri.here"). If you don’t specify any namespaces, Nokogiri will automatically register any defined in the root node so they will be available in your query. The namespace corresponding to the a prefix isn’t defined on the root, so it is unavailable, and so you get the error you see.
To specify namespaces in Nokogiri you pass a hash, mapping the prefix (as used in the query) to namespace URI, to the xpath method (or which ever query method you’re using). Since you are providing your own namespace mappings, you also need to include any you use from the root node, Nokogiri doesn’t include them in this case.
In your case, the code would look something like this:
namespaces = {
'w' => 'http://some.uri', # whatever the URI is for this namespace
'a' => 'http://schemas.openxmlformats.org/drawingml/2006/main'
}
# You can combine this to a single query.
# Also note you don’t want a double slash infront of
# the `/a:blip` part, just one.
xml.xpath('//w:drawing/a:blip', namespaces)
Have a look at the Nokogiri tutorial section on namespaces for more info.
I would say that this is a bug in the xml parser that you are using :
Indeed, the error seems to be that the namespace prefix a is undefined, however, it has been defined in <a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">, which is a parent of the <a:blip> element.
See here if you want to know more about xml namespaces
It seems that they are a few other questions about problems with namespace prefixes in nokogiri, for example : Undefined namespace prefix in Nokogiri and XPath

How to use "doc" tag in Nokogiri to build an XML document

I have a problem: I must build an XML document with a <doc> tag. I can use any custom tag except "doc".
I need to use "doc". How can I fix this?
You can add an underscore to the name to prevent it being seen as an existing method. See the section “Special Tags” in the Nokogiri Builder docs.
Something like:
Nokogiri::XML::Builder.new do |xml|
# Note the underscore here:
xml.doc_ "A doc tag"
end
This example produces the following XML (the underscore isn’t included in the tag name):
<?xml version="1.0"?>
<doc>A doc tag</doc>

Schema Validation using Nokogiri

I am trying to validate an XML document against a dozen or so schemas using Nokogiri. Currently I have a root schema document that imports all the other schemas, and I validate against that.
Can I point to each schema file from the XML file itself, and have Nokogiri look in the XML file for the schemas to validate against?
The proper way to reference multiple schemata against which to validate an XML file is with the schemaLocation attribute:
<?xml version="1.0"?>
<foo xmlns="http://bar.com/foo"
xmlns:bz="http://biz.biz/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://bar.com/foo http://www.bar.com/schemas/foo.xsd
http://biz.biz/ http://biz.biz/xml/ns/bz.xsd">
For each namespace in your document you list a pair of whitespace-delimited values: the namespace URI followed by a 'hint' as to where to find the schema for that namespace. If you provide a full URI for each hint, then you can process this with Nokogiri as such:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri.XML( my_xml )
schemata_by_ns = Hash[ doc.root['schemaLocation'].scan(/(\S+)\s+(\S+)/) ]
schemata_by_ns.each do |ns,xsd_uri|
xsd = Nokogiri::XML.Schema(open(xsd_uri))
xsd.validate(doc).each do |error|
puts error.message
end
end
Disclaimer: I have never attempted to validate a single XML document using multiple namespaced schemata with Nokogiri before. As such, I have no direct experience to guarantee that the above validation will work. The validation code is based solely on Nokogiri's schema validation documentation.

How to parse html source code with ruby/nokogiri?

I've successfully used ruby (1.8) and nokogiri's css parsing to pull out front facing data from web pages.
However I now need to pull out some data from a series of pages where the data is in the "meta" tags in the source code of the page.
One of the lines I need is the following:
<meta name="geo.position" content="35.667459;139.706256" />
I've tried using xpath put haven't been able to get it right.
Any help as to what syntax is needed would be much appreciated.
Thanks
This is a good case for a CSS attribute selector. For example:
doc.css('meta[name="geo.position"]').each do |meta_tag|
puts meta_tag['content'] # => 35.667459;139.706256
end
The equivalent XPath expression is almost identical:
doc.xpath('//meta[#name = "geo.position"]').each do |meta_tag|
puts meta_tag['content'] # => 35.667459;139.706256
end
require 'nokogiri'
doc = Nokogiri::HTML('<meta name="geo.position" content="35.667459;139.706256" />')
doc.at('//meta[#name="geo.position"]')['content'] # => "35.667459;139.706256"

HTML Entity problems using Nokogiri::XML.fragment

it seems that all entities are killed using
tags = "<p>test umlauts ö</p>"
Nokogiri::XML.fragment(tags)
Result:
<p>test umlauts </p>
The above method calls Nokogiri::XML::DocumentFragment.parse(tags) and that methods calls
Nokogiri::XML::DocumentFragment.new(XML::Document.new, tags).
In relation to the nokogiri documentation this code will be executed:
def initialize document, tags=nil
if tags
parser = if self.kind_of?(Nokogiri::HTML::DocumentFragment)
HTML::SAX::Parser.new(FragmentHandler.new(self, tags))
else
XML::SAX::Parser.new(FragmentHandler.new(self, tags))
end
parser.parse(tags)
end
end
I think we are dealing with the XML::SAX::Parser and the corresponding FragmentHandler. Digging around the code gives no hint; which parameters do I have to set to get the correct result?
oouml is not a predefined entity in XML. If you want to allow the HTML entity references in XHTML you'd need to use a parser that read the external DTD in the doctype. This is a lot of effort; you may prefer to just use the HTML parser if you have HTML-compatible XHTML with entity references.

Resources