Access deep nested node from document.xml using nokogiri - ruby

I am using nokogiri to access a docx's document xml file.
here is a sample of it:
<w:document>
<w:body>
<w:p w:rsidR="00454EDC" w:rsidRDefault="00454EDC" w:rsidP="00454EDC">
<w:drawing>
<wp:inline distT="0" distB="0" distL="0" distR="0">
<wp:extent cx="1926590" cy="1088571"/>
<wp:effectExtent l="0" t="0" r="0" b="0"/>
<wp:docPr id="1" name="Picture 1"/>
<wp:cNvGraphicFramePr>
<a:graphicFrameLocks xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" noChangeAspect="1"/>
</wp:cNvGraphicFramePr>
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:nvPicPr>
<pic:cNvPr id="0" name="Picture 1"/>
<pic:cNvPicPr>
<a:picLocks noChangeAspect="1" noChangeArrowheads="1"/>
</pic:cNvPicPr>
</pic:nvPicPr>
<pic:blipFill>
<a:blip r:embed="rId5" cstate="print">
<a:extLst>
<a:ext uri="{28A0092B-C50C-407E-A947-70E740481C1C}">
<a14:useLocalDpi xmlns:a14="http://schemas.microsoft.com/office/drawing/2010/main" val="0"/>
</a:ext>
</a:extLst>
</a:blip>
<a:srcRect/>
<a:stretch>
<a:fillRect/>
</a:stretch>
</pic:blipFill>
<pic:spPr bwMode="auto">
<a:xfrm>
<a:off x="0" y="0"/>
<a:ext cx="1951299" cy="1102532"/>
</a:xfrm>
<a:prstGeom prst="rect">
<a:avLst/>
</a:prstGeom>
<a:noFill/>
<a:ln>
<a:noFill/>
</a:ln>
</pic:spPr>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:inline>
</w:drawing>
</w:p>
</w:body>
</w:document>
Now I want to access all <w:drawing> tags and from them I wan to access <a:blip> tag and extract the value of attribute of r:embed from it.
In this case as you can see it is rId5
I am able to access the <w:drawing> tag by using xml.xpath('//w:drawing') but when I do so xml.xpath('//w:drawing').xpath('//a:blip'), it throws error :
Nokogiri::XML::XPath::SyntaxError: Undefined namespace prefix: //a:blip
What am I doing wrong, can anyone point me in the right direction?

The error is telling you that in your XPath query, //a:blip, Nokogiri doesn’t know what namespace a refers to. You need to specify the namespaces that you are targeting in your query, not just the prefix. The fact that the prefix a is defined in the document doesn’t really matter, it is the actual namespace URI that is important. It is possible to use completely different prefixes in the query than those used in the document, as long as the namespace URIs match.
You may be wondering why the query //w:drawing works. You don’t include the full XML, but I suspect that the w prefix is defined on the root node (something like xmlns:w="http://some.uri.here"). If you don’t specify any namespaces, Nokogiri will automatically register any defined in the root node so they will be available in your query. The namespace corresponding to the a prefix isn’t defined on the root, so it is unavailable, and so you get the error you see.
To specify namespaces in Nokogiri you pass a hash, mapping the prefix (as used in the query) to namespace URI, to the xpath method (or which ever query method you’re using). Since you are providing your own namespace mappings, you also need to include any you use from the root node, Nokogiri doesn’t include them in this case.
In your case, the code would look something like this:
namespaces = {
'w' => 'http://some.uri', # whatever the URI is for this namespace
'a' => 'http://schemas.openxmlformats.org/drawingml/2006/main'
}
# You can combine this to a single query.
# Also note you don’t want a double slash infront of
# the `/a:blip` part, just one.
xml.xpath('//w:drawing/a:blip', namespaces)
Have a look at the Nokogiri tutorial section on namespaces for more info.

I would say that this is a bug in the xml parser that you are using :
Indeed, the error seems to be that the namespace prefix a is undefined, however, it has been defined in <a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">, which is a parent of the <a:blip> element.
See here if you want to know more about xml namespaces
It seems that they are a few other questions about problems with namespace prefixes in nokogiri, for example : Undefined namespace prefix in Nokogiri and XPath

Related

using <#include> resoults in error: Can't find resource

I am implementing a freemarker code in an environment that stores the templates in an database.
for example
${bundle.key}
will display the value of the row with row_id = 'key'
However when I use include directive something doesn't work.
I have a template with a key GenF as follows
<#function PriceFormat Number>
<#return Number?string['0.0000']>
</#function>
if i run
${GenF.PriceFormat(1.568)}
I get the output
1.5680
as expected.
but when i run
<#include bundle.GenF>
${PriceFormat(1.568)}
I receive an error message:
Can't find resource for bundle ...structures.shared.localization.bl.MultiResourceBundle, key
do I use the include directive wrong, or is something was not defined correctly in the Data model by our programmers?
#include expects the name
(path, "file" name) of a template, not the template content itself. See: https://freemarker.apache.org/docs/ref_directive_include.html
What you seem to want is <#bundle.GenF?interpret />. Though note that the parsed template won't be cached that way, unlike when you invoke a template with #include. For #include to be able to resolve "bundle.GenF" as template name (or rather something like "bundle:/GenF", but it's up to you), you have to use a custom TemplateLoader (see Configuration.setTemplateLoader).
As far as you only need this for defining custom number formats, you may also want to consider using custom number formats (https://freemarker.apache.org/docs/pgui_config_custom_formats.html), like ${1.538?string.#bundle_GenF}.

simplexml_load_file with xPath returns empty array

Getting XML from this URL:
$xml = simplexml_load_file('http://geocode-maps.yandex.ru/1.x/?geocode=37.71677,55.75208&kind=metro&spn=1,1&rspn=1');
print_r($xml) shows that XML loaded, but xpath always returns empty array. I tried:
$xml->xpath('/');
$xml->xpath('/ymaps');
$xml->xpath('/GeoObjectCollection');
$xml->xpath('/ymaps/GeoObjectCollection');
$xml->xpath('//GeoObjectCollection');
$xml->xpath('precision');
Why I got empty array? Hope I just missing something easy.
It might be rather easy, but I guess it is also the most common mistake in the history of XML: You are forgetting namespaces!
A lot of elements in the given XML are changing the default namespace and you have to consider that in your XPath.
You can first register your namespace like so:
$xml->registerXPathNamespace('y', 'http://maps.yandex.ru/ymaps/1.x');
$xml->registerXPathNamespace('a', 'http://maps.yandex.ru/attribution/1.x');
and then you can query your data:
$xml->xpath('//y:ymaps/y:GeoObjectCollection');

How to get namespace names in XPath?

I have this .xml file
<root xmlns:h="http://www.w3.org/TR/html4/"
xmlns:f="http://www.w3.org/TR/html4/">
<h:table>
<h:tr>
<h:td>Apples</h:td>
<h:td>Bananas</h:td>
</h:tr>
</h:table>
<f:table>
<f:tr>
<f:td>Red</f:td>
<f:td>Yellow</f:td>
</f:tr>
</f:table>
</root>
How can i get only the elements with a specify namespace?
For example i want to retrieve only that elements in 'h' namespace.
How can i get it? In exist-db the 'namespace::' axis is not more working
Try using the in-scope-prefixes() function in a predicate:
//*[in-scope-prefixes(.)='h']
In the comments you show a solution that parses the return value from name():
//*[substring-before(name(), ":")='h']
There is a far simpler way to get all elements in the namespace that's mapped to the h prefix:
//h:*
Note: when I first tested this, I was getting back all elements in the document. That's because both of your prefixes are mapped to the same namespace:
xmlns:h="http://www.w3.org/TR/html4/"
xmlns:f="http://www.w3.org/TR/html4/"
You should also fix this.

XML Namespace issue with Nokogiri

I have the following XML:
<body>
<hello xmlns='http://...'>
<world>yes</world>
</hello>
</body>
When I load that into a Nokogiri XML document, and call document.at_css "world", I receive nil back. But when I remove the namespace for hello, it works perfectly. I know I can call document.remove_namespaces!, but why is it that it will not work with the namespace?
Because Nokogiri requires you to register the XML namespaces you are querying within (read more about XML Namespaces). But you should still be able to query the element if you specify its namespace when calling at_css. To see the exact usage, check out the css method documentation. It should end up looking something like this:
document.at_css "world", 'namespace_name' => 'namespace URI'

XPath using string functions in the middle of the path

I'm trying to use Web Deploy 3.0 to make changes to my web.config before deployment. Let's say I have the following xml:
<node>
<subnode>
<connectInfo httpURL="http://LookImAUrl.com" />
</subnode>
<node>
And I'd like to match just the "http" in "http://..." so that I can potentially replace it with https.
I looked into XPath string functions and understand them -- I just don't know how to put them in the middle of an expression, for example:
"//node/subnode/connectInfo/#httpURL/substring-before(../#httpURL,':')"
That's basically what I want to do, but it doesn't look right.
"//node/subnode/connectInfo/#httpURL/substring-before(../#httpURL,':')"
That's basically what I want to do, but it doesn't look right.
But it is right and will match the http.
(Btw, you could write it shorter without ..
//node/subnode/connectInfo/#httpURL/substring-before(.,':')
)
However, it will return the string "http" not some kind of pointer pointing to the value of #httpUrl, which is not possible, since there are no partial nodes within the value.
(In XPath 2,) you can return the attribute and a new value, and then perhaps change it in the calling language
//node/subnode/connectInfo/#httpURL/(., concat("https:", substring-after(.,':')))
Using XPath 1.0, if you want to return the initial part of the URL use:
substring-before(//node/subnode/connectInfo/#httpURL,':')
Note though that this will return the value of ONLY the first connectInfo element.
If you want to get the connectInfo nodes that use HTTP:
//node/subnode/connectInfo[starts-with(#httpURL,'http:')]
If you wan to get all httpURL that use HTTP:
//node/subnode/connectInfo/#httpURL[starts-with(.,'http:')]

Resources