How to use xpath from lxml on null namespaced nodes? - xpath

What is the best way to handle the lack of a namespace on some of the nodes in an xml document using lxml? Should I first modify all None named nodes to add the "gmd" name and then change the tree attributes to name http://www.isotc211.org/2005/gmd as "gmd"? If so, is there a clean way to do this with lxml or something else that would be relatively clean/safe?
from lxml import etree
nsmap = charts_tree.nsmap
nsmap.pop(None) # complains without this on the xpath with
# TypeError: empty namespace prefix is not supported in XPath
len (charts_tree.xpath('//*/gml:Polygon',namespaces=nsmap))
# 1180
len (charts_tree.xpath('//*/DS_DataSet',namespaces=nsmap))
# 0 ... Bummer!
len (charts_tree.xpath('//*/DS_DataSet'))
# 0 ... Also a bummer
e.g. http://www.charts.noaa.gov/ENCs/ENCProdCat_19115.xml
<DS_Series xmlns="http://www.isotc211.org/2005/gmd" xmlns:gco="http://www.isotc211.org/2005/gco" xmlns:gml="http://www.opengis.net/gml/3.2" xmlns:gsr="http://www.isotc211.org/2005/gsr" xmlns:gss="http://www.isotc211.org/2005/gss" xmlns:gts="http://www.isotc211.org/2005/gts" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.isotc211.org/2005/gmd http://schemas.opengis.net/iso/19139/20070417/gmd/gmd.xsd">
<composedOf>
<DS_DataSet>
<has>
<MD_Metadata>
<parentIdentifier>
<gco:CharacterString>NOAA ENC Product Catalog</gco:CharacterString>
</parentIdentifier>
...
<EX_BoundingPolygon>
<polygon>
<gml:Polygon gml:id="US1AK90M_P1">
<gml:exterior>
<gml:LinearRing>
<gml:pos>67.61505 -178.99979</gml:pos>
<gml:pos>73.99999 -178.99979</gml:pos>
...
<gml:pos>64.99997 -178.99979</gml:pos>
<gml:pos>67.61505 -178.99979</gml:pos>
</gml:LinearRing>

I believe your DS_DataSet is by virtue of being within the DS_Series (implying a default namespace of "http://www.isotc211.org/2005/gmd") carrying a namespace.
Try and map that into your namespace dictionary (you can probably first test through a print to see if it's already in there, otherwise add it and refer to the namespace by your new key).
nsmap['some_ns'] = "http://www.isotc211.org/2005/gmd"
len (charts_tree.xpath('//*/some_ns:DS_DataSet',namespaces=nsmap))
Which becomes:
nsmap['gmd'] = nsmap[None]
nsmap.pop(None)
len(charts_tree.xpath('//*/gmd:DS_DataSet',namespaces=nsmap))

Related

Declare additional dependency to sphinx-build in an extension

TL,DR: From a Sphinx extension, how do I tell sphinx-build to treat an additional file as a dependency? In my immediate use case, this is the extension's source code, but the question could equally apply to some auxiliary file used by the extension.
I'm generating documentation with Sphinx using a custom extension. I'm using sphinx-build to build the documentation. For example, I use this command to generate the HTML (this is the command in the makefile generated by sphinx-quickstart):
sphinx-build -b html -d _build/doctrees . _build/html
Since my custom extension is maintained together with the source of the documentation, I want sphinx-build to treat it as a dependency of the generated HTML (and LaTeX, etc.). So whenever I change my extension's source code, I want sphinx-build to regenerate the output.
How do I tell sphinx-build to treat an additional file as a dependency? That is not mentioned in the toctree, since it isn't part of the source. Logically, this should be something I do from my extension's setup function.
Sample extension (my_extension.py):
from docutils import nodes
from docutils.parsers.rst import Directive
class Foo(Directive):
def run(self):
node = nodes.paragraph(text='Hello world\n')
return [node]
def setup(app):
app.add_directive('foo', Foo)
Sample source (index.rst):
.. toctree::
:maxdepth: 2
.. foo::
Sample conf.py (basically the output of sphinx-quickstart plus my extension):
import sys
import os
sys.path.insert(0, os.path.abspath('.'))
extensions = ['my_extension']
templates_path = ['_templates']
source_suffix = '.rst'
master_doc = 'index'
project = 'Hello directive'
copyright = '2019, Gilles'
author = 'Gilles'
version = '1'
release = '1'
language = None
exclude_patterns = ['_build']
pygments_style = 'sphinx'
todo_include_todos = False
html_theme = 'alabaster'
html_static_path = ['_static']
htmlhelp_basename = 'Hellodirectivedoc'
latex_elements = {
}
latex_documents = [
(master_doc, 'Hellodirective.tex', 'Hello directive Documentation',
'Gilles', 'manual'),
]
man_pages = [
(master_doc, 'hellodirective', 'Hello directive Documentation',
[author], 1)
]
texinfo_documents = [
(master_doc, 'Hellodirective', 'Hello directive Documentation',
author, 'Hellodirective', 'One line description of project.',
'Miscellaneous'),
]
Validation of a solution:
Run make html (or sphinx-build as above).
Modify my_extension.py to replace Hello world by Hello again.
Run make html again.
The generated HTML (_build/html/index.html) must now contain Hello again instead of Hello world.
It looks like the note_dependency method in the build environment API should do what I want. But when should I call it? I tried various events but none seemed to hit the environment object in the right state. What did work was to call it from a directive.
import os
from docutils import nodes
from docutils.parsers.rst import Directive
import sphinx.application
class Foo(Directive):
def run(self):
self.state.document.settings.env.note_dependency(__file__)
node = nodes.paragraph(text='Hello done\n')
return [node]
def setup(app):
app.add_directive('foo', Foo)
If a document contains at least one foo directive, it'll get marked as stale when the extension that introduces this directive changes. This makes sense, although it could get tedious if an extension adds many directives or makes different changes. I don't know if there's a better way.
Inspired by Luc Van Oostenryck's autodoc-C.
As far as I know app.env.note_dependency can be called within the doctree-read to add any file as a dependency to the document currently being read.
So in your use case, I assume this would work:
from typing import Any, Dict
from sphinx.application import Sphinx
import docutils.nodes as nodes
def doctree-read(app: Sphinx, doctree: nodes.document):
app.env.note_dependency(file)
def setup(app: Sphinx):
app.connect("doctree-read", doctree-read)

pyyaml parse data with tag

I have yaml data like the input below and i need output as key value pairs
Input
a="""
--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess
code:
- '716'
- '718'
id:
- 488
- 499
"""
ouput needed
{'code': ['716', '718'], 'id': [488, 499]}
The default constructor was giving me an error. I tried adding new constructor and now its not giving me error but i am not able to get key value pairs.
FYI, If i remove the !ruby/hash:ActiveSupport::HashWithIndifferentAccess line from my yaml then it gives me desired output.
def new_constructor(loader, tag_suffix, node):
if type(node.value)=='list':
val=''.join(node.value)
else:
val=node.value
val=node.value
ret_val="""
{0}
""".format(val)
return ret_val
yaml.add_multi_constructor('', new_constructor)
yaml.load(a)
output
"\n [(ScalarNode(tag=u'tag:yaml.org,2002:str', value=u'code'), SequenceNode(tag=u'tag:yaml.org,2002:seq', value=[ScalarNode(tag=u'tag:yaml.org,2002:str', value=u'716'), ScalarNode(tag=u'tag:yaml.org,2002:str', value=u'718')])), (ScalarNode(tag=u'tag:yaml.org,2002:str', value=u'id'), SequenceNode(tag=u'tag:yaml.org,2002:seq', value=[ScalarNode(tag=u'tag:yaml.org,2002:int', value=u'488'), ScalarNode(tag=u'tag:yaml.org,2002:int', value=u'499')]))]\n "
Please suggest.
This is not a solution using PyYAML, but I recommend using ruamel.yaml instead. If for no other reason, it's more actively maintained than PyYAML. A quote from the overview
Many of the bugs filed against PyYAML, but that were never acted upon, have been fixed in ruamel.yaml
To load that string, you can do
import ruamel.yaml
parser = ruamel.yaml.YAML()
obj = parser.load(a) # as defined above.
I strongly recommend following #Andrew F answer, but in case you
wonder why your code did not get the proper result, that is because
you don't correctly process the node under the tag in your tag
handling.
Although the node's value is a list (of tuples with key value pairs),
you should test for the type of the node itself (using isinstance)
and then hand it over to the "normal" mapping processing routine as
the tag is on a mapping:
import yaml
from yaml.loader import SafeLoader
a = """\
--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess
code:
- '716'
- '718'
id:
- 488
- 499
"""
def new_constructor(loader, tag_suffix, node):
if isinstance(node, yaml.nodes.MappingNode):
return loader.construct_mapping(node, deep=True)
raise NotImplementedError
yaml.add_multi_constructor('', new_constructor, Loader=SafeLoader)
data = yaml.load(a, Loader=SafeLoader)
print(data)
which gives:
{'code': ['716', '718'], 'id': [488, 499]}
You should not use PyYAML's yaml.load(), it is documented to be potentially unsafe
and above all it is not necessary. Just add the new constructor to the SafeLoader.

Get value of XML attribute with namespace

I'm parsing a pptx file and ran into an issue. This is a sample of the source XML:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<p:presentation xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" xmlns:p="http://schemas.openxmlformats.org/presentationml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
<p:sldMasterIdLst>
<p:sldMasterId id="2147483648" r:id="rId2"/>
</p:sldMasterIdLst>
<p:sldIdLst>
<p:sldId id="256" r:id="rId3"/>
</p:sldIdLst>
<p:sldSz cx="10080625" cy="7559675"/>
<p:notesSz cx="7772400" cy="10058400"/>
</p:presentation>
I need to to get the r:id attribute value in the sldMasterId tag.
doc = Nokogiri::XML(path_to_pptx)
doc.xpath('p:presentation/p:sldMasterIdLst/p:sldMasterId').attr('id').value
returns 2147483648 but I need rId2, which is the r:id attribute value.
I found the attribute_with_ns(name, namespace) method, but
doc.xpath('p:presentation/p:sldMasterIdLst/p:sldMasterId').attribute_with_ns('id', 'r')
returns nil.
You can reference the namespace of attributes in your xpath the same way you reference element namespaces:
doc.xpath('p:presentation/p:sldMasterIdLst/p:sldMasterId/#r:id')
If you want to use attribute_with_ns, you need to use the actual namespace, not just the prefix:
doc.at_xpath('p:presentation/p:sldMasterIdLst/p:sldMasterId')
.attribute_with_ns('id', "http://schemas.openxmlformats.org/officeDocument/2006/relationships")
http://nokogiri.org/Nokogiri/XML/Node.html#method-i-attributes
If you need to distinguish attributes with the same name, with different namespaces use attribute_nodes instead.
doc.xpath('p:presentation/p:sldMasterIdLst/p:sldMasterId').each do |element|
element.attribute_nodes().select do |node|
puts node if node.namespace && node.namespace.prefix == "r"
end
end

Xml formatting using Node

Following is the method used to write an entry to xml file
def write_entry(entry)
node = Nokogiri::XML::Node.new("url", #xml_document)
node["loc"]= entry[:url]
node["lastmod"]= entry[:lastmod].to_s
node["changefreq"] = entry[:frequency].to_s
node["priority"] = entry[:priority].to_s
node.to_xml
end
The entry looks like this:
<urlset>
<url loc="http:`enter code here`//www.experteer.co.uk/vacaturebank/banen/vacatures/xing-ag" lastmod="2011-11-23 16:58:27 UTC" changefreq="0.8" priority="monthly"/>
</urlset>
I want the entry of xml to be like this
<urlset>
<url>
<loc> http://www.experteer.co.uk/vacaturebank/banen/vacatures/xing-ag </loc>
<lastmod> 2011-11-23 16:58:27 UTC </lastmod>
<changefreq> 0.8 </changefreq>
<priority> monthly </priority>
</url>
</urlset>
Is it possible with using Node or I have to use Builder?
If possible with Node Then how?
and If I have to use Builder it writes header for each entry how can I handle that it dont write header for each entry.
you can use << or add_child to append children nodes to a node.
def write_entry(entry)
url = Nokogiri::XML::Node.new( "url" , #xml_document )
%w{loc lastmod changefreq priority}.each do |node|
url << Nokogiri::XML::Node.new( node, #xml_document ).tap do |n|
n.content = entry[ node.to_sym ]
end
end
url.to_xml
end
For this to work correctly, you have to change entry[:url] to entry[:loc]. and entry[:frequency] to entry[:changefreq], which shouldn't be a bad thing (it's best to have the same name for the same thing everywhere, isn't it ?).
Alternatively, if your entry hash only contains what you need to convert to xml, use entry.each do |key,value| instead of the array.

Ruby libxml: format XMLParser to expand closing tags [duplicate]

libxml2 (for C) is not preserving empty elements in their original form on a save. It replaces <tag></tag> with <tag/> which is technically correct but causes problems for us.
xmlDocPtr doc = xmlParseFile("myfile.xml");
xmlNodePtr root = xmlSaveFile("mynewfile.xml", doc);
I've tried playing with the various options (using xlmReadFile) but none seem to affect the output. One post here mentioned disabling tag compression but the example was for PERL and I've found no analog for C.
Is there an option to disable this behavior?
Just found this enum in the xmlsave module documentation:
Enum xmlSaveOption {
XML_SAVE_FORMAT = 1 : format save output
XML_SAVE_NO_DECL = 2 : drop the xml declaration
XML_SAVE_NO_EMPTY = 4 : no empty tags
XML_SAVE_NO_XHTML = 8 : disable XHTML1 specific rules
XML_SAVE_XHTML = 16 : force XHTML1 specific rules
XML_SAVE_AS_XML = 32 : force XML serialization on HTML doc
XML_SAVE_AS_HTML = 64 : force HTML serialization on XML doc
XML_SAVE_WSNONSIG = 128 : format with non-significant whitespace
}
Maybe you can refactor your application to use this module for serialization, and play a little with these options. Specially with XML_SAVE_NO_EMPTY.
Your code may look like this:
xmlSaveCtxt *ctxt = xmlSaveToFilename("mynewfile.xml", "UTF-8", XML_SAVE_FORMAT | XML_SAVE_NO_EMPTY);
if (!ctxt || xmlSaveDoc(ctxt, doc) < 0 || xmlSaveClose(ctxt) < 0)
//...deal with the error

Resources