Selecting and manipulating mixed nodes - xpath

I have thousands of poorly formatted html documents and I have to fix formatting errors using only php. So far I do well with simplexml and xpath. Now I stumbled over this:
<ul>
Lorem ipsum <strong>dolor sit amet,</strong> consectetur
adipiscing elit, <em>sed</em> do eiusmod tempor
<li>incididunt</li>
<li>ut</li>
<li>labo</li>
</ul>
Now the Text Lorem…tempor belongs outside of the <ul> while everything else (incididunt…labo) should remain a list item.
So my idea was to select child nodes of <ul> that are not <li> including text nodes. But can I do this with xpath?

You can union two xpathes. The first finds all not li nodes, the second - text nodes under ul
//ul/*[name() != "li"] | //ul/text()

Related

Xquery copy all 'content' of an element (text() + all child nodes) into new element

In Xquery 3.1 I am transforming an XML file (containing data originally held in SQL tables) into another XML file format. Generally it's quite straightforward, but for one particular element I would like to copy the text() as well as any child nodes and their respective content into a new element. For example, I want to transform the following XML:
<table tablename="collections">
<record id="1">
<field fieldname="id">1</field>
<field fieldname="title">Quisque elementum cursus nunc non aliquam</field>
<field fieldname="desc">Quisque elementum cursus nunc non aliquam;
also known under the title <i>Lorem ipsum dolor sit amet</i>.<br/> The
same author compiled a large encyclopedia <i>Liber de natura
rerum,</i> synthesizing much knowledge from his period.</field>
<field fieldname="author">Thomas de Cantimpré</field>
</record>
</table>
into:
<list xml:id="collections">
<item n="1">
<list>
<item type="title">Quisque elementum cursus nunc non aliquam</item>
<item type="desc">Quisque elementum cursus nunc non aliquam;
also known under the title <i>Lorem ipsum dolor sit amet</i>.<br/> The
same author compiled a large encyclopedia <i>Liber de natura
rerum,</i> synthesizing much knowledge from his period.</item>
<item type="author">Thomas de Cantimpré</item>
</list>
</item>
</list>
In general much of this hasn't posed a problem. However I'm stumped on the problem of getting the text() and all child nodes inside element <field fieldname="desc">. A solution in XPATH has eluded me.
Thanks in advance for any advice.
Your verbal description of "all content" would be all child nodes in XPath terminology and text nodes are of course also child nodes of an element, like the child elements are child nodes. To select all child nodes of a container node you just need node() in the context of the container node e.g. /table/record/field[#fieldname="desc"]/node() selects all child nodes of the field fieldname="title" element of your input sample.

Improve XPath-query to distinguish text-nodes correctly

I am using XPath extensively in the past. Currently I am facing a problem, which I am unable so solve.
Constraints
pure XPath 1.0
no aux-functions (e.g. no "concat()")
HTML-Markup
<span class="container">
Peter: Lorem Impsum
<i class="divider" role="img" aria-label="|"></i>
Paul Smith: Foo Bar BAZ
<i class="divider" role="img" aria-label="|"></i>
Mary: One Two Three
</span>
Challenge
I want to extract the three coherent strings:
Peter: Lorem Impsum
Paul Smith: Foo Bar BAZ
Mary: One Two Three
XPath
The following XPath-queries is the best I've come up with after HOURS of research:
XPath-query 1
//span[contains(#class, "container")]
=> Peter: Lorem ImpsumPaul Smith: Foo Bar BAZMary: One Two Three
XPath-query 2
//span[contains(#class, "container")]//text()
Peter: Lorem Impsum Paul Smith: Foo Bar BAZ Mary: One Two Three
Problem
Although it is possible to post-process the resulting string using (PHP) string functions afterwards, I am not able to split it into the correct three chunks: I need an XPath-query which enables me to distinguish the text-nodes correctly.
Is it possible to integrate some "artificial separators" between the text-nodes?
You're expecting too much from XPath 1.0. XPath 1.0, itself, can help you here to select
a string, or
a set of text nodes
Then, you'll have to complete your processing outside of XPath (as Mads suggests in the comments).
To understand the limits you're hitting against, your first XPath,
//span[contains(#class, "container")]
selects a nodeset of span elements. The environment in which XPath 1.0 is operating is showing you (some variation of) the string value of the single such node in your document:
Peter: Lorem ImpsumPaul Smith: Foo Bar BAZMary: One Two Three
But be clear: Your XPath is selecting a nodeset of span elements, not strings here.
Your second XPath,
//span[contains(#class, "container")]//text()
selects a nodeset of text() nodes. The environment in which XPath 1.0 is operating is showing the string value of each selected text() node.
If you could use XPath 2.0, you could directly, within XPath, select a sequence of strings,
//span[contains(#class, "container")]/text()/string()
or you could join them,
string-join(//span[contains(#class, "container")]/text(), "|")
and directly get
Peter: Lorem Impsum
|
Paul Smith: Foo Bar BAZ
|
Mary: One Two Three
or
string-join(//span[contains(#class, "container")]/text()/normalize-space(), "|")
to get
Peter: Lorem Impsum|Paul Smith: Foo Bar BAZ|Mary: One Two Three

Substituting field attributes from another document in restructured text + sphinx

I am working on some documentation in restructured text using sphinx and want to substitute values defined as fields from one document into another.
Given a file foo.rst I would like to use the value of its Author field in another document. For instance, it is defined as follows:
Foo
===
:Author: Random, Joe;
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua ...
And I have the following snippet in some other file bar.rst
In "Foo" <insert author field from foo.rst>
writes ...
Once compiled and converted to HTML using sphinx, I want the output of bar.rst to be:
In "Foo" Random, Joe writes ...
I've looked at the documentation and it only specifies setting fields and doesn't have anything on accessing these fields from another document.
There is no direct way as far as I know, perhaps it depends on the theme you use (look in conf.py item "html_theme = ...").
But you can do this:
in foo.rst:
.. _author_random:
:Author: Random, Joe;
in bar.rst:
:ref:`any text <author_random>`
Then you have in bar.html a link to _author_random with the text "any text".

Creating a node cross-referencing another domain in Sphinx

Within a custom Sphinx domain, I'd like to create a reference to another node in a different domain. For example:
.. py:class:: foo.bar
Lorem ipsum.
.. example:directive:: baz -> foo.bar
Sit amet, sit.
My example:directive:: says that my "method" baz returns something of type foo.bar, which is a Python class. So I'd like to cross-reference that to the other py:class:: foo.bar description.
from sphinx.directives import ObjectDescription
class ExampleDescription(ObjectDescription):
def handle_signature(self, sig, signode):
# lots of parsing and node creation here
# parsed_annotation = "foo.bar"
signode += addnodes.desc_returns(parsed_annotation, parsed_annotation)
Within my custom domain I'm parsing my directives and building the elements and it's all fine, even cross-referencing within my example domain works just fine by subclassing the sphinx.domains.Domain:resolve_xref method. I'm just unsure how I would programmatically insert a node in my handle_signature method which is later resolved to a node in another domain. Would I somehow have to instantiate a sphinx.domains.python.PyXRefRole?
The expected result in HTML would be something like:
<dl>
<dt>
<code>baz</code>
→
<a href="example.html#py.class.foo.bar">
<code>foo.bar</code>
</a>
</dt>
</dl>

How can I insert image between text in sphinx

I would like to learn how an image is inserted between words such as:
Lorem ipsum :imageshouldcomehere: dolor sit amet.
Is it possible in sphinx? If yes, how?
I think this should answer your question.
Lorem ipsum |image_reference| dolor sit amet.
.. |image_reference| image:: <path/to/img>

Resources