In XPath I want a expression to select all under div but those nodes that are not a. The tree structure must remain the same - html-agility-pack

How to exclude specific descendants of a node? In this direction, the expression *[not(self::nodetag)] seems just to discriminate at a child level of the node, accepting all other descedants in the returned node set. I want a expression to select all under div but those nodes that are not a, see example below. The tree structure must remain the same.
The approach poste by #Dimitri Novatchev seems to be right but not for HAP implementation:
Using this example document:
<div>
<span>
<a>lala</a>
</span>
</div>
The HAP would return the following structure with his suggested expression /div/descendant::node()[not(self::a)]
<div>
<span>
<a>lala</a>
</span>
</div>
<span>
<a>lala</a>
</span>
If there would be another tag other than a nested on span, it would also return it as a separte tree, any one know about this strange behavior? Is it a HAP bug?
Thanks

I want a expression to select all under div but those nodes that are
not a. The tree structure must remain the same.
Use:
/div/descendant::node()[not(self::a)]
This selects any descendant of the top element div that (the descendant) is not an a.
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:for-each select="/div/descendant::node()[not(self::a)]">
<xsl:value-of select="concat('
', position(), '. "')"/>
<xsl:copy-of select="."/>"
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<div>
<span>
<a>lala</a>
</span>
</div>
the XPath expression is evaluated and all selected nodes are output with proper formatting to make them well-visible:
1. "
"
2. "<span>
<a>lala</a>
</span>"
3. "
"
4. "lala"
5. "
"
6. "
"
As we can see, 6 nodes are selected -- one span element, four whitespace-only text nodes and one non-whitespace-only text node -- and none of them is an a.
Update:
In a comment the OP has clarified that he actually wants the XML document to be transformed into another, in which any a descendant of a div is omitted.
Here is one such transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="div//a"/>
</xsl:stylesheet>
When this transformation is applied on the same XML document (above), the (wwhat I guess is) wanted result is produced:
<div>
<span/>
</div>
If we want to produce only the descendants of any div that has an a descendant, then we need almost the same transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="div[.//a]"><xsl:apply-templates/></xsl:template>
<xsl:template match="div//a"/>
</xsl:stylesheet>
The result of this applied to the same XML document as above is:
<span/>

#Devela: you are confusing the set of nodes selected by the XPath expression with the way that they are then displayed by the application that issued the request. It's quite common for an application to display a node by showing the whole subtree rooted at that node. So if your query is //div, and one of the selected div elements contains an <a> node as a descendant, the results will be shown including that <a> element. You can't change that by changing the XPath expression, because the XPath expression didn't select the <a> element; you can only change it by changing the way the results are displayed.
Now, if you want to display a <div> element that is like the <div> element in your source except that the <a> is omitted, then you are outside the scope of what XPath can do. XPath can only choose a subset of the nodes in your input tree, it can't create a modified tree. For that, you need XSLT or XQuery.

Related

find next-to-last node with xpath

I have a XML document with chapters and nested sections.
I am trying to find, for any section, the first second-level section ancestor.
That is the next-to-last section in the ancestor-or-self axis.
pseudo-code:
<chapter><title>mychapter</title>
<section><title>first</title>
<section><title>second</title>
<more/><stuff/>
</section>
</section>
</chapter>
my selector:
<xsl:apply-templates
select="ancestor-or-self::section[last()-1]" mode="title.markup" />
Of course that works until last()-1 isn't defined (the current node is the first section).
If the current node is below the second section, i want the title second.
Otherwise I want the title first.
Replace your xpath with this:
ancestor-or-self::section[position()=last()-1 or count(ancestor::section)=0][1]
Since you can already find the right node in all cases except one, I updated your xpath to also find the first section (or count(ancestor::section)=0), and then select ([1]) the first match (in reverse document order, since we are using the ancestor-or-self axis).
Here is a shorter and more efficient solution:
(ancestor-or-self::section[position() > last() -2])[last()]
This selects the last of the possibly first two topmost ancestors named section. If there is only one such ancestor, then it itself is the last.
Here is a complete transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="section">
<xsl:value-of select="title"/>
<xsl:text> --> </xsl:text>
<xsl:value-of select=
"(ancestor-or-self::section[position() > last() -2])[last()]/title"/>
<xsl:text>
</xsl:text>
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="text()"/>
</xsl:stylesheet>
When this transformation is applied on the following document (based on the provided, but added more nested section elements):
<chapter>
<title>mychapter</title>
<section>
<title>first</title>
<section>
<title>second</title>
<more/>
<stuff/>
<section>
<title>third</title>
</section>
</section>
</section>
</chapter>
the correct results are produced:
first --> first
second --> second
third --> second

XPath 1.0 Order of returned attributes in a UNION

<merge>
<text>
<div begin="A" end="B" />
<div begin="C" end="D" />
<div begin="E" end="F" />
<div begin="G" end="H" />
</text>
</merge>
I need a UNIONed set of attribute nodes, in the order A,B,C,D,E,F,G,H, and this will work:
/merge/text/div/#begin | /merge/text/div/#end
but only if each #begin comes before each #end, since the UNION operator is spec'd to return nodes in document order. (Yes?)
I need the nodeset to be in the same order, even if the attributes appear in a different order in the document, as here:
<merge>
<text>
<div end="B" begin="A" />
<div begin="C" end="D" />
<div end="F" begin="E" />
<div begin="G" end="H" />
</text>
</merge>
That is, I need elements to follow document order, but the attributes in each element to follow a determined order (either specified or alphabetical by attribute name).
This simply isn't possible in pure XPath. First of all, attributes in XML are unordered. From the XML 1.0 Recommendation:
Note that the order of attribute specifications in a start-tag or
empty-element tag is not significant.
An XPath engine might be reading and storing them in the order they appear in the document, but in terms of the spec, this is just a happy coincidence that cannot be relied upon.
Second, XPath has no sorting functionality. So, your best option is to sort the elements in your host language (e.g. XSLT or a general-purpose PL) after they've been selected.
Here's how to sort those attributes by value in XSLT:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:apply-templates
select="/merge/text/div/#*[name()='begin' or name()='end']">
<xsl:sort select="."/>
</xsl:apply-templates>
</xsl:template>
</xsl:stylesheet>
Note that I also merged your two expressions into one.
Edit: Use the following to output begin/end pairs in document order (as described in the comments):
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="div">
<xsl:value-of select="concat(#begin, #end)"/>
</xsl:template>
</xsl:stylesheet>

Return full text element (including child/descendant elements)

I'm trying to get the text from the first occurrence on the page of div/p, and only the first p. The <p> contains other tags (<b>, <a href>) and the returned text from <p> stops at any other tag. Is there a way to get this line to return all the text between <p> and </p>, even between embedded tags?
puts doc.xpath('html/body/div/p[1]/text()').first
Use:
string((//div/p)[1])
When this XPath expression is evaluated the result is the string value of the first p in the document that is a child of a div.
By definition the string value of an element is the concatenation (in document order) of all of its text-node descendents.
Therefore, you get exactly all the text in the subtree rooted by this p element, with any other nodes (elements, comments, PIs) skipped.
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select="string(p)"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the following XML document (no such provided!):
<p>
Hello <b>
XML
World!</b>
</p>
the result of the evaluated XPath expression is output:
Hello XML
World!
Using Nokogiri as an alternative for more XPath you can use Nokogiri::XML::Node#inner_text:
puts doc.xpath('html/body/div/p[1]').inner_text

XSLT 1.0: restrict entries in a nodeset

Being relatively new to XSLT I have what I hope is a simple question. I have some flat XML files, which can be pretty big (eg. 7MB) that I need to make 'more hierarchical'. For example, the flat XML might look like this:
<D0011>
<b/>
<c/>
<d/>
<e/>
<b/>
....
....
</D0011>
and it should end up looking like this:
<D0011>
<b>
<c/>
<d/>
<e/>
</b>
<b>
....
....
</D0011>
I have a working XSLT for this, and it essentially gets a nodeset of all the b elements and then uses the 'following-sibling' axis to get a nodeset of the nodes following the current b node (ie. following-sibling::*[position()=$nodePos]). Then recursion is used to add the siblings into the result tree until another b element is found (I have parameterised it of course, to make it more generic).
I also have a solution that just sends the position in the XML of the next b node and selects the nodes after that one after the other (using recursion) via a *[position() = $nodePos] selection.
The problem is that the time to execute the transformation increases unacceptably with the size of the XML file. Looking into it with XML Spy it seems that it is the 'following-sibling' and 'position()=' that take the time in the two respective methods.
What I really need is a way of restricting the number of nodes in the above selections, so fewer comparisons are performed: every time the position is tested, every node in the nodeset is tested to see if its position is the right one. Is there a way to do that ? Any other suggestions ?
Thanks,
Mike
Yes there is a way to do it much more efficiently: See Muenchian grouping. If having looked at this you need more help with the details, let us know. The key you'll need is something like:
<xsl:key name="elements-by-group" match="*[not(self::b)]"
use="generate-id(preceding-sibling::b[1])" />
Then you can iterate over the <b> elements, and for each one, use key('elements-by-group', generate-id()) to get the elements that immediately follow that <b>.
The task of "making the XML more hierarchical" is sometimes called up-conversion, and your scenario is a classic case for it. As you may know, XSLT 2.0 has very useful grouping features that are easier to use than the Muenchian method.
In your case it sounds like you would use <xsl:for-each-group group-starting-with="b" /> or, to parameterize the element name, <xsl:for-each-group group-starting-with="*[local-name() = 'b']" />. But maybe you already considered that and can't use XSLT 2.0 in your environment.
Update:
In response to the request for parameterization, here's a way to do it without a key.
Note though that it may be much slower, depending on your XSLT processor.
<xsl:template match="D0011">
<xsl:for-each select="*[local-name() = $sep]">
<xsl:copy>
<xsl:copy-of select="following-sibling::*[not(local-name() = $sep)
and generate-id(preceding-sibling::*[local-name() = $sep][1]) =
generate-id(current())]" />
</xsl:copy>
</xsl:for-each>
</xsl:template>
As noted in the comment, you can keep the performance benefit of keys by defining several different keys, one for each possible value of the parameter. You then select which key to use by using an <xsl:choose>.
Update 2:
To make the group-starting element be defined based on /*/*[2], instead of based on a parameter, use
<xsl:key name="elements-by-group"
match="*[not(local-name(.) = local-name(/*/*[2]))]"
use="generate-id(preceding-sibling::*
[local-name(.) = local-name(/*/*[2])][1])" />
<xsl:template match="D0011">
<xsl:for-each select="*[local-name(.) = local-name(../*[2])]">
<xsl:copy>
<xsl:copy-of select="key('elements-by-group', generate-id())"/>
</xsl:copy>
</xsl:for-each>
</xsl:template>
<xsl:key name="k1" match="D0011/*[not(self::b)]" use="generate-id(preceding-sibling::b[1])"/>
<xsl:template match="D0011">
<xsl:copy>
<xsl:apply-templates select="b"/>
</xsl:copy>
</xsl:template>
<xsl:template match="D0011/b">
<xsl:copy>
<xsl:copy-of select="key('k1', generate-id())"/>
</xsl:copy>
</xsl:template>
This is the fine grained trasversal pattern:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="node()|#*" name="identity">
<xsl:copy>
<xsl:apply-templates select="node()[1]|#*"/>
</xsl:copy>
<xsl:apply-templates select="following-sibling::node()[1]"/>
</xsl:template>
<xsl:template match="b[1]" name="group">
<xsl:copy>
<xsl:apply-templates select="following-sibling::node()[1]"/>
</xsl:copy>
<xsl:apply-templates select="following-sibling::b[1]" mode="group"/>
</xsl:template>
<xsl:template match="b[position()!=1]"/>
<xsl:template match="b" mode="group">
<xsl:call-template name="group"/>
</xsl:template>
</xsl:stylesheet>
Output:
<D0011>
<b>
<c></c>
<d></d>
<e></e>
</b>
<b>
....
....
</b>
</D0011>

how to for every parent node select every not first child node in a tree with multiple parent nodes

His,
I think I've got a tricky questions for XPath experts. There is a node structure like this:
A(1)-|
|-B(1)
|-B(2)
|-B(3)
A(2)-|
|-B(2.1)
|-B(2.2)
|-B(2.3)
...
How to, with a single XPath-expression, extract only the following nodes
A(1)-|
|-B(2)
|-B(3)
A(2)-|
|-B(2.2)
|-B(2.3)
...
That is for every parent node its first child element should be excluded.
I tried A/B[position() != 1] but this would filter out only B(1.1) and select B(2.1).
Thanks
This XPath expression (no preceding-sibling:: axis used):
/*/a/*[not(position()=1)]
when applied on this XML document:
<t>
<a>
<b11/>
<b12/>
<b13/>
</a>
<a>
<b21/>
<b22/>
<b23/>
</a>
</t>
selects the wanted nodes:
<b12 />
<b13 />
<b22 />
<b23 />
This can be verified with this XSLT transformation, producing the above result:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select="/*/a/*[not(position()=1)]"/>
</xsl:template>
</xsl:stylesheet>
Tricky. You could select nodes that have preceding siblings:
A/B[preceding-sibling::*]
This will fail for the first element and succeed for the rest.

Resources