Is it possible to truncate an XPath axis at a given node? - xpath

I've been writing some code that extracts the main textual content from web pages. One strategy that's been useful is to locate the first paragraph of content, then select all of the following sibling elements up to, but not including, the first one that isn't a p, ul, ol, or blockquote element. In Perl, the code looks something like this:
my ($firstpara) = $document->findnodes('//p[whatever]');
my #content = ($firstpara);
for my $sibling ($firstpara->findnodes('following-sibling::*')) {
last if $sibling->tag !~ /^(?:p|ol|ul|blockquote)\z/;
push #content, $sibling;
}
This isn't too bad, but it would be cool to be able to get the nodes I want using only XPath, so I could write something like this instead:
my ($firstpara) = $document->findnodes('//p[whatever]');
my #content = ($firstpara, $firstpara->findnodes('<query>'));
I've done a lot of experimentation, but haven't been able to figure out how to write that last query. The closest to a valid-looking solution I've been able to find is something like:
$firstpara->findnodes('following-sibling::*[position() < $EXPR]');
...where $EXPR is some expression that returns the position of the next sibling whose tag is not p, ul, ol, or blockquote, but I haven't been able to work out if such an expression is expressible in XPath.
Is there any way to do what I've described in XPath?
Example:
Suppose my document looks like this:
<h1>Header</h1>
<p>Paragraph 1</p>
<p id="first">Paragraph 2</p>
<p>Paragraph 3</p>
<ul><li>Item 1</li><li>Item 2</li></ul>
<p>Paragraph 4</p>
<hr>
<p>Paragraph 5</p>
<blockquote>Blockquote 1</blockquote>
...
I have a reference to the <p> element with id first. I'm after an XPath expression, using that first element as the content node, that will give me the following siblings Paragraph 3, the unordered list, and Paragraph 4. The <hr> element is not among those I want (<p>, <ul>, <ol>, and <blockquote>), so that element and all siblings after that should not be part of the returned node set.

As the OP explained, he wants:
all of the following sibling elements up to, but not including, the
first one that isn't a p, ul, ol, or blockquote element
I. XPath 1.0 solution:
The nodes that are wanted are the intersection of two nodesets:
All elements that are following siblings of the p with id with value 'first'.
All elements that are preceding siblings of hr.
To find this with XPath 1.0 we use the Kayessian formula for nodeset intersection:
$ns1[count(.|$ns2) = count($ns2)]
The above XPath expression selects all nodes that belong both to the node-set $ns1 and to the node-set $ns2.
Let $vP1 is defined as /*/p[#id='first'].
Let $vFirstNotInRange is:
$vP1/following-sibling::*
[not(self::p or self::ul
or self::ol or self::blockquote)
] [1]
This selects the first unwanted node (in this case hr), or more precisely: the first element that is a following sibling of $vP1 and that is not a p, a ul, a ol or a blockquote.
Then the two node-sets we want to intersect are all following siblings of $vP1 and all preceding siblings of $vFirstNotInRange:
Let us denote with $vFollowingP1 the first node-set -- this is:
$vP1/following-sibling::*
And let us denote with $vPreceedingNotInRange the second node-set -- this is:
$vFirstNotInRange/preceding-sibling::*
Finally we substitute in the Kayessina formula $ns1 with $vPreceedingNotInRange and $ns2 with $vFollowingP1. The reult of these substitutions selects exactly the wanted nodes:
$vPreceedingNotInRange
[count(.|$vFollowingP1)
=
count($vFollowingP1)
]
If we substitute all variables until we get an expression that doesn't contain any variables, we get:
/*/p[#id='first']/following-sibling::*
[not(self::p or self::ul
or self::ol or self::blockquote
)
] [1]
/preceding-sibling::*
[count(.| /*/p[#id='first']/following-sibling::*)
=
count(/*/p[#id='first']/following-sibling::*)
]
This expression selects exactly the wanted nodes.
Here is an XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:variable name="vP1" select="/*/p[#id='first']"/>
<xsl:variable name="vFirstNotInRange" select=
"$vP1/following-sibling::*
[not(self::p or self::ul
or self::ol or self::blockquote)
] [1]"/>
<xsl:variable name="vFollowingP1"
select="$vP1/following-sibling::*"/>
<xsl:variable name="vPreceedingNotInRange"
select="$vFirstNotInRange/preceding-sibling::*"/>
<xsl:template match="/">
<xsl:copy-of select=
"$vPreceedingNotInRange
[count(.|$vFollowingP1)
=
count($vFollowingP1)
]"/>
================
<xsl:copy-of select=
"/*/p[#id='first']/following-sibling::*
[not(self::p or self::ul
or self::ol or self::blockquote
)
] [1]
/preceding-sibling::*
[count(.| /*/p[#id='first']/following-sibling::*)
=
count(/*/p[#id='first']/following-sibling::*)
]
"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the following XML document (the provided non-wellformed XML fragment -- corrected and wrapped in order to be made wellformed):
<html>
<h1>Header</h1>
<p>Paragraph 1</p>
<p id="first">Paragraph 2</p>
<p>Paragraph 3</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
<p>Paragraph 4</p>
<hr/>
<p>Paragraph 5</p>
<blockquote>Blockquote 1</blockquote>
</html>
the two XPath expressions (one with variables and one with all variables substituted) are evaluated and the wanted, correct selected nodes output:
<p>Paragraph 3</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
<p>Paragraph 4</p>
================
<p>Paragraph 3</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
<p>Paragraph 4</p>
II. XPath 2.0 solution:
$vFirstNotInRange/preceding-sibling::*
[. >> $vP1]
This selects any preceding sibling of $vFirstNotInRange that is also following $vP1 and selects the same wanted nodes:
<p>Paragraph 3</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
<p>Paragraph 4</p>
Explanation: Here we use the XPath 2.0 "follows" operator >>.

Related

Select only the second preceding tag if there is one, or then just the first one

I have this HTML:
<h4>block 1</h4>
<p>paragraph 1</p>
<p>paragraph 2</p>
<table></table>
<h4>block 2</h4>
<p>paragraph 1</p>
<table></table>
As you can see, the first block contains two <p></p> tags, while the second block only has one.
I am currently using this XPath: //table/preceding::p[1], which returns:
1. <p>paragraph 2</p>
2. <p>paragraph 1</p>
However, this is what I'd like to have:
1. <p>paragraph 1</p>
2. <p>paragraph 1</p>
So basically the farest "preceding" table p tag, as explained in my question title.
I want to keep using //table/preceding, as this is very important in my case.
I already tried //table/preceding::p[1 or 2], but that selects both.
I also tried //table/preceding::p[2] but that will select both paragraphs from the first block, and none from the second one.
As you can probably notice, I'm pretty new to XPath. How can I achieve the desired result?
Try this one to get select desired paragraphs
//table/preceding-sibling::h4[1]/following-sibling::p[1]

xpath for locating li with text does not work

Using the xpath //ul//li[contains(text(),"outer")] to find a li in the outer ul does not work
<ul>
<li>
<span> not unique text, </span>
<span> not unique text, </span>
outer ul li 1
<ul >
<li> inner ul li 1 </li>
<li> inner ul li 2 </li>
</ul>
</li>
<li>
<span> not unique text, </span>
<span> not unique text, </span>
outer ul li 2
<ul >
<li> inner ul li 1 </li>
<li> inner ul li 2 </li>
</ul>
</li>
</ul>
Any idea how to find a li with a specific text in the outer ul?
Thank you
This will work for you //ul//li[contains(.,"outer")]
I would expect that you only like to consider the text nodes which are direct child of the li. Therefore you are right with using text() (if you use contains(.,"outer") this will consider text form any children of li).
Therefore try this:
//ul/li[text()[contains(.,'outer')]]
Running this with Saxon, the original XPath expression gives:
XPTY0004: A sequence of more than one item is not allowed as the first argument of
contains() ("", "", ...)
Now, I guess Selenium is probably using XPath 1.0 rather than XPath 2.0, and in 1.0 the contains() function has "first item semantics" - it converts its argument to a string, which if the argument is a node-set containing more than one node, involves considering only the first node. And the first text node is probably whitespace.
If you want to test whether some child text node contains "outer", use
//ul//li[text()[contains(.,"outer")]]
Another reason for switching to XPath 2.0...
For above issue -
This solution will work
//ul//li[contains(.,"outer")]
"." Selects the current node

XPath and negation searches

I have the following code sample in an xmlns root:
<ol class="stan">
<li>Item one.</li>
<li>
<p>Paragraph one.</p>
<p>Paragraph two.</p>
</li>
<li>
<pre>Preformated one.</pre>
<p>Paragraph one.</p>
</li>
</ol>
I would like to perform a different operation on the first item in <li> depending on the type of tag it resides in, or no tag, i.e. the first <li> in the sample.
EDIT:
My logic in pursuing the task turns out to be incorrect.
How do I query a <li> that has no descendants as in the first list item?
I tried negation:
#doc.xpath("//xmlns:ol[#class='stan']//xmlns:li/xmlns:*[1][not(p|pre)]")
That gives me the exact opposite for what I think I am asking for.
I think I am making the expression more complicated since I can't find the right solution.
UPDATE:
Navin Rawat has answered this one in the comments. The correct code would be:
#doc.xpath("//xmlns:ol[#class='stan']/xmlns:li[not(xmlns:*)]")
CORRECTION:
The correct question involves both an XPath search and a Nokogiri method.
Given the above xhtml code, how do I search for first descendant using xpath? And how do I use xpath in a conditional statement, e.g.:
#doc.xpath("//xmlns:ol[#class='stan']/xmlns:li").each do |e|
if e.xpath("e has no descendants")
perform task
elsif e.xpath("e first descendant is <p>")
perform second task
elsif e.xpath("e first descendant is <pre>")
perform third task
end
end
I am not asking for complete code. Just the part in parenthesis in the above Nokogiri code.
Pure XPath answer...
If you have the following XML :
<ol class="stan">
<li>Item one.</li>
<li>
<p>Paragraph one.</p>
<p>Paragraph two.</p>
</li>
<li>
<pre>Preformated one.</pre>
<p>Paragraph one.</p>
</li>
</ol>
And want to select <li> that has no child element as in the first list item, use :
//ol/li[count(*)=0]
If you have namespaces problem, please give to whole XML (with the root element and namespaces declaration) so that we can help you dealing with it.
EDIT after our discussion, here is your final tested code :):
#doc.xpath("//xmlns:ol[#class='footnotes']/xmlns:li").each do |e|
if e.xpath("count(*)=0")
puts "No children"
elsif e.xpath("count(*[1]/self::xmlns:p)=1")
puts "First child is <p>"
elsif e.xpath("count(*[1]/self::xmlns:pre)=1")
puts "First child is <pre>"
end
end

xpath: count preceding elements

I have an xml structure that looks like this:
<document>
<body>
<section>
<title>something</title>
<subtitle>Something again</subtitle>
<section>
<p xml:id="1234">Some text</p>
<figure id="2121"></figure>
<section>
<p xml:id="somethingagain">Some text</p>
<figure id="939393"></figure>
<p xml:id="countelement"></p>
</section>
</section>
</section>
<section>
<title>something2</title>
<subtitle>Something again2</subtitle>
<section>
<p xml:id="12345678">Some text2</p>
<figure id="939394"></figure>
<p xml:id="countelement2"></p>
</section>
</section>
</body>
</document>
How can I count the figure elemtens I have before the <p xml:id="countelement"></p> element using XPath?
Edit:
And i only want to count figure elements within the parent section, in the next section it should start from 0 again.
Given you're using an XPath 2.0 compatible engine, find the count element and call fn:count() for each of them with using all preceding figure-elements as input.
This will return the number of figures preceding each "countelement" on the same level (I guess this is what you actually want):
//p[#xml:id="countelement"]/count(preceding-sibling::figure)
This will return the number of figures preceding each "countelement" and the level above:
//p[#xml:id="countelement"]/count(preceding-sibling::figure | parent::*/preceding-sibling::figure)
This will return the number of all preceeding figures preceding each "countelement" and the level above:
//p[#xml:id="countelement"]/count(preceding::figure)
If you're bound to XPath 1.0, you won't be able to get multiple results. If #id really is an id (and thus unique), you will be able to use this query:
count(//p[#xml:id="countelement"]/preceding::figure)
If there are "countelements" which are not <p/> elements, replace p by *.
count(id("countelement")/preceding-sibling::figure)
Please note that the xml:id attributes of two different elements cannot the same value, such as "countelement". If you wish two different elements to have a same-named attribute with the same value "countelement", it must be some other attribute perhaps "kind" that is not of DTD attribute type ID. In that case in place of id("countelement") you would use *[#kind="countelement"].

how do I formulate this xpath expression?

given the following div element
<div class="info">
title
<span class="a">123</span>
<span class="b">456</span>
<span class="c">789</span>
</div>
I want to retrieve contents of the span with class "b". However, some divs I want to parse lack the second two spans (of class "b" and "c"). For these divs, I want the contents of the span with class "a". Is it possible to create a single XPath expression that selects this?
If it is not possible, is it possible to create a selector that retrieves the entire contents of the div? ie retrieves
title
<span class="a">123</span>
<span class="b">456</span>
<span class="c">789</span>
If I can do that, I can use a regex to find the data I want. (I can select the text within the div, but I'm not sure how to select the tags also. Just the text yields 123456789.)
More efficient -- requires no union:
//div/span
[#class='b'
or
#class='a'
and
not(parent::*[span[#class='b']])
]
An expression (like the one below) that is the union of two absolute "// expressions", typically performs two complete document tree traversals and then the union operation does deduplication and sorting in document order -- all this can be signifficantly less efficient than a single tree traversal, unless the XPath processor has an intelligent optimizer.
An example of such inefficient expression:
//div/span[#class='b'] | //div[not(./span[#class='b'])]/span[#class='a']
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select=
"//div/span
[#class='b'
or
#class='a'
and
not(parent::*[span[#class='b']])
]"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<div class="info">
title
<span class="a">123</span>
<span class="b">456</span>
<span class="c">789</span>
</div>
The Xpath expression is evaluated and the selected elements (in this case just one) are copied to the output:
<span class="b">456</span>
When the same transformation is applied on a different XML document, where there is no class='b':
<div class="info">
title
<span class="a">123</span>
<span class="x">456</span>
<span class="c">789</span>
</div>
the same XPath expression is evaluated and the correctly selected element is copied to the output:
<span class="a">123</span>
The xpath expression should be something like:
//div/span[#class='b'] | //div[not(./span[#class='b'])]/span[#class='a']
The expression left of the union operator | will select you all the b-class spans inside all divs, the expression on the right hand side will first query all divs that do not have a b-class span and then select their a-class span. The | operator combines the results of the two sets.
See here for selecting nodes with not() and here for combining results with the | operator.
Also, to refer to the second part of your question have a look here.
Using node() in your xpath you can select everything (nodes + text) that is below the node selected. So you can get everything in the div returned by
//div/node()
for future processing by other means.
An expression that works on your input without the union operator:
//div/span[#class='a' or #class='b'][count(../span[#class='b']) + 1]
This is just for fun. I'd probably use something more like #inVader's answer in production code.

Resources