xpath: how to select items between item A and item B - xpath

I have an HTML page with this structure:
<big><b>Staff in:</b></big>
<br>
<a href='...'>Movie 1</a>
<br>
<a href='...'>Movie 2</a>
<br>
<a href='...'>Movie 3</a>
<br>
<br>
<big><b>Cast in:</b></big>
<br>
<a href='...'>Movie 4</a>
How do I select Movies 1, 2, and 3 using Xpath?
I wrote this query
'//big/b[text()="Staff in:"]/following::a'
but it returns Movies 1, 2, 3, and 4. I guess I need to find a way to get items after <big><b>Staff in: but before the next <big>.
Thanks,

Assuming that <big><b>Staff in:</b></big> is a unique element that we can use as 'anchor', you can try this way :
//big[b='Staff in:']/following-sibling::a[preceding-sibling::big[1][b='Staff in:']]
Basically, the xpath finds all <a> that is following sibling of the 'anchor' <big> element mentioned above, and restrict the result to those having nearest preceding sibling <big> equals the anchor element.
output in xpath tester given markup in question as input (with minimal adjustment to make it well-formed XML) :
Element='Movie 1'
Element='Movie 2'
Element='Movie 3'

//a[preceding::b[text()="Staff in:"] and following::b[text()="Cast in:"]]
Returns all a after the element b with text Staff in: but before the element b with the text Cast in:.
You may need to add some more conditions to make it more specific depending on whether or not these b elements are unique on the page.

Just to add up and following the stackoverflow link here XPath axis, get all following nodes until here is the complete solution that i have worked up with xslt editor. Firstly /*/ is used instead of // as this is faster. Second the logic says all anchor nodes which are siblings of big are returned if they satisfy the inner condition that they have preceding sibling of big node equal to what they are following. Also presumed you have distinct big node.
The x-path looks like
/*/big[b="Cast in:"]/following-sibling::a [1 = count(preceding-sibling::big[1]| ../big[b="Cast in:"])]
The xslt solution looks like
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<html>
<body>
<h2>My Movie Collection</h2>
<table border="1">
<tr bgcolor="#9acd32">
<th>Title</th>
</tr>
<xsl:variable name="placeholder" select="/*/big" />
<xsl:for-each select="$placeholder">
<xsl:variable name="i" select="position()" />
<b>
<xsl:value-of select="$i" />
<xsl:value-of select="$placeholder[$i]" />
</b>
<xsl:for-each
select="following-sibling::a [1 = count(preceding-
sibling::big[1]| ../big[b=$placeholder[$i]])]">
<tr>
<td>
<xsl:value-of select="." />
</td>
</tr>
</xsl:for-each>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>
</xsl:stylesheet>

Related

Using fn:path() to select an element in XML

I have an XML with a HTML-like structure:
<h1 id="1">
<table>
<tr>
<td>
<p>text</p>
</td>
</tr>
</table>
</h1>
<h1 id="2">
<table>
<tr>
<td><p>translated text</p>
</td>
</tr>
</table>
</h1>
I want to copy the text from nodes in h1 id="2" to the node that's at the same position in h1 id="1".
Required result:
<p>text/translated text</p>
I can create an Xpath that addresses a single node:
/h1[2]/table[1]/tr[1]/td[1]/p[1]
but I can't figure out how to create an xPath that finds "the node in h1 id="2" that's at the same position as the node I'm working on in h1 id="1""
i.e. when I'm in
/h1[1]/table[1]/tr[1]/td[1]/p[1]
I want to address
/h1[2]/table[1]/tr[1]/td[1]/p[1]
and also
/h1[3]/table[1]/tr[1]/td[1]/p[1]
etc. if more h1 elements are present in the XML.
I tried using the path() function. This returns the path of the current node:
/h1[1]/table[1]/tr[1]/td[1]/p[1]
I'll modify this string by replacing the first part:
<xsl:variable name="newpath" select="concat('/Q{}h1[1]', substring-after(path(),'/Q{}h1[2]'))">
and then read the contents of that xPath:
<xsl:apply-templates select="$newpath"/>
this fails because $newpath is seen as a string instead of a path.
How can I get the output of path() to be treated as a node set instead of a string?
To dynamically evaluate an XPath expression you have as a string, in XSLT 3 and where xsl:evaluate is supported (Saxon PE/EE 9.8 and later, Saxon HE 10 and later, SaxonJS 2 and later, Altova XML 2017 R3 and later) you can use e.g.
<xsl:evaluate context-item="/" xpath="$newpath"/>
to select and output the element(s) selected by $newpath or you can of course store the result of xsl:evaluate in a variable and push the nodes to apply-templates with e.g.
<xsl:variable name="nodes" as="item()*"><xsl:evaluate context-item="/" xpath="$newpath"/></xsl:variable>
<xsl:apply-templates select="$nodes"/>
Online sample using SaxonJS.

XSLT split sorted data into different tables

I'm trying to format my xml data into two HTML tables. I successfully can sort some dummy data xls:sort, but I can't split up the sorted data into different tables.
My xml:
<a>
<b id="N">text1</b>
<b id="N">text2</b>
<b id="N+1">text3</b>
<b id="N">text4</b>
<b id="N+2">text5</b>
<b id="N+3">text6</b>
<b id="N">text7</b>
<b id="N+2">text8</b>
</a>
N is in this case a number, but I don't know which number. It could be 2 and 55, 3 and 4, 44 and 52 and 78 and 98.
Each number I wish to send to their own table, so the result would be:
<table>
<tr><td>text1</td></tr>
<tr><td>text2</td></tr>
<tr><td>text4</td></tr>
<tr><td>text7</td></tr>
</table>
<table>
<tr><td>text3</td></tr>
</table>
<table>
<tr><td>text5</td></tr>
<tr><td>text8</td></tr>
</table>
<table>
<tr><td>text6</td></tr>
</table>
How can I devide the sorted data into different tables depending on their attribute?
Any pointers would be appreciated.
The standard approach to this kind of problem in XSLT 1.0 is called Muenchian grouping. You define a key that groups your target elements in the way you want
<xsl:key name="bsById" match="b" use="#id" />
then use a trick with generate-id to extract just the first node in each group as a proxy for the group as a whole
<xsl:apply-templates select="b[generate-id()
= generate-id(key('bsById', #id)[1])]"
mode="group">
<xsl:sort select="#id" />
</xsl:apply-templates>
So now the following template would fire once per group, and you can use the key function within it to get all the nodes in the group
<xsl:template match="b" mode="group">
<table>
<!-- extract all the nodes that are grouped with this one -->
<xsl:apply-templates select="key('bsById', #id)">
<!-- you could <xsl:sort> here if you want to sort within groups -->
</xsl:apply-templates>
</table>
</xsl:template>
<xsl:template match="b">
<tr><td>...</td></tr>
</xsl:template>
All the above is fine if that example is your entire XML document, but if there's more than one a element within the document each with its own set of b elements that need grouping independently, then the key needs to be more complex. The usual trick here is to use the generate-id of the parent a node as part of the grouping key value for its b children:
<xsl:key name="bsByParentAndId" match="a/b" use="concat(generate-id(..), '|', #id)" />
and for the Muenchian grouping expression
<xsl:template match="a">
<xsl:apply-templates select="b[generate-id()
= generate-id(key('bsByParentAndId', concat(
generate-id(current()), '|', #id))[1])]"
mode="group"/>
</xsl:template>
For the record, if you could use XSLT 2.0 then it becomes significantly easier. No need to define a complex key, you simply use for-each-group
<xsl:template match="a">
<xsl:for-each-group select="b" group-by="#id">
<xsl:sort select="current-grouping-key()" />
<table>
<xsl:apply-templates select="current-group()" />
</table>
</xsl:for-each-group>
</xsl:template>
<xsl:template match="b">
<tr><td>...</td></tr>
</xsl:template>

extract tags not having sub elements

I have a web page which is something like this
<p> Content </p>
...........
..........
<p> Other content
<b> Use link <b>
<h3> some text <h3>
</p>
...........
........... and some other elements starting with <p> tag having
different sub-elements inside it
What I want to do is to extract text of only those <p> tags which doesnt have any sub elements
The proposes solution is correct when using XSLT. As you tagged this just with xpath, here is the XPath version:
//p[count(*) = 0]/text()
Just checking if there is one, is often faster than count:
//p[not(*)]/text()
You can use the template
<xsl:template match="p">
<xsl:if test="count(*) = 0">
<xsl:value-of select="."/>
</xsl:if>
</xsl:template>
This will print the value of p tags only if the node have no other tags inside it.

Get all elements by partial match of class attribute

I'm trying to use Nokogiri to display results from a URL. (essentially scraping a URL).
I have some HTML which is similar to:
<p class="mattFacer">Matty</p>
<p class="mattSmith">Matthew</p>
<p class="suzieSmith">Suzie</p>
So I need to then find all the elements which begin with the word "matt". What I need to do is save the value of the element and the element name so I can reference it next time.. so I need to capture
"Matty" and "<p class='mattFacer'>"
"Matthew" and "<p class='mattSmith'>"
I haven't worked out how to capture the element HTML, but here's what I have so far for the element (It doesnt work!)
doc = Nokogiri::HTML(open(url))
tmp = ""
doc.xpath("[class*=matt").each do |item|
tmp += item.text
end
#testy2 = tmp
This should get you started:
doc.xpath('//p[starts-with(#class, "matt")]').each do |el|
p [el.attributes['class'].value, el.children[0].text]
end
["mattFacer", "Matty"]
["mattSmith", "Matthew"]
Use:
/*/p[starts-with(#class, 'matt')] | /*/p[starts-with(#class, 'matt')]/text()
This selects any p elements that is a child of the top element of the XML document and the value of whose class attribute starts with "matt" and any text-node child of any such p element.
When evaluated against this XML document (none was provided!):
<html>
<p class="mattFacer">Matty</p>
<p class="mattSmith">Matthew</p>
<p class="suzieSmith">Suzie</p>
</html>
the following nodes are selected (each on a separate line) and can be accessed by position:
<p class="mattFacer">Matty</p>
Matty
<p class="mattSmith">Matthew</p>
Matthew
Here is a quick XSLT verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:for-each select=
"/*/p[starts-with(#class, 'matt')]
|
/*/p[starts-with(#class, 'matt')]/text()
">
<xsl:copy-of select="."/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The result of this transformation, when applied on the same XML document (above) is the expected, correct sequence of selected nodes:
<p class="mattFacer">Matty</p>
Matty
<p class="mattSmith">Matthew</p>
Matthew
doc = Nokogiri::HTML(open(url))
tmp = ""
items = doc.css("p[class*=matt]").map(&:text).join
The accepted answer is great, but another approach would be to use Nikkou, which lets you match via regular expressions (without needing to be familiar with XPATH functions):
doc.attr_matches('class', /^matt/).collect do |item|
[item.attributes['class'].value, item.text]
end

XPath "following siblings before"

I'm trying to select elements (a) with XPath 1.0 (or possibly could be with Regex) that are following siblings of particular element (b) but only preceed another b element.
<img><b>First</b><br>
<img> First Href - 19:30<br>
<img><b>Second</b><br>
<img> Second Href - 19:30<br>
<img> Third Href - 19:30<br>
I tried to make the sample as close to real world as possible. So in this scenario when I'm at element
<b>First</b>
I need to select
First Href
and when I'm at
<b>Second</b>
I need to select
Second Href
Third Href
Any idea how to achieve that? Thank you!
Dynamically create this XPath:
following-sibling::a[preceding-sibling::b[1][.='xxxx']]
where 'xxxx' is the replaced with the text of the current <b>.
This is assuming that all the elements actually are siblings. If they are not, you can try to work with the preceding and following axes, or you write a more specific XPath that better resembles document structure.
In XSLT you could also use:
following-sibling::a[
generate-id(preceding-sibling::b[1]) = generate-id(current())
]
Here is a solution which is just a single XPath expression.
Using the Kaysian formula for intersection of two nodesets $ns1 and $ns2:
$ns1[count(. | $ns2) = count($ns2)]
We simply substitute $ns1 with the nodeset of <a> siblings that follow the current <b> node, and we substitute $ns2 with the nodeset of <a> siblings that precede the next <b> node.
Here is a complete transformation that uses this:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:apply-templates select="*/b"/>
</xsl:template>
<xsl:template match="b">
At: <xsl:value-of select="."/>
<xsl:variable name="vNextB" select="following-sibling::b[1]"/>
<xsl:variable name="vA-sAfterCurrentB" select="following-sibling::a"/>
<xsl:variable name="vA-sBeforeNextB" select=
"$vNextB/preceding-sibling::a
|
$vA-sAfterCurrentB[not($vNextB)]
"/>
<xsl:copy-of select=
"$vA-sAfterCurrentB
[count(.| $vA-sBeforeNextB)
=
count($vA-sBeforeNextB)
]
"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the following XML document:
<t>
<img/>
<b>First</b>
<br />  
<img/>  
First Href - 19:30
<br />
<img/>
<b>Second</b>
<br />
<img/>  
Second Href - 19:30
<br />
<img/> 
Third Href - 19:30
<br />
</t>
the correct result is produced:
At: First First Href
At: Second Second Href
Third Href

Resources