Selecting a sibling with constraints - xpath

I've got an XML file in the vein of
<chapter template="A" id='1'/>
<chapter template="B"/>
<chapter template="B"/>
<chapter template="A" id='2'/>
<chapter template="B"/>
<chapter template="B"/>
<chapter template="B"/>
<chapter template="B"/>
<chapter template="C"/>
And I've got an XSLFO to process these chapters. The last chapter that has template "A" needs some special processing. The id attribute isn't something I can filter for (just added it here for illustration).
I've got a template match="chapter" that does the general processing for all chapters. Somewhere inside that code block is a section that only applies to the last chapter that uses template A. To execute this code block for that chapter only, I've got this test:
if test="following-sibling::chapter[not(#template = 'B')]/#template = 'C'"
So I'm trying to find the first following sibling that isn't template B, and checking if this sibling has template C. (it's been set up so that template C is always the last chapter).
The XPath above returns 'true' when processing chapter id=1 and id=2, so it's too broad. It's true when the chapter has any following sibling with template C.
So I'm thinking: I'll constrain the test to look only at the first following sibling that isn't template B:
if test="following-sibling::chapter[not(#template = 'B')][1]/#template = 'C'"
But this returns false for all chapters. Where am I going wrong?

How about something like //chapter[#template='A'][last()]?

Related

how can I select the text in the parent node after the current tag in xsl?

I have the following xsl:
<titleGroup>
<title type="main" xml:lang="en">Synthesis of <i>N</i>‐Heterocyclic Carbenes and Their Complexes by
Chloronium Ion Abstraction from 2‐Chloroazolium Salts Using Electron‐Rich Phosphines
</title>
</titleGroup>
if I'm at the template that calls "i" how can I check the value "‐Heterocyclic Carbenes and Their Complexes by
Chloronium Ion Abstraction from 2‐Chloroazolium Salts Using Electron‐Rich Phosphines" in XSL I want the part which is after the current node only, not the part before?
From the context of i, the instruction:
<xsl:value-of select="following-sibling::text()[1]"/>
will return the value of the closest following sibling text node.

Xquery the function parse-xml() produces an error on &?

As XML content in an HTTP POST request, I receive the following which I process in Xquery 3.1 (eXist-db 5.2):
<request id="foo">
<p>The is a description with a line break<br/>and another linebreak<br/>and
here is an ampersand&.</p>
<request>
My objective is to take the node <p> and insert it into a TEI file in eXist-db. If I just insert the fragment as-is, no errors are thrown.
However I need to transform any instances of string <br/> into element <lb/> before adding it to the TEI document. I try that with fn:parse-xml.
Applying the following, however, throws an error on &amp...which surprises me:
let $xml := <request id="foo">
<p>The is a description with a line break<br/>and
another linebreak<br/>and here is an ampersand&.</p>
<request>
let $newxml := <p>{replace($xml//p/text(),"<br/>","<lb/>")}</p>
return <p>{fn:parse-xml($newxml)}</p>
error:
Description: err:FODC0006 String passed to fn:parse-xml is not a well-formed XML document.: Document is not valid.
Fatal : The entity name must immediately follow the '&' in the entity reference.
If I remove & the fragment parses just fine. Why is this producing an error if it is legal XML? How can I achieve the needed result?
Many thanks in advance.
ps. I am open to both Xquery and XSLT solutions.
It seems that the issue is the HTML entities. It would work with numeric entities (i.e. < instead of < and > instead of >), but the XML parser doesn't know about HTML character entities.
Useutil:parse-html() instead of fn:parse-xml().
let $xml := <request id="foo">
<p>The is a description with a line break<br/>and
another linebreak<br/>and here is an ampersand&.</p>
</request>
return <p>{util:parse-html($xml/p/text())/HTML/BODY/node()}</p>

XPath 1.0 to find a sibling of an attribute node whose name is based on the attribute, but has a suffix

Given the following Xml:
<Root><Foo Bar="" Bar_Baz="12" /></Root>
Is there an XPath statement (using version 1.0 functions only) that can return Root/Foo/#Bar where there exists some sibling attribute starting with Bar (determined by context), and ending in _Baz, where that node has the value 12?
Bar should be anonymous - the XPath shouldn't care what it's called - but whatever it is called, if it is returned or not should be determined by whether X_Baz exists, and has the value of 12.
I was looking into something like:
//#*[sibling::#*[concat(local-name(), '_Baz') = '12']
But fairly obviously, this would just compare the text Bar_Baz to 12, not the value of that sibling attribute.
I'm making use of this using the .Net XmlDocument class, meaning I'm limited to Microsoft's XPath 1.0 implementation, so please don't make use of subsequent versions of the spec!
EDIT: Per the comment requesting a more diverse set of examples, see below:
<Root>
<Item Foo="" Foo_Baz="12">Yes - #Foo_Baz is 12, and #Foo exists</Item>
<Item Bar="" Bar_Baz="12">Yes - #Bar_Baz is 12, and #Bar exists</Item>
<Item Foo="" Foo_Baz="1">No - Foo_Baz != 12<Item>
<Item Baz="" Foo_Baz="12">No - No #Foo to return</Item>
<Item Foo_Baz="12">No - No #Foo to return</Item>
<Item Foo="" Foo_Haz="12">No - No #Foo_Baz node to check the value of</Item>
</Root>
Edit 2:
Looking at the first couple of answers proposed, I think there is something I haven't been clear on: the names, Foo or Bar, are unknown. The only things that are known are:
There are one or more attributes with a suffix _Baz that has the value 12
They may have siblings whose entire name is whatever came before the suffice
If they do, then that sibling is the node I want to match, provided the _Baz attribute has the value of 12
Another option :
//item[substring-after(local-name(./#*[last()]),"_")="baz" and ./#*[last()]="12"][local-name(./#*[1])=substring-before(local-name(./#*[last()]),"_")]
Shortest form :
//item[#foo or #bar][#bar_baz="12" or #foo_baz="12"]
EDIT : Massive and horrible XPath here, but it should work. It supports up to 5 attributes per item and regardless the position of these attributes inside each item tag.
//item[contains(local-name(#*[1]),"_baz") and #*[1]=12][local-name(#*[1])=substring-before(local-name(#*[1]),"_")]|//item[contains(local-name(#*[1]),"_baz") and #*[1]=12][local-name(#*[3])=substring-before(local-name(#*[1]),"_")]|//item[contains(local-name(#*[1]),"_baz") and #*[1]=12][local-name(#*[4])=substring-before(local-name(#*[1]),"_")]|//item[contains(local-name(#*[1]),"_baz") and #*[1]=12][local-name(#*[5])=substring-before(local-name(#*[1]),"_")]|//item[contains(local-name(#*[2]),"_baz") and #*[2]=12][local-name(#*[1])=substring-before(local-name(#*[2]),"_")]|//item[contains(local-name(#*[2]),"_baz") and #*[2]=12][local-name(#*[3])=substring-before(local-name(#*[2]),"_")]|//item[contains(local-name(#*[2]),"_baz") and #*[2]=12][local-name(#*[4])=substring-before(local-name(#*[2]),"_")]|//item[contains(local-name(#*[2]),"_baz") and #*[2]=12][local-name(#*[5])=substring-before(local-name(#*[2]),"_")]|//item[contains(local-name(#*[3]),"_baz") and #*[3]=12][local-name(#*[1])=substring-before(local-name(#*[3]),"_")]|//item[contains(local-name(#*[3]),"_baz") and #*[3]=12][local-name(#*[3])=substring-before(local-name(#*[3]),"_")]|//item[contains(local-name(#*[3]),"_baz") and #*[3]=12][local-name(#*[4])=substring-before(local-name(#*[3]),"_")]|//item[contains(local-name(#*[3]),"_baz") and #*[3]=12][local-name(#*[5])=substring-before(local-name(#*[3]),"_")]|//item[contains(local-name(#*[4]),"_baz") and #*[4]=12][local-name(#*[1])=substring-before(local-name(#*[4]),"_")]|//item[contains(local-name(#*[4]),"_baz") and #*[4]=12][local-name(#*[3])=substring-before(local-name(#*[4]),"_")]|//item[contains(local-name(#*[4]),"_baz") and #*[4]=12][local-name(#*[4])=substring-before(local-name(#*[4]),"_")]|//item[contains(local-name(#*[4]),"_baz") and #*[4]=12][local-name(#*[5])=substring-before(local-name(#*[4]),"_")]|//item[contains(local-name(#*[5]),"_baz") and #*[5]=12][local-name(#*[1])=substring-before(local-name(#*[5]),"_")]|//item[contains(local-name(#*[5]),"_baz") and #*[5]=12][local-name(#*[3])=substring-before(local-name(#*[5]),"_")]|//item[contains(local-name(#*[5]),"_baz") and #*[5]=12][local-name(#*[4])=substring-before(local-name(#*[5]),"_")]|//item[contains(local-name(#*[5]),"_baz") and #*[5]=12][local-name(#*[5])=substring-before(local-name(#*[5]),"_")]
Working sample (4 nodes selected) :
Strictly in terms of xpath, this expression
//Item[attribute::*[contains(local-name(), '_Baz')]='12'][attribute::*[local-name()='Foo'] | attribute::*[local-name()='Bar']]
should get you your desired output.

XPath - Nested path scraping

I'm trying to perform html scrapping of a webpage. I like to fetch the three alternate text (alt - highlighted) from the three "img" elements.
I'm using the following code extract the whole "img" element of slide-1.
from lxml import html
import requests
page = requests.get('sample.html')
tree = html.fromstring(page.content)
text_val = tree.xpath('//a[class="cover-wrapper"][id = "slide-1"]/text()')
print text_val
I'm not getting the alternate text values displayed. But it is an empty list.
HTML Script used:
This is one possible XPath :
//div[#id='slide-1']/a[#class='cover-wrapper']/img/#alt
Explanation :
//div[#id='slide-1'] : This part find the target <div> element by comparing the id attribute value. Notice the use #attribute_name syntax to reference attribute in XPath. Missing the # symbol would change the XPath selector meaning to be referencing a -child- element with the same name, instead of an attribute.
/a[#class='cover-wrapper'] : from each <div> element found by the previous bit of the XPath, find child element <a> that has class attribute value equals 'cover-wrapper'
/img/#alt : then from each of such <a> elements, find child element <img> and return its alt attribute
You might want to change the id filter to be starts-with(#id,'slide-') if you meant to return the all 3 alt attributes in the screenshot.
Try this:
//a[#class="cover-wrapper"]/img/#alt
So, I am first selecting the node having a tag and class as cover-wrapper and then I select the node img and then the attribute alt of img.
To find the whole image element :
//a[#class="cover-wrapper"]
I think you want:
//div[#class="showcase-wrapper"][#id="slide-1"]/a/img/#alt

how to match no following sibling

Here's my xml,
<w:tc>
<w:p>
<w:pPr></w:pPr>
<w:r></w:r>
</w:p>
</w:tc>
<w:tc>
<w:p>
<w:pPr></w:pPr>
</w:p>
</w:tc>
I want to match w:p which is preceded by w:tc and has no following sibling w:r, Precisely i want second w:tc. Code what i have tried,
<xsl:template match="w:pPr[ancestor::w:p[ancestor::w:tc] and not(following-sibling::w:r)]">
I need xpath for w:pPr having no following-sibling
The problem is when w:pPr is followed by w:hyperlink. Now i have ignored w:hyperlink too.
If you want to match a w:pPr that has no following sibling elements at all (regardless of name), then just use a match pattern of
w:pPr[ancestor::w:p[ancestor::w:tc] and not(following-sibling::*)]
or equivalently (and slightly shorter)
w:tc//w:p//w:pPr[not(following-sibling::*)]
Using the XPath is simple and straightforward, you have to filter elements olny. Your filtring could be based on the content of the element (using [] and path inside the brackets). With the filtered elements you can work as same as with the XML tree (start filtering again or select the final elements).
In your case, first you have to choose the correct tc element (filter the element as you need):
Based on the count of elements: //tc[count(./p/*) = 1], or
Based on non existing r element: //tc[not(./p/r)], or
Based on non existing r and hyperlink element: //tc[not(./p/r) and not(./p/hyperlink)]
Based on existing pPr and non existing r (it is not a necessary because the pPr is filtred in second step): //tc[./p/r and not(./p/r)]
It returns the following XML.
<tc>
<p>
<pPr>pPr</pPr>
</p>
</tc>
Then just simply say what do you want from the new XML:
Do you want the pPr element? Use: /p/pPr
All together:
//tc[count(./p/*) = 1]/p/pPr
or
//tc[not(./p/r)]/p/pPr
Note: // means find the element anywhere in the document.
Update 1: Hyperlink condition added.

Resources