XPath expression to match across two associated elements - xpath

I’ve got the following XML of associated elements:
<doc>
<!-- A block of style elements. -->
<styles>
<style id='style-1' class='bar'>…</style>
<style id='style-2' class='baz'>…</style>
…
</styles>
<!-- Document content. -->
<p style='style-1'>…</p>
<p style='style-2'>…</p>
…
</doc>
For an XSLT template I’m looking for an XPath expression matches “an element p whose style is of class bar”?

Pure XPath 1.0 expression that will return all elements p whose style is of class bar :
//p[#style = //style[#class='bar']/#id]
Basically, the XPath looks for <p> elements where style attribute equals id of <style class='bar'>.

Presuming that is an accurate representation of your document's structure, I would advise using this, without double-slashes (//) since double-slashes can be very inefficient:
/doc/p[#style = /doc/styles/style[#class = 'bar']/#id]

Related

How to exclude a child node from xpath?

I have the following code :
<div class = "content">
<table id="detailsTable">...</table>
<div class = "desc">
<p>Some text</p>
</div>
<p>Another text<p>
</div>
I want to select all the text within the 'content' class, which I would get using this xPath :
doc.xpath('string(//div[#class="content"])')
The problem is that it selects all the text including text within the 'table' tag. I need to exclude the 'table' from the xPath. How would I achieve that?
XPath 1.0 solutions :
substring-after(string(//div[#class="content"]),string(//div[#class="content"]/table))
Or just use concat :
concat(//table/following::p[1]," ",//table/following::p[2])
The XPath expression //div[#class="content"] selects the div element - nothing more and nothing less - and applying the string() function gives you the string value of the element, which is the concatenation of all its descendant text nodes.
Getting all the text except for that containing in one particular child is probably not possible in XPath 1.0. With XPath 2.0 it can be done as
string-join(//div[#class="content"]/(node() except table)//text(), '')
But for this kind of manipulation, you're really in the realm of transformation rather than pure selection, so you're stretching the limits of what XPath is designed for.

How stop on specific tag?

How get whole text under h1 tag to the next h1 tag?
I have class name of starting h1 tag
...
<h1 class="something">...</h1>
...
<h1 ...>...</h1>
...
I tried: //*[#class='something']//text()
I want to scrapy text from all childs and siblings. I don't need text of h1 tags. I don't know how to stop scraping to next h1 tag.
With a proper example:
<root>
<h1 class="something">.1.</h1>
.2.
<p>.3.</p>
.4.
<h1 class="other">.5.</h1>
</root>
This XPath 1.0 expression:
/root//text()[not(ancestor::h1)][preceding::h1[1][#class='something']]
Meaning: "descendants text nodes of root element having the first preceding h1 element with #class attribute equal to 'something´ and not having an ancestor h1 element"
And it selects
.2.
.3.
.4.
Test in http://www.xpathtester.com/xpath/ecd4f379b13558572ffd62d0db3a3f98

XPath based on node indexes only

I have an XML :
<Section>
<Paragraph>
<Text>t1</Text>
<Text>t2</Text>
</Paragraph>
<Paragraph>
<Text>t3</Text>
<Text>t4</Text>
</Paragraph>
</Section>
and I know only element indexes, e.g., /0/1/0 i.e. first Section, second Paragraph, and its first Text. How can I translate '0/1/0' into a valid XPath that returns element where t3 is ?
Note that I don't know element names because they can differ but I only know sequence of indexes as in above example.
Many thanks
For the example given this will work.
/element()[1]/element()[2]/element()[1]/text()

xpath: check if element is within other element

I have quite a large XML structure that in its simplest form looks kinda like this:
<document>
<body>
<section>
<p>Some text</p>
</section>
</body>
<backm>
<section>
<p>Some text</p>
<figure><title>This</title></figure>
</section>
</backm>
</document>
The section levels can be almost limitless (both within the body and backm elements) so I can have a section in section in section in section, etc. and the figure element can be within a numlist, an itenmlist, a p, and a lot more elements.
What I want to do is to check if the title in figure element is somewhere within the backm element. Is this possible?
A document could have multiple <backm> elements and it could have multiple <figure><title>Title</title></figure> elements in it. How you build your query depends on the situations you're trying to distinguish between.
//backm/descendant::figure/title
Will return the <title> elements that are the child of a <figure> element and the descendant of a <backm> element.
So:
count(//backm/descendant::figure/title) > 0
Will return True if there are 1 or more such title elements.
You can also express this using Double Negation
not(//backm[not(descendant::figure/title)])
I'm under the impression that this should have better performance.
//title[parent::figure][ancestor::backm]
Lists all <title> elements with a parent of <figure> and an <backm> ancestor.

xPath help > Find img with alt='My Keyword'

How can I alter this xpath to find if the content contains an img tag with an alt containing my keyword phrase?
$xPath->evaluate('count(/html/body//'.$tag.'[contains(.,"'.$keyword.'")])');
Use:
boolean(//img[contains(#alt, 'yourKeywordHere')])
to find (true(), false()) whether there is an img element in the XML document whose alt attribute contains 'yourKeywordHere'.
Use:
boolean(//yourTag//img[contains(#alt, 'yourKeywordHere')])
to find if there is an element in the document named yourTag that has a descendent img whose alt attribute contains 'yourKeywordHere'.
I don't understand exactly what elements you are loooking for, but here is the example, which returns all elements h1 which contains at least one image with your_keyword in alt:
//h1[.//img[contains(#alt, 'your_keyword')]]
You should also handle if it is case sensitive or not. You can use this xpath but be careful, some xpath evaluators doesn't support lower-case function.
//h1[.//img[contains(lower-case(#alt), lower-case('your_keyword'))]]
Here is example:
//h1[.//img[contains(#alt, 'key ')]]
<html>
<h1> <!-- found -->
<img alt='here is my key' />
</h1>
<h1><!-- not found -->
<img alt='here is not' />
</h1>
<h1> <!-- found -->
<h2>
<img alt='the key is also here' />
</h2>
</h1>
<h1></h1> <!-- not found -->
</html>

Resources