XPath selection while excluding elements having certain attribute values - xpath

My first post here - it's a great site and I will certainly do my best to give back as much as I can.
I have seen different manifestations of this following question; however my attempts to resolve don't appear to work.
Consider this simple tree:
<root>
<div>
<p>hello</p>
<p>hello2</p>
<p><span class="bad">hello3</span></p>
</div>
</root>
I would like to come up with an XPath expression that will select all child nodes of "div", except for elements that have their "class" attribute equal to "bad".
Here is what I have tried:
/root/div/node()[not (#class='bad')]
... However this doesn't seem to work.
What am I missing here?
Cheers,
Isaac

When testing your XPath here with the provided XML document, the XPath seems to be indeed selecting all child nodes that do not have an attribute class="bad" - these are all the <p> elements in the document.
You will note that the only child node that has such an attribute is the <span>, which indeed does not get selected.
Are you expecting the p node surrounding your span not to be selected?

I have been working with XPath in a Java program I'm writing. If you want to select the nodes that don't have class="bad" (i.e. the <span> nodes, but not the surrounding <p> nodes), you could use:
/root/div/descendant::*[not (#class='bad')]
Otherwise, if you want to select the nodes that don't have a child with class='bad', you can use something like the following:
/root/div/p/*[not (#class='bad')]/..
the .. part selects the immediate parent node.

The identity transform just matches and copies everything:
<xsl:template match="#*|node()" >
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
But you add a null transform that more specifically matches the pattern you want to exclude:
<xsl:template match="span[#class='bad']" />
( you can also add a priority attrib if you want to be more explicit about which one has precedence. )

Welcome to SO, Isaac!
I'd try this:
/root/div/*[./*[#class != "bad"]]
this ought to select all child elements (*) of the div element that do not have a descendant element with a class attribute that equals bad.
Edit:
As per #Alejandros comment:
/root/div/*[not(*/#class "bad")]

Related

How to access second element using relative Xpath

Given this page snippet
<section id="mysection">
<div>
<div>
<div>
<a href="">
<div>first</div>
</a>
</div>
<div>
<a href="">
<div>second</div>
</a>
</div>
</div>
</div>
</section>
I want to access the second a-element using relative Xpath. In FF (and locating with Selenium IDE) this
//section[#id='mysection']//a[1]
works but this does not match
//section[#id='mysection']//a[2]
What is wrong with the second expression?
EDIT: Actually I do not care so much about Selenium IDE (just use it for quick verification). I want to get it going with selenium2library in Robot Framework. Here, the output is:
ValueError: Element locator with prefix '(//section[#id' is not
supported
for the suggested solution (//section[#id='mysection']//a)[2]
You can use this. This would select the anchor descendants of section and get you the second node. This works with xslt processor, hope this works with Selenium
//section[#id='mysection']/descendant::a[2]
Try this way instead :
(//section[#id='mysection']//a)[2]
//a[2] looks for <a> element within the same parent. Since each parent <div> only contains one <a> child, your xpath didn't match anything.
With this:
//section[#id='mysection']//a[1]
you are matching all first 'a' elements within any context (inside one div, for example), but with this
//section[#id='mysection']//a[2]
you are trying to match any second 'a' element with any context, but you dont have more than one 'a' element in any of nodes.
The icrementing sibling node thus should be a parent div node to those 'a' tags.
Very simple:
//section[#id='mysection']//a[1] - both elements
This is why previous answer with paranthesis around the whole thing is correct.
//section[#id='mysection']//div[1]/a - only first element
//section[#id='mysection']//div[2]/a - only second elemnt
Other way to mach each 'a' separately:
//section[#id='mysection']//a[div[text()='first']]
//section[#id='mysection']//a[div[text()='second']]
Other ways to reach to the second a-element can be by using the
<div>second</div>, call this bottom-up approach
instead of starting from section-element
<section id="mysection">, call this top-down approach
Using the div child of a-element, the solutions should look like this:
//div[.='second']/..

How to get node text without children?

I use Nokogiri for parse the html page with same content:
<p class="parent">
Useful text
<br>
<span class="child">Useless text</span>
</p>
When I call the method page.css('p.parent').text Nokogiri returns 'Useful text Useless text'. But I need only 'Useful text'.
How to get node text without children?
XPath includes the text() node test for selecting text nodes, so you could do:
page.xpath('//p[#class="parent"]/text()')
Using XPath to select HTML classes can become quite tricky if the element in question could belong to more than one class, so this might not be ideal.
Fortunately Nokogiri adds the text() selector to CSS, so you can use:
page.css('p.parent > text()')
to get the text nodes that are direct children of p.parent. This will also return some nodes that are whtespace only, so you may have to filter them out.
You should be able to use page.css('p.parent').children.remove.
Then your page.css('p.parent').text will return the text without the children nodes.
Note: the page will be modified by the remove

how to find the preceding sibling of a link

I have the following I am trying to analyse using xpath
<table>
<tr>
<td>Name</td>
<td>Info</td>
<td>Download</td>
</tr>
<tr>
<td>Name2</td>
<td>Info</td>
<td>Download</td>
</tr>
....
<tr>
..
</tr>
</table>
I have the following xpath to grab the download links
$xpath->query("//a[text()='Download']/#href");
What I am trying to figure out is the query to send to grab the Name of each of the downloads.
The page has no div id markups at all, just plain table, tr, td tags.
I have tried something like
$xpath->query("//preceding-sibling::a[text()='Download']");
Does anyone have any idea on this?
Close!
Given a particular context node (here, the href attribute for a download), you want to find the eldest sibling of the td containing the context node. So your relative path should first ascend to the td and then find the oldest sibling:
parent::a/parent::td/preceding-sibling::td[last()]
or more briefly (and without assuming that there are no elements like p or span intervening between the td and the a):
ancestor::td[1]/preceding-sibling::td[last()]
Some users find the reverse numbering of nodes on the preceding-sibling axis confusing, so it may feel simpler to say that what you really want is the first td child of the smallest containing tr:
ancestor::tr[1]/child::td[1]
If you need in a single pass to pick up all the download links and the textual label for them, then how you do it depends on the context in which you're using XPath. In XSLT, for example, you might write:
<xsl:apply-templates select="//a[text()='Download']/#href"/>
and then fetch the label in the appropriate template:
<xsl:template match="a/#href">
<xsl:value-of select="string(ancestor::tr[1]/td[1])"/>
:
<xsl:value-of select="."/>
</xsl:template>
In other host languages, you will want to do something similar. The key problem is that you have to iterate over the nodes matching your expression for href, and then for each of those nodes you need to move back in the document to pick up the label. How you say "evaluate this second XPath expression based on the current node from the first XPath expression" will vary with your environment.

Find an element that only has one other kind of child

I want to use XPath to find every <blockquote> element that has at least one child <pre> element, no other kinds of child elements, and optionally text nodes as children:
<body><div><!-- arbitrary nesting -->
<blockquote><pre>YES</pre></blockquote>
<blockquote><p>NO</p></blockquote>
<blockquote><pre>NO</pre><p>NO</p></blockquote>
<blockquote><p>NO</p><pre>NO</pre></blockquote>
<blockquote><pre>YES</pre> <pre>YES</pre></blockquote>
<blockquote>NO</blockquote>
</div></body>
This XPath appears to work, but I suspect that it's overly complicated:
//blockquote[pre][not(*[not(name()="pre")])]
Is there a better (less code, more efficient, more DRY) way to select what I want?
//blockquote[pre][count(pre)=count(*)]
Use:
//blockquote[* and not(*[not(self::pre)])]
This selects all blockquote elements in the XML document that have at least one element child and don't have any element child that isn't a pre element.
This is just an application of the double negation law :).
Do note, that this expression is more efficient than one that counts all element children (because the selection stops right at the moment a non-pre child is found).
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select="//blockquote[* and not(*[not(self::pre)])]"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<body><div><!-- arbitrary nesting -->
<blockquote><pre>YES</pre></blockquote>
<blockquote><p>NO</p></blockquote>
<blockquote><pre>NO</pre><p>NO</p></blockquote>
<blockquote><p>NO</p><pre>NO</pre></blockquote>
<blockquote><pre>YES</pre> <pre>YES</pre></blockquote>
<blockquote>NO</blockquote>
</div></body>
the XPath expression is evaluated and the selected nodes are copied to the output:
<blockquote>
<pre>YES</pre>
</blockquote>
<blockquote>
<pre>YES</pre>
<pre>YES</pre>
</blockquote>

Xpath getting node without node child contents

hey guys coudln't get around this. I have an html structured as follow:
<div class="review-text">
<div id="reviewerprofile">
<div id="revimg"></div>
<div id="reviewr">marc</div>
<div id="revdate">2011-07-06</div>
</div>
this is an awesome review
</div>
what i am trying to get is just the text "this is an awesome review" but everytyme i query the node i also get the other content in the childs. using something like this now ".//div[#class='review-text']" how to get just that text only? tank you very much
You're almost there! Just add /text() at the end of your XPath to get the text node.
An XPath expression such as //div returns a set of nodes, in this case div elements. These are in effect pointers to the original nodes in the original tree; the nodes are still connected to their parents, children, ancestors, and siblings. If you see the children of the div element and don't want them, that's not the fault of the XPath processor, it's the fault of whatever software is processing the results returned by the XPath expression.
You can get the text that's an immediate child of the div element by using /text() as suggested. However, that assumes that you know exactly what you are expecting to find in the HTML page - if "awesome" were in italics, it would give you something different.

Resources